web stats




JMU >> CARS >> Assessment Resources >> Multiple Choice Format Test Development Guide

Multiple Choice Format Test Development Guide

Test Construction: During initial test construction, emphasis should be placed on creating the desired number of test items to represent each program objective. At this stage, it is necessary to assure that all program objectives are represented by a reasonable number of items. If some objectives are more important than others, more items should represent those objectives. It may be helpful to create a list of objectives and the desired percentage of items on the test that should pertain to each objective. Many programs treat individual objectives as potential subscores for the total test. Such specifications can serve as a test blueprint and will be useful throughout the development of the test. If more items than necessary are generated for some objectives, they should not be included in the initial form of the test. The desired balance of items per objective as outlined in the blueprint should be preserved. Any additional items should be saved, however, since they could be useful during test refinement, or for creating multiple forms of the test. An item bank can be created for the storage and retrieval of test items.

Creating individual test items for a multiple-choice test should be done carefully. First, the phrasing of item roots should be examined by asking the following questions:

  • Is the wording clear?
  • Is the root too long or too short?
  • Is it asking what you think it is asking?
  • Does it include unnecessary jargon?

Second, the phrasing of item answer choices should be examined with these questions:

  • Is there one, and only one, correct answer?
  • Are the answer choices ambiguous?
  • Is there an answer choice that could be legitimately considered correct if looked at from another perspective (i.e. Can you confidently explain to a hypothetical student why the correct answer is in fact the keyed response)?
  • Are there enough option choices, but not too many? (Usually 4 choices are adequate.)
  • Is the difficulty level of the test reasonable for the intended examinees?

After initial construction, the test should be examined and evaluated by people both inside and outside the particular program before being administered. Sometimes what seems unambiguous to one may be very ambiguous to others.

Once the test has been evaluated by a sufficient number of people and their recommendations have been responded to, it is time to pilot the test. At first, a small pilot administration may be useful. This will allow us to detect the presence of any gross administration problems that may have been overlooked. After a successful small pilot administration, the test should be piloted again, this time to as large a number of students as possible.

Pilot-Test Refinement: Results of the pilot test should be examined closely in terms of performance, reliability, and perhaps dimensionality.

Student Performance Issues: A look at the distribution of test scores can provide initial information about the quality of the test. Descriptive statistics concerning the central tendency, variability, reliability, and standard errors are useful information that provides clues to the quality of the test. For example, if the range of scores is too narrow, more items of varying difficulty should be added. If the range is too wide, perhaps some items should be removed from the test. An exceptionally low or high mean score on the test may indicate that the desired difficulty level has not been achieved. If the difficulty level is not close to what is desired, then adding easier or more difficult items to the test will help to obtain the desired difficulty level. (This assumes that there was some expectation about average performance before the pilot test was administered).

A look at the distribution of responses to each item can also be useful. Any items that every, or almost every, examinee answered correctly or incorrectly should be removed or improved. In addition, examination of students' answer choices can provide some insight into how the distracter choices are operating. Some distracters may be modified or replaced based on this information.

Reliability Issues: A classical reliability analysis should be performed both on the total test, and any subtests that may be appropriate. The value of Cronbach's coefficient should be about .80 or higher depending on the length of the test or subtest. At this stage, the coefficient will probably not reach that criterion. Examination of the item-discrimination indices and item-to-total score correlations can help discern which items are causing the coefficient to be lower than desired.

Items that are negatively correlated with the total, and items whose correlation with the total is less than .2 are problematic. The content of these items should be examined. Issues about item content considered during the construction phase should be revisited at this time, and these problematic items should be improved or removed. Insight into reasons for problematic items can be gained by examining the distribution of responses to that item - is there one answer choice that many examinees were fooled by? Is that answer choice worded ambiguously?

Dimensionality Issues: It may be appropriate to examine the dimensionality of the test at this stage, particularly if the reliability analysis yields fair results. If internal consistency is low, examination of the dimensionality of the entire test would not be worthwhile, but it may be helpful to examine the dimensionality for the subset of good items that will be retained in future versions of the test. The goal is to develop a test that is reasonably unidimensional. If there is not one overall factor, which all the items are related to, there may be several sub-factors related to the various objectives being measured. Appropriate scoring of the test depends on some knowledge of the dimensionality of the test. Evidence that the test is reasonably unidimensional, provides a rationale for computing and interpreting a total test score as the sum of the correct responses. Evidence of the existence of sub-factors within the test provides a rationale for reporting and interpreting subscores.

Further Test Refinement (after subsequent administrations): Performance, reliability, and dimensionality should be examined closely after each test administration as outlined above.

Validity Issues: When the test has stabilized to some extent, validity should also be examined. Analysis of the relationship of this test score to other similar measures should be performed answering the following questions:

  • Are the total and subtest scores related to other measures such as specific course grades, overall GPA, major GPA, SAT scores, performance assessments, faculty ratings, other test scores, or any other appropriate measures?
  • Are the relative correlations reasonable (i.e. is the test more related to some indicators than to others as we would expect)?
  • Can we confidently predict a person's test score within a reasonable margin of error based on other information?
  • Are these relationships stable over time?

Setting Standards: In order to allow for confident decisions based on test scores, a test should meet the following criteria:

  • The test contains a reasonable representation of appropriate program objectives.
  • The test is consistently of the desired difficulty level for the intended examinees.
  • Cronbach's coefficient alpha is sufficiently high (at least .85 for longer tests, .8 for shorter tests or subtests).
  • All items have item-total correlations of at least .2.
  • The test consistently demonstrates "sufficient" unidimensionality as described above OR the test consistently falls into clearly defined and interpretable sub-factors.
  • Test scores are related as expected to other measures, such as specific course grades.

It is at this point in the development of the test that it is reasonable to set some standards of performance. Expectations of levels of student performance should be considered. Cut-off scores can be developed for different competency levels for the total test and for subtests where appropriate. Faculty expectations for performance of students graduating from a program should be established.

Test Maintenance (and continued development): In addition to the standard analyses delineated above, test maintenance and continued development may involve any or all of the following:

  • Development and maintenance of an item bank.
  • Construction of multiple forms, and equating of those forms.
  • Development of additional test items to be piloted with the refined test.
  • Item Response Theory analysis for further insight into test reliability and item performance, examination of item bias, and preparation for a computer adaptive form of the test.
  • Development of a computer-based test, either non-adaptive, self-adaptive or computer-adaptive, to eventually potentially the paper and pencil format.



photo of students sitting at desks




PUBLISHER: Center for Assessment and Research Studies | CARS is part of JMU's University Studies
298 Port Republic Rd., MSC 6806 | Harrisonburg, VA | 22807 | PHONE: (540) 568-6706
FOR INFORMATION CONTACT: assessment@jmu.edu | Privacy Statement