|
Multiple Choice Format Test
Development Guide
Test Construction: During initial test construction, emphasis
should be placed on creating the desired number of test
items to represent each program objective. At this stage,
it is necessary to assure that all program objectives are
represented by a reasonable number of items. If some objectives
are more important than others, more items should represent
those objectives. It may be helpful to create a list of
objectives and the desired percentage of items on the test
that should pertain to each objective. Many programs treat
individual objectives as potential subscores for the total
test. Such specifications can serve as a test blueprint
and will be useful throughout the development of the test.
If more items than necessary are generated for some objectives,
they should not be included in the initial form of the
test. The desired balance of items per objective as outlined
in the blueprint should be preserved. Any additional items
should be saved, however, since they could be useful during
test refinement, or for creating multiple forms of the
test. An item bank can be created for the storage and retrieval
of test items.
Creating individual test items for a multiple-choice test
should be done carefully. First, the phrasing of item roots
should be examined by asking the following questions:
- Is the wording clear?
- Is the root too long or too short?
- Is it asking what you think it is asking?
- Does it include unnecessary jargon?
Second, the phrasing of item answer choices should be
examined with these questions:
- Is there one, and only one, correct answer?
- Are the answer choices ambiguous?
- Is there an answer choice that could be legitimately
considered correct if looked at from another perspective
(i.e. Can
you confidently explain to a hypothetical student
why the correct answer is in fact the keyed response)?
- Are there enough option choices, but not too many?
(Usually 4 choices are adequate.)
- Is the difficulty level of the test reasonable
for the intended examinees?
After initial construction, the test should be examined
and evaluated by people both inside and outside the particular
program before being administered. Sometimes what seems
unambiguous to one may be very ambiguous to others.
Once the test has been evaluated by a sufficient number
of people and their recommendations have been responded
to, it is time to pilot the test. At first, a small pilot
administration may be useful. This will allow us to detect
the presence of any gross administration problems that
may have been overlooked. After a successful small pilot
administration, the test should be piloted again, this
time to as large a number of students as possible.
Pilot-Test Refinement: Results of the pilot test should
be examined closely in terms of performance, reliability,
and perhaps dimensionality.
Student Performance Issues: A look at the distribution
of test scores can provide initial information about the
quality of the test. Descriptive statistics concerning
the central tendency, variability, reliability, and standard
errors are useful information that provides clues to the
quality of the test. For example, if the range of scores
is too narrow, more items of varying difficulty should
be added. If the range is too wide, perhaps some items
should be removed from the test. An exceptionally low or
high mean score on the test may indicate that the desired
difficulty level has not been achieved. If the difficulty
level is not close to what is desired, then adding easier
or more difficult items to the test will help to obtain
the desired difficulty level. (This assumes that there
was some expectation about average performance before the
pilot test was administered).
A look at the distribution of responses to each item can
also be useful. Any items that every, or almost every,
examinee answered correctly or incorrectly should be removed
or improved. In addition, examination of students' answer
choices can provide some insight into how the distracter
choices are operating. Some distracters may be modified
or replaced based on this information.
Reliability Issues: A classical reliability analysis should
be performed both on the total test, and any subtests that
may be appropriate. The value of Cronbach's coefficient
should be about .80 or higher depending on the length of
the test or subtest. At this stage, the coefficient will
probably not reach that criterion. Examination of the item-discrimination
indices and item-to-total score correlations can help discern
which items are causing the coefficient to be lower than
desired.
Items that are negatively correlated with the total, and
items whose correlation with the total is less than .2
are problematic. The content of these items should be examined.
Issues about item content considered during the construction
phase should be revisited at this time, and these problematic
items should be improved or removed. Insight into reasons
for problematic items can be gained by examining the distribution
of responses to that item - is there one answer choice
that many examinees were fooled by? Is that answer choice
worded ambiguously?
Dimensionality Issues: It may be appropriate to examine
the dimensionality of the test at this stage, particularly
if the reliability analysis yields fair results. If internal
consistency is low, examination of the dimensionality of
the entire test would not be worthwhile, but it may be
helpful to examine the dimensionality for the subset of
good items that will be retained in future versions of
the test. The goal is to develop a test that is reasonably
unidimensional. If there is not one overall factor, which
all the items are related to, there may be several sub-factors
related to the various objectives being measured. Appropriate
scoring of the test depends on some knowledge of the dimensionality
of the test. Evidence that the test is reasonably unidimensional,
provides a rationale for computing and interpreting a total
test score as the sum of the correct responses. Evidence
of the existence of sub-factors within the test provides
a rationale for reporting and interpreting subscores.
Further Test Refinement
(after subsequent administrations): Performance, reliability, and dimensionality should be
examined closely after each test administration as outlined
above.
Validity Issues: When the test has stabilized to some
extent, validity should also be examined. Analysis of the
relationship of this test score to other similar measures
should be performed answering the following questions:
- Are the total and subtest scores related to other measures
such as specific course grades, overall GPA, major GPA,
SAT scores, performance assessments, faculty ratings, other
test scores, or any other appropriate measures?
- Are the relative correlations reasonable (i.e. is the
test more related to some indicators than to others
as we would
expect)?
- Can we confidently predict a person's test score
within a reasonable margin of error based on other
information?
- Are these relationships stable over time?
Setting Standards: In order to allow for confident decisions
based on test scores, a test should meet the following
criteria:
- The test contains a reasonable representation of appropriate
program objectives.
- The test is consistently of the desired difficulty
level for the intended examinees.
- Cronbach's coefficient alpha is sufficiently high
(at least .85 for longer tests, .8 for shorter
tests or subtests).
- All items have item-total correlations of at least
.2.
- The test consistently
demonstrates "sufficient" unidimensionality
as described above OR the test consistently
falls into clearly defined and interpretable sub-factors.
- Test scores are related as expected to other
measures, such as specific course grades.
It is at this point in the development of the test that
it is reasonable to set some standards of performance.
Expectations of levels of student performance should be
considered. Cut-off scores can be developed for different
competency levels for the total test and for subtests where
appropriate. Faculty expectations for performance of students
graduating from a program should be established.
Test Maintenance (and
continued development): In addition
to the standard analyses delineated above, test maintenance
and continued development may involve any or all of the
following:
- Development and maintenance of an item bank.
- Construction of multiple forms, and equating of those forms.
- Development of additional test items to be piloted with
the refined test.
- Item Response Theory analysis for further insight into
test reliability and item performance, examination of item
bias, and preparation for a computer adaptive form of the
test.
- Development of a computer-based test, either non-adaptive,
self-adaptive or computer-adaptive, to eventually potentially
the paper and pencil format.
|
|
|