|
Writing
Test Items
Selected Response Assessments
To answer selected-response test items, students choose
the best answer among two or more possible responses.
Multiple-choice items. The multiple-choice item is
a common type of selected-response item. The item begins
with a partial statement or a question, termed the
item stem. Several possible options to complete or
answer the statement in the item stem are provided
and the student chooses one. Incorrect options are
sometimes called distractors. In the item below, options
a, b, and d are distractors.
1) What are the primary colors?
a) black and white
b) purple,
green, and orange
c) red, yellow, and blue
d) khaki and olive
The stem could also be written as a partial statement
instead of a complete question:
1) The primary colors are _______________ .
a) black and white
b)
purple, green, and orange
c) red, yellow, and blue
d) khaki and olive
In some studies, items
were more reliable (led to more stable scores) when
the stems were written as
complete questions (Haladyna & Downing, 1989).
Distractors should be plausible. Trevisan, Sax, and
Michael (1994) found there was little difference in
reliability (stability) between a test composed of
4-option items and a test composed of 5-option items.
When the number of items that can be administered in
a given time frame is considered (items with fewer
options can be answered more quickly), 3-option items
were almost as reliable as 4-option items. Generally,
item-writers try to keep the length of the options
fairly similar; several studies summarized in Haladyna
and Downing (1989) found items were easier when the
correct answer tended to be longer than the other options.
Matching items are another type of selected-response
item.
For each work in column A, choose the author from
column B. Write the letter of the chosen author in
the blank beside the title. Guessing can be reduced
on matching items by using more options than items.
| ___ Great Expectations |
(a) John Grisham |
| ___ Rosemary's Baby |
(b) Charlotte Bronte |
| ___ Wuthering Heights |
(c) Emily Bronte |
| ___ The Firm |
(d) Shakespeare |
| ___ The Tempest |
(e) Ira Levin |
| |
(f) Tolstoy |
| |
(g) Charles Dickens |
True-false items are another familiar type.
1) Ice and rain are forms of precipitation.
a) True
b) False
There is much disagreement
about the limitations of true-false items (see Frisbie & Becker,
1990, for a brief summary of the major views).
A less common item type is
multiple-true-false (MTF). Multiple-true-false items
share some ommon stem, and
the student then responds true or false to each option,
which extends the stem.
________ is an automobile
models made by Mazda.
1) Miata
a) True
b) False
2) Camry
a) True
b) False
3) RX7
a) True
b) False
4) Accord
a) True
b) False
5) Altima
a) True
b) False
Frisbie (1990) summarized
studies of MTF items and concluded that MTF items
measure the same constructs as
multiple-choice items, and more MTF items can be
given in the same amount of time. Downing, Baranowski,
Grosso, & Norcini (1995) had similar findings,
but noted that multiple-choice items were somewhat
more correlated with external criteria (for example,
other tests or grades), perhaps because the MTF items
in their study tended to measure more lower-level thinking.
A variant on the multiple-choice item is called Type
K (complex multiple choice). This item type is seen
on some standardized tests.
1) Which of the following are needed to calculate
simple interest?
|
I. The amount of
money borrowed
II. The interest rate
III. The length of
the borrowing period
|
a) I only
b) I and
II
c) I and III
d) I, II, and III
|
Type K items are more difficult
than multiple-choice items, fewer Type K items can
be answered in a given
time period, Type K items may be more dependent on
test-taking skills, and these items often have lower discriminations (Haladyna, 1992).
Common "Rules" for
Selected-Response Items
Haladyna and Downing
(1989) summarized common rules found in many references.
Some of these rules relate
to the empirical findings discussed above (number of
options, use of Type K). Other rules are termed by
Haladyna and Downing as "values" shared by
measurement experts, often based on common sense rather
than empirical evidence. Some of these rules are paraphrased
below.
- Edit the items for basic
grammar, punctuation, and spelling. All the option
choices should use parallel
grammar to avoid giving clues to the right answer.
- The option choices should address the same content,
and the distractors should be reasonable choices
for a student with limited or incorrect information.
One way to develop distractors is to use common errors students make.
- Items should be as clear and concise as possible,
both so students know what is being asked and
to minimize reading time and the influence of
reading skills
on performance.
- The stem, not the options, should clearly contain
the question or problem situation. Students should
know what the gist of the item is without
reading the options.
- Vocabulary should be appropriate for the level
of the test.
- "Focus on a single problem" items
with multiple clauses may have multiple correct
answers depending on which aspect the student focuses.
- To avoid testing rote facts, do not use the same
words and phrasing as the textbook.
- Multiple-choice items can be used to measure higher-level
thinking. Consider how a student needs to think
to answer the item.
- While the research
on "none of the above" and "all of the
above" is not decisive, many recommend using these
sparingly, if at all.
References
Case, S. M., & Swanson,
D. B. (1998). Constructing written test questions
for the basic and clinical sciences
[On-line]. Available: http://www.nbme.org/new.version/item.htm
Downing, S. M. (1992). True-false, alternate-choice,
and multiple-choice items. Educational Measurement:
Issues and Practice, 11 (3), 27-30.
Downing, S. M., Baranowski,
R. A., Grosso, L. J., & Norcini,
J. J. (1995). Item type and cognitive ability measured:
The validity evidence for multiple true-false items
in medical specialty certification. Applied Measurement
in Education, 8, 87-97.
Frisbie, D. A., & Becker,
D. F. (1990). An analysis of textbook advice about
true-false tests. Applied
Measurement in Education, 4, 67-83.
Haladyna, T. M. (1992). The effectiveness of several
multiple-choice formats. Applied Measurement in Education,
5, 73-88.
Haladyna, T. M. (1994). Developing and validating
multiple-choice test items. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Haladyna, T. M., & Downing,
S. M. (1989). A taxonomy of multiple-choice item-writing
rules. applied Measurement
in Education, 2, 37-50.
Haladyna, T. M., & Downing,
S. M. (1989). Validity of a taxonomy of multiple-choice
item-writing rules.
Applied Measurement in Education, 2, 51-78.
Roid, G. H., & Haladyna,
T. M. (1982). A technology for test-item writing.
Orlando, Fl: Academic Press,
Inc.
Trevisan, M. S., Sax,
G., & Michael, W. B. (1994).
Estimating the optimum number of options per item using
an incremental option paradigm. Educational and Psychological
Measurement, 54 (1) 86-91.
Constructed-Response Assessments
For constructed-response items, tasks, or projects,
students must supply or construct a response rather
than selecting from among supplied alternatives.
Fill in the blank
and short answer test items are
among the simplest of the constructed-response item
formats. They are similar to multiple-choice items;
the student is expected to either complete a statement
or answer a question with a word or phrase (perhaps
a sentence or two).
1) What are the three primary colors?
1) The primary colors are __________, __________,
and __________.
The most common type of constructed-response item
is the extended-written-response item, used broadly
here to include any test item or prompt to which students
respond with an essay, description, or explanation
(including diagrams, charts, or mathematical solutions,
as well as written text).
1) Describe the steps in the scientific method.
1) Tell about a situation where you gave your best
effort.
1) Find the area between these curves: y = x2 + 1
and y = - x2 +10
Projects, portfolios,
experiments, and demonstrations are also constructed response assessments. They may
involve written documents (reports, handouts, PowerPoint
presentations, lab notebooks, journals, mathematical
proofs, collections of assignments), non-written products
(artwork, audio or videotape, executable computer programs,
web pages) and/or presentations, demonstrations, or
performances (dance, theatre, and musical concerts,
athletic performances, teaching demonstrations, counseling
sessions, classroom presentations, oral tests).
Scoring constructed-response assessments is obviously
more complex than scoring selected-response items (which
can be easily delegated to a computer). Rating scales
and checklists are used to score student responses.
The distinction between rating scales and checklists
is that checklists are used to indicate simply presence
or absence of some behavior (Erwin, 1991), while a
rating scale has a continuum. Rating scales have several
possible scores points (often four to six). Some recommend
using an even number of points so there is no middle
or neutral point. Each point on the scale should be
described as explicitly as possible.
As with selected-response
items, one often wants to generalize from the specific
tasks a student completed
to a broader set of skills measured by the task (in
other words, the domain to which the task belongs).
Then the score on the tasks can be interpreted as how
the student would generally perform, across similar
contexts. Also, one often wants to generalize beyond
a particular rater. If the score depends heavily on
the particular tasks or raters, the score is not reliable.
Using multiple tasks and raters can increase reliability
(Erwin, 1991). Also, clearly specifying the rating
scale and training the raters will help reliability.
The descriptions of rating
scale points are often called rubrics. Rubrics are
holistic when they are
used to give a single overall score for the task, or
analytic when they are used to give scores for separate
aspects of the task. In writing, for example, papers
may be given several analytic scores for style, content,
and writing conventions, or they may be scored with
a holistic rubric, which incorporates all these elements.
Klein et al. (1998) found holistic scores were as reliable
as analytic scores while taking far less rater time,
though they noted holistic scores may be harder to
justify because it is not as apparent how the decision
was reached. Erwin (1991) noted that analytic scores
give greater diagnostic information and feedback.
Rubrics can also be either
general or task-specific. A general rubric can be used
across many tasks of the
same type, while a task-specific rubric contains elements
specific to the task for which it was designed. In
mathematics, for example, a general rubric might contain
descriptions of student problem-solving behavior, which
would apply to many contexts. A task-specific rubric,
on the other hand, would describe particular behaviors,
which were likely to occur with the specific task context.
References
Erwin, T. D. (1991). Assessing student learning and
development. San Francisco: Jossey-Bass.
Farr, R., & Tone,
B. (1994). Portfolio and performance assessment:
Helping students evaluate their progress
as readers and writers. Fort Worth: Harcourt Brace.
Illinois
State Board of Education, Department of School Improvement
Services, School and
Student Assessment
Section. (1995). Effective Scoring Rubrics: A guide
to their development and use. Springfield, IL: Author.
Klein, S. P., Stecher,
B. M., Shavelson, R. J., McCaffrey, D., Ormseth,
T., Bell, R. M., Comfort, K., & Othman,
A. R. (1998). Analytic versus holistic scoring of science
performance tasks. Applied Measurement in Education,
11, 121-138.
Glossary of
Terms
Construct: The cognitive area, skill, or trait measured.
Discrimination: If an item discriminates well, students
who get other items right tend to get this item right
as well. Discrimination is highly related to reliability.
Domain: The set of all possible items, which measure
a construct. The domain is usually hypothetical.
Reliability: The particular questions or items on
an assessment instruments are only a few of the possible
items one could write to measure the desired construct.
If we tested students again, with similar but different
items, we would want the scores from the two tests
to be correlated, or reliable. This can be estimated
from the correlations of items within the instrument.
If scores are reliable, they are consistent across
items and testing occasions.
|
|
|