web stats
   
   
   
   
   
   
   
   

 

 

 

JMU >> CARS >> Assessment Resources >> Writing Test Items

Writing Test Items

Selected Response Assessments

To answer selected-response test items, students choose the best answer among two or more possible responses.

Multiple-choice items. The multiple-choice item is a common type of selected-response item. The item begins with a partial statement or a question, termed the item stem. Several possible options to complete or answer the statement in the item stem are provided and the student chooses one. Incorrect options are sometimes called distractors. In the item below, options a, b, and d are distractors.

1) What are the primary colors?

a) black and white
b) purple, green, and orange
c) red, yellow, and blue
d) khaki and olive

The stem could also be written as a partial statement instead of a complete question:

1) The primary colors are _______________ .

a) black and white
b) purple, green, and orange
c) red, yellow, and blue
d) khaki and olive

In some studies, items were more reliable (led to more stable scores) when the stems were written as complete questions (Haladyna & Downing, 1989).

Distractors should be plausible. Trevisan, Sax, and Michael (1994) found there was little difference in reliability (stability) between a test composed of 4-option items and a test composed of 5-option items. When the number of items that can be administered in a given time frame is considered (items with fewer options can be answered more quickly), 3-option items were almost as reliable as 4-option items. Generally, item-writers try to keep the length of the options fairly similar; several studies summarized in Haladyna and Downing (1989) found items were easier when the correct answer tended to be longer than the other options.

Matching items are another type of selected-response item.

For each work in column A, choose the author from column B. Write the letter of the chosen author in the blank beside the title. Guessing can be reduced on matching items by using more options than items.

___ Great Expectations (a) John Grisham
___ Rosemary's Baby (b) Charlotte Bronte
___ Wuthering Heights (c) Emily Bronte
___ The Firm (d) Shakespeare
___ The Tempest (e) Ira Levin
  (f) Tolstoy
  (g) Charles Dickens


True-false items are another familiar type.

1) Ice and rain are forms of precipitation.

a) True
b) False

There is much disagreement about the limitations of true-false items (see Frisbie & Becker, 1990, for a brief summary of the major views).

A less common item type is multiple-true-false (MTF). Multiple-true-false items share some ommon stem, and the student then responds true or false to each option, which extends the stem.

________ is an automobile models made by Mazda.

1) Miata

a) True
b) False

2) Camry

a) True
b) False

3) RX7

a) True
b) False

4) Accord

a) True
b) False

5) Altima

a) True
b) False

Frisbie (1990) summarized studies of MTF items and concluded that MTF items measure the same constructs as multiple-choice items, and more MTF items can be given in the same amount of time. Downing, Baranowski, Grosso, & Norcini (1995) had similar findings, but noted that multiple-choice items were somewhat more correlated with external criteria (for example, other tests or grades), perhaps because the MTF items in their study tended to measure more lower-level thinking.

A variant on the multiple-choice item is called Type K (complex multiple choice). This item type is seen on some standardized tests.

1) Which of the following are needed to calculate simple interest?

I. The amount of money borrowed
II. The interest rate
III. The length of the borrowing period

a) I only
b) I and II
c) I and III
d) I, II, and III

Type K items are more difficult than multiple-choice items, fewer Type K items can be answered in a given time period, Type K items may be more dependent on test-taking skills, and these items often have lower discriminations (Haladyna, 1992).

Common "Rules" for Selected-Response Items

Haladyna and Downing (1989) summarized common rules found in many references. Some of these rules relate to the empirical findings discussed above (number of options, use of Type K). Other rules are termed by Haladyna and Downing as "values" shared by measurement experts, often based on common sense rather than empirical evidence. Some of these rules are paraphrased below.

  • Edit the items for basic grammar, punctuation, and spelling. All the option choices should use parallel grammar to avoid giving clues to the right answer.
  • The option choices should address the same content, and the distractors should be reasonable choices for a student with limited or incorrect information. One way to develop distractors is to use common errors students make.
  • Items should be as clear and concise as possible, both so students know what is being asked and to minimize reading time and the influence of reading skills on performance.
  • The stem, not the options, should clearly contain the question or problem situation. Students should know what the gist of the item is without reading the options.
  • Vocabulary should be appropriate for the level of the test.
  • "Focus on a single problem" items with multiple clauses may have multiple correct answers depending on which aspect the student focuses.
  • To avoid testing rote facts, do not use the same words and phrasing as the textbook.
  • Multiple-choice items can be used to measure higher-level thinking. Consider how a student needs to think to answer the item.
  • While the research on "none of the above" and "all of the above" is not decisive, many recommend using these sparingly, if at all.

References

Case, S. M., & Swanson, D. B. (1998). Constructing written test questions for the basic and clinical sciences [On-line]. Available: http://www.nbme.org/new.version/item.htm

Downing, S. M. (1992). True-false, alternate-choice, and multiple-choice items. Educational Measurement: Issues and Practice, 11 (3), 27-30.

Downing, S. M., Baranowski, R. A., Grosso, L. J., & Norcini, J. J. (1995). Item type and cognitive ability measured: The validity evidence for multiple true-false items in medical specialty certification. Applied Measurement in Education, 8, 87-97.

Frisbie, D. A., & Becker, D. F. (1990). An analysis of textbook advice about true-false tests. Applied Measurement in Education, 4, 67-83.

Haladyna, T. M. (1992). The effectiveness of several multiple-choice formats. Applied Measurement in Education, 5, 73-88.

Haladyna, T. M. (1994). Developing and validating multiple-choice test items. Hillsdale, NJ: Lawrence Erlbaum Associates.

Haladyna, T. M., & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. applied Measurement in Education, 2, 37-50.

Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2, 51-78.

Roid, G. H., & Haladyna, T. M. (1982). A technology for test-item writing. Orlando, Fl: Academic Press, Inc.

Trevisan, M. S., Sax, G., & Michael, W. B. (1994). Estimating the optimum number of options per item using an incremental option paradigm. Educational and Psychological Measurement, 54 (1) 86-91.

Constructed-Response Assessments

For constructed-response items, tasks, or projects, students must supply or construct a response rather than selecting from among supplied alternatives.

Fill in the blank and short answer test items are among the simplest of the constructed-response item formats. They are similar to multiple-choice items; the student is expected to either complete a statement or answer a question with a word or phrase (perhaps a sentence or two).

1) What are the three primary colors?

1) The primary colors are __________, __________, and __________.

The most common type of constructed-response item is the extended-written-response item, used broadly here to include any test item or prompt to which students respond with an essay, description, or explanation (including diagrams, charts, or mathematical solutions, as well as written text).

1) Describe the steps in the scientific method.

1) Tell about a situation where you gave your best effort.

1) Find the area between these curves: y = x2 + 1 and y = - x2 +10

Projects, portfolios, experiments, and demonstrations are also constructed response assessments. They may involve written documents (reports, handouts, PowerPoint presentations, lab notebooks, journals, mathematical proofs, collections of assignments), non-written products (artwork, audio or videotape, executable computer programs, web pages) and/or presentations, demonstrations, or performances (dance, theatre, and musical concerts, athletic performances, teaching demonstrations, counseling sessions, classroom presentations, oral tests).

Scoring constructed-response assessments is obviously more complex than scoring selected-response items (which can be easily delegated to a computer). Rating scales and checklists are used to score student responses. The distinction between rating scales and checklists is that checklists are used to indicate simply presence or absence of some behavior (Erwin, 1991), while a rating scale has a continuum. Rating scales have several possible scores points (often four to six). Some recommend using an even number of points so there is no middle or neutral point. Each point on the scale should be described as explicitly as possible.

As with selected-response items, one often wants to generalize from the specific tasks a student completed to a broader set of skills measured by the task (in other words, the domain to which the task belongs). Then the score on the tasks can be interpreted as how the student would generally perform, across similar contexts. Also, one often wants to generalize beyond a particular rater. If the score depends heavily on the particular tasks or raters, the score is not reliable. Using multiple tasks and raters can increase reliability (Erwin, 1991). Also, clearly specifying the rating scale and training the raters will help reliability.

The descriptions of rating scale points are often called rubrics. Rubrics are holistic when they are used to give a single overall score for the task, or analytic when they are used to give scores for separate aspects of the task. In writing, for example, papers may be given several analytic scores for style, content, and writing conventions, or they may be scored with a holistic rubric, which incorporates all these elements. Klein et al. (1998) found holistic scores were as reliable as analytic scores while taking far less rater time, though they noted holistic scores may be harder to justify because it is not as apparent how the decision was reached. Erwin (1991) noted that analytic scores give greater diagnostic information and feedback.

Rubrics can also be either general or task-specific. A general rubric can be used across many tasks of the same type, while a task-specific rubric contains elements specific to the task for which it was designed. In mathematics, for example, a general rubric might contain descriptions of student problem-solving behavior, which would apply to many contexts. A task-specific rubric, on the other hand, would describe particular behaviors, which were likely to occur with the specific task context.

References

Erwin, T. D. (1991). Assessing student learning and development. San Francisco: Jossey-Bass.

Farr, R., & Tone, B. (1994). Portfolio and performance assessment: Helping students evaluate their progress as readers and writers. Fort Worth: Harcourt Brace.

Illinois State Board of Education, Department of School Improvement Services, School and Student Assessment Section. (1995). Effective Scoring Rubrics: A guide to their development and use. Springfield, IL: Author.

Klein, S. P., Stecher, B. M., Shavelson, R. J., McCaffrey, D., Ormseth, T., Bell, R. M., Comfort, K., & Othman, A. R. (1998). Analytic versus holistic scoring of science performance tasks. Applied Measurement in Education, 11, 121-138.

Glossary of Terms

Construct: The cognitive area, skill, or trait measured.

Discrimination: If an item discriminates well, students who get other items right tend to get this item right as well. Discrimination is highly related to reliability.

Domain: The set of all possible items, which measure a construct. The domain is usually hypothetical.

Reliability: The particular questions or items on an assessment instruments are only a few of the possible items one could write to measure the desired construct. If we tested students again, with similar but different items, we would want the scores from the two tests to be correlated, or reliable. This can be estimated from the correlations of items within the instrument. If scores are reliable, they are consistent across items and testing occasions.

 

photo of students sitting at desks

 

 

 

PUBLISHER: Center for Assessment and Research Studies | CARS is part of JMU's University Studies
821 S. Main St., MSC 6806 | Harrisonburg, VA | 22807 | PHONE: (540) 568-6706
FOR INFORMATION CONTACT: assessment@jmu.edu | Privacy Statement