The third step of the Assessment Cycle is deciding how to measure your stated student learning outcomes (SLOs). During this step of the process, you will need to consider questions such as:

  • Should you select a preexisting instrument or develop a new one?
  • What are pros and cons of selecting or developing a new instrument?
  • Does the instrument align with your SLOs?
  • Which assessment items map to which SLOs?
  • Are you using a direct measure of student learning?
  • Is the instrument reliable?
  • Is there validity evidence to support your desired interpretation of the instrument?
Step 3
Alignment of Instrument to SLOs

Verb-Instrument Agreement
All SLOs include an action verb indicating what a student is expected to know, think, or do as a result of program participation. Each verb acts as a hint about what type of instrument is appropriate. For example, an SLO that states students will be able to "recognize" certain information could be assessed by a multiple choice or matching question. In contrast, for an SLO that states students should be able to "explain" something, an open-ended question would be more appropriate. Consider the additional examples below:

Verb Agreement

Instrument-to-Outcome Map
After selecting or designing an instrument that aligns with your SLOs, it is useful to develop an instrument-to-outcome map. In this map, you will specify which assessment instruments measure which SLOs. If an instrument measures more than one SLO, it may be helpful to identify which specific items map to each SLO. Additionally, it may be helpful to include information about instrument quality and how each instrument/item will be scored in your instrument-to-outcome map. Consider the example below:

Instrument to Outcome map
Types of Instruments

There are three main types of instruments: cognitive instruments, attitudinal instruments, and performance assessments. As alluded to above, the verbs in your SLOs will dictate which type of instrument you should use.

Cognitive Instruments are generally used to assess knowledge or reasoning. Multiple choice tests are a classic example of a cognitive instrument. Other examples include matching items, true/false statements, short essays, and sentence completion items.

Attitudinal Instruments are concerned with the attitudes, beliefs, or preferences students may have. These instruments often use Likert-type scales that ask students to indicate the extent to which they agree or disagree with a certain statement.

Performance Assessments are used to evaluate students' skills and behaviors as evidenced through certain products or performances. More specifically, raters are trained to evaluate the product or performance according to a rubric that specifies various criteria for success.

Direct vs. Indirect Measures
Often a distinction is made between "direct" and "indirect" measures. We generally consider direct measures of student learning to be instruments that require students to actually use the skills we are interested in measuring (e.g. a writing assignment meant to assess writing skills). In contrast, indirect measures do not call on students to use the skills being measured. Instead, these instruments rely on indirect evidence to infer that students possess certain skills (e.g., a self-report survey where students are asked to indicate whether they believe they've mastered various writing skills).

Although we often talk about instruments as being either direct or indirect, there is no such thing as a 100% direct measure. An instrument's "directness" is better conceived along a continuum ranging from relatively more direct to relatively less direct. Measures that are more direct are always preferable to those that are less direct.

Direct or Indirect

It is important to note that the directness of an instrument depends on what is being measured. For example, if we were interested in students' critical thinking skills, asking them to indicate how good they think they are at critical thinking (a self-report measure) would be a less direct measure than an assignment asking them to critique a journal article. However, if we were interested in students' beliefs about their critical thinking skills, then the self-report measure would be considered more direct than the journal critique.

Selecting vs. Designing Instruments

Student affairs practitioners have two options when it comes to measuring their SLOs: selecting a preexisting instrument or designing one from the ground up. There are pros and cons to each approach. Generally speaking, however, it is best to select a preexisting instrument whenever possible.

Designing a good instrument (one that fully captures the knowledge and skills we care about without also capturing irrelevant factors) requires an extensive amount of time, resources, and expertise. We strongly suggest consulting with individuals trained in measurement theory before embarking down this path. Student affairs staff at JMU can schedule an appointment with SASS for assistance. See the table below for more information about selecting vs. designing an instrument:

Selecting

Advantages
  • Convenience
  • Reliability and validity evidence often available
  • Able to make comparisons to others using the same instrument

Disadvantages
  • Alignment (between instrument and SLOs) often less precise
  • May be difficult to find a good instrument (many published instruments have poor psychometric properties)
  • Cost (if purchasing a commercial instrument)

When to select
  • When you want an instrument with strong psychometric properties (i.e., evidence of reliability and validity)
  • When data need to be collected soon
  • When your program is impacting attitudes or skills that have been well-studied (e.g., sense of belonging, cultural competency, career readiness, etc.)
Designing

Advantages
  • The instrument is specifically tailored for your SLOs (i.e., strong alignment)




Disadvantages
  • Resource intensive (it can take a year or more to develop a good instrument)
  • Need to collect validity evidence (this process can also take several months)
  • Comparisons to other groups may be limited



When to design
  • When you can't find an instrument that aligns with your SLOs
  • When your SLOs are highly specific (e.g., knowledge of JMU campus resources)
  • When preexisting instruments have poor psychometric qualities
  • When you have support from someone trained in instrument development
Where to Find Instruments?

The databases below can be used to find both commercial and non-commercial tests. For more information about how to find instruments and for links to other databases, click here.

American Psychological Association Test Database - an unparalleled resource for psychological measures, scales, and instrumentation tools with over 50,000 records

Educational Testing Service (ETS) Test Collection - over 25,000 tests available for researchers, educators and graduate students

ERIC - an extensive online research database; use the advanced search feature to filter by publication type (i.e., tests/questionnaires)

Mental Measurement Yearbook - database that hosts tests from specific areas (developmental, personality). Also includes reviews and critiques for each test

How to Develop an Instrument?

Table of Specifications
One of the first steps to developing an instrument is to create a test blueprint or table of specifications. This table indicates the number or proportion of items on the test that will be written to measure each SLO, skill, or skill dimension. For example, if we are interested in measuring "intercultural competency," we might decide that half of the items should be written to measure students' awareness of their own culture (self-awareness dimension) while the other half should measure students' knowledge of others' cultures (cultural awareness dimension). The development of the table of specifications should be guided by theory and your SLOs. You should ask, what do researchers say are the important dimensions of the skill I'm trying to measure? And, based on my SLOs what types of items should my test include? See the simplified example below:

Specifications

General Guidelines for Item Writing
The following guidelines apply to the writing of cognitive or attitudinal items. For more information and a list of common item-writing mistakes, check out this presentation or this comprehensive guide to item writing.

Cognitive Items

    • Offer 3-4 well-developed answer choices for multiple-choice items. The aim is to provide variability in responses and to include plausible distractors.
    • Distractors should not be easily identified as wrong choices.
    • Use boldface to emphasize negative wording.
    • Make answer choices brief, not repetitive.
    • Avoid the use of all of the above, none of the above, or a combination such as A and B options. It is tempting to use them because they are easy to write, but they tend to lower reliability and make it harder to distinguish between high and low ability students.
    • For matching items, provide more response options than items in the list (e.g., offer 10 response options for a list of seven items). This decreases the likelihood of participants getting an answer correct through the process of elimination.
    • List response options in a set order. Words can be listed in alphabetical order; dates and numbers can be arranged in either ascending or descending order. This makes it easier for a respondent to search for the correct answer.
    • Make sure you do not give hints about the correct answer to other items on the instrument.

Attitudinal Items

    • Avoid “loading” the questions by inadvertently incorporating your own opinions into the items.
    • Avoid statements that are factual or capable of being interpreted as such. Make sure you do not give hints about the correct answer to other items on the instrument.
    • Avoid statements that are so extreme that they are likely to be endorsed by almost everyone or almost no one.
    • Statements should be clearly written -- avoid jargon, colloquialisms, etc.
    • Each item should focus on one idea.
    • Items should be concise. Try to avoid the words “if” or “because,” which complicate the sentence.
Psychometric Properties

Whether selecting a preexisting instrument or designing a new one, it is crucial to evaluate the reliability and validity of the instrument. A broad overview of both concepts is provided below. If you are a JMU employee in the student affairs division and would like to learn more, or need assistance evaluating the reliability and validity of your own instruments, schedule an appointment to meet with a SASS consultant.

Reliability refers to the consistency of scores. We are often interested in one of three types of reliability: internal consistency reliability, inter-rater reliability/agreement, and test-retest reliability.

Internal Consistency - Internal consistency reliability is concerned with the consistency of students' responses across items on the same test. The logic behind internal consistency reliability is simple: if all of the items on a test are believed to measure the same knowlede or skill, then students should respond similarly to the items. In other words, a high-ability student should score relatively high on most items, and a low-ability student should score relatively low on most items. If this is not the case, we may question whether the test is providing meaningful information. The most commonly used index of internal consistency reliability is coefficient alpha. Scores closer to 1 indicate higher reliability.

Inter-Rater Reliability/Agreement - When evaluating a product or performance using a performance assessment, it is desirable to have more than one rater rate each product/performance to reduce potential rater bias. When more than one rater is used, however, it is possible that their ratings might disagree. This is problematic because it introduces uncertainty about the true competency of the student being rated. Inter-rater reliability/agreement indices are useful in this situation because they indicate the degree of similarity between scores from multiple raters. Scores near 1 are desirable for the majority of these indices, as they indicate that raters agree more often.

Test-Retest Reliability - Test-retest reliability refers to the consistency of scores from one time to another one, or one test to another. High test-retest reliability is necessary if we want to use multiple forms of the same test (e.g., introducing a stort version of a lengthy test), or if we plan to collect data at one time point and use it to make decisions at a much later date (e.g., using junior year SAT scores to determine college math placement).

Validity is defined as the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests1. From this definition we can draw three important conclusions:

1. Validity is a process of accumulating evidence. It may be helpful to think about this process as one that mirrors the scientific method: we start with a theory about the validity of our instrument for some purpose (e.g., the Alcohol Use Scale (AUS) is a valid measure of student alcohol use). Based on this theory we then construct a testable hypothesis (if the AUS measures alcohol use, it should be positively correlated with the number of alcohol-related hospitalizations and sanctions that are reported to the student conduct office). Finally, we design a study to test our hypothesis and determine the extent to which our theory is supported based on the results. Just like with the scientific method, the more “tests” we conduct, the stronger the support for our theory (i.e., the validity of our instrument for some purpose).

There are five types of validity evidence instrument developers typically collect: evidence related to test content, response processes, internal structure, relations to other variables, and consequences. For more information about these types of validity evidence, check out our comprehensive guide to selecting and designing instruments.

2. Validity is a matter of degree. Although it is common to hear test developers claim their tests are "valid and reliable," validity is not a benchmark that can be "reached." Determining whether an instrument is valid for a given purpose requires a holistic, subjective judgement that takes into account all of the evidence that has been collected.

3. Validity is concerned with how we interpret and use scores. Another reason it is problematic to call an instrument "valid" is that validity is not a property of instruments; the validity of an instrument depends on how we interpret and use its scores. For example, there may be strong validity evidence to support using the AP Calculus exam for college math placement, however, there would probably be less validity evidence to support using the AP Calculus exam scores for general college admission.

1American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, & Joint Committee on Standards for Educational and Psychological Testing. (2014). Standards for educational and psychological testing.Washington, DC: AERA. 

Back to Top