|
Doctoral dissertations from the assessment & measurement program
Stereotype Threat: The Effects of Gender Identification on Standardized Test Performance
by Robin D. Anderson, PsyD (2001)
Advisor: Dr. Donna Sundre
The purpose of this study was to examine whether one of the most common standardized testing procedures, the collection of demographic information prior to testing, facilitates performance decrements in subjects for whom a negative domain performance stereotype exists. The primary investigation involved examining whether the presence of a gender identification section on an optical readable form and the request that the gender section of the form be completed was a priming stimulus sufficient to trigger a stereotype threat effect. This study provided a real world adaptation of previous stereotype threat research. Results indicate that the inclusion of a gender identification item is not a sufficient priming stimulus to trigger stereotype threat patterns in low-stakes assessments. Results do indicate, however, that the removal of such an item may increase motivation and performance for both negatively and positively stereotyped groups.
Validity Study of the Leadership Attitudes and Beliefs Scale III
by Anna Katherine Busby, Ph.D. (2005)
Advisor: Dr. Christine DeMars
This study provides validity evidence for the use of the Leadership Attitudes and Beliefs Scale III (LABS III; Wielkiewicz 2000) scores. The scale is based upon the ecology theory of leadership (Allen, Stelzner, & Wielkiewicz, 1998), and is designed to measure the attitudes and beliefs college students have toward leadership. This study was conducted with 845 college students at a large, mid-western, urban institution. The content of the LABS III items was examined to determine the relationship between the ecology theory of leadership and the scale. The items did not completely represent of the ecology theory. A confirmatory factor analysis (CFA) was conducted to test the hypothesized two-factor model, and the data did not fit the hypothesized model well. The scale was modified using theoretically-supported model modifications and additional research questions were explored. The modified LABS III scores were correlated with scores from the Miville-Guzman Universality-Diversity Scale-Short Form (Fuertes, Miville, Mohr, Sedlacek, & Gretchen, 2000). A moderate correlation was found and this result supported the hypothesis that there is a relationship between attitudes toward diversity and attitudes toward leadership. The modified LABS III scores were also correlated with the subscale scores of the Student Leadership Practices Inventory (Posner & Brodsky, 1992). Moderate correlations were found and this result supports the hypothesis that leadership attitudes are related to leadership practices. It was hypothesized that age would be strongly correlated with leadership attitudes; however, the results did not support this hypothesis. The results also supported the hypothesis that men and women differ in their attitudes toward leadership. Further examination of the ecology theory of leadership in relation to the LABS III and the LABS III factor structure is recommended. The results from this study suggest that a number of theory-based hypotheses were supported. However, continued refinement of the theory and its relationship to the scale needs to be explicated. Only through continued reflection and careful study can the nomological net of the ecology theory of leadership be developed and contribute to research in leadership.
Invariance of the Modified Achievement Goal Questionnaire Across College Students with and without Disabilities
by Hilary Lynne Campbell, Ph.D. (2007)
Advisor: Dr. Dena Pastor
As an increasing number of students with disabilities (SWDs) is taking part in postsecondary education, postsecondary institutions must meet the needs of this unique population. Because it is linked to important achievement-related outcomes, one area in which educators have historically tried to meet students' needs is achievement goal orientation (AGO). Educators must ensure that they are able to measure AGO for SWDs and to determine whether SWDs would benefit from different services or educational methods than their nondisabled peers. In the K-12 literature, studies suggest that SWDs may have different AGO profiles than their peers, but no such research has been conducted for college students.>One specific instrument designed to measure AGO, the modified Achievement Goal Questionnaire (AGQ-M; Finney, Pieper, & Barron, 2004) was administered to college students with and without disabilities. Confirmatory factor analyses were conducted with both populations to test the four-factor structure of AGO (Mastery-Approach, Mastery-Avoidance, Performance-Approach, Performance-Avoidance). Next, a series of tests were conducted to test the measurement and structural invariance of the AGQ-M across students with and without disabilities. Finally, latent means for the two samples on each dimension of AGO were compared.The four-factor model of AGO fit both samples well. Further, invariance of factor loadings (metric invariance), intercepts (scalar invariance), error variances, factor variances, and factor covariances were supported. Since the AGQ-M was found to be invariant, latent means were compared. In contrast to previous findings in the literature, results indicated no significant or practically meaningful differences between these two groups on any of the four dimensions of the AGQ-M. These results suggest that college students with and without disabilities may not have markedly different AGO profiles. Results may differ from previous findings because the sample of SWDs in this study had already completed several semesters of college at a moderately selective institution; these students likely differed in important ways from the general population of SWDs. This study lays the groundwork for a host of future studies, including replication studies, involving specific disability groups, and linking AGO profiles to external achievement-related variables for college students with and without disabilities.
Using explanatory item response models to examine the impact of linguistic features of a reading comprehension test on English language learners
by Jaime A. Cid, Ph.D. (2009)
Advisors: Dr. Dena Pastor and Dr. Joshua Goodman
The unintended consequences of high-stakes testing decisions made on scores that may vary as a function of language proficiency have been noted as a major threat to English language learners (ELLs) (Herman & Abedi, 2004; Mahoney, 2008). While several studies have focused on the effects of language proficiency in high-stakes science and math examinations, the impact of English language proficiency on reading comprehension tests has received far less attention. Furthermore, the effects that specific linguistic features of reading comprehension tasks have on ELL's test performance have been noticeably understudied. The overall aim of this study was to examine the impact of seven linguistic features (false cognates, homographs, negative wording, propositional density, surface structure, syntactic complexity, and vocabulary) of high-stakes reading comprehension test on Spanish-speaking ELLs using explanatory item response models conceptualized as Hierarchical Generalized Linear Models (HGLMs). More specifically, in a 40-item reading test explanatory item response models were used to investigate: (a) differential item functioning (DIF) for ELLs and non-ELLs in a traditional manner; (b) whether items consisting of certain linguistic features were differentially difficult; (c) the extent to which linguistic features may be differentially difficult for ELLs in comparison to non-ELLs; and (d) whether the difficulty of the items with such linguistic features varied across ELL with different years of formal exposure to Spanish as primary language of academic instruction. The results of investigating DIF in a traditional manner revealed that six items (four favoring non-ELL and two favoring ELLs) displayed DIF with group differences of at least half a logit. The estimates of the effects of the seven linguistic features were statistically significant ( p < 0.0001). However, only false cognates, negative wording, surface structure, and vocabulary increased the difficulty of an item. The differential functioning of the seven linguistic features revealed that the log-odds of getting a typical item right were 0.4867 logits lower for ELLs compared to non-ELLs. However, from a practical significance perspective, the linguistic features were not differentially difficult for the two groups. While the results of the linguistic feature combinations showed that the majority of the features displayed differential difficulty in favor of non-ELLs, none of them can be considered of practical significance. Finally, items with only false cognates were less difficult for ELLs with more years of exposure to Spanish as primary language of academic instruction. The benefits of the explanatory properties of English language status as a person-level predictor in a reading comprehension test along with practical implications of the current research and directions for future research are discussed.
Methods for Identifying Differential Item and Test Functioning: An Investigation of Type I Error Rates and Power
by Amanda M. Dainis, Ph.D. (2008)
Advisors: Dr. J. Christine Harmes and Dr. Christine DeMars
This study examined bias, and therefore fairness, by investigating methods used for identifying differential item functioning (DIF). Four DIF-detection methods were applied to simulated data and empirical data. These techniques were selected to focus on a relatively new method, DFIT, and compare it to another IRT-based method (likelihood ratio test), and two Classical Test Theory-based methods (logistic regression and Mantel-Haenszel). Within the simulation study, four factors were manipulated: sample size, the presence and absence of impact, the uniformity and non-uniformity of the DIF, and the magnitude of the DIF. The Type I error and power rates of the methods were examined, and results indicated that the performance of the methods depended on the data conditions. The DFIT method had low Type I error rates across all simulated conditions. Regardless of the absence or presence of impact, the likelihood ratio test and the logistic regression main effect test had elevated Type I error rates under both sample size conditions. While the Mantel-Haenszel method's error rates were satisfactory across all conditions, its power was low when detecting non-uniform DIF. High power was demonstrated by the DFIT and likelihood ratio methods, but the logistic regression method yielded unsatisfactory power rates under the impact present condition. The DFIT method, as the central focus of this investigation, warrants further attention. A particular concern is the method's performance when applied to smaller sample sizes, due to fitting a 3PL model to a dataset with insufficient sample size. Another area for further investigation is the Item Parameter Replication (IPR) procedure, which is used to establish statistical significance within the DFIT framework. Although it has proven to be a reasonably efficient technique for establishing statistical significance, its conservative performance in the empirical portion of this study suggests the need for further examination under conditions with smaller amounts of DIF. DIF detection plays an integral part in constructing a fair and unbiased test. Based on empirical evidence, such as that reported here, researchers and practitioners should examine how an item or test is functioning statistically before spending resources to examine a conceptual, underlying cause of DIF.
Achievement Goal Orientation Across the College Career: A Latent Growth Analysis
by Susan Lynn Davis, Ph.D. (2005)
Advisor: Dr. Sara Finney
Assessing student development can be a challenge in that such constructs are difficult to define and difficult to measure. However, the need exists for universities to understand student's personal development as they progress though college. Although there are many important facets of student development worthy of examination, this study focused on one aspect of development commonly referenced in university mission statements: students' premonition for lifelong learning. Previous research has noted the difficulty in determining if universities are creating lifelong learners; however, this study attempted to examine this development by means of a related concept: student achievement goal orientation. One cohort of students was assessed on three occasions during college to estimate change in five dimensions of student achievement goal orientation: mastery-approach, performance-approach, mastery-avoidance, performance-avoidance, and work-avoidance. In addition to addressing the need for information on student development, this study attempted to address the shortcomings of prior longitudinal research, for example, by employing specific methodologies that allow inclusion of partial records, estimation of individual variation within change, examination of measurement invariance, and fluctuation within patterns of change. Before estimating change over time, it was first determined that the measurement of goal orientation was psychometrically stable across the three assessments, as indicated by the sufficient level of measurement invariance. Change was estimated using Latent Growth Modeling which allowed the estimated pattern of change to be explicitly identified and described. Individual variation in change was also found and used to address ancillary research questions regarding change across dimensions of goal orientation and the relationship between initial goal orientation and change in goal orientation. All five dimensions of goal orientation exhibited significant change across the three assessments. The identified patterns of change present interesting information for student development and student motivation. Discussion of this estimated change includes exploration of the change in terms of achievement goal orientation, students' motivational perspective, and the development of lifelong learners.
Towards Measuring Lifelong Learning: The Curiosity Index
by Keston H. Fulcher, Ph.D. (2004)
Advisor: Dr. T. Dary Erwin
Construct ambiguity and methodological shortcomings of instrument development have obscured the meaning of curiosity research. Nonetheless, it is an important construct, especially since it has been linked recently to lifelong learning. The purpose of these studies is to collect validity evidence for a new self-report questionnaire, The Curiosity Index (CI), which is based on Ainley's (1987) parsimonious breadth and depth conceptualization of curiosity. Proctors administered the CI to 1042 college freshmen, 854 college sophomore/juniors, and 74 members of a lifelong learning institute. In Study 1, freshmen CI data were analyzed using confirmatory factor analysis in an exploratory manner to identify items best representing the two-factor model. After selective item removal, all indices except for the RMSEA suggested good fit. In addition to the CI, college freshmen took several other instruments. In Study 2, scores derived from these instruments were correlated to the total, breadth, and depth scores. As predicted, the total CI, breadth, and depth scores correlated moderately to highly with trait curiosity and intrinsic motivation, lowly to confidence, not at all to intelligence or extrinsic motivation, and negatively to work-avoidance. In addition, mastery-approach correlated higher to depth than to breadth as predicted. In Study 3, average total, breadth, and depth scores of freshmen, sophomore, and lifelong learners were compared via ANOVAs. It was predicted that lifelong learners would have the highest scores on all categories, then sophomores, then freshmen. Lifelong Learning Institute members and sophomores did score significantly higher on total and depth curiosity than freshmen; however, no other differences were found. In Study 4, item response theory was used to investigate the amount of information obtained by the CI along the continuum of curiosity, from the least curious to the most curious students. Generally, information was high; however, students scoring 1.5 SD s above the mean or higher were measured less reliably. Overall, the results support the use of the Curiosity Index for measuring breadth and depth curiosity. Future directions of validation include additional correlational studies with other curiosity measures, reversing the response scale, and creating more difficult breadth items.
Examining the Psychometric Properties of a Multimedia Innovative Item Format: Comparison of Innovative and Non-Innovative Versions of a Situational Judgment Test
by Sara Lambert Gutierrez, Ph.D. (2009)
Advisor: Dr. J. Christine Harmes
In the measurement field, innovative item formats have shown promise for increasing the capability to assess constructs not easily measured with traditional item formats. These items are often assumed to also provide opportunities for better measurement. However, little empirical research exists to support these assumptions. The purpose of this study was to explore the psychometric properties of a multimedia innovative item type and then compare the results to the properties of a non-innovative item format. Participants were administered one of two tests of identical content: one consisting of an innovative item format and the other consisting of a non-innovative item format. Exploratory factor analyses were conducted to evaluate the dimensionality of the two tests. The graded-response model was fit to both tests to produce item and test level characteristic curves, allowing for the examination of the reliability, or information, produced by each test and the individual items. Measurement efficiency, a ratio of the average amount of information provided relative to the average amount of time taken, was also reviewed. Face validity was examined by analyzing participant ratings on an eight-item post-test survey. Finally, criterion-related validity was investigated for the innovative item format by examining the relationship between test scores and supervisors’ ratings of employee performance. Findings from this research suggest that the use of innovative items may alter the underlying construct of an assessment, and could potentially provide more measurement information about examinees with low prioritization skills. Also, innovative item formats do not necessarily decrease measurement efficiency, as has been previously suggested. Participants’ perceptions of the tests indicated that they felt the innovative version provided a more realistic experience and increased levels of engagement. Criterion-related validity scores on the innovative version was inconsistent across two samples. The key implication of these results applies to any practitioner employing innovative items; the addition of innovative item formats likely alters the measurement properties of a test. Further examination is needed to understand whether or not the alteration results in better measurement. As the overall psychometric functioning of both versions of the assessment was low, replication is recommended prior to generalizing these results.
Integrating and Evaluating Mathematical Models of Assessing Structural Knowledge: Comparing Associative Network Methodologies
by Emily R. Hoole, Ph.D. (2005)
Advisor: Dr. Christine DeMars
Structural knowledge assessment is a promising area of study for curriculum design and teaching, training, and assessment, but many issues in the field remain unresolved. This study integrates an associative network method, the Power Algorithm from the field of text comprehension into the realm to structural knowledge assessment by comparing it to an already established associative network method, Pathfinder Analysis. Faculty members selected the fifteen most important concepts in Classical Test Theory. Students and faculty then completed similarity ratings for each concept pair using an online survey program, SurveyMonkey. A variety of similarity measures for the Power Algorithm networks and Pathfinder networks were used to predict course performance in a graduate level measurement class. For the Power Algorithm networks, the correlation between the student and expert links between the concepts in the associative network were computed, along with the congruence coefficient between the associative network links. Finally, a measure of network coherence, harmony, was calculated for each Power Algorithm network. For the Pathfinder networks, the NETSIM measure of similarity between the student and expert networks was computed. An unusual finding for the Pathfinder measure of similarity, NETSIM, was uncovered, in which NETSIM values negatively predicted course performance. Results indicate that the Power Algorithm similarity measures did not uncover a latent structure in the data, but that network harmony might possibly serve as an indicator of quality for knowledge structures. Further investigation of the use of harmony in structural knowledge assessment is recommended.
Using Verbal Reports to Explore Rater Perceptual Processes in scoring: An Application to Oral Communication Assessment
by Jilliam N. Joe, Ph.D. (2008)
Advisor: Dr. J. Christine Harmes
Performance assessment has shown increasing promise for meeting educators' needs for "authenticity" in assessment that many argue is missing from standardized multiple choice testing. However, for all of its merits, performance assessment continues to present a formidable challenge to measurement theory and practice when human raters are a component of scoring. There is little known about the cognitive processes raters employ in scoring, and in particular, scoring for oral communication assessments. The purpose of this study was to explore feature attention within an oral communication assessment scoring context, and how feature attention influenced decisions. An additional purpose was to investigate the utility of verbal reports as a method for collecting perceptual data within an aurally and visually intensive context. The present study employed a concurrent complementarity mixed methods design (Greene, Carcelli, & Graham, 1989), in which concurrent and retrospective verbal report methods were used to gather cognitive data from experienced and inexperienced raters. Specifically, verbal report data were examined to discover meaningful patterns in feature attention, as well as alignment between raters' internal frameworks and the test developer's scoring framework. Generalizability Theory was used to answer questions related to verbal report impact on scoring. Self-report data on perceived difficulty of the scoring task were also collected within each condition of verbal reporting. The findings from this research suggest that raters' internal frameworks as applied in the service of scoring did not align with the test developer's framework. Raters did not consistently attend to the features found in the scoring rubric, nor did they adhere to the scoring system (analytic). Raters demonstrated complex integrative processes that often violated assumptions held about the rating process. Experienced raters, in particular, engaged in feature attention and subsequent decision-making that often "borrowed" information from other traits to better inform judgments, particularly when the rater endeavored to establish causal relationships for failures in trait mastery. These findings have several implications for rater selection and training procedures, as well as test development in oral communication.
Using the Right Tool for the Job: An Analysis of Item Selection Statistics for Criterion-Referenced Tests
by Andrew T. Jones, Ph.D. (2009)
Advisor: Dr. Christine DeMars
In test development, researchers often depend upon item analysis in order to select items to retain or add to an exam form. The conventional item analysis statistic is the point-biserial correlation. This statistic was developed to select items that would maximize the reliability indices of norm-referenced tests. When the focus of the exam is norm-referenced scores, then the point-biserial correlation works well as an item selection tool. However, the point-biserial correlation is also used in testing contexts where it may be less useful, specifically on criterion-referenced tests. Criterion-referenced tests have different reliability indices than norm-referenced tests, known as decision consistency indices. As such, using the point-biserial correlation to select items to maximize decision consistency may not have as much utility as other options. Researchers have developed several criterion-referenced item analysis statistics that have yet to be fully evaluated for their utility in selecting items for criterion-referenced tests. The purpose of this research was to evaluate each of the respective criterion-referenced item selection tools as well as the point-biserial correlation to determine which one optimized decision consistency.
Nonresponse bias in online course evaluations
by Cassandra R. Jones, Ph.D. (2009)
Advisor: Dr. Donna Sundre
Recently more universities have started administering course evaluations online. With the process no longer in the classroom, some students decide not to complete their course evaluations during their own time, resulting in concerns about online course evaluation results being biased because of lack of response. This study examined course evaluation results at a small diverse Mid-Atlantic Catholic university. A cross-classified random effects model was used to capture student responses across all of their courses. Nonresponse bias was examined by determining predictors of participation and predictors of online course evaluation ratings. Variables predicting both participation and ratings were considered to be a potential source of nonresponse bias. It was found that gender, ethnicity, and final course grade predicted online course evaluation ratings. Only final course grade predicted online course evaluation ratings.
An Empirical Demonstration of Direct and Indirect Mixture Modeling When Studying Personality Traits: A Methodological-Substantive Synergy
by Pamela K. Kaliski, Ph.D. (2009)
Advisor: Dr. Sara Finney
Many personality psychology researchers have employed the person-centered approach of cluster analysis to determine how many categorical Big Five personality types exist. The majority of these researchers have suggested that three Big Five personality types exist; however, results from two recent studies suggested that five types exist. In the first part of the current study, direct mixture modeling (an alternative person-centered approach to cluster analysis), was conducted on Big Five personality variables to explore the number of Big Five personality types that exist in college students, and two methodological approaches for gathering validity evidence for the personality types were demonstrated. Although more validity evidence must be gathered, results of the direct MM suggested that three personality types may exist in college students; however, the types differed in form from the three types that are commonly reported. In the second part of the current study, the same results were used to demonstrate an application of indirect mixture modeling. As opposed to interpreting the classes as substantively meaningful discrete subgroups, they were interpreted as common configurations that best represent the aggregate dataset. Additionally, the variable-centered approach of multiple regression was conducted. A comparison of the multiple regression results and the indirect mixture modeling results reveal the similarities and differences.
Using Response Time and the Effort-Moderated Model to Investigate the Effects of Rapid Guessing on Estimation of Item and Person Parameters
by Xiaojing Kong, Ph.D. (2007)
Advisor: Dr. Steven Wise
Rapid-guessing behavior, an aberrant examinee behavior observed frequently in testing, creates a possible source of systematic measurement error undermining psychometric quality of items and tests, and the validity of test scores. The purposes of this dissertation were to examine how and to what extent rapid guessing can impact item parameter and proficiency estimates, and to explore and evaluate the effectiveness of specific psychometric models controlling for rapid guesses. Five interrelated studies were conducted, involving the use of item response times for detecting rapid-guessing behavior in the empirical study, and the employment of the observed distribution of response time effort in the simulation studies. The primary investigation involved comparing the performance of the standard IRT models (i.e., 3PL, 2PL, and 1PL) with that of the effort-moderated item response model (Wise & DeMars, 2006) and its variations (i.e., EM-3PL, EM-2PL, and EM-1PL), with respect to model fit, item parameter estimates, proficiency estimates, and test information and reliability. The performance discrepancies were first studied using data from a computer-based, low-stakes achievement test. The direction and magnitude of estimation bias under each model were further examined in such simulated conditions that the proportions of rapid guesses presented in the data varied. Moreover, comparisons between the standard and EM models were conducted for conditions in which the probability of guessing an item right was correlated with examinees' level of proficiency. Additionally, the influence of rapid guessing on item parameter estimates was examined in the framework of classical test theory.</span></p> <p style='line-height:normal'><span lang=EN style='font-size:9.0pt;font-family:Verdana;"Times New Roman";'>Results indicate that a small proportion of rapid guesses can bias item indices and examinee proficiency estimates to a notable extent, and that the undesirable influence can be augmented by increased proportions of rapid guesses. The EM models produced more accurate estimates of item parameters, examinee proficiency, and test information than their counterpart IRT models in most simulated conditions. However, exceptions were observed with the two- and one-parameter models. Also, different patterns were found for conditions in which some level of cognitive process was assumed to be involved during a rapid guess.
Using a mixture IRT model to improve parameter estimates when some examinees are amotivated
by Abigail R. Lau, Ph.D. (2009)
Advisor: Dr. Dena Pastor
Test-takers can be required to complete a test form, but cannot be forced to demonstrate their knowledge. Even if an authority mandates completion of a test, examinees can still opt to enter responses randomly. When a test has important consequences for individuals, examinees are unlikely to behave this way. However, random responding becomes more likely when the consequences associated with a test are less significant to the examinees. To thwart random responding, test administrators have explored methods to motivate examinees to respond attentively. Ultimately, differences in how examinees approach low-stakes tests are inevitable, and measurement models that account for this difference are needed. This dissertation provides an overview of the approaches that have been proposed for modeling low-stakes test data. Further, it specifically investigates the performance and utility of the mixed-strategies item response model (Mislevy & Verhelst, 1990) as one method of capturing amotivated examinees. Amotivated examinees are defined here as examinees who do not provide meaningful responses to any test items. A simulation study shows that if a normal item response model is used, parameter recovery rates are unacceptable when 9% or more of the examinees were amotivated. However, normal item response models may still be useful if less than 1% of examinees were amotivated. Use of the mixed-strategies item response model led to better parameter estimation than the normal item response model regardless of the proportion of amotivated examinees in the dataset. Additional research is needed to determine if using the mixed-strategies model results in satisfactory parameter recovery when greater than 20% of examinees were amotivated. A second study shows that when the mixed-strategies model was used on real low-stakes test data, the examinees classified as amotivated reported much lower test-taking effort than other examinees. However, examinees classified as amotivated were not very different than other examinees in terms of academic ability. This finding supports the notion that the second class in the mixed-strategies model is capturing amotivated examinees rather than low-ability examinees. Limitations of the mixed strategies modeling technique are discussed, as is the appropriateness of applying this technique in various testing contexts.
Comparing the Relative Measurement Efficiency of Dichotomous and Polytomous Models in Linear and Adaptive Testing Conditions
by Susan Daffinrud Lottridge, Ph.D. (2006)
Advisor: Dr. Christine DeMars
The purpose of this study was to examine the relative performance of the dichotomous and nominal item response theory models in a linear testing and adaptive testing environment. A simulation study was conducted to investigate the relative measurement efficiency when moving from a dichotomous linear test to a dichotomous adaptive test, nominal linear test, and nominal adaptive test. Item exposure was also considered. Two dichotomous models (2PL, 3PL) and two nominal models (Bock's Nominal Model, Thissen's Nominal Model) were used. The simulated data were based upon responses to a 58-item mathematics test by 6711 students, and Ramsay's nonparametric item response theory method was used to generate option characteristic curves. These curves were then used to generate simulation data. MULTILOG was used to estimate item parameters. An item pool of 522 items was generated from the 58 items, with items being shifted left or right by increments of .05 to create new items. A 30-item fixed-length test was used, as was a 30-item adaptive test. 100 simulees were generated at each of 47 [straight theta] points on [-2.3, +2.3]. Using empirically derived standard errors, results indicated that the adaptive test and polytomous linear test outperformed the dichotomous linear test. The Thissen Nominal Model linear test performed similarly to the 3PL adaptive test, suggesting its potential use in place of the more expensive adaptive test. The Bock Nominal Model linear test also performed better than the 2PL linear test, but not as well as either of the adaptive tests. Future studies are suggested for better understanding the Thissen Nominal Model in light of its performance relative to the 3PL adaptive test.
Unfolding Analyses of the Academic Motivation Scale: A Different Approach to Evaluating Scale Validity and Self-Determination Theory
by Betty Jo Miller, Ph.D. (2007)
Advisors: Dr. Donna Sundre and Dr. Christine DeMars
Using the framework of a strong program of construct validation (Benson, 1998), the current study investigated Self-Determination Theory (SDT; Deci & Ryan, 1985), the construct of academic motivation, and the Academic Motivation Scale (AMS; Vallerand et al., 1992). Building upon a body of prior research that provided only limited support for the theory and the seven-factor structure of the scale, a technique other than factor analysis was used to analyze responses to the AMS. Specifically, the utility of a unidimensional unfolding model in analyzing such responses was explored. In addition, scale development efforts were pursued, and multiple measures of academic motivation within a single sample of students were compared. Data were collected from three large samples of university students over the period of one year. The AMS and other instruments were self-report measures administered on computer and by paper-and-pencil. Qualitative data were collected from the second sample for the purposes of exploring new content for pilot items and for explaining certain results. Results have important implications for both SDT and the measurement of academic motivation using the AMS. A unidimensional unfolding model was shown to provide adequate fit to the data, supporting the argument that academic motivation is a single construct ordered along a continuum according to increasingly internal degrees of self-regulation. Using the estimated item locations, a shortened version of the AMS was proposed that was highly reliable and consistent with SDT. Finally, a comparison of unfolded motivation scores with summated AMS subscale scores revealed the folding of the response process.
Refining and Extending the 2 x 2 Achievement Goal Framework: Another Look at Work-Avoidance
by Suzanne L. Pieper, Psy.D. (2003)
Advisors: Dr. Donna L. Sundre and Dr. Sara J. Finney
This study refining and extending the 2 x 2 achievement goal framework of mastery-approach, mastery-avoidance, performance-approach, and performance-avoidance goals had three purposes: (1)to investigate the possibility of a fifth goal orientation: work avoidance, (2)to examine the functioning of new items written to better measure the four goal orientations, and (3) to gather validity evidence for the four goal orientations and possibly a fifth goal orientation by examining the association between the variables need for achievement and fear of failure and the goal orientations. The results of this study provided support for the four-factor model of achievement goal orientation using the 12-item Achievement Goal Questionnaire (AGQ) (Elliot & McGregor, 2001) modified for a general academic domain. The four-factor model provided a good fit to the data and a better fit than competing models. Second, the results of this study provided support for the improved reliability and validity of the 16-item AGQ with one item added to each goal orientation subscale to improve measurement. Third, the results of this study provided strong evidence for the existence of a fifth goal orientation: work-avoidance. The five-factor model of goal orientation--mastery-approach, mastery-avoidance, performance-approach, performance-avoidance, and work-avoidance--as measured by the 20-item AGQ provided a good fit to the data. Furthermore, the work-avoidance orientation demonstrated relationships with the criterion variables workmastery, competitiveness, and fear of failure that were expected based on previous theory and research. While this study answers the call of Maehr (2001) to reinvigorate goal theory by considering many possible ways students engage in learning, much still needs to be done in terms of defining and assessing the work-avoidance goal orientation. Additionally, the limitations of this study need to be addressed. The results of this study need to be validated with other student populations and in a variety of educational contexts. Finally, because the same sample of college students was used for all three analytical stages of this study, thereby increasing the possibility for Type 1 error, future studies need to validate these results with fresh samples.
Measurement of Critical Thinking in College Students: Assessing the Model
by Kelly A. Williams Scocos, Psy.D. (2002)
Advisor: Dr. Steven L. Wise
The goal of this dissertation was to investigate the viability of the most broadly accepted definition of critical thinking. This definition is the Delphi model (Facione, 1990) and it receives support from professionals both in education and in business. A single, multipart instrument, the Williams Critical Thinking Assessment, was developed to measure the individual facets of critical thinking delineated by the Delphi conceptualization. Results indicated that the Delphi model constituted a workable critical thinking definition. Furthermore, critical thinking defined in a manner consistent with the Delphi model was demonstrated to be distinct from scholastic achievement. Educationally, these discoveries have implications for both critical thinking instruction and learning in a collegiate environment.
Parameter Recovery of the Explanatory Multidimensional Rasch Model
by J. Carl Setzer, Ph.D. (2008)
Advisor: Dr. Dena Pastor
Recently, there have been two types of model formulations used to demonstrate the utility of explanatory item response models. Specifically, the generalized linear mixed model (GLMM) and hierarchical generalized linear model (HGLM) have expanded item response models to include covariates for item effects, person effects, or both simultaneously. Both frameworks have recently been garnering greater attention in the educational measurement field. Despite these two frameworks being conceptually equivalent, much of the related literature has emphasized one or the other. However, to date, there has been little attempt to associate the frameworks together. In addition, item response models that have been described within the GLMM and HGLM frameworks have mostly been of the unidimensional type. Very little has been done to demonstrate the utility of an explanatory multidimensional item response model. As explanatory models become more prevalent in research and practice, it is important to maintain software that can estimate them. SAS is an all-purpose and widely-used program that can estimate explanatory item response models. However, no previous research has examined how well SAS can recover the parameters of an explanatory multidimensional Rasch model (EMRM). There were three main goals of this study. First, several types of Rasch models, including both non-explanatory and explanatory models, were summarized within the GLMM and HGLM frameworks. The equivalence of these two frameworks was demonstrated for each model. Second, a parameter recovery study was performed to determine how well SAS PROC NLMIXED can recover the parameters of an EMRM. The effect of sample size and test length on parameter recovery was assessed. The results of the simulation study indicate that very little bias occurs, even with small sample sizes and short test lengths. The final goal was to demonstrate the utility of an EMRM model using empirical data. Using data collected from the Marlowe-Crowne Social Desirability Scale (MCSDS), an EMRM was fit to the data while using gender as a covariate. Interpretations of the model parameter estimates were given and it was concluded that gender did not explain a significant amount of variation in either of the MCSDS subscales.
Cyberspace Versus Face-to-Face: The Influence of Learning Strategies, Self-Regulation, and Achievement Goal Orientation
by Kara Owens Siegert, Ph.D. (2005)
Advisor: Dr. Christine DeMars
Web-based education (WBE) is a popular educational format that allows certain learning and teaching advantages. However, some students may not learn or perform as well in this environment as compared to traditional face-to-face education (F2FE) settings. Little research has examined the differential impact of learner characteristics on performance in these two environments. This study explored differences in learning strategies, self-regulation skills, and achievement goal orientation, in WBE and F2FE college classrooms and found that students in the two environments could be differentiated based on the composite of learner characteristics. Specifically, WBE and F2FE students differed in terms of self-regulation, elaboration, and mastery-avoidance goals. Learner characteristics, however, did not have a differential influence on college student performance in the two environments.
Should We Worry About the Way We Measure Worry Over Time? A Longitudinal Analysis of Student Worry During the First Two Years of College
by Peter J. Swerdzewski, Ph.D. (2008)
Advisor: Dr. Sara Finney
This study evaluated longitudinal change in student worry using the Student Worry Questionnaire-30 (SWQ-30), an instrument that represents worry as six separate factors: (1) Worrisome Thinking, (2) Financial-Related Concerns, (3) Significant Others' Well-Being, (4) Academic Concerns, (5) Social Adequacy Concerns, and (6) Generalized Anxiety Symptoms. Prior to evaluating longitudinal change, the factor structure of the SWQ-30 was examined using four cross-sectional independent samples. A best-fitting six-factor model was found that removed four redundant items from the original 30-item instrument. This six-factor 26-item model was then fit to data from a longitudinal sample of students who completed the measure as entering freshmen and second-semester sophomores. Evidence for full configural and metric invariance was found. When the data were tested for scalar invariance, one item from each of the following subscales was found to be scalar non-invariant: Worrisome Thinking, Social Adequacy Concern, and Financial-Related Concern. Additionally, most of the items from the Generalized Anxiety Symptoms factor were found to be scalar non-invariant, thus making the latent mean difference for the factor uninterpretable. Overall, interpretable latent mean differences and stability estimates provided evidence that student worry was stable over time, although students appeared to decrease in the degree to which they worried about social adequacy. These findings suggest that some aspects of worry and the infamous sophomore slump may be unrelated phenomena. In sum, the SWQ-30 is a promising measure of multidimensional student worry; however, it has not received adequate empirical study. Furthermore, given the dearth of empirical research examining the stability of student worry over time and the unique characteristics of the samples under study, future research must be conducted to better uncover the link between worry and sophomore slump.
An Application of Generalizability Theory to Evaluate the Technical Quality of An Alternate Assessment
by Melinda A. Taylor, Ph.D. (2009)
Advisor : Dr. Dena Pastor
Federal regulations require testing of students with the most severe cognitive disabilities; although, little guidance has been given regarding the format of such assessments or how technical quality should be documented. It is well documented that specific challenges exist with the documentation of technical quality for alternate assessments that are often less standardized than their general assessment complements. One of the first steps in documenting technical quality is to determine the reliability of scores resulting from an assessment. Typical measures of reliability under a classical test theory framework, such as coefficient alpha, do little in modeling the multiple sources of error that are characteristic of alternate assessments. Instead, Generalizability theory (G-theory) allows rese! ! archers to identify potential sources of variability in scores and to analyze the relative contribution of each of those modeled sources. The purpose of this study was to demonstrate an application of G-theory to examining the technical quality of scores from an alternate assessment. A G-study where rater type, assessment attempts, and tasks were identified as facets was examined to determine the relative contribution of each facet to observed score variance. Data resulting from the G-study were used to examine the reliability of scores using a criterion-referenced interpretation of error variance associated with scores. The current assessment design was then modified to examine how changes in the design might impact the reliability of scores. Based on established criteria, the proposed designs were evaluated in terms of their ability to yield acceptable reliability coefficients. As a final step in the analysis, designs that were deemed satisfactory were evaluated from a pract! ! ical standpoint with respect to the feasibility of adapting them into a statewide standardized assessment program used for student and school accountability purposes.
Examinee Awareness of Performance Expectations and its Effects on Motivation and Test Scores
by Amy DiMarco Thelk, Ph.D. (2006)
Advisor: Dr. Donna L. Sundre
Published literature reveals little information about whether examinees should be told of established performance expectations prior to test taking. This study investigated whether students who are told of a test's cut scores, information about student performance from previous test administrations, or both types of information have significantly different test performance or motivation scores than those receiving only the standardized instructions. This research was conducted at a community college during regular assessment testing. Students taking a quantitative and scientific reasoning exam (QRSR) were assigned to one of four testing conditions. Motivation information was collected via two measures: Response Time Effort (RTE; Wise & Kong, 2005) and the Student Opinion Scale (SOS; Sundre, 1999). A confirmatory factor analysis was conducted to determine whether the two-factor structure of the SOS held up when administered to a community-college sample. The results support the established structure when administered in this setting. The second phase of analysis involved testing three path models to assess the impact of (a) SOS; (b) RTE; and (c) SOS and RTE on test scores. While the treatments had only small, and contradictory, effects on SOS and RTE, all three models were significant. SOS accounted for 9% of test score variance, RTE alone accounted for 16% of the variance in test scores, and the combination of RTE and SOS accounted for 19% of the variance in test scores. The final phase of the project involved interviewing a sample of students (n=8) following testing. Interviewees were asked about treatment recognition, effort, and ideas about motivating students in testing situations. While students were able to recognize the written information they had seen prior to testing, only one freely recalled the seeing additional data prior to testing. These findings call the potency of the manipulations into question. Also, while students verbally reported variations in how hard they tried, scores on the Effort subscale were not significantly different. The results of this study do not offer strong guidance on whether to tell students about cut scores prior to testing. Limitations of the research and suggestions for future research are offered.
Controlling Computer Adaptive Testing's Capitalization on Chance Errors in Item Parameter Estimates
by John Taylor Willse, Psy.D. (2002)
Advisor: Christine DeMars
Computer adaptive tests (CAT) have a tendency to capitalize on chance errors in a-parameter estimates (van der Linden and Glas, 2000). A-stratified, match difficulty, separate item-selection/item-scoring (half), and 1-pl only CATs were compared to a maximum information CAT for their ability to address the negative effects associated with controlling capitalization on chance. The CATs were evaluated in 3 simulations (i.e., using 1-, 2-, and 3-pl true item response theory models). Results were presented in terms of prevention of capitalization on chance and overall effectiveness. The phenomenon of capitalization on chance by a maximum information CAT was replicated. The astratified, match difficulty, and half CATs were successful at preventing capitalization on chance. Through consideration of overall effectiveness and ease of implementation, the match difficulty CAT was determined to be the best alternative to the maximum information CAT. The 1-pl only CAT was shown to be a poor alternative, especially in the 3-pl true item simulation.
|
|
|