共查询到20条相似文献,搜索用时 15 毫秒
1.
Wolfe EW 《Journal of applied measurement》2000,1(4):409-434
This article describes Rasch measurement procedures for equating multiple test forms or calibrating an item bank. The procedures entail (a) selecting an appropriate data collection design, (b) estimating parameters, (c) transforming the parameters from multiple forms to a common scale, and (d) evaluating the quality of the linkage between these forms. Data collection designs include (a) anchor tests, (b) single group, (c) single data set, and (d) equivalent groups. Estimation procedures may involve (a) separate or (b) simultaneous calibration of data from multiple forms. Transformation is typically accomplished using (a) estimation scaling, but may involve (b) parameter anchoring or (c) computing equating constants. Link quality is evaluated using four fit indices: (a) item-within-link, (b) item-between-link, (c) link-within-bank, and (d) form-within-bank. These procedures are illustrated using an anchor test design. 相似文献
2.
Penfield RD 《Journal of applied measurement》2005,6(4):355-365
The Rasch family of models displays several well-documented properties that distinguish them from the general item response theory (IRT) family of measurement models. This paper describes an additional unique property of Rasch models, referred to as the property of item information constancy. This property asserts that the area under the information function for Rasch models is always equal to the number of response categories minus one, regardless of the values of the item location parameters. The implication of the property of item information constancy is that, for a given number of response categories, all items following a Rasch model contribute equally to the height of the test information function across the entire latent continuum. 相似文献
3.
Smith EV 《Journal of applied measurement》2005,6(2):147-163
One of the assumptions of many latent trait models is local independence. This assumption specifies that, after controlling for the underlying trait, item responses are independent. Given the lack of studies of model robustness against such violations, it appears that this assumption is frequently taken for granted. Therefore, this study investigated the robustness of Rasch item and person estimates with simulated data under varying number of items, sample sizes, and levels of item redundancy. Item and person reliabilities, the standard deviations of the person and item estimates, the root mean squared differences and mean signed differences among person and item estimates, the correlations between person estimates, and the percentage of person estimates shifting by more than .50 logits were used to evaluate the impact of item redundancy. Both norm and criterion-reference interpretations may be influenced by the imputation of redundancy into the data. However, it appears that the amount of redundancy needs to be considerable before such interpretations would be adversely impacted. Suggestions for further simulation research are provided. 相似文献
4.
Rijmen F Tuerlinckx F Meulders M Smits DJ Balázs K 《Journal of applied measurement》2005,6(3):273-288
Mixed models take the dependency between observations based on the same person into account by introducing one or more random effects. After introducing the mixed model framework, it is explained, by taking the Rasch model as a generic example, how item response models can be conceptualized as generalized linear and nonlinear mixed models. Common estimation methods for generalized linear and nonlinear models are discussed. In a simulation study, the performance of four estimation methods is assessed for the Rasch model under different conditions regarding the number of items and persons, and the degree of interindividual differences. The estimation methods included in the study are: an approximation of the integral over the random effect by means of Gaussian quadrature; direct maximization with a sixth-order Laplace approximation to the integrand; a linearized approximation of the nonlinear model employing PQL2; and finally a Bayesian MCMC method. It is concluded that the estimation methods perform almost equally well, except for a slightly worse recovery of the variance parameter for PQL2 and MCMC. 相似文献
5.
The invariance of the estimated parameters across variation in the incidental parameters of a sample is one of the most important properties of Rasch measurement models. This is the property that allows the equating of test forms and the use of computer adaptive testing. It necessarily follows that in Rasch models if the data fit the model, than the estimation of the parameter of interest must be invariant across sub-samples of the items or persons. This study investigates the degree to which the INFIT and OUTFIT item fit statistics in WINSTEPS detect violations of the invariance property of Rasch measurement models. The test in this study is a 80 item multiple-choice test used to assess mathematics competency. The WINSTEPS analysis of the dichotomous results, based on a sample of 2000 from a very large number of students who took the exam, indicated that only 7 of the 80 items misfit using the 1.3 mean square criteria advocated by Linacre and Wright. Subsequent calibration of separate samples of 1,000 students from the upper and lower third of the person raw score distribution, followed by a t-test comparison of the item calibrations, indicated that the item difficulties for 60 of the 80 items were more than 2 standard errors apart. The separate calibration t-values ranged from +21.00 to -7.00 with the t-test value of 41 of the 80 comparisons either larger than +5 or smaller than -5. Clearly these data do not exhibit the invariance of the item parameters expected if the data fit the model. Yet the INFIT and OUTFIT mean squares are completely insensitive to the lack of invariance in the item parameters. If the OUTFIT ZSTD from WINSTEPS was used with a critical value of | t | > 2.0, then 56 of the 60 items identified by the separate calibration t-test would be identified as misfitting. A fourth measure of misfit, the between ability-group item fit statistic identified 69 items as misfitting when a critical value of t > 2.0 was used. Clearly relying solely on the INFIT and OUTFIT mean squares in WINSETPS to assess the fit of the data to the model would cause one to miss one of the most important threats to the usefulness of the measurement model. 相似文献
6.
The purpose of this paper is to explain the role of the unit implicit in the dichotomous Rasch model in determining the multiplicative factor of separation between measurements in a specified frame of reference. The explanation is provided at two complementary levels: first, in terms of the algebra of the model in which the role of an implicit, multiplicative constant is made explicit; and second, at a more fundamental level, in terms of the classical definition of measurement in the physical sciences. The Rasch model is characterized by statistical sufficiency, which arises from the requirement of invariant comparisons within a specified frame of reference. A frame of reference is defined by a class of persons responding to a class of items in a well-defined response context. The paper shows that two or more frames of reference may have different implicit units without destroying sufficiency. Understanding the role of the unit permits explication of the relationship between the Rasch model and the two parameter logistic model. The paper also summarises an approach that can be used in practice to express measurements across different frames of reference in the same unit. 相似文献
7.
This paper describes a class of rater effects that depict rater-by-time interactions. We refer to this class of rater effects as DRIFT differential rater functioning over time. This article describes several types of DRIFT (primacy/recency, differential centrality/extremism, and practice/fatigue) and Rasch measurement procedures designed to identify these types of DRIFT in rating data. These procedures are applied to simulated data and are shown to be useful in classifying raters as being aberrant or non-aberrant for primacy, recency, and differential centrality and extremism, particularly for moderate or larger effect sizes. Rates of correct classification for practice and fatigue were lower and statistical power exceeded.50 only with very large effect sizes. Type I error rates (i.e., incorrect nomination) were near expected levels in all cases. 相似文献
8.
Garner M 《Journal of applied measurement》2002,3(2):107-128
The purpose of this paper is to describe a technique for obtaining item parameters of the Rasch model, a technique in which the item parameters are extracted from the eigenvector of a matrix derived from comparisons between pairs of items. The technique can be applied to both dichotomous and polytomous data. In application to a previously published data set, it is shown that the technique provides item parameter estimates comparable to those produced by joint maximum likelihood estimation, and for the most difficult items, the technique appears to produce superior estimates. This method has several advantages. It easily accommodates missing data, and makes transparent the basis for item parameter estimation in the presence of missing data. Furthermore, the method provides a link to other methods in the social sciences and, in particular, provides the framework for application of graph theory to the analysis of assessment networks. Finally, it exploits several characteristics that are unique to the Rasch model. 相似文献
9.
10.
Demars CE 《Journal of applied measurement》2004,5(4):350-361
A multidimensional Rasch model was applied to two instruments measuring abilities in two related areas of a university general education curriculum. Grades from related courses were also calibrated using the Rasch model. Thus, course grades, test items, and persons were all placed on the same metric. Incorporating grades within the metric provided additional meaning to the measures; instructors could see which items were matched to students in a particular grade range for a course. This could help not only in interpreting items but also in interpreting grades. Test items and grades fit the model reasonably well, with adequate person separation reliability. 相似文献
11.
In the present paper, the Rasch measurement model is used in the validation and analysis of data coming from the satisfaction section of the first national survey concerning the social services sector carried out in Italy. A comparison between two Rasch models for polytomous data, that is the Rating Scale Model and the Partial Credit Model, is discussed. Given that the two models provide similar estimates of the item difficulties and workers satisfaction, for almost all the items the response probabilities computed using the RSM and the PCM are very close and the analysis of the bootstrap confidence intervals shows that the estimates obtained applying the RSM are more stable than the ones obtained using the PCM, it can be conclude that, for the present data, the RSM is more appropriate than the PCM. 相似文献
12.
Linacre JM 《Journal of applied measurement》2004,5(1):95-110
Building on Wright and Masters (1982), several Rasch estimation methods are briefly described, including Marginal Maximum Likelihood Estimation (MMLE) and minimum chi-square methods. General attributes of Rasch estimation algorithms are discussed, including the handling of missing data, precision and accuracy, estimate consistency, bias and symmetry. Reasons for, and the implications of, measure misestimation are explained, including the effect of loose convergence criteria, and failure of Newton-Raphson iteration to converge. Alternative parameterizations of rating scales broaden the scope of Rasch measurement methodology. 相似文献
13.
Local item dependence (LID) can emerge when the test items are nested within common stimuli or item groups. This study proposes a three-level hierarchical generalized linear model (HGLM) to model LID when LID is due to such contextual effects. The proposed three-level HGLM was examined by analyzing simulated data sets and was compared with the Rasch-equivalent two-level HGLM that ignores such a nested structure of test items. The results demonstrated that the proposed model could capture LID and estimate its magnitude. Also, the two-level HGLM resulted in larger mean absolute differences between the true and the estimated item difficulties than those from the proposed three-level HGLM. Furthermore, it was demonstrated that the proposed three-level HGLM estimated the ability distribution variance unaffected by the LID magnitude, while the two-level HGLM with no LID consideration increasingly underestimated the ability variance as the LID magnitude increased. 相似文献
14.
Babiar TC 《Journal of applied measurement》2011,12(2):144-164
Traditionally, women and minorities have not been fully represented in science and engineering. Numerous studies have attributed these differences to gaps in science achievement as measured by various standardized tests. Rather than describe mean group differences in science achievement across multiple cultures, this study focused on an in-depth item-level analysis across two countries: Spain and the United States. This study investigated eighth-grade gender differences on science items across the two countries. A secondary purpose of the study was to explore the nature of gender differences using the many-faceted Rasch Model as a way to estimate gender DIF. A secondary analysis of data from the Third International Mathematics and Science Study (TIMSS) was used to address three questions: 1) Does gender DIF in science achievement exist? 2) Is there a relationship between gender DIF and characteristics of the science items? 3) Do the relationships between item characteristics and gender DIF in science items replicate across countries. Participants included 7,087 eight grade students from the United States and 3,855 students from Spain who participated in TIMSS. The Facets program (Linacre and Wright, 1992) was used to estimate gender DIF. The results of the analysis indicate that the content of the item seemed to be related to gender DIF. The analysis also suggests that there is a relationship between gender DIF and item format. No pattern of gender DIF related to cognitive demand was found. The general pattern of gender DIF was similar across the two countries used in the analysis. The strength of item-level analysis as opposed to group mean difference analysis is that gender differences can be detected at the item level, even when no mean differences can be detected at the group level. 相似文献
15.
In state assessment programs that employ Rasch-based common item linking procedures, the linking constant is usually estimated with only those common items not identified as exhibiting item difficulty parameter drift. Since state assessments typically contain a fixed number of items, an item classified as exhibiting parameter drift during the linking process remains on the exam as a scorable item even if it is removed from the common item set. Under the assumption that item parameter drift has occurred for one or more of the common items, the expected effect of including or excluding the "affected" item(s) in the estimation of the linking constant is derived in this article. If the item parameter drift is due solely to factors not associated with a change in examinee achievement, no linking error will (be expected to) occur given that the linking constant is estimated only with the items not identified as "affected"; linking error will (be expected to) occur if the linking constant is estimated with all common items. However, if the item parameter drift is due solely to change in examinee achievement, the opposite is true: no linking error will (be expected to) occur if the linking constant is estimated with all common items; linking error will (be expected to) occur if the linking constant is estimated only with the items not identified as "affected". 相似文献
16.
The item infit and outfit mean square errors (MSE) and their t-transformed statistics are widely used to screen poorly fitting items. The t-transformed statistics, however, do not follow the standard normal distribution so that hypothesis testing of item fit based on the conventional critical values is likely to be inaccurate (Wang and Chen, 2005). The MSE statistics are effect-size measures of misfit and have an expected value of unity when the data fit the model's expectation. Unfortunately, most computer programs for item response analysis do not report confidence intervals of the item infit and outfit MSE, mainly because their sampling distributions are analytically intractable. Hence, the user is left without interval estimates of the magnitudes of misfit. In this study, we developed a FORTRAN 90 computer program in conjunction with the commercial program WINSTEPS (Linacre, 2001) that yields confidence intervals of the item infit and outfit MSE using the bootstrap method. The utility of the program is demonstrated through three illustrations of simulated data sets. 相似文献
17.
The purpose of this two-part paper is to introduce researchers to the many-facet Rasch measurement (MFRM) approach for detecting and measuring rater effects. The researcher will learn how to use the Facets (Linacre, 2001) computer program to study five effects: leniency/severity, central tendency, randomness, halo, and differential leniency/severity. Part 1 of the paper provides critical background and context for studying MFRM. We present a catalog of rater effects, introducing effects that researchers have studied over the last three-quarters of a century in order to help readers gain a historical perspective on how those effects have been conceptualized. We define each effect and describe various ways the effect has been portrayed in the research literature. We then explain how researchers theorize that the effect impacts the quality of ratings, pinpoint various indices they have used to measure it, and describe various strategies that have been proposed to try to minimize its impact on the measurement of ratees. The second half of Part 1 provides conceptual and mathematical explanations of many-facet Rasch measurement, focusing on how researchers can use MFRM to study rater effects. First, we present the many-facet version of Andrich's (1978) rating scale model and identify questions about a rating operation that researchers can address using this model. We then introduce three hybrid MFRM models, explain the conceptual distinctions among them, describe how they differ from the rating scale model, and identify questions about a rating operation that researchers can address using these hybrid models. 相似文献
18.
The purpose of this two-part paper is to introduce researchers to the many-facet Rasch measurement (MFRM) approach for detecting and measuring rater effects. In Part II of the paper, researchers will learn how to use the Facets (Linacre, 2001) computer program to study five effects: leniency/severity, central tendency, randomness, halo, and differential leniency/severity. As we introduce each effect, we operationally define it within the context of a MFRM approach, specify the particular measurement model(s) needed to detect it, identify group- and individual-level statistical indicators of the effect, and show output from a Facets analysis, pinpointing the various indicators and explaining how to interpret each one. At the close of the paper, we describe other statistical procedures that have been used to detect and measure rater effects to help researchers become aware of important and influential literature on the topic and to gain an appreciation for the diversity of psychometric perspectives that researchers bring to bear on their work. Finally, we consider future directions for research in the detection and measurement of rater effects. 相似文献
19.
Humphry S 《Journal of applied measurement》2012,13(2):165-180
The aim is to show that it is possible to parameterize discrimination for sets of items, rather than individual items, without destroying conditions for sufficiency in a form of the Rasch model. The form of the model is obtained by formalizing the relationship between discrimination and the unit of a metric. The raw score vector across item sets is the sufficient statistic for the person parameter. Simulation studies are used to show the implementation of conditional estimation solution equations based on the relevant form of the Rasch model. The model also applied to two numeracy tests attempted by a group of common persons in a large-scale testing program. The results show improved fit compared with the Rasch model in its standard form. They also show the units of the scales were more accurately equated. The paper discusses implications for applied measurement using Rasch models and contrasts the approach with the application of the two parameter logistic (2PL) model. 相似文献
20.
Chi E 《Journal of applied measurement》2001,2(4):379-388
This paper compares holistic and analytic scoring methods to explore how the alternative scorings can make differences for performance assessment using many-faceted Rasch model. The model is especially pertinent for analyzing performance assessment since the model can include several facets simultaneously. Forty three students' reports for social studies were scored by four raters with the holistic method and the analytic method. The result demonstrated that scoring rubrics could be improved by investigating rating scale categories. Also, the comparison of student scores between the two scoring methods revealed that the selection of scoring methods might not be significant for the relative comparison of students but it could have serious implication for the assessment of students' absolute abilities. For rater severity, analytic scoring provided more consistency than holistic scoring. These findings can be used to select and improve scoring methods for performance assessment. 相似文献