首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
This article describes Rasch measurement procedures for equating multiple test forms or calibrating an item bank. The procedures entail (a) selecting an appropriate data collection design, (b) estimating parameters, (c) transforming the parameters from multiple forms to a common scale, and (d) evaluating the quality of the linkage between these forms. Data collection designs include (a) anchor tests, (b) single group, (c) single data set, and (d) equivalent groups. Estimation procedures may involve (a) separate or (b) simultaneous calibration of data from multiple forms. Transformation is typically accomplished using (a) estimation scaling, but may involve (b) parameter anchoring or (c) computing equating constants. Link quality is evaluated using four fit indices: (a) item-within-link, (b) item-between-link, (c) link-within-bank, and (d) form-within-bank. These procedures are illustrated using an anchor test design.  相似文献   

2.
The objective of this article is to illustrate incremental item banking using health-related quality of life data collected from two samples of patients receiving cancer treatment. The kinds of decisions one faces in establishing an item bank for computerized adaptive testing are also illustrated. Pre-calibration procedures include: identifying common items across databases; creating a new database with data from each pool; reverse-scoring "negative" items; identifying rating scales used in items; identifying pivot points in each rating scale; pivot anchoring items at comparable rating scale categories; and identifying items in each instrument that measure the construct of interest. A series of calibrations were conducted in which a small proportion of new items were added to the common core and misfitting items were identified and deleted until an initial item bank has been developed.  相似文献   

3.
The study investigated five factors which can affect the equating of scores from two tests onto a common score scale. The five factors were: (a) item distribution type (i.e., normal versus uniform; (b) standard deviation of item difficulty (i.e.,.68,.95,.99); (c) number of items or test length (i.e., 50, 100, 200); (d) number of common items (i.e., 10, 20, 30); and (e) sample size (i.e., 100, 300, 500). SIMTEST and BIGSTEPS programs were used for the simulation and equating of 4,860 item data sets, respectively. Results from the five-way fixed effects factorial analysis of variance indicated three statistically significant two-way interaction effects. Simple effects for the interaction between common item length and test length only were interpreted given Type I error rate considerations. The eta-squared values for number of common items and test length were small indicating the effects had little practical importance. The Rasch approach to equating is robust with as few as 10 common items and a test length of 100 items.  相似文献   

4.
The invariance of the estimated parameters across variation in the incidental parameters of a sample is one of the most important properties of Rasch measurement models. This is the property that allows the equating of test forms and the use of computer adaptive testing. It necessarily follows that in Rasch models if the data fit the model, than the estimation of the parameter of interest must be invariant across sub-samples of the items or persons. This study investigates the degree to which the INFIT and OUTFIT item fit statistics in WINSTEPS detect violations of the invariance property of Rasch measurement models. The test in this study is a 80 item multiple-choice test used to assess mathematics competency. The WINSTEPS analysis of the dichotomous results, based on a sample of 2000 from a very large number of students who took the exam, indicated that only 7 of the 80 items misfit using the 1.3 mean square criteria advocated by Linacre and Wright. Subsequent calibration of separate samples of 1,000 students from the upper and lower third of the person raw score distribution, followed by a t-test comparison of the item calibrations, indicated that the item difficulties for 60 of the 80 items were more than 2 standard errors apart. The separate calibration t-values ranged from +21.00 to -7.00 with the t-test value of 41 of the 80 comparisons either larger than +5 or smaller than -5. Clearly these data do not exhibit the invariance of the item parameters expected if the data fit the model. Yet the INFIT and OUTFIT mean squares are completely insensitive to the lack of invariance in the item parameters. If the OUTFIT ZSTD from WINSTEPS was used with a critical value of | t | > 2.0, then 56 of the 60 items identified by the separate calibration t-test would be identified as misfitting. A fourth measure of misfit, the between ability-group item fit statistic identified 69 items as misfitting when a critical value of t > 2.0 was used. Clearly relying solely on the INFIT and OUTFIT mean squares in WINSETPS to assess the fit of the data to the model would cause one to miss one of the most important threats to the usefulness of the measurement model.  相似文献   

5.
A number of state assessment programs that employ Rasch-based common item equating procedures estimate the equating constant with only those common items for which the two tests' Rasch item difficulty parameter estimates differ by less than 0.3 logits. The results of this study presents evidence that this practice results in an inflated probability of incorrectly dropping an item from the common item set if the number of examinees is small (e.g., 500 or less) and the reverse if the number of examinees is large (e.g., 5000 or more). An asymptotic experiment-wise error rate criterion was algebraically derived. This same criterion can also be applied to the Mantel-Haenszel statistic. Bonferroni test statistics were found to provide excellent approximations to the (asymptotically) exact test statistics.  相似文献   

6.
Functional Caregiving (FC) is a construct about mothers caring for children (both old and young) with intellectual disabilities, which is operationally defined by two nonequivalent survey forms, urban and suburban, respectively. The purposes of this research are, first, to generalize school-based achievement test principles to survey methods by equating two nonequivalent survey forms. A second purpose is to expand FC foundations by a) establishing linear measurement properties for new caregiving items, b) replicate a hierarchical item structure across an urban, school-based population, c) consolidate survey forms to establish a calibrated item bank, and d) collect more external construct validity data. Results supported invariant item parameters of a fixed item form (96 items) for two urban samples (N = 186). FC measures also showed expected construct relationships with age, mental depression, and health status. However, only five common items between urban and suburban forms were statistically stable because suburban mothers' age and child's age appear to interact with medical information and social activities.  相似文献   

7.
There has been some discussion among researchers as to the benefits of using one calibration process over the other during equating. Although literature is rife with the pros and cons of the different methods, hardly any research has been done on anchoring (i.e., fixing item parameters to their pre-determined values on an established scale) as a method that is commonly used by psychometricians in large-scale assessments. This simulation research compares the fixed form of calibration with the concurrent method (where calibration of the different forms on the same scale is accomplished by a single run of the calibration process, treating all non-included items on the forms as missing or not reached), using the dichotomous Rasch (Rasch, 1960) and the Rasch partial credit (Masters, 1982) models, and the WINSTEPS (Linacre, 2003) computer program. Contrary to the belief and some researchers' contention that the concurrent run with larger n-counts for the common items would provide greater accuracy in the estimation of item parameters, the results of this paper indicate that the greater accuracy of one method over the other is confounded by the sample-size, the number of common items, etc., and there is no real benefit in using one method over the other in the calibration and equating of parallel tests forms.  相似文献   

8.
We expanded an existing 33-item physical function (PF) item bank with a sufficient number of items to enable computerized adaptive testing (CAT). Ten items were written to expand the bank and the new item pool was administered to 295 people with cancer. For this analysis of the new pool, seven poorly performing items were identified for further examination. This resulted in a bank with items that define an essentially unidimensional PF construct, cover a wide range of that construct, reliably measure the PF of persons with cancer, and distinguish differences in self-reported functional performance levels. We also developed a 5-item (static) assessment form ("BriefPF") that can be used in clinical research to express scores on the same metric as the overall bank. The BriefPF was compared to the PF-10 from the Medical Outcomes Study SF-36. Both short forms significantly differentiated persons across functional performance levels. While the entire bank was more precise across the PF continuum than either short form, there were differences in the area of the continuum in which each short form was more precise: the BriefPF was more precise than the PF-10 at the lower functional levels and the PF-10 was more precise than the BriefPF at the higher levels. Future research on this bank will include the development of a CAT version, the PF-CAT.  相似文献   

9.
BACKGROUND: In the development of health outcome measures, the pool of candidate items may be divided into multiple forms, thus "spreading" response burden over two or more study samples. Item responses collected using this approach result in two or more forms whose scores are not equivalent. Therefore, the item responses must be equated (adjusted) to a common mathematical metric. OBJECTIVES: The purpose of this study was to examine the effect of sample size, test size, and selection of item response theory model in equating three forms of a health status measure. Each of the forms was comprised of a set of items unique to it and a set of anchor items common across forms. RESEARCH DESIGN: The study was a secondary data analysis of patients' responses to the developmental item pool for the Health of Seniors Survey. A completely crossed design was used with 25 replications per study cell. RESULTS: We found that the quality of equatings was affected greatly by sample size. Its effect was far more substantial than choice of IRT model. Little or no advantage was observed for equatings based on 60 or 72 items versus those based on 48 items. CONCLUSIONS: We concluded that samples of less than 300 are clearly unacceptable for equating multiple forms. Additional sample size guidelines are offered based on our results.  相似文献   

10.
The measurement complexities emerging from vertical equating in an educational experiment aiming at an advance in the curriculum are addressed, when calibrating an 'integer ability' scale for year 5 students from Greater Manchester based both on primary (years 5 and 6) and high school (years 7 and 8) data. The need for such a calibration resulted from experimental teaching of 'high school content' in primary school. Substantial Rasch differential item functioning (DIF) arose in the vertical equating between primary and high school in our initial 'all-on-all' 'concurrent' calibration. A second 'Primary anchored-and-extended' calibration which substantially overcame DIF problems is shown to be preferable for our teaching experiment. The relevant methodological challenges and the techniques adopted are discussed. The solution provided might be useful to researchers for educational experiments targeting an advance in the curriculum.  相似文献   

11.
In 2005 PISA published trend indicators that compared the results of PISA 2000 and PISA 2003. In this paper we explore the extent to which the outcomes of these trend analyses are sensitive to the choice of test equating methodologies, the choice of regression models and the choice of linking items. To establish trends PISA equated its 2000 and 2003 tests using a methodology based on Rasch Modelling that involved estimating linear transformations that mapped 2003 Rasch-scaled scores to the previously established PISA 2000 Rasch-scaled scores. In this paper we compare the outcomes of this approach with an alternative, which involves the joint Rasch scaling of the PISA 2000 and PISA 2003 data separately for each country. Note that under this approach the item parameters are estimated separately for each country, whereas the linear transformation approach used a common set of item parameter estimates for all countries. Further, as its primary trend indicators, PISA reported changes in mean scores between 2000 and 2003. These means are not adjusted for changes in the background characteristics of the PISA 2000 and PISA 2003 samples - that is, they are marginal rather than conditional means. The use of conditional rather than marginal means results in some differing conclusions regarding trends at both the country and within-country level.  相似文献   

12.
When a new set of mixed format items is augmented with a previous old multiple-choice (MC) test, those mixed format items should be linked to the existing old MC test. This study used simulation to investigate sample size effect on recovery of known item parameter from the concurrent calibration in the context of horizontal equating, where the new mixed format tests are equated to the existing MC test which acts as the common linking items. In the partial credit model following the Andrich style parameterization, item location and item step parameters were differentially affected by the sample size. Item location parameters were recovered better than item step parameters at the individual item, the sub-test, and the total test level. This study also shows the outward bias for the item location parameter estimated by the maximum likelihood estimator.  相似文献   

13.
This paper reviews the literature on ambidexterity in service organizations with a specific focus on the banking industry. We identify three key, cross-unit bank processes: governance (bank headquarters), sales (branch processes) and operations (ICT and facilities to support local (branch) and inter-unit (headquarters-to-branch) tasks). We suggest a framework that incorporates three main “reference models”, from an organizational design perspective. Model 1 (exploitative model) applies when the bank's headquarters work to formalize branch sales processes supported by operations processes. Model 2 (exploratory model) applies when the bank's headquarters allows flexibility in branch sales processes and uses operations processes to decentralize tasks. Model 3 (ambidextrous model) applies when a branch incorporates the characteristics of Models 1 and 2 simultaneously. We ground our claims using fieldwork conducted in 2004–2005 that involved a number of major Italian banks. We show that while large organizations, such as banks, base their ambidextrous innovation on organizational design, contextual elements such as trust and commitment, and management styles and leadership play a role in dealing with efficiency-oriented vs. flexibility-oriented tasks within the same bank branch.  相似文献   

14.
This paper proposes a multilevel measurement model that controls for DIF effects in test equating. The accuracy and stability of item and ability parameter estimates under the proposed multilevel measurement model were examined using randomly simulated data. Estimates from the proposed model were compared with those resulting from two multiple-group concurrent equating designs, including 1) a design that replaced DIF-items with items with no DIF; and 2) a design that retained DIF items with no attempt to control for DIF. In most of the investigated conditions, the results indicated that the proposed multilevel measurement model performed better than the two comparison models.  相似文献   

15.
A series of tests were developed to assess the proficiency of Australian Year 5 and Year 8 students in Asian Studies. This paper presents results of analyses that involved calibrating items distributed over 14 overlapping subtests, developed to cater for state and territory curricula and two year-levels. This allowed for state and year-level preferences to be selected from a common pool of 105 items. The project used common item anchoring to map all students and items onto a single, underpinning scale that was identified and interpreted using concurrent equating procedures and a skills audit of items.  相似文献   

16.
Colleges and universities conduct student satisfaction studies for many important policy making reasons. However the differences in instrumentation and the use of students' self-reported ratings of satisfaction makes such decisions sample-, instrument-, and institution-dependent. A common metric of student satisfaction would assist decision makers by providing a richness of information not typically obtained. The present study investigated the extent to which two nationally known instruments of student satisfaction could be scaled on the same quantitative metric. Pseudo-common item equating (Fisher, 1997) based on five link items of low and high endorsability enabled comparisons of "similar, but not identical items, from different instruments, calibrated on different samples" (p. 87). Results suggest that both instruments measured similar constructs and could be reasonably used to create a single, common metric. While samples used in the experiment were less than ideal, results clearly demonstrated the usefulness and reasonability of the pseudo-common item equating process.  相似文献   

17.
The objective of this paper is to analyze major error sources in the process of simultaneous calibration of a large number of thermocouples (TCs). The main reason for the occurrence of additional error sources when a large number of TCs are simultaneously calibrated is due to additional inhomogeneity effects and characteristics of a particular measuring setup. The need for such calibrations is due to the fact that, often, a large number of relatively low-cost but traceably calibrated TCs are needed for monitoring purposes and measurements of temperature profiles in applications, such as evaluation of temperature and humidity chambers, wind channels, tunnel furnaces in steel plants, etc. The fact is that the overall uncertainty attributed to each particular TC during simultaneous calibration exceeds the uncertainty assigned to a TC during a calibration process of a single TC. Nevertheless, in most cases, the proposed solution is highly acceptable, especially in the area of testing, due to a significant decrease of calibration costs yet still meeting testing requirements.  相似文献   

18.
19.
In this paper, we explore the potential for strategic environmental assessment (SEA) to be a useful tool for banks to manage environmental risks and inform lending decisions. SEA is an environmental assessment tool that was developed to assist strategic-level decision-makers, such as policy-makers, planners, government authorities and environmental practitioners in improving developmental outcomes, aiming to facilitate the transition to sustainable development. We propose that SEA may also be a valuable tool for banks because it has the capacity to provide information about environmental risks at a time when it can be used as an input to bank lending decisions, which can assist banks in making lending decisions with better environmental outcomes. For these reasons, we argue that in some circumstances, and particularly for project finance transactions, SEA may be a more useful environmental assessment tool for lenders than environmental impact assessment, which many banks are currently relying on to help assess and mitigate environmental risks. Furthermore, we suggest that the use of SEA by banks would contribute to the sustainability goals of SEA.  相似文献   

20.
This study describes and demonstrates a set of processes for developing new forms of examinations which are intended to have equivalent cut scores in the raw score metric. This approach goes beyond the traditional Rasch-based approach which develops forms with cut scores that are equated in the logit metric. The methods described in this study can be used to create multiple forms of an assessment, all of which have the same raw score cut score (i.e., the number correct required to pass each examination form represents the same amount of the underlying construct). This paper provides an overview of equating standards, the research related specifically to pre-equating procedures, and three guidelines which can be used to achieve equal raw score cut scores. Three examples of how to use the guidelines as part of an iterative form-development process are provided using simulated data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号