期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An investigation of factors affecting test equating in latent trait theory

Suanthong S Schumacker RE Beyerlein MM 《Journal of applied measurement》2000,1(1):25-43

The study investigated five factors which can affect the equating of scores from two tests onto a common score scale. The five factors were: (a) item distribution type (i.e., normal versus uniform; (b) standard deviation of item difficulty (i.e.,.68,.95,.99); (c) number of items or test length (i.e., 50, 100, 200); (d) number of common items (i.e., 10, 20, 30); and (e) sample size (i.e., 100, 300, 500). SIMTEST and BIGSTEPS programs were used for the simulation and equating of 4,860 item data sets, respectively. Results from the five-way fixed effects factorial analysis of variance indicated three statistically significant two-way interaction effects. Simple effects for the interaction between common item length and test length only were interpreted given Type I error rate considerations. The eta-squared values for number of common items and test length were small indicating the effects had little practical importance. The Rasch approach to equating is robust with as few as 10 common items and a test length of 100 items. 相似文献

2.

Equating of multi-facet tests across administrations

Lunz M Suanthong S 《Journal of applied measurement》2011,12(2):124-134

The purpose of this study is to explore criteria for common element test equating for performance examinations. Using the multi-facet Rasch model, each element of each facet is calibrated or placed in a relative position on a Benchmark or reference scale. Common elements from each facet, included on the examinations being equated, are used to anchor the facet elements to the Benchmark Scale. This places all examinations on the same scale so that the same criterion standard can be used. Performance examinations typically have three to four facets including examinees, raters, items and tasks. Raters rate examinees on tasks related to the items included in the test. The initial anchoring of a current test administration to the Benchmark Scale is evaluated for invariance and fit. If there is too much variance or lack of fit for particular facet elements, it may be necessary to unanchor those elements, which means they are not used in the equating. The equating process was applied to an exam with four facets and another with five facets. Results found few common facet elements that could not be used in the test equating process and that differences in the difficulty of the equated exams were identified so that the criterion standard on the Benchmark Scale could be used. It was necessary to use careful quality control for anchoring the common elements in each facet. The common elements should be unaltered from their original use. Strict criteria for displacement and fit must be established and used consistently. Unanchoring inconsistent and/or misfitting facet elements improves the quality of the test equating. 相似文献

3.

Comparing concurrent versus fixed parameter equating with common items: using the dichotomous and partial credit models in a mixed-item format test

Taherbhai HM Seo DY 《Journal of applied measurement》2007,8(1):84-96

There has been some discussion among researchers as to the benefits of using one calibration process over the other during equating. Although literature is rife with the pros and cons of the different methods, hardly any research has been done on anchoring (i.e., fixing item parameters to their pre-determined values on an established scale) as a method that is commonly used by psychometricians in large-scale assessments. This simulation research compares the fixed form of calibration with the concurrent method (where calibration of the different forms on the same scale is accomplished by a single run of the calibration process, treating all non-included items on the forms as missing or not reached), using the dichotomous Rasch (Rasch, 1960) and the Rasch partial credit (Masters, 1982) models, and the WINSTEPS (Linacre, 2003) computer program. Contrary to the belief and some researchers' contention that the concurrent run with larger n-counts for the common items would provide greater accuracy in the estimation of item parameters, the results of this paper indicate that the greater accuracy of one method over the other is confounded by the sample-size, the number of common items, etc., and there is no real benefit in using one method over the other in the calibration and equating of parallel tests forms. 相似文献

4.

e-Testing from artificial intelligence approach

Ueno Maomi Fuchimoto Kazuma Tsutsumi Emiko 《Behaviormetrika》2021,48(2):409-424

This paper presents a review of advanced technologies for e-testing using an artificial intelligence approach. First, this paper introduces state-of-the-art uniform test assembly methods to guarantee examinee test score equivalence even if different examinees with the same ability take different tests. More formally, each uniform test form has equivalent measurement accuracy but with a different set of items. To increase the number of assembled tests, some test assembly methods allow that any two tests of uniform tests can include fewer common items than a user allows as a test constraint. This situation is designated as the overlapping condition. However, these methods used with an overlapping condition are often adversely affected by bias of the item exposure frequency and decreased reliability of items and tests. Second, this paper introduces state-of-the-art uniform test form assembly with a constraint of item exposure. Most earlier studies of e-testing employ item response theory (IRT) to obtain each examinee’s test score. However, IRT has several strict assumptions. Recently, Deep-IRT, which employs deep learning to relax the assumptions, has attracted attention. Finally, this paper introduces Deep-IRT models.

相似文献

5.

Evidence-based practice for equating health status items: sample size and IRT model

Cook KF Taylor PW Dodd BG Teal CR McHorney CA 《Journal of applied measurement》2007,8(2):175-189

BACKGROUND: In the development of health outcome measures, the pool of candidate items may be divided into multiple forms, thus "spreading" response burden over two or more study samples. Item responses collected using this approach result in two or more forms whose scores are not equivalent. Therefore, the item responses must be equated (adjusted) to a common mathematical metric. OBJECTIVES: The purpose of this study was to examine the effect of sample size, test size, and selection of item response theory model in equating three forms of a health status measure. Each of the forms was comprised of a set of items unique to it and a set of anchor items common across forms. RESEARCH DESIGN: The study was a secondary data analysis of patients' responses to the developmental item pool for the Health of Seniors Survey. A completely crossed design was used with 25 replications per study cell. RESULTS: We found that the quality of equatings was affected greatly by sample size. Its effect was far more substantial than choice of IRT model. Little or no advantage was observed for equatings based on 60 or 72 items versus those based on 48 items. CONCLUSIONS: We concluded that samples of less than 300 are clearly unacceptable for equating multiple forms. Additional sample size guidelines are offered based on our results. 相似文献

6.

The impact of receiving the same items on consecutive computer adaptive test administrations

O'Neill T Lunz ME Thiede K 《Journal of applied measurement》2000,1(2):131-151

This study addresses item exposure in a Computerized Adaptive Test (CAT) when the item selection algorithm is permitted to present examinees with questions that they have already been asked in a previous test administration. The results indicate that the combined use of an adaptive algorithm to select items and latent trait theory to estimate person ability provides substantial protection from score contamination. The implications for constraints that prohibit examinees from seeing an item twice are discussed. 相似文献

7.

Rasch fit statistics as a test of the invariance of item parameter estimates

Smith RM Suh KK 《Journal of applied measurement》2003,4(2):153-163

The invariance of the estimated parameters across variation in the incidental parameters of a sample is one of the most important properties of Rasch measurement models. This is the property that allows the equating of test forms and the use of computer adaptive testing. It necessarily follows that in Rasch models if the data fit the model, than the estimation of the parameter of interest must be invariant across sub-samples of the items or persons. This study investigates the degree to which the INFIT and OUTFIT item fit statistics in WINSTEPS detect violations of the invariance property of Rasch measurement models. The test in this study is a 80 item multiple-choice test used to assess mathematics competency. The WINSTEPS analysis of the dichotomous results, based on a sample of 2000 from a very large number of students who took the exam, indicated that only 7 of the 80 items misfit using the 1.3 mean square criteria advocated by Linacre and Wright. Subsequent calibration of separate samples of 1,000 students from the upper and lower third of the person raw score distribution, followed by a t-test comparison of the item calibrations, indicated that the item difficulties for 60 of the 80 items were more than 2 standard errors apart. The separate calibration t-values ranged from +21.00 to -7.00 with the t-test value of 41 of the 80 comparisons either larger than +5 or smaller than -5. Clearly these data do not exhibit the invariance of the item parameters expected if the data fit the model. Yet the INFIT and OUTFIT mean squares are completely insensitive to the lack of invariance in the item parameters. If the OUTFIT ZSTD from WINSTEPS was used with a critical value of | t | > 2.0, then 56 of the 60 items identified by the separate calibration t-test would be identified as misfitting. A fourth measure of misfit, the between ability-group item fit statistic identified 69 items as misfitting when a critical value of t > 2.0 was used. Clearly relying solely on the INFIT and OUTFIT mean squares in WINSETPS to assess the fit of the data to the model would cause one to miss one of the most important threats to the usefulness of the measurement model. 相似文献

8.

Equating and item banking with the Rasch model

Wolfe EW 《Journal of applied measurement》2000,1(4):409-434

This article describes Rasch measurement procedures for equating multiple test forms or calibrating an item bank. The procedures entail (a) selecting an appropriate data collection design, (b) estimating parameters, (c) transforming the parameters from multiple forms to a common scale, and (d) evaluating the quality of the linkage between these forms. Data collection designs include (a) anchor tests, (b) single group, (c) single data set, and (d) equivalent groups. Estimation procedures may involve (a) separate or (b) simultaneous calibration of data from multiple forms. Transformation is typically accomplished using (a) estimation scaling, but may involve (b) parameter anchoring or (c) computing equating constants. Link quality is evaluated using four fit indices: (a) item-within-link, (b) item-between-link, (c) link-within-bank, and (d) form-within-bank. These procedures are illustrated using an anchor test design. 相似文献

9.

Expected linking error resulting from item parameter drift among the common Items on Rasch calibrated tests

Miller GE Gesn PR Rotou J 《Journal of applied measurement》2005,6(1):48-56

In state assessment programs that employ Rasch-based common item linking procedures, the linking constant is usually estimated with only those common items not identified as exhibiting item difficulty parameter drift. Since state assessments typically contain a fixed number of items, an item classified as exhibiting parameter drift during the linking process remains on the exam as a scorable item even if it is removed from the common item set. Under the assumption that item parameter drift has occurred for one or more of the common items, the expected effect of including or excluding the "affected" item(s) in the estimation of the linking constant is derived in this article. If the item parameter drift is due solely to factors not associated with a change in examinee achievement, no linking error will (be expected to) occur given that the linking constant is estimated only with the items not identified as "affected"; linking error will (be expected to) occur if the linking constant is estimated with all common items. However, if the item parameter drift is due solely to change in examinee achievement, the opposite is true: no linking error will (be expected to) occur if the linking constant is estimated with all common items; linking error will (be expected to) occur if the linking constant is estimated only with the items not identified as "affected". 相似文献

10.

Equating student satisfaction measures

Beltyukova SA Stone GE Fox CM 《Journal of applied measurement》2004,5(1):62-69

Colleges and universities conduct student satisfaction studies for many important policy making reasons. However the differences in instrumentation and the use of students' self-reported ratings of satisfaction makes such decisions sample-, instrument-, and institution-dependent. A common metric of student satisfaction would assist decision makers by providing a richness of information not typically obtained. The present study investigated the extent to which two nationally known instruments of student satisfaction could be scaled on the same quantitative metric. Pseudo-common item equating (Fisher, 1997) based on five link items of low and high endorsability enabled comparisons of "similar, but not identical items, from different instruments, calibrated on different samples" (p. 87). Results suggest that both instruments measured similar constructs and could be reasonably used to create a single, common metric. While samples used in the experiment were less than ideal, results clearly demonstrated the usefulness and reasonability of the pseudo-common item equating process. 相似文献

11.

Coefficient Alpha: An Engineer's Interpretation of Test Reliability

Kirk Allen Teri Reed‐Rhoads Robert A. Terry Teri J. Murphy Andrea D. Stone 《工程教育杂志》2008,97(1):87-94

Reliability is a fundamental concept of test construction. The most common measure of reliability, coefficient alpha, is frequently used without an understanding of its behavior. This article contributes to the understanding of test reliability by demonstrating that questions which lower reliability are inconsistent with the bulk of the test, being prone to test‐taking tricks and guessing. These qualitative characteristics, obtained from focus groups, provide possible causes of lower reliability such as poorly written questions (e.g., the correct answer looks different from the incorrect answers), questions where students must guess (e.g., the topic is too advanced), and questions where recalling a definition is crucial. Quantitative findings confirm that questions lower reliability when students who answer correctly have lower overall scores than students who answer incorrectly. This phenomenon is quantified by the “gap” between these students' overall scores, which is shown to be highly correlated with other item metrics. An increasing number of concept inventory tests are being developed to assess student learning in engineering. Scores and student comments from the Statistics Concept Inventory are used to make these judgments. 相似文献

12.

Expanding an existing multiple choice test with a mixed format test: simulation study on sample size and item recovery in concurrent calibration

Paek I Young MJ 《Journal of applied measurement》2006,7(4):394-406

When a new set of mixed format items is augmented with a previous old multiple-choice (MC) test, those mixed format items should be linked to the existing old MC test. This study used simulation to investigate sample size effect on recovery of known item parameter from the concurrent calibration in the context of horizontal equating, where the new mixed format tests are equated to the existing MC test which acts as the common linking items. In the partial credit model following the Andrich style parameterization, item location and item step parameters were differentially affected by the sample size. Item location parameters were recovered better than item step parameters at the individual item, the sub-test, and the total test level. This study also shows the outward bias for the item location parameter estimated by the maximum likelihood estimator. 相似文献

13.

Pre-equating: a simulation study based on a large scale assessment model

Taherbhai HM Young MJ 《Journal of applied measurement》2004,5(3):301-318

Although post-equating (PE) has proven to be an acceptable method in the scaling and equating of items and forms, there are times when the turn-around period for equating and converting raw scores to scale scores is so small that PE cannot be undertaken within the prescribed time frame. In such cases, pre-equating (PrE) could be considered as an acceptable alternative. Assessing the feasibility of using item calibrations from the item bank (as in PrE) is conditioned on the equivalency of the calibrations and the errors associated with it vis a vis the results obtained via PE. This paper creates item banks over three periods of item introduction into the banks and uses the Rasch model in examining data with respect to the recovery of item parameters, the measurement error, and the effect cut-points have on examinee placement in both the PrE and PE situations. Results indicate that PrE is a viable solution to PE provided the stability of the item calibrations are enhanced by using large sample sizes (perhaps as large as full-population) in populating the item bank. 相似文献

14.

Nonequivalent survey consolidation: an example from functional caregiving

Bezruczko N Chen SP 《Journal of applied measurement》2007,8(4):336-358

Functional Caregiving (FC) is a construct about mothers caring for children (both old and young) with intellectual disabilities, which is operationally defined by two nonequivalent survey forms, urban and suburban, respectively. The purposes of this research are, first, to generalize school-based achievement test principles to survey methods by equating two nonequivalent survey forms. A second purpose is to expand FC foundations by a) establishing linear measurement properties for new caregiving items, b) replicate a hierarchical item structure across an urban, school-based population, c) consolidate survey forms to establish a calibrated item bank, and d) collect more external construct validity data. Results supported invariant item parameters of a fixed item form (96 items) for two urban samples (N = 186). FC measures also showed expected construct relationships with age, mental depression, and health status. However, only five common items between urban and suburban forms were statistically stable because suburban mothers' age and child's age appear to interact with medical information and social activities. 相似文献

15.

Multicomponent latent trait models for complex tasks

Embretson SE Yang X 《Journal of applied measurement》2006,7(3):335-350

Contemporary views on cognitive theory (e.g., Sternberg and Perez, 2005) regard typical measurement tasks, such as ability and achievement test items, multidimensional, rather than unidimensional. Assessing the levels and the sources of multidimensionality in an item domain is important for item selection as well as for item revision and development. In this paper, multicomponent latent trait models (MLTM) and traditional multidimensional item response theory models are described mathematically and compared for the nature of the dimensions that can be estimated. Then, some applications are presented to provide examples of MLTM. Last, practical estimation procedures are described, along with syntax, for the estimation of MLTM and a related model. 相似文献

16.

The influence of cruise control and adaptive cruise control on driving behaviour--a driving simulator study

Markvollrath Schleicher S Gelau C 《Accident; analysis and prevention》2011,(3):1134-1139

Although Cruise Control (CC) is available for most cars, no studies have been found which examine how this automation system influences driving behaviour. However, a relatively large number of studies have examined Adaptive Cruise Control (ACC) which compared to CC includes also a distance control. Besides positive effects with regard to a better compliance to speed limits, there are also indications of smaller distances to lead vehicles and slower responses in situations that require immediate braking. Similar effects can be expected for CC as this system takes over longitudinal control as well. To test this hypothesis, a simulator study was conducted at the German Aerospace Center (DLR). Twenty-two participants drove different routes (highway and motorway) under three different conditions (assisted by ACC, CC and manual driving without any system). Different driving scenarios were examined including a secondary task condition. On the one hand, both systems lead to lower maximum velocities and less speed limit violations. There was no indication that drivers shift more of their attention towards secondary tasks when driving with CC or ACC. However, there were delayed driver reactions in critical situations, e.g., in a narrow curve or a fog bank. These results give rise to some caution regarding the safety effects of these systems, especially if in the future their range of functionality (e.g., ACC Stop-and-Go) is further increased. 相似文献

17.

Rasch techniques for detecting bias in performance assessments: an example comparing the performance of native and non-native speakers on a test of academic English

Elder C McNamara T Congdon P 《Journal of applied measurement》2003,4(2):181-197

The use of common tasks and rating procedures when assessing the communicative skills of students from highly diverse linguistic and cultural backgrounds poses particular measurement challenges, which have thus far received little research attention. If assessment tasks or criteria are found to function differentially for particular subpopulations within a test candidature with the same or a similar level of criterion ability, then the test is open to charges of bias in favour of one or other group. While there have been numerous studies involving dichotomous language test items (see e.g. Chen and Henning, 1985 and more recently Elder, 1996) few studies have considered the issue of bias in relation to performance based tasks which are assessed subjectively, via analytic and holistic rating scales. The paper demonstrates how Rasch analytic procedures can be applied to the investigation of item bias or differential item functioning (DIF) in both dichotomous and scalar items on a test of English for academic purposes. The data were gathered from a pilot English language test administered to a representative sample of undergraduate students (N= 139) enrolled in their first year of study at an English-medium university. The sample included native speakers of English who had completed up to 12 years of secondary schooling in their first language (L1) and immigrant students, mainly from Asian language backgrounds, with varying degrees of prior English language instruction and exposure. The purpose of the test was to diagnose the academic English needs of incoming undergraduates so that additional support could be offered to those deemed at risk of failure in their university study. Some of the tasks included in the assessment procedure involved objectively-scored items (measuring vocabulary knowledge, text-editing skills and reading and listening comprehension) whereas others (i.e. a report and an argumentative writing task) were subjectively-scored. The study models a methodology for estimating bias with both dichotomous and scalar items using the programs Quest (Adams and Khoo, 1993) for the former and ConQuest (Wu, Adams and Wilson, 1998) for the latter. It also offers answers to the practical questions of whether a common set of assessment criteria can, in an academic context such as this one, be meaningfully applied to all subgroups within the candidature and whether analytic criteria are more susceptible to biased ratings than holistic ones. Implications for test fairness and test validity are discussed. 相似文献

18.

Methodology for an operationally-based test length decision

Donald P. Gaver Patricia A. Jacobs 《IIE Transactions》1998,30(12):1129-1134

Weapon systems that function destructively (e.g., missile or torpedo) are to be acquired in a lot of size m. Acceptance of the lot is based on the result of an operational test, administered to part of the lot: if the test results indicate positive operational value the lot is accepted and the remaining part of the lot is fielded; otherwise the lot is “rejected”. A test plan is designed that establishes an optimal number of weapon copies to test, given models of the operational gain of the fielded weapon under two tactical options, and the uncertainty in the weapon's predicted probability of success after the test is complete. The major test objective is to realize possible operational utility from the lot of items, and secondarily to demonstrate arbitrary levels of certainty. 相似文献

19.

Comparison of spectra using a Bayesian approach. An argument using oil spills as an example

Li J Hibbert DB Fuller S Cattle J Pang Way C 《Analytical chemistry》2005,77(2):639-644

The problem of assigning a probability of matching a number of spectra is addressed. The context is in environmental spills when an EPA needs to show that the material from a polluting spill (e.g., oil) is likely to have originated at a particular site (factory, refinery) or from a vehicle (road tanker or ship). Samples are taken from the spill, and candidate sources and are analyzed by spectroscopy (IR, fluorescence) or chromatography (GC or GC/MS). A matching algorithm is applied to pairs of spectra giving a single statistic (R). This can be a point-to-point match giving a correlation coefficient or a Euclidean distance or a derivative of these parameters. The distributions of R for same and different samples are established from existing data. For matching statistics with values in the range {0,1} corresponding to no match (0) to a perfect match (1) a beta distribution can be fitted to most data. The values of R from the match of the spectrum of a spilled oil and of each of a number of suspects are calculated and Bayes' theorem is applied to give a probability of matches between spill sample and each candidate and the probability of no match at all. The method is most effective when simple inspection of the matching parameters does not lead to an obvious conclusion; i.e., there is overlap of the distributions giving rise to dubiety of an assignment. The probability of finding a matching statistic if there were a match to the probability of finding it if there were no match, expressed as a ratio (called the likelihood ratio), is a sensitive and useful parameter to guide the analyst. It is proposed that this approach may be acceptable to a court of law and avoid challenges of apparently subjective opinion of an analyst. Examples of matching the fluorescence and infrared spectra of diesel oils are given. 相似文献

20.

Examining item difficulty and response time on perceptual ability test items

Yang CL O'Neill TR Kramer GA 《Journal of applied measurement》2002,3(3):282-299

This study examined item calibration stability in relation to response time and the levels of item difficulty between different response time groups on a sample of 389 examinees responding to six different subtest items of the Perceptual Ability Test (PAT). The results indicated that no Differential Item Functioning (DIF) was found and a significant correlation coefficient of item difficulty was formed between slow and fast responders. Three distinct levels of difficulty emerged among the six subtests across groups. Slow responders spent significantly more time than fast responders on the four most difficult subtests. A positive significant relationship was found between item difficulty and response time across groups on the overall perceptual ability test items. Overall, this study found that: 1) the same underlying construct is being measured across groups, 2) the PAT scores were equally useful across groups, 3) different sources of item difficulty may exist among the six subtests, and 4) more difficult test items may require more time to answer. 相似文献