首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Traditionally, women and minorities have not been fully represented in science and engineering. Numerous studies have attributed these differences to gaps in science achievement as measured by various standardized tests. Rather than describe mean group differences in science achievement across multiple cultures, this study focused on an in-depth item-level analysis across two countries: Spain and the United States. This study investigated eighth-grade gender differences on science items across the two countries. A secondary purpose of the study was to explore the nature of gender differences using the many-faceted Rasch Model as a way to estimate gender DIF. A secondary analysis of data from the Third International Mathematics and Science Study (TIMSS) was used to address three questions: 1) Does gender DIF in science achievement exist? 2) Is there a relationship between gender DIF and characteristics of the science items? 3) Do the relationships between item characteristics and gender DIF in science items replicate across countries. Participants included 7,087 eight grade students from the United States and 3,855 students from Spain who participated in TIMSS. The Facets program (Linacre and Wright, 1992) was used to estimate gender DIF. The results of the analysis indicate that the content of the item seemed to be related to gender DIF. The analysis also suggests that there is a relationship between gender DIF and item format. No pattern of gender DIF related to cognitive demand was found. The general pattern of gender DIF was similar across the two countries used in the analysis. The strength of item-level analysis as opposed to group mean difference analysis is that gender differences can be detected at the item level, even when no mean differences can be detected at the group level.  相似文献   

2.
Using data from the PISA 2006 field trial, Rasch item response models are used to demonstrate that extreme response tendency was exhibited differentially across culturally distinct countries when answering Likert type attitude items. A single attitude scale is examined across eight culturally distinct countries in this paper. Two avenues to ameliorate this tendency are explored: first using dichotomous variants of the items, and second incorporating the country specific response tendency into the Rasch item response model. Analysis of the item variants reveals similar scale outcomes and correlations with achievement but preference for the Likert variant when test information is considered. A hierarchical analysis using facet models reveals that the data fit significantly better in a model that incorporates an interaction effect between the country and the item delta parameters. The implications for reporting attitudes measured with Likert items across cultures are outlined.  相似文献   

3.
The invariance of the estimated parameters across variation in the incidental parameters of a sample is one of the most important properties of Rasch measurement models. This is the property that allows the equating of test forms and the use of computer adaptive testing. It necessarily follows that in Rasch models if the data fit the model, than the estimation of the parameter of interest must be invariant across sub-samples of the items or persons. This study investigates the degree to which the INFIT and OUTFIT item fit statistics in WINSTEPS detect violations of the invariance property of Rasch measurement models. The test in this study is a 80 item multiple-choice test used to assess mathematics competency. The WINSTEPS analysis of the dichotomous results, based on a sample of 2000 from a very large number of students who took the exam, indicated that only 7 of the 80 items misfit using the 1.3 mean square criteria advocated by Linacre and Wright. Subsequent calibration of separate samples of 1,000 students from the upper and lower third of the person raw score distribution, followed by a t-test comparison of the item calibrations, indicated that the item difficulties for 60 of the 80 items were more than 2 standard errors apart. The separate calibration t-values ranged from +21.00 to -7.00 with the t-test value of 41 of the 80 comparisons either larger than +5 or smaller than -5. Clearly these data do not exhibit the invariance of the item parameters expected if the data fit the model. Yet the INFIT and OUTFIT mean squares are completely insensitive to the lack of invariance in the item parameters. If the OUTFIT ZSTD from WINSTEPS was used with a critical value of | t | > 2.0, then 56 of the 60 items identified by the separate calibration t-test would be identified as misfitting. A fourth measure of misfit, the between ability-group item fit statistic identified 69 items as misfitting when a critical value of t > 2.0 was used. Clearly relying solely on the INFIT and OUTFIT mean squares in WINSETPS to assess the fit of the data to the model would cause one to miss one of the most important threats to the usefulness of the measurement model.  相似文献   

4.
In state assessment programs that employ Rasch-based common item linking procedures, the linking constant is usually estimated with only those common items not identified as exhibiting item difficulty parameter drift. Since state assessments typically contain a fixed number of items, an item classified as exhibiting parameter drift during the linking process remains on the exam as a scorable item even if it is removed from the common item set. Under the assumption that item parameter drift has occurred for one or more of the common items, the expected effect of including or excluding the "affected" item(s) in the estimation of the linking constant is derived in this article. If the item parameter drift is due solely to factors not associated with a change in examinee achievement, no linking error will (be expected to) occur given that the linking constant is estimated only with the items not identified as "affected"; linking error will (be expected to) occur if the linking constant is estimated with all common items. However, if the item parameter drift is due solely to change in examinee achievement, the opposite is true: no linking error will (be expected to) occur if the linking constant is estimated with all common items; linking error will (be expected to) occur if the linking constant is estimated only with the items not identified as "affected".  相似文献   

5.
Kim  Seock-Ho  Cohen  Allan S.  Eom  Hyo Jin 《Behaviormetrika》2021,48(2):345-367
Behaviormetrika - This paper contrasts three methods of item analysis for multiple-choice items based on classical test theory, generalized linear modeling, and item response theory. Illustrations...  相似文献   

6.
The Attention Deficit Hyperactivity Disorder (ADHD) criteria from the American Psychiatric Association's Diagnostic and Statistical Manual of Mental Disorders were used to assess a large sample of children at the end of their first year at school in England. These data were explored using Rasch measurement and the measures for the items together with their frequencies are reported. The data were further analysed in three ways: a) The results were compared with a previous similar analysis of college students. b) A principal components analysis of the item residuals from the Rasch analysis was conducted. c) The measures were linked to reading and mathematics attainment assessed at three different time points. The exploration supported previous work and theoretical positions, and in doing so raised issues about the appropriateness of the use of the criteria across all ages. It also suggested that one of the currently recognised ADHD sub-types could be further sub-divided into verbal and physical hyperactivity. The links to academic achievement raised questions about the integrity of the currently recognised ADHD sub-types and the paper calls for further investigations.  相似文献   

7.
In 2005 PISA published trend indicators that compared the results of PISA 2000 and PISA 2003. In this paper we explore the extent to which the outcomes of these trend analyses are sensitive to the choice of test equating methodologies, the choice of regression models and the choice of linking items. To establish trends PISA equated its 2000 and 2003 tests using a methodology based on Rasch Modelling that involved estimating linear transformations that mapped 2003 Rasch-scaled scores to the previously established PISA 2000 Rasch-scaled scores. In this paper we compare the outcomes of this approach with an alternative, which involves the joint Rasch scaling of the PISA 2000 and PISA 2003 data separately for each country. Note that under this approach the item parameters are estimated separately for each country, whereas the linear transformation approach used a common set of item parameter estimates for all countries. Further, as its primary trend indicators, PISA reported changes in mean scores between 2000 and 2003. These means are not adjusted for changes in the background characteristics of the PISA 2000 and PISA 2003 samples - that is, they are marginal rather than conditional means. The use of conditional rather than marginal means results in some differing conclusions regarding trends at both the country and within-country level.  相似文献   

8.
Stout  William  Henson  Robert  DiBello  Lou 《Behaviormetrika》2023,50(1):177-215
Behaviormetrika - The paper’s extended Diagnostic Classification Modeling setting assumes (a) nominal item (question) coding, thus including multiple-choice (MC) items, and (b)...  相似文献   

9.
When a new set of mixed format items is augmented with a previous old multiple-choice (MC) test, those mixed format items should be linked to the existing old MC test. This study used simulation to investigate sample size effect on recovery of known item parameter from the concurrent calibration in the context of horizontal equating, where the new mixed format tests are equated to the existing MC test which acts as the common linking items. In the partial credit model following the Andrich style parameterization, item location and item step parameters were differentially affected by the sample size. Item location parameters were recovered better than item step parameters at the individual item, the sub-test, and the total test level. This study also shows the outward bias for the item location parameter estimated by the maximum likelihood estimator.  相似文献   

10.
A factorial procedure for investigating differential distractor functioning in multiple-choice items is proposed. The procedure adopts the formulation of general linear models and treats grouping factors as independent variables and item parameters across the grouping factors as a dependent variable. Specifically, each distractor in a multiple-choice item is modeled with a distinct distractibility parameter. The distractibility parameters across groups are partitioned into a grand mean distractibility and sets of parameters representing main effects of the individual grouping factors, and interaction effects among them. Results of a simulation study show that the parameters of the proposed modeling were recovered very well. Ten four-choice items in the English test of the 1997 Taiwan Joint College Entrance Examination with seven thousands of examinees in two grouping factors were analyzed.  相似文献   

11.
A series of tests were developed to assess the proficiency of Australian Year 5 and Year 8 students in Asian Studies. This paper presents results of analyses that involved calibrating items distributed over 14 overlapping subtests, developed to cater for state and territory curricula and two year-levels. This allowed for state and year-level preferences to be selected from a common pool of 105 items. The project used common item anchoring to map all students and items onto a single, underpinning scale that was identified and interpreted using concurrent equating procedures and a skills audit of items.  相似文献   

12.
The rating scale model (Andrich, 1978) was applied to data from a survey that directed students to rate their satisfaction with college services on a five point Likert scale. Because students used different services, and students were directed to rate only the services they used, the items were differentially exposed to a person factor that we call "pleasability." Differential exposure to pleasability makes items' average rating a biased measure of their performance. In contrast, item parameter estimates in the rating scale model corrected for differential exposure to pleasability. Compared to items' average ratings, item parameter estimates in the rating scale model did a better job of predicting which item received the higher rating when any two items were rated by the same rater.  相似文献   

13.
14.
Many engineering faculty believe that when students perceive a course to have a high workload, students will rate the course and the performance of the course instructor poorly. This belief can be particularly worrying to engineering faculty since engineering courses are often perceived as uniquely demanding. The present investigation demonstrated that student ratings of workload and of overall instructor performance in engineering courses were not correlated (e.g., Spearman's rho = 0.068) in data sets from either of two institutions. In contrast, a number of evaluation items were strongly correlated (Spearman's rho = 0.7 to 0.899) with ratings of overall instructor performance across engineering, mathematics and science, and humanities courses. The results of the present study provide motivation for faculty seeking to improve their teaching and course evaluations to focus on teaching methods, organization/preparation, and interactions with students, rather than course workload.  相似文献   

15.
Background At the University of Michigan, qualified first‐year students who place out of the first‐semester calculus course may enroll in either the regular second‐semester calculus course or Applied Honors Calculus II. Students who enroll in Applied Honors Calculus II show higher academic performance than students enrolling in the Regular Calculus II. Purpose (Hypothesis ) The study addressed the question: does enrollment in Applied Honors Calculus II have a positive causal impact on subsequent academic performance for engineering students at the University of Michigan? Design /Method We acquired seven years of institutional data for engineering students who entered the University of Michigan from 1996 through 2003 and who qualified to enroll in Applied Honors Calculus II. Using regression analyses, we tested a causal model of impact of Applied Honors Calculus II on four measures of subsequent academic performance: grade in Physics II and average grade in all subsequent physics, mathematics, and engineering courses. Results After controlling for students' personal characteristics and prior academic achievement, the impact of Applied Honors Calculus II on students' academic performance was not statistically significant. In particular Advanced Placement scores accounted for the higher performance observed in Applied Honors Calculus II students. Conclusions We recommend including Advanced Placement scores in models that predict academic performance. Future research should also include measures of socioeconomic status (SES) and explore interactions between SES and academic background. Finally, in evaluations of specific curricula, the treatment effect—measured as treatment group mean minus control group mean, after controlling for covariates—is unlikely to be large if the control group receives high quality instruction.  相似文献   

16.
The purpose of this investigation was to use Rasch measurement to study the psychometric properties of a 34 item questionnaire designed to measure second language learners' willingness to communicate (WTC) in English inside their language class. 490 Japanese university students' responses to the questionnaire were subjected to a number of different analyses. The first involved a comparison of the category threshold estimates produced by the Rating Scale and Partial Credit models. The questionnaire's items were then evaluated according to how well they defined the willingness to communicate construct. The potential dimensionality of using items that involved different speaking and writing tasks/situations in order to gain a more comprehensive understanding of students' willingness to communicate was also investigated. Next there was an examination of the questionnaire's four-point scale to ensure that it captured meaningful differences in students' WTC. Finally, the questionnaire items were compared using differential item functioning to determine if second year students were more willing than first year students in any of the different speaking and writing tasks/situations. This investigation closes with some suggestions on how the WTC questionnaire can inform second language instruction and curriculum design.  相似文献   

17.
Background The U.S. has experienced a shift from a manufacturing‐based economy to one that overwhelmingly provides services and information. This shift demands that technological skills be more fully integrated with one's academic knowledge of science and mathematics so that the next generation of engineers can reason adaptively, think critically, and be prepared to learn how to learn. Purpose (Hypothesis ) Project Lead the Way (PLTW) provides a pre‐college curriculum that focuses on the integration of engineering with science and mathematics. We documented the impact that enrollment in PLTW had on student science and math achievement. We consider the enriched integration hypothesis, which states that students taking PLTW courses will show achievement benefits, after controlling for prior achievement and other student and teacher characteristics. We contrast this with alternative hypotheses that propose little or no impact of the engineering coursework on students' math and science achievement (the insufficient integration hypothesis), or that PLTW enrollment might be negatively associated with student achievement (the adverse integration hypothesis). Design/ Method Using multilevel statistical modeling with students (N = 140) nested within teachers, we report findings from a quantitative analysis of the relationship between PLTW enrollment and student achievement on state standardized tests of math and science. Results While students gained in math and science achievement overall from eighth to tenth grade, students enrolled in PLTW foundation courses showed significantly smaller math assessment gains than those in a matched group that did not enroll, and no measurable advantages on science assessments, when controlling for prior achievement and teacher experience. The findings do not support the enriched integration hypothesis. Conclusions Engineering education programs like PLTW face both challenges and opportunities to effectively integrate academic content as they strive to prepare students for college engineering programs and careers.  相似文献   

18.

The existence of an item pool can bring out the various merits of using item response theory (IRT). This study considered the case where the development of an item pool is in progress. We examined the robustness of four calibration methods in three linking designs using simulated data. The data were generated assuming that a small-sized item pool had already been developed and new items were to be added to that item pool. The results suggested that the item characteristic curve method generally performed well. The performance of the fixed common item parameter calibration method and the concurrent calibration method worsened in one of the linking designs where the number of common items was small. The results also suggested that performance was better when the sample size per form and the number of common items were large.

  相似文献   

19.
Past research on Computer Adaptive Testing (CAT) has focused almost exclusively on the use of binary items and minimizing the number of items to be administrated. To address this situation, extensive computer simulations were performed using partial credit items with two, three, four, and five response categories. Other variables manipulated include the number of available items, the number of respondents used to calibrate the items, and various manipulations of respondents' true locations. Three item selection strategies were used, and the theoretically optimal Maximum Information method was compared to random item selection and Bayesian Maximum Falsification approaches. The Rasch partial credit model proved to be quite robust to various imperfections, and systematic distortions did occur mainly in the absence of sufficient numbers of items located near the trait or performance levels of interest. The findings further indicate that having small numbers of items is more problematic in practice than having small numbers of respondents to calibrate these items. Most importantly, increasing the number of response categories consistently improved CAT's efficiency as well as the general quality of the results. In fact, increasing the number of response categories proved to have a greater positive impact than did the choice of item selection method, as the Maximum Information approach performed only slightly better than the Maximum Falsification approach. Accordingly, issues related to the efficiency of item selection methods are far less important than is commonly suggested in the literature. However, being based on computer simulations only, the preceding presumes that actual respondents behave according to the Rasch model. CAT research could thus benefit from empirical studies aimed at determining whether, and if so, how, selection strategies impact performance.  相似文献   

20.
Many studies on spare parts planning classified items based on the levels of importance using conventional approaches. Classification of spare parts based on the stated approach without considering failure value and/or its consequence may not withstand the test of time due to continuing technological advancement or environmental degradation. This study solved the stated problem by developing a system that is capable of dynamically determining critical equipment/spare parts based on failure rates using ABC analysis. In this analysis, all operable items were considered to be non-critical and they became critical when they approached failure time. These transitions were prompted by items’ failure conditional probability within the limits of 1, 2/3, 1/3 for highly critical, critical and less critical items, respectively. The most critical item(s) (A class) with highest failure value/consequence were sorted out based on specificity (one manufacturer’s item) and generality (many manufacturers’ item). Failure remedy was achieved by applying modified classical inventory model which considered heterogeneity in item failure. The stated conditions were integrated into a time series, linear regression model. The performance evaluation results showed that the new scheme was efficient in spare part failure criticality classification, consequence analysis and remedy. The practical implication of the findings indicated that the developed system could serve as a suitable alternative to the static classification style of the conventional approach in term of cost savings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号