首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The practice in the literature of computing purported interrater reliability coefficients when only one stimulus (person or object) is rated is examined. This article demonstrates that such coefficients are without logical foundation in standard measurement theory. The methods advanced by L. R. James et al (see record 1984-11275-001) for computing interrater reliability coefficients when only one stimulus is rated are considered and shown to be inconsistent with standard measurement principles and assumptions. Methods consistent with measurement theory are presented for indexing interrater agreement when only one stimulus is rated. These methods are based on the standard deviation of ratings across raters and the standard error of the mean rating. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

2.
This study used meta-analytic methods to compare the interrater and intrarater reliabilities of ratings of 10 dimensions of job performance used in the literature; ratings of overall job performance were also examined. There was mixed support for the notion that some dimensions are rated more reliably than others. Supervisory ratings appear to have higher interrater reliability than peer ratings. Consistent with H. R. Rothstein (1990), mean interrater reliability of supervisory ratings of overall job performance was found to be .52. In all cases, interrater reliability is lower than intrarater reliability, indicating that the inappropriate use of intrarater reliability estimates to correct for biases from measurement error leads to biased research results. These findings have important implications for both research and practice. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

3.
The Addiction Severity Index—Multimedia Version (ASI–MV) is a CD-ROM-based simulation of the interview-administered Addiction Severity Index (ASI). Clients in treatment (N ?=?202) self-administered the ASI–MV to examine the test–retest reliability, criterion validity, and convergent–discriminant validity of the ASI–MV. Excellent test–retest reliability was observed for composite scores and severity ratings. Criterion validity, tested against the interviewer-administered ASI, was good for the composite scores. For severity ratings, variable agreement was observed between the ASI–MV and each interviewer, suggesting poor interrater reliability among interviewers. This conclusion was bolstered by a finding of superior convergent–discriminant validity for both composite scores and severity ratings compared to the standard ASI. The ASI–MV is a viable alternative to the expensive and potentially unreliable interviewer-administered version. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

4.
Examined the influence of degree of interview structure, use of scaled-expectation rating scales, and similarity of postinterview trait rating intercorrelations on interrater agreement in employment interviews. 9 nursing interviewers sat as a selection board and interviewed and independently rated 54 senior nursing students. The interviews were highly structured, and ratings were recorded on scaled-expectation scales for general staff nursing positions. Although all interviewers shared essentially the same structure among their postinterview trait ratings, interrater agreement was no better than in previous studies. This finding shows the power of halo in the interview setting (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

5.
Assessing interobserver agreement calls for measuring the degree to which numbers generated by one observer match those generated by another observer. However, for all scales of measurement save one, the absolute scale, using interobserver agreement as a measure of interobserver consistency is too strict because the observers might disagree only on empirically meaningless relationships. Two observers that are rating behaviors on an ordinal scale need only to generate orders that agree, not ratings that agree. This concept is formalized into a notion of relational agreement. Observers need to agree only on empirically meaningful (in a measurement theoretical sense) relationships. Those relationships that are empirically meaningful change as a function of the scale of measurement in use. A class of measures for measuring relational agreement (based on F. E. Zegers and J. M. F. ten Berge [see PA, Vol 72:24356]) is presented. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

6.
The comparability of the absolute level of ratings of 10 psychiatric patients (mean age 43 yrs) on the Inpatient Multidimensional Psychiatric Scale (IMPS) by a homogeneous group of raters was assessed employing both reliability coefficients and ANOVA techniques. Even with attempts to standardize the sampling domain and reduce interrater variance, significant and substantial differences between raters on level of IMPS scores were found. Profile analyses indicated that this level difference was a complex function of a Rater?×?Score interaction. Specific recommendations on the usage of the IMPS as an ordinal, outcome, and diagnostic instrument are made based on the results. (21 ref) (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

7.
Behavioral expectation scales versus summated scales: A fairer comparison.   总被引:1,自引:0,他引:1  
Compared the psychometric qualities of ratings from behavioral expectation scales (BES) to the qualities of ratings from 2 types of summated scales. 154 university students each rated 1 of 15 instructors using the 3 scales. The 1st set of summated scales was comprised of components of the dimension definitions generated in the BES procedure, and the 2nd set of scales was comprised of the behavioral expectation items that had survived all phases of the BES procedure. No significant differences were found between the 3 formats on interrater reliability, leniency error, halo effect, or discrimination across ratees. Results are discussed in terms of the general improvement of BES psychometric properties and the need for more research on behavioral and organizational effects of BES. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

8.
Compared the psychometric properties of ratings on behavioral expectation scales (BES) across 4 groups totalling 156 undergraduate raters. Groups differed with respect to amount of prior training (1 hr or more), the nature of psychometric errors, and the extent of exposure to scales (read scales and recorded observed critical incidents, discussed general scale dimensions, or no exposure to scales). Three Ss from each group rated 1 of 13 instructors during the last week of a 10-wk term. Significantly less leniency error and halo effect, plus higher interrater reliability, were found for the group that had received the hour of training and full exposure to the BES. Ss who had received only training had significantly less halo error than those that had received no training. The need for rater training prior to observation and the use of BES as a context for observation are discussed. (20 ref) (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

9.
The interrater reliability of diagnoses, made on the basis of a structured interview for psychiatric patients with and without psychoactive substance use disorders (PSUDs), was examined. 47 pairs of ratings by 9 different clinical interviewers were used. Results supported 3 major findings: (1) The interrater reliability for non-PSUD psychiatric diagnoses is quite high when an S has no diagnosable PSUD; it is lower, though still substantial when a PSUD is present; (2) interviewers are not aware of this and are just as certain of the accuracy of their diagnoses when a PSUD is present as when one is not; and (3) interrater reliability is moderate to substantial as to the judgment of whether, when a non-PSUD diagnosis is present, it is caused by the use of psychoactive substances. The implications of these findings for the appropriate selection of treatments for dually diagnosed patients are discussed. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

10.
This study investigated within-source interrater reliability of supervisor, peer, and subordinate feedback ratings made for managerial development. Raters provided 360-degree feedback ratings on a sample of 153 managers. Using generalizability theory, results indicated that little within-source agreement exists; a large portion of the error variance is attributable to the combined rater main effect and Rater X Ratee effect; more raters are needed than currently used to reach acceptable levels of reliability; supervisors are the most reliable with trivial differences between peers and subordinates when the numbers of raters and items are held constant; and peers are the most reliable, followed by subordinates, followed by supervisors, under conditions commonly encountered in practice. Implications for the validity, design, and maintenance of 360-degree feedback systems are discussed along with directions for future research in this area. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

11.
Connectedness to school, teachers, and family are all significant protective factors in adolescents' lives, yet the measurement of each varies considerably. This article describes the measurement properties of three composite scales of adolescent connectedness, adapted from the Add Health study and the California Healthy Kids Survey. These composite scales are created by either summing or taking the mean of all individual items, measured on an ordinal scale. This approach fails to account for the ordinal, non-normal nature of the data. Using a covariance approach, this article describes the measurement properties of the latent constructs of connectedness to school, teachers, and family and the contribution of each of the items used to compile the relevant composite score. The outcomes of this study will provide researchers and practitioners with information about the validity, reliability, and overall usefulness of each of the measures of adolescents' perception of their connectedness to school, teachers, and family. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

12.
The reliability of current and lifetime Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM-IV; American Psychiatric Association, 1994) anxiety and mood disorders was examined in 362 outpatients who underwent 2 independent administrations of the Anxiety Disorders Interview Schedule for DSM-IV: Lifetime version (ADIS-IV–L). Good to excellent reliability was obtained for the majority of DSM-IV categories. For many disorders, a common source of unreliability was disagreements on whether constituent symptoms were sufficient in number, severity, or duration to meet DSM-IV diagnostic criteria. These analyses also highlighted potential boundary problems for some disorders (e.g., generalized anxiety disorder and major depressive disorder). Analyses of ADIS-IV–L clinical ratings (0–8 scales) indicated favorable interrater agreement for the dimensional features of DSM-IV anxiety and mood disorders. The findings are discussed in regard to their implications for the classification of emotional disorders. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

13.
L. R. James et al (1984) developed an index, rWG, for assessing within-group agreement appropriate when only a single target is rated. F. L. Schmidt and J. E. Hunter (1989) criticized the conceptual foundation of rWG because it is not consistent with the classical model of reliability, and proposed an alternative approach, the use of the rating standard deviation (SDx), the standard error of the rating mean (SEM), and the associated confidence intervals for SEM to index interrater agreement. This comment argues that the critique of rWG did not clearly distinguish the concepts of interrater consensus (i.e., agreement) and interrater consistency (i.e., reliability). When the distinction between agreement and reliability is drawn, the critique of rWG is shown to divert attention from more critical problems in the assessment of agreement. The approach for assessing within-group agreement proposed by Schmidt and Hunter has several limitations. rWG should not be used as an index of interrater reliability but, within certain bounds, it is suitable as an index of within-group interrater agreement. SDx and SEM are not acceptable substitutes for extant indexes of interrater agreement. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

14.
144 deputy sheriffs were rated on 9 job performance dimensions with 2 rating scales by 2 raters. Results indicate that the rating scales (the Multiple Item Appraisal Form and the Global Dimension Appraisal Form) developed in this study were able to minimize the major problems often associated with performance ratings (i.e., leniency error, restriction of range, and low reliability). A multitrait/multimethod analysis indicated that the rating scales possessed high convergent and discriminant validity. A multitrait/multirater analysis indicated that although the interrater agreement and the degree of rated discrimination on different traits by different raters were good, there was a substantial rater bias, or strong halo effect. This halo effect in the ratings, however, may really be a legitimate general factor rather than an error. (11 ref) (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

15.
Relationship satisfaction and adjustment have been the target outcome variables for almost all couple research and therapies. In contrast, far less attention has been paid to the assessment of relationship quality. The present study introduces the Relationship Quality Interview (RQI), a semistructured, behaviorally anchored individual interview. The RQI was designed to provide a more objective assessment of relationship quality as a dynamic, dyadic construct across 5 dimensions: (a) quality of emotional intimacy in the relationship, (b) quality of the couple's sexual relationship, (c) quality of support transactions in the relationship, (d) quality of the couple's ability to share power in the relationship, and (e) quality of conflict/problem-solving interactions in the relationship. Psychometric properties of RQI ratings were examined through scores obtained from self-report questionnaires and behavioral observation data collected cross-sectionally from a sample of 91 dating participants and longitudinally from a sample of 101 married couples. RQI ratings demonstrated strong reliability (internal consistency, interrater agreement, interpartner agreement, and correlations among scales), convergent validity (correlations between RQI scale ratings and questionnaire scores assessing similar domains of relationship quality), and divergent validity (correlations between RQI scale ratings and (a) behavioral observation codes assessing related constructs, (b) global relationship satisfaction scores, and (c) scores on individual difference measures of related constructs). Clinical implications of the RQI for improving couple assessment and interventions are discussed. (PsycINFO Database Record (c) 2011 APA, all rights reserved)  相似文献   

16.
One strategy suggested for improving the accuracy of the complex evaluative judgments involved in performance evaluation is to decompose them into a series of simpler judgments. Another is to collect observations in a distributional rating scheme in which raters estimate the frequencies of different classes of behavior and performance is assessed in terms of the relative frequencies of effective and ineffective behaviors. Distributional ratings were compared to Likert-type ratings of videotaped lectures at 3 levels of dimensional decomposition; ratings were evaluated in terms of interrater agreement and rating accuracy. Decomposition led to increased agreement and accuracy, but the use of distributional ratings did not. The practical implications of the results are discussed. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

17.
A recomparison of behavioral expectation scales to summated scales.   总被引:1,自引:0,他引:1  
Compared behavioral expectation scales (BES) to summated scales for leniency error, discriminability (among ratees), interrater agreement, and constancy of rater individual differences across dimensions. 27 university instructors were rated by their students. Less leniency error and greater interrater agreement were found for item-analyzed, summated scales. Results are discussed in terms of previous comparisons of the 2 methods and possible effects of variant developmental procedures on BES performance. (18 ref) (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

18.
19.
OBJECTIVES: Exposure to pesticides in fruit growing was estimated by pesticide experts, occupational hygienists, and fruit growing experts to determine whether valid subjective assessments can be made by experts. The study objectives were (i) validation of exposure assessment by experts using different sources of information, (ii) assessment of interrater agreement, (iii) measurement of agreement between experts' assessments and actual quantitative exposure data. METHODS: Three groups with different expertise made four ratings. Three of the ratings were made in three phases in which exposure information was provided. RESULTS: The intraclass correlation was high for each subgroup of experts when tasks in fruit growing were relatively ranked by increasing exposure level. In general, the interrater agreement on factors influencing the internal dose decreased when more information on exposure was provided. Experts correctly considered dermal exposure as the prominent contributor to internal dose. Results were comparable for the three pesticides under study. The ranking of 15 specific sprayings with a fungicide clearly showed differences between raters according to their expertise. The pesticide experts and occupational hygienists were able to rank daily exposure levels during pesticide spraying in a meaningful way. CONCLUSIONS: Experts seem to recognize the most important determinants of external exposure and therefore should be able pay a role in evaluating the effectiveness of control measures taken to reduce external exposure and to determine exposure groups in epidemiologic studies. The expert panel should not be too small, and consensus or average estimates should be used because differences within expert groups can be considerable.  相似文献   

20.
101 incumbents of 25 service jobs rated their respective tasks on relative time spent, difficulty of learning, criticality, and overall importance. Although scale convergence varied as a function of job title, task criticality and importance ratings were similar and presented low to moderate levels of convergence with both time-spent and difficulty-of-learning ratings. Different composites of task importance were compared. All composites and the overall judgments of importance were moderately correlated with each other and showed similar levels of interrater agreement. Several conclusions regarding the choice of scales and the use of composites in task analysis are drawn. (PsycINFO Database Record (c) 2010 APA, all rights reserved)  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号