Hostname: page-component-745bb68f8f-s22k5 Total loading time: 0 Render date: 2025-01-07T18:34:56.823Z Has data issue: false hasContentIssue false

Agreement between Two Independent Groups of Raters

Published online by Cambridge University Press:  01 January 2025

Sophie Vanbelle*
Affiliation:
University of Liege
Adelin Albert
Affiliation:
University of Liege
*
Requests for reprints should be sent to Sophie Vanbelle, Medical Informatics and Biostatistics, University of Liege, CHU Sart Tilman (B23), 4000 Liege, Belgium. E-mail: sophie.vanbelle@ulg.ac.be

Abstract

We propose a coefficient of agreement to assess the degree of concordance between two independent groups of raters classifying items on a nominal scale. This coefficient, defined on a population-based model, extends the classical Cohen’s kappa coefficient for quantifying agreement between two raters. Weighted and intraclass versions of the coefficient are also given and their sampling variance is determined by the Jackknife method. The method is illustrated on medical education data which motivated the research.

Type
Theory and Methods
Copyright
Copyright © 2009 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Barnhart, H.X., Williamson, J.M. (2002). Weighted least squares approach for comparing correlated kappa. Biometrics, 58, 10121019CrossRefGoogle ScholarPubMed
Bland, A.C., Kreiter, C.D., Gordon, J.A. (2005). The psychometric properties of five scoring methods applied to the Script Concordance Test. Academic Medicine, 80, 395399CrossRefGoogle Scholar
Charlin, B., Gagnon, R., Sibert, L., Van der Vleuten, C. (2002). Le test de concordance de script: un instrument d’évaluation du raisonnement clinique. Pédagogie Médicale, 3, 135144CrossRefGoogle Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 3746CrossRefGoogle Scholar
Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement of partial credit. Psychological Bulletin, 70, 213220CrossRefGoogle ScholarPubMed
Efron, B., Tibshirani, R.J. (1993). An introduction to the bootstrap, New York: Chapman and HallCrossRefGoogle Scholar
Feigin, P.D., Alvo, M. (1986). Intergroup diversity and concordance for ranking data: an approach via metrics for permutations. The Annals of Statistics, 14, 691707CrossRefGoogle Scholar
Fleiss, J.L. (1981). Statistical methods for rates and proportions, (2nd ed.). New York: WileyGoogle Scholar
Fleiss, J.L., Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measure of reliability. Educational and Psychological Measurement, 33, 613619CrossRefGoogle Scholar
Hollander, M., Sethuraman, J. (1978). Testing for agreement between two groups of judges. Biometrika, 65, 403411CrossRefGoogle Scholar
Kraemer, H.C. (1979). Ramifications of a population model for κ as a coefficient of reliability. Psychometrika, 44, 461472CrossRefGoogle Scholar
Kraemer, H.C. (1981). Intergroup concordance: definition and estimation. Biometrika, 68, 641646CrossRefGoogle Scholar
Kraemer, H.C., Vyjeyanthi, S.P., Noda, A. (2004). Agreement statistics. In D’Agostino, R.B. (Eds.), Tutorial in Biostatistics (pp. 85105). New York: WileyCrossRefGoogle Scholar
Lipsitz, S.R., Williamson, J., Klar, N., Ibrahim, J., Parzen, M. (2001). A simple method for estimating a regression model for κ between a pair of raters. Journal of the Royal Statistical Society Series A, 164, 449465CrossRefGoogle Scholar
Raine, R., Sanderson, C., Hutchings, A., Carter, S., Larking, K., Black, N. (2004). An experimental study of determinants of group judgments in clinical guideline development. Lancet, 364, 429437CrossRefGoogle ScholarPubMed
Schouten, H.J.A. (1982). Measuring pairwise interobserver agreement when all subjects are judged by the same observers. Statistica Neerlandica, 36, 4561CrossRefGoogle Scholar
Schucany, W.R., Frawley, W.H. (1973). A rank test for two group concordance. Psychometrika, 38, 249258CrossRefGoogle Scholar
van Hoeij, M.J., Haarhuis, J.C., Wierstra, R.F., van Beukelen, P. (2004). Developing a classification tool based on Bloom’s taxonomy to assess the cognitive level of short essay questions. Journal of Veterinary Medical Education, 31, 261267CrossRefGoogle ScholarPubMed
Vanbelle, S., Massart, V., Giet, G., Albert, A. (2007). Test de concordance de script: un nouveau mode d’établissement des scores limitant l’effet du hasard. Pédagogie Médicale, 8, 7181CrossRefGoogle Scholar
Vanbelle, S., Albert, A. (2009). Agreement between an isolated rater and a group of raters. Statistica Neerlandica, 63, 82100CrossRefGoogle Scholar
Williamson, J.M., Lipsitz, S.R., Manatunga, A.K. (2000). Modeling kappa for measuring dependent categorical agreement data. Biostatistics, 1, 191202CrossRefGoogle ScholarPubMed