Skip to main content
Log in

Agreement between Two Independent Groups of Raters

  • Theory and Methods
  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

We propose a coefficient of agreement to assess the degree of concordance between two independent groups of raters classifying items on a nominal scale. This coefficient, defined on a population-based model, extends the classical Cohen’s kappa coefficient for quantifying agreement between two raters. Weighted and intraclass versions of the coefficient are also given and their sampling variance is determined by the Jackknife method. The method is illustrated on medical education data which motivated the research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Barnhart, H.X., & Williamson, J.M. (2002). Weighted least squares approach for comparing correlated kappa. Biometrics, 58, 1012–1019.

    Article  PubMed  Google Scholar 

  • Bland, A.C., Kreiter, C.D., & Gordon, J.A. (2005). The psychometric properties of five scoring methods applied to the Script Concordance Test. Academic Medicine, 80, 395–399.

    Article  PubMed  Google Scholar 

  • Charlin, B., Gagnon, R., Sibert, L., & Van der Vleuten, C. (2002). Le test de concordance de script: un instrument d’évaluation du raisonnement clinique. Pédagogie Médicale, 3, 135–144.

    Article  Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement of partial credit. Psychological Bulletin, 70, 213–220.

    Article  PubMed  Google Scholar 

  • Efron, B., & Tibshirani, R.J. (1993). An introduction to the bootstrap. New York: Chapman and Hall.

    Google Scholar 

  • Feigin, P.D., & Alvo, M. (1986). Intergroup diversity and concordance for ranking data: an approach via metrics for permutations. The Annals of Statistics, 14, 691–707.

    Article  Google Scholar 

  • Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.) New York: Wiley.

    Google Scholar 

  • Fleiss, J.L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measure of reliability. Educational and Psychological Measurement, 33, 613–619.

    Article  Google Scholar 

  • Hollander, M., & Sethuraman, J. (1978). Testing for agreement between two groups of judges. Biometrika, 65, 403–411.

    Article  Google Scholar 

  • Kraemer, H.C. (1979). Ramifications of a population model for κ as a coefficient of reliability. Psychometrika, 44, 461–472.

    Article  Google Scholar 

  • Kraemer, H.C. (1981). Intergroup concordance: definition and estimation. Biometrika, 68, 641–646.

    Article  Google Scholar 

  • Kraemer, H.C., Vyjeyanthi, S.P., & Noda, A. (2004). Agreement statistics. In D’Agostino, R.B. (Ed.), Tutorial in Biostatistics (vol. 1, pp. 85–105). New York: Wiley.

    Chapter  Google Scholar 

  • Lipsitz, S.R., Williamson, J., Klar, N., Ibrahim, J., & Parzen, M. (2001). A simple method for estimating a regression model for κ between a pair of raters. Journal of the Royal Statistical Society Series A, 164, 449–465.

    Google Scholar 

  • Raine, R., Sanderson, C., Hutchings, A., Carter, S., Larking, K., & Black, N. (2004). An experimental study of determinants of group judgments in clinical guideline development. Lancet, 364, 429–437.

    Article  PubMed  Google Scholar 

  • Schouten, H.J.A. (1982). Measuring pairwise interobserver agreement when all subjects are judged by the same observers. Statistica Neerlandica, 36, 45–61.

    Article  Google Scholar 

  • Schucany, W.R., & Frawley, W.H. (1973). A rank test for two group concordance. Psychometrika, 38, 249–258.

    Article  Google Scholar 

  • van Hoeij, M.J., Haarhuis, J.C., Wierstra, R.F., & van Beukelen, P. (2004). Developing a classification tool based on Bloom’s taxonomy to assess the cognitive level of short essay questions. Journal of Veterinary Medical Education, 31, 261–267.

    Article  PubMed  Google Scholar 

  • Vanbelle, S., Massart, V., Giet, G., & Albert, A. (2007). Test de concordance de script: un nouveau mode d’établissement des scores limitant l’effet du hasard. Pédagogie Médicale, 8, 71–81.

    Article  Google Scholar 

  • Vanbelle, S., & Albert, A. (2009). Agreement between an isolated rater and a group of raters. Statistica Neerlandica, 63, 82–100.

    Article  Google Scholar 

  • Williamson, J.M., Lipsitz, S.R., & Manatunga, A.K. (2000). Modeling kappa for measuring dependent categorical agreement data. Biostatistics, 1, 191–202.

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sophie Vanbelle.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vanbelle, S., Albert, A. Agreement between Two Independent Groups of Raters. Psychometrika 74, 477–491 (2009). https://doi.org/10.1007/s11336-009-9116-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-009-9116-1

Keywords

Navigation