Inter-rater variability as mutual disagreement: identifying raters’ divergent points of view

Gingerich, Andrea; Ramlo, Susan E.; van der Vleuten, Cees P. M.; Eva, Kevin W.; Regehr, Glenn

doi:10.1007/s10459-016-9711-8

Inter-rater variability as mutual disagreement: identifying raters’ divergent points of view

Published: 20 September 2016

Volume 22, pages 819–838, (2017)
Cite this article

Advances in Health Sciences Education Aims and scope Submit manuscript

Andrea Gingerich ORCID: orcid.org/0000-0001-5765-3975¹,
Susan E. Ramlo²,
Cees P. M. van der Vleuten³,
Kevin W. Eva⁴ &
…
Glenn Regehr⁴

1429 Accesses
31 Citations
6 Altmetric
2 Mentions
Explore all metrics

Abstract

Whenever multiple observers provide ratings, even of the same performance, inter-rater variation is prevalent. The resulting ‘idiosyncratic rater variance’ is considered to be unusable error of measurement in psychometric models and is a threat to the defensibility of our assessments. Prior studies of inter-rater variation in clinical assessments have used open response formats to gather raters’ comments and justifications. This design choice allows participants to use idiosyncratic response styles that could result in a distorted representation of the underlying rater cognition and skew subsequent analyses. In this study we explored rater variability using the structured response format of Q methodology. Physician raters viewed video-recorded clinical performances and provided Mini Clinical Evaluation Exercise (Mini-CEX) assessment ratings through a web-based system. They then shared their assessment impressions by sorting statements that described the most salient aspects of the clinical performance onto a forced quasi-normal distribution ranging from “most consistent with my impression” to “most contrary to my impression”. Analysis of the resulting Q-sorts revealed distinct points of view for each performance shared by multiple physicians. The points of view corresponded with the ratings physicians assigned to the performance. Each point of view emphasized different aspects of the performance with either rapport-building and/or medical expertise skills being most salient. It was rare for the points of view to diverge based on disagreements regarding the interpretation of a specific aspect of the performance. As a result, physicians’ divergent points of view on a given clinical performance cannot be easily reconciled into a single coherent assessment judgment that is impacted by measurement error. If inter-rater variability does not wholly reflect error of measurement, it is problematic for our current measurement models and poses challenges for how we are to adequately analyze performance assessment ratings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of an inter-rater intra-class correlation coefficient that overcomes common assumption violations in the assessment of health measurement scales

Article Open access 12 September 2018

Carly A. Bobak, Paul J. Barr & A. James O’Malley

Can physician examiners overcome their first impression when examinee performance changes?

Article 20 March 2018

Timothy J. Wood, Debra Pugh, … Susan Humphrey-Murto

Detection of grey zones in inter-rater agreement studies

Article Open access 05 January 2023

Haydar Demirhan & Ayfer Ezgi Yilmaz

References

Aiman-Smith, L., Scullen, S. E., & Barr, S. H. (2002). Conducting studies of decision making in organizational contexts: A tutorial for policy-capturing and other regression-based techniques. Organizational Research Methods, 5(4), 388–414.
Article Google Scholar
Albanese, M. (2000). Challenges in using rater judgements in medical education. Journal of Evaluation in Clinical Practice, 6(3), 305–319.
Article Google Scholar
Beauvois, J.-L. O., & Dubois, N. (2009). Lay psychology and the social value of persons. Social and Personality Psychology Compass, 3(6), 1082–1095.
Article Google Scholar
Brehmer, A., & Brehmer, B. (1988). What have we learned about human judgment from thirty years of policy capturing? Advances in psychology, 54, 75–114.
Article Google Scholar
Brown, S. R. (1980). Political subjectivity: Applications of Q methodology in political science. New Haven: Yale University Press.
Google Scholar
Chahine, S., Holmes, B., & Kowalewski, Z. (2016). In the minds of OSCE examiners: Uncovering hidden assumptions. Advances in Health Sciences Education, 21(3), 609–625.
Article Google Scholar
Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalisability: A key to unlock professional assessment. Medical Education, 36(10), 972–978.
Article Google Scholar
Crossley, J., & Jolly, B. (2012). Making sense of work-based assessment: Ask the right questions, in the right way, about the right things, of the right people. Medical Education, 46(1), 28–37.
Article Google Scholar
Downing, S. M. (2004). Reliability: On the reproducibility of assessment data. Medical Education, 38(9), 1006–1012.
Article Google Scholar
Downing, S. M. (2005). Threats to the validity of clinical teaching assessments: What about rater error? Medical Education, 39(4), 353–355.
Article Google Scholar
Fiske, S. T., Cuddy, A. J. C., & Glick, P. (2007). Universal dimensions of social cognition: Warmth and competence. Trends in Cognitive Sciences, 11(2), 77–83.
Article Google Scholar
Gauthier, G., St-Onge, C., & Tavares, W. (2016). Rater cognition: Review and integration of research findings. Medical Education, 50(5), 511–522.
Article Google Scholar
Gibson, C. J., & Hobson, F. W. (1983). Policy capturing as an approach to understanding and improving performance appraisal: A review of the literature. Academy of Management Review, 8(4), 640–649.
Google Scholar
Gingerich, A., Kogan, J., Yeates, P., Govaerts, M., & Holmboe, E. (2014a). Seeing the ‘black box’ differently: Assessor cognition from three research perspectives. Medical Education, 48(11), 1055–1068.
Article Google Scholar
Gingerich, A., Regehr, G., & Eva, K. W. (2011). Rater-based assessments as social judgments: Rethinking the etiology of rater errors. Academic Medicine, 86(10), S1–S7.
Article Google Scholar
Gingerich, A., van der Vleuten, C. P. M., Eva, K. W., & Regehr, G. (2014b). More consensus than idiosyncrasy: Categorizing social judgments to examine variability in Mini-CEX ratings. Academic Medicine, 89(11), 1510–1519.
Article Google Scholar
Govaerts, M. J. B., Wiel, M. W. J., Schuwirth, L. W. T., van der Vleuten, C. P. M., & Muijtjens, A. M. M. (2013). Workplace-based assessment: Raters’ performance theories and constructs. Advances in Health Sciences Education, 18(3), 375–396.
Article Google Scholar
Herbers, J. E., Jr., Noel, G. L., Cooper, G. S., Harvey, J., Pangaro, L. N., & Weaver, M. J. (1989). How accurate are faculty evaluations of clinical competence? Journal of General Internal Medicine, 4(3), 202–208.
Article Google Scholar
Kane, M. (2002). Inferences about variance components and reliability-generalizability coefficients in the absence of random sampling. Journal of Educational Measurement, 39, 165–181.
Article Google Scholar
Karren, R. J., & Barringer, M. W. (2002). A review and analysis of the policy-capturing methodology in organizational research: Guidelines for research and practice. Organizational Research Methods, 5(4), 337–361.
Article Google Scholar
Kogan, J. R., Conforti, L., Bernabeo, E., Iobst, W., & Holmboe, E. (2011). Opening the black box of clinical skills assessment via observation: A conceptual model. Medical Education, 45(10), 1048–1060.
Article Google Scholar
Macrae, C. N., & Bodenhausen, G. V. (2000). Social cognition: Thinking categorically about others. Annual Review of Psychology, 51(1), 93–120.
Article Google Scholar
Mazor, K. M., Zanetti, M. L., Alper, E. J., Hatem, D., Barrett, S. V., Meterko, V., et al. (2007). Assessing professionalism in the context of an objective structured clinical examination: An in-depth study of the rating process. Medical Education, 41(4), 331–340.
Article Google Scholar
McKeown, B. F., & Thomas, D. B. (1988). Q methodology (quantitative applications in the social sciences series (Vol. 66). Thousand Oaks, CA: Sage.
Google Scholar
Mohr, C. D., & Kenny, D. A. (2006). The how and why of disagreement among perceivers: An exploration of person models. Journal of Experimental Social Psychology, 42(3), 337–349.
Article Google Scholar
Nasca, T. J., Gonnella, J. S., Hojat, M., Veloski, J., Erdmann, J. B., Robeson, M., et al. (2002). Conceptualization and measurement of clinical competence of residents: A brief rating form and its psychometric properties. Medical Teacher, 24(3), 299–303.
Article Google Scholar
Newman, I., & Ramlo, S. (2010). Using Q methodology and Q factor analysis in mixed methods research. In A. Tashakkori & C. Teddlie (Eds.), Sage handbook of mixed methods in social and behavioral research (2nd ed., pp. 505–530). London: Sage Publications.
Chapter Google Scholar
Norcini, J. J., Blank, L. L., Duffy, F. D., & Fortna, G. S. (2003). The Mini-CEX: A method for assessing clinical skills. Annals of Internal Medicine, 138(6), 476.
Article Google Scholar
O’Neill, T. A., McLarnon, M. J., & Carswell, J. J. (2015). Variance components of job performance ratings. Human Performance, 28(1), 66–91.
Article Google Scholar
Park, B., DeKay, M. L., & Kraus, S. (1994). Aggregating social behavior into person models: Perceiver-induced consistency. Journal of Personality and Social Psychology, 66(3), 437–459.
Article Google Scholar
Ramsey, P. G., & Wenrich, M. D. (1993). Use of peer ratings to evaluate physician performance. JAMA: Journal of the American Medical Association, 269(13), 1655–1660.
Article Google Scholar
Schmolck, P. (2014). PQMethod 2.35, software adapted from Mainframe-Program QMethod by John Atkinson at Kent State University. http://schmolck.userweb.mwn.de/qmethod/.
Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2011). Programmatic assessment: From assessment of learning to assessment for learning. Medical Teacher, 33(6), 478–485.
Article Google Scholar
Silber, C. G., Nasca, T. J., Paskin, D. L., Eiger, G., Robeson, M., & Veloski, J. J. (2004). Do global rating forms enable program directors to assess the ACGME competencies? Academic Medicine, 79(6), 549–556.
Article Google Scholar
Stephenson, W. (1953). The study of behavior: Q-technique and its methodology. Chicago, IL: University of Chicago Press.
Google Scholar
St-Onge, C., Chamberland, M., Lévesque, A., & Varpio, L. (2016). Expectations, observations, and the cognitive processes that bind them: Expert assessment of examinee performance. Advances in Health Sciences Education, 21(3), 627–642.
Article Google Scholar
Tavares, W., & Eva, K. W. (2013). Exploring the impact of mental workload on rater-based assessments. Advances in Health Sciences Education, 18(2), 291–303.
Article Google Scholar
Tavares, W., Ginsburg, S., & Eva, K. W. (2016). Selecting and simplifying: Rater performance and behavior when considering multiple competencies. Teaching and Learning in Medicine, 28(1), 41–51.
Article Google Scholar
Tweed, M., & Ingham, C. (2010). Observed consultation: Confidence and accuracy of assessors. Advances in Health Sciences Education, 15(1), 31–43.
Article Google Scholar
Verhulst, S. J., Colliver, J. A., Paiva, R., & Williams, R. G. (1986). A factor analysis study of performance of first-year residents. Academic Medicine, 61(2), 132–134.
Article Google Scholar
Watts, S., & Stenner, P. (2012). Doing Q methodological research: Theory, method and interpretation. Thousand Oaks, CA: Sage.
Google Scholar
Williams, R. G., Klamen, D. A., & McGaghie, W. C. (2003). Cognitive, social and environmental sources of bias in clinical performance ratings. Teaching and Learning in Medicine, 15(4), 270–292.
Article Google Scholar
Wojciszke, B. (2005). Affective concomitants of information on morality and competence. European Psychologist, 10(1), 60–70.
Article Google Scholar
Wood, T. (2014). Exploring the role of first impressions in rater-based assessments. Advances in Health Sciences Education, 19(3), 409–427.
Article Google Scholar
Yeates, P., Cardell, J., Byrne, G., & Eva, K. W. (2015). Relatively speaking: Contrast effects influence assessors’ scores and narrative feedback. Medical Education, 49(9), 909–919.
Article Google Scholar
Yeates, P., O’Neill, P., Mann, K., & Eva, K. W. (2013). Seeing the same thing differently: Mechanisms that contribute to assessor differences in directly-observed performance assessments. Advances in Health Sciences Education, 18(3), 325–341.
Article Google Scholar

Download references

Acknowledgments

The authors wish to thank everyone who assisted with recruiting participants and especially those who took the time to participate in this study. We also wish to thank Rick Hoodenpyle for designing the online data collection system and hosting it on QSortOnline.

Funding

This study was funded by a National Board of Medical Examiners^® (NBME^®) Edward J. Stemmler, MD Medical Education Research Fund Grant. The project does not necessarily reflect NBME policy, and NBME support provides no official endorsement.

Author information

Authors and Affiliations

Northern Medical Program, University of Northern British Columbia, 3333 University Way, Prince George, BC, V2N 4Z9, Canada
Andrea Gingerich
Department of Engineering and Science Technology, University of Akron, Akron, OH, USA
Susan E. Ramlo
School of Health Professions Education, Maastricht University, Maastricht, Netherlands
Cees P. M. van der Vleuten
Centre for Health Education Scholarship, University of British Columbia, Vancouver, BC, Canada
Kevin W. Eva & Glenn Regehr

Authors

Andrea Gingerich
View author publications
You can also search for this author in PubMed Google Scholar
Susan E. Ramlo
View author publications
You can also search for this author in PubMed Google Scholar
Cees P. M. van der Vleuten
View author publications
You can also search for this author in PubMed Google Scholar
Kevin W. Eva
View author publications
You can also search for this author in PubMed Google Scholar
Glenn Regehr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Gingerich.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gingerich, A., Ramlo, S.E., van der Vleuten, C.P.M. et al. Inter-rater variability as mutual disagreement: identifying raters’ divergent points of view. Adv in Health Sci Educ 22, 819–838 (2017). https://doi.org/10.1007/s10459-016-9711-8

Download citation

Received: 13 April 2016
Accepted: 09 September 2016
Published: 20 September 2016
Issue Date: October 2017
DOI: https://doi.org/10.1007/s10459-016-9711-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Inter-rater variability as mutual disagreement: identifying raters’ divergent points of view

Abstract

Access this article

Similar content being viewed by others

Estimation of an inter-rater intra-class correlation coefficient that overcomes common assumption violations in the assessment of health measurement scales

Can physician examiners overcome their first impression when examinee performance changes?

Detection of grey zones in inter-rater agreement studies

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Inter-rater variability as mutual disagreement: identifying raters’ divergent points of view

Abstract

Access this article

Similar content being viewed by others

Estimation of an inter-rater intra-class correlation coefficient that overcomes common assumption violations in the assessment of health measurement scales

Can physician examiners overcome their first impression when examinee performance changes?

Detection of grey zones in inter-rater agreement studies

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation