Skip to main content
Log in

Inter-rater variability as mutual disagreement: identifying raters’ divergent points of view

  • Published:
Advances in Health Sciences Education Aims and scope Submit manuscript

Abstract

Whenever multiple observers provide ratings, even of the same performance, inter-rater variation is prevalent. The resulting ‘idiosyncratic rater variance’ is considered to be unusable error of measurement in psychometric models and is a threat to the defensibility of our assessments. Prior studies of inter-rater variation in clinical assessments have used open response formats to gather raters’ comments and justifications. This design choice allows participants to use idiosyncratic response styles that could result in a distorted representation of the underlying rater cognition and skew subsequent analyses. In this study we explored rater variability using the structured response format of Q methodology. Physician raters viewed video-recorded clinical performances and provided Mini Clinical Evaluation Exercise (Mini-CEX) assessment ratings through a web-based system. They then shared their assessment impressions by sorting statements that described the most salient aspects of the clinical performance onto a forced quasi-normal distribution ranging from “most consistent with my impression” to “most contrary to my impression”. Analysis of the resulting Q-sorts revealed distinct points of view for each performance shared by multiple physicians. The points of view corresponded with the ratings physicians assigned to the performance. Each point of view emphasized different aspects of the performance with either rapport-building and/or medical expertise skills being most salient. It was rare for the points of view to diverge based on disagreements regarding the interpretation of a specific aspect of the performance. As a result, physicians’ divergent points of view on a given clinical performance cannot be easily reconciled into a single coherent assessment judgment that is impacted by measurement error. If inter-rater variability does not wholly reflect error of measurement, it is problematic for our current measurement models and poses challenges for how we are to adequately analyze performance assessment ratings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  • Aiman-Smith, L., Scullen, S. E., & Barr, S. H. (2002). Conducting studies of decision making in organizational contexts: A tutorial for policy-capturing and other regression-based techniques. Organizational Research Methods, 5(4), 388–414.

    Article  Google Scholar 

  • Albanese, M. (2000). Challenges in using rater judgements in medical education. Journal of Evaluation in Clinical Practice, 6(3), 305–319.

    Article  Google Scholar 

  • Beauvois, J.-L. O., & Dubois, N. (2009). Lay psychology and the social value of persons. Social and Personality Psychology Compass, 3(6), 1082–1095.

    Article  Google Scholar 

  • Brehmer, A., & Brehmer, B. (1988). What have we learned about human judgment from thirty years of policy capturing? Advances in psychology, 54, 75–114.

    Article  Google Scholar 

  • Brown, S. R. (1980). Political subjectivity: Applications of Q methodology in political science. New Haven: Yale University Press.

    Google Scholar 

  • Chahine, S., Holmes, B., & Kowalewski, Z. (2016). In the minds of OSCE examiners: Uncovering hidden assumptions. Advances in Health Sciences Education, 21(3), 609–625.

    Article  Google Scholar 

  • Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalisability: A key to unlock professional assessment. Medical Education, 36(10), 972–978.

    Article  Google Scholar 

  • Crossley, J., & Jolly, B. (2012). Making sense of work-based assessment: Ask the right questions, in the right way, about the right things, of the right people. Medical Education, 46(1), 28–37.

    Article  Google Scholar 

  • Downing, S. M. (2004). Reliability: On the reproducibility of assessment data. Medical Education, 38(9), 1006–1012.

    Article  Google Scholar 

  • Downing, S. M. (2005). Threats to the validity of clinical teaching assessments: What about rater error? Medical Education, 39(4), 353–355.

    Article  Google Scholar 

  • Fiske, S. T., Cuddy, A. J. C., & Glick, P. (2007). Universal dimensions of social cognition: Warmth and competence. Trends in Cognitive Sciences, 11(2), 77–83.

    Article  Google Scholar 

  • Gauthier, G., St-Onge, C., & Tavares, W. (2016). Rater cognition: Review and integration of research findings. Medical Education, 50(5), 511–522.

    Article  Google Scholar 

  • Gibson, C. J., & Hobson, F. W. (1983). Policy capturing as an approach to understanding and improving performance appraisal: A review of the literature. Academy of Management Review, 8(4), 640–649.

    Google Scholar 

  • Gingerich, A., Kogan, J., Yeates, P., Govaerts, M., & Holmboe, E. (2014a). Seeing the ‘black box’ differently: Assessor cognition from three research perspectives. Medical Education, 48(11), 1055–1068.

    Article  Google Scholar 

  • Gingerich, A., Regehr, G., & Eva, K. W. (2011). Rater-based assessments as social judgments: Rethinking the etiology of rater errors. Academic Medicine, 86(10), S1–S7.

    Article  Google Scholar 

  • Gingerich, A., van der Vleuten, C. P. M., Eva, K. W., & Regehr, G. (2014b). More consensus than idiosyncrasy: Categorizing social judgments to examine variability in Mini-CEX ratings. Academic Medicine, 89(11), 1510–1519.

    Article  Google Scholar 

  • Govaerts, M. J. B., Wiel, M. W. J., Schuwirth, L. W. T., van der Vleuten, C. P. M., & Muijtjens, A. M. M. (2013). Workplace-based assessment: Raters’ performance theories and constructs. Advances in Health Sciences Education, 18(3), 375–396.

    Article  Google Scholar 

  • Herbers, J. E., Jr., Noel, G. L., Cooper, G. S., Harvey, J., Pangaro, L. N., & Weaver, M. J. (1989). How accurate are faculty evaluations of clinical competence? Journal of General Internal Medicine, 4(3), 202–208.

    Article  Google Scholar 

  • Kane, M. (2002). Inferences about variance components and reliability-generalizability coefficients in the absence of random sampling. Journal of Educational Measurement, 39, 165–181.

    Article  Google Scholar 

  • Karren, R. J., & Barringer, M. W. (2002). A review and analysis of the policy-capturing methodology in organizational research: Guidelines for research and practice. Organizational Research Methods, 5(4), 337–361.

    Article  Google Scholar 

  • Kogan, J. R., Conforti, L., Bernabeo, E., Iobst, W., & Holmboe, E. (2011). Opening the black box of clinical skills assessment via observation: A conceptual model. Medical Education, 45(10), 1048–1060.

    Article  Google Scholar 

  • Macrae, C. N., & Bodenhausen, G. V. (2000). Social cognition: Thinking categorically about others. Annual Review of Psychology, 51(1), 93–120.

    Article  Google Scholar 

  • Mazor, K. M., Zanetti, M. L., Alper, E. J., Hatem, D., Barrett, S. V., Meterko, V., et al. (2007). Assessing professionalism in the context of an objective structured clinical examination: An in-depth study of the rating process. Medical Education, 41(4), 331–340.

    Article  Google Scholar 

  • McKeown, B. F., & Thomas, D. B. (1988). Q methodology (quantitative applications in the social sciences series (Vol. 66). Thousand Oaks, CA: Sage.

    Google Scholar 

  • Mohr, C. D., & Kenny, D. A. (2006). The how and why of disagreement among perceivers: An exploration of person models. Journal of Experimental Social Psychology, 42(3), 337–349.

    Article  Google Scholar 

  • Nasca, T. J., Gonnella, J. S., Hojat, M., Veloski, J., Erdmann, J. B., Robeson, M., et al. (2002). Conceptualization and measurement of clinical competence of residents: A brief rating form and its psychometric properties. Medical Teacher, 24(3), 299–303.

    Article  Google Scholar 

  • Newman, I., & Ramlo, S. (2010). Using Q methodology and Q factor analysis in mixed methods research. In A. Tashakkori & C. Teddlie (Eds.), Sage handbook of mixed methods in social and behavioral research (2nd ed., pp. 505–530). London: Sage Publications.

    Chapter  Google Scholar 

  • Norcini, J. J., Blank, L. L., Duffy, F. D., & Fortna, G. S. (2003). The Mini-CEX: A method for assessing clinical skills. Annals of Internal Medicine, 138(6), 476.

    Article  Google Scholar 

  • O’Neill, T. A., McLarnon, M. J., & Carswell, J. J. (2015). Variance components of job performance ratings. Human Performance, 28(1), 66–91.

    Article  Google Scholar 

  • Park, B., DeKay, M. L., & Kraus, S. (1994). Aggregating social behavior into person models: Perceiver-induced consistency. Journal of Personality and Social Psychology, 66(3), 437–459.

    Article  Google Scholar 

  • Ramsey, P. G., & Wenrich, M. D. (1993). Use of peer ratings to evaluate physician performance. JAMA: Journal of the American Medical Association, 269(13), 1655–1660.

    Article  Google Scholar 

  • Schmolck, P. (2014). PQMethod 2.35, software adapted from Mainframe-Program QMethod by John Atkinson at Kent State University. http://schmolck.userweb.mwn.de/qmethod/.

  • Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2011). Programmatic assessment: From assessment of learning to assessment for learning. Medical Teacher, 33(6), 478–485.

    Article  Google Scholar 

  • Silber, C. G., Nasca, T. J., Paskin, D. L., Eiger, G., Robeson, M., & Veloski, J. J. (2004). Do global rating forms enable program directors to assess the ACGME competencies? Academic Medicine, 79(6), 549–556.

    Article  Google Scholar 

  • Stephenson, W. (1953). The study of behavior: Q-technique and its methodology. Chicago, IL: University of Chicago Press.

    Google Scholar 

  • St-Onge, C., Chamberland, M., Lévesque, A., & Varpio, L. (2016). Expectations, observations, and the cognitive processes that bind them: Expert assessment of examinee performance. Advances in Health Sciences Education, 21(3), 627–642.

    Article  Google Scholar 

  • Tavares, W., & Eva, K. W. (2013). Exploring the impact of mental workload on rater-based assessments. Advances in Health Sciences Education, 18(2), 291–303.

    Article  Google Scholar 

  • Tavares, W., Ginsburg, S., & Eva, K. W. (2016). Selecting and simplifying: Rater performance and behavior when considering multiple competencies. Teaching and Learning in Medicine, 28(1), 41–51.

    Article  Google Scholar 

  • Tweed, M., & Ingham, C. (2010). Observed consultation: Confidence and accuracy of assessors. Advances in Health Sciences Education, 15(1), 31–43.

    Article  Google Scholar 

  • Verhulst, S. J., Colliver, J. A., Paiva, R., & Williams, R. G. (1986). A factor analysis study of performance of first-year residents. Academic Medicine, 61(2), 132–134.

    Article  Google Scholar 

  • Watts, S., & Stenner, P. (2012). Doing Q methodological research: Theory, method and interpretation. Thousand Oaks, CA: Sage.

    Google Scholar 

  • Williams, R. G., Klamen, D. A., & McGaghie, W. C. (2003). Cognitive, social and environmental sources of bias in clinical performance ratings. Teaching and Learning in Medicine, 15(4), 270–292.

    Article  Google Scholar 

  • Wojciszke, B. (2005). Affective concomitants of information on morality and competence. European Psychologist, 10(1), 60–70.

    Article  Google Scholar 

  • Wood, T. (2014). Exploring the role of first impressions in rater-based assessments. Advances in Health Sciences Education, 19(3), 409–427.

    Article  Google Scholar 

  • Yeates, P., Cardell, J., Byrne, G., & Eva, K. W. (2015). Relatively speaking: Contrast effects influence assessors’ scores and narrative feedback. Medical Education, 49(9), 909–919.

    Article  Google Scholar 

  • Yeates, P., O’Neill, P., Mann, K., & Eva, K. W. (2013). Seeing the same thing differently: Mechanisms that contribute to assessor differences in directly-observed performance assessments. Advances in Health Sciences Education, 18(3), 325–341.

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to thank everyone who assisted with recruiting participants and especially those who took the time to participate in this study. We also wish to thank Rick Hoodenpyle for designing the online data collection system and hosting it on QSortOnline.

Funding

This study was funded by a National Board of Medical Examiners® (NBME®) Edward J. Stemmler, MD Medical Education Research Fund Grant. The project does not necessarily reflect NBME policy, and NBME support provides no official endorsement.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Gingerich.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gingerich, A., Ramlo, S.E., van der Vleuten, C.P.M. et al. Inter-rater variability as mutual disagreement: identifying raters’ divergent points of view. Adv in Health Sci Educ 22, 819–838 (2017). https://doi.org/10.1007/s10459-016-9711-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10459-016-9711-8

Keywords

Navigation