Abstract
Whenever multiple observers provide ratings, even of the same performance, inter-rater variation is prevalent. The resulting ‘idiosyncratic rater variance’ is considered to be unusable error of measurement in psychometric models and is a threat to the defensibility of our assessments. Prior studies of inter-rater variation in clinical assessments have used open response formats to gather raters’ comments and justifications. This design choice allows participants to use idiosyncratic response styles that could result in a distorted representation of the underlying rater cognition and skew subsequent analyses. In this study we explored rater variability using the structured response format of Q methodology. Physician raters viewed video-recorded clinical performances and provided Mini Clinical Evaluation Exercise (Mini-CEX) assessment ratings through a web-based system. They then shared their assessment impressions by sorting statements that described the most salient aspects of the clinical performance onto a forced quasi-normal distribution ranging from “most consistent with my impression” to “most contrary to my impression”. Analysis of the resulting Q-sorts revealed distinct points of view for each performance shared by multiple physicians. The points of view corresponded with the ratings physicians assigned to the performance. Each point of view emphasized different aspects of the performance with either rapport-building and/or medical expertise skills being most salient. It was rare for the points of view to diverge based on disagreements regarding the interpretation of a specific aspect of the performance. As a result, physicians’ divergent points of view on a given clinical performance cannot be easily reconciled into a single coherent assessment judgment that is impacted by measurement error. If inter-rater variability does not wholly reflect error of measurement, it is problematic for our current measurement models and poses challenges for how we are to adequately analyze performance assessment ratings.
Similar content being viewed by others
References
Aiman-Smith, L., Scullen, S. E., & Barr, S. H. (2002). Conducting studies of decision making in organizational contexts: A tutorial for policy-capturing and other regression-based techniques. Organizational Research Methods, 5(4), 388–414.
Albanese, M. (2000). Challenges in using rater judgements in medical education. Journal of Evaluation in Clinical Practice, 6(3), 305–319.
Beauvois, J.-L. O., & Dubois, N. (2009). Lay psychology and the social value of persons. Social and Personality Psychology Compass, 3(6), 1082–1095.
Brehmer, A., & Brehmer, B. (1988). What have we learned about human judgment from thirty years of policy capturing? Advances in psychology, 54, 75–114.
Brown, S. R. (1980). Political subjectivity: Applications of Q methodology in political science. New Haven: Yale University Press.
Chahine, S., Holmes, B., & Kowalewski, Z. (2016). In the minds of OSCE examiners: Uncovering hidden assumptions. Advances in Health Sciences Education, 21(3), 609–625.
Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalisability: A key to unlock professional assessment. Medical Education, 36(10), 972–978.
Crossley, J., & Jolly, B. (2012). Making sense of work-based assessment: Ask the right questions, in the right way, about the right things, of the right people. Medical Education, 46(1), 28–37.
Downing, S. M. (2004). Reliability: On the reproducibility of assessment data. Medical Education, 38(9), 1006–1012.
Downing, S. M. (2005). Threats to the validity of clinical teaching assessments: What about rater error? Medical Education, 39(4), 353–355.
Fiske, S. T., Cuddy, A. J. C., & Glick, P. (2007). Universal dimensions of social cognition: Warmth and competence. Trends in Cognitive Sciences, 11(2), 77–83.
Gauthier, G., St-Onge, C., & Tavares, W. (2016). Rater cognition: Review and integration of research findings. Medical Education, 50(5), 511–522.
Gibson, C. J., & Hobson, F. W. (1983). Policy capturing as an approach to understanding and improving performance appraisal: A review of the literature. Academy of Management Review, 8(4), 640–649.
Gingerich, A., Kogan, J., Yeates, P., Govaerts, M., & Holmboe, E. (2014a). Seeing the ‘black box’ differently: Assessor cognition from three research perspectives. Medical Education, 48(11), 1055–1068.
Gingerich, A., Regehr, G., & Eva, K. W. (2011). Rater-based assessments as social judgments: Rethinking the etiology of rater errors. Academic Medicine, 86(10), S1–S7.
Gingerich, A., van der Vleuten, C. P. M., Eva, K. W., & Regehr, G. (2014b). More consensus than idiosyncrasy: Categorizing social judgments to examine variability in Mini-CEX ratings. Academic Medicine, 89(11), 1510–1519.
Govaerts, M. J. B., Wiel, M. W. J., Schuwirth, L. W. T., van der Vleuten, C. P. M., & Muijtjens, A. M. M. (2013). Workplace-based assessment: Raters’ performance theories and constructs. Advances in Health Sciences Education, 18(3), 375–396.
Herbers, J. E., Jr., Noel, G. L., Cooper, G. S., Harvey, J., Pangaro, L. N., & Weaver, M. J. (1989). How accurate are faculty evaluations of clinical competence? Journal of General Internal Medicine, 4(3), 202–208.
Kane, M. (2002). Inferences about variance components and reliability-generalizability coefficients in the absence of random sampling. Journal of Educational Measurement, 39, 165–181.
Karren, R. J., & Barringer, M. W. (2002). A review and analysis of the policy-capturing methodology in organizational research: Guidelines for research and practice. Organizational Research Methods, 5(4), 337–361.
Kogan, J. R., Conforti, L., Bernabeo, E., Iobst, W., & Holmboe, E. (2011). Opening the black box of clinical skills assessment via observation: A conceptual model. Medical Education, 45(10), 1048–1060.
Macrae, C. N., & Bodenhausen, G. V. (2000). Social cognition: Thinking categorically about others. Annual Review of Psychology, 51(1), 93–120.
Mazor, K. M., Zanetti, M. L., Alper, E. J., Hatem, D., Barrett, S. V., Meterko, V., et al. (2007). Assessing professionalism in the context of an objective structured clinical examination: An in-depth study of the rating process. Medical Education, 41(4), 331–340.
McKeown, B. F., & Thomas, D. B. (1988). Q methodology (quantitative applications in the social sciences series (Vol. 66). Thousand Oaks, CA: Sage.
Mohr, C. D., & Kenny, D. A. (2006). The how and why of disagreement among perceivers: An exploration of person models. Journal of Experimental Social Psychology, 42(3), 337–349.
Nasca, T. J., Gonnella, J. S., Hojat, M., Veloski, J., Erdmann, J. B., Robeson, M., et al. (2002). Conceptualization and measurement of clinical competence of residents: A brief rating form and its psychometric properties. Medical Teacher, 24(3), 299–303.
Newman, I., & Ramlo, S. (2010). Using Q methodology and Q factor analysis in mixed methods research. In A. Tashakkori & C. Teddlie (Eds.), Sage handbook of mixed methods in social and behavioral research (2nd ed., pp. 505–530). London: Sage Publications.
Norcini, J. J., Blank, L. L., Duffy, F. D., & Fortna, G. S. (2003). The Mini-CEX: A method for assessing clinical skills. Annals of Internal Medicine, 138(6), 476.
O’Neill, T. A., McLarnon, M. J., & Carswell, J. J. (2015). Variance components of job performance ratings. Human Performance, 28(1), 66–91.
Park, B., DeKay, M. L., & Kraus, S. (1994). Aggregating social behavior into person models: Perceiver-induced consistency. Journal of Personality and Social Psychology, 66(3), 437–459.
Ramsey, P. G., & Wenrich, M. D. (1993). Use of peer ratings to evaluate physician performance. JAMA: Journal of the American Medical Association, 269(13), 1655–1660.
Schmolck, P. (2014). PQMethod 2.35, software adapted from Mainframe-Program QMethod by John Atkinson at Kent State University. http://schmolck.userweb.mwn.de/qmethod/.
Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2011). Programmatic assessment: From assessment of learning to assessment for learning. Medical Teacher, 33(6), 478–485.
Silber, C. G., Nasca, T. J., Paskin, D. L., Eiger, G., Robeson, M., & Veloski, J. J. (2004). Do global rating forms enable program directors to assess the ACGME competencies? Academic Medicine, 79(6), 549–556.
Stephenson, W. (1953). The study of behavior: Q-technique and its methodology. Chicago, IL: University of Chicago Press.
St-Onge, C., Chamberland, M., Lévesque, A., & Varpio, L. (2016). Expectations, observations, and the cognitive processes that bind them: Expert assessment of examinee performance. Advances in Health Sciences Education, 21(3), 627–642.
Tavares, W., & Eva, K. W. (2013). Exploring the impact of mental workload on rater-based assessments. Advances in Health Sciences Education, 18(2), 291–303.
Tavares, W., Ginsburg, S., & Eva, K. W. (2016). Selecting and simplifying: Rater performance and behavior when considering multiple competencies. Teaching and Learning in Medicine, 28(1), 41–51.
Tweed, M., & Ingham, C. (2010). Observed consultation: Confidence and accuracy of assessors. Advances in Health Sciences Education, 15(1), 31–43.
Verhulst, S. J., Colliver, J. A., Paiva, R., & Williams, R. G. (1986). A factor analysis study of performance of first-year residents. Academic Medicine, 61(2), 132–134.
Watts, S., & Stenner, P. (2012). Doing Q methodological research: Theory, method and interpretation. Thousand Oaks, CA: Sage.
Williams, R. G., Klamen, D. A., & McGaghie, W. C. (2003). Cognitive, social and environmental sources of bias in clinical performance ratings. Teaching and Learning in Medicine, 15(4), 270–292.
Wojciszke, B. (2005). Affective concomitants of information on morality and competence. European Psychologist, 10(1), 60–70.
Wood, T. (2014). Exploring the role of first impressions in rater-based assessments. Advances in Health Sciences Education, 19(3), 409–427.
Yeates, P., Cardell, J., Byrne, G., & Eva, K. W. (2015). Relatively speaking: Contrast effects influence assessors’ scores and narrative feedback. Medical Education, 49(9), 909–919.
Yeates, P., O’Neill, P., Mann, K., & Eva, K. W. (2013). Seeing the same thing differently: Mechanisms that contribute to assessor differences in directly-observed performance assessments. Advances in Health Sciences Education, 18(3), 325–341.
Acknowledgments
The authors wish to thank everyone who assisted with recruiting participants and especially those who took the time to participate in this study. We also wish to thank Rick Hoodenpyle for designing the online data collection system and hosting it on QSortOnline.
Funding
This study was funded by a National Board of Medical Examiners® (NBME®) Edward J. Stemmler, MD Medical Education Research Fund Grant. The project does not necessarily reflect NBME policy, and NBME support provides no official endorsement.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gingerich, A., Ramlo, S.E., van der Vleuten, C.P.M. et al. Inter-rater variability as mutual disagreement: identifying raters’ divergent points of view. Adv in Health Sci Educ 22, 819–838 (2017). https://doi.org/10.1007/s10459-016-9711-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10459-016-9711-8