Elsevier

Journal of Clinical Epidemiology

Volume 66, Issue 10, October 2013, Pages 1105-1117.e1
Journal of Clinical Epidemiology

Original Article
Interrater reliability of grading strength of evidence varies with the complexity of the evidence in systematic reviews

https://doi.org/10.1016/j.jclinepi.2013.06.002Get rights and content

Abstract

Objectives

To examine consistency (interrater reliability) of applying guidance for grading strength of evidence in systematic reviews for the Agency for Healthcare Research and Quality Evidence-based Practice Center program.

Study Design and Setting

Using data from two systematic reviews, authors tested the main components of the approach: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision) separately for randomized controlled trials (RCTs) and observational studies and (2) developing an overall strength of evidence grade, given the scores for each of these domains.

Results

Conclusions about overall strength of evidence reached by experienced systematic reviewers based on the same evidence can differ greatly, especially for complex bodies of evidence. Current instructions may be sufficient for straightforward quantitative evaluations that use meta-analysis for summarizing RCT findings. In contrast, agreement suffered when evaluations did not lend themselves to meta-analysis and reviewers needed to rely on their own qualitative judgment. Three areas raised particular concern: (1) evidence from a combination of RCTs and observational studies, (2) outcomes with differing measurement, and (3) evidence that appeared to show no differences in outcomes.

Conclusion

Interrater reliability was highly variable for scoring strength of evidence domains and combining scores to reach overall strength of evidence grades. Future research can help in establishing improved methods for evaluating these complex bodies of evidence.

Introduction

What is new?

Key findings

  1. Interrater reliability is low among systematic reviewers for arriving at overall strength of evidence grades, and it is lower than interrater reliability for most domain scores from which such grades are to be derived.

  2. Complex but not uncommon evidence bases that do not lend themselves to meta-analysis—particularly those with mixes of randomized controlled trials and observational studies and those with benefits or harms assessed with multiple measures and instruments—can be extremely difficult for reviewers to grade.

  3. Dual review with adjudication of differences improves interrater reliability and is critical.

What this adds to what was known?
  1. Our results substantially extend the understanding of levels of interrater agreement in grading systematic review evidence beyond the one earlier published study on this topic, expanding the testing to include domain scoring and using more complicated exercises deliberately designed to reflect typical scenarios that systematic reviewers face today.

What is the implication and what should change now?
  1. More specific guidance is needed for individual domain scoring and combining domains into overall strength of evidence grades, to help reviewers account for the increasing diversity of study designs and outcome measures.

  2. Better guidance is needed for experienced systematic reviewers and even more so for those who are relatively new to the field and do not have the experience that our reviewers did.

  3. More research is needed on interrater reliability of domain scoring and grading overall using complex evidence bases, “methods” and “judgments” that different reviewers use, and advantages and disadvantages of concrete rules for reaching strength of evidence grades from domain scores.

  4. If the strength of evidence grading judgments can differ among experienced and well-trained reviewers, providing transparency in methods and the rationale for findings are critical in conducting each systematic review.

Systematic reviews, with or without meta-analyses, are the backbone of many aspects of health care: clinical and community practice guidelines, coverage and reimbursement decisions, quality improvement and patient safety initiatives, innovative delivery systems and payment schemes, and even emergency situations [1], [2], [3], [4], [5], [6], [7]. Numerous groups conduct such reviews with support from national governments, professional societies, industry, and patient or disease advocacy groups. The US Agency for Healthcare Research and Quality (AHRQ), through its Evidence-based Practice Center (EPC) program, has sponsored systematic reviews and related products for 15 years.

Throughout most of that time, AHRQ has supported research into best practices for enhancing the scientific rigor of these reviews [8], [9], [10], [11], [12]; the agency's Methods Guide for Effectiveness and Comparative Effectiveness Reviews (Methods Guide) covers a wide array of related topics [13]. Of particular salience have been the related tasks of rating the quality (i.e., risk of bias or internal validity) of individual studies and grading the strength of bodies of evidence addressing key questions in such reviews [14], [15], [16], [17], [18], [19], [20], [21], [22]. The guidance on methods for completing systematic review tasks is developed through work groups, comprising mainly expert practitioners from the independent EPCs. One such cross-EPC team developed the chapter on grading the strength of evidence related to therapeutic interventions, published in 2010 as definitive guidance for EPCs at that time [18].

In the intervening years, EPC work groups have revisited chapters of the Methods Guide, and systematic reviewers have raised various conceptual and practical questions about evolving standards for reviews. One set of questions concerns the interrater reliability of the two major steps in grading strength of evidence—rating key domains and reaching an overall strength of evidence grade—because these grades are key indicators of a review team's level of confidence that the studies in the review collectively reflect the true effect of an intervention on a health outcome. Others explored such methods or issues in the early 2000s as well; the Grading of Recommendations Assessment, Development and Evaluation (GRADE) working group, in particular, examined the interrater reliability of their earlier approach to grade the quality of evidence [23]. They determined that kappa agreement among raters for 12 outcomes was fair (κ = 0.27; standard error, 0.015), ranging from agreement slightly worse than chance for four outcomes (negative κ values) to a high value of κ = 0.823 for one outcome.

As of 2010, EPC reviewers evaluate bodies of evidence in relation to each major outcome (i.e., benefits and harms) and each main comparison (e.g., intervention A vs. intervention B) in relation to each key question of a review [18]. First, two independent reviewers rate (score) four required domains: the risk of bias of included studies, consistency of the evidence, directness of the evidence, and precision of the estimates (Table 1, top). They can score four additional domains (dose–response association, plausible confounding that would decrease observed effect, strength of association or magnitude of effect, and publication bias) if they consider them integral to make a final evaluation. The first three of these additional domains are more likely to apply to observational studies than to randomized controlled trials (RCTs).

The Methods Guide recommends scoring each domain separately for RCTs and observational studies. Second, for each major outcome or comparison, reviewers then aggregate domain scores into a single strength of evidence grade (Table 1, bottom). Reviewers resolve disagreements about domain scores or overall grades through consensus or adjudication by a third reviewer.

For arriving at overall grades, reviewers could use the then-contemporary GRADE approach [17], their own weighting (e.g., numeric) system, or a qualitative approach. AHRQ's primary requirement is transparency; reviewers must clearly explain their rationale for aggregating the domains into a single strength of evidence grade.

We report here on our 2010–11 evaluation of the interrater reliability of the strength of evidence guidance when used by experienced reviewers across the EPCs [22]. Our research focused on the interrater reliability testing of the two main components of the AHRQ approach: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision) and (2) developing an overall strength of evidence grade, given the scores for each of these four individual domains. Our primary goal was to determine whether different teams of reviewers would reach similar conclusions on the strength of evidence when presented with the same information about studies included in a systematic review. Our secondary goal was to gain a greater understanding of the relative role of each domain that EPCs evaluate in developing strength of evidence grades. Table 2 presents our specific key questions of the research, and Fig. 1 illustrates the complex linkages between populations (individual independent reviewers and reviewer pairs) and outcomes (agreement on domain scores and overall strength of evidence grades) for these key questions.

Section snippets

Interrater reliability testing approach: data collection

We conducted interrater reliability testing using data obtained from two published systematic reviews (comparative effectiveness reviews) focusing on drug treatments for two distinct medical indications: second-generation antidepressants for the treatment of major depressive disorder (MDD) and disease-modifying antirheumatic drugs (DMARDs) for the treatment of rheumatoid arthritis (RA) [24], [25]. From the data in these reviews, our study team designed 10 exercises; all 10 included RCTs, and

Reviewer characteristics

The 22 reviewers who participated in the interrater reliability test were from nine EPCs and AHRQ. Seven were trained as medical doctors; nine others had doctoral degrees. All reviewers were experienced in conducting systematic reviews (mean number completed, 16.6; range, 2–42); all but two had participated in strength of evidence assessments (mean number completed, 5.0; range, 0–10).

Independent reviewer agreement and difficulty in scoring

The level of independent reviewer interrater agreement for domain scores varied considerably from substantial

Discussion

This series of 10 exercises showed the level of diversity and complexity that EPC reviewers can encounter in their day-to-day evaluations of bodies of evidence. Our findings clearly demonstrate that the conclusions that experienced reviewers reach based on the same evidence can differ greatly. In an early interrater reliability test of the GRADE approach, researchers similarly found only a fair level of rater agreement for RCT-only evidence [23]. Of greatest concern is the poor level of

Conclusions

This series of 10 exercises (with 11 reviewer pairs) showed the level of diversity and complexity that systematic reviewers can encounter in their day-to-day evaluations. Particular challenges arise from evidence bases with both RCTs and observational studies and from situations involving “no differences” as contrasted with “superiority” findings. We demonstrate that the conclusions that experienced reviewers reach, based on the same evidence, can differ greatly. Of greatest concern is the poor

Acknowledgments

The authors thank Meera Viswanathan, PhD, Douglas Kamerow, MD, MPH, Suzanne L. West, PhD, Emily Richmond, MPH, Elizabeth Tant, BA, and Loraine G. Monroe for their participation and guidance on the project. They are very indebted to the assistance from our interrater reliability reviewers: Jeff Andrews, MD, Mohammed Ansari, MBBS, MMedSc, MPhil, Mary Butler, PhD, Donna Dryden, PhD, Lisa Hartling, MSc, PhD, Suchitra Iyer, PhD, Elisabeth Kato, MD, MRP, Stacey Lloyd, MPH, Marian McDonagh, PharmD,

References (55)

  • K.N. Lohr

    Rating the strength of scientific evidence: relevance for quality improvement programs

    Int J Qual Health Care

    (2004)
  • C. Swann et al.

    Health systems and health-related behaviour change: a review of primary and secondary evidence

    (2010)
  • Eden J, Levit L, Berg A, Morton S, editors. Finding what works in health care standards for systematic reviews....
  • Cochrane Collaboration. Evidence Aid project. Available at...
  • M. Helfand et al.

    Challenges of summarizing better information for better health: the Evidence-based Practice Center experience. A guide to this supplement

    Ann Intern Med

    (2005)
  • M. Pignone et al.

    Challenges in systematic reviews of economic analyses

    Ann Intern Med

    (2005)
  • G. Gartlehner et al.

    Clinical heterogeneity in systematic reviews and health technology assessments: synthesis of guidance documents and the literature

    Int J Technol Assess Health Care

    (2012)
  • Agency for Healthcare Research and Quality. Methods guide for effectiveness and comparative effectiveness reviews....
  • S. West et al.

    Systems to rate the strength of scientific evidence

    Evid Rep Technol Assess (Summ)

    (2002)
  • D. Atkins et al.

    Systems for grading the quality of evidence and the strength of recommendations I: critical appraisal of existing approaches the GRADE Working Group

    BMC Health Serv Res

    (2004)
  • G.H. Guyatt et al.

    What is “quality of evidence” and why is it important to clinicians?

    BMJ

    (2008)
  • K.N. Lohr

    Grading strength of evidence

    (2011)
  • Viswanathan M, Ansari MT, Berkman ND, Chang S, Hartling L, McPheeters LM, et al. Assessing the risk of bias of...
  • Berkman ND, Lohr KN, Morgan LC, Richmond E, Kuo TM, Morton S, et al. Reliability testing of the AHRQ EPC approach to...
  • D. Atkins et al.

    Systems for grading the quality of evidence and the strength of recommendations II: pilot study of a new system

    BMC Health Serv Res

    (2005)
  • Gartlehner G, Hansen RA, Thieda P, DeVeaugh-Geiss AM, Gaynes BN, Krebs EE, et al. Comparative effectiveness of...
  • Cited by (22)

    • Cross-cultural adaptations of the oral health impact profile – An assessment of global availability of 4-dimensional oral health impact characterization

      2023, Journal of Evidence-Based Dental Practice
      Citation Excerpt :

      Although we did not formally assess the degree of agreement between the 2 reviewers applying the inclusion/exclusion criteria to screening of abstracts, full texts analysis and data abstraction, we are confident that the possible influence of this selection bias on our findings are minimal for the following reasons. First, we used a simple and intuitive set of assessment criteria (mostly needing an absent/present recording) which in itself increases the likelihood of agreement between reviewers.124 Second, both reviewers independently assessed and applied the inclusion/exclusion criteria first and then convened for discussion.

    • The quality of evidence for medical interventions does not improve or worsen: a metaepidemiological study of Cochrane reviews

      2020, Journal of Clinical Epidemiology
      Citation Excerpt :

      Another explanation for different GRADE ratings for updated reviews that had no new data is inconsistency in the way the way GRADE is applied. One study found variability in the way GRADE is applied leading to different conclusions about strength of evidence [14]. Another study found low agreement among systematic reviewers using the Cochrane risk of bias tool (which influences the GRADE rating) [15].

    • An algorithm was developed to assign GRADE levels of evidence to comparisons within systematic reviews

      2016, Journal of Clinical Epidemiology
      Citation Excerpt :

      We planned to grade the evidence for each relevant comparison using the GRADE definitions of high, moderate, low, or very low quality of evidence [2,6−8], using definitions of these criteria from published literature on GRADE [9]. However, Berkman et al. [10] have previously found poor agreement on grading strength of evidence within systematic reviews using GRADE, even among experienced systematic reviewers. Berkman et al. [10] concluded that more specific guidance was required to assist reviewers in judging grade of evidence.

    View all citing articles on Scopus

    Funding: This project was funded under contract no. HHSA-290-2007-10056-I from the Agency for Healthcare Research and Quality, US Department of Health and Human Services.

    Conflict of interest: None.

    View full text