What is new?
Key findings- •
Interrater reliability is low among systematic reviewers for arriving at overall strength of evidence grades, and it is lower than interrater reliability for most domain scores from which such grades are to be derived.
- •
Complex but not uncommon evidence bases that do not lend themselves to meta-analysis—particularly those with mixes of randomized controlled trials and observational studies and those with benefits or harms assessed with multiple measures and instruments—can be extremely difficult for reviewers to grade.
- •
Dual review with adjudication of differences improves interrater reliability and is critical.
What this adds to what was known?- •
Our results substantially extend the understanding of levels of interrater agreement in grading systematic review evidence beyond the one earlier published study on this topic, expanding the testing to include domain scoring and using more complicated exercises deliberately designed to reflect typical scenarios that systematic reviewers face today.
What is the implication and what should change now?- •
More specific guidance is needed for individual domain scoring and combining domains into overall strength of evidence grades, to help reviewers account for the increasing diversity of study designs and outcome measures.
- •
Better guidance is needed for experienced systematic reviewers and even more so for those who are relatively new to the field and do not have the experience that our reviewers did.
- •
More research is needed on interrater reliability of domain scoring and grading overall using complex evidence bases, “methods” and “judgments” that different reviewers use, and advantages and disadvantages of concrete rules for reaching strength of evidence grades from domain scores.
- •
If the strength of evidence grading judgments can differ among experienced and well-trained reviewers, providing transparency in methods and the rationale for findings are critical in conducting each systematic review.
Systematic reviews, with or without meta-analyses, are the backbone of many aspects of health care: clinical and community practice guidelines, coverage and reimbursement decisions, quality improvement and patient safety initiatives, innovative delivery systems and payment schemes, and even emergency situations [1], [2], [3], [4], [5], [6], [7]. Numerous groups conduct such reviews with support from national governments, professional societies, industry, and patient or disease advocacy groups. The US Agency for Healthcare Research and Quality (AHRQ), through its Evidence-based Practice Center (EPC) program, has sponsored systematic reviews and related products for 15 years.
Throughout most of that time, AHRQ has supported research into best practices for enhancing the scientific rigor of these reviews [8], [9], [10], [11], [12]; the agency's Methods Guide for Effectiveness and Comparative Effectiveness Reviews (Methods Guide) covers a wide array of related topics [13]. Of particular salience have been the related tasks of rating the quality (i.e., risk of bias or internal validity) of individual studies and grading the strength of bodies of evidence addressing key questions in such reviews [14], [15], [16], [17], [18], [19], [20], [21], [22]. The guidance on methods for completing systematic review tasks is developed through work groups, comprising mainly expert practitioners from the independent EPCs. One such cross-EPC team developed the chapter on grading the strength of evidence related to therapeutic interventions, published in 2010 as definitive guidance for EPCs at that time [18].
In the intervening years, EPC work groups have revisited chapters of the Methods Guide, and systematic reviewers have raised various conceptual and practical questions about evolving standards for reviews. One set of questions concerns the interrater reliability of the two major steps in grading strength of evidence—rating key domains and reaching an overall strength of evidence grade—because these grades are key indicators of a review team's level of confidence that the studies in the review collectively reflect the true effect of an intervention on a health outcome. Others explored such methods or issues in the early 2000s as well; the Grading of Recommendations Assessment, Development and Evaluation (GRADE) working group, in particular, examined the interrater reliability of their earlier approach to grade the quality of evidence [23]. They determined that kappa agreement among raters for 12 outcomes was fair (κ = 0.27; standard error, 0.015), ranging from agreement slightly worse than chance for four outcomes (negative κ values) to a high value of κ = 0.823 for one outcome.
As of 2010, EPC reviewers evaluate bodies of evidence in relation to each major outcome (i.e., benefits and harms) and each main comparison (e.g., intervention A vs. intervention B) in relation to each key question of a review [18]. First, two independent reviewers rate (score) four required domains: the risk of bias of included studies, consistency of the evidence, directness of the evidence, and precision of the estimates (Table 1, top). They can score four additional domains (dose–response association, plausible confounding that would decrease observed effect, strength of association or magnitude of effect, and publication bias) if they consider them integral to make a final evaluation. The first three of these additional domains are more likely to apply to observational studies than to randomized controlled trials (RCTs).
The Methods Guide recommends scoring each domain separately for RCTs and observational studies. Second, for each major outcome or comparison, reviewers then aggregate domain scores into a single strength of evidence grade (Table 1, bottom). Reviewers resolve disagreements about domain scores or overall grades through consensus or adjudication by a third reviewer.
For arriving at overall grades, reviewers could use the then-contemporary GRADE approach [17], their own weighting (e.g., numeric) system, or a qualitative approach. AHRQ's primary requirement is transparency; reviewers must clearly explain their rationale for aggregating the domains into a single strength of evidence grade.
We report here on our 2010–11 evaluation of the interrater reliability of the strength of evidence guidance when used by experienced reviewers across the EPCs [22]. Our research focused on the interrater reliability testing of the two main components of the AHRQ approach: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision) and (2) developing an overall strength of evidence grade, given the scores for each of these four individual domains. Our primary goal was to determine whether different teams of reviewers would reach similar conclusions on the strength of evidence when presented with the same information about studies included in a systematic review. Our secondary goal was to gain a greater understanding of the relative role of each domain that EPCs evaluate in developing strength of evidence grades. Table 2 presents our specific key questions of the research, and Fig. 1 illustrates the complex linkages between populations (individual independent reviewers and reviewer pairs) and outcomes (agreement on domain scores and overall strength of evidence grades) for these key questions.