Background
A 2010 article estimated that 75 trials and 11 systematic reviews are published in the medical field each day [
1]; in 2016, this estimate was updated to 25 systematic reviews published each day [
2]. Developing practical, effective tools to use in these reviews is critical to providing timely and useful summaries of a rapidly expanding and evolving evidence base.
Risk of bias in intervention studies has been defined as “the likelihood of inaccuracy in the estimate of causal effect in that study” [
3]. In systematic reviews, assessing risk of bias of individual studies is essential in providing accurate assessments of the overall intervention effect. Many different tools have been developed to assess risk of bias. Several systematic reviews have identified an incredible array of tools (as many as 194 in one review) [
4] that have been developed for different purposes, cover a range of study designs, and assess different domains of potential bias [
4‐
6]. Given the diversity of purposes for which they were designed, each of these tools has unique strengths and weaknesses. However, the majority (87% according to one review) are specific to a single type of study design rather than encompassing a range of study designs, and few are validated [
4]. While there have been a small number of assessments of the validity and reliability of existing tools in recent years [
7‐
10], there is still generally limited information on which are best [
3].
In this article, we present a tool for assessing risk of bias in both randomized and non-randomized intervention studies. The tool was developed by the Evidence Project, which conducts systematic reviews and meta-analyses of behavioral interventions for human immunodeficiency virus (HIV) in low- and middle-income countries. Specifically, we sought to develop a tool that would be appropriate for use across a range of study designs, from randomized trials to observational studies, and that would capture some of the main aspects of risk of bias among behavioral interventions in our field. Our goal here is to describe our risk of bias tool in sufficient detail that readers can interpret its use in Evidence Project reviews and apply it themselves in their own reviews if desired. We also evaluate reliability of the tool by assessing inter-rater agreement of both individual items and the total count of items.
Discussion
The Evidence Project tool assesses risk of bias in a range of different study designs with moderate to substantial reliability. This tool is one of many existing tools that systematic reviewers and others can select from. Viswanathan et al. [
3] advocate that systematic reviewers should consider the following general principles when selecting a tool: (a) it should be specifically designed for use in systematic reviews, (b) be specific to the study designs being evaluated, (c) show transparency in how assessments are made, (d) address risk-of-bias categories through specifically related items, and (e) be based on theory or, ideally, empirical evidence. We believe our tool meets these criteria, though like any other tool, it has strengths and weaknesses and should be selected when it best meets the needs of a given review.
One strength of the Evidence Project risk of bias tool is its applicability to a range of study designs, from RCTs to case-control studies to cohorts to pre-post studies, and including both prospective and retrospective studies. Previous reviews have found that the majority (87%) of existing risk of bias tools are design-specific [
4], although there may be clear benefits to including a range of study designs in a given systematic review [
39]. This aspect also allows the tool to be used across a range of topics, thus facilitating comparison across topics; for example, we have found that some HIV prevention interventions (such as Condom Social Marketing [
25]) rarely use RCTs, while other topics (such as school-based sex education [
13]) are much more likely to do so. Our risk of bias tool highlights these differences when compared across reviews. Also facilitating comparability across reviews is the fact that the tool does not need to be adapted for each review, or for each included study. This distinguishes it from tools such as ROBINS-I [
40], which asks reviewers to assess bias separately for each outcome included in each study (which may differ across studies and across review topics), or the Newcastle-Ottawa scale [
41], which asks reviewers to select the most important factor for which studies should control (which may differ across review topics).
Other strengths of the Evidence Project risk of bias tool include its relative ease of use and clarity. The eight items are fairly straightforward and easy to assess, which should make data extraction less prone to error and easier for reviewers with less experience. The tool is also relatively easy for readers to interpret and read, as all information can be condensed into a single table with one row per study.
However, our tool also has some limitations. Some items, as noted above, may capture elements based on study features other than bias differentially across studies. For example, length of follow-up, which differs across studies, affects the 80% retention cutoff. Similarly, sample size and the choice of sociodemographic or outcome variables may both affect whether comparison groups are equivalent on these measures. While these items could be adapted for individual reviews, that would reduce the consistency across topics noted above.
Second, while our decision to change the tool to a simple checklist, rather than a checklist with a summary (numerical) judgment, avoids criticisms of summary scores, Viswanathan et al. have recently noted that this approach “devolves the burden of interpretation of a study’s risk of bias from the systematic reviewer to the reader.” [
3] When we did present a summary score, readers found it easy to see differences in overall quality across included articles; without the summary score, we feel it has become more difficult to succinctly communicate overall risk of bias in presentation of the review results. An alternative may be to use individual items in the scale to create general categories, where studies could be ranked as “low,” “medium,” and “high” risk of bias. We have not done this to date, as the different items and domains do not assess an equal risk of bias; however, it could be considered by others using the tool.
Third, the Evidence Project risk of bias tool does not capture some elements of quality that other tools assess. For example, ROBINS-I [
40] assesses bias in the classification of interventions, deviations from intended interventions, measurement of outcomes, and selection of the reported results. The Newcastle-Ottawa scale [
41] considers items such as the case definition (for case-control studies) and ascertainment of exposure. The Cochrane Risk of Bias tool [
42] includes items such as random sequence generation, allocation concealment, blinding of participants and personnel, blinding of outcome assessment, and selective reporting. For the Evidence Project, we focus on behavioral interventions that are often impossible to blind, and with few RCTs included in our reviews, items such as random sequence generation and allocation concealment are rare. In line with recommendations to “select the most important categories of bias for the outcome(s) and topic at hand” [
3], we have found the categories in our risk of bias tool to be useful for an overall assessment of the diverse types of studies we see in the field of HIV behavioral interventions in low- and middle-income countries.
Inter-rater reliability was moderate to substantial for all items in our tool individually, and the median inter-rater reliability across items was substantial. This compares favorably to other risk of bias tools. Assessing the Cochrane Risk of Bias tool, Harding et al. found inter-rater agreement ranged from slight (
κ = 0.13) to substantial (
κ = 0.74) across items [
33], while Armijo-Olivo et al. found inter-rater reliability was poor for both the overall score (
κ = 0.02) and individual items (median
κ = 0.19, range − 0.04 to 0.62). The Newcastle-Ottawa score has similarly been found to have fair inter-rater reliability overall (
κ = 0.29), with individual items ranging from substantial (
κ = 0.68) to poor (
κ = − 0.06) [
9]. The relative ease of use and clarity of items on our tool likely increased its reliability. However, as both reviewers were from the same study team, our inter-rater reliability results may have been more consistent than would be expected if the tool were applied by members of different groups. Several studies have found consistency may be even lower across different groups, such as Cochrane reviewers and blinded external reviewers [
7] or across consensus assessments of reviewer pairs [
8].
The Evidence Project risk of bias tool has been used in over 30 systematic reviews to date, including both Evidence Project publications [
11‐
27] and other systematic reviews not connected with the Evidence Project [
43‐
58]. Some of these reviews have changed the tools’ criteria slightly—for example, by using a 75% instead of 80% cutoff [
44,
48,
49,
52,
54] or by adding an extra item for whether the study adjusted for confounding variables [
44,
46,
48,
49,
52‐
54]. The Evidence Project risk of bias tool has been used in reviews of a range of topics, including in Cochrane reviews [
14,
52] and reviews to inform World Health Organization guidelines [
43‐
48,
50,
53]. We believe this widespread use in reputable settings, including by researchers outside our study team, provides at least some indication that others feel the tool is useful and has face validity.