Background
Systematic reviews and meta-analyses of randomized clinical trials (RCTs) are central to evidence-based clinical decision-making [
1,
2]. RCTs are the gold standard design for assessing the effectiveness of treatment interventions. Well-conducted RCTs may eliminate confounding, which allows decision-makers to infer that changes in the outcome of interest are causally linked with the experimental intervention. If results of RCTs included in a meta-analysis are biased, then the results of the meta-analysis will also be biased [
3,
4]. Meta-analysis commonly account for this risk of bias by stratifying the analysis based on low or high risk of bias in RCTs.
In 2008, the Cochrane Collaboration published a tool and guidelines for the assessment of risk of bias in RCTs [
5,
6]. The risk of bias tool was widely embraced by the systematic review community [
7]. The tool addresses six domains of bias, classified as low, high, or unclear risk of bias. Domains of bias were selected based on empirical evidence and theoretical considerations that focused on methodological issues likely to influence the results of RCTs.
Several studies reported that the reliability of the risk of bias tool is low [
8‐
10]. Reliability of the risk of bias tool can be assessed between two raters of the same research group when, for instance, they assess the risk of bias of RCTs included in a meta-analysis in duplicate. It can also be assessed across research groups if the risk of bias was assessed for a trial included in two different meta-analyses by two different research groups. Disagreements between two raters of the same research group may be less problematic since they will normally discuss their ratings to come to a consensus. Disagreements between raters from different research groups will be more problematic, for example, if for the same outcome a trial is considered at low risk of bias in one meta-analysis, but is at high risk of bias in another one. Low reliability of risk of bias assessments can then ultimately have repercussions on decision-making and quality of patient care [
11,
12].
We recently found that reliability of the risk of bias tool might improve if raters receive intensive standardized training [
8]. However, to our knowledge, no formal evaluation of such a training intervention has been performed. We therefore aimed to investigate whether training of raters, with objective and standardized instructions on how to assess risk of bias, would improve the within and between pairs of rater reliability of the Cochrane risk of bias tool.
Discussion
To our knowledge, this prospective pilot study is the first to indicate that the reliability of the risk of bias tool may be improved by a standardized training of inexperienced raters. Increase in between-group Kappa agreement ranged from 0.09 to 0.52 across risk of bias items, but only reached statistical significance for allocation concealment and incomplete outcome data. These results indicate that intensive standardized training may minimize the variation in risk of bias assessment across different research groups. Increase in within-group Kappa agreement ranged from 0.62 to 1 across risk of bias items, and there is strong evidence that standardized training will improve within-group Kappa agreement for all risk of bias items.
Critics of the risk of bias tool commonly refer to its low agreement within a pair of raters to challenge its usefulness [
8,
10,
20,
21]. Indeed, Kappa within a pair of raters for the Cochrane risk of bias tool has been reported to be generally low [
8,
9,
22]. Our findings are in line with previous studies, in that we also observed a rather low agreement within a pair of inexperienced raters that received minimal training, with Kappa values indicating only a slight agreement at best. However, the practical implications of such disagreement may be irrelevant, since raters within a research group usually discuss to reach consensus when their assessments differ. What is more relevant is whether or not their risk of bias assessment after discussion is accurate, and whether assessments are similar to those from other research groups, since low agreement of risk of bias assessments between research groups can have repercussions on decision-making and quality of patient care [
11,
12]. Our results suggest that, although a discussion between minimal training raters to reach consensus will lead to a more accurate risk of bias assessment, it will not reach an acceptable level of agreement between different research groups. These findings are in agreement with Hartling et al. and Armijo-Olivo et al., who investigated the agreement between pairs of raters from different research groups, and also concluded that discussion within pairs of raters to reach consensus is not enough to reach acceptable levels of agreement across different research groups [
8,
10].
Although low agreement of the Cochrane risk of bias tool has been reported by several studies, none have proposed and investigated ways to improve it. Our study is the first to show that an intensive standardized training on risk of bias shows promise as a method to improve agreement not only within pairs of raters, but also across research groups. We found that standardized training improves agreement of all items assessed within a pair of raters. Although standardized training also led to better agreement between pairs of raters for all items assessed, there was only evidence of improvement for concealment of allocation and incomplete outcome data risk of bias assessment. In the present study, assessment of concealment of allocation was most problematic, with 75% of the trials not reporting enough information to allow a proper assessment of this item. Raters receiving standardized training, including explanations and decision rules, had higher agreement between pairs of raters, notwithstanding poor reporting of the item. As a way to circumvent poor reporting of randomization methods, Corbett et al. suggested that reviewers take between-group baseline imbalances in important prognostic indicators into consideration when assessing selection bias, something that could also be included in standardized instructions to further facilitate the risk of bias assessment of this poorly reported item [
23]. The largest difference in agreement between pairs of raters receiving standardized training was observed for the assessment of incomplete outcome data. Savović et al. conducted a survey with stakeholders within the Cochrane Collaboration and reported that most of them (67%) found the assessment of risk of bias due to incomplete outcome data to be the most difficult [
24]. Such difficulty may explain the largest improvement observed in the agreement between pairs of raters with standardized training where clear instructions were provided on how to assess this item.
The standardized instructions and training for risk of bias assessment should be tailored to address the main methodological problems commonly found in the area of research of interest. For instance, for most physical therapy interventions, it is difficult if not impossible to blind the therapist. However, a trial comparing two different spinal manipulation techniques will not necessarily have a high risk of performance bias due to the lack of therapist blinding. This problem can be circumvented, for example, by using expertise-based randomization, where patients are only treated by experts on a particular intervention [
25]. In order to develop valid instructions for risk of bias assessment within a specific area of research, it is of utmost importance that experienced epidemiologists in this area of research are involved in the process so that risk of biases and possible ways to minimize them are properly identified and addressed in the instructions. Properly developed instructions for risk of bias assessment will not only improve the agreement of the risk of bias tool within- and between-research groups, but will likely also increase the validity and transparency of the risk of bias assessment process within a specific area of research.
The main strength of our study is that we included raters completely inexperienced with the risk of bias assessment to investigate the effect of standardized training on the agreement of the risk of bias tool. The randomization of only inexperienced raters to training groups allowed us to maximize the effect of standardized training. If raters were already experienced with the risk of bias assessment, there could be limited room for improvement as postulated in a previous study that investigated the effect of training on a similar method for methodological quality assessment [
26]. The main limitation of the present study was the low number of raters randomized to training groups. While this was unproblematic for statistical precision, we cannot exclude relevant baseline imbalances that could partially explain the observed results. To try and overcome this limitation, an obvious strategy would be to assess the baseline agreement between risk of bias assessment from each inexperienced rater with those from experienced raters and then match inexperienced raters in accordance to their baseline performance to conduct a matched-pairs randomization. However, baseline assessment of students’ performance in risk of bias assessment could already result in training, which in turn could bias the results of our analysis.
Our results could potentially be influenced by performance bias resulting from a nocebo effect in the control group of doctoral students who received minimal training. If students in the control group understood that they were not receiving the best training available in our study, they could have felt discouraged to try and perform risk of bias assessments to the best of their ability. This could in turn lead to an artificially lower agreement of the risk of bias tool with minimal training as compared to standardized training. Unblinding could also have resulted in an underestimation of the difference in between-group reliability across groups of raters, since inexperienced raters in the minimal training group could alternatively have sought additional training elsewhere or be prompted to self-study. To try and minimize the risk of such performance bias, inexperienced raters were not informed to which training group they were randomized, and they were instructed to not discuss with each other any characteristics of their training. After data extraction was completed, inexperienced raters were asked to guess their group assignment. All four inexperienced raters correctly identified the groups to which they were allocated, but reported that their suspicion did not influence their performance. Moreover, the use of minimal training as a control intervention may have led to an underestimation of the effect of our standardized training. Although “no training” could be used as a control intervention instead of minimal training to maximize the effect of standardized training, this could have substantially increased the risk of performance bias in our study as explained above. Finally, the minimal training used in the present study may be better than what reviewers commonly receive. Again, the effect of intensive training may be even larger in a setting where minimal training is worse than the minimal training provided here.
The low number of raters randomized to intervention groups limits the generalizability of our findings and may have generated confounding as previously mentioned. Because it is a pilot study, we included the minimal number of participants needed to calculate Kappa agreements within each study condition. Given the promising large effect of standardized training observed in the present study, a future study using the same methods but including a larger number of inexperienced raters should be conducted. Generalizability may be further limited by the characteristics of the trials assessed in our study. Reliability of the risk of bias assessment could vary if trials with different patient populations, interventions, and outcomes were assessed. However, we believe the sample of trials used allowed us to make a more valid assessment of blinding, given the subjective nature of pain outcomes, and the difficulties involved in blinding of patients and therapists in physical therapy trials. Our results are further limited by the exclusion of selective reporting of outcomes assessment from our investigation.
Acknowledgements
We thank Kali Tal for her editorial suggestions.