Given the implications of imprecision evaluation for recommending health care interventions as standard of care or cautionary noting that additional clinical trials are warranted, it is expected that imprecision be transparently evaluated and reported. There remains ample room for improvement, however. Almost half of the outcomes were downgraded due to imprecision, but only about 1 in 10 reviews reported the criteria to downgrade imprecision. The width of 95% of confidence intervals and the number of study participants were the most common criteria to infer downgrading due to imprecision. One third of the conclusions that did not downgrade the evidence for imprecision would have been contradicted based on GRADE assessment of imprecision following the GRADE Handbook or TSA if these methods had been applied. This was mainly because GRADE evaluation by the review authors was more lenient and also because the number of patients included in the meta-analyses was often insufficient to make any definitive conclusion.
The GRADE approach and TSA have different connotations that might influence the judgment process. While GRADE is defined as “a semi quantitative approach that encompasses imprecision besides the other certainty domains” and has intrinsic subjectivity [
5], TSA can be viewed as a purely quantitative and objective approach [
10]. Despite their differences, the agreement between the two approaches was substantial. Nonetheless, the two methods give different weight to the extent of imprecision. TSA tended toward more severe judgment, while GRADE seemed to be more easily applied by the review authors, resulting in overestimation of the certainty of evidence. When imprecision is rated as negligible (i.e., the true effect plausibly lies within the 95% CI), it is more likely that the effect estimates can be trusted and evidence quality rated highly. TSA will allow the reader to gauge the extent of confidence on imprecision of primary results of meta-analyses, though it might also be perceived as too radical by some. With a GRADE approach, reviewers’ decisions on downgrading are more open to subjective decisions. To guide such decisions, further research is warranted on optimal information size and of different types of interventions under different circumstances.
Strengths and weaknesses of the study
This study has several limitations. We included only Cochrane reviews of health care interventions, a fairly homogeneous but partial sample of systematic reviews published in the medical literature. Reviews published in other medical journals might provide different results. However, since Cochrane strongly encourages the use of the GRADE approach, including imprecision, it seems implausible that other journals would perform better. Our analyses are valid for pooled results and restricted to dichotomous outcomes, limiting the generalizability of our results. However, in the medical literature, two thirds of primary outcomes reported by systematic reviews are dichotomous [
24]. Besides, calculation and definition of a clinical threshold for continuous outcomes, i.e., the minimal important differences or minimal detectable changes, remain ambiguous and unclear [
25,
26]. Furthermore, our logistic regression results may be under-powered to detect any significant factor, due to the limited number of studies included in the analysis.
While our analysis was protected against confounding by disease area and type of intervention because it is based on a sample of meta-analyses irrespective of interventions, it may still have been confounded by other meta-analysis characteristics. Given the wide diversity of the studies included, the quantitative results are suggestive and might change in the future when GRADE and Cochrane propose methods and policies that strengthen imprecision assessment. Furthermore, imprecision assessment may vary between research areas (i.e., type of interventions), becoming speculative in fields where trials and reviews are too small to permit exploration of imprecision.
When we evaluated the conclusiveness of evidence by TSA in studies where the anticipated treatment effect was not reported, we could not directly assess the authors’ assumptions on relevant intervention effects. We chose one arbitrary hypothesis when we recalculated the required information size based on a 25% relative risk reduction or improvement, which might be unrealistic for some outcomes. Nevertheless, the proportion of primary outcomes downgraded for imprecision did not change substantially when we applied different thresholds for benefits in our sensitivity analyses. Moreover, we only used a two-level downgrading approach when our TSA did not show benefit, harm, or futility. By, e.g., employing the Trial Sequential Analysis-adjusted confidence intervals, this simple approach could have been refined.
As well, our results are based on our subjective application of the GRADE approach following the GRADE Handbook, which would then influence the assessments we made between the methods.
Despite these limitations, this project aspires to be a step toward optimized assessment and interpretation of the certainty of evidence regarding imprecision. Recently, evidence synthesis has evolved into a dynamic process. A new concept of systematic reviews,
living systematic reviews, has been introduced [
27]
. In this context, TSA has been suggested as a valid method to assess constantly updated evidence [
12] since it can offer constantly updating of optimal information size as new trials are added. This dynamicity could affect the imprecision domain, with assessment more closely related to what has been reached for a specific condition, outcome, and intervention at a certain time point.
Implications for systematic reviewers
As part of their mandate to identify potentially effective interventions, systematic reviewers should include precision as a key dimension to evaluate, particularly in situations where uncertainty about the ratio between benefits and harms is high and where new trial data may influence the summary judgment of the review [
28]. More detailed guidance from PRISMA and PRISMA-P could facilitate the analysis and reporting of imprecision [
29,
30]. There is room for better standardization of approaches and inclusion of quantitative methods, such as TSA, to formally evaluate imprecision.
Implications for clinicians
Imprecision assessment seems to be based on GRADEing criteria that vary considerably in meaning, value, or boundaries depending on context or conditions. Incomplete or vague imprecision assessment supports the natural tendency to simplify a review’s findings about an intervention as positive or negative and to over-rely on
P values [
31]. Instead, clinicians using evidence to orient their practice should interpret imprecision as a dynamic and often uncertain dimension that requires thorough examination of all of the evidence, including size of the trials, number of participants with outcomes, and heterogeneity (diversity) across trials. TSA offers the means to model heterogeneity and estimate optimal information size based on different heterogeneity thresholds.