Background
The preferred reporting of clinical outcomes in randomized controlled trials (RCTs) is described in the Consolidated Standards of Reporting Trials (CONSORT) statement [
1]. Within CONSORT the use of confidence intervals is emphasized in preference to
p-values. Confidence intervals describe the precision of the estimate and “are especially valuable in relation to differences that do not meet conventional statistical significance, for which they often indicate that the result does not rule out an important clinical difference” [
1]. Editorials dating back almost 40 years have encouraged authors to use confidence intervals to describe the results of their studies rather than simply reporting the findings as statistically significant or not [
2‐
4]. Despite this, the use of
p-values in published articles remains approximately seven times more common than confidence intervals [
5]. Furthermore, confidence intervals are often used in a manner similar to
p-values, to dichotomize outcomes as statistically significant (SS) or not. We have previously written about three important clinical controversies resulting from this dichotomous activity [
6].
Interpretation of trial results when primary outcomes are not statistically significant (NSS) is challenging. In particular, it can be difficult putting the potential clinical relevance of the NSS effect and confidence intervals in context of the entire study results. Boutron and colleagues demonstrated that authors often place a favorable “spin” (positive portrayal) on trial results when the primary outcome is NSS [
7]. Such spin occurred in 58% of abstract conclusions, 50% of main text conclusions, and 18% of titles. Others have similarly reported spin in RCTs evaluating wound care [
8] and surgical modalities [
9,
10]. Although promotion of results may be common in NSS trial reporting, the evaluation assumes that NSS results demonstrate no potentially clinically meaningful effect.
For these reasons we examined the primary outcomes and conclusions of RCTs in six major medical journals. We had two primary questions: (1) How often do the point estimates and confidence intervals of the primary outcome of NSS and SS trials include potentially clinically meaningful effects? and (2) Are the authors’ conclusions in the abstract of NSS trials influenced by potentially clinically meaningful point estimates and confidence intervals? We focused specifically on cardiovascular trials with major adverse cardiovascular events (MACE) because these are established, objective, patient-oriented outcomes that overlap between trials. Additionally, in large cardiovascular trials with hard clinical endpoints, statistical significance can be difficult to attain but the results have high clinical relevance. We hypothesized that authors of cardiovascular trials may discount potentially clinically meaningful effects identified in the confidence intervals and/or point estimates when the results are NSS.
Methods
We followed the basic approach described in PRISMA [
11] because there is no agreed on methodology for this type of study.
We included all cardiovascular RCTs of superiority design that evaluated preventive or interventional therapies regardless of the nature of the interventions – including medication, surgery, models of care, and lifestyle change. All comparators were valid, including placebo, active control, and no intervention. The primary outcome had to include at least one MACE: myocardial infarction, stroke, or cardiovascular death. We used PubMed to identify relevant trials from five high-impact general medical journals and one high-impact specialty journal: New England Journal of Medicine (N Engl J Med), Lancet, Journal of the American Medical Association (JAMA), British Medical Journal (BMJ), Annals of Internal Medicine (Ann Intern Med), and Circulation.
Study search and selection
Between 17 March and 14 April 2015, we searched PubMed for papers using the full journal title (and abbreviation, if present) with PubMed limits for RCTs and date (1 January 2010 to 31 December 2014). In the case of Circulation, the term circulation could relate to medical/physiologic issues in addition to the journal, so we restricted the search field to “Journal”. For the other five journals we did not apply any search restrictions in order to minimize the unlikely chance of missing relevant articles. For each journal, two authors (from VK, SK, EB, and GMA) independently evaluated and selected studies for inclusion. We excluded studies of subgroups, re-analyses, and studies that were either extensions or follow-ups from previously published trials to avoid including the same data more than once. We also excluded non-inferiority designed studies because authors’ interpretations and conclusions of non-inferiority results are broader, and this would add complexity to our interpretation of abstract conclusions. Disagreements for inclusion were resolved by consensus.
Data extraction and management
Two authors (CF with VK or SK) independently extracted data from the trials. Disagreement was resolved with consensus or third-party review (GMA).
Data extraction on study characteristics included citation, type of intervention and control, primary versus secondary prevention population, mean age in study, and percentage of males studied. Data on traditional risk of bias included allocation concealment, blinding, analysis (intention to treat or per protocol), and withdrawals. We also collected data on funding, and whether the trial was stopped early (if so, why) or extended. Data related to the primary outcome included the clinical endpoint, number of subjects in each study arm, number with the outcomes in each group, point estimate, confidence intervals, and p-values.
To evaluate the authors’ conclusions, the abstract conclusion was rated using a method derived from Als-Nielsen and colleague’s technique [
12]. We condensed the score from six to three possible conclusions: treatment superior, neutral, or control superior.
Assessing potentially meaningful effects
To assess if the primary outcome of an NSS trial included potentially meaningful effects, we focused on the point estimate and lower confidence interval. The margins of potentially clinically meaningful effect are undoubtedly debatable. Over 20 years ago, authors suggested that potentially clinically meaningful effects could be 25% or 50% relative risk reductions [
13]. More recently, trials showing a relative risk reduction of 6% for ezetimibe [
14] and 14% for empagliflozin [
15] have been greeted with enthusiasm [
16,
17]. We selected our margins of potentially meaningful effect liberally to be broad and inclusive, thereby ruling out what is likely not a clinically meaningful effect. We decided that the smallest potentially clinically meaningful effect was a 6% relative risk reduction or a 0.94 relative risk, as reported by the IMPROVE-IT trial for ezetimibe [
14]. For lower confidence intervals to include potentially meaningful effects, we selected a 25% relative risk reduction or 0.75 relative risk described in meta-analyses of statin trials [
18], an established clinical therapy.
Analysis of results
Study characteristics and potential biases are presented descriptively. Relative effect estimates including relative risks, hazard ratios, rate ratios, and odds ratios were used for primary analysis. If not provided, relative risks and 95% confidence intervals were calculated.
Trials were initially categorized into three groups based on the statistical testing of the primary outcome: SS trials favoring control, SS trials favoring treatment, and NSS trials. Statistical significance was determined by hypothesis testing via the p-value first and, if not available, we determined if the confidence interval excluded 1 (the line of no-effect).
To analyze and describe the results, the primary outcomes for all RCTs were presented on a forest plot with the potentially clinically meaningful thresholds for point estimate (≤0.94) and confidence interval (≤0.75) indicated. We categorized NSS trials as having (1) both the lower confidence interval and point estimate include potentially meaningful effects; (2) either the lower confidence interval or point estimate include a potentially meaningful effect; or (3) neither the lower confidence interval nor point estimate include a potentially meaningful effect. Among NSS trials, results were further stratified according to authors’ conclusions.
We used chi-square and independent samples median test to examine if selected factors were associated with authors’ conclusions in NSS trials. Factors compared included type of control used in the trials, funding (industry, public, or mixed), point estimates, and lower confidence intervals.
Sensitivity analyses
We performed sensitivity analyses to examine the effect of some key variables on the proportion of NSS trials with potentially clinically meaningful effect. Because smaller trials may be expected to have broader confidence intervals, we performed an analysis of trials with <2000 patient-years and those with ≥2000 patient-years. Because primary prevention trials will have smaller absolute benefits for a given relative benefit, we performed an analysis of primary versus secondary prevention trials.
To determine how sensitive the results were to the threshold of potential clinically meaningful effects, we increased the potentially meaningful relative risk reduction threshold for point estimates to ≥15% (or ≤0.85 relative risk) and for lower confidence intervals to ≥35% (or ≤0.65 relative risk).
Discussion
In 61% of NSS cardiovascular trials, the primary outcome had a confidence interval that included an effect similar to or better than statin therapy (relative risk reduction ≥25%) and/or a point estimate similar to or better than ezetimibe (≥6%). These results suggest that if we were to strictly focus on a dichotomous finding of whether results are SS or NSS, we run the risk of dismissing a treatment in almost two thirds of NSS trials that could potentially have meaningful effects. Furthermore, about one third of NSS trials had even higher probability of potentially clinically meaningful effects because both confidence intervals and point estimates included potentially meaningful effects. In fact, visual inspection of Fig.
2 shows the distribution of the effects is very similar between SS trials favoring treatment and NSS trials when both confidence interval and point estimates include potential meaningful effects. This further suggests that strict adherence to an arbitrary threshold for statistical significance may serve poorly as a judgment of treatment benefit.
Within NSS trials, authors’ conclusions were associated with the potentially meaningful effects in the confidence intervals and point estimates. For example, both the point estimate and confidence intervals included potentially meaningful effects in 67% of NSS trials in which the authors concluded treatment was superior. In contrast, both the point estimate and confidence intervals included potentially meaningful effects in only 6% of NSS in which the authors’ concluded control was superior. Past research suggested that just over half of NSS studies have conclusions that are unjustifiably positive and inconsistent with the results [
7]. However, our study suggests that some of these favorable interpretations may relate to potentially meaningful benefits suggested in the confidence intervals and/or point estimates. Given this and the recommendations of CONSORT regarding the presentation of results [
1], future research evaluating authors’ interpretations or conclusions of NSS trials should assess trial outcomes beyond statistical significance testing.
Potentially meaningful effects in the point estimates and confidence intervals are not the only factors influencing authors’ conclusions. For example, 28% of NSS trials with a neutral conclusion had both a lower confidence interval and point estimate suggestive of potentially meaningful effects. Perhaps these authors are basing their conclusions solely on statistical significance but it is also possible that other elements of the trial results or intervention play a role: adverse events, costs, and secondary outcomes are all potentially relevant.
Our results were sensitive to two possibly predictable factors. First, trials of smaller size frequently have less precision in the estimate and thus broader confidence intervals. Within our study, this could result in more of the smaller trials having lower confidence intervals crossing a potentially meaningful threshold. This did occur but most of the trials included in this review were large. Therefore, the proportion of NSS trials in which either the point estimate and/or the confidence interval included potentially meaningful effects was only slightly lower in larger trials (having ≥2000 patient-years) than overall (53% versus 61%, respectively). Second, modification of the thresholds of potentially clinically meaningful effects foreseeably reduced the proportion of trials with potentially meaningful effects. The proportion of NSS trials in which either the point estimate and/or the confidence interval included potentially meaningful effects was 61% in our primary analysis but fell to 26% when the relative risk reduction thresholds were increased to ≥15% for point estimates and ≥35% for confidence intervals. However, even with these stricter criteria, a quarter of all NSS cardiovascular trials found potentially meaningful effects.
Despite our findings, it is important not to over-interpret our results and assume that we are suggesting that a 6% relative risk reduction is a meaningful effect in all populations. Nor would we suggest all researchers use these thresholds for sample size estimation and/or extended or repeated studies until these small benefits are entirely ruled out. All interventions, and the trials assessing their clinical value, need to be considered in the boarder context of many relevant factors, including overall risk of the primary outcome, adverse events, costs, inconvenience, and alternative interventions. We hope this paper can draw attention to the need to use confidence intervals and describe potentially meaningful effects. Fortunately, it appears that a number of authors are already doing this. Moreover, we support the advice [
19] that authors and evidence-users move away from the dogmatic adherence to hypothesis testing that leads some to believe that a
p-value of 0.049 means a positive trial and treatment works while a
p-value of 0.051 means a negative trial and treatment does not work.
There are some notable limitations to our study. First, there are many factors involved in how authors interpret their research but our study focused only on point estimates and confidence intervals of primary outcomes. Second, we focused on cardiovascular trials with hard clinical (MACE) endpoints and so confirmation is required to determine if results would be similar for research in other conditions like chronic obstructive pulmonary disease or infectious disease. Third, our definitions of potentially clinically meaningful effects may be seen as arbitrary or too generous. There is no agreed-on minimal clinically important effect for MACE outcomes so we derived our definition from established therapies although some will certainly feel they are too generous. We used somewhat liberal thresholds because our goal was to determine if results included any “potentially” clinically meaningful effects but we also performed a sensitivity analysis with stricter criteria. While some will see these cut-offs as arbitrary, a goal of this paper is to reflect on the rigid adherence to the 0.05 statistic significance threshold, which itself can be considered arbitrary. Fourth, we used relative margins. The use of relative margins allows for more easy comparison across trials because any assessment of absolute effects must also account for time. Fifth, although we assessed authors’ conclusions by focusing on abstract conclusions, this is a previous method of rating conclusions [
12] and abstract conclusion is the most likely location for promotion of results [
7]. It should also be noted that the abstract conclusions, like any part of the articles, may have been modified through the peer-review process and editorial recommendations. It is not possible to clarify to what, if any, degree this occurred but we suspect it is small.
Conclusions
In up to 61% of NSS cardiovascular trials, the primary outcome has a point estimate and/or confidence interval that includes potentially clinically meaningful effects. Furthermore, among the NSS cardiovascular trials, authors’ conclusions were positively associated with point estimates and lower confidence intervals that suggest greater potential effects. In fact, both the point estimates and confidence intervals included potentially meaningful effects in 67% of trials (12/18) in which the authors concluded that treatment was superior, compared to only 6% (1/16) in which authors concluded that control was superior. Given the frequency of NSS cardiovascular trials, it is reassuring that many authors look beyond statistical significance testing and consider the potentially meaningful clinical effects of their results. Additionally, journals and evidence-users should be encouraged, as directed by CONSORT, to consider point estimates and confidence intervals in the context of potentially clinically meaningful effects and not strictly for hypothesis and statistical significance testing.