Background
-
How confident can I be about the results?
-
Will the results change if I change the definition of the outcome (e.g., using different cut-off points)?
-
Will the results change if I change the method of analysis?
-
Will the results change if we take missing data into account? Will the method of handling missing data lead to different conclusions?
-
How much influence will minor protocol deviations have on the conclusions?
-
How will ignoring the serial correlation of measurements within a patient impact the results?
-
What if the data were assumed to have a non-Normal distribution or there were outliers?
-
Will the results change if one looks at subgroups of patients?
-
Will the results change if the full intervention is received (i.e. degree of compliance)?
Discussion
Sensitivity Analysis
What is a sensitivity analysis in clinical research?
Why is sensitivity analysis necessary?
How often is sensitivity analysis reported in practice?
Variable | Medical journals | Health economics journals |
---|---|---|
Number with statistical analysis | 64$
| 71 |
Number with sensitivity analysis (%) | 13& (20.3) | 22 (30.9) |
Type of sensitivity analysis | ||
• Methods of analysis | 5 | 12 |
• Outcome definitions | 4 | 1 |
• Distributional assumptions | 1 | 0 |
• Key assumptions* | 2 | 4 |
• Missing data | 1 | 4 |
• Baseline imbalances | 0 | 1 |
Types of sensitivity analyses
Scenario | Sensitivity analysis options |
---|---|
Outliers | - Assess outlier by z-score or boxplot |
- Perform analyses with and without outliers | |
Non-compliance or protocol violation in RCTs | Perform |
- Intention-to-treat analysis (as primary analysis) | |
- As-treated analysis | |
- Per-protocol analysis | |
Missing data | - Analyze only complete cases |
- Impute the missing data using single or multiple imputation methods and redo the analysis | |
Definitions of outcomes | - Perform analyses on outcomes of different cut-offs or definitions |
Clustering or correlation | - Compare the analysis that ignores clustering with one primary method chosen to account for clustering |
and multi-center trials | |
- Perform analysis with and without adjusting for center | |
- Use different methods of adjusting for center [12] | |
Competing risks in RCTs | - Perform a survival analysis for each event separately |
- Use a proportional sub-distribution hazard model (Fine & Grey approach) | |
- Fit one model by taking into account all the competing risks together [13] | |
Baseline imbalance | Perform: |
- Analysis with and without adjustment for baseline characteristics | |
- Analysis with different methods of adjusting for baseline imbalance. e.g. Multivariable regression vs. propensity score method | |
Distributional assumptions | Perform analyses under different distributional assumptions |
- Different distributions (e.g. Poisson vs. Negative binomial) | |
- Parametric vs. non-parametric methods | |
- Classical vs. Bayesian methods | |
- Different prior distributions |
Impact of outliers
-
In a cost–utility analysis of a practice-based osteopathy clinic for subacute spinal pain, Williams et al. reported lower costs per quality of life year ratios when they excluded outliers [17]. In other words, there were certain participants in the trial whose costs were very high, and were making the average costs look higher than they probably were in reality. The observed cost per quality of life year was not robust to the exclusion of outliers, and changed when they were excluded.
-
A primary analysis based on the intention-to-treat principle showed no statistically significant differences in reducing depression between a nurse-led cognitive self-help intervention program compared to standard care among 218 patients hospitalized with angina over 6 months. Some sensitivity analyses in this trial were performed by excluding participants with high baseline levels of depression (outliers) and showed a statistically significant reduction in depression in the intervention group compared to the control. This implies that the results of the primary analysis were affected by the presence of patients with baseline high depression [18].
Impact of non-compliance or protocol deviations
-
A trial was designed to investigate the effects of an electronic screening and brief intervention to change risky drinking behaviour in university students. The results of the ITT analysis (on all 2336 participants who answered the follow-up survey) showed that the intervention had no significant effect. However, a sensitivity analysis based on the PP analysis (including only those with risky drinking at baseline and who answered the follow-up survey; n = 408) suggested a small beneficial effect on weekly alcohol consumption [31]. A reader might be less confident in the findings of the trial because of the inconsistency between the ITT and PP analyses—the ITT was not robust to sensitivity analyses. A researcher might choose to explore differences in the characteristics of the participants who were included in the ITT versus the PP analyses.
-
A study compared the long-term effects of surgical versus non-surgical management of chronic back pain. Both the ITT and AT analyses showed no significant difference between the two management strategies [32]. A reader would be more confident in the findings because the ITT and AT analyses were consistent—the ITT was robust to sensitivity analyses.
Impact of missing data
-
A 2011 paper reported the sensitivity analyses of different strategies for imputing missing data in cluster RCTs with a binary outcome using the community hypertension assessment trial (CHAT) as an example. They found that variance in the treatment effect was underestimated when the amount of missing data was large and the imputation strategy did not take into account the intra-cluster correlation. However, the effects of the intervention under various methods of imputation were similar. The CHAT intervention was not superior to usual care [43].
-
In a trial comparing methotrexate with to placebo in the treatment of psoriatic arthritis, the authors reported both an intention-to-treat analysis (using multiple imputation techniques to account for missing data) and a complete case analysis (ignoring the missing data). The complete case analysis, which is less conservative, showed some borderline improvement in the primary outcome (psoriatic arthritis response criteria), while the intention-to-treat analysis did not [44]. A reader would be less confident about the effects of methotrexate on psoriatic arthritis, due to the discrepancy between the results with imputed data (ITT) and the complete case analysis.
Impact of different definitions of outcomes (e.g. different cut-off points for binary outcomes)
-
In a trial comparing caspofungin to amphotericin B for febrile neutropoenic patients, a sensitivity analysis was conducted to investigate the impact of different definitions of fever resolution as part of a composite endpoint which included: resolution of any baseline invasive fungal infection, no breakthrough invasive fungal infection, survival, no premature discontinuation of study drug, and fever resolution for 48 hours during the period of neutropenia. They found that response rates were higher when less stringent fever resolution definitions were used, especially in low-risk patients. The modified definitions of fever resolution were: no fever for 24 hours before the resolution of neutropenia; no fever at the 7-day post-therapy follow-up visit; and removal of fever resolution completely from the composite endpoint. This implies that the efficacy of both medications depends somewhat on the definition of the outcomes [45].
-
In a phase II trial comparing minocycline and creatinine to placebo for Parkinson’s disease, a sensitivity analysis was conducted based on another definition (threshold) for futility. In the primary analysis a predetermined futility threshold was set at 30% reduction in mean change in Unified Parkinson’s Disease Rating Scale (UPDRS) score, derived from historical control data. If minocycline or creatinine did not bring about at least a 30% reduction in UPDRS score, they would be considered as futile and no further testing will be conducted. Based on the data derived from the current control (placebo) group, a new threshold of 32.4% (more stringent) was used for the sensitivity analysis. The findings from the primary analysis and the sensitivity analysis both confirmed that that neither creatine nor minocycline could be rejected as futile and should both be tested in Phase III trials [46]. A reader would be more confident of these robust findings.
Impact of different methods of analysis to account for clustering or correlation
-
Ma et al. performed sensitivity analyses of different methods of analysing cluster RCTs [48]. In this paper they compared three cluster-level methods (un-weighted linear regression, weighted linear regression and random-effects meta-regression) to six individual level analysis methods (standard logistic regression, robust standard errors approach, GEE, random effects meta-analytic approach, random-effects logistic regression and Bayesian random-effects regression). Using data from the CHAT trial, in this analysis, all nine methods provided similar results, re-enforcing the hypothesis that the CHAT intervention was not superior to usual care.
-
Peters et al. conducted sensitivity analyses to compare different methods—three cluster-level (un-weighted regression of practice log odds, regression of log odds weighted by their inverse variance and random-effects meta-regression of log odds with cluster as a random effect) and five individual-level methods (standard logistic regression ignoring clustering, robust standard errors, GEE, random-effects logistic regression and Bayesian random-effects logistic regression.)—for analyzing cluster randomized trials using an example involving a factorial design [13]. In this analysis, they demonstrated that the methods used in the analysis of cluster randomized trials could give varying results, with standard logistic regression ignoring clustering being the least conservative.
-
Cheng et al. used sensitivity analyses to compare different methods (six models for clustered binary outcomes and three models for clustered nominal outcomes) of analysing correlated data in discrete choice surveys [49]. The results were robust to various statistical models, but showed more variability in the presence of a larger cluster effect (higher within-patient correlation).
-
A trial evaluated the effects of lansoprazole on gastro-esophageal reflux disease in children from 19 clinics with asthma. The primary analysis was based on GEE to determine the effect of lansoprazole in reducing asthma symptoms. Subsequently they performed a sensitivity analysis by including the study site as a covariate. Their finding that lansoprazole did not significantly improve symptoms was robust to this sensitivity analysis [50].
-
In addition to comparing the performance of different methods to estimate treatment effects on a continuous outcome in simulated multicenter randomized controlled trials [12], the authors used data from the Computerization of Medical Practices for the Enhancement of Therapeutic Effectiveness (COMPETE) II [51] to assess the robustness of the primary results (based on GEE to adjust for clustering by provider of care) under different methods of adjusting for clustering. The results, which showed that a shared electronic decision support system improved care and outcomes in diabetic patients, were robust under different methods of analysis.
Impact of competing risks in analysis of trials with composite outcomes
-
A previously-reported trial compared low molecular weight heparin (LMWH) with oral anticoagulant therapy for the prevention of recurrent venous thromboembolism (VTE) in patients with advanced cancer, and a subsequent study presented sensitivity analyses comparing the results from standard survival analysis (Kaplan-Meier method) with those from competing risk methods—namely, the cumulative incidence function (CIF) and Gray's test [52]. The results using both methods were similar. This strengthened their confidence in the conclusion that LMWH reduced the risk of recurrent VTE.
-
For patients at increased risk of end stage renal disease (ESRD) but also of premature death not related to ESRD, such as patients with diabetes or with vascular disease, analyses considering the two events as different outcomes may be misleading if the possibility of dying before the development of ESRD is not taken into account [49]. Different studies performing sensitivity analyses demonstrated that the results on predictors of ESRD and death for any cause were dependent on whether the competing risks were taken into account or not [53, 54], and on which competing risk method was used [55]. These studies further highlight the need for a sensitivity analysis of competing risks when they are present in trials.
Impact of baseline imbalance in RCTs
-
A paper presented a simulation study where the risk of the outcome, effect of the treatment, power and prevalence of the prognostic factors, and sample size were all varied to evaluate their effects on the treatment estimates. Logistic regression models were compared with and without adjustment for the prognostic factors. The study concluded that the probability of prognostic imbalance in small trials could be substantial. Also, covariate adjustment improved estimation accuracy and statistical power [56].
-
In a trial testing the effectiveness of enhanced communication therapy for aphasia and dysarthria after stroke, the authors conducted a sensitivity analysis to adjust for baseline imbalances. Both primary and sensitivity analysis showed that enhanced communication therapy had no additional benefit [57].
Impact of distributional assumptions
-
Ma et al. performed sensitivity analyses based on Bayesian and classical methods for analysing cluster RCTs with a binary outcome in the CHAT trial. The similarities in the results after using the different methods confirmed the results of the primary analysis: the CHAT intervention was not superior to usual care [10].
-
A negative binomial regression model was used [52] to analyze discrete outcome data from a clinical trial designed to evaluate the effectiveness of a pre-habilitation program in preventing functional decline among physically frail, community-living older persons. The negative binomial model provided an improved fit to the data than the Poisson regression model. The negative binomial model provides an alternative approach for analyzing discrete data where over-dispersion is a problem [59].
Commonly asked questions about sensitivity analyses
-
Q: Do I need to adjust the overall level of significance for performing sensitivity analyses?
-
Q: Do I have to report all the results of the sensitivity analyses?
-
Q: Can I perform sensitivity analyses posthoc?
-
Q: How do I choose between the results of different sensitivity analyses? (i.e. which results are the best?)
-
Q: When should one perform sensitivity analysis?
-
Q: How many sensitivity analyses can one perform for a single primary analysis?
-
Q: How many factors can I vary in performing sensitivity analyses?
-
Q: What is the difference between secondary analyses and sensitivity analyses?
-
Q: What is the difference between subgroup analyses and sensitivity analyses?
Conclusion
Reporting of sensitivity analyses
-
In Methods Section: Report the planned or posthoc sensitivity analyses and rationale for each.
-
In Results Section: Report whether or not the results of the sensitivity analyses or conclusions are similar to those based on primary analysis. If similar, just state that the results or conclusions remain robust. If different, report the results of the sensitivity analyses along with the primary results.
-
In Discussion Section: Discuss the key limitations and implications of the results of the sensitivity analyses on the conclusions or findings. This can be done by describing what changes the sensitivity analyses bring to the interpretation of the data, and whether the sensitivity analyses are more stringent or more relaxed than the primary analysis.