Traditionally in acute stroke clinical trials, the primary clinical outcomeemployed is a dichotomized modified Rankin Scale (mRS). New statisticalmethods, such as responder analysis, are being used in stroke studies toaddress the concern that baseline prognostic variables, such as strokeseverity, impact the likelihood of a successful outcome. Responder analysisallows the definition of success to vary according to baseline prognosticvariables, producing a more clinically relevant insight into the actualeffect of investigational treatments. It is unclear whether or notstatistical analyses should adjust for prognostic variables when responderanalysis is used, as the outcome already takes these prognostic variablesinto account. This research aims to investigate the effect of covariateadjustment in the responder analysis framework in order to determine theappropriate analytic method.
Methods
Using a current stroke clinical trial and its pilot studies to guidesimulation parameters, 1,000 clinical trials were simulated at varyingsample sizes under several treatment effects to assess power and type Ierror. Covariate-adjusted and unadjusted logistic regressions were used toestimate the treatment effect under each scenario. In the case ofcovariate-adjusted logistic regression, the trichotomized National Instituteof Health Stroke Scale (NIHSS) was used in adjustment.
Results
Under various treatment effect settings, the operating characteristics of theunadjusted and adjusted analyses do not substantially differ. Power and typeI error are preserved for both the unadjusted and adjusted analyses.
Conclusions
Our results suggest that, under the given treatment effect scenarios, thedecision whether or not to adjust for baseline severity when using aresponder analysis outcome should be guided by the needs of the study, astype I error rates and power do not appear to vary largely between themethods. These findings are applicable to stroke trials which use the mRSfor the primary outcome, but also provide a broader insight into theanalysis of binary outcomes that are defined based on baseline prognosticvariables.
Trial registration
This research is part of the Stroke Hyperglycemia Insulin Network Effort(SHINE) trial, Identification Number NCT01369069.
The online version of this article (doi:10.1186/1745-6215-14-98) contains supplementary material, which is available to authorized users.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
KG carried out the simulation programming and interpretation, manuscript drafting andfinalization. SY helped conceive the study concept, assisted in simulationprogramming and interpretation, and aided in manuscript drafting and finalization.VR helped conceive the study concept, aided in statistical interpretation, as wellas manuscript drafting and finalization. KJ and EJ assisted with study concept,design, clinical interpretation, manuscript drafting and finalization. VD helpedconceive the study concept, aided in design, analysis and interpretation, as well asmanuscript drafting and finalization. VD provided overall supervision as the primarymentor of the first author. All authors read and approved the final manuscript.
Abkürzungen
GRASP
Glucose Regulation in Acute Stroke Patients
mRS
modified Rankin Scale
MSE
mean square error
NIHSS
National Institute of Health Stroke Scale
NINDS tPA
National Institute of Neurological Disorders and Stroke tissue PlasminogenActivator
SHINE trial
Stroke Hyperglycemia Insulin Network Effort trial
THIS
Treatment of Hyperglycemia in Ischemic Stroke.
Background
Stroke is a potentially debilitating medical event that affects approximately 800,000people in the United States each year, leaving as many as 30% of survivorspermanently disabled [1]. Given this impact, there is great demand for treatments thatsignificantly improve functional outcome following a stroke. To date, few clinicaltrials for the treatment of acute stroke have succeeded; of over 125 acute strokeclinical trials, only three successful treatment methods have been identified [2, 3].
One of the possible reasons for the excessive number of neutral or unsuccessfulstroke trials is the definition of successful outcome utilized in the studies [4]. In clinical trials, stroke outcome is most commonly measured by themodified Rankin Scale (mRS) of global disability at 90 days. The mRS is a validand reliable measure of functional outcome following a stroke [5]. Past trials have dichotomized mRS scores into “success” and“failure”, scores of 0 to 1 (or 0 to 2) were considered to be“successes” while scores greater than 1 (or 2) were considered to be“failures,” regardless of baseline stroke severity [6‐9]. This method fails to take into account the understanding that baselineseverity is highly correlated with outcome. New methods, such as the globalstatistic, shift analysis, permutation testing and responder analysis, are evolvingto make better use of the outcome data with the hopes of providing highersensitivity to detect true treatment effects [2, 4, 6, 9‐17].
Anzeige
Responder analysis, also known as the sliding dichotomy, dichotomizes ordinaloutcomes into “success” and “failure,” but addresses thedrawbacks of traditional dichotomization by allowing the definition of success tovary by baseline prognostic variables. Various trials have implemented the responderanalysis where baseline severity is defined by one or many baseline prognosticfactors [18‐20]. Those study subjects in a less severe prognosis group at baseline mustachieve a better outcome to be considered a trial “success,” whereas aless stringent criterion for success is applied to subjects in a more severebaseline prognosis category. The currently enrolling Stroke Hyperglycemia InsulinNetwork Effort (SHINE) trial employs responder analysis for its primary efficacyoutcome [18].
The SHINE trial is a large, multicenter, randomized clinical trial designed todetermine the efficacy and safety of targeted glucose control in hyperglycemic acuteischemic stroke patients. While the methodological details of the SHINE trial arediscussed elsewhere [18], it should be noted that the primary outcome for efficacy is the baselineseverity adjusted 90-day mRS score dichotomized as “success” or“failure” according to a sliding dichotomy. Eligibility criteria forSHINE require that a subject’s baseline NIHSS score must be between 3 and 22,inclusively. Those with a “mild” prognosis, defined by a baseline NIHSSscore of 3 to 7, must achieve a 90-day mRS of 0 to be classified as a“success.” Those with a “moderate” prognosis, defined by abaseline NIHSS score of 8 to 14, must achieve a 90-day mRS of 0 to 1 to beclassified as a “success.” Finally, those subjects with a“severe” prognosis, defined by a baseline NIHSS score of 15 to 22, mustachieve a 90-day mRS of 0 to 2 to be classified as a “success.” By usingresponder analysis with a trichotomized NIHSS, the threshold for success isstringent for the milder strokes, while the moderate to severe strokes are allowedto have more residual deficits in the threshold for success.
One of the questions that arose from the trial’s Data and Safety MonitoringBoard was that of covariate adjustment. Statistical analyses often adjust forprognostic factors, or covariates, that may be predictive of the primary outcome,such as baseline severity [21, 22]; however, in the case of SHINE, this prognostic variable is also used todefine the outcome. While the literature provides many resources on the design andimplementation of responder analysis, as well as examples of trials which usedresponder analysis, there are no clear resources discussing whether or notstatistical analyses should be adjusted for the prognostic variables used to definesuccessful outcome.
This research aims to investigate the effect of covariate adjustment in the responderanalysis framework, particularly when the covariate is involved in the definition ofsuccessful outcome. The cut-points for the SHINE trial are clinically, rather thanstatistically, defined and so it is conceivable that adjustment for baselineseverity in the statistical analysis may account for additional variation andincrease the power to detect a true treatment effect. A simulation study isconducted to assess the operating characteristics (power and type I error) ofcategorically-adjusted and unadjusted analyses under several possible treatmenteffect scenarios. In addition, treatment effect estimates and their standard errorsare examined across the various scenarios. Since the primary outcome for the SHINEtrial is binary, we expect to see an increase of standard error on the treatmenteffect estimates, consistent with the findings of Robinson and Jewell [23]. However, also consistent with Robinson and Jewell, we expect to see thisincrease in standard error to be balanced by a movement of the treatment effectestimate away from the null hypothesis.
Anzeige
By examining the effect of covariate adjustment in responder analysis, we aim todefine the most appropriate statistical approach to identify true treatment effects.Our findings are not only applicable to the SHINE and other stroke trials which usethe mRS for the primary outcome, but also provide insight into the appropriate useof categorical baseline prognostic variables in other trials which use an ordinalscale as a primary outcome measure.
Methods
Simulation studies were performed to examine the performance of logistic regressionmodels that were unadjusted and adjusted by a trichotomized baseline severitycategory. Baseline severity category and criteria for successful outcome weredefined as in the SHINE trial described above, and are summarized inTable 1. The type I error rate and power werecalculated and compared for each method, as were the treatment effect estimates andtheir standard errors.
Table 1
Sliding dichotomy criteria for successful outcome in SHINE trial
Baseline NIHSS
Prognosis group
90-day mRS for successful outcome
3 to 7
Mild
0
8 to 14
Moderate
0, 1
15 to 22
Severe
0, 1, 2
The simulation parameters were guided by the SHINE trial design. A total of 1,000clinical trials were simulated at sample sizes ranging from 498 to 1,958. Thissample size range allowed us to cover the planned SHINE sample size of 1,400 whilealso examining model behavior at smaller and larger sample sizes. A 1:1randomization scheme was assumed for the purposes of this investigation. Allanalyses were performed using SAS version 9.2 (SAS Institute, Cary, NC, USA).
The prevalence of each baseline severity category was guided by data from two priorpilot trials of hyperglycemia management in acute stroke, the Glucose Regulation inAcute Stroke Patients (GRASP) [24] and Treatment of Hyperglycemia in Ischemic Stroke (THIS) [25] pilot trials. In the simulations, 42% of subjects were classified as“mild” at baseline, 32% classified as “moderate”, and theremaining 26% classified as “severe”. This distribution of prognosiscategories was imposed using a uniform (0, 1) random variable. In order to simulate90-day mRS scores for the control group, we examined the distribution of 90-day mRSscores for the control groups in the GRASP and THIS pilot trials. Though thesimulation of 90-day mRS scores was primarily driven by the results of the GRASP andTHIS pilot trials, the National Institute of Neurological Disorders and Stroketissue Plasminogen Activator (NINDS tPA) trial control data [26] were used to aid in the approximation of mRS outcome distributions withineach of the baseline severity strata. The NINDS tPA control data helped smooth thedistribution of mRS scores, as the GRASP and THIS pilot trials each had small samplesizes that resulted in several empty cells after baseline severity stratification.The exact control group distribution of 90-day mRS scores used in the simulationstudy is shown in Additional file 1: Table S1.
Type I error rates for each method of analysis were obtained by using the sameproportion of success for both the control and intervention groups, simulating thenull hypothesis of “no treatment effect”. In order to assess the powerof each method, a treatment effect was simulated in the data by altering the successprevalence for the intervention group. A 7% treatment effect was used, as this wasthe minimal clinically relevant absolute difference in favorable outcome between thetwo treatment groups in the SHINE study plan. For these analyses, power was examinedunder several scenarios as illustrated in Table 2: (1) a“flat” scenario, in which the 7% treatment effect was held constant overthe three baseline severity strata; (2) a “varying” scenario, in whichthe overall treatment effect is still 7%, but the magnitude within strata is varied,where the mild and moderate groups see the most benefit; (3) another“varying” scenario, in which the severe group sees the most benefit; (4)a “mild harm” scenario, where the mild group sees a harmful treatmenteffect; and (5) a “severe harm” scenario, in which the severe group seesa harmful treatment effect.
Table 2
Success prevalence for simulated treatment effect scenarios
Baseline severity
Treatment effect scenarios
No treatment effect
Flat
Varying 1
Varying 2
Mild harm
Severe harm
Mild
25%
32%
33.6%
27%
23%
33%
Moderate
35%
42%
44%
44%
50%
48%
Severe
15%
22%
17%
27.6%
26.7%
13%
In the first varying scenario, we applied an 8.6% treatment effect in the mildcategory, a 9% treatment effect in the moderate category and a 2% treatment effectin the severe category; that is, there was an 8.6% increase in prevalence of the 0mRS for the mild stratum, a 9% increase in the prevalence of the 0 to 1 range of mRSscores for the moderate stratum, and a 2% increase in the prevalence of the 0 to 2range of mRS scores for the severe stratum. This scenario is relevant to the SHINEtrial; it is similar to what we may observe if the investigational treatment islargely beneficial to mild and moderate stroke victims, but only marginallybeneficial to victims of severe stroke. The second varying treatment effect scenarioapplies an opposite effect in which the intensive glucose control intervention islargely beneficial to more severe strokes, but only slightly beneficial to thosesubjects having mild strokes. Additional file 1: Table S1shows the exact distribution of 90-day mRS scores for the treatment groups undereach of these treatment effect scenarios. These distributions were used to randomlyassign 90-day mRS scores to each simulated subject in each simulated trial, with theproportions of success following the scenarios in Table 2. Given a subject’s simulated baseline severity stratum (mild,moderate or severe), an assignment of “success” or “failure”was made according to the sliding dichotomy definitions.
Logistic regression was used to investigate each of these scenarios. The unadjustedcase models “success” as a function only of treatment group, while thecategorically-adjusted case models “success” as a function of treatmentgroup and severity category. Severity was defined as “mild,”“moderate” or “severe” based on the NIHSS prognosis groupdiscussed in the introduction. Power and type I error rate were based on theproportion of simulated trials at a given sample size which rejected the nullhypothesis at a nominal level of 0.05. The treatment effect and its standard errorwere estimated for each trial.
Results
The type I error rate at each sample size for each analysis method is plotted inFigure 1. The nominal 5% reference line is shown,along with the upper and lower 95% confidence limits on this nominal level ofsignificance. The confidence limits were calculated using the formula for binomialproportion 95% confidence intervals. The confidence limits remain the same at eachsample size, as they are based on the number of trials at each sample size (1,000)rather than the sample size itself. The type I error rates for both the unadjustedand categorically-adjusted methods are within the 95% confidence limits for all thesample sizes, hovering close to the nominal 5% level.
×
Anzeige
The first investigation of power was under a “flat” treatment effect of7% where the success rates in the control group were 25%, 35% and 15% in the mild,moderate and severe prognosis groups, respectively. The power estimates for this“flat” treatment effect scenario are plotted in Figure 2. The unadjusted and categorically-adjusted methods do notsignificantly differ, with the categorically-adjusted method having slightly greaterpower for most of the sample sizes. As planned by the SHINE study investigators, the80% power threshold is crossed between 650 and 700 subjects per arm (1,300 to 1,400subjects total).
×
The next two scenarios varied the treatment effects across the mild, moderate andsevere baseline categories as 8.6%, 9% and 2%, respectively and 2%, 9% and 12.6%,respectively. The power results for these two scenarios are shown inFigure 3. As in the flat treatment effect scenario,there is no drastic difference in the unadjusted and categorically-adjusted methodswith respect to power in these varying treatment effect scenarios.
×
As previously mentioned, it is conceivable that one of the prognosis groups mayexperience a slightly harmful treatment effect. When 2% harm is experienced ineither the mild or the severe baseline prognosis category, the unadjusted andadjusted analyses still appear to have a similar performance, as shown inFigure 4. In the mild harm scenario, the unadjustedand adjusted power curves are still nearly stacked upon one another, with the powercurve for the adjusted analysis pulling slightly above that of the unadjustedanalysis at a few points. A more noticeable difference can be seen in the severeharm scenario, where the adjusted analysis consistently has a slightly, though notdramatically, higher power than that of the unadjusted analysis.
×
In addition to the plots in Figures 2, 3 and 4, we also observed the treatmentcoefficient estimates and their standard errors for the adjusted and unadjustedmodels under the various treatment effect scenarios at selected sample sizes. Thesample sizes of 498, 722, 946, 1,170 and 1,394 were chosen because they are theclosest sample sizes to those at which the planned interim and final analyses willtake place for SHINE. In addition to model estimates, the true treatment effectcoefficient was calculated by pooling the nominal log-odds ratios for each prognosisgroup. To visualize the bias of the estimate of each treatment effect parameter andtheir standard errors, the simulation mean squared error (MSE) was plotted againstthe squared bias in Figure 5. The MSE quantifies theaccuracy and precision of an estimate in terms of both the bias (the differencebetween the true and estimated treatment effect) and the variance of the estimate.By plotting the MSE against the squared bias, we can illustrate the adequacy of theestimator. In Figure 5, the squared bias is depicted onthe x-axis and the MSE on the y-axis. While the bias decreases with increasingsample size, the adjusted estimates of the treatment effect parameter areconsistently less biased than the unadjusted estimates. For smaller sample sizes,the MSEs for the adjusted analyses are negligibly larger than those for theunadjusted analyses. The treatment coefficients and standard errors are provided inAdditional file 2: Table S2.
×
Anzeige
Discussion
Successful stroke treatments are desperately needed given stroke’s large anddetrimental effect on the worldwide population. Consequently, statistical methodsthat offer high power to detect a true treatment effect are also needed. With thissimulation study, we sought to determine whether adjustment for baseline severitywithin the responder analysis setting would be beneficial or harmful in terms ofpower and type I error rates when compared to an unadjusted analysis.
The type I error rates did not differ substantially between the two methods. Theexperimental type I error rates for both of the methods stayed within the 95%confidence bounds. This is a welcomed result, as a test that is either too liberalor too conservative, (rejects the null hypothesis either more or less than thenominal level, respectively), has implications on the power of the test. Theoscillation around the nominal 5% level of significance is likely due to chance, andis to be expected in simulated data. Since neither method shows consistently largertype I error rates than the other, we can conclude that there is no meaningfuldifference between the two methods with respect to type I error.
The power appears to be approximately the same or slightly higher for the adjustedanalyses in the selected scenarios. In the cases where the power is slightly higher,the magnitude is not remarkable and offers little evidence to suggest that adjustingby the single covariate leads to significantly more power. Although the simulationstudy presented is not exhaustive and, therefore, does not provide additionalinsight regarding this, the literature by Choi and Hernández suggest that anincrease in power could occur as other important prognostic variables are added tothe model [27, 28]. It is reassuring, however, that neither method appears to be detrimentalto power under the given scenarios.
In terms of bias, the unadjusted analyses consistently underestimate the nominaltreatment effect, while the adjusted analyses tend to be less biased, but oftenslightly overestimate the nominal treatment effect. Given the magnitude of thecoefficient estimates and their standard errors, neither of these bias tendencies issubstantial. In terms of MSE, the two methods do not differ greatly as the samplesize increases. At the smaller sample sizes, the adjusted analyses have larger MSEvalues due to increased standard error; however, as the sample size increases, theMSE values for the two methods converge.
Anzeige
Though negligible differences were identified between the adjusted and unadjustedmodels, researchers should keep the randomization scheme of the study in mind whendeciding whether or not to adjust for baseline severity. In general, it is advisableto “analyze as you randomize,” meaning that any variable used as astratification variable during randomization should be included as a covariate inthe analysis in order to preserve nominal type I error rates and power [22, 29]. Baseline severity is often used as a stratification variable in therandomization of acute stroke clinical trials, and should be included as a covariatein these cases.
It is important to note that these analyses adjust categorically for baselineseverity. The categories - mild, moderate and severe - are defined by the NIHSSscore, which is a larger scale ranging from 0 to 42 (limited to 3 through 22 inSHINE’s inclusion criteria). A one-unit change in the NIHSS cannot easily beinterpreted, as this change may have different meanings depending on the combinationof neurological impairments and location along the scale. Despite this issue, theNIHSS is sometimes used as a continuous measure in the literature [30, 31]. This is not necessarily straightforward and should be done with caution.It is possible that adjusting by the actual NIHSS score will provide additionalinformation to the model and increase or maintain power in some treatment effectscenario(s). However, due to uncertainties in the clinical interpretation of acontinuous NIHSS variable, adjustment by actual NIHSS score has been left as a topicfor future research.
Adjustment for other baseline prognostic variables may also impact study power underthe given scenarios. The inclusion of additional covariates that were not used indefining the primary outcome has not been examined in these scenarios, as it isoutside the primary focus of this research. It is conceivable that the addition ofmultiple covariates could reduce overall power due to the increasing standard erroron the treatment effect estimate, as studied by Robinson and Jewell [23] and discussed in the Background section of this paper.
Conclusions
Our results show negligible differences between analysis methods in the responderanalysis setting, suggesting that in most treatment effect scenarios, adjustment forbaseline severity in the primary analyses may best be guided by individual studyneeds rather than a blanket guideline for all studies. Though we have not shown theresults here, we did examine other treatment effect scenarios which yield similarresults. These scenarios included a flat and varying 15% treatment effect (insteadof the 7% specified in the SHINE study plan), as well as a scenario in which themild group experienced 5% harm.
Overall, these results shed light on the important concept of adjustment in thecontext of responder analysis. Though this study only examined a single severityscale, its findings are not restricted to use in stroke studies; they can provideinsight into the treatment of categorical baseline prognostic covariates in otherstudies which use responder analysis to define their primary outcome ofinterest.
Authors’ information
KG recently completed her master’s degree in biostatistics from the MedicalUniversity of South Carolina. This manuscript is a result of her master’sthesis and her work as a graduate student on the SHINE grant. The other authors wereher committee members and VD was her primary mentor.
Acknowledgements
This work was funded by both the SHINE trial NIH/NINDS grant 1U01-NS069498 andthe Neurological Emergencies Treatment Trial (NETT) Statistical and DataManagement Center NIH/NINDS grant U01-NS059041.
Open Access
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
https://creativecommons.org/licenses/by/2.0
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
KG carried out the simulation programming and interpretation, manuscript drafting andfinalization. SY helped conceive the study concept, assisted in simulationprogramming and interpretation, and aided in manuscript drafting and finalization.VR helped conceive the study concept, aided in statistical interpretation, as wellas manuscript drafting and finalization. KJ and EJ assisted with study concept,design, clinical interpretation, manuscript drafting and finalization. VD helpedconceive the study concept, aided in design, analysis and interpretation, as well asmanuscript drafting and finalization. VD provided overall supervision as the primarymentor of the first author. All authors read and approved the final manuscript.