Background
Many randomized controlled trials (RCTs) involve a single post-treatment measurement of a continuous outcome variable previously measured at baseline. Although randomization creates asymptotic balance in important prognostic factors, including baseline values of the outcome variable [
1], in finite samples an imbalance in such factors may occur notwithstanding randomization [
2‐
6]; this represents the difference between the
expectation of a random process and its
realization[
6]. Depending crucially on the correlation between the baseline covariate and the outcome variable, this chance imbalance may not only create a potential bias in crude estimates of treatment effect in the outcome variable, but may also affect the precision with which such an effect is measured and the statistical power of the analysis. Attempts are made to address this problem either at the level of design (e.g. stratification and minimization) or at the level of analysis, or indeed both. Although opinions are still divided on the first-line strategy to deal with baseline imbalance in RCTs [
7‐
11], the general consensus seems to be that, whichever method is employed at the design stage to achieve balance in covariate distribution, an adjusted statistical analysis that accounts for important covariates should take precedence over an unadjusted analysis [
3,
8,
9,
12‐
16]. Nonetheless, there appears to be varied practice in this area and further consideration of the relative merits of adjusted and unadjusted analyses has been called for [
17].
For a single post-treatment assessment of a continuous outcome variable, three statistical methods have commonly been used: crude comparison of treatment effect by
t test or, equivalently, analysis of variance (ANOVA); change-score analysis (CSA); and analysis of covariance (ANCOVA). On occasions, CSA is performed using percentage change, but this has been shown to be an inefficient approach [
18]. Whereas CSA compares changes between pre- and post-treatment scores between treatment groups, ANCOVA accounts for the imbalance by including baseline values in a regression model – theoretically, this regression-based procedure yields unbiased estimates of treatment effect [
19,
20].
Given their different statistical basis, each of these statistical methods has a potentially marked effect on the estimate of the treatment effect and its associated precision, and differing statistical conclusions may therefore be reached according to the method of analysis chosen [
21‐
23]. In addition, contrary views have been reported on the implications of using CSA as a method for statistical adjustment in an RCT [
3,
12,
24,
25] and this warrants further investigation, to clarify the appropriateness of particular methods.
This study therefore seeks to quantify, through an established approach based on data simulation [
22,
26‐
28], differences in the estimate (bias) and precision of treatment effect and associated statistical power through using either ANOVA or CSA in relation to the unbiased estimate of treatment effect by ANCOVA, in differing hypothetical trial scenarios. Although previous authors [
19,
29] have provided theoretical accounts of bias and precision in estimates of treatment effect derived through ANOVA and CSA when baseline imbalance exists, we are aware of no previous study that has sought simultaneously to quantify bias, precision and statistical power of these three methods in relation to a wide range of combinations of different levels of experimental conditions, including baseline imbalance in the outcome variable, that are typical of pragmatic RCTs. Addressing this issue will allow practical recommendations to be made for the future analysis of RCTs in the presence of baseline imbalance.
Methods
Data simulation
A statistical program was developed in STATA to generate hypothetical two-arm trials involving specific levels of experimental conditions, run the regression models for the statistical methods being studied, and then post selected results into a file. Each hypothetical trial scenario was repeated a thousand times, so as to generate robust estimates (e.g. allowing statistical power to be estimated with a margin of error no greater than ±3% at a 95% confidence level). Detailed information on the statistical program is included in the Appendix.
Levels of experimental conditions
A population standard deviation of 1 (
σ = 1) for the outcome data was assumed in each trial and these data were normally distributed at baseline and at follow-up. A 1:1 allocation ratio was employed. Rather than choose arbitrary levels of other experimental conditions, these were selected in relation to specific criteria so as to reproduce conditions typical of an empirical trial scenario. Data for the outcome variable (
Y
T,
Y
C, for the treatment and control groups, respectively, with higher values taken to be clinically desirable) were simulated so as to produce a standardized treatment effect
:
A treatment effect was taken to be a higher (i.e. better) score in the treatment than in the control group, and was set at three levels of 0.2, 0.5 and 0.8, classified by Cohen [
30] as ‘low’, ‘medium’, and ‘large’ respectively.
For a nominal statistical power of 80%, the required sample size was utilized for each of these standardized effect sizes: 394, 64 and 26 per group, respectively. The correlation between baseline values (
Z
T,
Z
C, for the treatment and control groups, respectively) and post-treatment values was varied from 0.1 to 0.9 in increments of 0.2, as it has been argued that the correlation between baseline covariates and outcome scores in RCTs may range between these values [
31]. A correlation of zero was also included as a reference value.
For each hypothetical trial, imbalance in baseline values of the outcome measure was computed as a standardized score
, in terms of its standard error:
Here, z is a standard normal deviate. In this way, realistic values of imbalance were derived in relation to the sample size, thus avoiding large absolute imbalance for large sample sizes that would contradict the principles of randomization. Imbalance was simulated in both the same direction (‘positive’ imbalance, where the treatment group has ‘better’ baseline scores than the control group) and the opposite direction (‘negative’ imbalance, where the control group has ‘better’ scores) in relation to the treatment effect. The predetermined levels of for this study were calculated in relation to standard normal deviates of ±1.28, ±1.64 and ±1.96, representing 20%, 10% and 5% two-tailed probabilities respectively of the standard normal distribution.
Hence, the various levels of imbalance had a predetermined probability of occurring, whatever the sample size and on whatever scale the covariate or outcome variable is scored.
In total, 126 scenarios representing hypothetical combinations of experimental conditions were simulated at 80% nominal power, comprising:
7 standardized baseline imbalances: −1.96; −1.64; −1.28; 0; 1.28; 1.64; 1.96
6 covariate-outcome (ZY) correlations: 0; 0.1; 0.3; 0.5; 0.7; 0.9
3 standardized treatment effect sizes: 0.2; 0.5; 0.8
Each scenario was analysed by each of the statistical methods. In the analyses, a binary variable represented group allocation, such that the estimate of the treatment effect in each simulated dataset was derived from the associated regression coefficient (β).
Bias, precision and power
To quantify bias associated with the estimates of effect by ANOVA and CSA, the following indices were computed:
Bias was assessed not in relation to the nominal standardized treatment effect, as this effect is liable to be biased in the presence of confounding. Rather, bias was determined in relation to the adjusted estimate from ANCOVA, as this is known to provide the unbiased estimate of outcome, conditional upon the conditions represented by a given scenario.
In order to quantify the relative precision of the three methods of analysis, ratios of the resulting standard errors (design effects) were calculated:
Finally, the conditional statistical power of each of the three methods of analysis was calculated as the percentage of rejections of the null hypothesis in the 1000 simulations within each scenario; this was compared to the nominal power of 80%.
Discussion
This simulation study has examined the effect of baseline imbalance in an RCT on the bias and precision of estimates of treatment effect, and the power of a statistical test conditional on such imbalance. Although the statistical implications of baseline imbalance have previously been described, they have not hitherto been simultaneously quantified for these three analyses in relation to various combinations of levels of associated trial characteristics: effect size, degree of baseline-outcome (ZY) correlation and both magnitude and direction of baseline imbalance.
ANCOVA is known to produce unbiased estimates of treatment effect in the presence of baseline imbalance when groups are randomized [
19,
20]. ANOVA and CSA, however, produce biased estimates in such circumstances. For both ANOVA and CSA, the direction of bias is related to the direction of baseline imbalance, and bias is greatest when baseline imbalance, in either direction, is most pronounced. At a low
ZY correlation, ANOVA exhibits less bias than CSA, but at a high
ZY correlation the reverse is the case. In a situation in which ANOVA overestimates the unbiased treatment effect, CSA underestimates it, and vice versa. Both ANOVA and CSA show equal levels of bias (albeit in different directions) when the
ZY correlation is 0.5. When
ZY correlation is 0, estimates from ANCOVA and ANOVA are equivalent, as the absence of correlation means that the ANCOVA takes no account of imbalance and thereby reduces to ANOVA.
As regards precision, ANOVA and CSA yield less precise estimates than ANCOVA. ANOVA is progressively less precise than ANCOVA as ZY correlation increases; by contrast, CSA shows increasing precision as ZY correlation increases. CSA is less precise than ANOVA at ZY correlations below 0.5, but more precise at ZY correlations greater than 0.5, and both analyses present the same magnitude of associated standard error when the correlation is 0.5. In no situation do either CSA or ANOVA exceed the precision of ANCOVA.
The results for statistical power of the three analyses are not straightforward. The greater precision noted for ANCOVA might suggest that it would be unconditionally the most powerful analysis. Yet, as Figure
3 shows, whilst under some circumstances its power exceeds the nominal 80% power of ANOVA, under other circumstances ANOVA has greater power. This can be explained by the adjusted treatment effect derived through ANCOVA. When baseline imbalance is in the opposite direction from the treatment effect, ANCOVA corrects the resulting bias by producing an adjusted treatment effect that is larger than the nominal treatment effect, and ANCOVA therefore has greater power to detect this effect than ANOVA has to detect the nominal effect, at the same sample size. Correspondingly, when imbalance is in the same direction as the treatment effect, ANCOVA corrects the bias by adjusting the treatment effect downwards; its power to detect this effect is therefore less than that of ANOVA to detect the nominal treatment effect. However, when
ZY correlation is 0 (Figure
3 graphs A to C), ANCOVA and ANOVA produce equivalent estimates of treatment effect, as noted earlier, and the difference in power therefore essentially disappears. This phenomenon also explains why baseline imbalance affects precision and power differently; precision is unaffected by imbalance but power reflects imbalance when it is calculated in relation to an adjusted treatment effect. When there is no imbalance, the adjusted treatment effect equals the nominal treatment effect and here ANCOVA is more powerful than ANOVA by virtue of its greater precision [
18,
31,
32]. An important point to emphasize is that, in the presence of imbalance, nominal power is inappropriate due to the underlying bias in the estimation of the true treatment effect by ANOVA, which fails to address the baseline imbalance of the two treatment groups. As regards the analyses that accommodate baseline imbalance, ANCOVA is unconditionally more powerful than CSA, especially at lower
ZY correlations [
33].
The power of CSA shows a similar pattern to that of ANCOVA when ZY correlation is 0.7 or greater. At lower correlations, however, it demonstrates greater extremes of power than ANCOVA – higher than ANCOVA with imbalance in the opposite direction from the treatment effect and lower than ANCOVA with imbalance in the same direction. This indicates CSA’s over-correction of bias, in both directions, when
ZY correlation is low; this stems from its failure to account for regression to the mean [
24,
34]. In the absence of imbalance, the power of CSA exceeds the nominal 80% power of ANOVA when
ZY correlation is high, but is lower than that of ANOVA when
ZY correlation is low. This reflects the relative precision of these two analyses conditional upon
ZY correlation; CSA is the more precise at high correlations whereas ANOVA is the more precise a low correlations, as indicated by the ratios of standard errors in Table
2.
Relative to ANCOVA, the alternative analyses are thus liable to be either too conservative or too liberal [
26]. It is clear therefore that the use of either ANOVA or CSA is inadvisable when baseline imbalance exists. Although all three methods are unbiased when there is no baseline imbalance, the likelihood is that in a clinical trial with several baseline covariates there will be some degree of imbalance across a number, if not all, of these variables. Similarly, the level of correlation between these covariates and the outcome variable is likely to be greater than zero (or possibly less than zero, though baseline values of the outcome variable are more likely to be positively than negatively correlated with post-treatment values). Moreover, ANCOVA is consistently the most precise method of analysis and hence delivers greatest efficiency in respect of testing against the null hypothesis and reducing the type II error. Our results concur with previous literature that emphasizes the advantages of covariate adjustment [
3,
8,
9,
12‐
16,
24,
35].
These simulations are based on imbalance in a single covariate. Where imbalance exists in a number of covariates, the degree of bias associated with either ANOVA or CSA will depend upon the combined effect of imbalances that may be in different directions, and upon the particular ZY correlations associated with each of these covariates. However, loss of precision (and hence of statistical power) through the use of ANOVA or CSA is likely to be greater with imbalance in multiple covariates than with imbalance in a single covariate, as there will normally be a greater proportion of variance in the outcome measure that is unaccounted for by either of these analyses.
Our results show the advantages of ANCOVA in reducing bias, increasing precision and providing appropriate power of statistical testing across a number of practical situations commonly seen in clinical trials. Several authors [
2,
34,
36‐
39] argue that covariates should be selected a priori in terms of their prognostic importance, rather than on the basis of examining baseline imbalance in the trial data – even large imbalance is of little consequence in terms of bias if the covariate is not related to outcome. Moreover, the primary analysis in an RCT should be pre-specified [
40,
41]. Accordingly, our findings suggest that ANCOVA should be adopted as the analysis of choice, regardless of the magnitude of imbalance observed in the trial data. Consideration should also be given to achieving balance in important prognostic covariates at baseline in addition to subsequent statistical adjustment [
42] – e.g. through stratified randomization or covariate-adaptive methods of allocation [
11,
43,
44].
Limitations
The conditions under which we have investigated the effect of baseline imbalance – in terms of magnitude of effect sizes, baseline imbalance and
ZY correlation – are plausible and realistic, although the extremes of baseline imbalance examined will, reassuringly, be uncommon. Our findings are therefore readily transferable to specific real-life RCT scenarios. However, our findings assume equal allocation, and results may differ where this is not the case. Nor do our findings necessary generalize fully to trials where groups are not formed by randomization [
45] or where outcomes are binary or time-to-event [
28,
42,
46]. These results are also based on analyses whose assumptions were optimally satisfied through the simulation process, and are likely to differ in respect of real-life data that depart from such assumptions – e.g. a skewed outcome variable, or heterogeneous
ZY regression coefficients between groups. Large trials will produce data that are robust to certain deviations in the assumptions underlying parametric analysis. Nonetheless, future work could usefully explore the impact of some of these deviations on the conclusions of the current study.
Appendix
Simulation program in STATA. The prime identifies values that are specific to a particular simulation; i.e. r′ indicates r = 0.1, r = 0.3, r = 0.5, r = 0.7, r = 0.9; y′ indicates y = 0.2, y = 0.5, y = 0.8; z′ indicates standardized imbalance (the standard error of absolute imbalance multiplied by the appropriate standard normal deviate).
set seed
set obs n
[defines number of observations (n) for the trial]
g g = mod(_n,2)
[defines two treatment groups – Control (0); Treatment(1)]
g z = invnorm(uniform())*1
[generates normally distributed baseline scores (z) with mean = 0 and SD = 1 and randomly assigns these to treatment groups]
g r = r′
[generates a predetermined correlation between baseline and post-treatment scores]
g k = invnorm(uniform())*1
[generates another normally distributed set of scores (k)]
g y = z*r + k*(1−r^2)^.5
[transforms k into an outcome score (y) that has a predetermined correlation with the baseline score (z)]
replace z = z + g*z′
[applies a predetermined direction-specific baseline imbalance to the treatment groups; with ‘z + g’, imbalance is in the same direction as the treatment effect, but with ‘z−g’ it is in the opposite direction]
replace y = y + g*y′
[creates a predetermined treatment effect]
g c = y−z
[generates change scores for the treatment groups]
regress y g
[performs analysis of variance]
regress c g
[performs change-score analysis]
regress y g z
[performs analysis of covariance]
Competing interests
The authors have no competing interests.
Authors’ contributions
JS and ML conceived the study. All authors designed the study. BEE planned and performed the simulations. All authors interpreted the data. All authors drafted the manuscript and approved the final version.