1. Introduction
Cluster randomized trials (CRTs), where groups of participants rather than individuals are randomized, are increasingly being used in health promotion and health services research [
1]. When participants have to be managed within the same setting, such as hospital, community, or family physician practice, this randomization strategy is usually adopted to minimize the potential treatment "contamination" between intervention and control participants. It is also used when individual level randomization may be inappropriate, unethical, or infeasible [
2]. The main consequence of the cluster-randomized design is that participants can not be assumed independent due to the similarity of participants from the same cluster. This similarity is quantified by the intra-cluster correlation coefficient [ICC]
ρ. Considering the two components of the variation in the outcome, between-cluster and intra-cluster variations,
ρ may be interpreted as the proportion of overall variation in outcome that can be explained by the between-cluster variation [
3]. It may also be interpreted as the correlation between the outcomes for any two participants in the same cluster. It has been well established that failing to account for the intra-cluster correlation in the analysis can increase the chance of obtaining statistically significant but spurious findings [
4].
The risk of attrition may be very high in some CRTs due to the lack of direct contact with individual participants and lengthy follow-up [
5]. In addition to missing individuals, the entire clusters may be missing, which further complicates the handling of missing data in CRTs. The impact of missing data on the results of statistical analysis depends on the mechanism which caused the data to be missing and the way that it is handled. The default approach in dealing with this problem is to use complete case analysis (also called listwise deletion), i.e. exclude the participants with missing data from the analysis. Though this approach is easy to use and is the default option in most statistical packages, it may substantially weaken the statistical power of the trial and may also lead to biased results depending on the mechanism of the missing data.
Generally, the nature or type of missingness can fit into four categories: missing completely at random (MCAR), missing at random (MAR), covariate dependent (CD) missing, and missing not at random (MNAR) [
6]. Understanding these categories is important since the solutions may vary depending on the nature of missingness. MCAR means that the missing data mechanism, i.e. the probability of missing, does not depend on the observed or unobserved data. Both MAR and CD mechanisms indicate that causes of missing data are unrelated to the missing values, but may be related to the observed values. In the context of longitudinal data when serial measurements are taken for each individual, MAR means that the probability of a missing response at a particular visit is related to either observed responses at previous visits or covariates, whereas CD missing - a special case of MAR - means that the probability of a missing response is dependent only upon covariates. MNAR means that the probability of missing data depends on the unobserved data. It commonly occurs when people drop out of the study due to poor or good health outcomes. A key distinction between these categories is that MNAR is non-ignorable while the other three categories (i.e., MCAR, CD, or MAR) are ignorable [
7]. Under the circumstances of ignorable missingness, imputation strategies such as mean imputation, hot deck, last-observation carried forward, or multiple imputation (MI) - which substitute each missing value to one or multiple plausible values - can produce a complete dataset that is not adversely biased [
8,
9]. Non-ignorable missing data are more challenging and require a different approach [
10].
Two main approaches in handling missing outcomes are likelihood based analyses and imputation [
10]. In this paper, we focus on MI strategies, which take into account the variability or uncertainty of the missing data, to impute the missing binary outcome in CRTs. Under the assumption of MAR, MI strategies replace each missing value with a set of plausible values to create multiple imputed datasets - usually varying in number from 3 to 10 [
11]. These multiple imputed datasets are analyzed by using standard procedures for complete data. Results from the imputed datasets are then combined for inference to generate the final result. Standard MI procedures are available in many standard statistical software packages such as SAS (Cary, NC), SPSS (Chicago IL), and STATA (College Station, TX). However, these procedures assume observations are independent and may not be suitable for CRTs since they do not take into account the intra-cluster correlation.
To the best of our knowledge, limited investigation has been done on the imputation strategies for missing binary outcomes or categorical outcomes in CRTs. Yi and Cook reported marginal methods for missing longitudinal data from clustered design [
12]. Hunsberger
et al. [
13] described three strategies for continuous missing data in CRTs: 1) multiple imputation procedure in which the missing values are replaced with re-sampled values from the observed data; 2) a median procedure based on the Wilcoxon rank sum test assigning the missing data in the intervention group with the worst ranks; 3) multiple imputation procedure in which the missing values are replaced by the predicted values from a regression equation. Nixon
et al. [
14] presented strategies of imputing missing end points from a surrogate. In the analysis of a continuous outcome from the Community Intervention Trial for Smoking Cessation (COMMIT), Green
et al stratified individual participants into groups that were more homogeneous with respect to the predicted outcome. Within each stratum, they imputed the missing outcome using the observed data [
15,
16]. Taljaard
et al [
17] compared several different imputation strategies for missing continuous outcomes in CRTs under the assumption of missing completely at random. These strategies include cluster mean imputation, within-cluster MI using Approximate Bayesian Bootstrap (ABB) method, pooled MI using ABB method, standard regression MI, and mixed-effects regression MI. As pointed out by Kenward
et al that if a substantive model, such as generalized linear mixed model, is to be used which reflects the data structure, it is important that the imputation model also reflects this structure [
18].
The objectives of this paper are to: i) investigate the performance of various imputation strategies for missing binary outcomes in CRTs under different percentages of missingness, assuming a mechanism of missing completely at random or covariate dependent missing; ii) compare the agreement between the complete dataset and the imputed datasets obtained from different imputation strategies; iii) compare the robustness of the results under two commonly used statistical analysis methods: the generalized estimating equations (GEE), and random-effects (RE) logistic regression, under different imputation strategies.
5. Discussion
In this paper, under the assumption of MCAR and CD missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study. Our results show that, first, when the percentage of missing data is low or intra-cluster correlation coefficient is small, different imputation strategies or complete case analysis approach generate quite similar results. Second, standard MI strategies, which do not take into account the intra-cluster correlation, underestimate the variance of the treatment effects. Therefore, they may lead to statistically significant but spurious conclusion when used to deal with the missing data from CRTs. Third, under the assumption of MCAR and CD missing, the point estimates (OR) are quite similar across different approaches to handle the missing data except for random-effects logistic regression MI strategy. Fourth, both within-cluster and across-cluster MI strategies take into account the intra-cluster correlation and provide much conservative treatment effect estimates compared to MI strategies which ignore the clustering effect. Fifth, within-cluster imputation strategies lead to wider CI than across-cluster imputation strategies, especially when the percentage of missingness is high. This may be because within-cluster imputation strategies only use a fraction of data, which leads to much variation of the estimated treatment effect. Sixth, larger estimated kappa, which indicates higher agreement between the imputed values and the observed values, is associated with better performance of MI strategies in terms of generating estimated treatment effect and 95% CI closer to those obtained from the complete CHAT dataset. Seventh, under the same imputation strategy and percentage of missingness, the estimates of the treatment effect from GEE and RE logistic regression models are similar.
To the best of our knowledge, limited work has been done on comparing different multiple imputation strategies for missing binary outcomes in CRTs. Taljaard
et al [
17] compared four MI strategies (pooled ABB, within-cluster ABB, standard regression, mixed-effects regression) for missing continuous outcome in CRTs when missing is completely at random. Their findings are similar to ours.
It should be noted that within-cluster MI strategies might only be applicable when the cluster size is sufficiently large and the percentage of missingness is relatively small. In the CHAT study, there were 55 patients in each cluster which provided enough data to carry out the within-cluster imputation strategies using propensity score and MCMC method. However, the logistic regression method failed when the percentage of missingness was high. This was because that when generating large percentage (≥20%) of missing outcome, all patients with binary outcome of "0" were simulated as missing for some clusters. Therefore, logistic regression model failed for these particular clusters. In addition, our results show that the complete case analysis approach performs relatively well even with 50% missing. We think that due to the intra-cluster correlation, one would not expect that the missing values have much impact if a large proportion of a cluster is still present. However, further investigation about this issue using a simulation study will be helpful to answer this question.
Our results show that the across-cluster random-effects logistic regression strategy leads to a potentially biased estimate, especially when the percentage of missingness is high. As we described in section 2.4.2, we assume the cluster-level random-effects follow normal distribution, i.e.
. Researchers have shown that misspecification of the distributional shape have little impact on the inferences about the fixed effects [
31]. Incorrectly assuming the random effects distribution is independent of the cluster size may affect inferences about the intercept, but does not seriously impact inferences about the regression parameters. However, incorrectly assuming the random effects distribution is independent of covariates may seriously impact inferences about the regression parameters [
32,
33]. The mean of random effects distribution could be associated with a covariate, or the variance of random effects distribution could be associated with a covariate for our dataset, which might explain the potential bias from the across-cluster random-effects logistic regression strategy. In contrast, the imputation strategy of logistic regression with cluster as a fixed effect has better performance. However, it might only be applied when the cluster size is large enough to provide stable estimate for the cluster effect.
For multiple imputation, the overall variance of the estimated treatment effect consists of two parts: within imputation variance
U, and between imputation variance
B. The total variance
T is calculated as
T =
U + (1 + 1/
m)
B, where
m is the number of imputed datasets [
10]. Since standard MI strategies ignore the between cluster variance and fail to account for the intra-cluster correlation, the within imputation variance may be underestimated, which could lead to underestimation of the total variance and consequently the narrower confidence interval. In addition, the adequacy of standard MI strategies depends on the ICC. In our study, the ICC of the CHAT dataset is 0.055 and the cluster effect in the random-effects model is statistically significant.
Among the three imputation methods: predictive model (logistic regression method), propensity score method, and MCMC method, the latter is most popular method for multiple imputation of missing data and is the default method implemented in SAS. Although this method is widely used to impute binary and polytomous data, there are concerns about the consequences of violating the normality assumption. Experience has repeatedly shown that multiple imputation using MCMC method tends to be quite robust even when the real data depart from the multivariate normal distribution [
20]. Therefore, when handling the missing binary or ordered categorical variables, it is acceptable to impute under a normality assumption and then round off the continuous imputed values to the nearest category. For example, the imputed values for the missing binary variable can be any real value rather than being restricted to 0 and 1. We rounded the imputed values so that values greater than or equal to 0.5 were set to 1, and values less than 0.5 were set to 0 [
34]. Horton
et al [
35] showed that such rounding may produce biased estimates of proportions when the true proportion is near 0 or 1, but does well under most other conditions. The propensity score method is originally designed to impute the missing values on the response variables from the randomized experiment with repeated measures [
21]. Since it uses only the covariate information associated with the missingness but ignores the correlation among variables, it may produce badly biased estimates of regression coefficients when data on predictor variables are missing. In addition, with small sample sizes and a relatively large number of propensity score groups, application of the ABB method is problematic, especially for binary variables. In this case, a modified version of ABB should be conducted [
36].
There are some limitations that need to be acknowledged and addressed regarding the present study. First, the simulation study is based on a real dataset, which has a relatively large cluster size and small ICC. Further research should investigate the performance of different imputation strategies at different design settings. Second, the scenario of missing an entire cluster is not investigated in this paper. The proposed within-cluster and across-cluster MI strategies may not apply to this scenario. Third, we investigate the performance of different MI strategies assuming missing data mechanism of MCAR and CD missing. Therefore, results cannot be generalized to MAR or MNAR scenarios. Fourth, since the estimated treatment effects are similar under different imputation strategies, we only presented the OR and 95% CI for each simulation scenario. However, estimates of standardized bias and coverage would be more informative and would also provide a quantitative guideline to assess the adequacy of imputes [
37].
Appendix A: SAS code for across-cluster random-effects logistic regression method
%let maximum = 1000;
%macro parameter_estimate(percent,index);
ods listing close;
proc nlmixed data = mcar&percent&index cov;
parms b0 = -0.0645 b_group = -0.1433 b_diabbase = -0.04 b_hdbase = 0.1224 b_age = -0.0066
b_base_bpcontrolled = 1.1487 b_sex = 0.0873 s2u = 0.5;
eta = b0 + b_group*group + b_diabbase*diabbase + b_hdbase*hdbase + b_age*age
+ b_base_bpcontrolled*base_bpcontrolled + b_sex*sex + u;
expeta = exp(eta);
p = expeta/(1+expeta);
model outcome ~ binary(p);
random u ~ normal(0,s2u) subject = assfpid;
ods output ParameterEstimates = parameter&percent&index
CovMatParmEst = covariance&percent&index;
run;
data parameter&percent&index;
set parameter&percent&index;
keep estimate;
run;
data covariance&percent&index;
set covariance&percent&index;
drop row parameter;
run;
%mend parameter_estimate;
%macro mvn(percent, index, n);
/* arguments for the macro:
1. varcov: data set for variance-covariance matrix
2. means: data set for mean vector
3. n: sample size
4. myMVN: output data set name */
proc iml;
use covariance&percent&index;/* read in data for variance-covariance matrix */
read all into sigma;
use parameter&percent&index;/* read in data for means */
read all into mu;
p = nrow(sigma);/* calculate number of variables */
n = &n;
l = t(half(sigma));/* calculate cholesky root of cov matrix */
z = normal(j(p,&n,1234));/* generate nvars*samplesize normals */
y = l*z;/* premultiply by cholesky root */
yall = t(repeat(mu,1,&n)+y);/* add in the means */
varnames = { b0 b_group b_diabbase b_hdbase b_age b_base_bpcontrolled b_sex s2u};
create myMVN&percent&index from yall (|colname = varnames|);
append from yall;
quit;
%mend mvn;
%macro mi_random_effect(percent, index);
%parameter_estimate(&percent, &index);
%mvn(&percent, &index, 5);
proc iml symsize = 512;
use mymvn&percent&index;
read all into mvndata;
use mcar&percent&index;
read all var {ptid DIABBASE HDBASE base_bpcontrolled last_bpimproved sex age assfpid_num group missing outcome} into temp_data;
log_icca_cov = j(7700,12,0);
do i = 0 to 4;
do j = 1 to 1540;
do k = 1 to 11;
log_icca_cov[i*1540+j,k] = temp_data[j,k];
end;
log_icca_cov[i*1540+j,12] = i+1;
end;
end;
do i = 1 to 7700;
if log_icca_cov[i, 11] = . then do;
num = log_icca_cov[i, 12];
logit_p = mvndata[num, 1] + mvndata[num, 2]*log_icca_cov[i, 9]
+ mvndata[num, 3]*log_icca_cov[i, 2] + mvndata[num, 4]*log_icca_cov[i,3]
+ mvndata[num, 5]*log_icca_cov[i, 7] + mvndata[num, 6]*log_icca_cov[i,4]
+ mvndata[num, 7]*log_icca_cov[i, 6] + rand('NORMAL', 0, sqrt(mvndata[num, 8]));
log_icca_cov[i, 11] = rand('BERNOULLI', exp(logit_p)/(1+exp(logit_p)));
end;
end;
varnames = {ptid DIABBASE HDBASE base_bpcontrolled last_bpimproved sex age assfpid_num group missing outcome _imputation_};
create log_icca_cov&percent&index from log_icca_cov (|colname = varnames|);
append from log_icca_cov;
quit;
%mend mi_random_effect;
%macro mi_icca_log(percent, index);
ods listing close;
%mi_random_effect(&percent, &index);
data log_icca_cov&percent&index;
set log_icca_cov&percent&index;
if outcome > = 1 then outcome = 1;
else if outcome < 1 then outcome = 0;
run;
proc freq data = log_icca_cov&percent&index;
table last_bpimproved*outcome/kappa;
ods output SimpleKappa = log_icca_kappapool&percent&index;
run;
data log_icca_kappapool&percent&index;
set log_icca_kappapool&percent&index;
if Label1 = 'Kappa';
run;
proc sort data = log_icca_cov&percent&index;
by _imputation_;
run;
proc genmod data = log_icca_cov&percent&index;
class outcome assfpid_num;
model outcome = group diabbase hdbase age base_bpcontrolled sex/D = B link = logit;
repeated subject = assfpid_num/type = exch covb;
by _imputation_;
ods output GEEEmpPEst = log_icca_geepar&percent&index
GEERCov = log_icca_geecov&percent&index;
run;
data log_icca_geepar&percent&index;
set log_icca_geepar&percent&index;
if Parameter~ = 'Scale';
if Parm = 'Prm' then Parm = 'Prm1';
else if Parm = 'GROUP' then Parm = 'Prm2';
else if Parm = 'DIABBASE' then Parm = 'Prm3';
else if Parm = 'HDBASE' then Parm = 'Prm4';
else if Parm = 'AGE' then Parm = 'Prm5';
else if Parm = 'BASE_BPCONTROLLED' then Parm = 'Prm6';
else if Parm = 'SEX' then Parm = 'Prm7';
run;
proc mianalyze parms = log_icca_geepar&percent&index covb = log_icca_geecov&percent&index;
modeleffects Prm2;
ods output ParameterEstimates = pool_log_icca_gee&percent&index;
run;
proc nlmixed data = log_icca_cov&percent&index cov;
by _imputation_;
parms b0 = -0.0645 b_group = -0.1433 b_diabbase = -0.04 b_hdbase = 0.1224 b_age = -0.0066
b_base_bpcontrolled = 1.1487 b_sex = 0.0873 s2u = 0.5;
eta = b0 + b_group*group + b_diabbase*diabbase + b_hdbase*hdbase + b_age*age
+ b_base_bpcontrolled*base_bpcontrolled + b_sex*sex + u;
expeta = exp(eta);
p = expeta/(1+expeta);
model outcome ~ binary(p);
random u ~ normal(0,s2u) subject = assfpid_num;
ods output ParameterEstimates = log_icca_repar&percent&index
CovMatParmEst = log_icca_recov&percent&index;
run;
proc mianalyze parms = log_icca_repar&percent&index covb = log_icca_recov&percent&index;
modeleffects b_group;
ods output ParameterEstimates = pool_log_icca_re&percent&index;
run;
ods listing;
%mend mi_icca_log;
%macro append_log_icca(percent);
%do index = 1%to &maximum;
%if &index = 1%then%do;
data pool_log_icca_re&percent;
set pool_log_icca_re&percent&index;
run;
data pool_log_icca_gee&percent;
set pool_log_icca_gee&percent&index;
run;
data log_icca_kappa&percent;
set log_icca_kappapool&percent&index;
run;
%end;
%else%do;
proc append base = pool_log_icca_re&percent data = pool_log_icca_re&percent&index;
run;
proc append base = pool_log_icca_gee&percent data = pool_log_icca_gee&percent&index;
run;
proc append base = log_icca_kappa&percent data = log_icca_kappapool&percent&index;
run;
%end;
%end;
%mend append_log_icca;
%macro collect_result_log_icca(percent);
%do index = 1%to &maximum;
%mi_icca_log(&percent,&index);
%end;
%append_log_icca(&percent);
proc univariate data = log_icca_kappa&percent;
var nValue1;
run;
proc univariate data = pool_log_icca_gee&percent;
var Estimate StdErr;
run;
proc univariate data = pool_log_icca_re&percent;
var Estimate StdErr;
run;
%mend collect_result_log_icca;
filename junk dummy;
proc printto log = junk;run;
%collect_result_log_icca(05);
%collect_result_log_icca(10);
%collect_result_log_icca(15);
%collect_result_log_icca(30);
%collect_result_log_icca(50);
proc printto; run;