Background
In clinical epidemiology, missing data are generally classified as (i) missing completely at random (MCAR); (ii) missing at random (MAR) when, conditional on the observed data, the probability of data being missing does not depend on unobserved data; or (iii) missing not at random (MNAR) when, conditional on the observed data, the probability of data being missing still depends on unobserved data, i.e., neither MCAR nor MAR [
1,
2]. Unfortunately, the missing data mechanisms of MNAR, MAR and MCAR are generally not testable unless there are direct modelisations of the missing data mechanisms. Although methods for handling MCAR or MAR data in clinical epidemiology have been widely described and studied, methods adapted for MNAR mechanisms are less studied.
In the presence of MNAR missing outcomes, valid statistical inference implies describing the missing data mechanism [
1,
3]. Hence, it often requires joint models for missing outcomes and their indicators of missingness [
4]. Two principal factorisations of these joint models have been proposed: pattern-mixture models and selection models [
1,
5‐
7]. The first consists of using different distributions to model individuals with and without missing observations [
8,
9]. The second directly models the relationship between the risk of a variable being missing and its unseen value. It involves defining an analysis model for the outcome and a selection model (i.e. the missing data mechanism). It generally relies on a bivariate distribution to model the outcome and its missing binary indicator simultaneously [
10]. This approach, called sample selection model, Tobit type-2 model [
11] or Heckman’s model, was first introduced by Heckman for continuous outcomes [
12,
13]. For continuous outcomes, two approaches have been proposed to estimate the model parameters: a one-step process that directly estimates all parameters of the joint model using the maximum likelihood estimator [
11] and a two-step process [
12,
13]. The first step of the latter consists of estimating the parameters of the selection model. The second step consists of fitting the outcome model adjusted on a correction term named “inverse Mills ratio” (
IMR), which is obtained via the first step.
IMR corresponds to the mean of the conditional distribution of the outcome within the bivariate normal distribution knowing that the outcome has been observed [
14]. This allows unbiased estimates of the parameters of the outcome model to be calculated.
For binary outcomes, sample selection methods rely on a different model. This model is not simply an adaptation of the continuous case and notably is not simply an adaptation of the two-step estimator with a different outcome model as a generalised linear model. In the setting of binary outcomes, the use of a bivariate probit model and a one-step maximum likelihood estimator is mandatory [
10]. Indeed, the use of a Heckman’s model implies linking the outcome model and the selection model by their error terms. Some authors, through analogy with Heckman’s two-step estimator, proposed modelling binary outcomes using a probit model adjusted on the
IMR [
15]. Despite the misuse of such approaches, it has been specifically demonstrated that the use of a two-step approach including the
IMR in a probit model for binary outcomes is not valid [
10,
16]. More generally, Heckman’s two-step estimator could not be extended straightforwardly to general linear outcome models by plugging
IMR into the linear predictor. It relies on the fact that outcome expectation in non-linear models subject to selection does not involve a simple corrector term in the linear predictor [
16].
If Heckman’s model handles MNAR missing binary outcomes well using a bivariate probit model, then in the presence of additional missing data on predictors, there is no process that can address all the missing data simultaneously. In this setting, missing data on predictors are typically treated using a non-satisfactory
complete-predictors approach, i.e., cases with at least one missing predictor are removed from the analysis. In the presence of missing data on more than one variable (including the outcome), multiple imputation (MI) appears to be one of the most flexible and easiest method to apply due to the numerous types of variables handled and the extensive development of statistical packages dedicated to its implementation [
17]. Galimard et al. [
18] previously developed an approach based on a conditional imputation model for an MNAR mechanism using a Heckman’s model and a two-step estimator to impute MNAR missing continuous outcomes. This approach allows imputing MAR missing covariates and MNAR missing outcomes within a multiple imputation by chained equations (MICE) procedure [
18]. MICE specifies a suitable conditional imputation model for each incomplete variable and iteratively imputes the missing values until convergence. The key concept of MI procedures is to use the distribution of the observed data to draw a set of plausible values for the missing data. Thus, imputing missing MNAR binary outcomes implies developing valid methods to obtain a valid distribution of missing binary outcomes. As mentioned above, the direct extension of the work of Galimard et al. [
18] on continuous outcomes cannot be considered because it involves a two-step estimator which is not compatible with Heckman’s model with binary outcomes.
Aims of this work
The first aim of this work is to propose an approach to handle MNAR binary outcomes. To our knowledge, the use of sample selection models as imputation models has never been proposed for missing binary outcomes, which is a current framework in clinical research. Thus, we propose developing an imputation method for binary outcomes based on a bivariate probit model associated with a one-step maximum likelihood estimator.
The second aim is to extend this approach for continuous outcomes proposing a new approach for the issue raised by Galimard et al. [
18]. Indeed, for continuous outcomes, one of the main drawbacks of Heckman’s two-step estimator is that the uncertainties of the first step estimates are not taken into account in the second step. Indeed,
IMR is considered as known observed values in the second step, whereas they have been estimated in the first step. Thus, the uncertainties around the final estimates are not fully assessed using a two-step estimator [
19]. This point could impact the quality of the imputation. This is the reason why we hypothesised that the use of a one-step estimator could also improve the performance of Heckman’s model as an imputation model for continuous outcomes. Therefore, we also proposed a new approach for continuous missing outcomes.
The final aim is to integrate the current developed MNAR model into a MICE procedure. It will handle both MNAR outcomes and MAR predictors in the same process.
In what follows, we introduce the study that motivated this work. Then, the “
Methods” section section develops our proposed imputation model using one-step ML estimation for binary and continuous outcomes. The “
Results” section section presents the evaluation of its performance using a simulation study and an illustrative example using data from our motivating example. Finally, a discussion and some conclusions are provided.
Simulation study
Data-generating process
We generated three normally independent and identically distributed variables,
X1,
X2 and
X3, with
Xj∼
N(0,
σ2). Two error terms,
ε and
εs, were generated using
ρ fixed at 0, 0.3 and 0.6 to simulate MAR, light MNAR and heavy MNAR settings from a bivariate normal distribution according to the model given in Eq. (
3).
For binary outcomes, Y was generated as follows: if β0+β1X1+β2X2+ε>0, then Y=1; otherwise, Y=0. The missing indicator Ry of Y was generated according to the following algorithm: if \(\beta _{0}^{s}+\beta _{1}^{s} X_{1}+\beta _{2}^{s} X_{2}+\beta _{3}^{s} X_{3}+\varepsilon ^{s}>0\), then Ry=1; otherwise, Ry=0.
For continuous outcomes,
Y was generated according to
Y=
β0+
β1X1+
β2X2+
ε. Note that in that case and according to the model given in Eq. (
4),
σε=1.
We fixed σ2 to 0.5 and (β0,β1,β2) to (0,1,1). \(\left (\beta _{0}^{s},\beta _{1}^{s},\beta _{2}^{s},\beta _{3}^{s}\right)\) were fixed to (0.75,1,-0.5,1), which resulted in approximately 30% missing data for the outcome.
To evaluate the robustness of our approach, we also generated a non-Heckman MNAR mechanism by directly including Y in the following selection equation: \(P(R_{y}=1)= logit\left (\beta _{0}^{sl}+X_{1}-0.5 \times X_{2}+X_{3}+\beta _{Y}^{sl} Y\right)\). Two sets of parameters were considered. To obtain approximately 30% missing data on Y, we fixed \(\beta _{0}^{sl}\) to 0.60 and 0.20 for binary outcomes and to 1.31 and 1.86 for continuous outcomes, with \(\beta _{Y}^{sl}\) equal to 0, 1 and 2.
We first simulated scenarios with only missing outcomes to validate our approach in a simple setting. Then, to evaluate the performance of the MICE process, we generated missing data on
X2 using two MAR mechanisms depending on either (
X1,
Y) or (
X1,
X3). Thus,
R2, the indicator of
X2 missingness, was defined by either:
-
P(R2=1|X1,X3)=Φ(0.25+X1+X3)
-
\(P(R_{2}=1|X_{1},Y)=\Phi \left (\beta _{0}^{R_{2}}+X_{1}+Y\right)\)
\(\beta _{0}^{R_{2}}\) was fixed to 1.10 and 0.25 for binary and continuous outcomes, respectively. We obtained approximately 30% missing data for X2.
A total of N=1000 independent datasets of size 500 were generated for each setting. The sample size was chosen to be similar to our motivating example.
Analysis methods
The analysis models were probit models and linear models for binary and continuous outcomes, respectively, including
X1 and
X2 as predictors. The simulated data were first analysed prior to data deletion as a benchmark. The incomplete data were then analysed using the following methods:
For continuous outcomes exclusively, two-step approaches have also been performed.
-
Heckman’s two-step estimation (
HE2
steps) consisting of Heckman’s two-step estimator for continuous outcomes as described in the “
Methods” section for continuous outcomes.
-
Multiple imputation using Heckman’s two-step model estimation (
MIHE2
steps) for continuous outcomes, as described in Galimard et al. [
18].
For
HEml,
MIHEml,
HE2
steps, and
MIHE2
steps, the selection equation included
X1,
X2 and
X3. For
MIHEml and
MIHE2
steps, the incomplete data were imputed
m=50 times, and final estimates were obtained by applying Rubin’s rules for small samples [
29].
For scenarios with missing
X2: (1) for the
HEml and
HE2
steps approaches, observations with missing
X2 were deleted from the analyses as previously described in the
complete-predictors approach; (2) for
MIHEml and
MIHE2
steps, a MICE procedure was applied.
X2 was imputed using a linear regression model and an approximate proper imputation algorithm [
2]. As recommended, we included
Ry and
Y in its imputation model [
2,
18]. Twenty iterations of the chained equation process were applied.
In each data-generating scenario, the performance of each method was assessed by computing the percent relative bias (%Rbias), the root mean square of the estimated standard error (SEcal), the empirical Monte Carlo standard error (SEemp), the root mean square error (RMSE) and the percent of the coverage of nominal 95% confidence intervals (Cover) of β1 and β2.
Computational settings
Simulations and analyses were performed using R statistical software, version 3.3.0 [
30]. We computed the imputation procedure within the
mice R package version 2.25 [
31]. Heckman’s One-step model estimator was supplied by functions
semiParBIV() and
copulaSampleSel() of the
GJRM R package version 0.1-1, for binary and continuous cases respectively [
19,
32]. Our code is available in the supplementary materials (S1 for binary outcomes and S2 for continuous outcomes). Heckman’s two-step model estimator was performed using the function
heckit() of package
sampleSelection version 1.0-4 [
14].
Application to illustrative examples
The impact of
treatment group on
adherence has been assessed using a probit model adjusted on
severity score.
Adherence presented 115 (21%) missing data. There were 51 and 375 non-adherent and adherent patients, respectively. The missing data mechanism of
adherence was strongly suspected to be MNAR. The severity score was missing for 114 (21%) patients, and its missing data mechanism was suspected to be MAR. Four methods were applied:
CCA,
HEml,
MIHEml and
MI. A standard
MI approach was added using a MICE procedure with a linear imputation model for
severity score and a probit imputation model for
adherence. The aim of the latter model was to assess the performance of an available misspecified but widely used approach. The missing data mechanisms assumed by each method are presented in Table
7. The
HEml and
MIHEml selection equations for
adherence included
treatment group,
severity score and
antibiotic treatment. The latter binary variable was chosen to fulfill the exclusion-restriction criterion. The MAR variables were imputed using linear and probit regression models for continuous and binary variables, respectively. Using
MIHEml, the indicator of
adherence missingness was included in the
severity score imputation model. The MICE procedure was applied for 20 iterations, and
m=100 datasets were generated. Finally, Rubin’s rules for small samples were applied.
Table 7
Estimation of the predictive value of the randomisation group and severity score
CCA (66%) | MCAR | MCAR | 0.243 | 0.217 | 0.061 | 0.206 | 0.021 | 0.163 |
MI (100%) | MAR | MAR | 0.380 | 0.205 | 0.055 | 0.183 | 0.035 | 0.163 |
HEml (79%) | MNAR | MCAR | 0.272 | 0.268 | 0.077 | 0.223 | 0.048 | 0.223 |
MIHEml (100%) | MNAR | MAR | 0.396 | 0.188 | 0.105 | 0.182 | 0.123 | 0.181 |
The results are presented in Table
7. The reference group for
treatment is the combination group. The
Severity score coefficient corresponds to an increase of 20 units.
CCA includes only 359 cases, i.e., 66% of the entire dataset. Observations with missing predictors are ignored in the
HEml analyses, i.e., only 427 (79%) cases are retained.
MI and
MIHEml consider all observations. As expected,
MI and
MIHEml have lower standard errors than those of
CCA and
HEml. The coefficients estimated for
Oseltamivir-Placebo with
MI and
MIHEml are similar and higher than those obtained with
CCA or
HEml. The effect of
Oseltamivir-Placebo reached significance with
MIHEml, thus enabling the assessment of the impact of
Oseltamivir-Placebo on
adherence. The estimated coefficients of
Zanamivir-Placebo and
severity score are similar for
CCA and
MI, slightly higher for
HEml and higher for
MIHEml. Not surprisingly, the proportion of imputed values corresponding to the non-adherent outcome are 13% and 47% for
MI and
MIHEml, respectively, indicating that missing values on self-reported adherence are more likely to correspond to non-adherent patients.
We also challenged the MAR assumption concerning the missing mechanism associated with the severity score. Thus, we performed a new MICE procedure encoding two Heckman’s imputation models for adherence and severity score. It involves defining selection and outcome models for severity score. The results for the effects were similar: 0.376 (0.186) and 0.096 (0.179) for Oseltamivir-Placebo and Zanamivir-Placebo, respectively. These results suggest a weak impact of the MNAR mechanism for severity score.
Discussion
The first aim of this work was to propose a unique approach to address binary outcomes according to an MNAR mechanism and missing predictors with a MAR mechanism. According to our simulation results, for MNAR outcomes, only
MIHEml and
HEml were unbiased. Our simulation studies were generated using a real Heckman’s model. Thus, we generated MNAR outcomes using a logistic selection model, directly including
Y as a predictor, i.e. an MNAR mechanism that is non-compatible with Heckman’s model. Although our results remain biased, the use of
MIHEml reduced the biases compared to
CCA. Because it is not possible to confirm the validity of Heckman’s model from the observed data alone [
17,
33], the developed approach appears to at least reduce the biases under an MNAR mechanism if the Heckman’s hypotheses do not hold.
To thoroughly evaluate our approach in a MICE procedure, we simulated missing data on predictors following two scenarios: one where the MAR mechanism for
X2 depended on the fully observed
X1 and
X3, and one where the mechanism depended on
X1 and
Y. For these two scenarios, Heckman’s model (
HEml) used only cases with complete predictors to estimate the model parameters, i.e, did not use all available information. This loss of information produced larger standard errors, particularly for
β1 and only slightly for
β2. This result is not surprising because the information lost, resulting from ignoring patients with missing
X2, primarily affected
X1. In terms of bias, the first scenario presented similar results to those obtained without missing
X2 data. In the second scenario, where the missing mechanism for
X2 also depended on
Y,
MIHEml out-performed all the other methods. The second aim was to validate the proposition of Galimard et al. [
18] using a one-step ML estimator for continuous outcome. Our simulations showed that
MIHEml performs slightly better than
MIHE2
steps in terms of standard errors for the missing MNAR outcomes.
Although our method performs well in the presence of a MAR mechanism, i.e., when
ρ=0, it is preferable to determine whether the missing data mechanism is most likely to be MNAR or MAR to avoid modelling a selection equation. Indeed, the standard errors are greater than those of the standard approaches for
ρ=0. Unfortunately, it is not possible to distinguish between MAR and MNAR from the observed data alone [
17,
33]. Hence, sensitivity analyses are often performed to evaluate departures from MAR. Some authors have proposed a pattern mixture model using
δ adjustment, i.e., systematically adding a certain increment
δ to the linear predictors of the imputed values. Despite its simplicity, van Buuren considered this method to be a powerful approach for evaluating the MAR mechanism by varying
δ [
2,
8,
17]. This method identifies two patterns: one for the observed data and one for the unobserved data. Missing values are imputed conditionally on the observed data with an additional shift parameter
δ, which is the magnitude of departure from MAR. Then, the model for the observed data is different from the model for the missing data. Similarly,
MIHEml can be viewed as a method that applies a shift term or a correction term for the selection bias in the imputation model specific to each observation
i. Precisely, as
\(E(Y_{i}|R_{yi}=0)=X_{i}\beta +\rho \sigma _{\varepsilon }\left (-\phi \left (X^{s}_{i}\beta ^{s}\right)\right)/\Phi \left (-X^{s}_{i}\beta ^{s}\right)\),
MIHEml uses a selection correction term that can be considered as an individual
δi for each patient (adjusted on the parameters of the selection equation). In this sense, we obtained a more precise
δ-adjustment approach.
The construction of the selection model follows strict rules [
14,
23]. In our experience, respect of the exclusion-restriction criterion should be strict. Indeed, Heckman’s model can inflate standard errors due to the collinearity between the regressors and
IMR, and this problem is exacerbated when the exclusion-restriction criterion does not hold [
34]. Moreover, MICE (or full conditional specification) follows certain rules. Each variable with missing data requires a specific conditional imputation model that is generally defined by a link function and a linear predictor with its set of predictors. Theoretically, imputation models should be derived from the global joint distribution of the variables, including the outcome [
2,
35], and misspecification may result in biased parameter estimates [
36]. Despite recent work in simple cases, the theoretical properties of MICE are not fully understood [
25,
26,
28,
37]. Nevertheless, it performs well in practice, particularly when the conditional imputation models are well accommodated to the substantive model. The efficiency of the MICE approach is generally validated by simulation studies, and the results appear robust even when the compatibility between the full conditional distribution and the global joint distribution is not proven [
2]. Although simulation is never sufficiently complete, these simulations suggest that our approach of multiple imputation using Heckman’s model and its use in a MICE process are valid and could be useful when the MNAR mechanism on the outcome is compatible with Heckman’s model. To avoid the bivariate normality assumption of Heckman’s model, Marchenko and Genton [
38] proposed a Heckman’s model with a bivariate Student distribution for error terms. Ogundimu and Collins [
39] developed an imputation model using this selection-t model. Unfortunately, their imputation model is only available for continuous outcomes. We compare the proposition in the current paper for continuous outcome to the propositions of Ogundimu and Collins [
39] and Galimard et al. [
18] in Additional file
2. Not surprisingly, the results were similar. Indeed,
t-distributions are very close to a normal distribution for high degrees of freedom. In this paper, we focused on frequentist sample selection approaches within a MICE procedure. Nevertheless, Bayesian posterior distribution of sample selection models can be obtained using Gibbs sampling and data augmentation [
40,
41]. Such a fully Bayesian framework could improve the imputation when based on small samples; this could be evaluated in further research.
Finally, our simulation study does not explore MNAR mechanisms on covariates and outcomes. Such a situation requires specifying a Heckman’s imputation model for each MNAR variable (i.e. selection and outcome models). Nevertheless, we used this type of approach in our example analysis to evaluate the departure from MAR for the missing predictors.