Notation
Suppose K random variables are intended to be observed on N subjects. We use subscripts i and j to index subjects and variables respectively (i = 1,…,N; j = 1,…,K). Let x = (x
ij
) denote an (N × K) matrix, whose i,j element is x
ij
. Column j of matrix x is denoted by . It is assumed that the rows of matrix x are independent and identically distributed draws from a probability distribution with probability distribution function p (X ∣ θ), where θ is an unknown parameter.
In practice some subjects have missing observations on up to
K - 1 variables and we write
x
j
= (
xjobs,
xjmis) for any
j,
x = (
x
obs
,
x
mis
) and
p (
x ∣
θ) =
p(
x
obs
,
x
mis
∣
θ), with superscript
obs and
mis denoting the observed and missing data respectively. In keeping with the assumptions of joint modelling imputation and chained equations imputation, the missing data mechanism is assumed to be ignorable for Bayesian inference [
20] p. 120, so that inferences about
θ can be based on the marginal observed data posterior
p (
θ ∣
x
obs
).
Joint modelling imputation
Joint modelling imputation requires the specification of a parametric joint model
p(
x
obs
,
x
mis ∣
θ) for the complete data and a prior distribution
p(
θ) for parameter
θ. Imputations are independent draws from the posterior predictive distribution of the missing data given the observed data
p(
x
mis ∣
x
obs
) [
2] p. 105, which under the ignorability assumption is
Therefore, to draw from this posterior predictive distribution, first draw
θ
∗ ∼
p (
θ ∣
x
obs
) followed by
x
mis∗ ∼
p (
x
mis
∣
x
obs
,
θ
∗) [
2] p. 105. When it is difficult to draw from the observed data posterior
p(
θ∣
x
obs
), Markov chain Monte Carlo methods can be used. For example, the data augmentation algorithm of Tanner and Wong [
21] draws missing values from the posterior predictive distribution
x
mis∗ ∼
p (
x
mis ∣
x
obs
,
θ
∗) and then draws
θ from the complete data posterior
θ
∗ ∼
p (
θ ∣
x
obs
,
x
mis∗), where ∗ denotes the last drawn values of
θ or
x
mis
. Upon convergence this produces a draw from the joint posterior distribution
p(
θ,
x
mis
∣
x
obs
).
Chained equations imputation
For every incomplete variable the chained equations algorithm requires an imputation model, typically a univariate regression model, and an accompanying prior distribution for the model’s parameter. Let X
-j
= (X
1,…,X
j-1,X
j+1,…,X
K
)T denote the vector of random variables excluding variable X
j
and the submatrix of x corresponding to variables X
-j
. We write p (x
j
∣ x
-j
,ψ
j
) for the probability distribution function of the imputation model for variable X
j
and p (ψ
j
) for the prior distribution of the unknown parameter ψ
j
.
Chained equations draws the imputations using an iterative algorithm, typically with 10 to 20 iterations [
15]. To start off, the missing values of each incomplete variable are replaced by its mean or a random sample of its observed values. Suppose, without loss of generality, that variables
X
1,…,
X
R
(
R ≤
K) are incomplete and variables
X
R+1,…,
X
K
are fully observed. Given the imputations from the last iteration
, iteration
t of the chained equations algorithm consists of the following draws [
18]
During each iteration the following two steps are applied to each incomplete variable X
j
in turn: is drawn from the posterior distribution proportional to and missing values are drawn from the predictive posterior . The imputations from the last iteration form the imputed dataset. The whole iterative algorithm is repeated to obtain further imputed datasets.
Simulation study
We conducted a simulation study to explore the consequences for chained equations imputation when the conditional models were compatible with the same joint model but the non-informative margins condition of Proposition 1 was not satisfied. In particular, we looked for evidence of “order effects”, where the distribution from which the final imputed values of the variables were drawn differed according to the order in which the variables were updated in the chained equations sampler. If the chained equations algorithm imputes all variables from the predictive distribution of the missing data implied by a specific joint model, then order effects cannot occur [
22]. Thus, the existence of order effects implies that the chained equations algorithm is not equivalent to imputing from any joint model.
The simulation study was based on a general location model, discussed in the Theoretical results section below, with one incomplete binary variable Y and two continuous variables W
1 and W
2, where W
1 was also incompletely observed. We compared joint modelling imputation under the general location model, considered as a gold standard, with the chained equations algorithm that imputes the binary variable Y under a logistic regression model and the continuous variable W
1 under a normal linear regression model.
We generated 500 datasets, each with a sample size of 100. For each dataset, the rows were independent, identically distributed realizations of the general location model
Y ∼ Bernoulli (3/10),
W
1 ∣
Y ∼
N (10 +
β Y,9) and
W
2 ∣
W
1,
Y ∼
N (9 + 8/9 + 1/9
W
1 +
β Y,8 + 8/9). The data model was a simplified version of data that can occur in the medical literature [
23]. The simulation study was repeated when
β, the regression coefficient for covariate
Y, was set to 1 and 3. The analysis of interest was the normal linear regression of
W
2 on
W
1 and
Y. To ensure that any observed order effects could only be due to the failure of the non-informative margins condition we considered the simplest setting, that of data missing completely at random [
20] p. 16, and set the values of
Y and
W
1 to be missing for the first 50 individuals in the dataset. Below we describe the joint modelling imputation procedure and the chained equations algorithm that were separately applied to the same 500 datasets.
We used the data augmentation algorithm (as described under the heading “Joint modelling imputation”) to perform joint modelling imputation under the general location model and the joint prior given in the general location example (see example 4 of the Results), setting hyperparameters τ = ν = 1/2 and κ = 3/2. The number of imputed datasets generated, the burn-in period and the number of iterations between imputed datasets was 100. The analysis model was applied to each dataset separately and the mean of the multiple estimates of β, the coefficient for Y, was calculated.
In the (standard) chained equations algorithm, a logistic regression model for Y given W
1 and W
2 was first fitted to those rows of the dataset in which Y was observed. Let denote the maximum likelihood estimate of the parameters of this model and denote its associated estimated variance-covariance matrix. A draw was then made from the multivariate normal approximation and used to impute the missing Y values. The continuous variable W
1 was imputed using the linear regression model W
1 ∣ Y,W
2 ∼ N (λ + ξ Y + ϕ W
2,ω) and prior distribution p (λ,ξ,ϕ,ω) ∝ ω
-3/2.
To start off the chained equations algorithm the missing values of
Y and
W
1 were replaced with a random sample of their observed values. We augmented the chained equations algorithm such that, within each iteration we fitted the analysis model immediately after updating the binary variable
Y and also immediately after updating the continuous variable
W
1. The simulation study focused on systematic differences between the two resulting estimates of
β. Given the imputations from the last iteration
, iteration
t of the augmented chained equations algorithm consisted of the following steps:
1.
Generate y (t) by imputing values for the missing binary observations, conditioning on and w 2.
2.
Linearly regress w 2 on and y (t) and store the estimate for the coefficient of Y, denoted by .
3.
Generate by imputing values for the missing continuous observations, conditioning on y (t) and w 2.
4.
Linearly regress w 2 on and y (t) and store the estimate for the coefficient of Y, denoted by .
The chained equations algorithm was implemented with 10010 iterations. The first 10 iterations were regarded as burn-in and the estimates from these iterations discarded. The remaining 10000 estimates of
were averaged, and likewise for
. We denote these means as
and
and their difference by
. The quantity
can be interpreted as an estimate of the order effect for imputation in one dataset. We estimated the (Monte Carlo) standard error of
using the batch-means method, a method for computing standard errors for correlated output [
24] p. 124, and calculated a 95% confidence interval from this.
Linear discriminant analysis is an alternative way to estimate a logistic regression [
25,
26]. A modified chained equations algorithm using linear discriminant analysis on all individuals with observed
Y has been proposed as an alternative way to impute the binary variable
Y[
27]. Because the linear discriminant likelihood is the joint distribution of
Y,
W
1 and
W
2, this model has the advantage of recovering some information about
ψ
Y
in the
W margin. We repeated the simulation study using this modified chained equations algorithm.
We assessed the sensitivity of our results by repeating the simulation study using different specifications. For joint modelling imputation we increased the number of imputed datasets generated, the burn-in period and the number of iterations between imputed datasets to 250. For the standard and modified chained equations procedures we (1) increased the burn-in period of the chained equations sampler to 1000 iterations and (2) sampled every 50
th
iteration thereby reducing serial correlation (with a burn-in period of 10 iterations). To check that our results were not dependent upon our choice of prior distributions we repeated the simulation study with improper imputation procedures; that is, using maximum likelihood estimates of ψ
j
instead of Bayesian draws ψ
j
from its posterior distribution . Lastly, we also repeated the simulation study with a sample size of 1000 observations.