Background
It is not unusual to encounter missing data in epidemiological studies [
1,
2]. Its presence affects the analysis of the data, and the methods employed in handling missing data can affect the results of the analysis. This could compromise conclusions drawn from the results. Types of missingness have been well documented [
3]. Popular classifications are “missing completely at random” (MCAR – the missing values are independent of both observed and unobserved data); “missing at random” (MAR – the missing values are independent of unobserved data but may depend on observed data) and “not missing at random” (MNAR – the missing data depends on both observed and unobserved data).
Commonly, missing data is managed by simply dropping all cases that are not fully measured. However, such a complete case analysis can introduce bias into the results and, in some cases, wrong conclusions can be drawn [
4]. While this approach is acceptable when the incomplete cases do not exceed 5% [
5] and for which the missingness can be classified as MCAR, when these conditions are not met, alternative means of dealing with the missing data need to be considered. One such method that is increasingly being used is multiple imputation (MI) [
6].
Imputation of missing data on a variable involves replacing the missing value by a value drawn from an estimate of the distribution of the variable [
7]. Multiple imputation does not replace the missing item with a single predicted value, but rather imputes multiple values for each missing data item. These multiple imputations and the addition of random error to each imputed item ensures that the variation in the imputed values follows closer the true distribution of the original measure. Multiple imputation is successfully applied to data that is MAR and yields unbiased results with accurate estimates for the standard errors [
7]. Unfortunately, the missingness mechanism is not usually fully known and is often a combination of more than one mechanism. However, by ensuring that the imputation model is more general than the analysis model, multiple imputation will usually produce sound results [
8‐
11]. This is achieved by including, in the imputation model, variables that are related to the incomplete variables as well as those related to their missingness; the outcome variable; and all interactions that will be examined in the analysis.
Rubin [
12] suggests that the need to include all possibly relevant predictors in the imputation model is demanding in practice. If interactions are selected
a priori, it is a straightforward exercise to include them in the imputation model [
9]. If, on the other hand, the relevant interactions have not been identified beforehand, then ideally all possible interactions should be included in the imputation model. This is neither practical nor, in some cases, possible [
13,
14], particularly when the number of variables is large. While model development with multiple imputation has been documented [
13,
15‐
17], none of these studies addresses the issue of how to include, in the imputation model, interactions that are not known
a priori. Developing a model with many variables, in the presence of missing data, when predictor variables include not only main effects but also interactions that are not pre-selected, presents a challenge, and not extensively reported in the literature. On the one hand, the data is needed to identify relevant interactions; on the other hand, the interactions are needed to impute the data. This paper addresses this dilemma and suggests a method in which model development, including interactions, and analysis can be carried out when missing data is imputed using multiple imputation.
We propose to identify the relevant interactions using a single complete set of data generated using the expectation-maximization (EM) algorithm for covariance matrices and then include these interactions in the imputation model.
Methods
The data
The relationship between environmental, socio-economic and genetic factors and the respiratory health of children in the Durban South region of KwaZulu-Natal, South Africa using cross-sectional data was investigated. The data comes from research commissioned by the eThekwini Municipality, Durban, South Africa in 2004 to investigate possible causal effects of environmental and lifestyle factors on respiratory health in children [
18]. Ethical approval was obtained from the Ethics Committee of the University of KwaZulu-Natal (Ref No.: E117/03). All the legal guardians of the child participants in this study gave written informed consent, participated voluntarily, and had the right to withdraw at any stage.
After an asthma symptoms screening survey, a sample of 423 primary school children were invited to participate in the study and from each participant multiple questionnaires were required to be completed. Of the 423 children included in the study, 382 that were deemed to have reliable data as well as complete data on the outcome variable, asthma severity, were used for this analysis. The removal of these children did not result in any selection bias.
Most of the predictor variables suffered from missing data. A study on the missingness mechanism was made prior to imputing the missing values. For each incomplete variable, an indicator variable was created and Chi-square analyses were performed to test whether either the incomplete variable or its missingness was related to observed values of other variables.
Selection of interactions for the imputation model
In order to ensure that the imputation model is at least as complex as the analysis model, and that the assumption of MAR is plausible, it is necessary to include the outcome variable and all possible likely predictors for the analysis model, in the imputation model. The selection of the interaction terms presents difficulties [
16,
17]. Comparable to the suggestion made by White et al [
16], we have generated a single complete set of data using the EM algorithm for covariance matrices. The EM algorithm is an iterative procedure that can be used to create a complete data set in which all missing values are replaced by maximum likelihood (ML) values that are asymptotically unbiased. The process starts by replacing each missing value with an estimate calculated from a regression equation in which all the other variables are predictors. Once all the missing values have been replaced, a variance covariance matrix and a vector of means from the completed data are calculated. New regression equations are then formed to predict a new set of estimates for the missing values. This process is repeated until the variances, covariances and means converge, thus producing ML estimates of the parameters.
The complete data set generated from this process is then used for model development and the identification of interactions. In our application, convergence was achieved in 36 iterations.
Multiple imputation
The imputation of multiple data sets was carried out using two different algorithms – multivariate normal imputation (MVNI) and fully conditional specification (FCS).
MVNI – This imputation algorithm, adopted by the NORM software [
19], assumes the complete data (observed and missing values) follows a multivariate normal distribution. NORM uses a data augmentation (DA) procedure to impute multiple sets of data.
This two-step process makes use of the ML estimates from EM as parameter starting values. In the first step, DA randomly imputes the missing data using the assumed values of the parameters. In the second step, new parameter estimates are drawn from a Bayesian posterior distribution based on the observed and imputed data. The repetition of these two steps results in a Markov chain. DA converges when the distribution of parameter estimates stabilizes. Research has shown that DA nearly always converges in fewer cycles than does EM [
8]. This enables one to estimate the cycle length, k, of DA as being any number at least as large as the number of iterations needed for EM to converge.
In order to impute m sets of data, DA is run for N = mk iterations and the data set at the end of every kth cycle is saved.
Because the data contained categorical variables, some adjustments were necessary both before and after imputation. Before imputation, dummy coding was applied to all the categorical variables and interaction product terms with more than two categories. After imputation, sensible rounding [
20] was used on these variables to prepare the data for analysis.
FCS – FCS, also termed “chained equations”, is the multiple imputation algorithm adopted by SPSS [
21]. This is a more flexible approach to imputation in that it is designed to handle different types of variables (continuous, binary, categorical, ordinal) and does not assume multivariate normality of the data [
6].
In practice, FCS involves running a series of regression models such that each variable with missing data is regressed on the other variables in the data set according to its distribution. So, for example, categorical variables will be modelled using logistic regression and continuous variables will be modelled using linear regression.
Imputation by FCS, as applied in SPSS, is also an iterative process that starts by imputing every missing value with random draws from the distribution of the non-missing values. Continuous variables are replaced with draws from a normal distribution and categorical variables are replaced with draws from a multinomial distribution. Azur et al [
22] refer to these replacements as “place holders”.
Each iteration involves the following steps:
-
Set the “place holders” of one variable that suffers from missing values back to missing
-
Set up a regression equation, according to the distribution of the variable, with the observed values as the dependent variable and the other variables as independent variables
-
Replace the missing values from this variable with predictions from the regression equation
-
Repeat these steps for each variable that has missing values.
This forms one iteration of the process. At each iteration the imputed values are updated. This process is repeated for a specified number of iterations, n, after which the data set is retained as one complete imputed data set. The number of iterations, n, chosen so that the parameters from the regression models have stabilized, is generally about ten [
23]. This entire process is repeated until the required number, m, of imputed data sets is generated.
Each of the m data sets were analysed with ordinal regression – the chosen method of analysis – and the results were combined using Rubin’s rules [
4]. Although, in the past, it was widely thought that as few as 3 imputed data sets are needed to obtain good results and inferences, new studies have shown that this may, in fact, not be enough [
24]. Studies have shown that there could be an important reduction in statistical power if m is small [
9]. Graham et al [
24] completed a simulation study on the number of imputations needed to attain maximum power. Their recommendations for the number of imputations, m, as a function of the fraction of missing information are summarized in Table
1. On the basis of the percentage of data missing in this study (5.3%), 20 sets of data were imputed.
Table 1
Recommended number of imputations needed for varying fractions of missing data (Graham[
9]
)
Fraction of missing data | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
Number of imputations | 20 | 20 | 40 | 100 | >100 |
Model development
In order to develop the best model given the large number of variables available, the following three-stage process was followed. Firstly, all variables were purposefully selected as main effects. Secondly, in developing the full model, interactions were chosen one at a time in a stepwise manner such that the interaction that made the biggest significant improvement to the fit was added to the model. For this process a cut-off p-value of 0.05 was used. Thirdly, when no further improvement to the fit was possible, backward elimination was carried out to find the smallest model that was as good as the full model. Here a p-value of 0.10 was used for the stopping criterion.
Model development with multiple imputation
In the setting of the multiple imputation process, we suggested two possible strategies that can be applied to carry out the model development process.
Strategy 1
All three stages of the model development process - the selection of main effects, identification of interactions as well as the backward elimination - are performed on the initial data set generated by the EM parameters. The variables and interactions identified by this process are incorporated into the imputation model. Interactions are treated differently, depending on which imputation method is used.
For MVNI as implemented in the NORM software, interactions with p categories are treated as categorical variables and coded into p-1 dummy variables before being added to the raw incomplete data. By way of an example: an interaction between gender (male/female) and smoking (yes/no) is broken down into separate categories – male/yes, male/no, female/yes and female/no – and binary coding (present/absent) is applied to the first three categories.
For FCS, the interaction is coded according to the possible categories. So, in the example above, male/yes = 1, male/no = 2, female/yes = 3 and female/no = 4.
The interactions as coded in the two scenarios above are merely treated as additional variables. This has been referred to as the ‘transform-then-impute’ method of dealing with interactions and, in a regression model that includes interactions, has been shown to yield good regression estimates, even though the imputed values are inconsistent with one another. In contrast to this is the ‘impute-then-transform’ method, also known as passive imputation, which yields plausible-looking imputed values but biased regression estimates [
25].
This imputation model is then used to produce the m sets of imputed data. These are analysed individually and the results are combined using Rubin’s rules [
4].
Strategy 2
Using the initial EM generated data set, the first two stages of the model development process are completed - selection of main effects and identification of interactions. These are then incorporated into the imputation model as before and m sets of imputed data are produced. Analysis, followed by the third stage of model development (backward elimination), is then applied to each of these data sets. The final selection of variables for the model includes those that are selected in at least 50% of the individual data sets. In the event that no variables satisfy the selection criterion, the condition can be relaxed to a lower percentage. Once these variables are established, analysis is carried out on each data set and the results are combined.
Analysis
Analyses were carried out using the Statistical Package for Social Sciences (SPSS v17). Given that the outcome variable, asthma severity, is an ordinal measure, the chosen method of analysis for this data was ordinal regression. The three categories of the outcome variable are ‘none/mild intermittent asthma’; ‘mild persistent asthma’ and ‘moderate/severe asthma’. For all the analyses, logit was the chosen link function.
In addition to the analysis of the imputed data, a complete case analysis was carried out for comparative purposes. All main effects and interactions that were defined in stages 1 and 2 of the model building process were used with the complete case analysis and then backward elimination was applied to reduce the model.
Discussion
In this study investigating methods for addressing missing data, specifically when including interactions in the analysis, we found support for building the model using an EM generated set of data and then applying multiple imputation as a robust method to address this common shortcoming in epidemiological studies.
Epidemiological studies frequently suffer from missing data. Many researchers avoid this problem by dropping all cases with data missing on any variable and carrying out what is known as a complete case analysis. An advantage of this type of analysis is that it is computationally easy to apply and can be done with any reputable commercial software package. However, unless the data is MCAR, the values of the estimated coefficients produced with this analysis may be biased. Moreover, when the missingness is not only a function of the covariate(s) but also of the outcome variable, then the bias from a complete case analysis is heightened [
27]. Although complete case analysis and other
ad hoc methods, like mean substitution and the missing-indicator method, are still widely used, researchers are becoming more aware of the perils of applying such methods and many are now employing multiple imputation methods to address the missingness in their data. While results from multiple imputation will be unbiased when data is MAR, it has been suggested that even when it is MNAR, adequately dealing with as much of the missingness mechanism as possible will usually produce sound results [
8‐
11]. This is achieved by including auxiliary variables – those variables related to the missingness but not necessarily included in the analysis, interactions and the outcome variable in the imputation model.
While much has been published on the application of multiple imputation to epidemiological studies, there is limited literature that deals with model building in the presence of missing data, and more specifically model building including interactions. The aim of this paper was to demonstrate a simple and easily applied strategy to build interactions, which are not known up front, into a model while at the same time imputing the missing data.
The dilemma that we faced was a practical one. It is possible for the interactions to be added after imputation. This is termed passive imputation or ‘impute-then-transform’. However, it has been shown that including interactions, as product terms, before imputation produces superior results than if the imputations are done first and the interactions are added at the analysis stage [
25]. For the best results, the identified interactions should be included in the imputation model along with the predictor variables, the auxiliary variables and the outcome variable. However, how can the interactions be identified and the best model built, when the data is incomplete?
Two strategies for model building, S1 and S2, were explored – both utilizing a single imputed data set generated from the ML parameter estimates produced from the EM algorithm for covariance matrices.
Imputation was carried out with both multivariate normal imputation (MVNI) and the more flexible fully conditioned specification (FCS). The same set of 17 predictor variables and 10 interactions for the best model were identified when applying MVNI with both strategies S1 and S2, as well as with the application of FCS and strategy S1. FCS with strategy S2 failed to include one of these predictors and an associated interaction in its best model. Since these dropped variables did not alter the interpretation of the results, it would seem that both strategies for model building are equally effective. The advantage of S1 over S2 is that it is easier and less time-consuming to execute and therefore probably the preferred choice.
In comparison to the model variables selected from the imputed data, fewer variables were selected for the model on the complete case data. This is most likely caused by the enormous reduction in cases and the subsequent loss of power.
A total of 5.3% missing items spread across 81.8% of variables, affecting 43.5% of cases was present in the dataset used for this analysis. Examination of the missingness revealed that it is possible that the missingness mechanism present in this data is a combination of MCAR, MAR and MNAR. Analysis of the relationships between both the missingness of the variables and the variables themselves confirmed that significant relationships exist between each of the variables and at least one other variable in the set; furthermore, the missingness of all but three of the variables is significantly related to at least one other variable in the set.
For reliable and unbiased results to be obtained from a complete case analysis, the data is required to be MCAR, which is clearly not the case here. Furthermore, although this means of dealing with missing data is acceptable when the lost cases amount to no more that 5%, this data set is reduced by over 40% which will inevitably have a negative effect on the outcome of the analysis.
On the other hand, multiple imputation, if applied correctly, is able to produce sound results when the data is MAR and it has been shown that even when the data is MNAR, the effects of this mechanism are often surprisingly minimal [
11]. In order to ensure that the imputation model was general enough to encompass the subsequent analysis, the outcome variable, interactions and variables related to either the incomplete variables or their missingness or both were included in the imputation model. By including variables that are correlated with each incomplete variable but not its missingness, we expect that the additional information will cause a decrease in the standard errors and hence an increase in efficiency and statistical power [
10]. If there is an element of MNAR present in the data, the inclusion of these variables in the imputation model should lessen the bias and make the assumption of MAR more plausible.
It is unclear as to how many variables and interactions, given the sample size available, can be reliably assessed with multiple imputation applications. It seems that this depends to some extent on the software being used. In some cases, convergence of large models is a problem in that it can make the imputation process unacceptably slow [
16]. Graham and Schafer [
28], in a study using NORM to perform the imputations found that results were quite acceptable “even with sample sizes as low as 50, even with as much as 50% missing from most variables, and even with relatively large and complex models”. In a study on the imputation of categorical data [
29] it was found that, while problems exist when imputing using a variant of NORM designed to deal with categorical data when many variables are present, the same limitations are not problematic for NORM. In another study [
30] on the inclusion of continuous auxiliary variables in the imputation model, the authors suggest the ratio of cases with complete data to variables should be at least 3:1. Given these guidelines, we found that convergence for both imputation methods was achieved quickly and reliably. Furthermore, even with the dummy coding of all the categorical variables and the interactions, the ratio of complete cases to variables far exceeds 3:1. We are therefore confident that our results are reliable.
Diagnostic tests on the distributions of the imputed data showed that data imputed both with MVNI and FCS were not significantly different from either the measured data or the CC data. These results confirm findings that multiple imputation with MVNI incorporating sensible rounding should work in most situations [
14], even in the presence of binary and ordinal variables [
6].
The diagnostic measure, df, also indicated that the estimates obtained from both multiple imputation methods have stabilized and are therefore trustworthy.
Analysis of the two sets of imputed data yielded very similar results. This is consistent with findings from a study comparing the two imputation approaches [
6] where it was found that “similar results can be expected from FCS and MVNI in a standard regression analysis involving variously scaled variables”. The magnitude of the standard errors and the magnitude and direction of the estimated coefficients were consistent across both these imputation types and for both model building strategies. While there were some inconsistencies in the significance of predictors, these did not affect the overall interpretation of the associations between asthma severity and the factors included on the models.
A comparison of results for the complete case analysis and the analyses of the imputed data showed that standard errors for the estimated coefficients from the analysis of the imputed data were, in all but one case, considerably smaller than those from the complete case analysis. These smaller standard errors resulted in greater accuracy of the estimated coefficients. This increased precision indicates the superior efficiency and statistical power obtained for the analysis of the imputed data. The inconsistencies in the signs of the estimates and the significance of the predictors could result from the non-random fashion in which cases are dropped for the complete case analysis which may distort the joint distribution among the variables. The resulting bias in point estimates could lead to misidentification of significant predictors [
31]. Another important factor that would negatively affect results of the complete case analysis is that the missingness mechanism present in the data is not confined to being MCAR. While multiple imputation methods produce unbiased parameter estimates when the missingness is MAR, this is not the case with complete case analysis. This missingness mechanism factor could also have added to the large difference in magnitude of the standard errors for the complete case analysis as compared to the imputed data analysis that, some would argue, could not be explained on the basis of sample size alone.
These results are consistent with what we expect given the significant reduction in cases for the complete case analysis and the missingness mechanism present in the data that would almost certainly result in a loss of power and the introduction of bias into estimates.
Given the rigid processes followed in the imputation of the data and subsequent analyses, we would suggest that the results from the imputed data can be considered reliable. On the other hand, the results from the complete case analysis should be treated with caution.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
GH analysed and interpreted the data, drafted the original article and made suggested revisions. RN was involved in the collection of the data, supervised the study, critically reviewed the article and approved the final version to be published. GM was involved in the collection of the data and critically reviewed the article. DN and TZ supervised the study and reviewed the original article. All authors read and approved the final manuscript.