Background
The aim of a prognostic study is to develop a classification model from an available data set and to estimate the performance it would have in future independent data, i.e., its
predictive performance. This cannot be achieved by fitting the model on the whole data set and evaluating performance in the same data set, since a model generally performs better for the data used to fit the model than for new data (“overfitting”) and performance would thus be overestimated. This can be observed already in low-dimensional situations and is especially pronounced in relatively small data sets [
1,
2]. Instead, the available data have to be split in order to allow performance assessment in a part of the data that has not been involved in model fitting [
3,
4]. For efficient sample usage, this is often achieved by internal validation strategies such as bootstrapping (BS), subsampling (SS) or cross-validation (CV).
The task of assessing predictive performance is made even more complicated when the data set is incomplete. Missing values occur frequently in epidemiological and clinical studies, for reasons such as incomplete questionnaire response, lack of biological samples, or resource-based selection of samples for expensive laboratory measurements. The majority of statistical methods, including logistic regression models, assume a complete data matrix, so that some action is required prior to or during data analysis to allow usage of incomplete data. Since ad hoc strategies such as complete-case analysis and single imputation often provide inefficient or invalid results, and model-based strategies require often sophisticated problem-specific implementation, multiple imputation (MI) is becoming increasingly popular among researchers of different fields [
5,
6]. It is a flexible strategy that typically assumes
missing at random (MAR) missingness, that is, missingness depending on observed but not unobserved data, which is often, at least approximately, given in practice [
5]. MI involves three steps [
7]: (i) missing values are imputed multiple (
M) times, i.e., missing values are replaced by plausible values, for instance derived as predicted values from a sequence of regression models including other variables, (ii) statistical analysis is performed on each of the resulting completed data sets, and (iii) the
M obtained parameter estimates and their variances are pooled, taking into account the uncertainty about the imputed values [
8].
When the estimate of interest is a measure of predictive performance of a classification model, or a measure of incremental predictive performance of an extended model as compared to a baseline model, the application of MI is not straightforward. Specifically, it is unclear how internal validation and MI should be combined in order to obtain unbiased estimates of predictive performance.
Previous strategies combining internal validation with MI mostly focused on application without the aim to compare their chosen strategy against others or to assess their validity [
9‐
11]. Musoro et al. [
12] studied the combination of BS and MI in the situation of a nearly continuous outcome using LASSO regression, essentially reporting that the strategy of conducting MI first followed by BS on the imputed data yielded overoptimistic mean squared errors, whereas conducting BS first on the incomplete data followed by MI yielded slightly pessimistic results in the studied settings. Wood et al. [
13] presented a number of strategies for performance assessment in multiply imputed data, leaving, however, the necessity of validating the model in independent data to future studies. Hornung et al. [
14] examined the consequence of conducting a single imputation on the whole data set as compared to the training data set on cross-validated performance of classification methods, observing a negligible influence. Their investigation was restricted to one type of imputation that did not include the outcome in the imputation process.
In this paper, we present results of a comprehensive simulation study and results of a real data-based simulation study comparing various strategies of combining internal validation with MI, with and without including the outcome in the imputation models. Our study extends upon previous work with regard to several aspects: (1) We consider different internal validation strategies and different ways to correct for optimism, we (2) study measures of discrimination, calibration and overall performance as well as incremental performance of an extended model, and we (3) closely examine the sensitivity of the results towards characteristics of the data set, including sample size, number of covariates, true effect size and degree and mechanism of missingness. Furthermore, we (4) elaborate on the number of imputations and resamples to be used and (5) provide an approach for the construction of confidence intervals for predictive performance estimates. Finally, we (6) translate our results into recommendations for practice, considering the applicability of the proposed methods for epidemiologists with limited analytical and computational resources.
Discussion
Using simulated and real data we have compared strategies of combining internal validation with multiple imputation in order to obtain unbiased estimates of various (added) predictive performance measures. Our investigation covered a wide range of data set characteristics, validation strategies and performance measures, and also dealt with practical questions such as the numbers of imputations and bootstrap samples to be chosen in a given data set, and the aspects of incomplete future patient data and the construction of confidence intervals for performance estimates.
Throughout the investigated simulation settings, we observed an optimistic bias for apparent performance estimates, which was insufficiently corrected by ordinary optimism correction and the BS (and SS) 0.632 estimate, whereas the OOB estimate tended to be pessimistic and the 0.632+ tended to provide unbiased estimates. CV estimates were more variable than BS estimates (although this comparison might not be completely fair since the total number of training/test set pairs was not always the same in BS/SS as in CV or CVrep). These trends were similarly observed for complete and incomplete data and are consistent with previous observations for complete data. For instance, Wehberg and Schumacher [
41] reported the 0.632+ method to outperform ordinary optimism correction and 0.632, while the OOB estimate was pessimistic. Also, Smith et al. [
1] and Braga-Neto et al. [
42] observed insufficient optimism correction for the ordinary method and the 0.632 estimate, respectively, and both reported increased variability of CV estimates. Another publication focused on AUC estimation and found the BS 0.632+ estimate to be the least biased and variable one among the BS estimates [
43].
When we investigated strategies of combining validation with imputation, we observed an optimistic bias for the strategy of imputing first and then resampling on the imputed data (
MI-Val), whereas imputing training and test sets separately (
Val-MI) provided largely unbiased and sometimes pessimistic results. The question of in which order bootstrapping and imputation should be combined has been studied before from a theoretical [
44] and empirical [
12] perspective. In
MI-Val, all observations, which are later on repeatedly separated into training (BS) and test (OOB) sets, are imputed in one imputation process. Since values are imputed using predictions based on multivariate models including all observations, it is evident that future test observations do not remain completely blind to future training observations. Still, the severity of the expected optimism of the
MI-Val approach given different data characteristics, validation strategies and performance estimates has not been intensively studied. In practice, both
MI-Val and
Val-MI have been applied before [
9,
10,
45].
Val-MI tended to be pessimistically biased in the presence of a true underlying effect in our and others’ [
12] work. Specifically, when sample size is low and number of covariates large, the model overfits the training (BS) part of the data set, resulting in a worse fit to the test (OOB) data. In the presence of missing values, training and test data are imputed separately. It can be assumed that overfitting also occurs at the stage of imputation (where imputation models might become overfitted to the observed data both in the training and in the test set). This may result in a more severe difference in the observed covariate-outcome relationships between training and test data, and consequently worse fit of the model fitted to the training data to the test data, yielding an underestimation of predictive performance that apparently cannot be fully corrected using the 0.632+ estimate.
MI(-y)-Val produced mostly pessimistic results in the presence of an underlying true effect, mostly independent of sample size and number of covariates. In general MI literature, it is not recommended to omit the outcome from the imputation models [
26,
46]. Omitting the outcome equals making the assumption that it is not related with the covariates, as stated by von Hippel [
26]. This assumption is wrong in the case of a true underlying effect, resulting in misspecified imputation models, and, in turn, in an underestimation of effect estimates [
46]. Of note, the same study reported no difference between the
MI and
MI(-y) methods as far as inference is concerned. To our knowledge, the issue has not been investigated in the context of predictive performance estimation. In their study of ‘incomplete’ CV, Hornung et al. [
14] investigated the effect of – amongst other preprocessing steps – imputing the whole data set prior to CV as compared to basing the imputation on the training data only. They used a single imputation method that omitted the outcome, and found only little impact on CV error estimation.
For measures of added predictive performance we made the observation that even in complete data, estimates were sometimes biased in the absence of a true effect. For instance,
ΔAUC and categorical NRI were pessimistically and optimistically biased, respectively. The optimistic bias of NRI has lead to critical discussion [
47]. It is not unexpected that such bias is not eliminated when the respective validation method is combined with imputation.
Our study focused on treating missing values and deriving reasonable estimates for predictive performance measures in the presence of incomplete data in the model development phase, i.e., in the phase where complete outcome data are available and one aims to derive a prediction model for use in future data.
Our study focused on treating missing values and deriving reasonable estimates for predictive performance measures in the presence of incomplete data in the
research stage, i.e. in the situation where data sets with complete outcome data are available from studies/cohorts and one aims to develop a prediction model for use in future patient data (as opposed to the
application stage where the model is applied to predict patients’ outcome). Thus, when we evaluated estimates, they were compared against average performance in large complete data sets. An important question is how missing values in future patient data impair the performance of a developed prediction model, and whether such impairment would have to be considered already when developing the model. It has been suggested that data in the research stage should be imputed omitting the outcome from the imputation process, at least in the test sets, to get close to the situation in future real-world clinical data, where no outcome would be available for imputation either [
13]. According to this suggestion, the strategy
Val-MI should be avoided. However, how close a predictive performance estimate obtained through any strategy on the research data approximates the actual performance in future clinical data, depends strongly on the similarity in the proportion (and putatively, in the pattern) of missing values in both situations. Our and others’ [
48] results suggest that – irregardless of how missing values in future clinical data are treated – accuracy is lost with increasing missingness in future data at a given proportion of missingness in the research data. We expect the proportion of missing values in future patient data to be lower than that in study data in many cases. Specifically, epidemiological study data are subject to additional missingness attributable to design, sample availability and questionnaire response. Since the precise missingness patterns in both study data and future patient data in clinical practice may vary between studies and the outcome of interest, no general rule can be developed for estimating predictive performance of a model when future patient data are expected to contain missing values.
We propose a simple integrated approach for the construction of confidence intervals for performance estimates. The resulting intervals kept the nominal type 1 error rate for both
Val-MI and
MI(-y)-Val, although a severe loss in power as compared to complete data could be observed. The chosen approach relies on the numerical finding that prediction error estimates have the same variability as apparent error estimates and thus, bootstrap intervals for apparent error can be centered at prediction error estimate [
34]. The strategy has a major computational advantage over alternative strategies of constructing confidence intervals for estimates of prediction error/performance measures that use resampling in order to estimate the distribution of e.g. CV errors [
49]. The latter require nesting the whole validation (and imputation) procedure within an outer resampling loop. Other alternatives that do not require a double resampling loop might rely on tests applied to the test data. An example is the median
P rule suggested by van de Wiel et al. [
50], where a nonparametric test is conducted on the test parts of a subsampling scheme, resulting in a collection of
P values of which the median is a valid summary that controls the type 1 error under fairly general conditions. The methodology could be generalized to other (parametric or nonparametric) tests conducted on the test observations, such as DeLong’s test for (
Δ)AUC, and extension to incomplete data is possible with the help of Rubin’s combination rules. However, this strategy might lack power, because tests are conducted on the small test sets.
Together, our findings allow the careful formulation of recommendations for practice. First, if one aims to assess predictive performance of a model, validation is of utmost importance to avoid overoptimism. As for complete data, bootstrap with the 0.632+ estimate, turned out to be a preferable validation strategy also in the case of incomplete data. When combining internal validation and MI, one should not impute the full data set including the outcome in the imputation followed by resampling (strategy MI-Val) due to its optimistic bias. Instead, we can recommend nesting the MI in the resampling (Val-MI) or performing MI first, but without including the outcome variable (MI(-y)-Val). The number of resamples (B) and imputations (M) should be maximized in Val-MI and MI(-y)-Val, respectively. The choice of exact number of resamples and imputations for a given data set can be guided by the variability data we provide. In many situations and for many performance criteria, Val-MI might be preferable, although this choice may also depend on computational capacity, which is lower for MI(-y)-Val, where variability of the 0.632+ estimate is lower at the same number of resamples and only half the number of imputation runs is required. One should also be aware of (complete-data) biases of specific performance criteria, which may be augmented in the presence of missing values. Finally, one possible way of constructing valid confidence intervals for predictive performance estimates may be to center the bootstrap interval of the apparent performance estimate at the predictive performance estimate. This strategy can be easily embedded in the Val-MI and MI(-y)-Val strategies.
Strengths of this study include its comprehensiveness with regard to different data characteristics, validation strategies and performance measures, and the use of both simulated and real data. Our investigation may be extended with regard to several aspects. For instance, we did not vary effect strengths between the covariates. The relationship between effect strengths and missingness in covariates may influence the extent of potential bias in e.g.
Val-MI. Furthermore, it will be interesting to extend the study on confidence intervals by adopting alternative approaches to incomplete data, with a focus on searching for a strategy that improves power. In addition, one might explore the role of the obtained findings in a higher-dimensional situation where variable selection and parameter tuning often requires an inner validation loop. Of note, while in our study results were very similar for BS and SS, in an extended situation involving model selection, or hypothesis tests following [
50], SS should be preferred due to known flaws of the BS methodology [
51].
Acknowledgements
We thank all MONICA/KORA study participants and all members of the field staff in Augsburg who planned and conducted the study. We thank Annette Peters, head of the KORA platform, for providing the data, and Andrea Schneider for excellent technical support. We thank the involved cooperation partners Wolfgang Koenig (University of Ulm Medical Center, Ulm, Germany) and Christian Herder (German Diabetes Center, Düsseldorf, Germany) for permission to use the MONICA/KORA subcohort data for the present analyses.