The online version of this article (doi:10.1186/1471-2288-14-116) contains supplementary material, which is available to authorized users.
The authors declare that they have no competing interests.
Authors JM, AZ and RG devised the statistical methods. Authors MP and GtR were responsible for the design and data collection of the study. Author JM performed the statistical analysis and wrote the paper. All authors read and corrected the draft versions of the manuscript, and approved the final manuscript.
In prognostic studies, the lasso technique is attractive since it improves the quality of predictions by shrinking regression coefficients, compared to predictions based on a model fitted via unpenalized maximum likelihood. Since some coefficients are set to zero, parsimony is achieved as well. It is unclear whether the performance of a model fitted using the lasso still shows some optimism. Bootstrap methods have been advocated to quantify optimism and generalize model performance to new subjects. It is unclear how resampling should be performed in the presence of multiply imputed data.
The data were based on a cohort of Chronic Obstructive Pulmonary Disease patients. We constructed models to predict Chronic Respiratory Questionnaire dyspnea 6 months ahead. Optimism of the lasso model was investigated by comparing 4 approaches of handling multiply imputed data in the bootstrap procedure, using the study data and simulated data sets. In the first 3 approaches, data sets that had been completed via multiple imputation (MI) were resampled, while the fourth approach resampled the incomplete data set and then performed MI.
The discriminative model performance of the lasso was optimistic. There was suboptimal calibration due to over-shrinkage. The estimate of optimism was sensitive to the choice of handling imputed data in the bootstrap resampling procedure. Resampling the completed data sets underestimates optimism, especially if, within a bootstrap step, selected individuals differ over the imputed data sets. Incorporating the MI procedure in the validation yields estimates of optimism that are closer to the true value, albeit slightly too larger.
Performance of prognostic models constructed using the lasso technique can be optimistic as well. Results of the internal validation are sensitive to how bootstrap resampling is performed.
Additional file 1: R function to perform resampling with caret package in the presence of multiply imputed data. The “validate.train” function below estimates optimism in predictive value via the bootstrap resampling procedures described in approach 1 and 4 in the manuscript. In approach 1 the completed data sets (via MI) are resampled. The same subjects are selected across the imputed data sets so that the bootstrap imputed data sets always differ only by their imputed values. In approach 4, the incomplete data set is resampled and then MI is performed using the mice package. The function can be used to estimate optimism in the predictive value of a linear regression model constructed within caret using the train() function, with method = “glmnet”. In order to be consistent with the output from caret, we assumed that the response variable is always in the last column of every data set. (ZIP 2 KB)
Additional file 2: Table S1: Simulation study results. The table presents means of all estimates along with their corresponding 2.5th and 97.5th percentile values within parentheses. These are based on 1000 simulated data sets for both n = 250 and 1000. (PDF 32 KB)
Additional file 3: Figure S1: Distribution of the estimated expected optimism values from the simulation study. These are based on 1000 simulated data sets (n = 250) for both the setting without (NM) and with (WM) missing data. (ZIP 6 KB)
Tibshirani R: Regression shrinkage and selection via lasso. J Roy Stat Soc B. 1996, 58: 267-288.
Steyerberg EW: Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2010, New York: Springer
Steyerberg EW, Harrell FE, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD: Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001, 8: 774-781. CrossRef
Breiman L: The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error. J Am Stat Assoc. 1992, 87: 738-754. 10.1080/01621459.1992.10475276. CrossRef
Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1986, New York: Chapman & Hall
Harrell FE: Regression Modeling Strategies: with Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: Springer CrossRef
Copas JB: Regression, prediction and shrinkage. J Roy Stat Soc B. 1983, 45: 311-354.
Rubin DB: Multiple Imputation for Nonresponse in Surveys. 1987, New York: John Wiley & Sons CrossRef
Siebeling L, ter Riet G, van der Wal WM, Geskus RB, Zoller M, Muggensturm P, Joleska I, Puhan MA: Ice cold eric–international collaborative effort on chronic obstructive lung disease: exacerbation risk index cohorts–study protocol for an international copd cohort study. BMC Pulm Med. 2009, 9: 16-10.1186/1471-2466-9-16. CrossRef
Puhan MA, Behnke M, Frey M, Grueter T, Brandli O, Lichtenschop A, Guyatt GH, Schunemann HJ: Self-administration and interviewer-administration of the German chronic respiratory questionnaire: instrument development and assessment of validity and reliability in two randomised studies. Health Qual Life Outcomes. 2004, 2: 1-10.1186/1477-7525-2-1. CrossRefPubMedPubMedCentral
van Buuren S, Karin G: Mice: multivariate imputation by chained equations in R. J Stat Software. 2011, 45: 1-67.
vonHippel PT: Regression with missing Ys: an improved strategy for analyzing multiply imputed data. Socio Meth. 2007, 37: 83-117. 10.1111/j.1467-9531.2007.00180.x. CrossRef
Cox DR: Two further applications of a model for binary regression. Biometrika. 1958, 45: 562-565. 10.1093/biomet/45.3-4.562. CrossRef
R Core Team: R: A Language and Environment for Statistical Computing. 2012, Vienna: R foundation for statistical computing, R foundation for statistical computing. ISBN 3-900051-07-0. [ http://www.R-project.org/]
Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Software. 2010, 33: 1-22. CrossRef
Kuhn M, Contributions from Wing J, Weston S, Williams A, Keefer C, Engelhardt A: Caret: Classification and Regression Training. 2012, R package version 5.15-023. [ http://CRAN.R-project.org/package=caret],
Van Houwelingen JC, Sauerbrei W: Cross-validation, shrinkage and variable selection in linear regression revisited. Open J Stat. 2013, 3: 79-10.4236/ojs.2013.32011. CrossRef
Wan Y, Datta S, Conklin DJ, Kong M: Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. J Stat Comput Simulat. 2014, 1-15. [doi:10.1080/00949655.2014.907801],
- Validation of prediction models based on lasso regression with multiply imputed data
Jammbe Z Musoro
Aeilko H Zwinderman
Milo A Puhan
Gerben ter Riet
Ronald B Geskus
- BioMed Central
Neu im Fachgebiet AINS
Meistgelesene Bücher aus dem Fachgebiet AINS
Mail Icon II