Background
“Big data” [
1] in information science refers to the collection and management of large and complex datasets. Big data is steadily growing in biomedicine with the development of electronic medical records, increased use of high-throughput technologies, and facilitated access to large environmental database [
2‐
6]. In epidemiology, the collection of hundreds to thousands of covariates is common in large-scale cohort studies and offers new challenges for the discovery of associations between individual or collective exposures and a health outcome. The use of specific methods to explore these associations, without any pre-specified hypothesis, therefore becomes essential.
In hypothesis-driven epidemiology, the search for associations involves statistical modeling and testing of the relationships between one or several covariates and the outcome. Logistic regression is the most widely used model when the outcome follows a binomial distribution. The usual epidemiologic analytic framework consists in testing the association between each covariate and the outcome through univariate logistic models; a subset of those covariates is then selected for multivariate logistic models based on some quantile of the test statistic for the covariate coefficient under the null hypothesis,
i.e. the Pvalue. This framework is the reference method in epidemiology for variable selection, and the use of alternative approaches remains uncommon [
7,
8]. With large datasets, the number of covariates selected in the univariate analyses can be high. As multivariate logistic regression can handle a limited number of covariates simultaneously [
9], it might therefore be poorly adapted to large epidemiologic datasets for identifying independent associations.
“Data mining”, a term which appeared in the early 1990’s [
10], describes data-driven analysis without any
a priori hypothesis about the structure or the potential relationships that could exist in the data. Data mining applications are broad, ranging from consumption analysis to fraud detection in high-dimensional databases [
11]. Data mining methods are non-parametric, more flexible than statistical regression methods, and are able to deal with a large number of covariates. Several studies have compared the performances of logistic regression and data mining methods for predicting a health outcome without clear conclusions about the superiority of one of these methods over the others [
12‐
17]. Most studies explored classification and regression trees, artificial neural networks or linear discriminant analysis, but only a few focused on more recently developed “ensemble-based” methods such as random forests or boosted regression trees [
13,
16,
17].
Shrinkage methods, such as the Least Absolute Shrinkage and Selection Operator (LASSO) [
18], have been developed to overcome the limitation of usual regression models when the number of covariates is high. However, LASSO logistic regression remains unfamiliar to epidemiologists and few applications of this method have been found [
19,
20].
We hereby performed a comparison of two data mining methods, random forests and boosted regression trees, with the conventional multivariate logistic regression and with the LASSO logistic regression for identifying independent associations in a large epidemiologic dataset including hundreds of covariates. Random forests and boosted regression trees were chosen among data mining methods for their ability to provide quantitative information about the strength of association between covariates and the outcome. The methods were used to detect covariates associated with H1N1 pandemic (pdm) influenza infections. We also assessed the performance of these methods to detect associations through simulations.
Discussion
Without any pre-specified hypotheses, Random Forests, Boosted regression trees and LASSO models identified 8 to 24 covariates independently associated with influenza infection, among which 23 were not detected by the “univariate followed by multivariate logistic regression” framework. On the other hand, when a Pvalue threshold of 0.20 was applied to select covariates for multivariate logistic regression during univariate logistic models, a substantial number of spurious independent associations were detected which were not retrieved by any other methods. Simulations showed that RF, BRT and LASSO outperformed the conventional logistic framework to detect independent associations, while the false positive detection rates remained at the nominal significance level (RF, BRT and LASSO-se) or moderately increased above it (LASSO-max).
When covariates not associated with the outcome were correlated with covariates associated with the outcome, the false positive rate was high, particularly with RF. For this method, this finding was explained by the sensitivity of the Gini impurity criterion to between-covariates correlation [
38]. More strikingly, increasing the number of correlated covariates also affected the true positive rate, which decreased with almost all methods (see Additional file
1). This finding may be attributed to a decrease of covariate strength of association due to a large number of correlated covariates and consequently, a decrease to be detected by any of the methods, as was shown in RF and LASSO [
39].
In this work, we used an exploratory approach to analyze a large epidemiologic dataset,
i.e. we aimed to detect associations between numerous covariates and an outcome, without pre-specified hypothesis. Despite the high number of covariates under study, the multiple testing issue was not considered. It is common to distinguish between two type of error rates: the comparisonwise error rate (CER) which corresponds to the probability, for an individual test, to reject the null hypothesis when it is actually true; and the experimentwise error rate (EER - also known as the familywise error rate), which corresponds to the probability of rejecting at least one true null hypothesis among the multiple tests performed [
40]. According to the simulations performed, we observed that the false positive rates associated to the permutation test was close to the expected CER level (5%) with almost all methods. Working at the individual covariate level, no adjustment was necessary. Adjusting Pvalues would have been required if the EER had to be controlled,
e.g. in order to build a predictive model or to confirm the detected associations [
41]. It is nevertheless essential to keep in mind that the significant results correspond to exploratory results, which require further confirmation.
To our knowledge, our study is the first to compare the performance in terms of associations detection of the random forest and boosted regression trees importance measures to the LASSO and the widely used analytic framework in the simultaneous analysis of hundreds to thousands of covariates to detect independent associations, a growing issue in epidemiology. Although such datasets offer analytic challenges, they are hardly comparable to datasets explored in omic-based approaches, in which the number of covariates (up to millions) is far higher to the number of samples, and for which the use of dedicated approaches,
e.g. the elastic net penalty [
42], would have been unavoidable.
Some associations with influenza infection detected with RF, BRT or LASSO-se methods were expected: HAI titers are well-known correlates of protection against influenza infection [
43], young age is a known risk factor for H1N1pdm influenza infection [
44], non-pharmaceutical preventive measures such as handwashing have been found to be determinants of H1N1pdm infection [
45], and asthma was also reported as a specific risk factor [
46]. Having a professional activity involving contact with ill people sounds logical as a potential risk factor, and several reports have shown that hospital staff were at increased risk of infection [
47]
. For other associations,
e.g. “Always or often covers mouth while coughing or sneezing”, we did not find consistent findings in the literature and it could be hard to hypothesize how the detected covariates could be linked with the risk of H1N1pdm influenza infection. However, “Professional activity involves contact with ill people” and age were correlated with this covariate (ρ = 0.10, Pvalue = 0.020 and ρ = 0.29, Pvalue < 0.001, respectively); based on our simulation findings we suspect that this association, as many others (
e.g. “Presence of a dishwasher in the kitchen”), are likely to be false positives.
Having no prior knowledge about the covariates truly associated with influenza infections we performed a simulation study to assess the performances of the different methods at detecting true and false associations in similar sized data, with a similar number of positive outcomes and covariates. Although we did not perform an extensive analysis exploring varying proportions of associated covariates or interactions between covariates, our simulations clearly demonstrated that UFMLR, with or without backward selection, were inefficient. We developed permutation tests to assess the significance of the covariates association with the outcome in RF, BRT and LASSO; their results with UFMLR were comparable to that of the Wald test in terms of nominal coverage (see Additional file
1). Although permutation tests exhibited slightly less power than the Wald test, this did not modify our general findings.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
YM and FC conceived and designed the experiments. YM performed the experiments and analyzed the data. YM and FC wrote the paper. All authors read and approved the final manuscript.