Background
NHANES Data
Machine Learning Models
-
Logistic Regression is a statistical model that finds the coefficients of the best fitting linear model in order to describe the relationship between the logit transformation of a binary dependent variable, and one or more independent variables. This model is a simple approach to prediction which provides baseline accuracy scores for comparisons with other non-parametric machine learning models [17].
-
Support Vector Machines (SVM) classify data by separating the classes with a boundary, i.e. a line or multi-dimensional hyperplane. Optimization ensures that the widest boundary separation of classes is achieved. While SVM often outperforms logistic regression, the computational complexity of the model results in long training durations for model development [18].
-
Ensemble models synthesize the results of multiple learning algorithms to obtain better performance than individual algorithms. If used correctly, they help decrease variance and bias, as well as improve predictions. Three ensemble models used in our study were random forests, gradient boosting, and a weighted ensemble model.
-
Random Forest Classifier (RFC) is an ensemble model that develops multiple random decision trees through a bagging method [19]. Each tree is an analysis diagram that depicts possible outcomes. The average prediction among the trees is taken into account for global classification. This reduces drawback of large variance in decision trees. Decision splits are made based on impurity and information gain [20].
-
Gradient Boosted Trees (GBT) [21] is also an ensemble prediction model based on decision trees. In contrast to Random Forest, this model successively builds decision trees using gradient descent in order to minimize a loss function. A final prediction is made using a weighted majority vote of all of the decision trees. We consider an implementation of gradient boosting, XGBoost [22], which is optimized for speed and performance.
-
A Weighted Ensemble Model (WEM) that combines the results of all aforementioned models was also used in our analysis. The model allows multiple predictions from disparate models to be averaged with weights based on a individual model’s performance. The intuition behind the model is the weighted ensemble could potentially benefit from the strengths of multiple models in order to produce more accurate results.
-
Methods
Data Mining and Modeling
Dataset Preprocessing
Subject Exclusion and Label Assignment
Criteria | Classification | |
---|---|---|
Answered “yes” to “Have you been told by a doctor that you have diabetes” γor had a Plasma Glucose ≥126 mg/dl δ | ⇒ | Diabetic |
Answered “no”, but had a Plasma Glucose ≥126 mg/dl | ⇒ | Undiagnosed diabetic |
Had a Plasma Glucose between 100−125 mg/dl | ⇒ | Prediabetic |
Had a Plasma Glucose ≤100 mg/dl | ⇒ | Not diabetic |
Classification | Case I | Case II |
---|---|---|
Diabetic | 1 | Excluded |
Undiagnosed diabetic | 1 | 1 |
Prediabetic | 0 | 1 |
Not diabetic | 0 | 0 |
Criteria | Classification | Label Assignment | |
---|---|---|---|
Answered “yes” to having had one of the following γ: congestive heart failure, coronary heart disease, heart attack, or stroke | ⇒ | Having heart diseases | 1 |
If they answered “no” to all conditions | ⇒ | Not having heart diseases | 0 |
Year | Case | Observations | Variables | No. of 0s | No. of 1s |
---|---|---|---|---|---|
1999-2014 | Case I | 21,131 | 123 | 15,599 | 5,532 |
1999-2014 | Case II | 16,426 | 123 | 9,944 | 6,482 |
2003-2014 | Case I | 16,443 | 168 | 11,977 | 4,466 |
2003-2014 | Case II | 12,636 | 168 | 7,503 | 5,133 |
2007-2014 | Cardio | 8,459 | 131 | 7,012 | 1,447 |
Model Development
The Weighted Ensemble Model
Feature Selection
Performance Metrics
Results
Lab | Year & Case | Model | AUC |
Precision
|
Recall
| F1 |
---|---|---|---|---|---|---|
No lab | Logistic Reg. | 0.827 | 0.75 | 0.75 | 0.75 | |
1999-2014 | SVM | 0.849 | 0.77 | 0.77 | 0.77 | |
Diab. Case I | Random Forest | 0.855 | 0.78 | 0.78 | 0.78 | |
XGBoost
|
0.862
|
0.78
|
0.78
|
0.78
| ||
Ensemble | 0.859 | 0.78 | 0.78 | 0.78 | ||
Logistic Reg. | 0.732 | 0.67 | 0.67 | 0.67 | ||
1999-2014 | SVM | 0.734 | 0.68 | 0.68 | 0.68 | |
Diab. Case II | Random Forest | 0.731 | 0.67 | 0.67 | 0.67 | |
XGBoost | 0.734 | 0.67 | 0.67 | 0.67 | ||
Ensemble |
0.737
|
0.68
|
0.68
|
0.68
| ||
Logistic Reg. | 0.800 | 0.72 | 0.72 | 0.72 | ||
2003-2014 | SVM | 0.822 | 0.75 | 0.75 | 0.75 | |
Diab. Case I |
Random Forest
|
0.841
|
0.77
|
0.76
|
0.76
| |
XGBoost | 0.837 | 0.75 | 0.75 | 0.75 | ||
Ensemble | 0.834 | 0.75 | 0.75 | 0.75 | ||
Logistic Reg. | 0.718 | 0.66 | 0.66 | 0.66 | ||
2003-2014 | SVM | 0.716 | 0.66 | 0.66 | 0.66 | |
Diab. Case II | Random Forest | 0.719 | 0.67 | 0.67 | 0.66 | |
XGBoost
|
0.725
|
0.67
|
0.67
|
0.67
| ||
Ensemble | 0.725 | 0.66 | 0.66 | 0.66 | ||
With lab | Logistic Reg. | 0.866 | 0.79 | 0.79 | 0.79 | |
1999-2014 | SVM | 0.887 | 0.81 | 0.81 | 0.81 | |
Diab. Case I | Random Forest | 0.937 | 0.86 | 0.86 | 0.86 | |
XGBoost
|
0.957
|
0.89
|
0.89
|
0.89
| ||
Ensemble | 0.944 | 0.87 | 0.87 | 0.87 | ||
Logistic Reg. | 0.724 | 0.67 | 0.67 | 0.67 | ||
1999-2014 | SVM | 0.737 | 0.68 | 0.68 | 0.68 | |
Diab. Case II | Random Forest | 0.738 | 0.68 | 0.68 | 0.68 | |
XGBoost
|
0.802
|
0.74
|
0.74
|
0.74
| ||
Ensemble | 0.783 | 0.71 | 0.71 | 0.71 | ||
Logistic Reg. | 0.877 | 0.80 | 0.80 | 0.80 | ||
2003-2014 | SVM | 0.882 | 0.81 | 0.80 | 0.80 | |
Diab. Case I | Random Forest | 0.939 | 0.86 | 0.86 | 0.86 | |
XGBoost
|
0.962
|
0.89
|
0.89
|
0.89
| ||
Ensemble | 0.948 | 0.88 | 0.88 | 0.88 | ||
Logistic Reg. | 0.738 | 0.68 | 0.68 | 0.68 | ||
2003-2014 | SVM | 0.737 | 0.68 | 0.68 | 0.68 | |
Diab. Case II | Random Forest | 0.740 | 0.68 | 0.68 | 0.67 | |
XGBoost
|
0.834
|
0.75
|
0.75
|
0.75
| ||
Ensemble | 0.798 | 0.72 | 0.72 | 0.72 |
Lab | Year | Model | AUC |
Precision
|
Recall
| F1 |
---|---|---|---|---|---|---|
No lab | Logistic Reg. | 0.822 | 0.74 | 0.74 | 0.74 | |
2007-2014 | SVM | 0.816 | 0.74 | 0.74 | 0.74 | |
Random Forest | 0.829 | 0.75 | 0.74 | 0.74 | ||
XGBoost | 0.830 | 0.74 | 0.74 | 0.74 | ||
Ensemble
|
0.831
|
0.75
|
0.75
|
0.75
| ||
With lab | Logistic Reg. | 0.827 | 0.75 | 0.75 | 0.75 | |
2007-2014 | SVM | 0.825 | 0.75 | 0.75 | 0.75 | |
Random Forest | 0.836 | 0.76 | 0.76 | 0.76 | ||
XGBoost | 0.838 | 0.76 | 0.76 | 0.76 | ||
Ensemble
|
0.839
|
0.76
|
0.76
|
0.76
|