Background
Objectives
Related work
Data mining in the medical domain
Frailty
Methods
Data
Definition of the variables
-
Geriatric depression scale (GDS) This scale was created with the objective to obtain a reliable rating for depression in elderly. The applicant himself answers in the so called short form 15 different questions. Of those, 10 questions indicate the presence of depression when positively answered and the remaining 5 questions indicate the presence of depression when negatively answered. The test yields a score between 0 and 15, where scores between 0 and 5 mean no depression is present and values above 5 indicate the presence of a depression [30, 31].
-
Activities of daily living (ADL) In this assessment also a questionnaire is used, which is answered by the patient. Here the goal is to estimate the patients’ satisfaction in his daily activities, which contain hygiene, alimentation and independent access to necessities. There exist different variations of the ADL test, which differ regarding their contained number of questions. In this work the ADL according to Katz [32] was used. The answers to 6 different questions provides a score between 0 and 6, where a score of 0 signifies no ability of self-care and a score of 6 complete ability of self-care.
-
Instrumental activities of daily living (IADL) Like the ADL-test but mainly focused on instrumental activities. These include following daily tasks and responsibilities: food preparation, shopping, using the telephone, housekeeping, transportation, responsibility for own medications and the ability to handle finances. For each activity exist 3 to 5 questions, each yielding 0 or 1 point. The maximum for each category is 1 point and signifies that the ability to perform that certain task is given. At the end these points are summed up. This sum represents the IADL-Score with a range between 0 and 8. [33]
-
Mini-mental-state-examination (MMSE) The Mini-Mental-State-Examination represents standardized test for cognitive function or measure of impaired thinking. The tested areas of cognitive function consist of orientation, registration, naming recall, calculation, writing, attention, repetition, comprehension, reading and drawing. The range of the result lies between total cognitive absence (0 points) and full cognitive function (30 points) [34, 35].
-
Mobility score (MS) The MS questions belong to the Physical Activity Scale for the Elderly (PASE) questionnaire [36]. They provide validated knowledge about the physical activity of the patients. Here, 5 principal questions and follow-up questions were asked, yielding to a in this work derived score between 0 and 5. The maximum score indicates full mobility and 0 signifies extremely limited mobility.
Data exploration and quality assessment
Data preparation
Imputation of missing data
Feature | Percentage of missing data | Reason for missingness | Imputation possible |
---|---|---|---|
Times stopped smoking
|
75.11
| MNAR (follow-up question) | No |
Daily wine consumption
|
91.14
| MNAR (follow-up question) | No |
Daily beer consumption
|
98.73
| MNAR (follow-up question) | No |
Daily spirits consumption
|
98.95
| MNAR (follow-up question) | No |
Duration of alcohol consumption
|
82.91
| MNAR (follow-up question) | No |
Earlier alcohol consumption | 19.20 | MAR | Yes |
Kind of drinker (earlier)
|
86.29
| MNAR (follow-up question) | No |
Starting age alcohol consumption
|
86.50
| MNAR (follow-up question) | No |
Ending age alcohol consumption
|
86.92
| MNAR (follow-up question) | No |
D Dimer [ μg/L] | 17.72 | MAR | Yes |
High-sensitivity C-reactive protein (hs-CRP) [mg/L] | 14.98 | MAR | Yes |
Number of IADL abilities | 6.33 | MAR | Yes |
Total MMSE score | 15.82 | MAR | Yes |
Total GDS | 9.49 | MAR | Yes |
Depression | 9.49 | related to {gdstotal} | No |
Insulin [U/mL] | 11.60 | MAR | Yes |
HDL | 9.07 | MAR | Yes |
LDL | 9.07 | MAR | Yes |
Total testosterone [ng/dL]
|
37.97
| MAR | Yes |
Free testosterone [ng/dL]
|
37.97
| MAR | Yes |
Mobility scale question 5 | 8.44 | MNAR (follow-up question) | No |
Mobility scale question 6 | 8.44 | MNAR (follow-up question) | No |
Mobility scale question 8 | 14.35 | MNAR (follow-up question) | No |
Mobility scale question 9 | 13.92 | MNAR (follow-up question) | No |
Mobility scale question 11 | 7.81 | MNAR (follow-up question) | No |
Mobility scale question 12 | 7.59 | MNAR (follow-up question) | No |
Mobility scale question 14 | 25.95 | MNAR (follow-up question) | No |
Mobility scale question 15 | 26.16 | MNAR (follow-up question) | No |
MMSE temporal domain 1 | 17.93 | MAR | Yes |
MMSE temporal domain 2 | 18.78 | MAR | Yes |
MMSE temporal domain 3 | 18.14 | MAR | Yes |
MMSE temporal domain 4 | 22.57 | MAR | Yes |
MMSE temporal domain 5 | 12.87 | MAR | Yes |
MMSE spatial domain 1 | 13.08 | MAR | Yes |
MMSE spatial domain 2 | 13.29 | MAR | Yes |
MMSE spatial domain 3 | 13.29 | MAR | Yes |
MMSE spatial domain 4 | 13.29 | MAR | Yes |
MMSE spatial domain 5 | 13.29 | MAR | Yes |
MMSE remembering 1 | 18.99 | MAR | Yes |
MMSE remembering 2 | 19.41 | MAR | Yes |
MMSE backward counting
|
51.05
| MAR | Yes |
MMSE spell the word
|
61.60
| MAR | Yes |
MMSE object naming | 13.92 | MAR | Yes |
MMSE repeat phrase | 13.08 | MAR | Yes |
MMSE left right | 13.50 | MAR | Yes |
MMSE following written order | 13.29 | MAR | Yes |
MMSE write sentence | 13.92 | MAR | Yes |
MMSE copying design | 13.50 | MAR | Yes |
Cognitive impairment | 17.09 | MAR | Yes |
Individual income | 8.44 | MAR | Yes |
Household income | 13.29 | MAR | Yes |
Number of persons in the family | 18.78 | MAR | Yes |
Insulin like growth factor 1 (IGF1) [ng/mL] | 27.00 | MAR | Yes |
Dementia type
|
98.73
| MNAR (follow-up question) | No |
Overall income | 13.71 | MAR | Yes |
-
For continuous features: rfcont for numeric random forest (RF) imputations
-
For binary, ordered and unordered categorical features: rfcat for categorical RF imputations (factor, ≥ 2 levels)
Feature selection
Modeling and evaluation
Classification model settings
NB
CART
Bagging CART
C5.0
RF
SVM
LDA
Optimization of algorithm input
Min-Max Normalization
Modeling and validation schema
Performance measures
Results
Selected features
Description | Type |
---|---|
Height (cm) | Numeric |
Presence of cognitive impairment | Binary |
Presence of depression | Binary |
Mobility Scale follow-up question (tiredness when going out) | Binary |
Mobility Scale question (stair-climbing ability) | Binary |
Mobility Scale follow-up question (tiredness when walking outside) | Binary |
Mobility Scale question (walking outside ability) | Binary |
MMSE follow-up question (remembering objects ability) | Categorical |
Total GDS | Binary |
Age in years | Numeric |
ADL question (difficulty washing) | Categorical |
Number of ADL abilities | Numeric |
Number of IADL abilities | Numeric |
IADL question (difficulty using telephone) | Categorical |
IADL question (difficulty shopping) | Categorical |
IADL question (difficulty cooking) | Categorical |
IADL question (difficulty doing light housework) | Categorical |
IADL question (difficulty doing heavy housework) | Categorical |
IADL question (difficulty using public transportation) | Categorical |
Total MMSE score | Numeric |
Sum of mobility score main features (em1,em2, em3,em4,em5) | Numeric |
Number of drugs (drug intake) | Numeric |
Alkaline phosphatase [U/L] | Numeric |
Presence of polypharmacy | Binary |
Self-reported health status | Categorical |
Self-reported health status compared to people the same age | Categorical |
Capacity of dealing with problems | Categorical |
Capacity of dealing with tasks | Categorical |
GDS question (dropped activity of interests) | Binary |
GDS question (boredom) | Binary |
Presence of joint inflammation (more than 4 weeks in a row) | Categorical |
Model performance
Prediction method | Accuracy | AUC | Sensitivity | Specificity | Precision | F1-Score |
---|---|---|---|---|---|---|
Imputation 1 | ||||||
Naive Bayes | 73.20 ± 5.97% | 0.756 ± 0.052 | 0.656 ± 0.102 | 0.856 ± 0.079 | 0.885 ± 0.054 | 0.749 ± 0.067 |
CART | 72.77 ± 5.20% | 0.710 ± 0.061 | 0.782 ± 0.108 | 0.639 ± 0.168 | 0.789 ± 0.065 | 0.778 ± 0.049 |
Bagging CART | 75.51 ± 7.16% | 0.731 ± 0.070 | 0.830 ± 0.086 | 0.633 ± 0.084 | 0.786 ± 0.048 | 0.806 ± 0.060 |
C5.0 | 77.83 ± 7.13% | 0.752 ± 0.086 | 0.860 ± 0.056 | 0.644 ± 0.164 | 0.804 ± 0.075 | 0.829 ± 0.051 |
Random forest | 77.64 ± 5.62% | 0.755 ± 0.053 | 0.844 ± 0.089 | 0.667 ± 0.087 | 0.806 ± 0.041 | 0.823 ± 0.050 |
Support vector machines (RBF) | 77.64 ± 6.55% | 0.762 ± 0.065 | 0.824 ± 0.09 | 0.700 ± 0.099 | 0.819 ± 0.053 | 0.819 ± 0.057 |
Linear discriminant analysis | 75.11 ± 5.34% | 0.739 ± 0.042 | 0.789 ± 0.096 | 0.689 ± 0.047 | 0.805 ± 0.023 | 0.795 ± 0.055 |
Imputation 2 | ||||||
Naive Bayes | 72.78 ± 6.47% | 0.750 ± 0.059 | 0.656 ± 0.109 | 0.844 ± 0.094 | 0.878 ± 0.063 | 0.745 ± 0.072 |
CART | 70.89 ± 5.94% | 0.699 ± 0.057 | 0.741 ± 0.098 | 0.656 ± 0.104 | 0.781 ± 0.047 | 0.757 ± 0.058 |
Bagging CART | 75.11 ± 6.59% | 0.729 ± 0.072 | 0.820 ± 0.089 | 0.639 ± 0.134 | 0.792 ± 0.066 | 0.802 ± 0.054 |
C5.0 | 77.39 ± 7.35% | 0.745 ± 0.093 | 0.867 ± 0.057 | 0.622 ± 0.192 | 0.797 ± 0.082 | 0.828 ± 0.050 |
Random forest | 77.01 ± 6.65% | 0.752 ± 0.064 | 0.827 ± 0.101 | 0.678 ± 0.101 | 0.809 ± 0.052 | 0.815 ± 0.060 |
Support vector machines (RBF) | 77.63 ± 7.01% | 0.761 ± 0.071 | 0.827 ± 0.085 | 0.694 ± 0.102 | 0.816 ± 0.057 | 0.820 ± 0.060 |
Linear discriminant analysis | 76.14 ± 5.15% | 0.752 ± 0.046 | 0.792 ± 0.081 | 0.711 ± 0.057 | 0.817 ± 0.032 | 0.803 ± 0.050 |
Imputation 3 | ||||||
Naive Bayes | 73.41 ± 5.64% | 0.757 ± 0.057 | 0.664 ± 0.083 | 0.849 ± 0.102 | 0.885 ± 0.069 | 0.755 ± 0.056 |
CART | 73.21 ± 5.75% | 0.728 ± 0.07 | 0.746 ± 0.064 | 0.709 ± 0.14 | 0.815 ± 0.067 | 0.776 ± 0.045 |
Bagging CART | 78.28 ± 3.92% | 0.764 ± 0.057 | 0.841 ± 0.058 | 0.688 ± 0.148 | 0.823 ± 0.062 | 0.828 ± 0.026 |
C5.0 | 74.06 ± 7.12% | 0.709 ± 0.089 | 0.837 ± 0.057 | 0.581 ± 0.181 | 0.774 ± 0.073 | 0.802 ± 0.048 |
Random forest | 77.62 ± 6.65% | 0.762 ± 0.076 | 0.820 ± 0.068 | 0.704 ± 0.134 | 0.824 ± 0.068 | 0.820 ± 0.052 |
Support vector machines (RBF) | 79.32 ± 5.00% | 0.779 ± 0.056 | 0.838 ± 0.049 | 0.720 ± 0.09 | 0.833 ± 0.048 | 0.834 ± 0.040 |
Linear discriminant analysis | 78.47 ± 4.77% | 0.773 ± 0.051 | 0.821 ± 0.059 | 0.726 ± 0.085 | 0.833 ± 0.045 | 0.825 ± 0.040 |
Imputation 4 | ||||||
Naive Bayes | 72.78 ± 5.89% | 0.750 ± 0.061 | 0.657 ± 0.083 | 0.843 ± 0.111 | 0.881 ± 0.075 | 0.749 ± 0.057 |
CART | 71.26 ± 5.83% | 0.697 ± 0.053 | 0.762 ± 0.095 | 0.631 ± 0.083 | 0.774 ± 0.043 | 0.765 ± 0.058 |
Bagging CART | 76.38 ± 5.77% | 0.747 ± 0.069 | 0.817 ± 0.076 | 0.676 ± 0.147 | 0.812 ± 0.065 | 0.811 ± 0.046 |
C5.0 | 74.25 ± 7.13% | 0.712 ± 0.085 | 0.837 ± 0.057 | 0.587 ± 0.157 | 0.774 ± 0.07 | 0.803 ± 0.052 |
Random forest | 76.99 ± 5.90% | 0.755 ± 0.069 | 0.817 ± 0.069 | 0.693 ± 0.136 | 0.819 ± 0.067 | 0.815 ± 0.046 |
Support vector machines (RBF) | 78.47 ± 5.14% | 0.771 ± 0.057 | 0.827 ± 0.053 | 0.714 ± 0.092 | 0.829 ± 0.049 | 0.827 ± 0.041 |
Linear discriminant analysis | 78.06 ± 5.39% | 0.772 ± 0.057 | 0.807 ± 0.061 | 0.737 ± 0.091 | 0.837 ± 0.049 | 0.820 ± 0.045 |
Imputation 5 | ||||||
Naive Bayes | 73.41 ± 5.45% | 0.756 ± 0.053 | 0.664 ± 0.088 | 0.849 ± 0.098 | 0.885 ± 0.066 | 0.754 ± 0.057 |
CART | 71.67 ± 7.79% | 0.702 ± 0.087 | 0.762 ± 0.100 | 0.642 ± 0.166 | 0.786 ± 0.089 | 0.769 ± 0.066 |
Bagging CART | 76.79 ± 4.69% | 0.749 ± 0.053 | 0.827 ± 0.071 | 0.671 ± 0.115 | 0.809 ± 0.049 | 0.815 ± 0.039 |
C5.0 | 75.31 ± 4.08% | 0.726 ± 0.055 | 0.837 ± 0.065 | 0.615 ± 0.138 | 0.787 ± 0.055 | 0.808 ± 0.030 |
Random forest | 78.03 ± 5.10% | 0.764 ± 0.060 | 0.830 ± 0.073 | 0.698 ± 0.129 | 0.824 ± 0.061 | 0.824 ± 0.041 |
Support vector machines (RBF) | 78.47 ± 5.39% | 0.771 ± 0.059 | 0.827 ± 0.055 | 0.714 ± 0.092 | 0.828 ± 0.049 | 0.827 ± 0.043 |
Linear discriminant analysis | 77.62 ± 5.35% | 0.769 ± 0.058 | 0.800 ± 0.063 | 0.737 ± 0.102 | 0.836 ± 0.054 | 0.816 ± 0.045 |