In this section, we describe our data source, formally identify and formulate the problem we want to solve, and describe the machine learning models that we implemented in an attempt to solve the stroke prediction task.
Data source
Our data was supplied by Tsuyama Jifukai Tsuyama Chuo Hospital in Okayama Prefecture, Japan.
Patient cohort
The patient cohort was divided into two groups: case patients and control patients. Case patients are patients who will be diagnosed with stroke. The criteria for determining case patients were as follows. The initial set of case patients had a first diagnosis of stroke between January 1, 2001 and January 1, 2015. The specific date range was chosen arbitrarily, with the intention of accounting for latent variables which could include access to newer drugs, changes in data recording policies, and even socioeconomic changes. The patient cohort was further filtered to include only those who received the first diagnosis of stroke between the ages of 45 and 95. This range was chosen because there were at least 10 incidents of stroke per age. Case patients were also selected to have at least one medical diagnosis and at least one exam result at least one day before the stroke diagnosis.
To maximize both the number of case and control patients, only a maximum of two control patients were able to be selected per case patient. To be selected as a control patient, the candidate must have the same birth year and gender as the case patient, at least one medical diagnosis and one exam result before the date the case patient was diagnosed with stroke, and at least one medical diagnosis and one exam result after the date the case patient was diagnosed with stroke. Control patients also did not have any diagnosis of stroke in their EHR. Case patients without two control patients were excluded from the final set of case patients.
After applying these criteria, 2725 case patients and 5450 control patients were identified. These patients formed the dataset used for training and evaluating our models.
Patient diagnoses
Each diagnosis was recorded in the EHR as a tuple comprising of the diagnosis date and the ICD-10 code [
18] representation of the diagnosis. For this study, codes longer than four letters were truncated to be four letters long. The use of codes longer than four letters comprised around 5% of diagnoses across all patients. In fact, five letter codes were the longest codes in the hospital’s EHR system. The truncation posed no problem because ICD-10 codes form a prefix hierarchy. We must also note that ICD-10 codes were derived through a mapping from Japanese insurance codes to ICD-10 codes of the year 2013 edition. Our study supported a total of 6905 ICD-10 codes. Stroke cases were identified by the range of codes spanning I60 to I69.
Patient exam results
We were supplied with blood and biochemical exam results for each patient. Each exam was comprised of one or more measured metrics, but an exam did not necessarily contain the same set of measured metrics. For example, a blood exam could contain results for red blood cell (RBC) count but not every blood exam was guaranteed to contain results for RBC count. There can also be two measurements per metric. The blood exam metrics were: (a) ALP (b) creatinine (c) Hb (d) hematocrit (e) MCH (f) MCHC (g) MCV (h) platelet count (i) RBC and (j) WBC. The biochemical exam metrics were: (a) GGT (b) GOT (c) GPT (d) HbA1c JDS (e) HbA1c NGSP (f) HDL-C (g) LDL-C (h) triglyceride (i) uric acid and (j) blood glucose.
We experimented with a few strategies to account for missing or optional data. We tried adding a categorical value per metric to indicate if the measurement was absent, as a means of differentiating zero values from unrecorded values. Our models did not have any predictive ability when this strategy was used. We tried a roll-forward strategy, where each missing value was replaced with the last known recorded value. Our models also failed to learn with this strategy. We then tried augmenting the roll-forward strategy by replacing missing values with the mean measured value for the patient’s gender using data from the training set, and our models were able to start learning with this strategy. Finally, to address the differences in the number of exam results between patients, exam metrics were summarised by age, i.e. keep only the newest recorded value for each metric at a given age. We found that our models attained the best results with this strategy.
Problem description
The purpose of this study was to investigate the possibility of predicting whether a patient will suffer from stroke within a year of the last set of exam results or the last set of medical diagnoses. A period of a year was chosen because annual health exams are mandatory for full time employees in Japan [
19]. Being able to predict future illness from a regular exam would enable proactive forms of healthcare measures to be taken.
We formulated the prediction task as a binary classification task. The inputs available to this task were a patient’s historical exam results
Xexam and medical diagnoses
Xdiagnosis. The output
\(\hat {\boldsymbol {y}}\) was a yes/no decision to indicate if the patient would suffer from stroke in the next 365 days. The following models were implemented and evaluated:
$$\begin{array}{*{20}l} \hat{\boldsymbol{y}} &= f_{d}\left(\boldsymbol{X}_{diagnosis}^{(i)}\right) \end{array} $$
(1)
$$\begin{array}{*{20}l} \hat{\boldsymbol{y}} &= f_{e}\left(\boldsymbol{X}_{exam}^{(j)}\right) \end{array} $$
(2)
$$\begin{array}{*{20}l} \hat{\boldsymbol{y}} &= f_{ed}\left(\boldsymbol{X}_{diagnosis}^{(i)}, \boldsymbol{X}_{exam}^{(j)}\right) \end{array} $$
(3)
We wanted our models to be able to parameterize by itself, the input features which are useful as predictors for the stroke prediction task. To do this we used all features that were available in the dataset, and encoded them as follows.
The input matrix
\(\boldsymbol {X}_{diagnosis}^{(i)}\) represented a patient’s
hth up to the
ith historical diagnoses:
$$ \boldsymbol{X}_{diagnosis}^{(i)} = \left(\boldsymbol{x}_{diagnosis}^{(h)}, \dots, \boldsymbol{x}_{diagnosis}^{(i)}\right) \mid h = \text{max}(1, i - 34), h \ne i $$
(4)
The maximum number of diagnoses in Xdiagnosis was limited to 35 because this covered 99% of patients in the dataset. At least half of the patients had 5 or more diagnoses. There was a long tail of patients having more than 35 diagnoses.
A patient’s
ith diagnosis was represented by the vector
\(\boldsymbol {X}_{diagnosis}^{(i)}\). The patient’s gender was encoded as
g∈{0,1}. To account for the patient’s age at time of diagnosis, this value was included and denoted as
\(a_{d}^{(i)} \in \mathbb {R}\). Each diagnosis ICD-10 code was encoded into a dense vector
\(\boldsymbol {d}^{(i)} \in \mathbb {R}^{32}\); details of this encoding process will be described in the next section. The number of days
n since the (
i−1)
th diagnosis was converted into a categorical value
nc and then encoded into a one-hot label vector
\(\boldsymbol {c}_{d}^{(i)} \in \{0,1\}^{|n_{c}|}\). The categories were:
$$ n_{c} =\left\{ \begin{array}{ll} 0 & \text{if } n \text{ is unknown or the first record}\\ 0 & 0 \leq n < 7\\ 1 & 7 \leq n < 30\\ 2 & 30 \leq n < 90\\ 3 & 90 \leq n < 180\\ 4 & 180 \leq n \leq 365\\ 5 & 365 < n\\ \end{array}\right. $$
(5)
The complete vector representation of the ith diagnosis was then formulated as \(\boldsymbol {x}_{diagnosis}^{(i)} = \left ((g), \left (a_{d}^{(i)}\right), \boldsymbol {c}_{d}^{(i)}, \boldsymbol {d}^{(i)}\right)\).
The input matrix
\(\boldsymbol {X}_{exam}^{(j)}\) represented the patient’s first to
jth summarized exam results:
$$ \boldsymbol{X}_{exam}^{(j)} = \left(\boldsymbol{x}_{exam}^{(1)}, \boldsymbol{x}_{exam}^{(2)}, \dots, \boldsymbol{x}_{exam}^{(j)}\right) \mid 0 < j \leq 20 $$
(6)
No patient in the dataset had more than 20 years of exam results, so \(\boldsymbol {X}_{exam}^{(j)}\) will contain at most 20 results. At least half of the patients had 5 years or more of exam results.
The patient’s
jth exam results was represented by the vector
\(\boldsymbol {X}_{exam}^{(j)}\). The patient’s gender was encoded as
g∈{0,1}. To account for the patient’s age when the exam was conducted, this value was included and denoted as
\(a_{e}^{(j)} \in \mathbb {R}\). The number of days
n since the
j−1
th exam was conducted was converted into a categorical value
nc as described in Eq.
5 and then encoded into a one-hot label vector
\(\boldsymbol {c}_{e}^{(j)} \in \left \{0,1\right \}^{|n_{c}|}\). Each exam metric’s measurement
m was represented as
\(e_{m}^{(j)} \in \mathbb {R}\). All exam metrics were combined to form the vector
\(\boldsymbol {e}_{results}^{(j)} = \left (e_{1}^{(j)}, \dots, e_{m}^{(j)}\right)\). The complete vector representation of the
jth exam was formulated as
\(\boldsymbol {x}_{exam}^{(j)} = \left ((g), \left (a_{e}^{(j)}\right), \boldsymbol {c}_{e}^{(j)}, \boldsymbol {e}_{results}^{(j)}\right)\).
ICD-10 encoding
We encoded the ICD-10 codes into the dense vector
d. Generally,
N categorical values can be encoded as a one-hot label vector {0,1}
1×N, where a value of 1 in the
ith column and zero in every other column is a vector representing the
ith category. This encoding scheme works when the value of
N does not cause memory usage issues during an algorithm’s runtime. For large
N, it may be necessary to apply dimensionality reduction techniques to project the input space into a lower dimensional space. We used a single hidden layer autoencoder network to project from the sparse input vector
xICD∈{0,1}
6905 into a dense vector
\(\boldsymbol {d} \in \mathbb {R}^{32}\). Our autoencoder network was defined as
$$\begin{array}{*{20}l} \boldsymbol{d} &= \text{tanh}\left(\boldsymbol{x}_{ICD}^{T}\boldsymbol{W}_{input} + \boldsymbol{b}_{input}\right) \end{array} $$
(7)
$$\begin{array}{*{20}l} \boldsymbol{y}_{decoding} &= \text{softmax}\left(\boldsymbol{d}^{T}\boldsymbol{W}_{output} + \boldsymbol{b}_{output}\right) \end{array} $$
(8)
The autoencoder network was then trained to learn
Winput,
binput,
Woutput, and
boutput such that the categorical cross-entropy loss between
ydecoding and
xICD was minimized. Equation
7 can then be applied to yield our dense representation
d. Our autoencoder network learned a perfect reconstruction.
Model implementations
All neural network models presented in this paper were selected through experimentation. These models were selected by evaluating the F
1 score achieved on the validation set. The experimentation (which included architecture searches and meta-parameter selections) was non-exhaustive; the models presented here make no claims of being globally optimal with respect to predictive power. Our experiments were conducted on a desktop-class machine with a single Nvidia GTX 1080 video card. We spent around 3 months conducting experiments, with the runtime for each experiment ranging from a day up to a week. We present our model descriptions as Keras 2.0.9 [
20] code in the interest of being precise.
Model 1 was implemented as a RNN using a Gated Recurrent Unit (GRU) module for recurrent connections. Text translation tasks, such as those described in [
21], have demonstrated the capability of RNNs to learn from variable-length sequences and model dependencies within the input sequence. GRU modules were chosen over Long Short Term Memory modules because empirical evidence have demonstrated it to be faster to train while achieving comparable performance [
22]. Listing 1 shows the model’s code definition.
Model 2 was implemented using a fully connected architecture. We experimented with CNN and RNN architectures but did not achieve better performance. We handled temporal input using Keras’ TimeDistributed wrapper. The code for this model is in Listing 2.
Finally, Model 3 was implemented using a dual-input architecture to handle both diagnoses and exam results as input data. This model was a fusion of Models 1 and 2. Each type of input was processed by a separate branch in the network and the outputs were then concatenated together and processed by the rest of the network. To speed up the training of this model, we transferred weight values from the original neural network models of layers sharing the same configuration. The code for this model is in Listing 3. For clarity, we have omitted the code for overriding the initial weights with transferred weights.
Our dataset was split into training, validation, and test sets at a ratio of 70/15/15. The split was performed by first dividing case patients into the aforementioned subsets. To ensure that each split has the same ratio of case to control patients, control patients who were associated with a case patient became members of the associated case patient’s subset split. Keeping the ratios consistent across each split was crucial when evaluating model performance.
All exam result values were scaled by subtracting the mean and then dividing by the standard deviation. The mean and standard deviation values were calculated from the training set. The prevalence of stroke diagnosis was 1 out of 8 Xdiagnosis inputs and 1 out of 9 Xexam inputs.
All models were trained to minimize a customised categorical cross-entropy loss. Adam optimizer [
23] with default parameters supplied by Keras was used. The random seed was fixed for all models. We used mini-batches of size 32 and trained for 500 epochs. Model 3 was trained for 100 epochs.
Our loss function enhanced the standard categorical cross-entropy loss by incorporating penalties for false positive and false negative predictions. Let
Y+ be positive labels and
Y− be negative labels. The total number of false positives
fp and false negatives
fn can be stated in the terms of the zero-one loss
$$ fp(f) = \sum_{i \in Y^{-}} \ell_{01}(f, x_{i}) \qquad \qquad fn(f) = \sum_{i \in Y^{+}} \ell_{01}(f, x_{i}) $$
(9)
Then, a suitable surrogate for the zero-one loss can be used to place an upper bound on these quantities and also to make these quantities optimizable through stochastic gradient descent methods [
24]. We used log loss as the surrogate since we wanted to incorporate these quantities into the standard categorical cross-entropy loss. These quantities satisfied the following inequalities
$$ \begin{aligned} fp^{u}(f) &\triangleq \sum_{i \in Y^{-}} \ell_{ll}(f, x_{i}) \geq fp(f) \\ fn^{u}(f) &\triangleq \sum_{i \in Y^{+}} \ell_{ll}(f, x_{i}) \geq fn(f) \end{aligned} $$
(10)
Where
ℓll(
f,
x)=−
p(
x) log
f(
x). These quantities were incorporated into the standard cross-entropy loss as
$$ \ell(f) = \alpha\left(\sum_{i \in Y} \ell_{ll}(f, x_{i}) \right) + \beta fp^{u}(f) + \gamma fn^{u}(f) $$
(11)
The terms α, β and γ were used to change the influence of each loss term. Our training hyperparameters used α=0.2,β=5 and γ=5.