Background
Temporal abstractions of EHRs
SAX
sequences, choose a representative subsequence for each feature, and compute the distance of all features to the representative. This process results in single-valued features. Unfortunately, the approach by Zhao et al. [32] suffers from two main weaknesses: (1) the chosen distance function, i.e., Levenshtein distance, is highly dependent on the sequence length and (2) it cannot effectively handle and exploit the high degree of sparsity in the feature space.ADE knowledge extraction from EHRs
Contributions
SAX
time series summarization technique as well as on the concept of s-shapelets, which correspond to class-distinctive discrete event subsequences;plain
(which is an extension of Zhao et al. [32]), most-common encoding or mc
, and left-right optimized encoding or lr
;Methods
Overview
-
Phase A: each feature in the training set is first transformed into a discrete symbolic sequence representation.×
-
Phase B: a set of candidate subsequences is generated from the discretized features, which are then evaluated based on their class-distinctive power (referred to as utility), and finally the set of subsequences with the highest utility, called s-shapelets, is generated. An important component of this phase concerns the way empty data records, i.e., records with empty sequences, are processed. Towards this end we propose three strategies for handling and exploiting empty sequence features.
-
Phase C: using the s-shapelets extracted from the training set, each multi-variate data feature (of the training and test sets) is converted to a real-valued feature.
Definitions and problem formulation
Phase A: Multi-variate feature discretization
Normalization
Summarization
Symbolic representation
SAX
, the initial multi-variate feature space \(\mathcal {A}\) is converted to its SAX representation, which defines the symbolic multi-variate feature space \(\mathcal {\hat {A}} = \left \{\hat {A}_{1}, \ldots, \hat {A}_{m}\right \}\), comprising symbolic sequence representations of variable lengths.Alphabet calibration
Phase B: sub-sequence enumeration
Subsequence generation
Sub-sequence evaluation
Sub-sequence selection
Phase C: data transformation
Exploiting sparsity
plain
), most-common encoding (or mc
), and left-right optimized encoding (or lr
).Strategy I: Length encoding (plain
)
plain
.plain
treats empty multi-variate feature records as regular entries, and replaces them with the distance between their symbolic representation and the optimum s-shapelet, that is, it replaces them with \(\left |s^{*}\left (\mathcal {\hat {A}}_{j}\right)\right |\) for each feature \(\mathcal {\hat {A}}_{j}\). This approach is a modified and improved version of random dynamic subsequence used in Zhao et al. [32]. In particular, our method corrects for bias introduced by random dynamic subsequences, i.e., to favor longer subsequences, using our re-defined subsequence distance measure. As a consequence, we use plain
as a baseline for methods that consider the temporal information, similar to how sl
acts as a baseline for methods that does not consider temporal information in our experimental evaluation.Strategy II: most-common encoding (mc
)
mc
. In fact, when building the single-valued features at training and prediction time, mc
replaces ∅s with the distance value Dist(s∗,·) that occurs most frequently within the training set. When dealing with very sparse feature spaces, this choice can be interpreted as a way of considering missing entries as “frequent”.Strategy III: Left-right optimized encoding (lr
)
lr
. When evaluating a distance threshold for a given subsequence and building the resulting split \(\left \{\mathcal {\hat {L}}_{1},\mathcal {\hat {L}}_{2}\right \}\) (lines (4) and (5) of Algorithm 1), lr
tries to assign all of the ∅s either to \(\mathcal {\hat {L}}_{1}\) (left) or to \(\mathcal {\hat {L}}_{2}\) (right), and selects the option yielding the highest information gain. According to this choice, the distance of a candidate s-shapelet to ∅ is computed as follows:
lr
selects the best s-shapelet. The above strategy also keeps track of the assignments that yield the overall maximum gain, that is, \(Dist(s^{*}(\mathcal {A}_{j}), \emptyset)\), j=1,…,m, and uses this value to replace missing entries at both training and prediction time.
lr
vs. plain
” section, we elaborate on the lr
strategy in terms of interpretability and show that a dynamic encoding of missing data can (also) help understand the s-shapelet that has been selected for classifying a multi-variate feature dataset.Data source
Empirical evaluation
Datasets
Adverse drug event | Positive | Negative |
---|---|---|
D61.1: Aplastic anaemia | 593 | 105 |
Average age | 50.3 | 55.6 |
Gender distribution (% female) | 42.7 | 49.5 |
E27.3
Adrenocortical insufficiency
| 70 | 259 |
Average age | 61.8 | 56 |
Gender distribution (% female) | 58.6 | 58.6 |
G62.0
Polyneuropathy
| 96 | 783 |
Average age | 59.8 | 72.7 |
Gender distribution (% female) | 47.9 | 40.1 |
I95.2
Hypotension
| 115 | 1287 |
Average age | 79.3 | 74.7 |
Gender distribution (% female) | 40.9 | 49.6 |
L27.0
Generalized skin eruption
| 182 | 468 |
Average age | 60.1 | 48.7 |
Gender distribution (% female) | 55.5 | 48.4 |
L27.1
Localized skin eruption
| 151 | 498 |
Average age | 59.8 | 55.9 |
Gender distribution (% female) | 50.3 | 54.5 |
M80.4
Osteoporosis
| 52 | 1170 |
Average age | 65.8 | 70.9 |
Gender distribution (% female) | 71.15 | 81 |
O35.5
Damage to fetus by drugs
| 146 | 260 |
Average age | 38.5 | 38.9 |
Gender distribution (% female) | 100 | 100 |
T78.2
Anaphylactic shock
| 131 | 856 |
Average age | 50.9 | 45.46 |
Gender distribution (% female) | 50.4 | 60.7 |
T78.3
Angioneurotic oedema
| 283 | 720 |
Average age | 56.4 | 42.35 |
Gender distribution (% female) | 59 | 59.9 |
T78.4
Allergy
| 574 | 415 |
Average age | 41.2 | 52.5 |
Gender distribution (% female) | 65.2 | 51.2 |
T80.1
Vascular complications
| 66 | 609 |
Average age | 66.2 | 63.2 |
Gender distribution (% female) | 48.5 | 64.7 |
T80.8
Infusion complications
| 538 | 138 |
Average age | 64.3 | 60.4 |
Gender distribution (% female) | 65.8 | 52.2 |
T88.6
Anaphylactic shock
| 89 | 1506 |
Average age | 56.9 | 58.5 |
Gender distribution (% female) | 51.7 | 57.6 |
T88.7
Unspecified adverse effect
| 1047 | 550 |
Average age | 60.9 | 53.6 |
Gender distribution (% female) | 60.2 | 51.3 |
0.2 | 0.3 | 0.5 | 0.7 | 0.9 | 0.95 | 1.0 | |
---|---|---|---|---|---|---|---|
D61.1
| 16 | 21 | 23 | 34 | 72 | 90 | 186 |
E27.3
| 11 | 12 | 14 | 19 | 42 | 88 | 137 |
G62.0
| 4 | 11 | 16 | 19 | 40 | 62 | 151 |
I95.2
| 11 | 13 | 14 | 20 | 30 | 56 | 180 |
L27.0
| 4 | 12 | 18 | 25 | 33 | 54 | 162 |
L27.1
| 6 | 11 | 17 | 24 | 35 | 62 | 169 |
M80.4
| 9 | 11 | 14 | 19 | 42 | 62 | 170 |
O35.5
| 1 | 2 | 4 | 15 | 24 | 38 | 73 |
T78.2
| 8 | 9 | 12 | 17 | 29 | 50 | 168 |
T78.3
| 8 | 9 | 12 | 17 | 27 | 43 | 131 |
T78.4
| 8 | 9 | 13 | 17 | 29 | 51 | 194 |
T80.1
| 11 | 13 | 19 | 25 | 33 | 40 | 131 |
T80.8
| 11 | 14 | 19 | 25 | 33 | 43 | 128 |
T88.6
| 11 | 12 | 15 | 21 | 33 | 59 | 202 |
T88.7
| 11 | 12 | 16 | 21 | 33 | 62 | 217 |
Benchmarked methods
plain
, mc
, and lr
, has been evaluated using the Random Forest algorithm (RF
, [51]). Since our proposed framework is model agnostic, we note that alternative predictive models can be used. The hyper-parameters for RF
have been configured as follows: (i) we set the number of trees to 100, (ii) information gain was used as the split criterion (being consistent with the way random subsequences are evaluated and an s-shapelet is selected), and (iii) the number of features to consider at each decision split was set to the default value \(\sqrt {m}\), where m is the number of features in the dataset.sl
[20]. This representation has been shown to be the best single-valued representation for clinical laboratory measurement features in the context of detecting ADEs in EHRs, compared to several other multi-variate feature representation techniques that do not take into account the temporal order of the measurements [20].Alphabet tuning
Sparsity tolerance
RF
. Clearly, when τsp=1.0, all available features are taken into account, regardless of the percentage of empty sequence records.Evaluation metrics
Results
Sparsity tolerance of baseline
sl
, can be affected in terms of predictive performance in the presence of different levels of sparse multi-variate features. Since sl
is using the length of the multi-variate feature value (i.e., the length of the sequence assigned in a feature record) as its single-valued representation, empty sequences will be replaced by ∅. Figure 4a depicts the AUC obtained by RF
on a selection of 5 ADE datasets, while Fig. 4b shows the average AUC on all 15 datasets, while increasing the value of τsp. We observe that predictive performance in terms of AUC increases as sparser features are included in the learning process. Note that although we only show five datasets in Fig. 4a, the performance is similar for the remaining datasets, as it can be confirmed in Table 5.
Sparsity tolerance of plain
, mc
, and lr
RF
by using the three proposed feature representations, namely plain
, mc
and lr
, and measure their respective AUCs for each ADE dataset and for different sparsity levels.Overall comparison
sl
) at each sparsity level – the highest AUC reached for a given threshold is underlined – we can see how plain
and lr
are the most effective feature representations, followed by sl
. In particular, lr
outperforms the other strategies (in terms of number of best results obtained on all of the 15 ADE datasets) for τsp equal to 0.2, 0.7, 0.95, and 1, while plain
is the best method for τsp=0.3. For the other two sparsity levels the two methods perform equally well. When considering the best overall result on each ADE dataset, lr
reaches the highest AUC on 8 out of 15 cases.lr
is the strategy which best takes into account the information provided by empty multi-variate feature records. To further prove this claim, we compare the performance of lr
against that of plain
and sl
. Table 3 shows the outcome of comparing lr
, plain
, and sl
over all datasets and sparsity levels. For each value of the sparsity threshold, we report the method achieving the highest AUC on most ADEs, while the number of ADEs where the method is performing best is indicated in brackets. The last column of Table 3 refers to the number of best AUCs obtained by the methods on all of the ADE datasets regardless of the sparsity threshold. Inspecting Table 3, we can notice that lr
is the most accurate feature representation strategy for almost every sparsity threshold.
lr
and plain
(first row) and between lr
and sl
(second row), for different values of the sparsity threshold0.2 | 0.3 | 0.5 | 0.7 | 0.9 | 0.95 | 1.0 | Overall | |
---|---|---|---|---|---|---|---|---|
lr vs plain | lr (10) | plain (8) | plain (8) | lr (11)* | lr (11) | lr (9) | lr (10)* | lr (9) |
lr vs sl | lr (13)* | lr (14)* | lr (13)* | lr (10)* | lr (13)* | lr (10) | lr (9) | lr (12)* |
Statistical significance
lr
performs statistically better than plain
when τsp=0.7,1.0, with a p-value of p<0.05 in both cases. Conversely, the two cases in which plain
outperforms lr
are not statistically significant. Concerning the comparison between lr
and sl
, the null-hypothesis is rejected for τsp=0.2,0.3,0.5,0.9 (with p<0.01), and τsp=0.7 (with p<0.05). Finally, lr
proves to be statistically better (p<0.05) than sl
also when considering the overall best performance on each ADE dataset, as it can be noticed from the last column of Table 3.Utility of s-shapelets selected by lr
vs. plain
lr
, and the strategy that does not, i.e., plain
. Figure 5, shows the s-shapelet utility, according to information gain, of the s-shapelets that have been generated by using either lr
or plain
, for all datasets. In fact, the horizontal axes of the subfigures show the information gain for the s-shapelets as computed by lr
, while the vertical axes report the information gain computed by plain
. The color intensity of the point representing each s-shapelet indicates the sparsity of the multi-variate feature from which it was selected.
lr
is consistently able to select s-shapelets with a higher information gain compared to plain
. Also, interestingly, we can identify ADE cases, where the above holds for extremely sparse features, such as T78.3, T80.8, T88.7, T80.1, L27.1, and G62.0. As confirmed by our results, this information gain difference between the two strategies results in a model with higher classification performance.Investigation of three ADEs
lr
and plain
for three ADEs: E27.3, L27.1, and G62.0. These adverse effects were chosen since they are relatively frequent and the information gain between the baseline and our proposed method differed the most, i.e., there is a conflicting explanation between the two models.lr
. More importantly, it is well-known in clinical pharmacology that the lack of aldosterone can cause persistently low or uncontrolled levels of sodium, potassium, and cortisol in the blood [58]. This has also been confirmed by the clinical pharmacologists involved in our study. Interestingly, both sodium and potassium are included in the list of top-5 features identified by lr
. On the other hand, plain
manages to identify rather obvious features, such as reduced levels of cortisol and hemoglobin, which are typically present in the occurrence of adrenal hemorrhage [58].
Adrenocortical insufficiency | Localized skin eruption | Polyneuropathy | ||||
---|---|---|---|---|---|---|
lr
| plain
| lr
| plain
| lr
| plain
| |
1 | Dehydroepiandro-sterone sulfate | Lymphocytes | Erythrocytes | MCHC | Erythrocytes | MCHC |
2 | Potassium ion | Cortisol | MCHC | Erythrocytes | MCHC | Erythrocytes |
3 | Neutrophilocytes | Hemoglobin | Calcium | Calcium | MCH | MCH |
4 | Lymphocytes | Sedimentation reaction | Bilirubins | Creatininium | Bilirubins | Calcium |
5 | Sodium ion | Erythrocytes | Creatininium | Carbamide | Calcium | Creatininium |
plain
and lr
. This also abides to the findings of Fernybough and Calcutt [60] that reduced or irregular levels of calcium can indicate peripheral neuropathy. More importantly, patients with peripheral neuropathy typically exhibit elevated erythrocyte sedimentation rate in their blood [61], which is also consistent with our findings.plain
and lr
manage to identify important features that are connected with the corresponding ADEs. Based on the existing literature and our consultation with our collaborating clinical pharmacologists, our findings presented in Table 4 are not substantially significant as they are already known to clinical pharmacologists. Nonetheless, they demonstrate that our proposed method works in practice, while lr
is shown to be highly robust to sparse temporal features without compromising predictive performance.Discussion
sl
, which simply employs the number of recordings (length) per clinical laboratory measurement as a feature for the ADE prediction task. This suggests that the presence or absence, as well as the number of times a clinical laboratory measurement has been taken for a patient can constitute a promising ADE predictor.sl
. According to this strategy, ∅s are replaced with 0, regardless of the class distribution. The same holds for plain
and mc
, which map ∅ with |s| and the most common observed distance value, respectively. More generally, a static encoding of missing data, ignoring the information given by class distribution, affects the quality of an s-shapelet. Conversely, by employing a strategy such as lr
, the choice of the encoding of empty sequences dynamically fits the class balance.Conclusions
RF
on 15 ADE datasets: 4 methods are compared, respectively sl
, plain
, mc
and lr
0.2 | 0.3 | 0.5 | 0.7 | 0.9 | 0.95 | 1.0 | ||
---|---|---|---|---|---|---|---|---|
D61.1
| plain
|
78.151
|
80.842
|
80.817
|
80.909
|
84.107
|
84.801
| 82.484 |
mc
| 76.876 | 78.132 | 80.163 | 78.854 | 81.259 | 82.946 | 81.720 | |
lr
| 77.978 | 80.720 | 79.961 | 79.675 | 83.624 | 83.610 | 81.510 | |
sl
| 73.846 | 78.653 | 78.928 | 80.756 | 83.016 | 82.412 |
82.817
| |
E27.3
| plain
| 51.916 | 58.345 | 56.302 | 59.870 | 58.882 | 62.518 | 66.854 |
mc
| 48.953 | 50.052 | 46.695 | 50.126 |
61.773
| 59.444 | 58.854 | |
lr
| 58.445 | 58.637 | 56.816 |
61.179
| 59.170 | 64.162 |
67.266
| |
sl
|
60.763
|
62.054
|
59.670
| 58.718 | 58.035 | 64.686 | 66.500 | |
G62.0
| plain
| 64.443 | 71.325 | 75.452 | 77.247 | 78.839 | 79.074 | 80.279 |
mc
|
66.228
|
72.700
| 71.376 | 76.974 | 74.769 | 74.146 | 74.713 | |
lr
| 64.713 | 70.756 | 74.787 | 77.169 | 78.992 | 79.357 | 79.418 | |
sl
| 61.473 | 69.429 |
76.335
|
78.379
|
81.980
|
81.509
|
82.222
| |
I95.2
| plain
| 56.157 | 57.338 | 54.871 | 52.257 | 54.046 |
59.855
| 56.369 |
mc
| 57.486 |
57.390
| 54.750 | 49.333 | 53.480 | 51.742 |
57.391
| |
lr
|
58.052
| 54.797 |
58.165
|
53.184
|
56.145
| 57.145 | 56.362 | |
sl
| 47.671 | 49.289 | 51.080 | 51.509 | 53.065 | 54.267 | 53.943 | |
L27.0
| plain
|
60.374
|
68.063
| 66.689 |
67.815
| 65.273 | 67.416 | 65.322 |
mc
| 56.065 | 62.683 |
66.766
| 65.527 |
67.431
| 66.443 | 64.500 | |
lr
| 56.585 | 66.422 | 66.256 | 67.020 | 66.277 |
68.424
| 66.110 | |
sl
| 55.400 | 62.981 | 61.919 | 64.648 | 64.597 | 65.136 |
68.271
| |
L27.1
| plain
| 58.130 | 61.600 | 63.471 | 64.835 | 63.012 | 62.454 | 61.798 |
mc
| 56.093 | 55.029 |
67.179
| 61.970 | 64.556 | 61.222 | 62.455 | |
lr
|
58.636
|
64.100
| 65.983 | 66.264 |
67.196
| 63.884 | 64.012 | |
sl
| 53.709 | 61.418 | 62.955 |
66.461
| 66.554 |
67.397
|
67.316
| |
M80.4
| plain
| 55.646 | 54.745 |
59.375
| 59.029 | 68.713 | 67.004 | 65.396 |
mc
| 55.729 | 55.770 | 52.641 | 58.996 | 66.863 | 63.000 | 63.507 | |
lr
|
55.753
|
58.056
| 58.849 |
59.760
|
68.912
|
69.119
|
67.756
| |
sl
| 47.933 | 57.149 | 52.678 | 58.414 | 67.945 | 65.199 | 64.845 | |
O35.5
| plain
| 51.650 | 52.557 | 64.963 | 73.161 | 77.383 | 78.590 | 80.298 |
mc
| 52.741 | 52.962 | 55.361 | 68.843 | 70.114 | 72.079 | 70.104 | |
lr
|
52.756
|
53.551
|
68.318
|
76.971
|
78.332
|
80.322
|
81.324
| |
sl
| 52.401 | 52.188 | 67.201 | 73.696 | 75.025 | 78.140 | 79.142 | |
T78.2
| plain
| 53.028 | 51.573 |
56.977
| 56.538 | 57.153 |
58.709
| 59.229 |
mc
|
55.811
|
54.349
| 53.146 | 55.771 | 53.260 | 54.417 | 53.092 | |
lr
| 52.924 | 52.369 | 55.286 |
59.569
|
63.066
| 57.775 | 59.254 | |
sl
| 53.113 | 50.471 | 50.492 | 57.561 | 56.850 | 58.646 |
59.421
| |
T78.3
| plain
| 51.177 |
54.073
| 52.614 | 54.575 | 58.973 | 59.449 | 63.647 |
mc
| 48.308 | 50.692 | 51.378 | 53.746 | 56.715 | 55.334 | 58.111 | |
lr
|
51.762
| 53.238 |
55.155
|
56.952
|
59.213
|
61.924
|
66.209
| |
sl
| 50.514 | 51.851 | 54.273 | 53.531 | 56.337 | 59.759 | 64.728 | |
T78.4
| plain
| 51.937 | 54.675 |
58.151
| 56.976 |
57.124
| 56.506 |
59.162
|
mc
| 52.176 | 55.459 | 55.644 | 54.720 | 53.475 | 55.145 | 56.218 | |
lr
|
54.434
|
58.122
| 54.655 |
57.019
| 56.627 |
58.496
| 58.549 | |
sl
| 49.903 | 49.876 | 47.081 | 50.235 | 54.326 | 56.986 | 58.444 | |
T80.1
| plain
|
78.962
| 77.270 |
84.589
| 83.097 |
86.832
|
85.994
| 84.189 |
mc
| 73.207 | 72.912 | 73.492 | 78.945 | 74.963 | 79.334 | 74.592 | |
lr
| 78.930 |
80.478
| 83.233 |
83.739
| 84.677 | 84.879 |
85.803
| |
sl
| 75.125 | 76.542 | 79.956 | 81.759 | 81.413 | 83.918 | 82.726 | |
T80.8
| plain
| 80.744 |
81.769
| 83.486 | 83.704 | 85.648 | 86.136 | 86.751 |
mc
| 71.968 | 71.717 | 76.488 | 77.570 | 78.308 | 78.182 | 80.725 | |
lr
|
81.119
| 81.211 |
85.431
| 85.210 | 86.319 | 86.715 | 86.627 | |
sl
| 80.537 | 80.973 | 84.689 |
85.569
|
86.332
|
86.875
|
87.351
| |
T88.6
| plain
|
61.998
|
60.731
| 62.299 | 62.830 |
65.113
|
63.499
| 62.249 |
mc
| 58.342 | 58.993 |
63.476
| 57.263 | 59.652 | 59.568 | 61.812 | |
lr
| 60.283 | 60.159 | 62.019 |
63.256
| 60.864 | 61.744 |
64.120
| |
sl
| 57.623 | 57.248 | 57.450 | 57.417 | 58.190 | 57.672 | 60.403 | |
T88.7
| plain
| 58.971 |
60.742
| 62.399 |
62.239
| 62.552 | 63.803 | 65.352 |
mc
| 56.875 | 57.931 | 62.472 | 60.892 |
63.013
| 62.435 | 64.209 | |
lr
|
60.580
| 60.462 |
62.803
| 61.543 | 62.588 |
64.137
|
66.276
| |
sl
| 60.297 | 59.862 | 61.498 | 61.811 | 62.259 | 64.001 | 65.139 |