Dataset description
MIMIC-III Dataset: The MIMIC-III Critical Care Database [
33] is a publicly-available database which contains de-identified health records of 46,518 patients who stayed in the Beth Israel Deaconess Medical Center’s Intensive Units from 2001 to 2012. Each visit in the dataset contains both structured health records data and free text clinical notes.
We used EHR data from all patients in the dataset. The total number of patient visits in MIMIC-III is 58,597. On average, each patient had 1.26 visits, 38,991 patients had a single visit, 5151 had two visits, and 2376 patients had 3 or more visits. The average number of the recorded ICD-9 diagnosis codes per visit is 11 and the average number of words in clinical notes is 7898. For each patient visit, we extracted all diagnosis codes and all clinical notes.
Preprocessing: For each EHR in the dataset we are only focusing on the clinical notes and ICD-9 diagnosis codes. Each clinical note was preprocessed in the following way. All digits and stop words were removed. The typos were filtered using a standard English vocabulary in PyEnchant, a Python library for spell checking. For representation learning, rare words were filtered out since they do not appear often enough to obtain good quality representations. Therefore, all words whose frequency is less than 50 were removed. The resulting number of unique words was 14,302. Furthermore, the total number of unique ICD-9 diagnosis codes in MIMIC-III is 6984. Codes whose frequency is less than 5 were removed. This reduced the number of codes to 3874. Since some codes were still relatively rare for learning meaningful representations, we exploited the hierarchical tree structure of ICD-9 codes and grouped them by their first three digits. For example, ICD-9 codes “2901” (presenile dementia), “2902” (senile dementia with delusional or depressive) and “2903” (senile dementia with delirium) were grouped into a single code “290” (dementias). The size of the final code vocabulary was 752.
Training and Test Patients: We randomly split the patients into training and test sets. All 38,991 patients with a single visit were placed in the training set. Of the 7527 patients with 2 or more visits, we randomly assigned 80% of them (6015 patients) to the training set and 20% of them (1512 patients) to the test set. The whole training set was used for learning of vector representations. We excluded patients with only a single visit for the task of next visit prediction because this task requires patients to have at least two visits.
Training JointSkip-gram model
EHRs of patients from the training set were used to learn our JointSkip-gram model. For each visit we created a (D,N) pair. There were 54,965 such pairs in the training data. The size T of vectors representing codes and words was set to 200. Stochastic gradient algorithm with negative sampling maximizing (11) and (14) was set to loop through all the training data 40 times because we empirically observed that it was sufficient for the algorithm to converge. The number of negative samples was set to 5 and the size of the window for word context in the clinical notes was set to 5. As a result, each of the 7898 words and 752 ICD-9 codes were represented as 200-dimensional vectors in a joint vector space. Before applying JointSkip-gram model, we used a small fraction (∼10%) of clinical notes to pretrain vector representations of words only, as we observed that this improves our final representations.
To evaluate the quality of vector representations, we performed two types of experiments: (1) phenotype and treatment discovery by evaluating associations between codes and words in the vector space, (2) testing the predictive power of the vector representations on the task of predicting medical codes of the next visit.
Phenotype discovery
Text-based phenotype discovery can be viewed as finding words representative of medical codes. For a given ICD-9 diagnosis code, we retrieved its nearest 15 words in the vector space. If successful, the neighboring words should be clinically relevant to the ICD-9 code.
As an alternative to JointSkip-gram, we used
labeled latent Dirichlet allocation (LLDA) [
28], a supervised version of LDA [
29]. In LLDA, there is a one-to-one correspondence between topics and labels. LLDA assumes there are multiple labels associated with each document and assigns each word a probability that it corresponds to each label. LLDA can be naturally adapted to our case by treating medical codes as labels and clinical notes as documents. For a given ICD-9 diagnosis code we retrieved 15 words with the highest probabilities and compared those words with the 15 words obtained by JointSkip-gram.
We consulted domain experts about quality of the extracted phenotypes. First, we selected 6 diverse ICD-9 codes from MIMIC-III that cover both acute and chronic diseases and both common and less common conditions. The 6 ICD-9 codes are listed in Table
1, together with their description and frequency in the training set. Table
1 shows the list of 15 closest words by both methods to the 6 ICD-9 codes. For each ICD-9 diagnosis code, we presented the two lists in a random order to a medical expert and asked two questions: (1) which list is a better representative of the diagnosis code, and (2) which words in each list are not highly related to the given diagnosis code. We recruited four physicians from the Fox Chase Cancer Center as medical experts for the evaluation.
Table 1
Most important 15 words (ranked by importance) for ICD-9 codes “570”, “174”, “295”, “348”, “311”, “042”
570 (Acute liver failure, 1067)
|
174 (Female breast cancer, 139)
|
JointSkip-gram
|
LLDA
|
JointSkip-gram
|
LLDA
|
Liver | Arrest | Metastatic | Breast |
Hepatic | Pea | Mets | Pres |
Cirrhosis | Cooling | Cancer | Mastectomy |
Rising | Sun | Breast | Flap |
Markedly | Arctic | Metastases | Mets |
Shock | Rewarmed | Malignant | Ca |
Lactate | Cooled | Metastasis | Cancer |
Encephalopathy | Atrophine | Oncologist | Metastatic |
Amps | Dopamine | Oncology | Chemotherapy |
Picture | Rewarming | Chemotherapy | Malignant |
Rise | Cardiac | Infiltrating | Oncologist |
Elevated | Coded | Palliative | Polumoprhic |
Cirrhotic | Continue | Tumor | Reversible |
Bicarb | Prognosis | Melanoma | Mastectomies |
AQlcoholic | Ems | Mastectomy | Crisis |
295 (Schizophrenic disorders, 691)
| |
348 (Conditons of brain, 3781)
| |
JointSkip-gram
|
LLDA
|
JointSkip-gram
|
LLDA
|
Schizophrenia | Schizophrenia | Hemorrhagic | Arrest |
Psych | Paranoid | Herniation | Herniation |
Bipolar | Psych | Temporal | Unresponsive |
Suicide | Psychiatric | Cerebral | Corneal |
Psychiatry | Disorders | Brain | Pupils |
Kill | Personality | Hemorrhage | Brain |
Paranoid | Hiss | Parietal | Cooling |
Ideation | Guardian | Ganglia | Posturing |
Psychiatrist | Psychiatry | Occipital | Head |
Hallucinations | Hypothyroidism | Extension | Nemorrhage |
Psychosis | Home | Surrounding | Noxious |
Personality | Aloe | Head | Family |
Sitter | Arrest | Effacement | Prognosis |
Disorder | Pt | Ataxia | Pea |
Abuse | Unresponsive | Burr | Gag |
311 (Depressive disorder, 3431)
| |
042 (HIV, 538)
| |
JointSkip-gram
|
LLDA
|
JointSkip-gram
|
LLDA
|
Patient | Depression | Aids | Aids |
Abuse | Tablet | Viral | Immunodeficiency |
Hallucinations | Blood | Fungal | Virus |
Withdrawal | Daily | Opportunistic | Human |
Ingestion | Campus | Bacterial | Viral |
Questionable | Mg | Disseminated | Load |
Thiamine | Garage | Immuno-deficiency | Cooling |
Remote | Capsule | Tuberculosis | Partner |
Alcohol | Building | Organisms | Acyclovir |
Significant | Parking | Herpes | Thrush |
Overdose | One | Undetectable | Fevers |
Prior | Discharge | Acyclovir | Induced |
Apparent | Normal | Detectable | Antigen |
Depression | East | Chlamydia | Pneumonia |
Although | Coherent | Syphilis | Blanket |
The evaluation results are summarized in Table
2. As could be seen, all 4 experts agreed that JointSkip-gram words better represent ICD-9 codes 570, 348, and 311. For the remaining 3 codes (174, 295, 042), the experts were split, but in no case the majority preferred the LLDA words. By considering the average number of words deemed unrelated by the experts, the experts found that JointSkip-gram was superior to LLDA for all 6 ICD-9 diagnosis codes.
Table 2
Evaluation results by clinical experts
# of experts who think the method is better than the other
|
ICD-9 codes
|
570
|
174
|
295
|
348
|
311
|
042
|
JointSkip-gram | 4 | 2 | 3 | 4 | 4 | 2 |
LLDA | 0 | 2 | 1 | 0 | 0 | 2 |
Average # of unrelated words across experts
|
ICD-9 codes
|
570
|
174
|
295
|
348
|
311
|
042
|
JointSkip-gram | 2.25 | 0.75 | 0.75 | 1.25 | 3.25 | 0.75 |
LLDA | 9.25 | 1.75 | 3 | 3.75 | 6.5 | 2.75 |
For ICD-9 code “570” (acute liver failure), JointSkip-gram finds “liver”, “hepatic”, “cirrhosis”, which are directly related to acute liver failure. Remaining words in the JointSkip-gram list are mostly indirectly related to liver failure, such as “alcoholic”, which explains one of the primary reasons for liver damage. On the other hand, LLDA captured a few related words, as evidenced by an average of 9.25 words that experts found unrelated. Among those unrelated words we find “cooling”, “sun”, “arctic”, “rewarmed”, “cooled”, “rewarming”, “coded”, “continue”, and “prognosis”.
For ICD-9 code “174” (female breast cancer), “295” (Schizopherenic disorders) and “042” (HIV), both Joint-Skipgram and LLDA find highly related words. One of our experts commented that several words found by JointSkip-gram are diseases which are likely to co-occur with the given disease. For example, JointSkip-gram finds “melanoma” for female breast cancer and “herpes”, “chlamydia”, “syphilis” for HIV. This suggests that JoinSkip-gram captures the hidden relationships between diseases, which could make it suitable for understanding of comorbidities.
For code “311” (depressive disorder), both JointSkip-gram and LLDA had difficulties in finding related words. According to feedback from one of our experts, “abuse”, “hallucinations”, “alcohol”, “overdose”, “depression” and “thiamine” (note: depression is a common symptom of thiamine deficiency) found by JointSkip-gram are related to the disease, while only “depression”, “tablet”, “capsule” found by LLDA are recognizably related to depression. We hypothesize that for common diseases (e.g, “depression” and “hypertension”), which are rarely the primary diagnosis or a major factor in deciding an appropriate treatment of the main condition, physicians rarely discuss them in clinical notes. Thus, it is difficult for any algorithm to discover words from clinical notes related to such diagnoses.
Treatment discovery
In our preliminary study [
21], we used PyEnchant standard English vocabulary to filter out the typos in clinical notes. However, there are many nonstandard English terms used in medical notes to describe medical treatments, medicines, and diagnoses. These nonstandard words are not part of PyEnchant standard English vocabulary we used for preprocessing, but they could have important meaning. Hence, we repeated our experiments by including all words occurring more than 50 times. The resulting vocabulary increased to 33,336 unique words.
After running our Joint-Skipgram model on the new dataset, we looked at the representative words for each diagnosis code. Tables
3 and
4 show the 15 nearest clinical note words in the vector space to ICD-9 codes “570” and “174”, respectively. We can observe that many retrieved words are different from those in Table
1 for codes “570” and “174”. The words that also appear in Table
1 are marked with italic font in Tables
3 and
4.
Table 3
Most important 15 words (including nonstandard English words) (ranked by importance) for ICD-9 codes “570”
Word
|
Description
|
liver
| An organ that produces biochemicals necessary for digestion |
Renal | Relating to the kidneys |
Hepatorenal | A life-threatening medical condition that consists of rapid deterioration in kidney |
Crrt | CRRT is a dialysis modality used to treat critically ill, hospitalized patients |
Vasopressin | A hormone synthesized |
Shock
| Shock liver is a condition defined as an acute liver injury |
Failure | Liver failure can occur gradually |
Levophed | Injection |
Ascites | Ascites is the abnormal buildup of fluid in the abdomen |
Oliguric | A urine output |
Pigtail | Pigtail drainage is used for liver abscess |
Transplant | liver transplant is a surgical procedure |
Rifaximin | Antibiotic |
Cirrhosis
| Cirrhosis is a late stage of scarring (fibrosis) of the liver |
Hepatic
| Relating to the liver. |
Table 4
Most important 15 words (including nonstandard English words) (ranked by importance) for ICD-9 codes 174
Word
|
Description
|
Xeloda | A prescription medicine used to treat people with cancer |
Tamoxifen | A medication that is used to prevent breast cancer |
Metastatic
| A pathogenic agent’s spread from an primary site to a different site |
Chemotherapy
| A treatment by the use of chemical substances |
Cancer
| A disease in which abnormal cells divide uncontrollably and destroy body tissue |
Carboplatin | It is used to treat ovarian cancer |
Onc | Abbreviations of oncologist |
Oncologist
| A doctor who treats cancer |
Taxol | It belongs to a class of chemotherapy drugs is the abnormal buildup of fluid in the abdomen |
Chemo | Short form of chemotherapy |
Gemcitabine | Gemcitabine is an anti-cancer |
Mets
| Abbreviations of metastasis |
Compazine | This medication is used to treat severe nausea |
Palliative
| A medical care for relieving pain |
Metastases | The development of secondary malignant growths |
A close look into Tables
3 and
4 reveals that most neighbors are specific medical terminology words describing drugs or treatments related to the diagnosis. For example, words “crrt”, “levophed”, “rifaximin”, and “transplant” in Table
3, are related to treatment of acute liver failure. Similarly, words “xcloda”, “tamoxifen”, “carboplatin”, “taxol”, “compazine” in Table
4 are related to cancer treatment. Therefore, including nonstandard words in our vocabulary enabled us to connect specialized medical terms with particular ICD-9 diagnosis codes.
Predictive evaluation
In another group of experiments we constructed patient representations and evaluated quality of the vector representations of words and medical codes through predictive modeling. We adopted the evaluation approach used in [
34], which predicts medical codes of the next visit given the information from the current visit. Specifically, given two consecutive visits of a patient, we used information of the first visit (i.e., medical codes and clinical notes) to predict medical codes assigned during the second visit. In the previous work on this topic, the authors of [
23,
34,
35] used medical codes as features for prediction. In our evaluation, we used both medical codes and clinical notes to create predictive features. To generate a feature vector for the first visit, we found the average JointSkip-gram vector representation of the diagnosis codes and the average JointSkip-gram vector representation of the words used in clinical notes. Then, we concatenated those two averaged vectors. We call this method
Concatenation-JointSG and compare it with the following five baselines:
Concatenation-One: The one-hot vector of medical codes and the one-hot vector of clinical notes for a given visit were concatenated. In the one-hot vector of each visit, words and codes which occur in the visit were encoded as 1, otherwise they were encoded as 0.
SVD: Singular vector decomposition (SVD) was applied to Concatenation-One representations to generate dense representations of visits.
LDA: Using latent Dirichlet allocation (LDA) [
29], each document was represented as a topic probability vector. This vector was used as the visit representation. To apply LDA, for each visit we created a document that consists of concatenation of a list of medical diagnosis codes and clinical notes. We note that LLDA is not suitable for this task since its topics only contain words.
Codes-JointSG: To evaluate the predictive power of medical codes, we created features for a visit as the average JointSkip-gram vector representation of the diagnosis codes.
Words-JoinSG: To evaluate the predictive power of clinical notes, we created features for a visit as the average JointSkip-gram vector representation of the words in clinical notes.
To compare vector representations obtained by JointSkip-gram and Skip-gram, we also trained Skip-gram on clinical notes and on medical codes separately. The resulting vector representations are not in the same vector space. We used Skip-gram representations to construct 3 more groups of features:
Codes-SG: The features for a visit were the average Skip-gram vector representation of the diagnosis codes.
Words-SG: The features for a visit were the average Skip-gram vector representation of the words in clinical notes.
Concatenation-SG: We concatenated the features from Codes-SG and Words-SG.
Given a set of features describing the first visit, we used softmax to predict medical codes of the second visit. Let us assume the feature vector of the first visit is
xt, the size of code vocabulary is |
C| and
\(Z\in \mathbb {R}^{(|C| \times |x_{t}|)}\) is the weight matrix of softmax function. The probability that the next visit
yt+1 contains medical code
ci is calculated as
$$p(y_{t+1}(c_{i})=1)=\frac{e^{Z_{i} \cdot x_{t}}}{\sum_{c_{k} \in C}e^{Z_{k} \cdot x_{t}}} $$
We use Top-k recall [
34] to measure the predictive performance, because it mimics the behavior of doctors who list the most probable diagnoses upon observation of a patient. For each visit, softmax recommends
k codes with the highest probabilities and Top-k recall is calculated as
$$\text{Top-k recall}=\frac{\text{the number of true positives in } {k} \text{ codes}}{\text{the number of all positives}} $$
In the experiment, we tested Top-k recall when k=20, k=30, and k=40.
Training details: To create features for all proposed models (Skip-gram, JointSkip-gram, LDA, SVD), we used the training set. To train the Skip-gram model, we used 40 iterations, 5 negative samples, and the window size 5 (the same as for JointSkip-gram). For SVD and LDA, we set the maximum number of iterations to 1000 to guarantee convergence. For JointSkip-gram, Skip-gram, SVD and LDA, we set the dimensionality of feature vectors to 200.
To train the softmax model, we created the labeled set using only patients with 2 or more visits. We sort all visits of each such patient by the admission time. Given two consecutive visits, we use the former to create features and the latter to create the labels. As a result, the labeled set used to train the softmax model had 9955 labeled examples and the test set had 2489 labeled examples. The softmax model for prediction was trained for 100 epochs using a stochastic gradient algorithm to minimize the categorical cross entropy loss.
Table
5 shows the performance of softmax models that use different sets of features. A model using Concatenation-JointSG features outperformed other baselines on all three Top-k measures.
Table 5
Performance of predicting medical codes of the next visit
Concatenation-One | 0.489 ±0.004 | 0.590 ±0.004 | 0.661 ±0.004 |
SVD | 0.478 ±0.004 | 0.588 ±0.004 | 0.652 ±0.004 |
LDA | 0.431 ±0.004 | 0.530 ±0.004 | 0.605 ±0.004 |
Codes-JointSG | 0.499 ±0.003 | 0.592 ±0.003 | 0.662 ±0.003 |
Words-JointSG | 0.437 ±0.004 | 0.536 ±0.004 | 0.609 ±0.004 |
Concatenation-JointSG |
0.506 ±0.003
|
0.599 ±0.003
|
0.670 ±0.003
|