Background
Over several decades, the vision of automatic systems assisting and supporting clinical decisions produced a plethora of clinical decision support systems [
1‐
4], including diagnostic decision support systems for inferring patient diagnosis. These methods typically focus on a single patient and apply manually or automatically constructed decision rules to produce a diagnosis [
2,
5,
6]. At the same time, health care is undergoing tremendous changes as medical information is digitized and archived in a structured fashion. Electronic health records (EHRs) promise to revolutionize the processes by which patients are administered, hospitalized and discharged [
7], improve safety [
8] and allow the conduct of post-hospitalization outcome research [
9]. This large corpus of population-based records is increasingly used in the context of clinical decision making for the individual patient [
10]. Nevertheless, there still seems to be no consistent association between EHRs and clinical decision support systems (CDSS) and better quality of care [
11].
Recently, several methods have been released for predicting certain patient outcomes using large cohorts of patients. Two such examples are the detection of heart failure more than six months before the actual date of clinical diagnosis [
12] and inference of patient prognosis based on patient similarities [
13]. These methods, however, use the patient diagnosis for the learning task.
In this paper, we address a different, fundamental challenge – can we leverage the corpus of EHR patient data, even with well-documented quality issues [
14], to infer the discharge diagnosis of patients using minimal medical data upon hospitalization. We introduce an automated method that exploits patient records for inferring an individual patient discharge diagnosis. For this task, we use basic patient-specific information gathered at admission, including medical history, blood tests, electrocardiography (ECG) results and demographics to identify similar patients, subsequently predicting patient outcomes. We test our method on two diverse sets of patients admitted to internal medicine departments in large medical centers in the United States and Israel, obtaining high precision and recall, suggesting that such systems may eventually be useful in the setting of assisting physicians with medical decisions, hospital planning and short-term resource allocation.
Methods
Data description
We obtained two EHR datasets from two hospitals: (i) 9,974 patients with 15,498 admissions, admitted in several wards belonging to internal medicine (for example, cardiology, oncology) or neurology over the course of two years from the Stanford Medical Center, CA, USA (USA dataset); and (ii) 5,513 patients with 7,070 admissions in internal medicine wards at the Rabin Medical Center, Israel between May 2010 and February 2012 (660 days; ISR dataset). Each dataset includes patient demographics (gender and age), medical history (International Classification of Diseases, Clinical Modification codes (ICD-9-CM) from past in- and out-patient encounters) and hospitalization specific information including blood test results and discharge diagnoses, coded as ICD-9 codes. A subset of the patients in the USA dataset includes ECG measurements, while the ISR dataset (7,261 patients) also contains ICD codes assigned upon admission. The USA dataset includes 86 commonly administered blood tests (after filtering, see below) and the ISR dataset includes 19 blood tests. Both patient cohorts include only urgent (non-elective) admissions and a roughly equal number of females and males. Both datasets cover the entire adult age spectrum (USA patients range between 15 and 90 years and ISR patients between 20 and 110), but the ISR cohort is skewed towards older patients (USA median age is 63 and ISR is 73, where 82% of ISR patients are above 60 while only 55% of the USA patients are).
In addition, we obtained records of the Healthcare Cost and Utilization Project (HCUP) of the Nationwide Inpatient Sample (NIS) of 2009 which contains more than 55 million associations between 5.8 million patients and 1,125 third level discharge ICD codes. The latter data were used to enhance the computation of ICD similarities, as described below.
The ICD codes in the EHR data included 469 (USA) and 396 (ISR) third level ICD codes (diagnostic and procedural codes). We excluded supplementary classification codes (codes starting with E or V) and several first level categories including complications of pregnancy (630 to 679) and codes in the range 740 to 999 for being uninformative (for example, general symptoms), a known condition (for example, congenital anomalies) or incidental conditions (for example, injuries or poisoning). We retained supplementary classification codes V40 to V49 –‘persons with a condition influencing their health status’ for being indicative of procedures a patient underwent.
As a sanity check, we extracted the ICD codes that were enriched in patients with extreme blood test values relative to other patients (hypergeometric test, false discovery rate (FDR) = 0.01) and verified that these corresponded to common knowledge associations, for example, various ICDs coding for cancer are enriched within patients with high lactic dehydrogenase values [
15] or the troponin-
t test is indicative of acute myocardial infarction [
16] [See Additional file
1: Table S1 for the full association list].
The patients were de-identified by using a randomly generated patient id. The study was approved by the Institutional Review Board of Stanford and by the Helsinki Committee of the Rabin Medical Center.
Similarity measure construction
In order to infer patient diagnosis, we computed a set of ten patient similarities. We computed two ICD similarity measures (1–2) and eight similarity measures between hospitalizations (3–10). All similarity measures were normalized to the range [0, 1]. We used the following ICD code similarities:
(1)
ICD code similarity: We used the levels of the ICD codes in the ICD coding hierarchy to measure the similarity between ICD codes c
i and c
j as
, where NCA is the level of the nearest common ancestor and #levels are the number of levels in the ICD hierarchy (five levels) (see [
17] for similar measures). When using third level codes, the number of levels equals three (the third, fourth and fifth levels).
(2)
Empirical co-occurrence frequency: We used the HCUP data to compute empirical co-occurrences between ICD codes. Computing the number of co-occurrences of an ICD pair across all patients, we first computed the Jaccard score [
18] between each pair. In order to transform the Jaccard score to a similarity measure, we randomly shuffled the associations of ICD codes to patients, keeping the overall ICD distribution as well as the per-patient ICD counts fixed. We then computed the similarity as the percentage of times the co-occurrence score was higher than the random shuffles.
We used the following inter-patient similarity measures
(3–4) Medical history: Each patient may possess medical history from three sources: (i) past encounters with local health providers (digitally connected to the medical center); (ii) discharge codes of past hospitalizations; and (iii) personal history ICD codes provided in the current hospitalization (ICD codes V01to V15, V40 to V49 and V87). The union of these three sources constitutes the patient medical history profile. To compute the similarity of two such profiles, we form a bipartite graph over the member ICD codes, connecting two codes in the two profiles by an edge whose weight is the similarity between the codes. Our similarity score is the value of a maximal matching in this graph normalized by the smaller history set size. We performed the maximal matching computation using either of the two ICD similarity measures, resulting in two similarity measures.
(5–6) Blood test similarity: We used only the chronologically first blood test of each type, performed upon admission for each hospitalization, retaining only blood test results obtained during the first three days of hospitalization. We filtered blood tests that were performed in less than 5% of the hospitalizations and those for which the difference in distribution between patients with the same diagnosis and patients without shared diagnosis was not statistically significant (Wilcoxon ranked sum test, FDR <0.01). This left us with 86 blood tests for the USA dataset and 19 blood tests for the ISR set. Each blood test was then normalized by converting it to a z-score, mean and standard deviation measured across the initial blood tests of all patients. Most of the patients had undergone only a partial set of the tests. We removed patients having fewer than three available blood tests and computed the similarity between a pair of hospitalizations based on the values of the blood tests common to the two hospitalizations, where patients sharing fewer than three blood tests between them received the minimal similarity score of zero. We formed two types of similarities: (i) using the entire set of common blood test array between any two hospitalizations, we computed the Euclidean distance between the z-score vectors, normalized by their length; and (ii) the average of differences in absolute values between the blood tests with the highest z-score for each patient. The distance Dij between patients i and j was converted to a similarity value by linear transformation.
(7–8) ECG similarity: The ECG values included eight interval values as well as the heart rate. Similarly to the blood tests, we used only the chronologically first measurement, performed upon admission for each hospitalization, obtained during the first three days of hospitalization. Each ECG measurement had undergone the same normalization and similarity construction as the blood tests.
(9)
Age similarity: In order to give precedence to age differences in younger age, we computed the similarity between two patients
p
i
and
p
j
as
(10)
Gender similarity: defined as 1 if the two patients have the same gender and 0 otherwise.
Combining similarity measures to classification features
The framework we used scores a hypothetical association according to its maximal similarity to a known, gold-standard, set of associations. In our case, we scored associations between hospitalization records and ICD codes based on the highest similarity to the known discharge codes in the background corpus of previously hospitalized patients (disregarding similarities to previous hospitalizations of the same patient). Specifically, the features used to classify hospitalization-primary discharge ICD code pairs were constructed from scores computed for each combination of an ICD-similarity measure and a similarity measure between patient hospitalizations (see previous section for details), resulting in 16 features overall (12 without the ECG similarities). For each such pair of similarity measures, the score of a potential discharge code
I for a given hospitalization
H is computed by considering the similarity to known discharge codes associated with other hospitalizations (excluding other hospitalizations of the same patient) (
I’ and H’). The computation is done as follows: First, for each known associations
(H’,I’) we compute the inter-hospitalization similarity
S(H,H’) and the ICD codes similarity
S(I,I’). Next, we follow the method of [
19] to combine the two similarities to a single score by computing their geometric mean. Thus:
(1)
We used the MATLAB implementation of the logistic regression classifier (glmfit function with binomial distribution and logit linkage) for the prediction task. We used a 10-fold cross validation scheme to evaluate the precision of our prediction algorithm. The training set used for the cross validation included 41,036 USA associations between hospitalizations and discharge codes and 14,506 ISR associations. We considered two types of negative sets, the same size as the positive set in each training set: (i) randomly sampling for each patient a diagnosis from the 469 (USA) or 396 (ISR) third level ICD codes (excluding true diagnoses for that patient), termed ‘pre-admission’; and (ii) randomly sampling a set of potential release codes for each hospitalization, termed ‘post-admission.’ Specifically for the second negative set scenario, we inspected the available admission diagnoses reported upon hospitalization (lacking from the USA dataset) and included the set of discharge diagnoses of all the patients who shared the same admission diagnosis (excluding the true discharge diagnosis for that hospitalization). As an example, the potential negative set for a patient admitted with chest pain includes the discharge diagnoses of all other patients admitted with chest pain, excluding the true final diagnoses of that patient. Additionally, we removed self-similarities of patients (that is, similarities between hospitalizations of the same patient) to avoid bias for patients with recurrent admissions. To obtain robust area under the curve (AUC) score estimates, we performed 10 independent cross validation runs, selecting a different negative set and a different random partition of the training set to 10 parts in each; we then averaged the resulting AUC scores. Expectedly, taking a negative set of size five, ten or twenty times the size of the positive set had a negligible effect on the resulting AUC score (AUC difference less than 0.002).
In order to apply our method in a scenario that mimics the admission of new patients, we split the hospitalizations into training and validation subsets. For the ISR data, we used the available admission date to select hospitalizations that spanned the first year of our data (July 2010 to June 2011) as our training set and validated on hospitalizations occurring in the subsequent 211 days, totaling 999 hospitalizations. For the USA data, we split the data into train and test sets (two thirds and a third, respectively) using the available sequential ordering of their admission dates. As with the cross-validation scheme, we masked similarities between hospitalizations of the same patient. We computed the precision of our predictions by counting the number of patients for which the top predicted discharge code was the same as one of its true diagnoses. Similarly, we also computed the performance when testing whether the true discharge code of a patient appeared in the top two predictions, top three and up to the top ten predictions per patient.
In order to identify ICD codes that are significantly correctly predicted, we compared the number of correct predictions for each ICD code against a background of 105 randomly shuffled patient-diagnosis associations sets.
Discussion
We used patient cohorts from two different hospitals. However, we trained and provided predictions for each dataset independently. This was done for three reasons: (i) combining the two datasets ignores information available in only one dataset (for example ECG data or blood tests that appear in only one set); (ii) the ICD codes, primarily used for billing purposes, are often biased due to the health system used in each country; and (iii) different sources of medical history (that is, outpatient versus inpatient facilities) display lower agreement between patients from different health systems. Indeed, we observed that merging the two datasets degraded the performance to that of the worse performing dataset (ISR, see Table
1).
In order to assess the potential benefits to a clinician, we looked at predictions that could be considered surprising with regard to the admission diagnoses (available in the ISR dataset). We found multiple examples in which the admission diagnosis contained only general symptoms and our method correctly predicted the true discharge diagnosis. We describe here two such examples: (i) a female patient who was admitted with an unspecified anemia (ICD code 285.9) was correctly predicted for cardiac dysrhythmias (427). Irregular heartbeat is one of the many symptoms of anemia but not a predictive one [
21]; and (ii) a female patient was admitted with fever (780.6) and was correctly predicted for acute myocardial infarction (410). Notably, fever is not a common symptom for acute myocardial infarction [
22].
Finally, analyzing our performance, we note that while our method provided high quality predictions in cross validation, it is likely to display lower performance in predicting conditions that evolve substantially over time and conditions that are rare in the population. We observe that high level ICD categories that achieve relative high precision are typically abundant in our data (above 6% (USA) and 4% (ISR) of the patients), including diseases related to endocrine, circulatory, respiratory and genitourinary systems (Figure
3). In contrast, lower precision is obtained for high level ICD categories which generally have a low representation in our data and are typically complex (for example, neoplasms). A larger and richer EHR data could enhance our prediction precision in these cases also. Specifically, a very large corpus of patients might introduce more of the currently rare cases and having a larger temporal range within the corpus would allow for richer representations of the medical history. This assumption is strengthened by the fact that the USA dataset is obtained from a tertiary care facility and, thus, harbors more ‘hard’ cases. Yet this dataset obtained better performance due to a larger corpus of patients and more information on each patient than the ISR dataset which is from a primary and secondary care facility. One reason may be that only a small subset of the blood tests was available for each patient in the ISR dataset, limiting the computation of similarity between patients and the ability to account for rarer test types. A fuller set of tests allows the computation of more accurate patient similarities.
Conclusions
Our results demonstrate that a large corpus of patient data can be exploited to predict the likely discharge diagnoses for a new patient. We introduced a general method for performing such an inference using information from past hospitalizations. Our method computes patient similarity measures and requires a minimal set of such measures, including medical history, blood tests performed upon admission and demographics. It is readily extensible to use the results of other admission information, such as ECG tests, as shown for the USA dataset and potentially, in the future, medical images and patient genomic information (for example, gene expression measurements or single nucleotide polymorphism data).
Our method is a stepping stone for the full exploitation of large population-based data sets. We recognize that the introduction of new decision support modalities requires careful analysis of physician and health-care system workflows and introduction of the information at the most pertinent decision points. However, it is clear that the emerging infrastructure of electronic patient information will provide not only better information about quality of care and guidance for policy but will be able to improve the care of the individual, benefitting from the aggregated information of previous patients.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AG and RS conceived the paper; AG performed the analysis and wrote the draft; GS obtained the data, and aided in pre-processing; GS, ER, RA and RS participated in the writing of the paper. All authors read and approved the final manuscript.