Background
Since the passage of the Health Information Technology for Economic and Clinical Health (HITECH) Act in 2009[
1,
2], there has been an increase in the rate of electronic health record (EHR) adoption. As of 2012, the rate of EHR adoption with at least basic functionality was 44.4% in non-federal acute care hospitals[
3] and 39.6% in office-based physician practices[
4].
The transition to EHRs has created new opportunities for research[
5‐
7]. The secondary use of EHR data provides a more efficient and less expensive alternative to clinical trials, the current gold standard of medical research[
8,
9]. This is especially important in the current fiscal climate, where federal funding of medical research is becoming increasingly limited.
There are, however, potential caveats to the secondary use of EHR data[
10]. EHRs suffer from data quality problems[
11‐
13], which may affect the internal validity of retrospective studies. One of these data quality problems is insufficient data. Sufficiency can be conceptualized as a type of completeness, which is one of several categories of data quality that are relevant to EHR data reuse[
14]. When EHR data are complete according to the requirements of a given task, those data can be considered to be sufficient for that task. Required data may be missing for different reasons: a data point was observed but not documented[
12] or it was never observed in the first place, either because the observation was not clinically necessary or because it could not be performed. Regardless of the reason, missing data is very common in today’s EHR databases, leading to datasets that may not be sufficient for work relying on the secondary use of EHR data. Although it has been pointed out that the missing data may cause records to be “visually complete but intellectually insufficient,”[
15] the causal effect of health status on data sufficiency is not the focus of this study. Instead, we focus on the correlation between the sufficiency of electronic health record data for clinical research and the underlying patient health status.
In a clinical trial a study sample is chosen based on predefined eligibility criteria. The data necessary to answer the research question is then prospectively collected for every participant. This approach ensures that all required data are present and trustworthy, but may come at the expense of limited external generalizability due to the non-representativeness of the sample[
16]. In contrast, studies relying on the use of EHR data are thought to have greater external validity, having drawn their participants from actual patients receiving regular care in actual health care settings. In such studies, however, participants must be chosen based not only on the eligibility criteria but also upon the availability of sufficient data for extraction[
17‐
19]. Example sufficiency requirements include “a sub-population who have sufficient health record data at institution
{X} frequenting the
{X} hospital system for routine care” and “total number of individuals that have male gender and serum creatinine 1.5 mg/dL or female gender and serum creatinine 1.3 mg/dL. The patients need to have at least 2 values over the threshold.” In a study by Green et al., of 122,270 patients satisfying eligibility criteria, only 59.7% had sufficient data[
19]. Patients without the data necessary to determine eligibility or perform the analyses of interest cannot, by definition, be included in the study sample. The addition of this frequently overlooked sufficiency requirement has the potential to lead to bias in the selection of patients for inclusion in EHR based studies, which may limit their external validity.
The proportion of patients in a given population with sufficient data varies from study to study, as it depends on the research question and the necessary kinds of data required for answering that question[
14,
20]. We have previously demonstrated the contextual nature of EHR sufficiency, as well as the high variability of sufficient patient records in a large-scale analysis of the NewYork-Presbyterian Hospital Clinical Data Warehouse[
14]. This variability is not always random; it is more likely that the pattern of data quantity is related to one or more of the variables of interest[
21]. Our preliminary work indicates that the patient records containing sufficient data, i.e., those best-suited for secondary use in research, tend to belong to the sickest patients[
22].
As Lee et al. point out, many studies assume that the addition of a requirement for a visit in a given time frame (visit-based sampling) produces a sample that is representative of the population from which it is derived. However, as their work demonstrates this assumption is wrong, and the imposition of just this one sufficiency requirement biases the population towards sicker and older patients[
23]. Sufficient visit data is one common way patients are selected for inclusion in EHR based research. Another common sufficiency requirement is based on laboratory and/or medication data. Some studies require just the presence of a specific laboratory value or medication order while others also impose a minimum threshold for the number of each.
This paper reports an in-depth exploration of the relationship between patient illness severity and quantity of available data, as well as the potential clinical confounders of this relationship. We demonstrate that, because of the data sufficiency requirements for sampling, the cohorts being identified for research may not be representative of the broader patient population, thus compromising the external validity of research conducted using EHR data.
We hypothesized that the health records of sicker patients would be more likely to have sufficient data for research, and that this relationship would hold true when controlling for possible covariates. We also hypothesized that other patient- and procedure-related factors, such as age, sex, admission status, and the emergent nature of the procedure, would independently affect EHR data quantity.
Results
The mean age of patients in our sample was 45.0 (SD = 23.9) and ranged from one year to 102. Sixty-one percent of our cohort was female. Most cases were non-emergent (88.8%) with more outpatients (41.6%) than same-day admissions (32.7%) or inpatients (25.6%). The most frequently occurring diagnostic categories in our dataset were Complications of pregnancy, childbirth, and puerperium (19.3%), Diseases of the digestive system (12.3%), and Neoplasms (11.1%). The most common procedure categories were Anesthesia procedures, which includes procedures for analgesia during labor and delivery (17.5%), Operations on the digestive system (16.3%), and Operations on the musculoskeletal system (13.2%). Table
2 presents descriptive statistics of counts of days with laboratory results and medication orders within subcategories.
Table 2
Model inputs with counts of days with laboratory results and medication orders (n = 10,000)
ASA class
| | | | | |
1
| 2263(22.6) | 20 | 2.9(3.4) | 21 | 1.3(2.2) |
2
| 4779(47.8) | 85 | 3.0(4.4) | 62 | 1.6(3.6) |
3
| 2499(25.0) | 107 | 5.7(9.4) | 102 | 4.1(8.5) |
4
| 459(4.6) | 99 | 9.4(13.2) | 91 | 7.5(11.5) |
Sex
| | | | | |
Male
| 3943(39.4) | 99 | 3.4(7.4) | 91 | 2.3(6.3) |
Female
| 6057(60.6) | 107 | 4.3(6.2) | 102 | 2.5(5.5) |
Age (years)
| | | | | |
1-10
| 911(9.1) | 52 | 1.4(4.0) | 62 | 1.6(4.8) |
11-20
| 837(8.4) | 49 | 2.7(4.5) | 91 | 1.9(4.9) |
21-30
| 1379(13.8) | 78 | 5.3(6.1) | 71 | 2.8(5.0) |
31-40
| 1461(14.6) | 107 | 5.0(6.5) | 102 | 2.1(5.5) |
41-50
| 981(9.8) | 85 | 3.4(6.3) | 43 | 2.0(5.0) |
51-60
| 1182(11.8) | 99 | 4.1(8.8) | 76 | 2.9(7.7) |
61-70
| 1484(14.8) | 99 | 3.7(7.2) | 76 | 2.5(6.2) |
71-80
| 1143(11.4) | 60 | 3.9(6.7) | 63 | 2.6(6.0) |
81-90
| 545(5.5) | 60 | 4.7(7.3) | 55 | 3.2(6.1) |
91-102
| 77(0.8) | 59 | 6.9(9.1) | 44 | 5.1(7.3) |
Emergency status
| | | | | |
Non-emergent
| 8883(88.8) | 107 | 3.7(6.3) | 102 | 2.2(5.5) |
Emergent
| 1117(11.2) | 99 | 5.5(9.2) | 81 | 4.0(7.9) |
Admission status
| | | | | |
Outpatient
| 4162(41.6) | 85 | 2.1(4.4) | 56 | 1.3(3.8) |
Same day
| 3274(32.7) | 69 | 3.7(4.8) | 76 | 1.7(4.0) |
Inpatient
| 2564(25.6) | 107 | 7.2(9.9) | 102 | 5.1(8.8) |
ICD-9 category name (number)
| | | | | |
Neoplasms (2)
| 1111(11.1) | 53 | 3.3(5.4) | 62 | 1.8(4.8) |
Endocrine, nutritional, and metabolic & immunity disorders (3)
| 211(2.1) | 59 | 3.0(6.0) | 44 | 1.3(4.4) |
Dz. of the nervous system and the sense organs (6)
| 942(9.4) | 99 | 1.9(6.0) | 70 | 1.6(4.9) |
Dz. of the circulatory system (7)
| 1005(10.0) | 92 | 5.2(8.5) | 81 | 3.9(7.4) |
Dz. of the respiratory system (8)
| 368(3.7) | 60 | 3.7(8.2) | 91 | 3.2(9.0) |
Dz. of the digestive system (9)
| 1232(12.3) | 107 | 3.7(8.2) | 102 | 2.7(7.3) |
Dz. of the genitourinary system (10)
| 887(8.9) | 92 | 3.9(7.1) | 65 | 2.3(6.4) |
Complications of pregnancy, childbirth and the puerperium(11)
| 1931(19.3) | 54 | 6.3(4.4) | 28 | 2.6(3.3) |
Dz. of the musculoskeletal system and connective tissue (13)
| 767(7.7) | 59 | 1.8(4.3) | 60 | 1.3(4.3) |
*Congenital anomalies (14)
| 248(2.5) | 15 | 1.4(2.1) | 34 | 1.1(3.2) |
*Symptoms, signs and ill-defined conditions (16)
| 602(6.0) | 99 | 5.1(9.0) | 62 | 3.4(7.6) |
Injury and poisoning (17)
| 696(7.0) | 77 | 2.4(6.2) | 76 | 2.3(5.7) |
CPT category name (number)
| | | | | |
Operations on the nervous system (1)
| 460(4.6) | 78 | 2.4(5.8) | 70 | 1.9(6.1) |
Operations on the endocrine system (2)
| 198(2.0) | 21 | 1.7(3.2) | 29 | 0.9(3.6) |
Operations on the eye (3)
| 664(6.6) | 99 | 1.8(5.9) | 52 | 1.4(4.5) |
*Operations on the nose, mouth, and pharynx (5)
| 430(4.3) | 44 | 1.1(3.1) | 36 | 1.1(3.1) |
Operations on the respiratory system (6)
| 198(2.0) | 60 | 6.5(10.0) | 91 | 5.6(12.2) |
Operations on the cardiovascular system (7)
| 1105(11.1) | 92 | 5.9(9.2) | 76 | 4.5(8.1) |
Operations on the digestive system (9)
| 1625(16.3) | 107 | 4.0(8.5) | 102 | 2.7(7.1) |
Operations on the urinary system (10)
| 533(5.3) | 57 | 3.7(5.4) | 65 | 1.5(4.9) |
Operations on the male genital organs (11)
| 345(3.5) | 49 | 2.5(3.8) | 33 | 1.1(3.4) |
*Operations on the female genital organs (12)
| 603(6.0) | 28 | 3.0(2.9) | 17 | 1.2(2.5) |
Operations on the musculoskeletal system (14)
| 1321(13.2) | 77 | 2.0(4.9) | 67 | 1.7(4.8) |
*Operations on the integumentary system (15)
| 442(4.4) | 54 | 3.5(6.5) | 53 | 2.4(5.6) |
Miscellaneous diagnostic and therapeutic procedures (16)
| 321(3.2) | 92 | 4.2(9.5) | 81 | 3.1(7.8) |
*Anesthesia procedures (18)
| 1755(17.6) | 54 | 6.7(4.5) | 52 | 2.8(3.5) |
Table
3 shows the effect of each variable as a whole and of each level of the primary outcome variable (ASA Class) on the estimated number of days with laboratory results and medication orders based on the parameter estimates from the negative binomial model. Effects, standard errors and 95% confidence intervals for individual variable levels are expressed as ratios comparing that level to the reference level for the variable, such that an effect of 2.0 for ASA 3 indicates that ASA 3 is estimated to have 2 times the number of days as ASA 1. These ratios were obtained by exponentiating the model regression coefficients. ASA class, subject sex, age, admission status, ICD-9 category, and CPT category were significantly associated with the counts of days with laboratory results and medication orders, while emergency status was associated only with laboratory results.
Table 3
Negative binomial regression
ASA class
| <.001†
| | | | <.001†
| | | |
1
| | 1.00 | | | | | 1.00 | |
2
| | 1.47(1.03) | 1.38 – 1.57 | <.001†
| | 1.74(1.05) | 1.56 – 1.94 | <.001†
|
3
| | 3.38(1.04) | 3.11 – 3.67 | <.001†
| | 4.78(1.07) | 4.18 – 5.48 | <.001†
|
4
| | 5.05(1.07) | 4.41 – 5.77 | <.001†
| | 6.85(1.12) | 4.49 – 8.54 | <.001†
|
Sex
| <.001†
| | | | <.001†
| | | |
Age
| <.001†
| | | | <.001†
| | | |
Emergency status
| <.001†
| | | | 0.84 | | | |
Admission status
| <.001†
| | | | <.001†
| | | |
ICD-9 category
| <.001†
| | | | <.001†
| | | |
CPT category
| <.001†
| | | | <.001†
| | | |
Our primary variable of interest, ASA class, had a significant association with the counts of days with laboratory results and medication orders. Controlling for all other variables, the estimated count of days with laboratory results for ASA 2 was 1.47 times, for ASA 3 was 3.38 times, and for ASA 4 was 5.05 times the count of days with laboratory results for ASA 1. The pairwise differences for counts of days with laboratory results between all four ASA classes were statistically significant. Similarly, the estimated count of days with medication orders for ASA 2 was 1.74 times, for ASA 3 was 4.78 times, and for ASA 4 was 6.85 times the count of days with medication orders for ASA 1. All pairwise comparisons between the four ASA classes for counts of days with medication orders were statistically significant.
Discussion
The results of the negative binomial regression model demonstrate the relationship between patient health status and EHR data sufficiency. The less healthy the patient, as measured by ASA status, the more data that patient is likely to have, as represented by counts of days with laboratory results and medication orders, and the more likely they are to satisfy sufficiency requirements. This relationship holds true even when controlling for a number of likely confounders, including sex, age, emergent status, patient type, diagnosis, and procedure, which suggests that even within specific, well-defined cohorts, sicker patients are likely to have more data than healthier patients.
These findings highlight an important but usually overlooked problem inherent to studies using EHR data: the selection of records with sufficient data, as measured by human imposed sufficiency requirements, for research may bias the sample towards patients who are sicker than the population from which the sample is drawn. The findings from this study are consistent with previous work exploring the complex relationships among data quality, bias, and health status. In one example, Wennberg et al. used insurance claims data to demonstrate bias in comorbidity measurement by showing that Charlson Comorbity Index scores are associated with the frequency of physician visits[
25], suggesting that data quality is compromised by differences in healthcare utilization. Similarly, Collins et al. identified a relationship between patient mortality and increased rates of nursing documentation, suggesting that more acutely ill patients are likely to have more thoroughly documented records[
36,
37]. In a study of a pneumonia severity index using EHR data, Hripcsak et al. found that the addition of cohort selection criteria that required the presence of sufficient data to make a reliable diagnosis substantially limited the sample size and significantly altered the mortality rates[
38]. They note that the addition of simple sample restraints, while beneficial in their case, has the potential to significantly narrow the sample, leading to the possibility of bias.
We observed a direct correlation between severity of illness and data sufficiency in spite of the presence of sub-populations in our study sample in which this correlation should not exist: living organ donors and pregnant women with uncomplicated pregnancies presenting for management of labor and delivery. These patients tend to be healthy, but have more data in their records, resulting from laboratory testing performed as part of routine prenatal care or organ donor evaluation. Our 10,000-patient sample contained 1,802(18.0%) such patients, of whom 1,746(96.9%) were classified as ASA 1 or 2 (relatively healthy). The average number of days with laboratory results for patients in this group (6.5) is nearly double that of all other patients in the study (3.4). Despite the presence of such a large number of healthy patients with a high degree of EHR sufficiency, our original hypothesis — that sicker patients have better EHR data sufficiency for research — was confirmed. (See Additional file
1: Table S1 for results of the negative binomial model with pregnant patients and living organ donors excluded).
In addition to confirming our primary hypothesis, we discovered that many other variables are independently associated with data sufficiency. These include admission status at time of assessment, age, emergency classification of the procedure, procedure type (CPT category) and primary diagnosis type (ICD-9 category). Potential biases in these other characteristics of the study population should be considered when selecting populations based on sufficiency requirements this population is studied.
Limitations and future directions
This study was performed primarily in a tertiary care academic medical center (though one of the included hospitals is a primary care facility) in a major metropolitan area. Consequently, many of the patients included in our analysis were likely referred from other facilities. Data might differ in a more rural, primary practice setting or in a health system where patients receive the majority of their care within that one system. A follow-up study should be performed to determine whether our results could be replicated in other clinical settings.
Since our findings are based only on data primarily collected for documentation of clinical care, we cannot definitively conclude that this same bias would exist for secondary use of data primarily collected for other purposes, such as regulatory oversight and billing. Further analysis should be performed on other data sources.
As a result of our decision to use ASA class as a measure of health status, our sample was limited to patients who had received anesthetic services. Though anesthetic services are generally provided to a wide range of patients, and one might therefore expect the relationship between record sufficiency and patient health to hold true more broadly, the generalizability of our results to other populations may be limited. A novel measure of health status that is independent of data quality but available for all patients in the EHR would provide a means to evaluate the correlation between health status and data sufficiency. Alternatively, a study that prospectively evaluates a representative sample of all patients in the EHR for health status could determine if the correlation exists in a more general population, though such a study would be costly. As in any retrospective study, it is possible that there exist covariates not controlled for in our model that account for the observed differences.
Acknowledgments
We would like to thank Dr. Mathew L. Maciejewski for his advice and help in the preparation of this manuscript. We would also like to thank the reviewers for their valuable input.
Funding
This work was supported by grants R01LM009886, R01LM010815, and 5T15LM007079 from the National Library of Medicine, grant UL1 TR000040 from the National Center for Advancing Translational Sciences (NCATS), and grant 5T32GM008464 from the National Institute of Health.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Competing interests
The authors declared that they have no competing interests.
Authors’ contributions
AR and NGW carried out data extraction and analysis and wrote the manuscript together. SW performed the statistical analyses. CW identified the research question and directed the experiment. All authors read and approved the final manuscript.