Background
Given the unprecedented global spread and impact of COVID-19, researchers are urgently conducting research to understand the safety and effectiveness of treatment options [
1,
2] and to understand short and long term sequelae of SARS-CoV-2 infection [
3]. Due to time exigencies and financial constraints, ethical considerations, and the suitability of certain lines of inquiry, randomized clinical trials are not always feasible or necessary, and research using existing secondary data can offer valuable hypothesis-generating insights or reliable evidence as medical professionals respond to the pandemic [
4]. In addition, inclusion of data from a wide array of institutions increases sample size enabling researchers to study less common conditions or treatments, and can enhance generalizability of results. Electronic health records (EHR) are generated at the time of healthcare delivery as a component of clinical care. Structured data in EHR typically include detailed information on clinical encounters including any procedures, diagnoses, ordered and administered medications, demographic data, vitals, and lab orders and results. Because the United States does not have a universal healthcare system, EHR data are maintained by individual health systems, each with different standards and protocols for data collection and storage, leading to a high degree of variability in the availability and quality of data across systems. Even when the same EHR platform is used by two health systems, differences in implementation are common. This lack of centralized or standardized reporting creates difficulties in using EHR at a large scale to conduct nationally representative research.
To enable COVID-19 research driven by data acquired across the United States, the National Center for Advancing Translational Sciences (NCATS) supported the creation of the National COVID Cohort Collaborative (N3C), a centralized repository of EHR-sourced data currently including over 9 million patients from 69 sites representing 49 out of 50 states that can be leveraged to study potential treatments and evaluate standards of care and best practices for COVID-19 in a real-world setting [
5]. Compared to census data, N3C data have been shown to be more racially diverse, though biased towards urban as opposed to rural areas [
6]. N3C aggregates and harmonizes EHR data across clinical organizations in the United States and supports data from both harmonized and unharmonized common data models (CDMs) including ACT, OMOP, PCORnet, and TriNetX, with OMOP version 5.3.1 being the target data model into which others are converted. Both automated and manual data ingestion and harmonization protocols are in place which ensure source CDM conformance to specific requirements and fitness for use [
7].
The establishment of N3C coincides with a growing interest in the use of non-trials-based real-world data (RWD) to inform public health policy, formulate testable hypotheses for designing randomized clinical trials, and assist in clinical decision making. Concomitantly, concerns have also been raised over published findings using RWD that can, at times, seem contradictory [
8,
9]. Model and data harmonization efforts [
7] in centralized EHR repositories are the first step towards answering many research questions. However, even harmonized records may require further cleaning and processing depending on a provider’s data capture practices and source data model. Furthermore, high-quality RWE study design requires high-quality data, which involves a close examination of all data streams and deep understanding of their limitations and the sources and mechanisms behind data quality issues, such as missingness [
10,
11]. Only then can these data be used to develop studies that support public health, generate viable hypotheses, and aid in clinical decision making. In light of this, our objectives were to highlight several important areas to be examined to ensure high data quality and to present potential solutions and risk-mitigation strategies based on our experience.
Methods
The N3C data enclave systematically aggregates EHR data from partnering health systems, known as data partners, for patients who have tested positive for COVID-19 or have equivalent diagnosis codes according to the N3C phenotype. Negative controls with a non-positive SARS-CoV-2 lab result are also included at a 1:2 ratio (cases:controls). The specifics of the N3C phenotype are detailed on the N3C Github [
12]. The final pooled data set includes information on hospital admissions, procedures, diagnoses, medications, lab test results, demographics, and basic vitals. This research was possible because of the patients whose information is included within the data and the organizations (
https://ncats.nih.gov/n3c/resources/data-contribution/data-transfer-agreement-signatories) and scientists who have contributed to the on-going development of this community resource. The N3C data transfer to NCATS is performed under a Johns Hopkins University Reliance Protocol #IRB00249128 or individual site agreements with the NIH. The N3C Data Enclave is managed under the authority of the NIH; information can be found at
https://ncats.nih.goc/n3c/resources. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the N3C program. Use of N3C data for this study does not involve human subjects (45 CFR 46.102) as determined by the NIH Office of IRB Operations.
Our focus was to conduct an in-depth data quality investigation to inform best practices for using these data for public health research specifically focusing on in-hospital drug effectiveness studies. Medications received during an inpatient stay cannot be identified using insurance claims data due to the bundling of facility charges in which a flat fee is charged for nursing, medications, supplies, etc. during each day of hospitalization. Thus, EHR data are potentially valuable for evaluations of drug utilization, safety and effectiveness in the inpatient setting [
13,
14]. We identify considerations relevant to a hypothetical study evaluating the efficacy of remdesivir treatment in hospitalized patients with COVID-19. We discuss how to appropriately define concepts of interest, highlight data quality considerations, and offer suggestions for researchers using centralized multi-institution EHR-sourced data repositories.
Study population
As a base cohort for our motivating example, we identified adult patients (≥ 18 years) who were hospitalized with COVID-19 between March 1, 2020 and September 1, 2021. The index date (initial observation period) was the first of either laboratory confirmed SARS-CoV-2 or the presence of at least one of several “strong positive” COVID-19 related diagnosis, as defined by the N3C version 3.3 phenotype [
12]. Visits were defined according to the macrovisit aggregation algorithm available in the N3C enclave which combines individual OMOP visit records that appear to be part of the same care experience [
15]. This is crucial as clinical encounter data is highly heterogeneous at both the CDM and institutional level. We included only the first hospitalization visit for patients meeting the inclusion requirements. All overlapping inpatient and emergency department visits for a single patient were merged to reconstruct complete hospital stays. We excluded patients with missing age or sex information and excluded visits shorter than 2 days. We also excluded patients who had positive COVID-19 test results predating 1/1/2020, as earlier positive results are implausible. Systematic missingness and other quality concerns in patient data outlined below further excluded all but 12 data partners for our in-hospital treatment effectiveness study. Since data suitability can vary considerably depending on the research question of interest, we discuss the specific criteria for these inclusion in our analysis in the results below.
Covariate definition
A key component of observational studies using EHR data is the operationalization of definitions of baseline health status and severity of illness. Characterizing this allows us to 1) compare different treatment options while accounting for patient characteristics, and 2) examine if the treatment effects vary across different types of patient populations. We use the term “covariate” to define a variable that describes a patient and any relevant concepts of interest (age, sex, presence of health comorbidities). We focus on “baseline” covariates, meaning we collect information on relevant characteristics using data from before the exposure of interest is measured [
16]. Previous investigation has already provided evidence that these baseline characteristics, many of which are only available in EHR data, are highly predictive of overall disease course severity [
17].
For this illustrative example, our list of covariates includes concepts that may be related to receipt of treatment for COVID-19 or risk of COVID-19 related outcomes, including patient demographics (age, sex, race, ethnicity), smoking status, patient body mass index (BMI), chronic comorbid conditions included in the Charlson comorbidity index (hypertension, diabetes, coronary artery disease, congestive heart failure, chronic obstructive pulmonary disease, cerebrovascular disease, chronic kidney disease, cardiac arrhythmia, malignancy), and prior medication use (angiotensin-converting enzyme or ACE-inhibitors, angiotensin receptor blockers or ARBs, statins) [
18]. In addition to characterizing chronic health conditions, the N3C data include lab measurements offering more proximal indicators related to illness severity or prognosis. We collect data on creatinine, bilirubin, partial arterial oxygen pressure (PaO2), fraction of inspired oxygen (FiO2), body temperature, white blood count (WBC), ferritin, C-reactive protein (CRP), interleukin-6 (IL-6), oxygen saturation (SpO2), and respiration rate within 2 days of the date of admission.
Data quality assessment
In the following sections, we describe our data quality findings. We first assess the proportion of missing data across covariates of interest by institution. A visual examination of the distribution of non-missing values is also carried out. Observations exceeding two standard deviations from the global mean are identified. We then explore drug exposures and duration of treatment, focusing on evidence of treatment with remdesivir or dexamethasone initiated within the first 2 days of admission for COVID-19. Within OMOP, drug exposures are individual records corresponding to when a drug was ordered or administered to a patient. The specifics of how these data are recorded depend on the source data model and provider practice. We discuss the nuances of these data and how they relate to operationalizing definitions of exposure to medications during hospitalization. We then assess EHR continuity and its importance in capturing longitudinal care and ensuring adequate history for baseline health status of the patients included in our analyses. Subsequently, we examine multiple clinical outcomes of interest relating to COVID-19 hospitalizations including mortality, invasive mechanical ventilation, acute inpatient events, and composite outcomes. We discuss considerations in defining these events, and implications they may have on research.
Discussion
N3C offers a significant step forward in providing access to integrated national-level EHR data which serves as a source for RWE that can help guide both public health policy, future prospective controlled clinical trials, and clinical decision making. It is part of a broader trend in which sources of RWD are consolidated into larger repositories for research purposes. This enables large-scale validation of findings across multiple health systems, the ability to achieve sufficient statistical power when investigating treatments and conditions with low prevalence, and improve equitability and representation of underserved hospitals and patient demographics in treatment effect studies. In addition, large data sets provide an opportunity for the deployment of machine learning methods while mitigating the risk of overfitting. As a centralized repository, N3C aggregates and harmonizes data in a single location. While this comes with its own challenges, it allows the data to be queried at the row level and enables detailed centralized investigations into data quality such as those presented here. This is in contrast to federated data models which allow data to be queried in aggregate, but rely on data curation at the local level [
7]. Nevertheless, operationalizing definitions for key clinical events can be complicated by limitations in site reporting and a loss of granularity through the use of a common data model to harmonize data across partners. The advantages of harmonization, however, outweighs the loss of granularity when addressing many research questions. The efficiency of centralization is also a significant advantage and was witnessed through the course of this study. For example, a number of sites using the ACT CDM initially did not have the ability to report vitals, but subsequently these data were made available due to continuous feedback during the review process. Sites are provided with data quality metrics and furnished with support to help address gaps or deficiencies, when possible.
The promise of data enclaves such as N3C are clear, yet it remains imperative to consider the limitations of the data and ensure that they are fit for purpose. Data missingness, for example, remains a significant analytical challenge. It may be reasonable to suspect that many of the lab measurements investigated in this study are informatively missing, in that they are clinically indicated (and thus ordered) only in the most severe patient cases. Yet, this may be confounded by individual data-contributing health systems’ respective capacities to obtain these measures at a given point in time (e.g., clinicians faced with an overwhelming caseload during pandemic ‘peaks’ may have inferred a level of clinical severity in some patients without ordering/recording such measures once seeing it repeatedly in other clinically similar cases). This suspected mixture of missing at random (MAR) and missing not-at-random (MNAR) mechanisms, likely varying over time within each data partner health system, presents a challenge to data analysts who ⏤ in trying to delineate between instances of informatively vs. inadvertently missing values ⏤ cannot directly access subject-matter experts (such as clinical care providers experienced in each health system during each time interval under study) to form the reasonably defensible assumptions required, assumptions not verifiable from EHR data alone [
31]. With many severity-of-illness indicators, it is common to create derived factors and missing data indicators. For example, body temperature may be used to create a three-level factor: “Normal temperature (<= 38 deg C)”, “Fever (> 38 deg C)”, and “Missing.” However, this approach has been shown to exhibit severe bias even with MAR data [
32]. The danger of this approach becomes even more evident when glancing at data missingness as a function of data partner and time shown previously in Fig.
2. A more principled variation on this approach is forming distinct (collections of) additional variable(s) indicating the missing value status for each variable in any given analysis; conditioning on these indicators in downstream analyses would make explicit the assumptions inherent to how those with missing values may be plausibly assumed to differ from those without missing values for the variables prone to missingness (and would inform analysts how to elicit plausible assumptions from domain experts, by considering these ‘full’ data).
Excluding sites which are major contributors to data missingness does not eliminate all missing data, and there are outstanding issues that still need to be addressed. Here, considering the different possible mechanisms responsible for missingness is necessary. One contribution to overall missingness involves mechanisms reasonably assumed to be MCAR (an effectively random sample of individuals from included sites still lack values expected to have been recorded), while another likely contribution is disease severity. This mechanism is closely related to what is termed in the statistical literature as outcome-dependent observation processes [
33‐
35]. Labs are usually ordered in sets, with complete blood count (CBC) and basic metabolic panel (BMP) being the most common, followed by complete metabolic panel (CMP). For COVID-19 positive patients, some providers may add CRP, d-dimer, and ferritin to CBC and BMP panels. Importantly, all of these orders vary depending on patient condition, provider practice, and standards of care - all of which may vary throughout the course of the pandemic; thus more tenable assumptions can be adopted (at least approximately, to mitigate bias) after considering how analyses incorporate care settings (from healthcare-system down to clinic/provide levels) as well as calendar time as proxy measures for such systematic differences. Similar MNAR mechanisms have been identified in end-of-life care studies where questionnaire missingness is related to poorer health status [
36].
In isolation, the labs as described above are MNAR since the decisions made to order specific labs are often based on existing clinical information. If the probability of observing a covariate can be assumed to not depend on its value after conditioning on other observables, then that covariate is considered MAR. This is often difficult to assume – let alone empirically verify from available data – in practice, without supplemental auxiliary data, but a detailed look at key conditional distributions combined with domain expertise can justify the adoption of MAR assumptions. In that case, a number of techniques including multiple imputation and inverse probability weighting exist to handle said missingness under specific assumptions, though special attention needs to be paid to both the model specification and algorithm [
37]. Finally, although complete-case analysis is used by some practitioners in these settings, omitting records with any missing data among variables associated with exposures/confounders or outcomes is known to bias effects estimates and, at best, reduce precision [
32,
38].
Complete drug exposures along with associated details such as dose and route of administration are not always available, constraining the possible study designs or requiring a tradeoff between sensitivity and specificity in defining treatment. In our analysis, for example, we were only concerned with adjusting for dexamethasone treatment as it is commonly administered and has been shown as part of the RECOVERY trial to lower mortality in hospitalized patients receiving respiratory support [
39]. However, a dedicated investigation into the effectiveness of dexamethasone may involve more subsetting of data based on availability of dosage and route of administration. Alternative explanations for observed drug exposure patterns should also receive consideration. The witnessed distribution of drug eras for remdesivir in our study may be consistent with some artifactual ‘coarsening’ mechanism at play in how actual drug exposures in each patient case may, depending on coding and CDM-mapping practice that vary by data partner, tend toward inclusion of ‘rounder’ numbers such as 1, 5, and 10 days [
40]. Additionally worth reiterating is the possibility that treatment with remdesivir was terminated early, and therefore had unexpected durations, due to drug reactions and side effects that outweighed benefits of treatment.
Looking more broadly, being limited to EHR data means no enrollment information demarcating a specific time period during which records are known to be complete is available; there are no guarantees on patient-level completeness. Therefore, estimating EHR continuity, as we have outlined above, is a crucial part of mitigating the resulting bias. It cannot be assumed that the lack of a given baseline comorbidity or therapy is evidence of its absence, particularly for chronic conditions and medications, unless EHR continuity can be established. This is also true for COVID-19 vaccination, which is an important exposure for COVID-19-related studies. There is no explicit indicator of non-vaccination, and the widespread availability of vaccines can further contribute to data fragmentation. It does appear however that some institutions may synchronize vaccination records with their state’s vaccine registry, which provides one strategy for assessing the completeness of an institution’s vaccine records [
41]. Given the significant potential for information bias due to EHR-discontinuity, some have proposed using predictive modeling to identify patients with high EHR-continuity [
21]. Furthermore, carefully designed methodological frameworks are needed to handle selection bias (the sickest patients often have the most complete records) which can arise when enforcing data completeness for the EHR data [
42]. There are ongoing efforts within N3C to link to CMS claims which will address many of these concerns for at least a specific patient population. Additionally, data mining unstructured data such as clinical notes may provide more comprehensive information on baseline comorbidities, particularly for patients which lack a prior history with the admitting health system.
Suitable outcomes for evaluation are generally limited by the availability of data in EHRs and more specifically limited by both the OMOP data model and data partner reporting. ICU admissions, for example, cannot be resolved from the visit-level information available, notwithstanding the possibility that some sites repurposed non-ICUs to serve as ICUs during surges in COVID-19 patients. Additionally, despite the popularity of composite outcomes, such as death or discharge to hospice, we find that most data partners do not provide discharge disposition. Perhaps most importantly with regards to outcomes is the lack of availability of mortality data outside the EHR, for the time being, which underestimates overall mortality and restricts investigations to in-hospital mortality alone. Although the use of survival methods may not be the most appropriate choice for patients who are hospitalized with critical illness [
43], they are still nonetheless quite widely used. For survival analysis, being limited to in-hospital mortality alone has important implications. Patients who are discharged from the hospital are typically discharged due to recovery, or to a different care facility due to disease severity. In either case, the risk of death among discharged patients is not the same as those patients who remain hospitalized. Censoring patients at discharge introduces a differential risk of death between censored and non-censored observations, violating the non-informative censoring assumption necessary for common survival models such as Cox proportional hazards. This violation can be addressed by censoring all patients after a fixed time period, known as the “best-case” or “best-outcome” approach, which assumes discharged patients have survived until the end of the observation period [
30]. Alternatively, one can treat discharge as a competing outcome and rely on the subdistribution hazard function or other methods for competing-risk analysis [
32]. The most direct remedy would be to link patient records through PPRL to ancillary sources of mortality data to capture deaths post-discharge. This may be particularly important in view of published data suggesting an increased risk of mortality for many months following hospitalization for COVID-19 [
44]. A summary of the issues presented is shown in Table
7.
Table 7
Summary of challenges presented along with possible solutions
Source-specific variability in data availability | • Cluster data sources based on relevant study variables and eliminate those with insufficient data. • Investigate possible temporal missingness patterns and evidence of MNAR data. • Potentially leverage relevant techniques such as multiple imputation and inverse probability weighting to handle remaining missing data. |
Unreconciled drug exposure intervals | • Aggregate contiguous drug exposure intervals into single drug eras. • Residual open-ended intervals may not allow for time-varying analysis and may only be suitable for analysis as point exposures. |
Absence of baseline medical history | • Perform a sensitivity analysis to understand the impact of EHR-continuity on the estimand. • Consider incorporating prognostic factors proximal to the outcome into the model. |
Limited availability of out-of-hospital mortality data | • Consider a sensitivity analysis on censoring time for discharged patients. • Employ competing risk analysis analysis with discharge and in-hospital mortality as competing risks. |
Previous medical history carried forward in EHR data | • Calculate the number of events recorded per day throughout the visit for an outcome of interest. • Determine if treatment preceded the outcome or if it is an artifact. |
These pervasive issues have been noted across a number of multi-site EHR repositories [
45,
46]. In Optum De-identified COVID-19 EHR, Chawla et al. notes that missing data is MNAR due to the urgency of the pandemic and can affect measured outcomes [
47]. Dependence on diagnostic and procedural codes may result in underreporting of events, and mortality rates can also be underestimated. Another analysis using the COVID-19 Research Database explains that only associations rather than causality can be determined using available medical record data as unmeasured confounders can mask true links between outcomes [
48].
With the proper strategies and data quality considerations, N3C is particularly well suited to investigate treatment effectiveness in hospitalized patients. It contains rich and detailed clinical data such as laboratory results, vital signs, and other measurements. These observations can serve as proximal measures to account for differences in severity of illness across patients, enable patient phenotyping and confounding adjustment, and may be more important and relevant than many chronic comorbidities that may be more difficult to measure. Additionally, data in N3C are routinely updated with little to no time lag, which is critical when pandemic conditions are changing rapidly and new variants of the SARS-CoV-2 virus are emerging. With the appropriate treatment of data quality issues as outlined in this paper, in addition to robust study design, N3C has demonstrated its central importance in serving as a source of RWE for COVID-19.
Acknowledgments
We gratefully acknowledge the following core contributors to N3C: Anita Walden, Leonie Misquitta, Joni L. Rutter, Kenneth R. Gersing, Penny Wung Burgoon, Samuel Bozzette, Mariam Deacy, Christopher Dillon, Rebecca Erwin-Cohen, Nicole Garbarini, Valery Gordon, Michael G. Kurilla, Emily Carlson Marti, Sam G. Michael, Lili Portilla, Clare Schmitt, Meredith Temple-O’Connor, David A. Eichmann, Warren A. Kibbe, Hongfang Liu, Philip R.O. Payne, Peter N. Robinson, Joel H. Saltz, Heidi Spratt, Justin Starren, Christine Suver, Adam B. Wilcox, Andrew E. Williams, Chunlei Wu, Davera Gabriel, Stephanie S. Hong, Kristin Kostka, Harold P. Lehmann, Michele Morris, Matvey B. Palchuk, Xiaohan Tanner Zhang, Richard L. Zhu, Benjamin Amor, Mark M. Bissell, Marshall Clark, Stephanie S. Hong, Kristin Kostka, Adam M. Lee, Robert T. Miller, Michele Morris, Matvey B. Palchuk, Kellie M. Walters, Will Cooper, Patricia A. Francis, Rafael Fuentes, Alexis Graves, Julie A. McMurry, Shawn T. O’Neil, Usman Sheikh, Elizabeth Zampino, Katie Rebecca Bradwell, Amin Manna, Nabeel Qureshi, Richard Moffitt, Christine Suver, Julie A. McMurry, Carolyn Bramante, Jeremy Richard Harper, Wenndy Hernandez, Farrukh M Koraishy, Amit Saha, Satyanarayana Vedula, Johanna Loomba, Andrea Zhou, Steve Johnson, Evan French, Alfred (Jerrod) Anzalone, Umit Topaloglu, Amy Olex. Details of contributions available at covid.cd2h.org/acknowledgements
The N3C Consortium
G Caleb Alexander10; Benjamin Bates5; Christopher G Chute11; Jayme L. Dahlin1, 2; Ken Gersing1; Melissa A Haendel12; Hemalkumar B Mehta11; Emily R. Pfaff13; David Sahner1, 2.
10. Johns Hopkins School of Medicine, Baltimore, MD, USA
11. Johns Hopkins University, Baltimore, MD, USA
12. University of Colorado Anschutz Medical Campus, Aurora, CO, USA
13. UNC Chapel Hill School of Medicine, Chapel Hill, NC, USA
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.