Introduction
Timely and reliable surveillance data is critical to guide efforts to reduce cancer morbidity and mortality, particularly among underserved populations that experience disparities in cancer risk factors, health care services, and outcomes [
1‐
3]. State cancer prevention and control programs rely on surveillance data to understand trends, evaluate the effectiveness of interventions, measure health equity, and distribute resources to the populations at highest risk. While cancer surveillance within the U.S. is relatively robust compared to surveillance for many other chronic conditions, there are several key limitations to current practices. First, prevalence estimates for risk factors and cancer screenings often come from population-based health surveys [
3‐
7], which can be both cost- and time-intensive and can have low validity for self-reported health information [
8‐
13]. Second, although cancer registries provide reliable cancer prevalence and incidence rates, they often lack comprehensive information regarding the full cascade of engagement with the healthcare system from screening to timely initiation of treatment, often referred to as the ‘cascade of care.’ Cancer registries also cannot assess prevention and screening efforts within wider patient populations (e.g., population-level cancer screening rates) [
14,
15].
Widespread adoption of electronic health records (EHRs) poses a strategic opportunity to improve upon these limitations in cancer surveillance. EHR data contain a wealth of clinical information, including diagnoses, procedures, lab results, medications, and vitals, which can be accessed in real-time on large convenience samples of in-care patients. In addition, the expansion of health information exchanges or research networks allows for the linkage of EHR data across contributing healthcare institutions. These data systems, which often use common data models to create a standard data format and link patient records across disparate institutions, can provide a more complete representation of care received and a wider geographic coverage than EHR data from a single healthcare institution. These EHR-based systems therefore have potential to produce timely estimates of cancer surveillance indicators along the continuum of the cascade of care. Yet the extent to which EHR networks can be used to generate accurate cancer surveillance metrics remains unknown.
EHR data are currently routinely used to explore and explain patterns in cancer care among patient populations. This includes reporting on clinical quality measures for health systems, such as adherence to cancer-related preventative and screening services [
16], as well as epidemiologic and clinical research, such as assessing determinants for non-adherence to cancer guidelines, developing interventions to increase screening and immunization efforts, and identifying delays in care [
17‐
25]. In addition, there has been growing support and adoption of using EHRs to automate and standardize reporting to state central cancer registries [
26,
27]. However, utilization of EHR data to inform cancer prevention and control programs has been more limited. To our knowledge, there have been no comprehensive scoping reviews to assess the use of EHR data for cancer surveillance, nor have efforts been made to comprehensively design and test EHR-based cancer surveillance metrics.
To fill these gaps, our study aimed to (1) review the current state of the literature on how EHRs have been used to inform cancer surveillance to date; (2) propose potential surveillance indicators that can be constructed using common data model EHR variables along the spectrum of cancer surveillance: risk factors, screening and immunization, quality of care, and incidence or prevalence; and (3) perform an initial validity test of these proposed indicators. The overall goal of the project was to identify EHR-based cancer surveillance indicators that could be used by state public health programs to set objectives to improve cancer prevention and control, plan public health interventions, and evaluate state-level progress towards achieving those objectives. Here, we define cancer surveillance indicators as measures related to primary (i.e. reducing the incidence of cancer) or secondary prevention (i.e., leading to early diagnosis or prompt treatment of cancer).
Discussion
Our scoping review provided a robust number of studies that explored a diversity of topics along the cancer cascade of care. To our knowledge, this is the first scoping review to provide a comprehensive overview of the EHR-based cancer literature starting from the recent maturation of EHR networks. Importantly, this literature base was critical for informing our indicator development work by providing our stakeholders with an understanding of the feasibility and acceptability of using EHRs to measure these types of indicators and by providing variable definitions that we could attempt to replicate or improve upon in this work.
However, we identified a number of gaps in the current literature regarding the use of EHRs to inform cancer prevention and control programs. Although we identified many articles that measured variables related to the cancer cascade of care using EHRs, few had an explicit purpose of informing cancer surveillance efforts. Articles that focused on quality improvement or epidemiologic research generally did not address issues related to selection biases or the representativeness of patient samples, a key challenge for using these data for surveillance efforts. Those that did examine EHR-based measures from the lens of public health surveillance were more likely to incorporate methods or validation approaches to address issues of population representativeness in their samples, but these studies were predominantly focused on cancer risk factors [
49‐
54,
68‐
70]. In addition, while most studies provided clear conceptual definitions of their EHR-based variables (e.g., receipt of a screening mammogram within the prior two years), a considerable proportion did not include practical definitions, such as use of a common data model, specific clinical sources within the EHR, or standardized codes/terminology. This lack of information could limit the replicability of these studies.
In our development of the proposed indicators for public health surveillance of cancer prevention and control, we attempted to fill these gaps by assessing measures along the cascade of care, from cancer risk factors to cancer incidence and prevalence, and by providing clear definitions that are directly transportable to PCORnet research networks and adaptable to other EHR data sources (Additional file
2). We also tested post-stratification methods to account for the demographic differences between patient samples and target populations and assessed the external validity of these measures. Importantly, we found that the validity of the PCORnet common data model-based cancer surveillance indicators varied substantially. Among the domains of surveillance indicators, estimates for cancer risk factors generally showed the best performance, likely due to measurement of obesity and smoking status at the majority of medical encounters. Estimates were comparable between the PCORnet common data model and a raw EHR and were similar or only slightly higher than estimates from external surveillance sources. These findings align with previous studies identified in the scoping review, which demonstrated that EHR-based obesity and smoking indicators were comparable to estimates from established surveillance data systems after weighting or adjusting for demographic differences between the patient and target populations [
66‐
70].
In contrast to cancer risk factors, the unweighted and weighted estimates generated for the screening and immunization indicators demonstrated poor performance, with substantial underestimation as compared to estimates generated using a raw EHR. Within the NYU Langone sample, we saw that many patients had documentation of their screenings and vaccinations within the health maintenance or immunization modules but not within standardized diagnosis or procedure codes. This lower prevalence within the common data model estimates may be largely attributable to the exclusion of certain components of the EHR from the PCORnet common data model, which largely relies on structured variables and standardized codes [
71]. Further, this highlights the importance of directly specifying the sources of clinical data from within the EHR, as information may not be consistently captured or recorded throughout the system. More importantly, EHR-derived estimates for screening and immunization (from either raw EHR or PCORnet common data model) were much lower than estimates from traditional surveillance data sources, indicating that controlling for demographic differences alone was insufficient to address the substantial underestimation of these indicators within EHR data. This underestimation may be reflective of patients receiving preventative services at outpatient practices that are not affiliated with large hospital systems or clinical research networks.
We found that common data model estimates were actually higher than those calculated using a raw EHR for many of the quality of care indicators. These results demonstrate the potential benefits of using health information exchanges or research networks, which can increase capture if patients receive care across multiple institutions. This benefit may be more apparent for quality of care indicators than preventative care indicators since these services may be more likely to occur in hospital-based settings. We were unable to externally validate these indicators, as there is no established NYC-based surveillance system that tracks timely diagnostic testing and diagnosis after abnormal cancer screenings. Using EHRs to monitor quality of cancer care represents a unique opportunity to fill this gap, and the few studies that we identified as related to this goal demonstrated that EHR data could provide valuable insights into trends and patterns in cancer care [
72‐
74].
Cancer incidence and prevalence within the INSIGHT population was variably higher or lower than incidence and prevalence within the NYU Langone population. This may reflect underlying differences in patterns of care, where NYU Langone may provide care for a disproportionate share of breast cancer patients while other organizations within the INSIGHT network may provide care for a disproportionate share of cervical cancer patients in NYC. In our external validation of these indicators, the weighted estimates for the incidence and prevalence indicators were substantially overestimated as compared to rates reported in the NYS Cancer Registry. However, prior studies that validated EHR-based cancer cases by directly linking these data to cancer registries demonstrated reduced sensitivity of EHRs as compared to registries [
55,
57‐
59]. This overestimation in our weighted estimates is therefore likely related to the calculation of incidence and prevalence rates within sicker patient populations, which presented a selection bias that could not be remedied by controlling for demographic differences alone.
Limitations to our scoping review include the use of a single reviewer during the data extraction portion of this study, which limited our ability to assess potential inconsistencies in how the reviewers extracted the data. However, an 87% agreement between two reviewers on a large sample of articles mitigated this concern. We also did not publish the scoping review protocol and excluded articles published from 2009 to 2011 during data extraction based on the decreased relevancy and utility of articles as we went further back in time. Limitations to the indicator development include our inability to use race/ethnicity in our weighting approach due to the high proportion of INSIGHT patients who had an unknown or other race/ethnicity. Numerous articles identified through the scoping review demonstrated patterns in these cancer indicators by race/ethnicity [
69,
70,
75‐
77], so our weighted estimates likely contain residual biases due to the racial/ethnic distribution in this patient population. Our weighting approach also only incorporated demographic variables, which likely cannot fully account for the systematic differences between patient populations and the general population. In addition, while we provide initial validation results, we did not formally evaluate the internal validity of our indicators through manual chart review and many of our indicators lack a true gold standard by which to externally validate these measures. Our data were also restricted to version 5.1 of the PCORnet common data model. More recent versions of this common data model have included provider specialty and qualitative lab results, which would likely improve the estimation of preventative services, like cancer screenings, using these data.
Conclusions
In conclusion, a review of the current literature suggests that future research on the use of EHRs for cancer surveillance will benefit from careful reporting of key information such as provision of EHR definitions, standardized codes and common data model correlates, as well as descriptions of data quality and bias correction measures taken. Effort could be made to improve the PCORnet common data model for surveillance purposes, such as through improved reporting of race/ethnicity and through the inclusion of additional sources of preventative health services information from raw EHRs. Future studies could also consider limiting patient cohorts to those seen by primary care providers and incorporating additional variables, like insurance status, into weighting or adjustment strategies to better address the biases we observed in this study. Local, state, territorial, and national public health agencies have a strategic opportunity to use timely and geographically granular EHR data for select indicators for public health surveillance of cancer prevention and control, such as cancer risk factors, to assist in planning interventions to improve cancer prevention and control. These data can also potentially provide more accessible and richer information on the cascade of cancer care than routine surveillance data systems. However, these data currently cannot be used to monitor screening and immunization or cancer incidence and prevalence due to the biases we observed for these indicators. Further research is needed to address issues related to population representativeness of these convenience samples.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.