In this analysis of more than 266,000 individuals with follow-up of almost 5 years, we found that hospital diagnoses of colorectal cancer can be used to reliably identify and/or rule out incident cases in the absence of cancer registry data. Hospital diagnoses of lung cancer were not as comprehensive or timely as those for colorectal cancer, but still provide a reasonable indicator for incident lung cancers. However, ascertainment of lung cancer diagnosis can be improved via the use of lung cancer death records.
Our results for lung and colorectal cancer broadly concur with a previous study of breast cancer in 45 and Up Study participants. Kemp et al. reported that APDC diagnosis codes identified incident breast cancers with 86% sensitivity and 86% PPV in 2004–2008 [
4]. In our analysis of breast cancer, PPV was 86% and sensitivity was 90%. Restricting to female cases diagnosed in 2004–2008 made no material difference to our results, which are comparable to the findings of Kemp et al. The improved sensitivity in our analysis can be explained because Kemp et al. used a 3 month window for identification of true positives and only included the principal diagnosis code at each hospital admission, whereas we used a 12 month window and included all diagnoses recorded at each admission. The higher sensitivity and unchanged PPV suggests that the latter is the optimal algorithm. In our study, we have built on the work of Kemp et al. to show that other common cancer types (colorectal, lung) are also amenable to the use of surrogate outcome markers of diagnosis in routinely collected hospital data. However, we have also shown in exploratory analysis that not all cancer types are necessarily amenable to this approach, especially if treatment does not routinely occur on an in-patient basis.
Lung cancers
We found that ascertainment of lung cancer diagnosis can be improved via the use of lung cancer death records for this low survival disease. It should, however, be borne in mind that such a strategy will preferentially identify fatal lung cancers over non-fatal cancers, so this may not be an appropriate ‘surrogate’ diagnostic marker for lung cancer for all analyses. The inclusion of death records improved the sensitivity with which lung cancer was identified, but led to a slight reduction in PPV. The death records often identified people who died shortly after being diagnosed and who did not have a long period of time to use health services and therefore had less chance of being identified in the hospital data. The reduction in PPV caused by the inclusion of death records was partly due to the introduction of people who died from cancer but who were diagnosed in the NSWCR before the study period, along with the inclusion of people who most likely had secondary lung cancer metastasised from another primary site or who had other lung disease but were classified as having primary lung cancer on their death certificate.
Limitations
One limitation of our analysis is that although it is population-based, the 45 and Up Study is not strictly representative of the general population [
9], with those in marginalised groups less likely to participate in studies of this type. For that reason, hospital data might not be as sensitive for population-wide identification of lung cancer as for other cancer types, given that socio-economic differences exist in smoking rates and in patterns of lung cancer care [
6]. However, using an available dataset containing all NSWCR records for 2001–2009 and their hospital records for 2000–2011, we ran the same analysis using hospital diagnosis records for the whole NSW population and obtained similar estimates for sensitivity for colorectal and lung cancers (data not shown). Also, the results reflect the data for the study period and might not be representative of later time periods. The results also included cancers diagnosed prior to entry into the 45 and Up Study and so might not be representative of future incident cancers, but the results were very similar when only post-baseline cancers were included. The narrow confidence intervals for sensitivity, PPV and specificity due to the very large cohort and numbers of cases are strengths of the study, but it is also a reflection of the precision of the estimates and not necessarily their accuracy, as other statistical uncertainty cannot be excluded.
There are also some limitations to the suggested surrogate markers for incident cancers. The hospital data do not include information about disease stage or the actual date of diagnosis, which are often important data items required for cancer-related studies such as assessing the appropriateness or timeliness of treatment. In NSW, pathology is performed through a mix of private and public hospital laboratories, with no one pathology database covering the entire population, so detailed individual-level pathology data beyond those included in cancer registry data were not available. We also did some preliminary investigating of cancer site/location recorded in the hospital records (data not shown) and they often varied and were different to those recorded in the cancer registry data, such as the recording of rectosigmoid cancer as rectal cancer or vice versa. However, hospital data can be used to identify important health-related information that is not available from cancer registries, such as the presence of various comorbid conditions over time.
The surrogate indicators for cancer, particularly for lung cancer, tended to lag behind the actual diagnosis date, although around three-quarters were identified up to 3 months before or after the NSWCR recorded date of diagnosis. Using these sources would result in a small dip in the number of cases in the months after the cancer registry data ends, due to the surrogate indicators identifying cases already covered by the final cancer registry data, but it would then return to around the expected level. The time lag might be important if trying to assess a relatively short time-related factor such as the timeliness of treatment after diagnosis or short-term survival, but for overall incidence it is a reasonable measure. There were cases who were not identified using the hospital records, in particular for lung cancer. Using hospital records alone will miss a small proportion of new cases, and they tend to be the people with less health system contact, such as those with unknown disease stage or from non-metropolitan areas. This also suggests that if hospital data are used to calculate incidence, they will give a slight underestimate and could attenuate differences between cancer cases and non-cases in analyses of risk factors for cancer. However this relates to a relatively small proportion of the overall number of cases. The criteria used to assess the validity of the surrogate indicator algorithms are not perfect, particularly for people diagnosed at the start/end of the study period. For example, someone diagnosed in the NSWCR in December 2010 and in the APDC in January 2011 would be considered a ‘false negative’ in the APDC due to the study period date cut-off. However we believe that overall the criteria used provide a strong and objective measure of validity for comparisons.
Furthermore, there are some limitations relating to the study we have undertaken. The primary purposes of the non-cancer registry data sources do not include cancer identification or recording, so they should be used for this purpose with caution. It is also possible that the collection of hospital data might change in future and this could impact upon their validity for identifying incident cancers. There is a small chance of false negative or false positive linkage, which can have an impact when there is a relatively small proportion of cases. Finally, the NSWCR data for 2011 and COD-URF data for 2013 became available as we were completing this study (in 2016), but we have not yet gained access to these data to allow for further analysis.
In this analysis of specific cancer types, we found that an algorithm based on hospital records, rather than emergency department records or Medicare claims, was the most accessible, practical and valid method for ascertaining cancer diagnosis. The EDDC is a rich and useful dataset in its own right, but it does not appear to contribute to the identification of cancer cases. The EDDC data custodian warns against the use of diagnosis fields in the EDDC for analytical purposes, as only one diagnosis is recorded per presentation and it is not coded consistently across all EDs in the state [
10]. Furthermore, the EDDC did not capture all EDs in the state throughout the study period. It covered around 80% of ED presentations in 2007, with coverage continuing to steadily increase since 2005 [
10]. Despite these limitations, the EDDC still provides powerful information about an important part of patient care. Similarly there was a great deal of information gained from the claims records in the MBS and PBS. The data identified many thousands of people who had cancer treatment and provide an excellent insight into patient care, but the recorded items may not be specific to cancer types (e.g. chemotherapy medicines such as docetaxel can be used for several different cancer types) so by themselves they may not be useful as surrogate indicators for these specific cancers. Future work, however, will explore methods for overcoming such issues via the use of probabilistic algorithmic approaches using the rich information in all of the available datasets.
Implications
For ongoing cohort studies there is great benefit in having cancer incidence data that are as current as possible, allowing for more timely and relevant examination of cancer-related outcomes, as well as a greater number of cases to increase the power to detect associations. Furthermore, for countries that lack centralised cancer registries, being able to estimate cancer incidence through hospital and other medical records is of benefit for research, surveillance and planning purposes.
APDC diagnosis records for colorectal or lung cancer were adequate for identifying new cases of these cancer types in this prospective cohort study. Using the APDC to identify new cases of colorectal and lung cancer provides more up-to-date cancer incidence data and permits investigation of a range of topics with greater follow-up time from entry into the study and higher statistical power. Using the APDC diagnosis data to the end of the follow-up period (June 2014) more than doubled the number of incident cancers since entry into the study compared to using cancer registry data alone, increasing the median follow-up time from 2.5 to 6 years and providing greater power to, for example, detect associations between risk factors and cancer incidence. Using hospital records, the vast majority of cases in the cohort were picked up and those who were identified as having cancer are highly likely to be true cases. The extremely large cohort provided large numbers of cases and precise estimates and the use of population-based datasets allows for excellent coverage of the cohort and the conditions of interest.