Background
Analyses of real-world evidence (RWE) are becoming increasingly popular in all therapy areas, including respiratory diseases [
1,
2]. This is largely because data regarding the effectiveness and safety of treatments are critical to guide treatment decisions by physicians and decision makers/payors, as they are more generalizable to daily care than data from randomized clinical trials with strict inclusion/exclusion criteria [
2,
3]. However, issues concerning the appropriateness of RWE collection methods, analysis methods and data reliability have been previously highlighted [
4,
5].
Many study designs and data sources are available for RWE generation, which mainly involve primary data collection or secondary data use [
5,
6]. Primary data studies document data from patients explicitly included for study purposes, and may collect data either retrospectively, via patient charts/other data sources, or prospectively, via documentation of data by study physicians/patients. Secondary data studies use data that have been previously recorded for reasons other than the intended study objectives in a retrospective manner. Such data may, for example, be obtained from administrative claims databases, existing patient registries, or electronic medical record databases [
7].
Most RWE studies are based on a single data type; however, each data type is associated with its own strengths and weaknesses [
6] and the appropriate choice of data type and study design is dependent on the scientific question asked. In addition, practical factors such as data availability, cost, generalizability of the data, project timelines and necessary approval processes including ethical approval may impact the choice of study design and data source [
8,
9]. Data enrichment via linkage of different types of RWE could address some of the weaknesses associated with single source data capture and improve the scientific quality of the studies; however, data linkage may also be associated with limitations such as selection bias (e.g. when linking study populations with different inclusion/exclusion criteria), potential linking errors [
10], or a potential loss of power due to smaller sample size.
To gain insight into the value of linking primary and secondary data sources, this analysis assessed the degree of agreement between such data sources using the example of chronic obstructive pulmonary disease (COPD) in Germany.
Discussion
This study evaluated agreement between primary and secondary data collection (prospective observational data/retrospective chart review and retrospective claims data, respectively). We found discrepancies between primary data and claims data for linked patients for most assessed variables including comorbidities, drug prescriptions, and exacerbations, with a lower number of comorbid patients, COPD therapy prescriptions and exacerbations reported in primary versus claims data.
Agreement between primary and claims datasets with regards to sociodemographic data was very good. The gender information was identical between datasets and the information on age differed only in a small number of patients (4.3%, mean difference 9.4 years). This difference could be associated with recording errors or missing data in the primary or claims datasets; however, as the mean age for linked patients was similar between the primary and claims data, it is unlikely that this reflects any major discrepancies between the datasets.
Kappa values for the prevalence rates of observed comorbidities indicated no to strong agreement between the primary and claims datasets, based on previous interpretations that indicated that values ≤0.20 indicate no agreement, 0.21–< 0.40 minimal agreement, 0.40–< 0.60 weak agreement. 0.60–< 0.80 moderate agreement, 0.80–0.90 strong agreement and > 0.90 almost perfect agreement [
12]. We initially expected to observe a higher percentage of comorbid patients in the primary data compared with the claims data as physicians were asked to document all comorbidities that they were aware of in the primary data collection whereas the claims data analysis only considered comorbidities documented by a respective ICD-10 code between January 1, 2010 and the patient’s linked dataset index date. Instead, the percentage of comorbid patients was higher using claims data compared with primary data; this difference was even more pronounced in patients included by pneumologists in the primary dataset, as indicated by the lower kappa values within this subset. Specific limitations of the data sources may explain these deviations. One limitation of using primary data collection to document comorbidities is that study physicians, particularly disease specialists such as pneumologists, may not be aware of every comorbidity a patient suffers from. This may explain why we found that GPs generally documented more comorbidities compared with pneumologists in the primary data, as GPs consider the overall health of the patient, which is reflected in the better agreement between datasets for comorbidity data collected by GPs versus those collected by pneumologists, as indicated by the higher kappa values. On the other hand, claims data collection is also limited as comorbidities may be more frequently reported owing to the prescription behavior of German physicians being evaluated by payers based on documented background diagnoses. However, since most patients who were classified as comorbid in claims data only were also given respective drug therapies, this limitation can only partially explain the differences.
This study also showed that only approximately a quarter (24.5%) of prescriptions for COPD therapies were recorded in both datasets, with a lower number of prescriptions generally observed in primary data. These findings could be due to incomplete data collection in the primary dataset. Incomplete data collection may arise when physicians other than the primary study physician prescribe COPD medications, thereby leading to lower reporting of prescription data in primary data collection in the absence of electronic medical records. Additionally, prescriptions found in primary data but not in claims data may be explained by non-filled receipts, indicating patients’ primary non-adherence. Therefore, study conclusions drawn from the analysis of primary data only could miss a substantial number of patient prescriptions, while studies based on claims data only cannot capture patient primary non-adherence, highlighting the potential benefits of using a linked dataset to more fully describe treatments received by study patients.
We also observed substantial differences in documented exacerbations (moderate and severe) between the two datasets. The higher number of documented exacerbations in the claims data may be due to specific features of the German outpatient coding system. In this system, physicians can “keep” a diagnosis in their practice software and re-document it at the next visit; on the other hand, the system only allows researchers to record a specific ICD-10 code once per quarter for every attending physician, which can potentially lead to lower reporting as the maximum number of events reported by quarter is one. As kappa coefficients were not calculated for the exacerbation results, this potential under-reporting should be stressed. However, specifics of the coding system cannot explain the substantial differences in the number of severe exacerbations (leading to hospitalizations) documented in the datasets. The lower number of moderate or severe exacerbations observed in the primary dataset compared with the claims dataset could be due to study physicians not being fully appraised of the patient’s pathways, treatments and hospitalizations that they did not themselves initiate or steer. In this regard, it is interesting to note the slightly higher agreement between datasets for exacerbations documented by pneumologists compared with those documented by GPs, possibly due to the higher awareness of disease-specific events among specialists. Overall, our findings indicate that data linkage between primary and secondary data sources could increase the validity of data when describing exacerbation events in patients with COPD.
Results from this study highlight some benefits and limitations of the data sources considered. One of the main advantages of primary data collection is that it allows the selection of data that are of direct interest to the researcher; however, it may be limited by a risk of lower reporting and incomplete data collection. On the other hand, claims data collection can give access to a greater amount of data, but may be limited by the selection of variables available [
13]. The limitations of primary data collection are particularly apparent for information related to patient treatment provided by physicians or institutions that did not participate in the study. For example, the risk of lower reporting is likely to be higher for variables that can be measured and influenced by multiple healthcare professionals (e.g., GPs, different specialists and hospitals) such as drug treatment and exacerbations, whereas disease-related variables that require specific equipment or knowledge for physician assessment are less likely to be affected. This is also relevant to the interpretation of data stemming from registries, which are typically designed as primary prospective observational studies with specialists documenting registry data. Furthermore, patients who agree to participate in prospective data collection may not be representative of the wider patient population, and in this regard claims databases may provide higher external validity compared with primary datasets. However, claims data may themselves be limited by the availability of recorded information or the risk of higher reporting of specific items that are associated with positive reimbursement decisions (as noted above). For example, in German claims data, only data associated with the reimbursement of services (hospitalizations, prescriptions and outpatient treatments) are generally available [
14], whereas disease-specific laboratory values and COPD-specific outcomes such as lung function, COPD symptoms or GOLD group are not captured. Therefore, using COPD as an example, a claims data-based study may describe patient characteristics, prescribed medications and outpatient/inpatient treatments, but may be unable to report important disease characteristics such as lung function, COPD Assessment Test (CAT) score, modified Medical Research Council (mMRC) score or laboratory values, unless these are available in the claims database considered.
This analysis showed that data linkage of primary and claims datasets can lead to data enrichment, in certain situations such as the analysis of drug prescriptions and the reporting of exacerbation events. For example, in this study, the claims dataset captured a higher number of recorded prescriptions compared with the primary dataset, as illustrated in Fig.
2. The primary dataset contains data recorded by only the study physicians, whereas the claims dataset also gives access to data recorded by physicians other than the primary study physician, thereby providing additional details which would not be captured in the primary data. The results show that linking primary and claims data for the recording of prescriptions can yield a more complete description of the data. The value of data linkage has also been demonstrated in other disease areas, for example, in a comparison between cancer registries and GP electronic health records in England [
15]. Other studies have also highlighted the added value of data linkage with regards to improving disease identification [
10]. However, data linkage could potentially introduce selection bias [
10]. Further analysis of data from our study may provide insight on whether data linkage introduced such a bias in this example.
This study is not without limitations. Our conclusions are based on German data and therefore influenced by data capture methodology specific to Germany; other databases in other countries may identify additional benefits and limitations not covered by our observations. Additionally, the German healthcare system is characterized by a widespread network of outpatient specialists operating outside of hospital care, increasing the probability that patients visit specialists independently, unbeknown to their regular physicians (GPs or other specialists). Furthermore, inclusion of patients was based on slightly different criteria in each dataset. The main reasons for this were the need to also address further research questions such as general prevalence of COPD in Germany (not presented here), and unavailability of applied criteria in one or the other dataset. Nevertheless, as the content of this publication focuses on linked patients only, we do not expect this to have any impact on the presented analysis. In addition, not all variables collected in the primary data were documented in the claims dataset, for example spirometry measures, CAT score, mMRC score, and sociodemographic characteristics such as educational level and professional activity. Agreement between datasets with regards to these variables could therefore not be assessed.
Moreover, the observation periods for documenting comorbidities were different in the primary (any comorbidities known to the physician at linked dataset index date) versus the claims datasets (January 2010 to linked dataset index date). However, as the comorbidities analyzed here were chronic diseases, this discrepancy in reporting periods may not have overly influenced the differences observed between datasets. The study did not consider potential linking errors, including potential errors in recording of patients’ insurance numbers, which could have contributed to the differences observed between the datasets. Finally, the inherent weaknesses associated with primary data collection may have contributed to the observed lower reporting of comorbidities, prescriptions and exacerbations relative to claims data collection, despite the primary study being performed in accordance with all known guidelines for observational research, and including on-site visits and extensive data validity control.
In conclusion, this study highlights discrepancies between primary and claims data collection capture for this population of German patients with COPD. Primary data collection may be appropriate for studies primarily assessing information that is completely available at one study site (e.g., diagnostic data, especially when required equipment is only available at these sites) or when the risk of patient and study site selection bias is minimized by random or consecutive sampling. An analysis based on claims data may be effective in observational COPD research in situations where only the variables that are well covered in the claims data are of interest (such as costs, hospitalizations and outpatient prescriptions). In other situations, linking primary and secondary data sources for the same patient population could enrich data and may be a preferred choice to fully describe COPD endpoints.