Background
Emergency departments (EDs) are challenging research environments. Acutely and life-threateningly ill patients, high patient traffic, relatively short patients’ length of stay, 24-hours operation time and opening hours, as well as symptom-based emergency care, require specific adaptations to the patient recruitment process for health services research projects [
1,
2]. Research based on primary data (i.e., data that is collected for a research-specific purpose) is cost-intensive and prone to biases during data collection that might impair results’ generalizability, but is inevitable for certain research questions relying on valid real-life data from health care settings [
3,
4]. Secondary data (i.e., data that is produced by third parties for their specific purpose) from hospital information systems (HIS) provide an easily accessible data source that includes an entire ED population [
5]. However, these data are collected primarily for medical documentation and reimbursement and not for research purposes. Additionally, there is no uniform documentation standard or a standardized set of variables across the HIS data of different hospitals, at least in Germany [
6].
The use of primary and secondary data is often discussed in terms of advantages and disadvantages of both types [
5,
7‐
10]. Primary data bear the risk of bias, face validity problems regarding participants’ responses as well as the possibility of non-response. For longitudinal studies, Roos et al. argue that poor representativeness is due to the complexity of the participant recruitment process, the circumstances of the ‘initial contact’ often leading to non-participation and the difficulty of maintaining contact for follow-up interviews [
11]. Patient consent for studies linking primary and secondary data is challenging and has effects on the representativeness of studies [
12,
13]. Other studies using primary data demonstrate the effort and complexity of the recruitment process and describe strategies for achieving representative samples [
14‐
17]. Secondary data, on the other hand, have a lower risk of bias and often reflect a high number of cases. Direct contact with participants for data collection is not necessary and non-response as well as loss to follow-up are not as prevalent as in primary data collection [
11]. However, secondary data bear the risk of possible deficiencies in data validity and quality as those depend on complex coding processes and documentation discipline within the institution producing the data.
Whereas advantages and disadvantages of both data types for research are well known and have been critically discussed, comparative analysis of corresponding primary and secondary data in the ED setting are rare. A review of studies on drug effects using primary and secondary data by Prada-Ramallal et al. showed that differences between data types were almost never addressed as a (possible) cause for heterogeneous study results [
18]. However, a few studies showed differences in frequency distributions of specific outcomes when comparing primary and secondary data types, e.g., concerning diagnoses and associated comorbidities [
19] and cost estimates for primary care utilization [
20].
Goals of this investigation
For this analysis, we compared three primary data study populations with respective secondary data HIS populations regarding socio-structural (age, gender) and health- and care-related characteristics (triage category, transportation to ED, case and discharge type, multi-morbidity). Our research question was: What are potential implications of using primary and secondary data for analyzing care within EDs? In addition, our analysis aims to show potential insights from the comparison of both data types and methodological-practical suggestions derived from this investigation’s experiences.
Discussion
This contribution’s novelty lies in the comparison of primary and secondary data in the emergency medicine health services research context and its inclusion of three different patient populations and respective indications from eight EDs, which is unique so far. Mostly minor, although statistically significant, differences in distributions of patient characteristics between primary and secondary data samples were found in most variables and for all three sub-studies.
Age and
gender distributions in study participants mostly reflected the secondary data sample which was also reported in similar trial studies [
31].
Differences in patient and case characteristics can be attributed to recruitment conditions, study-specific inclusion criteria and the modus operandi of documentation in hospitals. One of the central reasons for the observed differences in data samples are the specifics of the recruitment situation and process of the three sub-studies, which has also been argued by Roos et al. in the context of longitudinal studies [
11]. The effects of recruitment practices for the composition of study populations in health care research are discussed broadly. Some trial studies investigated barriers to patient recruitment, such as migration background, language barriers, cognitive characteristics that make informed consent difficult or the perceived lack of benefits for the patients [
32‐
37]. The issue of recruitment barriers and their effects on study population composition are certainly important when studies operate with the goal of achieving certain case numbers and response rates. Identified further factors influencing recruitment were, e.g., certain communication channels (telephone vs. mail) [
38], specific time points of recruitment [
39], and the recruitment experience of study nurses [
40].
Our analysis focused on patient characteristics and specific features of the recruitment situation. Generally, time restrictions in EMACROSS and EMASPOT did not generate meaningful differences with regard to the distribution of characteristics in primary and secondary data samples. However, in EMAAGE retrospective patient inclusion was practiced, so that in this sample the time of presentation to the ED was irrelevant. In cardiac patients, the prolonged stay on the Chest Pain Unit (CPU) of the ED may have helped to include patients during working hours that initially presented during night hours. Given the almost similar distributions of patient characteristics between samples, we conclude that restricting study recruitment to specific times of the day does not hamper the inclusion of a patient population similar to the target population. This finding appears to be a special feature and novelty of our contribution. Whether this feasibility of comparisons as well as the seemingly negligible impact of time restrictions to recruitment can be generalized to other clinical settings is beyond the scope of this article. As available literature suggests, recruiting and data collection heavily depend on the properties of certain settings [
41‐
43].
The effect of the recruitment process is particularly noticeable for participants with respiratory diseases in EMACROSS whose characteristics differed more profoundly from the secondary data sample with respiratory diseases. Concerning the distribution of patients’
triage categories, severely ill and acute patients in category 1 were less often included in the primary study sample. Recruitment of patients for interviews of 30 to 60 minutes who are in need of immediate treatment is not feasible for medical and ethical reasons. Differences between populations found in triage categories therefore can be regarded as unavoidable. Concerning EMACROSS, recruitment might have been additionally hampered by patients’ physical inability to conduct an interview due to shortness of breath or respiratory therapy in the ED. We observed that older patients with respiratory complaints were more likely to be non-responders, which might have influenced the age distribution in the recruited population. As studies on hospice patients [
31,
44,
45] or patients in stressful situations [
46] argued in a similar way, primary data collection might be inappropriate or at least comes with a higher share of nonresponse, if patients suffer from certain illnesses. The same reasoning generally applies to studying diseases that affect patients’ communication skills. Thus, if the importance of particularly severely ill patients is relevant to the research question, recourse to HIS data may be more appropriate.
Inclusion criteria and respective changes during the recruitment process are of particular relevance for the total composition of a population. Participants in EMAAGE and EMASPOT reproduced the distribution of
ambulatory and inpatient stays in the secondary data sample with respective diagnoses. The overrepresentation of ambulatory participants (and thus associated surplus of younger and healthier patients) in EMACROSS is explained by the project’s initial inclusion criteria focusing on outpatient ED patients. Participants of all sub-studies slightly differed with regard to
discharge types from the secondary data samples. The difference in the number of deceased patients in EMACROSS and respiratory ED patients in general might be explained by the fact that this sub-study recruited mostly younger patients with ambulatory health care needs and rather average to low acuity measured by triage categories [
47].
In EMAAGE, we observed an overrepresentation of patients who were transferred to other health care facilities in the primary data sample. This might be due to the focus of the study personnel on patients’ final care arrangements documented in electronic patient files while HIS data only captures the most immediate discharge type after hospital treatment, e.g., discharge home. This indicates the relevance of documentation routines and data production practices in patient surveys and routine data. This was even more obvious in the case of multi-morbidity, which is a generally complex variable [
48]. Multi-morbidity was the variable with the most pronounced differences between the data sets in EMASPOT and EMAAGE with higher rates of multi-morbid patients in the primary data sample. This might be explained by two aspects: The primary data set was tailored to detect certain comorbidities that are not systematically documented in ED diagnoses. Especially for ambulatory ED patients in the secondary data sample, only diagnoses relevant for ED treatment are documented in the HIS. Thus, comorbid conditions might not have been systematically documented by healthcare personnel, as other studies also pointed out [
29]. The definition of multi-morbidity applied in this data sample is dependent on thorough ICD-coding [
29,
48,
49]. Thus, using ED diagnoses from HIS for determining patient multi-morbidity is potentially less suitable, since relevant diagnoses might be lacking and comorbidities are also often recorded in form of free text. Therefore, prevalence of multi-morbidity in ambulatory ED patients might be underestimated. In primary data collection for research purposes, study personnel cannot only inquire relevant diagnoses from patients themselves, but also search through the electronic patient file in HIS on past hospital stays, physician’s letters, and other sources of information. This argument is in line with reviews that have examined the construction of the variable multi-morbidity [
50]. We complement the point made by Stirland et al. [
50] by saying that the choice of data source is relevant and needs to be critically reflected when doing research on complex variables like multi-morbidity. However, the failure to diagnose and document specific conditions in the ED, e.g., mental health disorders, is another relevant point when considering the reliability of both primary and secondary data concerning completeness of diagnoses [
23,
49].
Finally, with regard to patients’ transport to the ED, inconclusive differences were observed in all transport categories between primary and secondary data populations. No content-related explanation for these differences was found, thus indicating that recruitment in our study failed to reproduce the actual pattern of patients’ transport to the ED.
Limitations
Our research combines comprehensive data of two types on three ED patient populations with common model diseases. Nevertheless, our analyses are subject to limitations. Firstly, samples of primary and secondary data were drawn from two different time periods due to research-practical reasons and availability of secondary data. This might have influenced absolute numbers in sample composition. However, no changes in the relative distribution of patient- and care-related characteristics in participating EDs should have occurred in the rather short period between 2016 and 2019, as there were no major changes in the prevalence of studied (mostly chronic) diseases, medical guidelines for ED treatment of these diseases, and in the structure or processes of ED care in Germany. Secondly, some variable categories in primary and secondary data sets originally differed and were thus harmonized retrospectively for comparative analyses, which might have introduced minor distortions of results. Thirdly, three variables (triage category, transportation to ED, and discharge type) showed high proportions of missing values, especially in secondary data, which might have influenced results. Reasons for high missing values in secondary data samples were mainly due to the fact that some EDs did not transmit data on certain variables, e.g., because this information was not collected in the respective ED at all (no mandatory field in documentation forms) or information was not collected systematically for every ED patient. If the amount of missing secondary data from specific study EDs would be systematically associated with the above-mentioned ED factors, the distribution of the respective variables in our analysis of secondary data might be biased. However, from descriptive analysis of ED features and the pattern of the amount of missing values per ED, no systematic bias in this direction became evident so that we can reasonably assume that missing values in our datasets occurred completely at random. Lower missing values in primary data in the same variables might point to the advantage of primary data collection by trained study nurses, who closely observed the care process of study participants during recruitment and manually retrieved not readily available information from all electronic documentation in the patient file. Fourthly, identification of cases in primary and secondary data was different (leading symptoms in primary recruitment and diagnoses in secondary data), which might have affected the comparability of populations. Lastly, the secondary data sample consists of cases from eight EDs. However, the number of patients treated in each ED differs vastly between EDs. As documentation of variables was not harmonized prior to data extraction, systematic differences may occur.
Conclusions
Overall this articles shows, that the comparison of patient populations in primary and secondary data samples can provide insights into the advantages and shortcomings of both data types for health services research in emergency medicine. Complete secondary data thus allow to assess and to verify the composition of primary data samples if the same study inclusion criteria are applied to both samples and data sets are adjusted for comparison. Overall differences between primary and secondary data samples are evident in our patient populations but comparably small. Observed differences in patient characteristics of the primary data sample might have been influenced by recruitment practices (e.g., non-response, length and type of survey administration), project-specific inclusion criteria (e.g., language and cognitive requirements for study participation, focus on specific case types) and differing documentation rationales. Nevertheless, primary data allow a comprehensive and detailed collection of information on specific patient groups. The higher workload from patient recruitment and resulting lower sample sizes in primary datasets may be disadvantages. In contrast, the secondary data sample depicts the full population of ED patients with respective diagnoses in a specific time frame, although this data type bears the risk of incomplete information due to missing values or non-usable data formats in HIS documentation.
The aim of health services research studies is to depict real-life conditions of health care provision in certain patient groups or settings. Future research studies with primary data collection should thus additionally establish close concomitant monitoring practices during patient recruitment, in order to timely detect potential deviations from targeted sample characteristics. Additionally, our analysis has shown the need for systematic, harmonized and complete secondary data documentation in hospital information systems for health services research purposes.
Acknowledgements
The authors would like to thank the participating hospitals in the EMANET research network, namely Charité – Universitätsmedizin Berlin (Campus Charité Mitte and Campus Virchow-Klinikum), St. Hedwig Hospital (Alexianer Berlin St. Hedwig-Krankenhaus), Elisabeth Hospital (Evangelische Elisabeth Klinik der Paul-Gerhardt Diakonie), Franziskus Hospital (Franziskus-Krankenhaus Berlin), German Armed Forces Hospital Berlin (Bundeswehr Krankenhaus Berlin), German Red Cross Hospital Berlin-Mitte (DRK Kliniken Berlin-Mitte), and Jewish Hospital (Jüdisches Krankenhaus Berlin). The authors would like to thank all study participants as well as all EMANET researchers and study personnel responsible for study planning, patient recruitment, and data management.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.