Background
A note on information governance, data protection, and de-identified data
Main text
Overview: Use of self-reported and routinely collected care data
Self-reported methods
Electronic data sources of routinely collected data
Raw data extraction
NHS digital and commissioning datasets
Name of database/ service software | Service category | Comments about the data | Data dictionary (Yes/No) |
---|---|---|---|
Primary care databases | |||
Clinical Practice Research Datalink (CPRD) | Primary care | Collects data from Vision (historically), EMIS (more recently), and potentially SystmOne (being piloted at time of writing) GP practice software systems. Reportedly covers over 11.3 million patients (4.4 million active patients) from 674 practices in the UK b – this was the figure reported for just Vision practices. | Yes (Read code based) |
The Health Improvement Network (THIN) database | Primary care | Collects data from Vision GP practice software systems. Reportedly covers 11.1 million patients (3.7 million active patients) from 562 general practices in the UK b | Yes (Read code based) |
ResearchOne | Primary care (and other contributing organisations – see ‘comments’) | Collects data from SystmOne GP practice software systems. Also reportedly collects data from other services using SystmOne. As of 2013, ResearchOne reportedly includes 5 million health records from 400 contributing organisations across 10 organisation types (ranging from hospitals to end-of-life organisations)b | Yes (Read code based) |
QResearch | Primary care | Collects data from EMIS GP practice software systems. Based on current publications associated with this dataset, unsure what resource-use data is available as not used for economic studies. As of 2015, the database reportedly obtains data from a sample of approximately 1000 practices covering a population of 18 million peopleb | Yes (Read code based) |
NHS Digital databases | |||
Secondary Uses Service (SUS) | Healthcare data | Designed to provide anonymous patient-based data for purposes other than direct clinical care, such as healthcare planning, commissioning, public health, clinical audit and governance, benchmarking, performance improvement, medical research and national policy development. SUS will only provide data for the region of interest to the commissioners if obtained through NHS commissioners. SUS is updated once a month. | Yes (note, includes a variety of commissioning data)d |
Hospital Episode Statistics (HES) | Secondary care | Hospital care data (inpatient, outpatient, A&E, and critical care). Once a month and at pre-arranged dates during the year, SUS takes an extract from their database and sends it to HES – it is this data which populates the HES database. | Yes (online)e |
General Practice Extraction Service (GPES) | Primary care | GPES is part of the GP collection service alongside the Calculating Quality Reporting Service (CQRS) used to record practice participation and to process and display information. GPES collects primary care information from GP IT systems and then presents it at a National level. Used to inform GP payments. Collects both anonymised and person-identifiable data (PID; when permitted). Main focus is clinical data (i.e. Quality Outcomes Framework [QOF] data), not resource-use data. | General overview of data that can be viewed in GPES is listed onlinef |
Diagnostic Imaging Dataset (DIDS) | Diagnostic | NHS-funded diagnostic imaging tests | Yes (online)g |
Improving Access to Psychological Therapies (IAPT) | Mental health | Adults in receipt of NHS-funded IAPT services (see data dictionary). | Yes (online)h |
Mental Health Services Data Set (MHSDSb | Mental health | Record-level data on care of children, young people and adults who are in contact with mental health, learning disability or autism spectrum disorder services. | Yes (online)i |
Raw data extraction based on study by M Franklin, V Berdunov, et al (2014)a | |||
Patient Administration System (PAS) | Hospital | Hospitals collect data into PAS (in Sheffield this is the Lorenzo system, which is a well-established system in England). This dataset includes basic hospital activity (i.e. inpatient, outpatient, and A&E); other detailed clinical information may be held on other hospital systems. | See HES data dictionary for a general overview |
SystmOne, EMIS, Vision | Primary care | SystmOne, EMIS and Vision currently dominant primary care software systems in England. Each collects and records data slightly differently, but underlying data are coded based on Read Codes (most if not all should be using SNOMED CT by April 2018). Note, the methods used by Franklin et al (2014) did not rely on the use of Read Codes, rather front-end report outputs which were processed using visual basic for application (VBA) scripts. | See ‘Read Codes and SNOMED CT’ |
Intermediate care, mental health trust, ambulance services, and social care systems | Various services | The study by M Franklin, V Berdunov et al (2014) collected raw electronic data from all these systems. It is possible to collect these data after discussion with the service and consent agreements from the patients of interest. | No – data based on discussion with services |
Technology for future consideration | |||
Read Codes and SNOMED CT | Primary care | Read codes can be obtained from the Technology Reference data Update Distribution (TRUD) website. SNOMED CT is a more unified coding base than current Read codes. Software systems have been developed to export information from primary care systems in a more usable manner, such as the Apollo software system. | Yes – a Read Browser and Read codes are required j |
GP Connect and Data Commissioning Flows | Primary care (initially) | GP Connect and Data Commissioning Flows works are in their early stages; it is difficult to gauge the possible benefits these plans will bring from a researcher perspective. | N/A |
Bespoke linked dataset examples | |||
NorthWest EHealth linked database | Linked datasets | Information on medications, symptoms and use of healthcare facilities. | Contact provider |
CALIBER dataset | Linked datasets | Linked data for primary care (CPRD), coded hospital records (HES), social deprivation information and cause-specific mortality data (ONS). | Contact provider |
Other large observational datasets: Primary care
Linked datasets
Efficient study designs using large observational datasets
Discussion
Aspects to consider | Self-reported | Electronic database |
---|---|---|
Access to person-level or record-level data | Data reported by the patient themselves (or a proxy on their behalf) are patient-level by definition. | Currently a major issue for electronic datasets. To those without advanced knowledge of large datasets, it is unclear whether person-level data can be obtained and the IG aspects for obtaining these data are challenging for researchers. There may also be a restricted data flow of person-level data depending on the current stance of the data holders of what constitutes appropriate data protection policies (e.g. NHS Digital) |
Service for which data are required | Essential for services with no electronic records; for example, travel, childcare, over-the-counter medications | All care services should operate an electronic administrative system from which data could be obtained – will only collect data based on care service provided or if linked to another service (e.g. CPRD linked to HES; SystmOne central database). |
Practicality and cost | Pragmatic and cheap method which is well understood and largely under the control of the researcher | Large datasets often incur a cost and the researcher is bound by the time for data approval and extraction by the data holders. Raw data extraction can be time consuming and relatively costly compared with self-reported methods. |
Number of patients | Administratively burdensome for large numbers of patients | If a large dataset exists and contains some person-level identifier code (e.g. NHS number), then obtaining data for large patient numbers is possible. For raw data extraction, less practical for large numbers of patients unless a systematic method for data extraction is available (e.g. software system for data extraction). |
Validity of data | Known issues with validity of self-reported data, particularly problematic if differential between arms. Can be tested in a pilot phase. | Large databases have been known to validate their data; however, the extent to which these data are validated is not transparent, and validity for costing purposes may not have been tested. Raw data are complicated to validate. |
Time horizon for analysis | Loss to follow-up may be higher with a lengthy time horizon. Self-reported methods may work better for shorter time horizons (i.e. one questionnaire per 3 month time period of interest). | Depends on time horizon of the database. Loss to follow-up can occur in large datasets and raw data depending on the database or service (e.g. GP practice may change system restricting eligibility to provide data to particular primary care datasets). |
Patient group being analysed | Care may be needed with particular patient groups who lack capacity, for example | Different patient groups may use different services from which data may need to be obtained. Type of patient (e.g. cognitive ability) is not generally a concern. |
Type of costing exercise (e.g. top-down or micro-costing) | Can be tailored exactly to the type of costing exercise required but depends on knowledge of patient to provide the detail of care consumed. More time consuming collecting detailed information for micro-costing exercises. | Raw and large datasets can offer aggregated or very detailed information based on the level of data recording. Some data offered may still not be reliable for micro-costing (e.g. time with patient recorded in large databases such as CPRD). |
Recall bias | Problematic if differential recall errors exist systematically between arms of a trial | Recall bias is not an issue, but potential bias relies on accurate data recording at the service-level. |
Missing data | A known problem with self-report; can be minimised by following good practice | Missing data is not a ‘known’ issue – if data are missing, then not easy to assess (i.e. it would be assumed there was no resource-use). Some evidence of data missing from HES, but would be difficult to assess extent in a trial. |
Regional or national study | Data can be collected consistently across geographical areas | More detailed datasets are available regionally than nationally. National datasets depend on service uptake to provide electronic data. Raw data may be difficult to obtain electronically if there is no remote access to the software system (e.g. remote access is possible with SystmOne). |
International studies (outside of England) | Self-reported data is still necessary for many countries and necessary in circumstances where electronic systems are not available or cannot provide the data required. | More countries are using electronic data provided by care services, commissioners, and insurance companies (to name a few sources). This is important to note when comparing analysis in England with other international studies. Comparably, this may limit our (i.e. studies based in England) ability to perform the best possible analysis which is desirable as part of research studies. |
All-cause or disease specific assessment | Patients may struggle to correctly identify whether an event is related to their condition or not | A variety of codes (e.g. ICD-10 and OPCS-4 for in-hospital codes) and free text to specify whether resource-use is associated with a condition. Primary care data has Read or SNOMED CT codes for specific conditions and diseases, although these codes are not always used appropriately. Free text is difficult to use. HES outpatient diagnosis codes are poorly completed. |
Baseline measurements | Additional burden on patient and very rarely collected. | Not an issue if the data are available for the baseline period of interest. |
Experience and familiarity | Relatively easy for a researcher to get up to speed with. Design for a clinical study may require knowledge of the clinical area to accurately collect the resource-use cost drivers. | For large datasets, requires a data requisition form to be completed which is not always easily understood. For commissioning data, requires a contact with access to the data and a data requisition form to be completed. For raw data, requires knowledge of the service or to identify a person who can extract the data (i.e. trained researcher of practice nurse). |
Information Governance | Managed through standard ethics application methods. | IG is a major concern when using electronic data. This process can be navigated with expert guidance, although the developing world of electronic data will always be a concern for researchers. |
Social care data | Social care data could be self-reported and the exact type of social care data of interest could be specified within the questionnaire. | Routinely collected social care data is not discussed in this paper, but is an important aspect for future consideration. Healthcare systems are more usable for obtaining data relative to social care systems because of aspects such as the inclusion of unique identifiers (NHS number of other pseudo codes), relatively more standardised coded data, established national data dictionaries, and national software and system requirement. |