Background
UK Biobank (UKB) is a prospective population-based cohort study with extensive phenotypic and genotypic information on > 500,000 participants (
www.ukbiobank.ac.uk). It is an open access resource, established to facilitate research into the determinants of a wide range of health outcomes [
1]. Disease outcomes are ascertained primarily via linkages to routinely collected coded national administrative health datasets [
2], enabling the identification of a broad range of disease phenotypes with sufficient accuracy for many research studies [
2‐
4]. However, these coded data are often incomplete and less accurate when it comes to identifying specific disease subtypes [
2,
3]. For example, up to 40% of participants with a stroke code in hospital, death record or primary care administrative data in UKB do not have a code specifying their stroke subtype, even though review of the full text medical records shows that a stroke subtype was known in over 99% of cases [
2]. Further, among subtype specific codes, hemorrhagic stroke codes may have precision as low as 42% [
2]. This will be a limitation for many researchers since stroke is a heterogeneous disease, and genetic and environmental risk factors to date have been found to be very subtype specific. Indeed, the International Stroke Genetics Consortium has already identified stroke subtyping as a top research priority [
3]. Similarly, while coded data can be used to identify all-cause dementia, accuracy in identifying dementia subtypes, in particular vascular dementia, is much lower [
4,
5]. This may be a limitation for researchers studying genetic and environmental associations specific to disease subtypes, and hence automated, scalable methods are urgently needed to improve disease subtyping.
Possible solutions to enhance the accuracy of coded data and improve the ability to deep-phenotype (e.g. subtype) all participants at scale include linkage to national disease-specific audit and registry datasets and/or the development of automated tools to extract data from participants’ detailed electronic medical records (EMR). While linkage to disease-specific datasets is promising, these data are limited to select diseases, may not cover all regions or nations of the UK, may cover limited time periods, do not always capture primary care and outpatient as well as inpatient encounters, and may have unknown accuracy. On the other hand, approaches relying on mining the complete EMR are limited by the challenges of data anonymization and of accessing the many different systems used by hospitals and other regional healthcare providers across the UK. For diseases that are diagnosed based on imaging, an alternative approach would be to access not the complete EMR but only the participants’ relevant clinical radiology reports. Since imaging reports are stored in more accessible and unified data repositories (including, in the UK, national Picture Archiving and Communications Systems in Scotland [
6] and Wales [
7] and seven regional imaging networks in England [
8]) and contain far less text than the entire EMR, both research access and anonymization of these data are likely to be much less challenging.
Inferring disease subtypes from free text is challenging for computers, as it is usually beyond the scope of named entity recognition tasks. For example, inferring that the combination of the two entities “bleeding” and “intracerebral” signifies intracerebral hemorrhage (ICH) requires clinical knowledge. While deep learning methods have great potential to learn such associations, large datasets would be required to train them. At the same time, many disease subtypes are rare by nature, which is a limitation for supervised learning.
By combining natural language processing (NLP) with clinical knowledge inference, this work aimed to investigate the feasibility and added value of automated methods applied to clinical radiology reports in ascertaining accurate disease subtype information for participants with any stroke code in a regional UKB subpopulation. We used stroke as an exemplar disease, specifically looking to improve hemorrhagic stroke identification. Stroke patients always require brain imaging to exclude alternative diagnoses and determine the stroke subtype, although ischemic stroke is not always visible on imaging done very soon after symptom onset [
9].
Discussion
Our results demonstrate the potential for significant added value and feasibility of using automated methods on clinical brain scan reports to improve stroke subtyping in UKB. While the automated method assigned a correct stroke subtype diagnosis to only 55% cases overall, its main benefit came from markedly improving the precision of hemorrhagic stroke codes. As expected, ischemic stroke code accuracy remained similar. This approach of combining NLP and clinical knowledge inference is potentially scalable across the UK and may also scale well in other settings. It may also be relevant to disease subtyping for other conditions, where information from images is important in the diagnosis of disease subtypes. Furthermore, the SemEHR tool used in this project can be easily adapted for research into other phenotypes by adopting transfer learning technologies [
15].
Compared to coded data alone, for hemorrhagic stroke, the automated method improved precision at the expense of slightly poorer recall. Depending on the study design, more importance can be attributed to either estimate, however our achieved trade-off is likely to be preferrable for many research studies. For ischemic stroke, the effect was the opposite, resulting in a lower precision at the expense of improved recall. An additional caveat is that the true-positive ischemic stroke cases identified by the automated method are likely to be different to the true-positive cases missed. This is because cases identified will have had a stroke resulting in a visible lesion on the scan, and hence are likely to be clinically more severely affected. Therefore, for ischemic stroke, unless the automated method can achieve a near-perfect recall, many research studies are likely to prefer using coded data to avoid this bias.
As a substantial proportion of clinical features are only available in free text [
16], NLP has been extensively studied and applied to extract clinical features from medical records [
17‐
29]. Methodologies used range from rule-based approaches [
23,
27] to machine learning approaches [
17,
22,
24‐
26] to deep learning methods [
18]. However, to date most of the work has focused on named entity recognition tasks, such as semantics in domain terminologies (e.g. ontology-driven inferences) [
10,
28] and identification of contextual mentions (e.g. negation, temporality and the person to whom the information refers to) [
15,
18]. Very few studies [
17] have investigated methods to help derive disease sub-phenotypes from free text, where the information to derive these exists but additional clinical knowledge is needed to derive it. Our work addressed this gap by combining NLP with clinical knowledge inference. Advantages of this approach are that it does not require the very large datasets required to train machine learning methods, along with the potential both to transfer knowledge to new datasets from external sources and to apply the approach in other languages. This is currently a relatively understudied area, with very few sharable resources available.
Previous studies applying NLP and machine learning to classify stroke into subtypes have focused on automating ischemic stroke subtyping into specific sub-categories using the EMR [
30,
31] or a selection of available features [
32]. Others, such as the Edinburgh Information Extraction for Radiology reports (EdIE-R) [
33] have shown good performance of text mining systems in subtyping already expert-validated stroke cases into the three main subtypes (IS, ICH and SAH) based on radiology scan reports. Our study differs from these in two main ways. Firstly, it is nested in a population-based cohort study rather than a disease specific cohort. We combine existing information from national administrative health datasets with automated methods by identifying participants with a high prior probability of having had a true-positive stroke diagnosis (represented by them having a stroke code in the administrative data) followed by the application of automated methods to subtype stroke into the three main types (IS, ICH and SAH). This approach means that the results are applicable to other population-based studies and large biobanks using administrative data for disease identification (e.g. the UK-based Generation Scotland [
34] study and SAIL Dataset [
35]). Secondly, we use expert stroke physician adjudications based on the complete EMR to derive ground truth diagnoses. This step is important, since while in a large number of cases the correct hemorrhagic stroke subtype diagnosis can be reached by the expert based only on the brain scan report, in a proportion of cases, additional information from the complete EMR is required in addition. One example of this would be a case where a patient’s brain scan report describes a brain hemorrhage, which could be secondary to head injury (i.e. a traumatic hemorrhage, not a stroke) or it could be a primary hemorrhage (i.e. a stroke), and additional medical history regarding any mention of a relevant traumatic event prior to symptom onset in the EMR will help make the correct final diagnosis. We are not aware of any previous studies combining these two features in order to automate stroke subtyping.
Our results show that in large population-based cohorts, the ascertainment of cases via codes indicating stroke combined with subsequent automated methods applied to the free text of brain scan reports is a feasible and potentially scalable approach for enhancing the accuracy of stroke subtyping. Our primary approach was to first identify participants with a high prior probability of having had a stroke of any subtype (defined as participants with any stroke code in administrative datasets) and then apply automated methods to enhance the accuracy of specific subtype diagnosis (IS vs ICH vs SAH). We also explored the benefit of identifying participants with a high prior probability of having had a hemorrhagic stroke subtype (defined as participants with a hemorrhagic stroke specific code in administrative datasets) before applying automated methods. This improved the precision of SAH subtyping slightly, but would need to be validated in larger datasets.
The strengths of our study include the application and testing of existing methods on a real-world dataset. In addition, we tested the performance of the methods against robust ground-truth diagnoses made by specialist physicians based on the complete EMR. To maximize the reusability of our work, we deliberately decoupled the NLP component from the clinical knowledge inference component in our pipeline, so that the latter can be reused in different settings. We have also made the model and inference rules publicly available [
13] to facilitate future similar studies by others. The imaging reports however are currently only available for UKB sub-cohorts via individual data linkage projects.
Our study also has some limitations. The relatively wide confidence intervals for precision and recall suggest a high variability of these estimates, which could be due to the small sample size and heterogenicity of the sample, particularly for the SAH and ICH cases. Also, this work didn’t have a replication or an external validation cohort for evaluating the pipeline. Furthermore, in our study, we included only participants’ first-ever stroke events and their relevant clinical brain scans were selected manually by experts, whereas this step would also need to be automated to make the approach scalable in large datasets. We envisage this may involve including all brain scans within a certain timeframe from the stroke code. Finally, we did not apply a rule-based approach to tackle the issue of some participants having multiple brain scans with competing disease subtypes. Developing methods to address this may improve the performance of automated methods further.
Further work to build on these results is now needed and should focus on: (1) validating our automated methods in further datasets, which could include additional UK Biobank sub-cohorts as well as data from other population-based cohorts; (2) investigating the time interval between the code(s) and clinical scan reports to enable inclusion of the most relevant data; (3) investigating the usefulness of the automated methods in identifying recurrent stroke events; (4) developing rules for disease subtype adjudication based on multiple reports per participant; and (5) expanding this work to investigate disease subtyping of other conditions beyond stroke.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.