In this paper, we describe an NLP system to extract measurements with their descriptors in a structured format from radiology reports. All of this information is necessary to track a lesion in a report over time, since the measurement itself is ambiguous, but the measurement in addition to its descriptors makes it sufficiently unique that it can be distinguished from other measurements in the report, enabling tracking of lesions. The recall and precision of our system for measurement extraction were 100% and 96.43%, respectively, which are reasonably good results for MR and CT reports. Among 784 (97%) correctly extracted measurements, only 465 (58%) measurements were extracted as fully matched with all of their related descriptors due to very diverse and unstructured referencing in text. On the other hand, in order to identify a measurement for any follow-up encounter, it is necessary to be able to distinguish them based on their all descriptors. As opposed to previous studies, in which measurements were defined as quantitative descriptions of other entities [
23], in this study, we treated measurements as core concepts and defined other related entities as their descriptors. Errors we observed during the evaluation phase were primarily due to expressing several measurements in the same sentence with their previous measurements and insufficient descriptions of features for each measurement as in the example below.
Example Sentence
“…reduced size of scattered enhancing nodules on series 11, for example 4 mm left frontal nodule (image 127, previously 7 mm), 4 mm right frontal nodule (image 124, previously 9 mm), 3 mm right frontal nodule (image 114; previously 7 mm), 2 mm lateral right frontal nodule (image 118, previously 5 mm), 2 mm left cerebellar nodule (image 53, previously 6 mm)….”
In this sentence, there are 10 different measurements (5 current and 5 prior) and their descriptors (imaging observations, RadLex descriptors, and image numbers). Our system finds each measurement correctly with its temporality, image/series numbers, and laterality. On the other hand, it only detected the “scattered|Radlex_Descriptor” and “enchancing|Imaging_Observation” for the first measurement since it is the closest and these modifiers are not reported in the other sub-sentences. Therefore, we calculated 9 of 10 cases as “partial match” and it decreased our system’s performance. These kinds of problems are due to a lack of description of each measurement separately, which might be solved by specific rules but it can also increase false positive cases.
One important goal of any information extraction task is to reveal the relations between concepts. However, relationship extraction is a granular task which includes several modifiers related to target measurement and requires detailed relationship labels generated for training a machine learning pipeline or purpose of rule development. Moreover, as a unique challenge of radiology report parsing, relationships between measurement and their characteristics are not obviously definable via adverbs, and relational and qualitative adjectives for all of the entities in our corpus except “Measure_of” anatomical entity. Therefore, we tried to learn entity sequencing using CRF models in order to provide some insights for associating modifiers with measurements via rules rather than directly using it as a relationship extraction model.
CRF is the most popular supervised machine learning algorithm for named entity tagging tasks. Being a statistical machine learning method, CRF analyzes the data to infer rules and patterns and uses sequence labeling to model relationships between neighbors [
24,
25]. In this study, we trained a CRF model to label the named entities of interest automatically. We also aimed to mine relationships between measurement and their descriptors using the calculated transition probabilities of this model such as the following: a measurement is most likely to be followed by an imaging observation or anatomical entity, but we observed that it is difficult to decide about the order of sequencing for a given dataset with small training data set. Therefore, we only used the CRF model’s output for named entity tagging phase.
For the generalizability evaluation, we tested our system on 25 mammography reports and 96% of measurements were extracted correctly with their modifiers. Although the performance was very high, it should be noted that, in those reports, a single sentence does not include more than one measurement and our system performs best for sentences having only one measurement. On the other hand, this pattern would be common in mammography reports. As a future work, we are planning to evaluate the success of the pipeline on other modality reports.
The main limitation of this study was the small dataset from one single institution; in our future experiments, we plan to increase the training and test size with reports from multiple different institutions, hence, increasing the generalizability of our system. Similarly, due to limited resource, we performed a “light annotation [
22]” of the training set of (1100 reports) for CRF model by a single expert. On the other hand, the annotation of the test set was a completely manual effort which we think is a valuable resource that we will use in future work for developing appropriate lesion tracking models. In the future, we also intend to adopt attention-based convolutional neural network model for extracting relations between the entities. It should be noted that all these methodologies require larger training sets and manual annotation of the training data is a very labor-intensive task.
Extracting measurements and their descriptors as a structured summary of the lesions from unstructured radiology reports might be quite valuable for lesion tracking purposes. That information might be used to disambiguate the lesions across studies to identify the baseline and follow-up measurements of the same lesion. For example, if a lesion in the fifth segment of the liver is identified in the baseline study and then, it is identified again in the follow-up study, the anatomical entity and the segment number can be used to associate the measurements as the measurement of the same lesion. Moreover, the historical references can be used to bind a measurement to the measurement of the same lesion in a prior study. This can help in generating automatic lesion tracking and tumor burden reports. In addition, another impact of automated text annotation, such as in our work, is large-scale data labeling to train models that automate image interpretation.