Introduction
Materials and Methods
Ethics Approval
Sample Selection
Evaluating Information in Reports by Manual Classification
Status indicator (regression, stable, progression) |
This indicated the final overall tumor status compared to prior studies. Tumor size was the primary determinant of status, unless another feature was highlighted as indicating a status change despite stable tumor size. In the absence of specific reference to size, surrogate indicators (e.g. general statements on status, enhancement, signal intensity changes, mass effect, and presence of new lesions) were used to determine status. |
Only changes from the most recent comparison study were considered. If a mix of ‘stable’ and another status (‘progress’ or ‘regress’) was present, then the net status change (‘progress’ or ‘regress’) was taken as the final status. If a mix of ‘progress’ and ‘regress’ statuses were present, the final status was classified as the worse (i.e. ‘progress’) status. |
Magnitude indicator (mild, moderate, marked) |
This indicated the qualitative extent of change, if any. The magnitude of change was classified as: |
Mild if ‘mild’, ‘slight’, ‘minimal’, ‘somewhat’, ‘small amount’, ‘subtle’, ‘appears to be some’, ‘tiny’, ‘partial’, ‘slow growth’ or equivalent was used. |
Moderate if ‘moderate’, ‘modest’, ‘some’ or equivalent was used. This was also the default classification if there was no specific mention of magnitude. |
Marked if ‘marked’, ‘significant’, ‘resolved’, ‘resolution’, ‘clearly’, ‘considerable’, ‘substantially’, ‘pronounced’, ‘large amount’ or equivalent was used. |
If several lesions with different change magnitudes were present, the greatest magnitude was chosen as the final magnitude. |
Significance indicator (uncertain, possible, probable) |
This indicated the subjective clinical significance of change, if any. The clinical significance was classified as: |
Uncertain if ‘uncertain’, ‘slight’, ‘subtle’, ‘unclear’, ‘not entirely typical of’, ‘indeterminate’, ‘non-specific findings’ ‘cannot be excluded’ or equivalent was used. In the absence of specific significance indicators, a mild magnitude of change was tagged to an ‘uncertain’ significance. |
Possible if ‘possible’, ‘suggestive of’, ‘somewhat’, ‘benign rather than malignant’, ‘more consistent with post-therapy changes rather than neoplasm’, ‘could reflect’, ‘may represent’, ‘continued observation to assess’, ‘follow-up imaging to evaluate’ or equivalent was used. This was also the default classification if there was no specific mention of significance and the magnitude of change was neither mild nor marked. |
Probable if ‘probably represents’, ‘worrisome for’, ‘concern that this represents’, ‘suspicious’, ‘concerning for’, ‘consistent with’, ‘compatible with’, ‘findings suggest’, ‘presumably indicating’, ‘findings indicate’, ‘findings likely reflect’ or equivalent was used. In the absence of other specific significance indicators, a marked magnitude of change was tagged to a ‘probable’ significance. |
If several lesions with similar status but differing significance were present, the greatest significance was chosen as the final significance. |
Developing an NLP-Based Data Extraction Tool
Discovering Tumor Status
Discovering magnitude and significance
Comparing Human and NLP Classification Outcomes
Results
Report characteristic | Reports | Manual classification | NLP tool development | ||||
---|---|---|---|---|---|---|---|
Overall (N = 778) | Classifiable (N = 772) | Unclassifiable (N = 6) | p value | Training group (N = 541) | Testing group (N = 231) | p value | |
Mean length (word count) | 109 | 109 | 122 | 0.579 | 110 | 107 | 0.509 |
Incidental findings | 49.4% | 49.5% | 33.3% | 0.686 | 50.6% | 46.8% | 0.322 |
Spelling errors | 11.6% | 11.5% | 16.7% | 0.523 | 11.3% | 12.1% | 0.736 |
Fusion wordsa
| 14.8% | 14.8% | 16.7% | 1.000 | 15.7% | 12.6% | 0.258 |
Information in Unstructured Reports (Manual Classification)
Status indicator | Regression (N = 105) | Stable (N = 432) | Progression (N = 235) | Overall (N = 772) |
---|---|---|---|---|
Tumor size | 55 (52.4%) | 75 (17.4%)* | 112 (47.7%)** | 242 (31.3%) |
Surrogate indicatora
| 50 (47.6%) | 360 (83.3%)* | 147 (62.6%)*,** | 557 (72.2%) |
Enhancement | 37 (35.2%) | 142 (32.9%) | 95 (40.4%) | 274 (35.5%) |
T1 signal changeb
| 4 (3.8%) | 21 (4.9%) | 13 (5.5%) | 38 (4.9%) |
T2 signal change | 16 (15.2%) | 85 (19.7%) | 54 (23.0%) | 155 (20.1%) |
Mass effect | 9 (8.6%) | 8 (1.9%)* | 12 (5.1%)** | 29 (3.8%) |
New lesion(s)c
| 4 (3.8%) | 28 (6.5%) | 53 (22.6%)*,** | 85 (11.0%) |
Recurrent neoplasmc
| 1 (1.0%) | 127 (29.4%)* | 4 (1.7%)** | 132 (17.1%) |
Residual neoplasmc
| 2 (1.9%) | 132 (30.6%)* | 1 (0.4%)** | 135 (17.5%) |
General statementd
| 5 (17.4%) | 273 (63.2%)* | 9 (3.8%)** | 287 (37.1%) |
Classification category | Kappa | Weighted kappa | Bowker’s test of symmetry |
---|---|---|---|
Intra-annotator agreement for human classification | |||
Status | 0.98 | 0.96 | P = 0.80 |
Magnitudea
| 0.86 | 0.88 | P = 0.80 |
Significanceb
| 0.82 | 0.87 | P = 0.97 |
Agreement between NLP and human classification | |||
Status | 0.75 | 0.75 | P = 0.58 |
Magnitudea
| 0.68 | 0.71 | P = 0.85 |
Significanceb
| 0.56 | 0.63 | P = 0.41 |
Comparison of NLP and Human Classification Outcomes
Category | Sensitivity (95% CI) | Specificity (95% CI) | PPV (95% CI) | NPV (95% CI) |
---|---|---|---|---|
Status | ||||
Regress | 64.5 (46.9–78.9) | 96.5 (93.0–98.3) | 74.1 (55.3–86.8) | 94.6 (90.6–97.0) |
Stable | 89.9 (83.5–94.0) | 85.3 (77.1–90.9) | 88.6 (82.0–92.9) | 87.0 (79.0–92.2) |
Progress | 87.3 (77.6–93.2) | 93.1 (88.1–96.1) | 84.9 (75.0–91.4) | 94.3 (89.5–97.0) |
Mean | 80.6 | 91.6 | 82.4 | 92.0 |
Magnitudea
| ||||
Mild | 85.7 (68.5–94.3) | 90.7 (80.1–96.0) | 82.8 (65.5–92.4) | 92.5 (82.1–97.0) |
Moderate | 80.9 (67.5–89.6) | 82.9 (67.3–91.9) | 86.4 (73.3–93.6) | 76.3 (60.8–87.0) |
Marked | 71.4 (35.9–91.8) | 94.7 (87.1–97.9) | 55.6 (44.4–73.3) | 97.3 (90.5–99.2) |
Mean | 79.3 | 89.4 | 74.9 | 88.7 |
Significanceb
| ||||
Uncertain | 53.8 (29.1–76.8) | 85.2 (73.4–92.3) | 46.7 (24.8–69.9) | 88.5 (77.0–94.6) |
Possible | 70.4 (51.5–84.1) | 75.0 (59.8–85.8) | 65.5 (47.3–80.1) | 78.9 (63.7–88.9) |
Probable | 81.5 (63.3–91.8) | 97.5 (87.1–99.6) | 95.7 (79.0–99.2) | 88.6 (76.0–95.0) |
Mean | 68.6 | 85.9 | 69.3 | 85.3 |
Report feature | NLP classification outcomes | ||||||||
---|---|---|---|---|---|---|---|---|---|
Status | Magnitude | Significance | |||||||
+ (N = 198) | − (N = 33) | p value | + (N = 67) | − (N = 48) | p value | + (N = 48) | − (N = 67) | p value | |
Average report length | 104 | 125 | 0.029* | 127 | 130 | 0.794 | 125 | 131 | 0.599 |
Incidental findings | 44.4% | 60.6% | 0.085 | 46.3% | 52.1% | 0.538 | 45.8% | 50.8% | 0.603 |
Spelling errors | 11.1% | 18.2% | 0.249 | 10.5% | 14.6% | 0.504 | 6.3% | 16.4% | 0.148 |
Fusion wordsa
| 12.6% | 12.1% | 1.000 | 11.9% | 12.5% | 0.928 | 14.6% | 10.5% | 0.504 |
Tumor size | 30.3% | 30.3% | 1.000 | 53.7% | 35.4% | 0.052 | 58.3% | 37.3% | 0.0257* |
Surrogate indicatorb
| 71.7% | 72.7% | 0.905 | 49.3% | 70.8% | 0.021* | 45.8% | 67.2% | 0.0222* |
Enhancement | 32.8% | 48.5% | 0.081 | 28.4% | 52.1% | 0.0098* | 25.0% | 47.8% | 0.0133* |
T1 signal changec
| 4.6% | 6.1% | 0.660 | 4.5% | 4.2% | 1.000 | 6.3% | 3.0% | 0.648 |
T2 signal change | 21.7% | 27.3% | 0.479 | 22.4% | 22.9% | 0.947 | 22.9% | 22.4% | 0.947 |
Mass effect | 3.5% | 6.1% | 0.620 | 6.0% | 4.2% | 1.000 | 6.3% | 4.5% | 0.693 |
New lesion(s)d
| 13.1% | 3.0% | 0.141 | 20.9% | 12.5% | 0.242 | 20.8% | 14.9% | 0.410 |
Recurrent neoplasmd
| 15.7% | 12.1% | 0.794 | 0.0% | 8.3% | 0.0281* | 0.0% | 6.0% | 0.139 |
Residual neoplasmd
| 17.7% | 15.2% | 1.000 | 0.0% | 10.4% | 0.0112* | 0.0% | 7.5% | 0.074 |
General statemente
| 41.9% | 9.1% | 0.0002* | 1.5% | 8.3% | 0.159 | 2.1% | 6.0% | 0.399 |