Background
Prior work
Year | Author | NLP method (tool) | NLP category |
---|---|---|---|
Setting | Dataset | Performance | |
Current study | Seong et al | Bi-LSTM-CRF, BioBERT | Deep learning-based NLP |
Samsung Medical Center | 280,668 colonoscopy reports Training and Test: 1,000–5,000 Embedding: 280,668 | F1 score: 0.9564–0.9862 | |
2022 | Bae et al.[13] | SmartTA | Rule-based NLP (Commercial software) |
Seoul National University Hospital | 54,562 colonoscopy reports and pathology reports Training: 2,000 Test: 1,000 | Accuracy: 0.99–1.0 | |
2021 | Vadyala et al. [41] | Bio-Bi-LSTM-CRF | Deep learning-based NLP |
Veterans Affair Medical Centers (VA) | 4,000 colonoscopy reports and pathology reports Training: 3,200 Test: 400 Validation: 400 | F1 score: 0.85–0.964 | |
2020 | Fevrier et al. [40] | SAS PERL regular expression | Rule-based NLP (Commercial software) |
Kaiser Permanente Northern California (KPNC) | 401,566 colonoscopy reports and pathology reports Training: 1,000 Validation: 3,000 Test: 397,566 | Cohen's κ: 0.93–0.99 | |
2020 | Karwa et al. [12] | Prolog | Rule-based NLP (Logic program language) |
Cleveland Clinic | 2,439 colonoscopy reports Validation: 263 | Accuracy: 1.0 | |
2019 | Lee et al. [11] | Linguamatics I2E [42] | Rule-based NLP (Commercial software) |
Kaiser Permanente Northern California (KPNC) | 500 colonoscopy reports Validation: 300 | Accuracy: 0.893–1.0 | |
2017 | Hong et al. [10] | SAS ECC [43] | Rule-based NLP (Commercial software) |
Samsung Medical Center (SMC) | 49,450 colonoscopy reports and pathology reports | Precision: 0.9927 Recall: 0.9983 | |
2017 | Carrell et al. [44] | HITEX [45] | Statistical NLP (Clinical NLP framework) |
University of Pittsburgh Medical Center (UPMC) | 3,178 colonoscopy reports and 1,799 pathology reports Training: 1,051 Validation: 2,127 | F-measure: 0.57–0.99 | |
2015 | Raju et al. [46] | CAADRR | Rule-based NLP |
MD Anderson | 12,748 colonoscopy reports and pathology reports Validation: 343 | Positive predictive value: 0.913 | |
2014 | Gawron et al. [47] | UIMA [48] | Statistical NLP (NLP framework) |
Northwestern University | 34,998 colonoscopy reports and 10,186 pathology reports Validation: 200 | F1 score: 0.81–0.95 | |
2013–2015 | cTAKES [50] | Statistical NLP (Clinical NLP framework) | |
Veterans Administration medical center | 42,569 colonoscopy reports and pathology reports Training: 250 Test: 500 | Accuracy: 0.87–0.998 | |
2011 | Harkema et al. [51] | GATE [52] | Statistical NLP (NLP framework) |
University of Pittsburgh Medical Center (UPMC) | 453 colonoscopy reports and 226 pathology reports | Accuracy: 0.89 (0.62–1.0) F-measure: 0.74 (0.49–0.89) Cohen’s κ: 0.62 (0.09–0.86) |
Objective
Methods
Data collection and text annotation
Year | Colonoscopy reports | Annotated reports |
---|---|---|
2000 | 2,620 | – |
2001 | 3,521 | – |
2002 | 4,196 | – |
2003 | 4,890 | – |
2004 | 5,299 | – |
2005 | 7,780 | – |
2006 | 9,525 | – |
2007 | 10,926 | – |
2008 | 17,108 | – |
2009 | 26,617 | – |
2010 | 30,387 | – |
2011 | 34,446 | 1,000 |
2012 | 32,441 | 1,000 |
2013 | 32,103 | 1,000 |
2014 | 34,156 | 1,000 |
2015 | 24,653 | 1,000 |
Total | 280,668 | 5,000 |
Data | For pre-trained word embedding | For training and test |
---|---|---|
Year | 2000–2015 | 2011–2015 |
Number of documents | 280,668 | 5,000 |
Number of sentences | 4,193,814 | 81,666 |
Number of types of words | 41,563 | 4,478 |
Items | Labels a |
---|---|
1. Patient information | |
1.1 Brief history (disease, family, etc.) | |
1.2 Indication/reason for endoscopy | |
2. Procedures | |
2.1 Sedation and other drugs | |
2.1.1 Sedation | SEDATION |
2.1.1.1 Level of sedation | SEDATIONLEVEL |
2.1.1.2 Medication | MEDICATION |
2.1.1.3 Dosage | DOSAGE |
2.1.2 Antispasmodics | ANTISPASMODICS |
2.2 Equipment (endoscope) used | DEVICE |
2.2.1 Extent of examination | EXTENT |
2.3 Quality of cleansing/visualization | PREPARATION |
2.4 Procedural time | |
2.4.1 Time-to-cecum | |
2.4.2 Withdrawal time | |
2.5 Digital rectal examination | DRE |
3. Colonoscopic findings | |
3.1 Lesions and their attributes | |
3.1.1 Lesion | LESION, NEGATION b |
3.1.2 Anatomical site | LOCATION |
3.1.3 Shape | SHAPE |
3.1.4 Color | COLOR |
3.1.5 Size | SIZE |
3.1.6 Number | NUMBER |
3.2 Sampling (type of sample) | BIOPSY |
3.3 Adverse intraprocedural events | |
4. Conclusion |
Clinical information Past (medical) Hx: AGC s/p STG B-II Antithrombotics: No Indication: Checkup Procedure Note Sedation: Yes [SEDATION]: midazolam [MEDICATION] 3 mg [DOSAGE] pethidine [MEDICATION] 50 mg [DOSAGE] Level of sedation: moderate [SEDATIONLEVEL] (paradoxical response: no) Antispasmodics (cimetropium 5 mg): Yes [ANTISPASMODICS] Digital rectal examination was normal [DRE] Bowel preparation was fair [PREPARATION] The CF 260AI [DEVICE] was inserted up to the terminal ileum [EXTENT] Colonoscopic finding On the terminal ileum [LOCATION], several [NUMBER] erosions [LESION] and shallow [SHAPE] ulcer [LESION] were noticed There were several [NUMBER] outpouching lesions [LESION] on the ascending colon [LOCATION]. On the distal descending colon [LOCATION], about 0.5 cm [SIZE] sized Is [SHAPE] polyp [LESION] was noticed. It was removed by cold biopsy. On the rectum [LOCATION], AV 10 cm [LOCATION] about 0.3 cm [SIZE] sized Is [SHAPE] polyp [LESION] was noticed. It was removed by cold biopsy biopsy + [BIOPSY] Conclusion 1. Colon polyp, removed 2. Rectal polyp, removed 3. A-colon diverticulum Comment No immediate complication |
Dataset a | D1 | D2 | D3 | D4 | D5 |
---|---|---|---|---|---|
Number of documents | 1,000 | 2,000 | 3,000 | 4,000 | 5,000 |
Number of sentences | 16,417 | 32,821 | 49,048 | 65,279 | 81,668 |
Number of words | 92,315 | 184,928 | 277,266 | 369,063 | 461,713 |
Number of types of words | 2,001 | 2,771 | 3,410 | 3,922 | 4,478 |
Labels | D1 | D2 | D3 | D4 | D5 |
---|---|---|---|---|---|
PROCEDURE NOTE | |||||
SEDATION | 860 | 1,735 | 2,586 | 3,443 | 4,312 |
SEDATIONLEVEL | 679 | 1,361 | 2,027 | 2,706 | 3,404 |
MEDICATION | 871 | 1,778 | 2,659 | 3,566 | 4,500 |
DOSAGE | 872 | 1,781 | 2,663 | 3,576 | 4,515 |
ANTISPASMODICS | 799 | 1,620 | 2,408 | 3,215 | 4,032 |
DRE | 995 | 1,990 | 2,986 | 3,982 | 4,979 |
PREPARATION | 996 | 1,993 | 2,985 | 3,977 | 4,971 |
DEVICE | 999 | 2,000 | 2,997 | 3,994 | 4,992 |
EXTENT | 1,000 | 2,000 | 2,998 | 3,995 | 4,992 |
COLONOSCOPIC FINDINGS | |||||
LESION | 1,043 | 2,053 | 3,201 | 4,237 | 5,336 |
LOCATION | 1,118 | 2,269 | 3,481 | 4,599 | 5,757 |
SHAPE | 719 | 1,513 | 2,296 | 3,024 | 3,795 |
COLOR | 197 | 373 | 589 | 789 | 983 |
SIZE | 726 | 1,530 | 2,318 | 3,037 | 3,831 |
NUMBER | 219 | 416 | 639 | 853 | 1,052 |
BIOPSY | 995 | 1,993 | 2,991 | 3,987 | 4,984 |
NEGATION | 651 | 1,300 | 1,929 | 2,609 | 3,240 |
Total | 13,739 | 27,705 | 41,753 | 55,589 | 69,675 |
Model
Input and embedding layer
Bidirectional LSTM layer
BioBERT layer
CRF layer
Experiment
Comparison of LSTM and BioBERT variants
Applying pre-trained word embedding
Model | Loss function a & optimizer b | Precision c | Recall c | F1 score c |
---|---|---|---|---|
LSTM | CCE + ADAM | 0.5267 | 0.5297 | 0.5282 |
LSTM | CCE + NADAM | 0.5258 | 0.5285 | 0.5271 |
LSTM | CCE + RMS | 0.5266 | 0.5297 | 0.5281 |
LSTM | KL + ADAM | 0.5255 | 0.5286 | 0.5270 |
LSTM | KL + NADAM | 0.5258 | 0.5287 | 0.5273 |
LSTM | KL + RMS | 0.5260 | 0.5278 | 0.5269 |
LSTM | POISSON + ADAM | 0.5255 | 0.5274 | 0.5264 |
LSTM | POISSON + NADAM | 0.5245 | 0.5267 | 0.5256 |
LSTM | POISSON + RMSProp | 0.5229 | 0.5258 | 0.5244 |
Bi-LSTM | CCE + ADAM | 0.5880 | 0.6761 | 0.6290 |
Bi-LSTM | CCE + NADAM | 0.5971 | 0.7056 | 0.6460 |
Bi-LSTM | CCE + RMSProp | 0.5884 | 0.6763 | 0.6293 |
Bi-LSTM | KL + ADAM | 0.5881 | 0.6768 | 0.6294 |
Bi-LSTM | KL + NADAM | 0.5957 | 0.7039 | 0.6445 |
Bi-LSTM | KL + RMSProp | 0.5884 | 0.6767 | 0.6295 |
Bi-LSTM | POISSON + ADAM | 0.5873 | 0.6756 | 0.6284 |
Bi-LSTM | POISSON + NADAM | 0.5949 | 0.7021 | 0.6433 |
Bi-LSTM | POISSON + RMSProp | 0.5869 | 0.6758 | 0.6282 |
Bi-LSTM-CRF | CRF + ADAM | 0.9828 | 0.9842 | 0.9835 |
Bi-LSTM-CRF | CRF + NADAM | 0.9825 | 0.9851 | 0.9838 |
Bi-LSTM-CRF | CRF + RMSProp | 0.9844 | 0.9853 | 0.9848 |
BioBERT | CCE + ADAM | 0.9824 | 0.9821 | 0.9822 |
BioBERT-CRF | CRF + ADAM | 0.9810 | 0.9815 | 0.9812 |
Comparison by the amount of data
Labels | Bi-LSTM-CRF + one-hot encoding | Bi-LSTM-CRF + pre-trained word embedding | ||||
---|---|---|---|---|---|---|
Precision | Recall | F1 score | Precision | Recall | F1 score | |
PROCEDURE NOTE a | ||||||
SEDATION | 0.9881 | 0.9953 | 0.9916 | 0.9888 | 0.9950 | 0.9918 |
SEDATIONLEVEL | 0.9987 | 0.9938 | 0.9962 | 0.9985 | 0.9958 | 0.9971 |
MEDICATION | 0.9991 | 0.9954 | 0.9972 | 1 | 0.9959 | 0.9980 |
DOSAGE | 0.9929 | 0.9897 | 0.9913 | 0.9959 | 0.9920 | 0.9939 |
ANTISPASMODICS | 0.9962 | 1 | 0.9981 | 0.9978 | 1 | 0.9989 |
DRE | 0.9967 | 0.9990 | 0.9978 | 0.9958 | 0.9989 | 0.9973 |
PREPARATION | 0.9892 | 0.9914 | 0.9903 | 0.9879 | 0.9928 | 0.9904 |
DEVICE | 0.9991 | 0.9991 | 0.9991 | 0.9980 | 0.9979 | 0.9979 |
EXTENT | 0.9883 | 0.9951 | 0.9916 | 0.9960 | 0.9967 | 0.9963 |
COLONOSCOPIC FINDINGS b | ||||||
LESION | 0.9881 | 0.9953 | 0.9916 | 0.9888 | 0.9950 | 0.9918 |
LOCATION | 0.9987 | 0.9938 | 0.9962 | 0.9985 | 0.9958 | 0.9971 |
SHAPE | 0.9991 | 0.9954 | 0.9972 | 1 | 0.9959 | 0.9980 |
COLOR | 0.9929 | 0.9897 | 0.9913 | 0.9959 | 0.9920 | 0.9939 |
SIZE | 0.9962 | 1 | 0.9981 | 0.9978 | 1 | 0.9989 |
NUMBER | 0.9967 | 0.9990 | 0.9978 | 0.9958 | 0.9989 | 0.9973 |
BIOPSY | 0.9892 | 0.9914 | 0.9903 | 0.9879 | 0.9928 | 0.9904 |
NEGATION | 0.9991 | 0.9991 | 0.9991 | 0.9980 | 0.9979 | 0.9979 |
MICROAVG | 0.9883 | 0.9951 | 0.9916 | 0.9960 | 0.9967 | 0.9963 |
Results
Comparison of LSTM and BioBERT variants
Applying word embedding
Comparison by the amount of data
Labels | D1 | D2 | D3 | D4 | D5 |
---|---|---|---|---|---|
COLONOSCOPIC FINDINGS a | |||||
LESION | 0.9366 | 0.9453 | 0.9530 | 0.9530 | 0.9564 |
LOCATION | 0.9545 | 0.9627 | 0.9681 | 0.9711 | 0.9722 |
SHAPE | 0.9739 | 0.9782 | 0.9772 | 0.9797 | 0.9809 |
COLOR | 0.9653 | 0.9736 | 0.9749 | 0.9636 | 0.9720 |
SIZE | 0.9879 | 0.9874 | 0.9867 | 0.9875 | 0.9862 |
NUMBER | 0.9480 | 0.9653 | 0.9791 | 0.9713 | 0.9717 |
BIOPSY | 0.9975 | 0.9985 | 0.9988 | 0.9989 | 0.9992 |
NEGATION | 0.9772 | 0.9784 | 0.9845 | 0.9815 | 0.9858 |
MICROAVG | 0.9892 | 0.9912 | 0.9921 | 0.9921 | 0.9924 |