Background
-
To retrieve the daily used Chinese term of clinical finding (CTCF) from physician notes in CIS.
-
To map the CTCFs to the corresponding SCT concepts, which helps the SNOMED CT with Chinese synonyms enrichment (SCCSE).
-
To recognize the CTCFs in Chinese physician notes in CIS.
Methods
Enriching the SNOMED CT with Chinese synonyms
Initial CTCF dictionary creation
Key words group | Function | Example |
---|---|---|
Time Period | Matching the time duration phrase. | 年(year),月(month),周(week),天(day),小时(hour)… |
Number | Matching the numbers in describing time duration. | 1,2,3,4,5,6,7,8,9,0,半(half),一(one),二(two),三(three),+(more than)… |
Modifier in CTCF | Some modifiers which mingled within CTCF could be ignored in the recognition task. | 持续(constantly),逐渐(gradually),明显(obviously), 稍显(slightly),反复(recurrently)… |
Exception with context | Some phrases, even matched, were invalid with context. | 抗乙肝药物(anti-HBV drugs), 多尿期(the polyuria stage),最高血压(the highest blood pressure)… |
CTCF categorization and concept mapping
-
The Chinese SNOMED III. It contains 145,856 descriptions, each of which owns a concept ID to SCT. The descriptions in hierarchy of F (function) and D (Diseases/Diagnoses) were considered the rCTCFs, totally 44,862 descriptions.
-
The Chinese ICD-10. It contains 28,668 descriptions, which could cross map to SCT. Besides the classification of disease, the Chapter R contains symptoms, signs and abnormal clinical and laboratory findings. All descriptions in Chinese ICD-10 were considered the rCTCFs.
CTCF recognition task in clinical text
The rule- and terminology-based approach for CTCF recognition
-
Each sentence in HPI was segmented into short clauses by all types of punctuation, except the caesura sign. For each short clause, the RMM was used to recognize the CTCF. Considering the longest CTCF containing 16 characters, a same character context window was adopted.
-
Some CTCFs, even matched by RMM, were invalid with context. For example, the CTCF ‘多尿’ (polyuria) is a substring of another phrase of ‘多尿期’ (the polyuria stage), but should not be recognized as a valid CTCF in this context. A hand-built key word group of ‘Exception with context’ (Table 1) was created to solve this problem.
-
To refine the language, some aggregated CTCFs are created. The aggregated CTCF may include several independent CTCFs which shared a same prefix or suffix. For language refined, the redundant parts, either prefix or suffix, would appear once for short. For example, the phrase ‘颈肩部麻木’ (the neck and shoulder are numb) is a refined aggregated CTCF which actually consists of two individual CTCFs ‘颈部麻木’ (the neck is numb) and ‘肩部麻木’ (the shoulder is numb). A sub procedure in RTBA worked for detecting these aggregations. Once a CTCF matched by RMM, a sub procedure would scan and locate other potential prefix or suffix which may exist in the neighboring characters. The sub procedure was:a)In stage of CTCF dictionary creation, each CTCF owned a corresponding group of ‘potential prefix’ and ‘potential suffix’. For example, if two CTCFs had a same prefix (or suffix), the rest part of them would add into the group of potential suffix (or prefix) of each other.b)Once a CTCF matched by RMM, each potential prefix (or suffix) of the CTCF would try to match the neighboring characters. If success, the frequency of occurrence of potential prefix (or suffix) increased 1.c)Finally, the frequently occurred potential prefixes (or suffixes) with each particular CTCF were reviewed by experts, judging whether the aggregations were accepted.
-
The judgment whether a CTCF expressing a negative meaning is a separate experiment. The CTCFs with negative meaning are mostly indicated by a limited group of Chinese key words such as ‘不伴,无,否认,未,消失,不明显’, which mean ‘without’.
Conditional random rields for CTCF recognition
-
Context feature: The characters around the target character are important for CRF utilization. The neighboring characters may provide useful information and help in predicting the correct tag for each token. The context window size (CWS) means in addition to the character itself, how many characters around it are used as feature for output tags predicting.
-
N-gram feature: Chinese character unigram (U), bigram (B), and trigram (T) cover most of the meaningful words in clinical text. Considering lots of long fixed expressions in Chinese, the quadgram (Q) was also taken into account.
-
Stop character feature: Some characters which frequently occurred in clinical text would never appear in any valid CTCF.
-
Grammatical feature: The ICTCLAS was used for tokenization and producing the grammatical features. The POS tags for each word may assist in determining the CTCF boundaries.
-
Associative strength feature: It common practice to classify words on their co-occurrence with other words [26]. We use the chi-square statistic to test the association between two adjacent words based on the 218-million characters clinical corpus. In this research, the significance level is 0.05 and the critical value is 3.841. Two adjacent words were considered combined only if X 2 > 3.841.
Data and experimental settings
-
520 thousand cases of CCC and HPI with discharge diagnoses from 2011.1 to 2014.10
-
22.5 million cases of CID for outpatient from 2008.3 to 2014.10.
Results
General picture of CTCF retrieved and concept mapping
-
One-to-one: It’s the most common relationship.
-
One-to-many: Some aggregated CTCFs like ‘主动脉瓣双病变’ (Double lesions of aortic valves) should be mapped to more than one concept such as ‘aortic stenosis’ and ‘aortic insufficiency’.
-
Many-to-one: The average count of synonyms for each mapped rCTCF is 4.4.
-
One-to-zero: As mentioned above, there were still 70% CTCFs that failed in mapping to any rCTCF. The reasons were: 1) there was a limitation of rCCNs coverage. 2) Both rCCNs were translational version, with formal, precise and academic expression, comparing with the more complex daily expression in real world. 3) Some CTCFs from medical imaging contained detailed anatomic sites, which difficult in concept mapping, such as ‘股骨髁上髁间开放粉碎骨折’ (open comminuted fracture of supracondylar and intercondylar femur). The introduction of Body Structure, another top hierarchy of SCT, might be helpful. 4) The similarity metrics are still worth further research, for some synonyms might be missed by low HS score.
CTCF | The 1st rCTCF candidate | The 2nd rCTCF candidate | The 3rd rCTCF candidate | The 4th rCTCF candidate | … |
---|---|---|---|---|---|
主动脉夹层 (Aortic dissection) | 主动脉夹层动脉瘤:0.91 (Aortic dissection aneurysm) | 主动脉扩张:0.75 (Aortic dilation) | 腹主动脉动脉瘤:0.75 (Abdominal aorta aneurysm) | 主动脉瓣关闭不全:0.67 (Aortic valve insufficiency) | … |
共同性外斜 (Concomitant exotropia) | 共同性外斜视:0.95 (Concomitant exotropia) | 共同性内斜视:0.89 (Concomitant esotropia) | 共同性斜视:0.81 (Concomitant strabismus) | 间歇性外斜视:0.70 (Intermittent exotropia) | … |
Evaluation of CRF for CTCF recognition
Models | TP | FP | FN | P | R | F | Total |
---|---|---|---|---|---|---|---|
CWS1 + U | 9427 | 1598 | 1791 | 0.855 | 0.840 | 0.848 | 11,218 |
CWS1 + UB | 9650 | 1210 | 1568 | 0.889 | 0.860 | 0.874 | 11,218 |
CWS1 + UBT | 9616 | 1104 | 1602 | 0.897 | 0.857 | 0.877 | 11,218 |
CWS1 + UBTQ | 9510 | 1012 | 1708 | 0.903 | 0.848 | 0.875 | 11,218 |
CWS2 + U | 9367 | 1676 | 1851 | 0.848 | 0.835 | 0.842 | 11,218 |
CWS2 + UB | 9646 | 1250 | 1572 | 0.885 | 0.860 | 0.872 | 11,218 |
CWS2 + UBT | 9635 | 1111 | 1583 | 0.897 | 0.859 | 0.877 | 11,218 |
CWS2 + UBTQ | 9571 | 1042 | 1647 | 0.902 | 0.853 | 0.877 | 11,218 |
CWS3 + U | 9282 | 1745 | 1936 | 0.842 | 0.827 | 0.835 | 11,218 |
CWS3 + UB | 9614 | 1296 | 1604 | 0.881 | 0.857 | 0.869 | 11,218 |
CWS3 + UBT | 9637 | 1145 | 1581 | 0.894 | 0.859 | 0.876 | 11,218 |
CWS3 + UBTQ | 9559 | 1057 | 1659 | 0.900 | 0.852 | 0.876 | 11,218 |
CWS4 + U | 9184 | 1749 | 2034 | 0.840 | 0.819 | 0.829 | 11,218 |
CWS4 + UB | 9553 | 1331 | 1665 | 0.878 | 0.852 | 0. 864 | 11,218 |
CWS4 + UBT | 9599 | 1156 | 1619 | 0.893 | 0.856 | 0.874 | 11,218 |
CWS4 + UBTQ | 9542 | 1083 | 1676 | 0.898 | 0.851 | 0.874 | 11,218 |
Round | Models | TP | FP | FN | P | R | F | Total |
---|---|---|---|---|---|---|---|---|
- | M0 (baseline) | 9635 | 1111 | 1583 | 0.897 | 0.859 | 0.877 | 11,218 |
1st | M0 + F1 | 9654 | 1135 | 1564 | 0.895 | 0.861 | 0.877 | 11,218 |
1st | M0 + F2 | 9671 | 1127 | 1547 | 0.896 | 0.862 | 0.879 | 11,218 |
1st | M0 + F3 | 9664 | 1135 | 1554 | 0.895 | 0.862 | 0.878 | 11,218 |
1st | M0 + F4 | 9678 | 1101 | 1540 | 0.898 | 0.863 | 0.880 | 11,218 |
1st | M0 + F5 (M1) | 9711 | 1057 | 1507 | 0.902 | 0.866 | 0.883 | 11,218 |
2nd | M1 + F1 | 9717 | 1066 | 1501 | 0.901 | 0.866 | 0.883 | 11,218 |
2nd | M1 + F2 | 9725 | 1083 | 1493 | 0.900 | 0.867 | 0.883 | 11,218 |
2nd | M1 + F3 | 9732 | 1082 | 1486 | 0.900 | 0.868 | 0.883 | 11,218 |
2nd | M1 + F4 (M2) | 9725 | 1071 | 1493 | 0.901 | 0.867 | 0.884 | 11,218 |
3rd | M2 + F1 (M3) | 9754 | 1062 | 1464 | 0.901 | 0.870 | 0.885 | 11,218 |
3rd | M2 + F2 | 9752 | 1080 | 1466 | 0.900 | 0.869 | 0.885 | 11,218 |
3rd | M2 + F3 | 9758 | 1094 | 1460 | 0.899 | 0.870 | 0.884 | 11,218 |
4th | M3 + F2 (M4) | 9785 | 1069 | 1433 | 0.902 | 0.872 | 0.887 | 11,218 |
4th | M3 + F3 | 9765 | 1097 | 1453 | 0.899 | 0.871 | 0.885 | 11,218 |
Models | TP | FP | FN | P | R | F | Total |
---|---|---|---|---|---|---|---|
The best CRF model (Baseline) | 9785 | 1069 | 1433 | 0.902 | 0.872 | 0.887 | 11,218 |
RTBA with original SCCSE (R1) | 10,165 | 743 | 1053 | 0.932 | 0.906 | 0.919 | 11,218 |