Background
Language | Imaging report content | Pathologic report content |
---|---|---|
English | The solid hypoechoic area of the subcutaneous tissues of maxillofacial region is 14.6 mm × 10.4 mm and covered with a capsule. The boundary is clear and the shape is regular. | The specimen for pathological examination contains one mass. The size of mass is 1.2 × 1 × 1 cm, the color is gray red and the capsule is complete. (Parotid gland) favor a diagnosis of pleomorphic adenoma. The lesion contains abundant cells without a clear limit out of the surrounding tissue. |
Chinese | 颌面部所指处皮下见实质性低回声区14.6 mm × 10.4 mm, 边界清, 有包膜, 形态规则。 | 肿块一枚, 大小1.2*1*1 cm, 灰红色, 包膜完整。(腮腺)多形性腺瘤, 细胞丰富, 与周围组织分界不清。 |
Methods
Technical workflow
Data description
CNN model for text similarity detection
Medical concept vectors using ontology-based graph embedding
Model evaluation
-
Keyword mapping. We used the vocabulary from CMeSH as a medical dictionary to filter the original text. All words outside the dictionary were discarded and Jaccard similarity coefficient was calculated based on the key words remained in the two report texts
-
Latent Semantic Analysis (LSA). For this approach we collected all reports and construct bag-of-words representation vectors for each of them. Then singular value decomposition was performed on the matrix concatenating all bag-of-words vectors to reduce the dimensionality of the vector representations and cosine similarity was measured on those vectors from the reduced-dimension space
-
Latent Dirichlet Allocation (LDA). This approach constructed the bag-of-words representations for the reports. It assumed that each report was a mixture of a set of “topics” and each topic was a mixture of the set of words in the vocabulary. Cosine similarity was measured on their topic composition vectors.
-
Doc2Vec. Doc2Vec is an extension to the Word2Vec model [26], where a document vector is trained together with the word vectors in the continuous bag-of-words model. Cosine similarity was measured on the learned document vectors.
-
Siamese long short term memory (LSTM). Siamese LSTM is often used for text similarity systems. It uses two LSTM networks to encode two sentences respectively, then calculate Manhattan distance between the encoded hidden vectors to decide whether the two sentences are similar or not. The training process is supervised.
-
Named Entity Recognition (NER). We used another annotated Chinese clinical EMR corpus from Shanghai Tongren Hospital. This corpus contains 46,665 sentences and 89,231 entities of four types: symptoms, diseases, lab tests and body structures. We trained a DNN-based NER model with random initialized word embedding [23] and then adopted this model to identify all the entities in the original report texts. We only keep these entity words and construct bag-of-words representation vectors for each of the reports. Cosine similarity was measured on their entity representation vectors.
Model interpretability
Results
Model performance
Model | Macro average | Positive class | Negative class | AUC (mean ± std) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Precision (mean ± std) | Recall (mean ± std) | F1-score (mean ± std) | Precision (mean ± std) | Recall (mean ± std) | F1-score (mean ± std) | Precision (mean ± std) | Recall (mean ± std) | F1-score (mean ± std) | ||
Zero-r | 0.73 ± 0.0 | 0.85 ± 0.0 | 0.78 ± 0.0 | 0.0 ± 0.0 | 0.0 ± 0.0 | 0.0 ± 0.0 | 0.85 ± 0.0 | 1.0 ± 0.0 | 0.92 ± 0.0 | 0.0 ± 0.0 |
Keyword Mapping | 0.827 ± 0.006 | 0.842 ± 0.005 | 0.833 ± 0.005 | 0.464 ± 0.023 | 0.358 ± 0.018 | 0.404 ± 0.018 | 0.891 ± 0.004 | 0.927 ± 0.004 | 0.909 ± 0.003 | 0.840 ± 0.004 |
LSA | 0.892 ± 0.005 | 0.862 ± 0.007 | 0.873 ± 0.006 | 0.512 ± 0.022 | 0.758 ± 0.019 | 0.611 ± 0.019 | 0.956 ± 0.004 | 0.879 ± 0.008 | 0.916 ± 0.005 | 0.894 ± 0.006 |
LDA | 0.872 ± 0.006 | 0.852 ± 0.006 | 0.860 ± 0.006 | 0.514 ± 0.021 | 0.669 ± 0.023 | 0.581 ± 0.019 | 0.936 ± 0.005 | 0.884 ± 0.005 | 0.910 ± 0.004 | 0.879 ± 0.004 |
Doc2Vec | 0.882 ± 0.007 | 0.862 ± 0.007 | 0.869 ± 0.007 | 0.514 ± 0.019 | 0.682 ± 0.023 | 0.586 ± 0.018 | 0.943 ± 0.005 | 0.892 ± 0.006 | 0.917 ± 0.004 | 0.871 ± 0.005 |
NER-based | 0.835 ± 0.006 | 0.849 ± 0.005 | 0.842 ± 0.006 | 0.473 ± 0.022 | 0.501 ± 0.020 | 0.482 ± 0.020 | 0.904 ± 0.006 | 0.923 ± 0.005 | 0.912 ± 0.005 | 0.853 ± 0.004 |
Siamese LSTM | 0.920 ± 0.006 | 0.891 ± 0.005 | 0.904 ± 0.006 | 0.582 ± 0.020 | 0.843 ± 0.021 | 0.698 ± 0.020 | 0.964 ± 0.006 | 0.901 ± 0.007 | 0.932 ± 0.006 | 0.916 ± 0.006 |
CNN + random vector | 0.916 ± 0.005 | 0.931 ± 0.006 | 0.923 ± 0.005 | 0.631 ± 0.022 | 0.833 ± 0.019 | 0.712 ± 0.019 | 0.972 ± 0.007 | 0.917 ± 0.005 | 0.941 ± 0.005 | 0.942 ± 0.003 |
CNN + pretrain vector | 0.912 ± 0.006 | 0.927 ± 0.006 | 0.920 ± 0.006 | 0.637 ± 0.021 | 0.811 ± 0.019 | 0.701 ± 0.020 | 0.965 ± 0.006 | 0.920 ± 0.005 | 0.937 ± 0.006 | 0.936 ± 0.004 |
CNN + concept vector | 0.931 ± 0.006 | 0.938 ± 0.007 | 0.935 ± 0.006 | 0.682 ± 0.023 | 0.771 ± 0.020 | 0.734 ± 0.021 | 0.969 ± 0.004 | 0.938 ± 0.008 | 0.954 ± 0.007 | 0.951 ± 0.003 |
LIME experiment
Sample pair No. | Imaging report content (Chinese) | Imaging report content (English) | Pathologic report content (Chinese) | Pathologic report content (English) |
---|---|---|---|---|
1 | 宫内见1个胎儿, 胎位头位, 胎方位LOP。双顶径81, 枕额径101, 腹前后径92, 腹左右径83, 股骨长60, 肱骨长52。胎心胎动见, 胎心133次/分, 胎心律齐。胎盘位于后壁, 厚度35, 分级II, 胎盘下缘距宫颈内口> 54。羊水指数31 + 31 + 36 + 39。胎儿脐血流指数:PI = 0.93, RI = 0.63, S/D = 2.71 单胎头位。胎儿迟发畸形的检查受多因素影响, 超声无法检出所有胎儿异常。此检查仅限于胎儿生长监测。 | One fetus can be observed in the uterus. The position of the fetus is cephalic position, the orientation is LOP, the biparietal diameter is 81, the occipitofrontal diameter is 101, the anteroposterior trunk diameter is 92, the transverse trunk diameter is 83, the femur length is 60, the humeral length is 52. Fetal heart rate and fetal movement can be observed. The fetal heart rate us 133 beats per minute and the heart rhythm is regular. The placenta is located in the posterior wall. The thickness of the placenta is 35, grade II. The distance between the placental margin and the internal cervical os is > 54. The Amniotic fluid index is 31 + 31 + 36 + 39. Fetal umbilical artery plow index: PI = 0.93, RI = 0.63, S/D = 2.71. Singleton and cephalic presentation. The examination of fetal delayed malformation is affected by many factors, and ultrasound cannot detect all fetal abnormalities. This examination is limited to fetal growth monitoring. | 胎盘组织重600g, 大小21*17*3 cm, 胎膜完整, 切面灰红色, 母面小叶完整, 子面光滑, 相连脐带长35cm, 直径1.2 cm, 血管三根。(胎盘)孕晚期胎盘一个, 绒毛发育良好, 脐带及胎膜未见明显异常。 | The weight of placental tissue is 600 g, the size is 21 × 17 × 3 cm, the fetal mem-brane is intact, the cut sur-face is gray-red, the lobules of maternal surface are intact, and the daughter surface is smooth. The length of the umbilical cord is 35 cm, the diameter is 1.2 cm, and three blood vessels can be observed. (Placenta) favor a diagnosis of previa of late pregnancy, the villi are well-developed, and no obvious lesion is observed in umbilical cord and fetal membrane. |
2 | 甲状腺大小正常, 包膜清晰完整, 内部回声分布均匀, CDFI:腺体内部血流信号未见明显异常。甲状腺右叶内可见数个低回声区, 大者大小23.5*13.2 mm, 形态规则, 边界清晰, 内部回声不均匀。 | The size of thyroid gland is normal, the capsule is clear and intact, and the echogenicity is homogeneous. CDFI: There is no obvious abnormality of blood flow signal in the gland. There are several hypoechoic areas in the right lobe of the thyroid. The size of the lesion is 23.5 × 13.2 mm, the shape is regular, the boundary is clear, and the echogenicity is inhomogeneous. | 甲状腺组织, 大小4.5*2.5*1.5 cm, 切面见结节两枚, 直径1-2 cm, 灰红色, 质软。(甲状腺右叶)结节性甲状腺肿伴滤泡性腺瘤形成。 | The specimen for pathological examination contains one thyroid tissue. The size of the tissue is 4.5 × 2.5 × 1.5 cm. Two thyroid nodules can be observed from the cut surface. The diameter of the nodules is 1 to 2 cm, the color are grey red, the texture is soft. (The right lobe of the thyroid) favor a diagnosis of nodular goiter combined with follicular adenoma. |
Sample pair No. | Imaging report | Pathologic report | ||||
---|---|---|---|---|---|---|
Word-Chinese | Word-English | Feature importance of word | Word-Chinese | Word-English | Feature importance of word | |
1 (Prediction probability = 0.77) | 胎膜 | Fetal membranes | 0.15 | 胎儿 | Fetus | 0.12 |
脐带 | Umbilical cord | 0.14 | 胎心 | Fetal heart | 0.03 | |
胎盘 | Placenta | 0.06 | 羊水 | Amniotic fluid | 0.03 | |
毛发 | Hair | 0.04 | 头位 | Head position | 0.02 | |
小叶 | Lobule | 0.02 | 股骨 | Femur | 0.01 | |
面灰 | Face ash | 0.01 | 单胎 | Single fetus | 0.01 | |
2 (Prediction probability = 0.83) | 甲状腺 | Thyroid | 0.19 | 甲状腺 | Thyroid | 0.16 |
结节 | Tubercle | 0.15 | 腺体 | Glandular body | 0.14 | |
滤泡 | Follicular | 0.08 | 右叶 | Right lobe | 0.07 | |
右叶 | Right lobe | 0.03 | 包膜 | Envelope | 0.03 | |
腺瘤 | Adenoma | 0.01 | 回声 | Echoes | 0.01 | |
切面 | Section | 0.01 | 血流 | Blood flow | 0.01 |
Discussion
Sample pair No. | Imaging report content (Chinese) | Imaging report content (English) | Pathologic report content (Chinese) | Pathologic report content (English) | True label | Predict label |
---|---|---|---|---|---|---|
3 | 于左肾下极腹侧可见多个囊性为主的混合性回声, 相互融合, 较大之一约17.1 × 17.0 mm(局部凸向肾外), 靠近肾盏之一大小约14.2 × 14.6 mm, 形态欠规则, 表面光整, 境界欠清, 囊内无回声透声尚可, 分布欠均, 可见分隔样回声, 间隔及囊壁未见明显增粗, 囊内及囊壁可见点状、带状强回声, 团块后方回声无明显改变, CFI示未见明显血流信号。 | Multiple cystic mixed echoes can be observed in the ventral side of the inferior pole of left kidney, which fuse with each other. The largest one is about 17.1 × 17.0 mm (which protrudes out locally from the kidney), and the one near the renal pelvis is about 14.2 × 14.6 mm. The shape of the cysts is irregular, the surface is smooth, and the boundary is not clear. There are no echoes in the cysts, the sound transmission is normal, but the echogenicity is inhomogeneous, and septations can be observed. There is no obvious thickening for both septations and walls of the cysts. Punctate and banded strong echoes can be observed inside the cysts and on the wall of the cysts. There is no obvious lesion behind the cysts, and CFI showed no obvious blood flow signal. | 肿物两枚, 直径1cm, 暗黄色, 质中。另见肾上腺组织, 大小2.5*1.5*1.5 cm, 暗红色, 质中。(左肾上腺)倾向皮质结节状增生。 | The specimen for pathological examination contains two masses and one adrenal tissue. The diameter of masses is 1 cm, the color is dark yellow, and the texture is medium level. The size of adrenal tissue is 2.5 × 1.5 × 1.5 cm, the color is dark red, and the texture are medium level. (Left adrenal gland) favor a diagnosis of nodular adrenal cortical hyperplasia. | True | False |
4 | 左侧腋下见数个淋巴结样回声区, 大者11mm*5 mm, 边界清, 有包膜, 形态规则, 内部结构清晰, 未见明显血流信号。左侧腋下可见多个淋巴结。 | There are several lymphoid echoes under the left armpit, the largest one is 11 mm × 5 mm, the boundary is clear, the capsule is regular, the internal structure is clear. There is no obvious blood flow signal. Multiple lymph nodes can be seen in the left armpit. | 脂肪组织, 大小3.5*3*1 cm, 找见淋巴结两枚。(右腋下淋巴结)淋巴结(0/1)未见癌转移。免疫组化:(右腋下淋巴结)淋巴结(0/1)未见癌转移。 | The specimen for pathological examination contains one fat tissue. The size of fat tissue is 3.5 × 3 × 1 cm, and two lymph nodes can be seen in the tissue. (The lymph node of right armpit) lymph node (0/1) show no metastasis. Immunohistochemical staining method: (the lymph node of right armpit) lymph node (0/1) show no metastasis. | False | True |