Introduction
[PHARMPRODUCT|SUBSTANCE] + allergy
'', that is, a pharmacological product or a substance followed by "allergy", like "urokinase allergy", "cortisone allergy" or "phentolamine allergy". Our hypothesis is based on the evidence that we have already the translations of some chunks within the complex term. In this step the translation application should generate the Basque equivalences using the already translated components and some generation rules.Pattern found | Quantity |
---|---|
[PHARMPRODUCT|SUBSTANCE]+allergy
| 1,498 |
[PHARMPRODUCT|SUBSTANCE]+adverse+reaction
| 1,488 |
[PHARMPRODUCT|SUBSTANCE]+poisoning
| 847 |
[PHARMPRODUCT|SUBSTANCE]+overdose
| 567 |
[PHARMPRODUCT|SUBSTANCE]+poisoning+of+undetermined+intent
| 432 |
intentional+[PHARMPRODUCT|SUBSTANCE]+poisoning
| 429 |
accidental+[PHARMPRODUCT|SUBSTANCE]+poisoning
| 428 |
...
| ... |
Background
Methods
Analysis to choose the source language in SNOMED CT
Description
| Type |
---|---|
Obstruction of pelviureteric junction (disorder) | FSN |
Obstruction of pelviureteric junction | Preferred Term |
PUJ - Pelviureteric obstruction | Synonym |
PUO - Pelviureteric obstruction | Synonym |
Pelviureteric obstruction | Synonym |
UPJ - Ureteropelvic obstruction | Synonym |
Ureteropelvic obstruction | Synonym |
English version | Spanish version | |||
---|---|---|---|---|
Hierarchy | Semantic Tag (ST) | # FSN | Semantic Tag (ST) | # FSN |
Clinical
|
disorder
|
66,239
| trastorno | 66,199 |
Finding/disorder
|
finding
|
33,573
| hallazgo | 33,613 |
Procedure/intervention
|
procedure
|
51,149
| procedimiento | 51,149 |
regime/therapy
|
2,480
| régimen/terapia | 2,480 | |
Organism
|
organism
|
33,157
| organismo | 33,157 |
Body structure
|
body structure
|
24,950
| estructura corporal | 24,953 |
morphologic abnormality
|
4,509
| anomalía morfológica | 4,509 | |
cell
|
626
| célula | 626 | |
cell structure
|
504
| estructura celular | 501 | |
Substance | substance | 23,845 | sustancia | 23,845 |
Pharmaceutical/biologic product | product | 16,759 | producto | 16,759 |
Qualifier value | qualifier value | 8,944 | calificador | 8,944 |
Observable entity | observable entity | 8,278 | entidad observable | 8,278 |
Event | event | 3,671 | evento | 3,670 |
Situation with explicit context | situation | 3,561 | situación | 3,561 |
Social context | occupation | 3,852 | ocupación | 3,852 |
person | 425 | persona | 425 | |
ethnic group | 262 | grupo étnico | 262 | |
religion/philosophy | 203 | religión/filosofía | 203 | |
life style | 21 | estilo de vida | 21 | |
social concept | 23 | contexto social | 23 | |
racial group | 19 | grupo racial | 19 | |
Physical object | physical object | 4,513 | objeto físico | 4,513 |
Specimen | specimen | 1,440 | espécimen | 1,440 |
Environment or geographical location | environment | 1,094 | medio ambiente | 1,094 |
geographic location | 617 | localización geográfica | 617 | |
Staging and scales | assessment scale | 1,077 | escala de evaluación | 1,077 |
tumor staging | 214 | estadificación tumoral | 214 | |
staging scale | 16 | escala de estadificación | 16 | |
Special concept | navigational concept | 640 | concepto para navegación | 640 |
namespace concept | 169 | espacio de nombres | 169 | |
special concept | 1 | concepto especial | 1 | |
Record artifact | record artifact | 224 | elemento de registro | 224 |
Physical force | physical force | 171 | fuerza física | 171 |
Metadata | foundation metadata | 169 | metadato fundacional | 169 |
core metadata concept | 31 | metadato del núcleo | 32 |
Phase 1: lexical resources
-
ZT Dictionary[25]: a specialized dictionary of science and technology that contains areas included in SNOMED CT as medicine, biochemistry, biology... It contains 10,626 English-Basque equivalences and 10,971 Spanish-Basque equivalences.
-
Nursing Dictionary[26]: a small dictionary of the nursing domain that has 4,155 entries in the English-Basque chapter and 4,671 entries in the Spanish-Basque one.
-
Glossary of Anatomy: anatomical terminology used by university experts in their lectures. In its development phase it has 2,818 entries for the English-Basque pair, and 3,940 entries for the Spanish-Basque pair.
-
ICD-10[27]: The 10th version of the International Classification of Diseases was translated into Basque in 1996. We combined it with the Spanish and English versions and we obtained a dictionary of 6,936 equivalences between English and Basque and 8,842 equivalences between Spanish and Basque.
-
EuskalTerm[28]: the biggest multilingual terminology bank available for Basque with 75,860 entries. Regarding the domain of biomedicine, the bank contains 32,301 term equivalences. These equivalences are all available for the Spanish-Basque pair, and 10,506 equivalences for the English-Basque pair.
-
Dictionary of Sanitary Administration[31]: a small dictionary that contains 1,799 entries for the Spanish-Basque pair corresponding to the administration of the sanitary domain.
Phase 2: finite state transducers and biomedical affixes
Baseline translation process
read lexc prefixes.lex
define PREFALL
define PREF PREFALL.u ;
read lexc suffixes.lex
define SUFALL
define SUFF SUFALL.u ;
regex [[[PREF 0:+] (o 0:+)]* SUFF] ;
-
photo+dermat+itis: 3
-
photo+derm+at+itis: 4
-
phot+o+dermat+itis: 4
-
phot+o+derm+at+itis: 5
read lexc prefixes.lex
define TRANSPRE
read lexc suffixes.lex
define TRANSSUF
define MORPHO ...
define TRANS (ˆ) [[[TRANSPRE +] (o:o +)]* TRANSSUF] ;
regex TRANS .o. MORPH ;
English terms | Basque terms |
---|---|
echoencephalogram |
eko
entzefal
ograma
|
encephalitis |
entzefal
itis
|
encephalomyelitis |
entzefal
omielitis
|
leukoencephalitis |
leuko
entzefal
itis
|
... | ... |
First approach
...
define IDEN1 [[[PREF 0:+] (o 0:+)]* SUFF] ;
define IDEN2 [(? + 0:# +) [PREF 0:+]]* (? + 0:# +) SUFF ;
regex IDEN1 .P. IDEN2 ;
-
diverticul#+itis: 10 + 2 = 12
-
divertic#+ul+itis: 8 + 3 = 11
-
di+verticul#+itis: 8 + 3 = 11
-
di+vertic#+ul+itis: 6 + 4 = 10
...
define C c -> k | | [ noC ] [ a | o | u | noHC | #] , ,
c -> z | | [ noC ] [ e | i | y ] ;
define V v -> b ;
define Vow [ a | e | i | o | u | y ] ;
define Sib [ s | z | x ] ;
define PAL n -> n t | | Sib Vow , ,
l -> l t | | Sib Vow , ,
r -> r t | | Sib Vow , ,
m -> n t | | Sib Vow ;
...
Second approach
...
read lexc prefixes Reduced.lex
define PREFREDUCED
define PREFRED PREFREDUCED.u ;
read lexc suffixes Reduced.lex
define SUFFREDUCED
define SUFFRED SUFFREDUCED.u ;
define IDEN1 [[[PREF 0:\%+] (o 0:+)]* SUFF] ;
define IDEN2 [(? + 0:# +) [PREF 0:+ ]]* (+ 0:# +) SUFFRED ;
define IDEN3 [(? + 0:# +) [PREFRED 0:+ ]] + (? + 0:\ # +) SUFF ;
regex IDEN1 .P. IDEN2 .P. IDEN3 ;
Results
Phase 1 results
English | Spanish | Total | ||||
---|---|---|---|---|---|---|
#Syn. | #Concepts | #Syn. | #Concepts | #Syn. | #Concepts | |
Disorder
| 3,975 | 3,063 | 2,231 | 1,602 | 4,362 | 3,275 |
Finding
| 1,690 | 857 | 1,866 | 759 | 2,855 | 1,018 |
Body Structure
| 5,554 | 2,747 | 5,076 | 2,616 | 7,077 | 3,295 |
Procedure
| 557 | 405 | 536 | 377 | 775 | 501 |
Phase 2 results
-
True Positives: The term should be translated, it is translated and the translation is correct. That is, at least one of the Basque terms generated matches at least one synonym from the Gold Standard.
-
False Negatives: The term should be translated and it is not translated.
-
False Positives: The term should not be translated and it is translated, or the term should be translated and the Basque term generated is not correct.
-
True Negatives: The term should not be translated and it is not translated.
TP | FN | FP | TN | Total | Prec. | Recall | F-M | ||
---|---|---|---|---|---|---|---|---|---|
Disorder
|
Baseline
| 289 | 451 | 31 | 77 | 848 | 0.903 | 0.391 | 0.545 |
1st approach
| 615 | 67 | 108 | 58 | 848 | 0.851 | 0.902 | 0.875 | |
2nd approach
| 577 | 104 | 102 | 65 | 848 | 0.850 | 0.847 | 0.849 | |
Finding
|
Baseline
| 79 | 171 | 9 | 116 | 375 | 0.898 | 0.316 | 0.467 |
1st approach
| 213 | 29 | 41 | 92 | 375 | 0.839 | 0.880 | 0.859 | |
2nd approach
| 178 | 63 | 32 | 102 | 375 | 0.848 | 0.739 | 0.789 | |
Body Structure
|
Baseline
| 121 | 425 | 23 | 205 | 774 | 0.840 | 0.222 | 0.351 |
1st approach
| 322 | 174 | 100 | 178 | 774 | 0.763 | 0.649 | 0.702 | |
2nd approach
| 284 | 212 | 91 | 187 | 774 | 0.757 | 0.573 | 0.652 | |
Procedure
|
Baseline
| 98 | 77 | 9 | 64 | 248 | 0.916 | 0.560 | 0.695 |
1st approach
| 144 | 16 | 49 | 39 | 248 | 0.746 | 0.900 | 0.816 | |
2nd approach
| 154 | 5 | 50 | 39 | 248 | 0.755 | 0.969 | 0.848 | |
Total
|
Baseline
| 587 | 1,124 | 72 | 462 | 2,245 | 0.891 | 0.343 | 0.495 |
1st approach
| 1,295 | 286 | 297 | 367 | 2,245 | 0.813 | 0.826 | 0.820 | |
2nd approach
| 1,304 | 275 | 299 | 367 | 2,245 | 0.813 | 0.747 | 0.779 |
Overall results
Phase 0 ICD-10 mapping | Phase 1 Lexical resources | Phase 2 Morphosemantics | Total | |||||
---|---|---|---|---|---|---|---|---|
#Syn. | #Match | #Syn. | #Match | #Syn. | #Match | #Syn. | #Match | |
Disorder
| 11,224 | 11,224 | 4,362 | 5,029 | 2,699 | 2,417 | 17,912 | 18,670 |
Finding
| 1,871 | 1,871 | 2,855 | 1,771 | 897 | 655 | 5,508 | 4,297 |
Body Structure
| 0 | 0 | 7,077 | 5,843 | 1,026 | 861 | 8,036 | 6,704 |
Procedure
| 0 | 0 | 536 | 835 | 1,780 | 1,427 | 2,490 | 2,262 |
1 token | 2 tokens | 3 tokens | 4 tokens | >4 tokens | Total | ||
---|---|---|---|---|---|---|---|
Disorder
|
Translated
| 3,388 | 1,098 | 533 | 275 | 419 | 5,713 |
Total
| 3,962 | 21,830 | 24,054 | 20,357 | 39,501 | 109,704 | |
Percentage
| 85.51% | 5.03% | 2.22% | 1.35% | 1.06% | 5.21% | |
Finding
|
Translated
| 1,290 | 161 | 39 | 19 | 56 | 1,565 |
Total
| 1,821 | 8,850 | 11,126 | 10,092 | 19,689 | 51,578 | |
Percentage
| 70.84% | 1.82% | 0.35% | 0.19% | 0.28% | 3.03% | |
Body Structure
|
Translated
| 1,931 | 1,460 | 381 | 72 | 15 | 3,859 |
Total
| 2,612 | 11,287 | 12,443 | 10,793 | 21,515 | 58,650 | |
Percentage
| 73.93% | 12.94% | 3.06% | 0.67% | 0.07% | 6.58% | |
Procedure
|
Translated
| 1,741 | 80 | 11 | 2 | 1 | 1,835 |
Total
| 1,982 | 9,966 | 15,848 | 16,578 | 37,695 | 82,069 | |
Percentage
| 87.84% | 0.80% | 0.07% | 0.01% | 0.003% | 2.24% |
Disorder | Finding | Body Structure | Procedure | |
---|---|---|---|---|
Translated Concepts
| 14,181 | 2,953 | 3,845 | 1,607 |
Concepts in total
| 66,239 | 33,573 | 30,589 | 53,629 |
Percentage
| 21.41% | 8.80% | 12.57% | 3.00% |
-
Regarding the Disorder sub-hierarchy, we obtained the translation of 21.41% of the terms (see Table 9). Considering that we have focused our work until now mainly on simple terms, we can consider that it is a very good result. The ICD-10 mapping contribution is the major one, producing 11,224 synonyms. In any case, the strength of the morphosemantics phase is noticeable in Table 8 which shows that 85.51% of the simple terms are translated.
-
In regards to the Finding sub-hierarchy, we can consider it as the most balanced one, as it does not outline any method used. In this case, we achieved the translation of 8.80% of the concepts.
-
In the Body Structure hierarchy, 12.57% of the concepts get a Basque equivalent, with outstanding results for complex terms (12.94% of two token terms).
-
For the Procedure hierarchy the dictionaries are of hardly any use (536 Basque terms as seen in Table 7). In contrast, after applying the mophosemantics phase 87.84% of the simple terms are translated (see Table 8). In any case, we only obtain 3.00% of the concepts translated, and this must be an aspect to be improved in the following phases.
-
In general, even if the overall numbers seems to be low (22,586 concepts translated over 184,030), it is a solid base to implement the following two phases in an incremental strategy.