Data
The major medical data are the data of medical institutions’ diagnosis and treatment, collection of massive clinical data and laboratory data produced every day at all levels of hospital.
A golden annotated corpus, marked up with temporal expressions, events and relationship between them is needed to allow us to evaluate by our methods. The THYME corpus, which has been used since 2011 [
39], is one of the suitable corpora, consisting of clinical and pathological notes of patients with colon cancer and brain cancer from Mayo Clinic. Unlike other datasets, the events in this dataset are all single words, which are very suitable for our system. The notes are manually annotated by the THYME project (
thyme.healthnlp.org) using an extension of ISOTimeML for the annotation of temporal expressions, events and temporal relations [
40]. 50% of the corpus is used for training, 25% is for development and 25% is for testing. The development set is used for optimizing learning parameters, then combine it with the training set to build the system used for reporting results. Table
1 shows the distribution of the THYME corpus. The colon cancer data are used as training data and are tested on both colon cancer and brain cancer data, to demonstrate its effectiveness with or without domain adaptation, and this can reflect that our approach is not limited to a particular field. In evaluation, all methods have access to the same set of training and testing data.
Table 1
The distribution of the THYME corpus. In this table, we show the different types of data in the corpus
Document | 293,143,141 | 30,148 |
Temporal expressions | 3833 2078 1952 | 3,501,552 |
Event expressions | 38,890 20,974 18,990 | 2557 11,510 |
ER | 11,150 6163 5894 | 6,241,759 |
Results
The method has been evaluated on both colon cancer data and brain cancer data to demonstrate the effectiveness with or without domain adaptation. In order to do better research, several methods are used to de the entity extraction.
Six methods of temporal expression extraction. 1) a rule-based method; 2) a system based on CRF; 3) a system based on general RNN without any attention mechanism or context (RNN); 4) a system based on RNN with easy attention mechanism but without any context words (RNN-att); 5) our proposed method, a RNN system with attention mechanism and context words; 6) a system combines the CRF and RNN network. All the results are compared (part 1 in Tables
2 and
3). For the rule-based methods, firstly, we find all the prepositions, according to our experience and experimental statistics; we extract five tokens behind their own prepositions. Through careful observation of data, we found that many time expressions always show up behind a preposition, we then judge whether those five words are related to time expressions. We define a time dictionary to list the words which we think can be a part of the time expressions, like “month”, “week”, “day”, “hour”, “May”, “Monday”, “morning”, “once” and so on. Next, we contrast the five tokens with time dictionary, and find whether it can represent a date or a precise time. Finally, we extract all the continuous tokens that we thought may relate to the time expressions (if there is a definite article before those tokens, extract it as well). There exist some expressions do not after a preposition and only contain one word and most of them have the same prefix like “pre”, “post”, “peri”. So we use this prefix rule to find the remain expressions. The major feature we used for training the CRF and SVM classifier is simple lexical features (word embeddings, part-of-speech tag, numeric type, capital type, lower case). BluLab: run 1–3 and GUIR system are the previously best system mentioned in the section two. These results are shown in Tables
2 and
3 (part 2).
Table 2
The temporal expressions extraction results on colon cancer. The Part 1 shows the results of six different methods that we used to do the temporal expressions extraction; the Part 2 shows the result of the previously best system
Part 1 | Rule-based | 0.412 | 0.594 | 0.486 |
CRF | 0.813 | 0.592 | 0.685 |
RNN | 0.662 | 0.629 | 0.645 |
RNN-att | 0.677 | 0.669 | 0.663 |
ARNN | 0.691 | 0.675 | 0.683 |
CRF-ARNN | 0.754 | 0.725 | 0.739 |
Part 2 | BluLab: run 1–3 | 0.797 | 0.664 | 0.725 |
Table 3
The temporal expressions extraction results on brain cancer. We utilize 6 different methods to do the task, the results are shown in Part 1; the result of previously best system is shown in Part 2
Part 1 | Rule-based | 0.33 | 0.52 | 0.41 |
CRF | 0.72 | 0.55 | 0.62 |
RNN | 0.63 | 0.57 | 0.60 |
RNN-att | 0.65 | 0.57 | 0.61 |
ARNN | 0.66 | 0.60 | 0.63 |
CRF-ARNN | 0.69 | 0.65 | 0.67 |
Part 2 | GUIR | 0.51 | 0.67 | 0.58 |
In both Tables
2 and
3, rule based methods achieve the lowest result. The recalls are relatively better than the precisions due to the well-defined dictionary. The error analysis shows that some “pre”, “post” and “peri”, are considered as time expressions while they should not be. Meanwhile, the rule-based method often mistakes two independent expressions as one if they are adjacent. In Table
2, the RNN system’s performances are lower than BluLab: run 1–3(a ClearTK SVM pipeline using mainly simple lexical features along with information from rule-based systems). The rule-based information is effective, but it has limitations; it can extract rules according to the characteristics of data. We do not add any rules to the RNN system. The observation on the error analysis shows that without any attention mechanism and context words, RNN is not very effective for similar combinations of numbers and letters (e.g. 20 h, 3 days etc.). Because the form of the corresponding word vectors are generated randomly, and the time series contains a large number of the above type, so the model cannot learn characteristics of time series, so it cannot be correctly extracted. After adding the attention mechanism and context words, the ARNN system achieve the relatively good results. Because of the good results of CRF, we combine the CRF with the ARNN and achieve the best result. From Table
3, we can see that the RNN outperforms the GUIR system, which is the current best system. (It is an ensemble of CRF and decision tree with lexical, syntactic, semantic, distributional, and rule-based features). The GUIR system can not extract the previously unseen or atypical date formats very well, it is obvious that their rules are not comprehensive enough. This problem also exists in RNN system, however, when adding the attention mechanism, it can extract more new and otherwise unknown formats. The ARNN and CRF-ARNN system achieve the best results. In this part, we have two test data sets, one is colon cancer, another one is brain caner. We trained all the models on the same training data and test them on two different test data sets. Except for the different test data, the parameters are exactly the same. The experimental results prove that our model is effective on other test data sets.
Meanwhile, five methods of event extraction. 1) a method based on SVM; 2) a system based on CRF; 3) a system based on general RNN without any attention mechanism or context (RNN); 4) a system based on RNN with easy attention mechanism but without any context words (RNN-att); 5) our proposed method, a RNN system with attention mechanism and context words. All the results are evaluated (part 1 in Tables
4 and
5). For event extraction, the SVM and CRF model obtain the relatively good results in colon data and perform poorly in brain colon data compared to the best system (LIMSI). However, RNN achieves preferably results in the two sets of test data, even higher than the best system (LIMSI). As shown in both Tables
4 and
5, when adding the attention mechanism and context words, the results are improved.
Table 4
The event extraction results on colon cancer. 5 different methods are utilized by us, all the results are shown in Par1; the Part 2 shows the result of the previously best system
Part 1 | SVM | 0.860 | 0.843 | 0.851 |
CRF | 0.896 | 0.874 | 0.885 |
RNN | 0.893 | 0.897 | 0.893 |
RNN-att | 0.903 | 0.899 | 0.901 |
ARNN | 0.922 | 0.908 | 0.915 |
Part 2 | BluLab: run 1–3 | 0.887 | 0.864 | 0.875 |
Table 5
The event extraction results on brain cancer. We adopt 5 methods to do the task, the results can be compared in Part 1; the result of the best system is shown in Part 2
Part 1 | SVM | 0.55 | 0.69 | 0.61 |
CRF | 0.68 | 0.80 | 0.73 |
RNN | 0.75 | 0.83 | 0.77 |
RNN-att | 0.77 | 0.79 | 0.78 |
ARNN | 0.82 | 0.78 | 0.80 |
Part 2 | LIMSI | 0.69 | 0.85 | 0.76 |
As for the ER extraction, which is the key point of the paper. First, we compare our proposed model with the following methods: 1) a general RNN system without any attention mechanism or piecewise representation. We use the sentence between the entity pair as the input (RNN); 2) a general RNN system without any attention mechanism or piecewise representation. We use the whole sentence as the input (RNN-whole). We can see the results of RNN-whole is better than the results of RNN. It means that the sentence length can affect the performance of the system. Therefore, we use the sentence between the entity pair as the input for other system. 3) a general RNN system with attention mechanism but without piecewise representation (RNN-att). 4) a general RNN system without attention mechanism but with piecewise representation (RNN-pie). 5) our proposed system, APRNN, but only use the word embeddings trained from Wikipedia (APRNN-wiki). 6) our proposed system, APRNN, but only use the word embeddings trained from BioASQ (APRNN-BioASQ). 7) our proposed system, which is based on a recurrent neural network combining the attention mechanism and the piecewise representation. All these results are evaluated (part 1 in Tables
6 and
7). Except for model 5) and 6), the word embeddings for other models are from both Wikipedia and BioASQ. From the results, we can see that both attention mechanism and piecewise representation are useful. They can improve the results to some extent. We can directly compare the value of attention in two groups of experiments (result 1) and result 3); result 4) and result 7)). The result 3) and result 7), result 1) and result 4) can directly demonstrate the performance with and without segmentation. The difference between model 3) and 7) is that model 3) is missing the piecewise representation, and the difference between model 4) and 7) is without or with the attention mechanism. The result has been improved with the piecewise representation. The experiment 5) and 6) are about looking at the impact of word embeddings. The result 5) and result 6) show that different word embeddings can lead to different results. After combining the two corpus (Wikipedia and BioASQ), the results increase slightly (APRNN). Different factors that may affect the results are verified from experimental results, e.g. piecewise representation, attention mechanism, word embeddings. All these factors are utilized to make better use of contextual information.
Table 6
The ER classification results on colon cancer. Part 1 shows the results of the relevant methods we used; the other related works, which achieved the very good results are shown in Part 2
Part 1 | RNN | 0.697 | 0.721 | 0.709 |
RNN-whole | 0.668 | 0.680 | 0.674 |
RNN-att | 0.719 | 0.715 | 0.717 |
RNN-pie | 0.717 | 0.709 | 0.713 |
APRNN-wiki | 0.727 | 0.717 | 0.722 |
APRNN-BioASQ | 0.731 | 0.723 | 0.727 |
APRNN | 0.733 | 0.711 | 0.729 |
Part 2 | BluLab: run 1–3 | 0.712 | 0.693 | 0.702 |
SVM | 0.678 | 0.658 | 0.668 |
Att-BLSTM | 0.721 | 0.715 | 0.718 |
Table 7
The ER classification results on brain cancer. The results of our proposed methods are shown in Part 1; Part 2 shows the results of other related
Part 1 | RNN | 0.61 | 0.59 | 0.60 |
RNN (whole) | 0.59 | 0.55 | 0.57 |
RNN-att | 0.61 | 0.61 | 0.61 |
RNN-pie | 0.62 | 0.60 | 0.61 |
APRNN-wiki | 0.63 | 0.61 | 0.62 |
APRNN-BioASQ | 0.64 | 0.62 | 0.63 |
APRNN | 0.65 | 0.59 | 0.63 |
Part 2 | LIMSI | 0.53 | 0.66 | 0.59 |
SVM | 0.57 | 0.53 | 0.55 |
Att-BLSTM | 0.63 | 0.61 | 0.62 |
We compare our work with other related work. The LIMSI system, which achieves the best score on the ER task in SemEval-2017 Task 12; Li, Rao and Zhang (2016) proposed the Litway, which is a system that has adopted a hybrid approach that uses the LibSVM classifier with a rule-based method for relation extraction [
41]. They achieve the best score in the SeeDev task of BioNLP-ST 2016. Thus, we use their approach as a benchmark for our system. The BiLSTM-attention networks proposed by Zhou et al. [
21] were chosen as another benchmarking model (Att-BLSTM), which outperforms most of the existing methods. They designed a bidirectional attention mechanism to extract word-level features from the sentence. The features for the attention-based model include word vectors and position indicators. For the sake of fairness, we also use the sentence between the entity pair as the input, but without the piecewise representation. The results are shown in the part 2 in Tables
6 and
7. The reported results involve reimplementation of all of the approaches. The SVM system can not get the whole information about the input sentence. The Att-BLSTM achieve the good results. However, we use the sentence between the entity pair as the input, the Att-BLSTM can not get the context information. The ER results show that APRNN has a higher performance in comparison with other systems. Both Tables
6 and
7 can show that our proposed model (APRNN) can effectively extract the biomedical entity relations. The APRNN model can better utilize the information in the context, which is extremely important for extracting biomedical entity relations.
To further verify our approach, we also validate our system on the data of TimeBank_Dense which is provided by The TempEval3 (TE3) workshop [
42]. The TimeBank-Dense corpus contains 12,715 temporal relations over 36 documents taken from TimeBank 1.2. (22 documents training set, 5 documents development set and 9 documents test set). It is created to address the sparsity issue in the existing TimeML corpora. All pairs of events and time expressions are labeled. Some entity pairs may not in the same sentence. We still choose the sequence between the entity pair as the input sentence, the sequence before or after the two entities are used as the context words. We select several systems to do comparative experiments. Bethard propose the ClearTK system [
43], which is the winner of TempEval-3. TempEval-3 use TimeBank documents, but remove a small portion of its events. Chambers et al. [
44] propose the CAEVO system (a sieve-based architecture) on the TimeBank_Dense corpus; they achieve the state-of-art result and exceed other systems by a large margin. All the results are shown in Table
8. They make specific settings (e.g. rule-based classifiers) for the data. The Table
8 demonstrates that our proposed model (APRNN) has a better performance than the comparative models on the TimeBank_Dense corpus.
Table 8
The temporal relation classification results on TimeBank_Dense corpus
ClearTK | 0.397 | 0.091 | 0.147 |
CAEVO | 0.508 | 0.506 | 0.507 |
APRNN | 0.511 | 0.507 | 0.509 |
Experimental results show that RNN model achieves good results in information extraction. However, the results based on APRNN obtain the highest value with or without domain adaptation. Experiments show that our system has a certain degree of universality; it is not limited to a specific data, but also suitable for other data.