Background
Named Entity Recognition (NER) is an important step in natural language processing (NLP). It has many applications in the general language domain such as identifying person names, locations, and organizations. NER is crucial for biomedical literature mining as well[
1,
2]; many studies have focused on biomedical entities, such as gene/protein names. There are mainly two types of approaches to identifying biomedical entities: rule-based and supervised machine learning based approaches. While rule-based approaches use existing biomedical knowledge/resources, supervised machine learning based approaches rely heavily on annotated training data and domain dictionaries. The advantage of rule-based approaches is that they are easily customized to new vocabulary and author styles, while supervised machine learning approaches often report better results when the task domain does not change and training data is plentiful. Among supervised machine learning methods, Support Vector Machine (SVM) and Conditional Random Field (CRF) are the two most common algorithms that have been successfully used in NER in general and in the biomedical domain in particular[
2‐
7].
One way to harness the advantages of both these approaches is to combine them into an ensemble classifier[
4,
6,
8]. Zhou et al.[
8] investigated the combination of three classifiers, including one SVM and two discriminative Hidden Markov Models, into an ensemble classifier with a simple majority voting strategy. They reported the best result for the protein/gene name recognition task in BioCreAtIvE task 1A (for gene mention identification) in comparison to other results. Smith et al.[
4] showed that most of the top NER systems in the BioCreAtIvE II challenge for gene mention tagging combined results from multiple classifiers using simple heuristic rules. In a similar way, Torii et al.[
6] used a majority voting scheme to combine recognition results from four systems into a single system called BioTagger-GM and reported a higher F-score than the first-place system in the BioCreAtIvE II challenge.
The 2009 i2b2 NLP challenge aimed to extract medication information (i.e., medication name, dosage, mode, frequency, duration, and reason) from de-identified hospital discharge summaries[
9]. The different types of information were called
fields and are described in Table
1. Note that fields might include phrases without noun phrases such as “as long as needed” or “until the symptom disappears”. The challenge asked that participating systems be used to extract the text corresponding to each of the fields for each medication mention. Among the top ten systems achieving the best performance in the 2009 i2b2 challenge, there were two machine learning based systems: the Sydney team ranked first while the Wisconsin team ranked tenth in the final evaluation[
9]. Both systems used a similar volume of training data: 145 notes for the Sydney team and 147 notes for the Wisconsin team, respectively. The difference between those training data was that the Sydney team chose the 145 longest notes while the Wisconsin team randomly selected 147 notes from the training data[
10]. The second best system, the Vanderbilt team, used a rule-based system which extended their MedEx system[
11]. More recently, the i2b2 organizers used a maximum entropy (ME) model on the same training data as the Sydney team and reported that their results were comparable to the top systems in the challenge[
12,
13].
Table 1
Number of fields and descriptions with examples from the i2b2 2009 dataset
Medication | 12773 | “Lasix,” “Caltrate plus D,” “fluocinonide 0.5% cream,” “TYLENOL (ACETAMINOPHEN)” | Prescription substances, biological substances, over-the-counter drugs, excluding diet, allergy, lab/test, alcohol. |
Dosage | 4791 | “1 TAB,” “One tablet,” “0.4 mg,” “0.5 m.g.,” “100 MG,” “100 mg x 2 tablets” | The amount of a single medication used in each administration. |
Mode | 3552 | “Orally,” “Intravenous,” “Topical,” “Sublingual” | Describes the method for administering the medication. |
Frequency | 4342 | “Prn,” “As needed,” “Three times a day as needed,” “As needed three times a day,” “x3 before meal,” “x3 a day after meal as needed” | Terms, phrases, or abbreviations that describe how often each dose of the medication should be taken. |
Duration | 597 | “x10 days,” “10-day course,” “For ten days,” “For a month,” “During spring break,” “Until the symptom disappears,” “As long as needed” | Expressions that indicate for how long the medication is to be administered. |
Reason | 1534 | “Dizziness,” “Dizzy,” “Fever,” “Diabetes,” “frequent PVCs,” “rare angina” | The medical reason for which the medication is stated to be given. |
From the perspective of supervised machine learning, the medication extraction task in the 2009 i2b2 challenge can be divided into two steps: 1) identifying fields, and 2) determining the relationships between the detected medication names and the other fields. The first task was treated as a sequence labeling task and fields were considered as named entities[
10,
13]. In this paper, for the sake of convenience, we refer to the term “named entities” or “entities” as fields which have the same meaning as[
10,
13]. The first task was an NER task, for which both the Sydney and the Wisconsin teams used CRF to detect fields, while Halgrim et al.[
13] used maximum entropy based classifiers. Using the test data in the challenge, Doan and Xu[
14] investigated using output from the MedEx rule-based system as features for SVM algorithms and showed that those features could substantially improve an SVM-based NER system. However, the combination of multiple classifiers into an ensemble classifier presents another opportunity to improve NER performance on the i2b2 task, but it has not been investigated yet. To the best of our knowledge, this is the first study on investigating ensemble classifiers in recognizing medication relevant entities in clinical text.
In this study, we consider a fresh NER problem for clinical text in the 2009 i2b2 challenge and examine the combination of three classifiers: a rule-based system (MedEx, the second-ranked system in the 2009 i2b2 challenge), an SVM-based NER system, and a CRF-based NER system. Ensemble classifiers are built based on different combination methods and are evaluated using the challenge data set. Our studies provide valuable insights into the NER task for medical entities in clinical text. Throughout the paper, we compare our results against the top-ranked state-of-the-art system in the i2b2 challenge task from the Sydney group.
Results and discussion
We measured Precision, Recall, and F-score metrics using the standard CoNLL evaluation script (the Perl program conlleval.pl)[
28]. Precision is the ratio between the number of NEs correctly identified by the system and the total number of NEs found by the system; Recall is the ratio between the number of NEs found by the system and the number of NEs in the gold standard; and F-score is the harmonic mean of Precision and Recall. For the whole system, we used micro-averaging of the Precision, Recall, and F-score. The micro-average is a weighted average over all NEs, where the weight for each NE is proportional to its size within the set of all NEs.
Experiments were run in a Linux machine with 4GB RAM and 4 cores of Intel Xeon 2.0GHz.
Results for the first setting: 10-fold cross-validation
First, we evaluated the effectiveness of the features as well as their combinations. Table
4 shows Precision, Recall, and F-score of the SVM-based NER system for each type of entity when different combinations of feature sets were used. For all tables, “ALL” means the set of all named entities with values corresponding to micro-averaged Precision/Recall/F-score. Table
4 shows that the best F-score achieved was 90.54% when using all features for the SVM-based NER system. Semantic tag features from MedEx contributed greatly to the performance (89.47% F-score) compared to the remaining features, like history (83.81% F-score) or orthography (86.15% F-score). The difference between using and not using semantic tags was significant according to the approximate randomization test described above. Similar results (not shown) were observed for the CRF-based system.
Table 4
Performance of the SVM-based system for different feature combinations in 10-fold cross-validation
* | | | | | | 87.09/77.05/81.76 |
* | * | | | | | 90.34/78.17/83.81 |
* | | * | | | | 91.81/80.74/85.91 |
* | | | * | | | 89.92/78.54/83.84 |
* | | | | * | | 91.62/81.32/86.15 |
* | | | | | * | 92.38/86.73/89.47
|
* | * | * | | | | 91.72/81.08/86.06 |
* | * | * | * | | | 91.81/81.06/86.10 |
* | * | * | * | * | | 91.78/81.29/86.22 |
* | * | * | * | * | * | 93.75/87.55/90.54
|
Second, we compared the performance of MedEx, CRF-based, and SVM-based NER systems (the CRF-based and SVM-based systems used all six types of features). The results for each separate field and overall results (micro-averaged scores) are given in Table
5 and the results of significance test for differences in performance between the methods are shown in Table
6. As shown, using all features, CRF and SVM give significantly higher F-scores (the differences are significant according to randomization tests) than the customized MedEx system which served as the baseline in this experiment. This can be explained because CRF and SVM can harness the advantages both from the rule-based system (i.e., features from MedEx) and from machine-learning algorithms. We also note that overall (column ALL), the performances of the CRF-based and SVM-based systems are comparable: CRF achieved a 90.48% F-score and SVM achieved a 90.54% F-score. When considering each field separately, significant differences in performance between the two systems have been observed for the dosage, frequency, and duration fields: the SVM-based system performed better for dosage and frequency while the CRF-based system performed better for duration. For the remaining three fields, the SVM-based system achieved higher F-scores for medication and mode, and the CRF-based system scored higher for reason; however, the differences were not statistically significant.
Table 5
Results from the customized MedEx system, CRF (all features), SVM (all features) systems, Simple Majority, Local CRF-based and SVM-based voting in 10-fold cross-validation: “m” stands for medication, “do” for dosage, “mo” for mode, “f” for frequency, “du” for duration, “r” for reason
Customized MedEx
| Pre | 89.57 | 90.33 | 95.01 | 96.26 | 92.09 | 51.19 | 62.10 |
| Re | 84.01 | 89.10 | 82.88 | 86.95 | 88.50 | 58.82 | 47.93 |
| F-score | 86.67 | 89.68 | 88.50 | 91.32 | 90.16 | 54.20 | 53.78 |
CRF
| Pre | 94.38 | 93.99 | 96.47 | 97.63 | 95.61 | 77.40 | 79.34 |
| Re | 86.92 | 90.38 | 89.42 | 92.11 | 91.38 | 62.13 | 43.41 |
| F-score | 90.48 | 92.13 | 92.79 | 94.77 | 93.42 | 68.64 | 55.74 |
SVM
| Pre | 93.75 | 93.84 | 95.40 | 97.13 | 95.68 | 70.42 | 74.46 |
| Re | 87.55 | 90.76 | 91.46 | 93.27 | 92.69 | 48.21 | 44.50 |
| F-score | 90.54 | 92.26 | 93.38 | 95.14 | 94.14 | 56.89 | 55.50 |
Simple Majority Voting
| Pre | 93.99 | 93.62 | 96.39 | 97.27 | 95.44 | 73.11 | 77.91 |
| Re | 87.17 | 90.72 | 89.71 | 92.45 | 91.96 | 53.97 | 45.82 |
| F-score | 90.43 | 92.12 | 92.91 | 94.78 | 93.63 | 61.65 | 57.37 |
Local CRF-Based Voting
| Pre | 94.11 | 93.86 | 95.43 | 97.16 | 95.65 | 70.58 | 85.81 |
| Re | 87.81 | 90.79 | 91.49 | 93.27 | 92.64 | 65.76 | 40.87 |
| F-score | 90.84 | 92.28 | 93.40 | 95.16 | 94.11 | 67.78 | 55.01 |
Local SVM-Based Voting
| Pre | 93.32 | 93.88 | 95.40 | 97.16 | 95.65 | 70.58 | 70.27 |
| Re | 88.19 | 90.79 | 91.44 | 93.24 | 92.64 | 65.76 | 46.99 |
| F-score | 90.67 | 92.30 | 93.37 | 95.14 | 94.11 | 67.78 | 56.08 |
Table 6
Statistical significance tests for differences in performance using approximate randomization in 10-fold cross-validation
Customized MedEx
| all, m, mo, do, f, du, r | all, m, mo, do, f, du | all, m, mo, do, f, du, r | all, m, mo, do, f, du | all, m, mo, do, f, du |
CRF
| | do, f, du | du | all, do, du | do, f |
SVM
| | | du | du | du |
Simple Majority Voting
| | | | all, do, du | du |
Local CRF-Based Voting
| | | | | NS |
Among the six NE fields, duration and reason proved to be the most difficult, with F-scores not exceeding 69% and 58% respectively by all experimented methods. The reasons that both duration and reason fields received lower scores than others might be: 1) the training data was smaller than for the other fields: there are only 957 duration fields and 1,534 reason fields, compared to 12,773 medication fields, and 2) the definitions of the frequency and duration fields might confuse the classifiers, e.g., “as needed” is defined as frequency but “as long as needed” is defined as duration.
Among all experimented methods, local CRF-based voting achieved the highest overall F-score (90.84%), followed by local SVM-based voting (90.67%). These scores were higher than those of the two single classifiers CRF and SVM. According to randomization tests, the local CRF-based voting system performed significantly better than the single CRF system for overall, dosage and was significantly less accurate than the single CRF system in recognizing duration. The significant difference in performance between local CRF-based voting and single SVM has been observed only for duration, for which the two systems achieved F-scores of 67.78% and 56.89% respectively.
Results for the second setting: The 2009 i2b2 challenge with Sydney training data
In order to compare our results to the first-ranked system (the Sydney team from the 2009 i2b2 challenge), we used the same training dataset from their system and applied it to the standard test set. To evaluate the NER performance of the Sydney team’s system, we picked up three submissions from them in the challenge and chose the best one (submission 3). For the sake of comparison, the results from the Sydney team, the customized MedEx, the single classifiers based on SVMs and CRFs, and the three combination methods are shown in Table
7. We also present the results of statistical significance tests for differences in performance between the methods in Table
8.
Table 7
Results from the first ranked system (Sydney), customized MedEx system, CRF (all features), SVM (all features) systems, Simple Majority, Local CRF-based and SVM-based voting on the test set from the 2009 i2b2 challenge: “m” stands for medication, “do” for dosage, “mo” for mode, “f” for frequency, “du” for duration, “r” for reason
Sydney
| Pre | 93.78 | 93.51 | 94.78 | 96.45 | 96.59 | 69.71 | 76.04 |
| Re | 85.03 | 88.31 | 88.91 | 91.28 | 91.25 | 40.93 | 38.83 |
| F-score | 89.19 | 90.84 | 91.75 | 93.80 | 93.85 | 51.58 | 51.41 |
Customized MedEx
| Pre | 89.51 | 89.97 | 94.95 | 96.23 | 92.18 | 50.94 | 62.31 |
| Re | 84.95 | 89.68 | 84.04 | 88.14 | 89.80 | 60.62 | 47.87 |
| F-score | 87.17 | 89.83 | 89.16 | 92.01 | 90.97 | 55.36 | 54.14 |
CRF
| Pre | 94.11 | 93.71 | 95.92 | 97.26 | 95.60 | 71.88 | 77.52 |
| Re | 84.89 | 89.19 | 87.37 | 90.19 | 90.73 | 38.86 | 40.97 |
| F-score | 89.26 | 91.39 | 91.44 | 93.59 | 93.10 | 50.45 | 53.63 |
SVM
| Pre | 93.35 | 93.98 | 94.79 | 96.56 | 95.37 | 68.73 | 68.54 |
| Re | 85.42 | 89.18 | 88.73 | 91.01 | 91.71 | 38.34 | 40.75 |
| F-score | 89.21 | 91.51 | 91.66 | 93.71 | 93.50 | 49.22 | 51.12 |
Simple Majority Voting
| Pre | 93.91 | 93.62 | 95.86 | 97.23 | 95.58 | 72.73 | 75.90 |
| Re | 85.76 | 90.19 | 87.62 | 90.44 | 91.20 | 44.21 | 43.67 |
| F-score | 89.65 | 91.87 | 91.55 | 93.71 | 93.34 | 54.99 | 55.44 |
Local CRF-Based Voting
| Pre | 94.20 | 93.96 | 94.84 | 96.56 | 95.27 | 74.07 | 83.39 |
| Re | 85.11 | 89.18 | 88.80 | 91.01 | 91.71 | 34.54 | 37.13 |
| F-score | 89.42 | 91.51 | 91.72 | 93.71 | 93.46 | 47.11 | 51.38 |
Local SVM-Based Voting
| Pre | 93.03 | 94.06 | 94.77 | 96.56 | 95.37 | 66.94 | 65.83 |
| Re | 85.76 | 89.18 | 88.71 | 90.98 | 91.66 | 42.66 | 44.52 |
| F-score | 89.24 | 91.55 | 91.64 | 93.69 | 93.48 | 52.11 | 53.12 |
Table 8
Statistical significance tests for differences in performance using approximate randomization on the test set from the 2009 i2b2 challenge
Sydney
| all, m, mo, do, f | m | m | all, m, du, r | m,du | m |
Customized MedEx
| | all, m, mo, do, f, du | all, m, mo, do, f, du, r | all, m, mo, do, f | all, m, mo, do, f, du | all, m, mo, do, f, du |
CRF
| | | NS | all, m, du | du | du |
SVM
| | | | all, du, r | NS | NS |
Simple Majority Voting
| | | | | du, r | all, du, r |
Local CRF-Based Voting
| | | | | | du |
As in the first experimental setting, customized MedEx was behind the machine learning and ensemble systems and received the lowest overall F-score of 87.17% (column “All” in table
7). Tables
7 and
8 show that when used individually, the SVM-based and CRF-based methods, with semantic tags from customized MedEx as one of input features, were comparable to the Sydney team’s system; specifically, the SVM-based and CRF-based systems and the system from the Sydney team achieved 89.21%, 89.26%, and 89.19% overall F-scores respectively. At the same time, the highest and second highest overall F-scores came from two ensemble methods: majority voting achieved an 89.65% F-score and local CRF-based achieved an 89.42% F-score. The score of majority voting was significantly higher than any single method including the method from the Sydney team, as determined by the statistical tests (Table
8). When considering each field separately, majority voting consistently outperformed the single methods in recognizing duration and reason. For medication – the most numerous field – majority voting also performed significantly better than the method from the Sydney team, customized MedEx, and the CRF-based system. The two single CRF-based, SVM-based systems as well as two local CRF-based and SVM-based voting systems are not significantly different, except for some variations in F-scores for the duration field.
The two experimental settings showed empirical evidence that the ensemble classifiers outperformed single classifiers. The best system for 10-fold cross validation is the CRF-based voting system and the best system for held-out testing set is the majority voting. Since the training and testing data in the two experimental settings are different, it is difficult to choose the best voting method for all data.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SD, XH, and TMP developed the idea for this study. The project was supervised by XH, TMP, and NC. SD designed and carried out the experiments. Data analysis was conducted by SD, PHD, and TMP. The manuscript was prepared by SD with additional contributions by all authors. All authors read and approved the final manuscript.