Background
The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews (SRs). The use of text mining (TM) tools and machine learning techniques (MLT) to aid citation screening is becoming an increasingly popular approach to reduce human burden and increase efficiency to complete SRs [
1‐
6].
Thanks to its 28 million citations, PubMed is the most prominent free online source for biomedical literature, continuously updated and organized in a hierarchical structure that facilitates article identification [
7]. When searching through PubMed by using keyword queries, researchers usually retrieve a minimal number of papers relevant to the review question and a higher number of irrelevant papers. In such a situation of imbalance, most common machine learning classifiers, used to differentiate relevant and irrelevant texts without human assistance, are biased towards the majority class and perform poorly on the minority one [
8,
9]. Mainly, three sets of different approaches can be applied to deal with imbalance [
9]. The first is the pre-processing data approach. With this approach, either majority class samples are removed (i.e., undersampling techniques), or minority class samples are added (i.e., oversampling techniques), to make the data more balanced before the application of an MLT [
8,
10]. The second type of approaches is represented by the set of algorithmic ones, which foresee cost-sensitive classification, i.e., they put a penalty to cases misclassified in the minority class, this with the aim to balance the weight of false positive and false negative errors on the overall accuracy [
11]. Third approaches are represented by the set of ensemble methods, which apply to boosting and bagging classifiers both resampling techniques and penalties for misclassification of cases in the minority class [
12,
13].
This study examines to which extent class imbalance challenges the performance of four traditional MLTs for automatic binary text classification (i.e., relevant vs irrelevant to a review question) of PubMed abstracts. Moreover, the study investigates whether the considered balancing techniques may be recommended to increase MLTs accuracy in the presence of class imbalance.
Results
Table
2 reports cross-validated AUC-ROC values for each strategy, stratified by SR. In general, all the strategies achieved a very high cross-validated performance. Regarding the methods to handle class imbalance, ROS-50:50 and RUS-35:65 reported the best results. The application of no balancing technique resulted in a high performance only for the k-NN classifiers. Notably, for k-NN, the application of any method for class imbalance dramatically hampers its performance. A gain is observed for GLMnet and RF when coupled with a balancing technique. Conversely, no gain is observed for SVM.
Table 2AUC-ROC values by combination of MLTs, balancing techniques and balancing ratios across 14 systematic reviews
GLMNet | Cavender et al. 2014 [ 26] | 0.9667 | 1 | 1 | 0.9988 | 1 |
Chatterjee et al. 2014 [ 27] | 0.9738 | 0.9667 | 0.9667 | 0.9875 | 0.9963 |
Douxfils et al. 2014 [ 23] | 0.9667 | 0.9988 | 0.9988 | 1 | 0.9988 |
Funakoshi et al 2014 [ 28] | 0.8851 | 0.9602 | 0.9799 | 0.9794 | 0.9885 |
Kourbeti et al. 2014 [ 24] | 0.9518 | 0.9921 | 0.9991 | 0.9918 | 0.9991 |
| 0.9 | 1 | 1 | 0.9975 | 0.97 |
| 0.8975 | 0.8975 | 0.9475 | 0.99 | 0.9375 |
| 0.915 | 0.98 | 1 | 0.9983 | 0.9975 |
| 1 | 1 | 1 | 0.9963 | 0.9963 |
| 1 | 1 | 1 | 1 | 0.9875 |
| 0.9667 | 1 | 0.9988 | 0.995 | 0.9863 |
| 0.9667 | 1 | 1 | 0.9988 | 0.9988 |
| 0.975 | 0.975 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0.98 |
k-nearest neighbors | Cavender et al. 2014 [ 26] | 1 | 0.5113 | 0.5063 | 0.5013 | 0.5792 |
Chatterjee et al. 2014 [ 27] | 0.9988 | 0.5388 | 0.5363 | 0.5063 | 0.6333 |
Douxfils et al. 2014 [ 23] | 0.9667 | 0.5213 | 0.5113 | 0.5075 | 0.5625 |
Funakoshi et al 2014 [ 28] | 0.9955 | 0.5005 | 0.5 | 0.5 | 0.5885 |
Kourbeti et al. 2014 [ 24] | NA | NA | NA | 0.5 | 0.5661 |
| 0.9775 | 0.63 | 0.6125 | 0.5125 | 0.7775 |
| 0.7975 | 0.685 | 0.59 | 0.5675 | 0.71 |
| 0.9975 | 0.5017 | 0.5017 | 0.5 | 0.5983 |
| 1 | 0.5075 | 0.505 | 0.5025 | 0.6996 |
| 0.9875 | 0.59 | 0.57 | 0.515 | 0.71 |
| 0.9283 | 0.51 | 0.5063 | 0.5 | 0.5625 |
| 1 | 0.5056 | 0.5056 | 0.5 | 0.5237 |
| 0.9404 | 0.5288 | 0.52 | 0.5025 | 0.6333 |
| 1 | 0.675 | 0.6425 | 0.54 | 0.71 |
Random forest | Cavender et al. 2014 [ 26] | 1 | 1 | 1 | 1 | 1 |
Chatterjee et al. 2014 [ 27] | 0.9167 | 0.975 | 0.975 | 0.9963 | 1 |
Douxfils et al. 2014 [ 23] | 1 | 1 | 1 | 1 | 1 |
Funakoshi et al 2014 [ 28] | 0.9184 | 0.9517 | 0.9299 | 0.9895 | 0.9895 |
Kourbeti et al. 2014 [ 24] | 0.9918 | 0.9854 | 0.9854 | 0.9988 | 0.9984 |
| 0.95 | 1 | 1 | 1 | 1 |
| 0.8 | 0.9 | 0.9 | 0.9 | 0.9475 |
| 0.98 | 0.9992 | 0.9783 | 0.9992 | 0.9992 |
| 1 | 1 | 1 | 0.9988 | 0.9988 |
| 0.95 | 0.95 | 0.95 | 1 | 1 |
| 0.9988 | 0.9988 | 0.9988 | 0.9975 | 0.9963 |
| 0.9815 | 0.9821 | 0.9827 | 0.9994 | 0.9975 |
| 0.95 | 0.975 | 0.95 | 0.9083 | 0.9046 |
| 1 | 1 | 1 | 1 | 0.995 |
Support vector machines | Cavender et al. 2014 [ 26] | 1 | 1 | 1 | 1 | 0.825 |
Chatterjee et al. 2014 [ 27] | 1 | 1 | 0.9988 | 1 | 0.9263 |
Douxfils et al. 2014 [ 23] | 1 | 1 | 1 | 0.9963 | 0.8338 |
Funakoshi et al 2014 [ 28] | 0.999 | 0.999 | 0.9985 | 0.9945 | 0.975 |
Kourbeti et al. 2014 [ 24] | 0.9927 | 0.9927 | 0.9991 | 0.9988 | 0.9875 |
| 1 | 0.9975 | 0.9975 | 0.9325 | 0.5625 |
| 0.85 | 0.9 | 0.9925 | 0.98 | 0.6775 |
| 1 | 1 | 1 | 0.9992 | 0.96 |
| 1 | 1 | 1 | 0.9988 | 0.785 |
| 1 | 1 | 1 | 0.99 | 0.62 |
| 0.9333 | 0.9333 | 1 | 0.995 | 0.8013 |
| 1 | 0.9857 | 1 | 0.9988 | 0.9681 |
| 0.975 | 0.9417 | 0.9654 | 0.995 | 0.8825 |
| 1 | 1 | 1 | 1 | 0.7425 |
Meta-analytic analyses (see Fig.
3) show a significant improvement of the GLMNet classifier while using any strategy to manage the imbalance (minimum delta AUC of + 0.4 with [+ 0.2, + 0.6] 95% CI, reached using ROS-35:65). Regarding the application of strategies in combination with k-NN, all of them drastically and significantly hamper the performance of the classifier in comparison with the use of the k-NN alone (maximum delta AUC of − 0.38 with [− 0.39, − 0.36] 95% CI reached using RUS-50:50). About the RF classifier, the worst performance was reached using ROS-50:50 which is the only case the RF did not show a significant improvement (delta AUC + 0.01 with [− 0.01, + 0.03] 95% CI); in all the other cases, the improvements were significant. Last, the use of an SVM in combination with strategies to manage the imbalance shows no clear pattern in the performance, i.e., using RUS-50:50, the performance decreases significantly (delta AUC − 0.13 with [− 0.15, − 0.11] 95% CI); ROS-35:65 does not seem to have any effect (delta AUC 0.00 with [− 0.02, + 0.02] 95% CI); for both ROS-50:50 and RUS-35:56, the performance improves in the same way (delta AUC 0.01 with [− 0.01, + 0.03] 95% CI), though not significantly.
Discussion
Application of MLTs in TM has proven to be a potential model to automatize the literature search from online databases [
1‐
5]. Although it is difficult to establish any overall conclusions about best approaches, it is clear that efficiencies and reductions in workload are potentially achievable [
6].
This study compares different combinations of MLTs and pre-processing approaches to deal with the imbalance in text classification as part of the screening stage of an SR. The aim of the proposed approach is to allow researchers to make comprehensive SRs, by extending existing literature searches from PubMed to other repositories such as
ClinicalTrials.gov, where documents with a comparable word charactezisation could be accurately identified by the classifier trained on PubMed, as illustrated in [
14]. Thus, for real-world applications, researchers must conduct the search string on citational databases, make the selection of studies to include in the SR, and add negative operator to the same search string to retrieve the negative citations. Next, they can use the information retrieved from the selected studies to train a ML classifier to apply on the corpus of the trials retrieved from
ClinicalTrials.gov.
Regardless of the balancing techniques applied, all the MLTs considered in the present work have shown the potential to be used for the literature search from the online databases with AUC-ROCs across the MLTs (excluding k-NN) ranging prevalently above 90%.
Among study findings, the resampling pre-processing approach showed a slight improvement in the performance of the MLTs. ROS-50:50 and RUS-35:65 techniques showed the best results in general. Consistent with the literature, the use of k-NN does not seem to require any approach for imbalance [
23]. On the other hand, for straightforward computational reasons directly related to the decrease in the sample size of the original dataset, the use of RUS 35:65 may be preferred. Moreover, k-NN showed unstable results when data had been balanced using whatever technique. It is also worth noting that k-NN-based algorithms returned an error, with no results, three times out of the 70 applications, while no other combination of MLT and pre-processing method encountered any errors. The problem occurred only in the SR of Kourbeti [
24] which is the one with the highest number of records (75 positives and 1600 negatives), and only in combination with one of the two ROS techniques or when no technique was applied to handle unbalanced data, i.e., when the dimensionality does not decrease. The issue is known (see for instance the discussion in
https://github.com/topepo/caret/issues/582) when using the caret R interface to MLT algorithms, and manual tuning of the neighborhood size could be a remedy [
25].
According to the literature, the performance of various MLTs was found sensitive to the application of approaches for imbalanced data [
11,
26]. For example, SVM with different kernels (linear, radial, polynomial, and sigmoid kernels) was analysed on a genomics biomedical text corpus using resampling techniques and reported that normalized linear and sigmoid kernels and the RUS technique outperformed the other approaches tested [
27]. SVM and k-NN were also found sensitive to the class imbalance in the supervised sentiment classification [
26]. Addition of cost-sensitive learning and threshold control has been reported to intensify the training process for models such as SVM and artificial neural network, and it might provide some gains for validation performances, not confirmed in the test results [
28].
However, the high performance of MLTs in general and when no balancing techniques were applied are not in contrast with the literature. The main reason could be that each classifier is already showing good performance without the application of methods to handle unbalanced data, and there is no much scope left for the improvement. A possible explanation for such a good performance lies in the type of the training set and features, where positives and negatives are well-separated by design, and based on search strings performing word comparison into the metadata of the documents [
14]. Nevertheless, the observed small relative gain in performance (around 1%) may translate into a significant absolute improvement depending on the intended use of the classifier (i.e., an application on textual repositories with millions of entries).
Study findings suggest that there is not an outperforming strategy to recommend as a convenient standard. However, the combination of SVM and RUS-35:65 may be suggested when the preference is for a fast algorithm with stable results and low computational complexity related to the sample size reduction.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.