SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering
Introduction
Several real-world classification problems in fields such as text categorization [49], medicine [52], bankruptcy prediction [38] and intrusion detection [31], are characterized by a highly imbalanced distribution of examples among the classes. In these problems, one class (known as the minority or positive class) contains a much smaller number of examples than the other classes (the majority or negative classes). The minority class is often the most interesting from the application point of view [11], [5]. Class imbalance constitutes a difficulty for most learning algorithms which assume an approximately balanced class distribution and are biased toward the learning and recognition of the majority class. As a result, minority class examples usually tend to be misclassified.
The problem of learning from imbalanced data has been intensively researched in the last decade and several methods have been proposed to address it – for a review see, e.g., [24]. Re-sampling methods [9], [8], [33], [47] are a classifier-independent type of techniques that modify the data distribution taking into account local characteristics of examples to change the balance between classes. There are numerous works discussing their advantages [4], [10]. Among these methods, the Synthetic Minority Over-sampling Technique (SMOTE) [9] is one of the most well-known; it generates new artificial minority class examples by interpolating among several minority class examples that lie together.
However, some researchers have shown that the class imbalance ratio is not a problem itself. Even though the observation of a low classification performance in some concrete imbalanced problems may be influenced by the validation scheme used to estimate this performance of the classifiers [35], the classification performance degradation is usually linked to other factors related to data distributions [28], [22], [40]. Among them, in [40] the influence of noisy and borderline examples on classification performance in imbalanced datasets is experimentally studied. Borderline examples are defined as examples located either very close to the decision boundary between minority and majority classes or located in the area surrounding class boundaries where classes overlap. The authors of [33], [40] refer to noisy examples as those from one class located deep inside the region of the other class. Furthermore, this paper, considers noisy examples in the wider sense of [57], [43], in which they are treated as examples corrupted either in the attribute values or the class label.
Even though SMOTE achieves a better distribution of the number of examples in each class, when used in isolation it may obtain results that are not as good as they could be or it may even be counterproductive in many cases. This is because SMOTE presents several drawbacks related to its blind oversampling, whereby the creation of new positive (minority) examples only takes into account the closeness among positive examples and the number of examples of each class, whereas other characteristics of the data are ignored – such as the distribution of examples from the majority classes. These drawbacks, which can further aggravate the difficulties produced for noisy and borderline examples in the learning process, include: (i) the creation of too many examples around unnecessary positive examples which do not facilitate the learning of the minority class, (ii) the introduction of noisy positive examples in areas belonging to the majority class and (iii) the disruption of the boundaries between the classes and, therefore, an increase in the overlapping between them. In order to overcome these problems, two different approaches are followed in the literature:
- 1.
Modifications of SMOTE (hereafter called change-direction methods). These guide the creation of positive examples performed by SMOTE towards specific parts of the input space, taking into account specific characteristics of the data. Within this group, the Safe-Levels-SMOTE (SL-SMOTE) [8], the Borderline-SMOTE (B1-SMOTE and B2-SMOTE) [23] or LN-SMOTE [37] methods are found, which try to create positive examples close to areas with a high concentration of positive examples or only inside the boundaries of the positive class.
- 2.
Extensions of SMOTE by integrating it with additional techniques (these extensions will be referred to as filtering-based methods since SMOTE is integrated with either special cleaning or filtering methods). In the standard classification tasks, noise filters are often used in order to detect and eliminate noisy examples from training datasets and also to clean up and to create more regular class boundaries [55], [53]. Experimental studies, such as [4], confirm the usefulness of integrating such filters – e.g., Edited Nearest Neighbor Rule (ENN) or Tomek Links (TL) [53] – as a post-processing step after using SMOTE.
The ability to deal with imbalanced datasets with noisy and borderline examples of methods belonging to both approaches will be studied in the experimental section, even though this paper also proposes a new extension of SMOTE. Existing extensions of SMOTE are very simple because they are based on using a single learning algorithm or simple measures such as, e.g., k-Nearest Neighbors (k-NN) [39] paradigm inside ENN [55] – used in SMOTE-ENN.
Some works highlight the good behavior of ensembles for classification in noisy environments, showing that the combined use of several classifiers is a good alternative in these scenarios as opposed to the employment of single classifiers [42], [43]. In the same way, some authors also propose the usage of ensembles for filtering [7], [17], [18], [54]. However, all these works only consider the point of view of the standard classification and the overall classification accuracy. Thus, ensembles are used for filtering in [7] considering that some examples have been mislabeled and the label errors are independent of particular classifiers learned from the data. In this scenario, the authors claim that collecting predictions from different classifiers could provide a better estimation of mislabeled examples rather than collecting information from a single classifier only. According to our best knowledge, these ensemble-based filters have not yet been used in the context of learning from imbalanced data. Analyzing such filters focuses our attention on the Iterative-Partitioning Filter (IPF) [32]. Its characteristics differentiate it from most of the filters, making it particularly suitable to overcome the problems produced by noisy and borderline examples specific to the dataset plus those additional ones that SMOTE may introduce.
The main aim of this paper is to propose and examine a new extension of SMOTE, in which the IPF noise filter is applied in post-processing resulting in SMOTE–IPF – its implementation can be found in KEEL1 [2]. Its suitability for handling noisy and borderline examples in imbalanced data will be a particular focus of evaluation as these are one of the main sources of difficulties for learning algorithms. Differences between this approach and other re-sampling methods also based on generalizations of SMOTE will be discussed and studied. One cannot treat this proposal as a simple combination of two methods, as we want to study more deeply the conditions of its appropriate use in dealing with different types of noise in imbalanced data which have not been considered yet. We discuss its properties in comparison to other previous, related generalizations of SMOTE.
The other contribution of this paper is to provide a comprehensive experimental comparison of SMOTE–IPF with these generalizations. Moreover, different data factors will be considered in these parts of this experimental study. A first part will be carried out with special synthetic datasets containing different shapes of the minority class example boundaries and levels of borderline examples, as considered in related studies [22], [28], [29], [40]. Additionally, a set of real-world datasets which are also known to be affected by noisy and borderline examples will be considered. All of these were used in [40] and are available in the KEEL-dataset repository [1]. Yet another contribution of this paper will be to introduce additional class or attribute noise into these real-world datasets and to study its impact on compared SMOTE generalizations. After preprocessing these datasets, the performances of the classifiers built with C4.5 [41] will be evaluated and they will also be contrasted using the proper statistical tests as recommended in the specialized literature [14], [19], [25]. The characteristics of IPF which differentiate it from other filters and a discussion on the strengths and weaknesses of IPF in dealing with imbalanced datasets with noisy and borderline examples will be analyzed in Section 6.
In addition, experiments with many other classification algorithms on the preprocessed datasets will be carried out in order to show the behavior of the preprocessing techniques with different classifiers. These are k-NN [39], a Support Vector Machine (SVM) [13], [51], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [12] and PART [16]. Due to length restrictions, their results are only included on the web-page associated with this paper, available at http://sci2s.ugr.es/noisebor-imbalanced. This web-page also includes the basic information of this paper, the datasets used and the parameter setup for all the classification algorithms.
The rest of this paper is organized as follows. Section 2 presents the imbalanced dataset problem. Section 3 is devoted to the motivations behind our extension of SMOTE. Next, Section 4 describes the experimental framework. Section 5 includes the analysis of the experimental results, and Section 6 outlines the results and the suitability of IPF for the problem treated. Finally, in Section 7, some concluding remarks are presented.
Section snippets
Classification for imbalanced datasets
In this section, first the problem of imbalanced datasets is introduced in Section 2.1. Some additional problems related to class imbalance that may harm classifier performance are described in Section 2.2.
SMOTE along with noise filters to tackle noisy and borderline examples
In this section, first, the main details of the proposed extension of SMOTE are given in Section 3.1. Next, its two implicated parts are described in depth: the SMOTE algorithm in Section 3.2 and noise filters in Section 3.3.
Experimental framework
In this section, the details of the experimental study developed in this paper are presented. First, in Section 4.1, we describe how the synthetic imbalanced datasets with borderline examples were built. Then, the real-world datasets and the noise introduction processes are presented in Section 4.2. In Section 4.3 the preprocessing techniques considered in this work are briefly described. Finally, in Section 4.4, the methodology of the analysis carried out is described.
Evaluation of re-sampling methods with noisy and borderline examples
In this section, the performance of C4.5 using the different preprocessing techniques over the imbalanced datasets with noisy and borderline examples is analyzed. In Section 5.1, the results considering synthetic datasets are analyzed, whereas Sections 5.2 Results on real-world datasets, 5.3 Results on noisy modified real-world datasets are respectively devoted to analyzing the results on the real-world datasets and the noisy modified real-world ones.
SMOTE–IPF: suitability of the approach, strengths and weaknesses
This section summarizes the main conclusions obtained in the experimental section. Section 6.1 outlines the results obtained with the different types of datasets. Then, Section 6.2 describes the characteristics of IPF that make it suitable for this type of problem when preprocessed with SMOTE. Section 6.3 analyzes the main drawback of SMOTE–IPF, its parametrization. Finally, Section 6.4 establishes a hypothesis in order to explain its good behavior in the different scenarios considered.
Concluding remarks
This paper has focused on the presence of noisy and borderline examples, which is an important and contemporary research issue for learning classifiers from imbalanced data. It has been proposed to extend SMOTE with a new element, the IPF noise filter, to control the noise introduced by the balancing between classes produced by SMOTE and to make the class boundaries more regular. The suitability of the approach in this scenario has been analyzed.
Synthetic imbalanced datasets with different
Acknowledgment
Supported by the National Project TIN2011-28488, and also by the Regional Projects P10-TIC-06858 and P11-TIC-9704. José A. Sáez holds an FPU scholarship from the Spanish Ministry of Education and Science. This paper is also supported by the Polish National Science Center under Grant No. DEC-2013/11/B/ST6/00963.
José A. Sáez received his M.Sc. in Computer Science from the University of Granada (Granada, Spain) in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence in the University of Granada. His main research interests include noisy data in classification, discretization methods and imbalanced learning.
References (57)
- et al.
Strategies for learning in class imbalance problems
Pattern Recogn.
(2003) The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recogn.
(1997)Fast effective rule induction
- et al.
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets
Inf. Sci.
(2010) - et al.
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power
Inf. Sci.
(2010) Diversity in multiple classifier systems
Inf. Fus.
(2005)- et al.
On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed
Inf. Sci.
(2014) - et al.
Addressing imbalanced classification with instance generation techniques: IPADE-ID
Neurocomputing
(2014) - et al.
Tackling the problem of classification with noisy data using multiple classifier systems: analysis of the performance and robustness
Inf. Sci.
(2013) - et al.
Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification
Pattern Recogn.
(2013)
On strategies for imbalanced text classification using SVM: a comparative study
Decis. Support Syst.
Cost-sensitive boosting for classification of imbalanced data
Pattern Recogn.
Parasite detection and identification for automated thin blood film malaria diagnosis
Comput. Vis. Image Understand.
KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework
J. Multiple-Valued Logic Soft Comput.
KEEL: a software tool to assess evolutionary algorithms for data mining problems
Soft Comput. – Fus. Found. Methodol. Appl.
A study of the behavior of several methods for balancing machine learning training data
ACM SIGKDD Explor. Newslett.
Developing new fitness functions in genetic programming for classification with unbalanced data
IEEE Trans. Syst. Man Cybern., Part B: Cybern.
Identifying mislabeled training data
J. Artif. Intell. Res.
Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem
SMOTE: synthetic minority over-sampling technique
J. Artif. Intell. Res.
Automatically countering imbalance and its empirical relationship to cost
Data Min. Knowl. Discov.
Editorial: special issue on learning from imbalanced data sets
SIGKDD Explor.
Support vector networks
Mach. Learn.
Statistical comparisons of classifiers over multiple data sets
J. Mach. Learn. Res.
Generating accurate rule sets without global optimization
Experiments with noise filtering in a medical domain
Noise detection and elimination in data preprocessing: experiments in medical domains
Appl. Artif. Intell.
Combined effects of class imbalance and class overlap on instance-based classification
Cited by (483)
FCM-CSMOTE: Fuzzy C-Means Center-SMOTE
2024, Expert Systems with ApplicationsWRND: A weighted oversampling framework with relative neighborhood density for imbalanced noisy classification
2024, Expert Systems with ApplicationsA software defect prediction method based on learnable three-line hybrid feature fusion
2024, Expert Systems with ApplicationsASE: Anomaly scoring based ensemble learning for highly imbalanced datasets[Formula presented]
2024, Expert Systems with ApplicationsSMOTE-kTLNN: A hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier
2024, Expert Systems with Applications
José A. Sáez received his M.Sc. in Computer Science from the University of Granada (Granada, Spain) in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence in the University of Granada. His main research interests include noisy data in classification, discretization methods and imbalanced learning.
Julián Luengo received the M.S. degree in computer science and the Ph.D. degree from the University of Granada, Granada, Spain, in 2006 and 2011 respectively. His research interests include machine learning and data mining, data preparation in knowledge discovery and data mining, missing values, data complexity and fuzzy systems.
Jerzy Stefanowski received the Ph.D. and Habilitation degrees in computer science from Poznan University of Technology, Poland, in 1994 and 2001, respectively. He is currently an Associate Professor in Institute of Computing Science, Poznan University of Technology (specialization in Machine Learning and Knowledge Discovery from Data). His research interests include: machine learning, data mining and intelligent decision support – in particular rule induction, multiple classifiers, data pre-processing; document clustering, mining sequence patterns and handling uncertainty in data. His work has also led to applications in medicine, technical diagnostics and finance. He served as a reviewer for many international journals and conferences. His publication list covers over 120 journal and conference papers.
Francisco Herrera received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both from the University of Granada, Spain.
He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has been the supervisor of 30 Ph.D. students. He has published more than 260 papers in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001).
He currently acts as Editor in Chief of the international journals “Information Fusion” (Elsevier) and “Progress in Artificial Intelligence (Springer). He acts as an area editor of the International Journal of Computational Intelligence Systems and associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Knowledge and Information Systems, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as a member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Information Fusion, Knowledge-Based Systems, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, and Swarm and Evolutionary Computation.
He received the following honors and awards: ECCAI Fellow 2009, IFSA Fellow 2013, 2010 Spanish National Award on Computer Science ARITMEL to the “Spanish Engineer on Computer Science”, International Cajastur “Mamdani” Prize for Soft Computing (Fourth Edition, 2010), IEEE Transactions on Fuzzy System Outstanding 2008 Paper Award (bestowed in 2011), 2011 Lotfi A. Zadeh Prize Best paper Award of the International Fuzzy Systems Association, and 2013 AEPIA Award to a scientific career in Artificial Intelligence (September 2013).
His current research interests include computing with words and decision making, bibliometrics, data mining, data preparation, instance selection and generation, imperfect data, fuzzy rule based systems, genetic fuzzy systems, imbalanced classification, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms, biometrics, cloud computing and big data.