SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering

doi:10.1016/j.ins.2014.08.051

Information Sciences

Volume 291, 10 January 2015, Pages 184-203

https://doi.org/10.1016/j.ins.2014.08.051 Get rights and content

Highlights

•
Noisy and borderline examples in imbalanced datasets harm classifier performance.
•
Our proposal reduces the noise and makes the class boundaries more regular.
•
Our proposal performs better than other re-sampling methods in this scenario.
•
Ensemble-based noise filters work well when dealing with noise.
•
The iterative noise elimination is a good approach to deal with noisy datasets.

Abstract

Classification datasets often have an unequal class distribution among their examples. This problem is known as imbalanced classification. The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most well-know data pre-processing methods to cope with it and to balance the different number of examples of each class. However, as recent works claim, class imbalance is not a problem in itself and performance degradation is also associated with other factors related to the distribution of the data. One of these is the presence of noisy and borderline examples, the latter lying in the areas surrounding class boundaries. Certain intrinsic limitations of SMOTE can aggravate the problem produced by these types of examples and current generalizations of SMOTE are not correctly adapted to their treatment.

This paper proposes the extension of SMOTE through a new element, an iterative ensemble-based noise filter called Iterative-Partitioning Filter (IPF), which can overcome the problems produced by noisy and borderline examples in imbalanced datasets. This extension results in SMOTE–IPF. The properties of this proposal are discussed in a comprehensive experimental study. It is compared against a basic SMOTE and its most well-known generalizations. The experiments are carried out both on a set of synthetic datasets with different levels of noise and shapes of borderline examples as well as real-world datasets. Furthermore, the impact of introducing additional different types and levels of noise into these real-world data is studied. The results show that the new proposal performs better than existing SMOTE generalizations for all these different scenarios. The analysis of these results also helps to identify the characteristics of IPF which differentiate it from other filtering approaches.

Introduction

Several real-world classification problems in fields such as text categorization [49], medicine [52], bankruptcy prediction [38] and intrusion detection [31], are characterized by a highly imbalanced distribution of examples among the classes. In these problems, one class (known as the minority or positive class) contains a much smaller number of examples than the other classes (the majority or negative classes). The minority class is often the most interesting from the application point of view [11], [5]. Class imbalance constitutes a difficulty for most learning algorithms which assume an approximately balanced class distribution and are biased toward the learning and recognition of the majority class. As a result, minority class examples usually tend to be misclassified.

The problem of learning from imbalanced data has been intensively researched in the last decade and several methods have been proposed to address it – for a review see, e.g., [24]. Re-sampling methods [9], [8], [33], [47] are a classifier-independent type of techniques that modify the data distribution taking into account local characteristics of examples to change the balance between classes. There are numerous works discussing their advantages [4], [10]. Among these methods, the Synthetic Minority Over-sampling Technique (SMOTE) [9] is one of the most well-known; it generates new artificial minority class examples by interpolating among several minority class examples that lie together.

However, some researchers have shown that the class imbalance ratio is not a problem itself. Even though the observation of a low classification performance in some concrete imbalanced problems may be influenced by the validation scheme used to estimate this performance of the classifiers [35], the classification performance degradation is usually linked to other factors related to data distributions [28], [22], [40]. Among them, in [40] the influence of noisy and borderline examples on classification performance in imbalanced datasets is experimentally studied. Borderline examples are defined as examples located either very close to the decision boundary between minority and majority classes or located in the area surrounding class boundaries where classes overlap. The authors of [33], [40] refer to noisy examples as those from one class located deep inside the region of the other class. Furthermore, this paper, considers noisy examples in the wider sense of [57], [43], in which they are treated as examples corrupted either in the attribute values or the class label.

Even though SMOTE achieves a better distribution of the number of examples in each class, when used in isolation it may obtain results that are not as good as they could be or it may even be counterproductive in many cases. This is because SMOTE presents several drawbacks related to its blind oversampling, whereby the creation of new positive (minority) examples only takes into account the closeness among positive examples and the number of examples of each class, whereas other characteristics of the data are ignored – such as the distribution of examples from the majority classes. These drawbacks, which can further aggravate the difficulties produced for noisy and borderline examples in the learning process, include: (i) the creation of too many examples around unnecessary positive examples which do not facilitate the learning of the minority class, (ii) the introduction of noisy positive examples in areas belonging to the majority class and (iii) the disruption of the boundaries between the classes and, therefore, an increase in the overlapping between them. In order to overcome these problems, two different approaches are followed in the literature:

1.
Modifications of SMOTE (hereafter called change-direction methods). These guide the creation of positive examples performed by SMOTE towards specific parts of the input space, taking into account specific characteristics of the data. Within this group, the Safe-Levels-SMOTE (SL-SMOTE) [8], the Borderline-SMOTE (B1-SMOTE and B2-SMOTE) [23] or LN-SMOTE [37] methods are found, which try to create positive examples close to areas with a high concentration of positive examples or only inside the boundaries of the positive class.
2.
Extensions of SMOTE by integrating it with additional techniques (these extensions will be referred to as filtering-based methods since SMOTE is integrated with either special cleaning or filtering methods). In the standard classification tasks, noise filters are often used in order to detect and eliminate noisy examples from training datasets and also to clean up and to create more regular class boundaries [55], [53]. Experimental studies, such as [4], confirm the usefulness of integrating such filters – e.g., Edited Nearest Neighbor Rule (ENN) or Tomek Links (TL) [53] – as a post-processing step after using SMOTE.

The ability to deal with imbalanced datasets with noisy and borderline examples of methods belonging to both approaches will be studied in the experimental section, even though this paper also proposes a new extension of SMOTE. Existing extensions of SMOTE are very simple because they are based on using a single learning algorithm or simple measures such as, e.g., k-Nearest Neighbors (k-NN) [39] paradigm inside ENN [55] – used in SMOTE-ENN.

Some works highlight the good behavior of ensembles for classification in noisy environments, showing that the combined use of several classifiers is a good alternative in these scenarios as opposed to the employment of single classifiers [42], [43]. In the same way, some authors also propose the usage of ensembles for filtering [7], [17], [18], [54]. However, all these works only consider the point of view of the standard classification and the overall classification accuracy. Thus, ensembles are used for filtering in [7] considering that some examples have been mislabeled and the label errors are independent of particular classifiers learned from the data. In this scenario, the authors claim that collecting predictions from different classifiers could provide a better estimation of mislabeled examples rather than collecting information from a single classifier only. According to our best knowledge, these ensemble-based filters have not yet been used in the context of learning from imbalanced data. Analyzing such filters focuses our attention on the Iterative-Partitioning Filter (IPF) [32]. Its characteristics differentiate it from most of the filters, making it particularly suitable to overcome the problems produced by noisy and borderline examples specific to the dataset plus those additional ones that SMOTE may introduce.

The main aim of this paper is to propose and examine a new extension of SMOTE, in which the IPF noise filter is applied in post-processing resulting in SMOTE–IPF – its implementation can be found in KEEL¹ [2]. Its suitability for handling noisy and borderline examples in imbalanced data will be a particular focus of evaluation as these are one of the main sources of difficulties for learning algorithms. Differences between this approach and other re-sampling methods also based on generalizations of SMOTE will be discussed and studied. One cannot treat this proposal as a simple combination of two methods, as we want to study more deeply the conditions of its appropriate use in dealing with different types of noise in imbalanced data which have not been considered yet. We discuss its properties in comparison to other previous, related generalizations of SMOTE.

The other contribution of this paper is to provide a comprehensive experimental comparison of SMOTE–IPF with these generalizations. Moreover, different data factors will be considered in these parts of this experimental study. A first part will be carried out with special synthetic datasets containing different shapes of the minority class example boundaries and levels of borderline examples, as considered in related studies [22], [28], [29], [40]. Additionally, a set of real-world datasets which are also known to be affected by noisy and borderline examples will be considered. All of these were used in [40] and are available in the KEEL-dataset repository [1]. Yet another contribution of this paper will be to introduce additional class or attribute noise into these real-world datasets and to study its impact on compared SMOTE generalizations. After preprocessing these datasets, the performances of the classifiers built with C4.5 [41] will be evaluated and they will also be contrasted using the proper statistical tests as recommended in the specialized literature [14], [19], [25]. The characteristics of IPF which differentiate it from other filters and a discussion on the strengths and weaknesses of IPF in dealing with imbalanced datasets with noisy and borderline examples will be analyzed in Section 6.

In addition, experiments with many other classification algorithms on the preprocessed datasets will be carried out in order to show the behavior of the preprocessing techniques with different classifiers. These are k-NN [39], a Support Vector Machine (SVM) [13], [51], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [12] and PART [16]. Due to length restrictions, their results are only included on the web-page associated with this paper, available at http://sci2s.ugr.es/noisebor-imbalanced. This web-page also includes the basic information of this paper, the datasets used and the parameter setup for all the classification algorithms.

The rest of this paper is organized as follows. Section 2 presents the imbalanced dataset problem. Section 3 is devoted to the motivations behind our extension of SMOTE. Next, Section 4 describes the experimental framework. Section 5 includes the analysis of the experimental results, and Section 6 outlines the results and the suitability of IPF for the problem treated. Finally, in Section 7, some concluding remarks are presented.

Section snippets

Classification for imbalanced datasets

In this section, first the problem of imbalanced datasets is introduced in Section 2.1. Some additional problems related to class imbalance that may harm classifier performance are described in Section 2.2.

SMOTE along with noise filters to tackle noisy and borderline examples

In this section, first, the main details of the proposed extension of SMOTE are given in Section 3.1. Next, its two implicated parts are described in depth: the SMOTE algorithm in Section 3.2 and noise filters in Section 3.3.

Experimental framework

In this section, the details of the experimental study developed in this paper are presented. First, in Section 4.1, we describe how the synthetic imbalanced datasets with borderline examples were built. Then, the real-world datasets and the noise introduction processes are presented in Section 4.2. In Section 4.3 the preprocessing techniques considered in this work are briefly described. Finally, in Section 4.4, the methodology of the analysis carried out is described.

Evaluation of re-sampling methods with noisy and borderline examples

In this section, the performance of C4.5 using the different preprocessing techniques over the imbalanced datasets with noisy and borderline examples is analyzed. In Section 5.1, the results considering synthetic datasets are analyzed, whereas Sections 5.2 Results on real-world datasets, 5.3 Results on noisy modified real-world datasets are respectively devoted to analyzing the results on the real-world datasets and the noisy modified real-world ones.

SMOTE–IPF: suitability of the approach, strengths and weaknesses

This section summarizes the main conclusions obtained in the experimental section. Section 6.1 outlines the results obtained with the different types of datasets. Then, Section 6.2 describes the characteristics of IPF that make it suitable for this type of problem when preprocessed with SMOTE. Section 6.3 analyzes the main drawback of SMOTE–IPF, its parametrization. Finally, Section 6.4 establishes a hypothesis in order to explain its good behavior in the different scenarios considered.

Concluding remarks

This paper has focused on the presence of noisy and borderline examples, which is an important and contemporary research issue for learning classifiers from imbalanced data. It has been proposed to extend SMOTE with a new element, the IPF noise filter, to control the noise introduced by the balancing between classes produced by SMOTE and to make the class boundaries more regular. The suitability of the approach in this scenario has been analyzed.

Synthetic imbalanced datasets with different

Acknowledgment

Supported by the National Project TIN2011-28488, and also by the Regional Projects P10-TIC-06858 and P11-TIC-9704. José A. Sáez holds an FPU scholarship from the Spanish Ministry of Education and Science. This paper is also supported by the Polish National Science Center under Grant No. DEC-2013/11/B/ST6/00963.

José A. Sáez received his M.Sc. in Computer Science from the University of Granada (Granada, Spain) in 2009. He is currently a Ph.D. student in the Department of Computer Science and Artificial Intelligence in the University of Granada. His main research interests include noisy data in classification, discretization methods and imbalanced learning.

References (57)

R. Barandela et al.
Strategies for learning in class imbalance problems
Pattern Recogn.
(2003)
A.P. Bradley
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recogn.
(1997)
W.W. Cohen
Fast effective rule induction
A. Fernández et al.
On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets
Inf. Sci.
(2010)
S. García et al.
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power
Inf. Sci.
(2010)
L.I. Kuncheva
Diversity in multiple classifier systems
Inf. Fus.
(2005)
V. López et al.
On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed
Inf. Sci.
(2014)
V. López et al.
Addressing imbalanced classification with instance generation techniques: IPADE-ID
Neurocomputing
(2014)
J.A. Sáez et al.
Tackling the problem of classification with noisy data using multiple classifier systems: analysis of the performance and robustness
Inf. Sci.
(2013)
J.A. Sáez et al.
Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification
Pattern Recogn.
(2013)

A. Sun et al.

On strategies for imbalanced text classification using SVM: a comparative study

Decis. Support Syst.

(2009)

Y. Sun et al.

Cost-sensitive boosting for classification of imbalanced data

Pattern Recogn.

(2007)

F.B. Tek et al.

Parasite detection and identification for automated thin blood film malaria diagnosis

Comput. Vis. Image Understand.

(2010)

J. Alcalá-Fdez et al.

KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework

J. Multiple-Valued Logic Soft Comput.

(2011)

J. Alcalá-Fdez et al.

KEEL: a software tool to assess evolutionary algorithms for data mining problems

Soft Comput. – Fus. Found. Methodol. Appl.

(2009)

G. Batista et al.

A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explor. Newslett.

(2004)

U. Bhowan et al.

Developing new fitness functions in genetic programming for classification with unbalanced data

IEEE Trans. Syst. Man Cybern., Part B: Cybern.

(2012)

C.E. Brodley et al.

Identifying mislabeled training data

J. Artif. Intell. Res.

(1999)

C. Bunkhumpornpat et al.

Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem

N.V. Chawla et al.

SMOTE: synthetic minority over-sampling technique

J. Artif. Intell. Res.

(2002)

N.V. Chawla et al.

Automatically countering imbalance and its empirical relationship to cost

Data Min. Knowl. Discov.

(2008)

N.V. Chawla et al.

Editorial: special issue on learning from imbalanced data sets

SIGKDD Explor.

(2004)

C. Cortes et al.

Support vector networks

Mach. Learn.

(1995)

J. Demšar

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

(2006)

E. Frank et al.

Generating accurate rule sets without global optimization

D. Gamberger et al.

Experiments with noise filtering in a medical domain

D. Gamberger et al.

Noise detection and elimination in data preprocessing: experiments in medical domains

Appl. Artif. Intell.

(2000)

V. García et al.

Combined effects of class imbalance and class overlap on instance-based classification

Cited by (483)

FCM-CSMOTE: Fuzzy C-Means Center-SMOTE
2024, Expert Systems with Applications
Imbalanced class distributions in machine learning, where the minority class is often under-represented, pose a substantial challenge. Synthetic Minority Over-sampling Technique (SMOTE) has been widely employed to address this issue by generating synthetic minority samples through interpolation. Despite its popularity, SMOTE exhibits certain drawbacks caused by the implementation of random interpolation samples. In this paper, we introduce a new data level technique for oversampling, called Fuzzy C-Means Center-SMOTE (FCM-CSMOTE), which generates synthetic samples in each cluster using its center considered as the memory of the main data components. We demonstrate that the proposed selective strategy has a very low probability to generate noise. The experimental results demonstrate that the proposed method performs better than the state-of-the-art approaches on 21 real unbalanced data sets (regular and large size data set) in terms of several metrics, including Geometric Mean (GM), F-Measure (FM), Area Under the Curve (AUC), and Accuracy.
WRND: A weighted oversampling framework with relative neighborhood density for imbalanced noisy classification
2024, Expert Systems with Applications
Imbalanced data and label noise are ubiquitous challenges in data mining and machine learning that severely impair classification performance. The synthetic minority oversampling technique (SMOTE) and its variants have been proposed, but they are easily constrained by hyperparameter optimization such as $k$ -nearest neighbors, their performance deteriorates owing to noise, they rarely consider data distribution information, and they cause high complexity. Furthermore, SMOTE-based methods perform random linear interpolation between each minority class sample and its randomly selected $k$ -nearest neighbors, regardless of sample differences and distribution information. To address the above problems, an adaptive, robust, and general weighted oversampling framework based on relative neighborhood density (WRND) is proposed. It can combine with most SMOTE-based sampling algorithms easily and improve performance. First, it adaptively distinguishes and filters noisy and outlier samples by introducing the natural neighbor, which inherently avoids the extra noise and overlapping samples introduced by the synthesis of noisy samples. The relative neighborhood density of each sample can then be obtained, which reflects the intra-class and inter-class distribution information within the natural neighborhood. To alleviate the blindness of SMOTE-based methods, the number and locations of synthetic samples are informedly assigned based on distribution information and reasonable generalization of natural neighborhoods of original samples. Extensive experiments on 23 benchmark datasets and six classic classifiers with eight pairs of representative sampling algorithms and two state-of-the-art frameworks, significantly demonstrate the effectiveness of the WRND framework. Code and framework are available at https://github.com/dream-lm/WRND_framework.
A software defect prediction method based on learnable three-line hybrid feature fusion
2024, Expert Systems with Applications
Software defect prediction (SDP) plays a crucial role in ensuring the security and quality of software systems. However, it faces challenges posed by high-dimensional features present in software defect datasets and the limited effectiveness of traditional nonlinear dimensionality reduction methods in extracting essential feature information. To address these issues, we propose a novel approach called learnable three-line hybrid feature fusion (LTHFFA), which incorporates the principle of three-line hybrid breeding into feature fusion for the first time. In this method, three distinct dimensionality reduction techniques are initially employed to obtain three separate sets of features. Subsequently, a learnable weight factor feature fusion method is proposed to facilitate automatically learn and dynamically update of feature weights. By integrating the three feature sets based on the principle of three-line hybrid breeding, we derive learnable three-line hybrid fusion features. These features are then utilized in the context of software defect prediction. Experimental results demonstrate the superior performance of LTHFFA compared to nine other dimensionality reduction methods across seventeen publicly available software defect datasets. LTHFFA exhibits the ability to effectively integrate multiple feature sets, reduce feature redundancy, and enhance predictive accuracy. Moreover, statistical analysis using Friedman ranking and Holm's post-hoc test confirms the significant advantage of LTHFFA over alternative dimensionality reduction methods.
ASE: Anomaly scoring based ensemble learning for highly imbalanced datasets[Formula presented]
2024, Expert Systems with Applications
Nowadays, many classification algorithms have been applied to various industries to help them work out their problems met in real-life scenarios. However, in many binary classification tasks, samples in the minority class only make up a small part of all instances, which leads to the datasets we get usually suffer from high imbalance ratio. Existing models sometimes treat minority classes as noise or ignore them as outliers encountering data skewing. In order to solve this problem, we propose a bagging ensemble learning framework $A S E$ (Anomaly Scoring Based Ensemble Learning). This framework has a scoring system based on anomaly detection algorithms which can guide the resampling strategy by divided samples in the majority class into subspaces. Then specific number of instances will be under-sampled from each subspace to construct subsets by combining with the minority class. And we calculate the weights of base classifiers trained by the subsets according to the classification result of the anomaly detection model and the statistics of the subspaces. Experiments have been conducted which show that our ensemble learning model can dramatically improve the performance of base classifiers and is more efficient than other existing methods under a wide range of imbalance ratio, data scale and data dimension. $A S E$ can be combined with various classifiers and every part of our framework has been proved to be reasonable and necessary.
SMOTE-kTLNN: A hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier
2024, Expert Systems with Applications
In recent years, class-imbalanced learning has become an important branch of machine learning. Synthetic Minority Oversampling Technique (SMOTE) is known as a benchmark method to address imbalanced learning. Although SMOTE performs well on many data, it also has the drawback of generating noisy samples. There are many SMOTE variants to solve this problem. Specifically, these methods are hybrid sampling methods, that is, carrying out an undersampling stage after SMOTE to remove noisy samples. It requires a method that can accurately identify noise to provide reliable performance. In this paper, a hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier (SMOTE-kTLNN) is proposed. SMOTE-kTLNN recognition noise is realized by an Iterative-Partitioning Filter (IPF). Specifically, SMOTE is performed on the original data to balance the data, then the data is divided into $n$ equal parts, establishing kTLNN on each part to predict the whole data. And noisy samples are removed according to the majority voting rule. In the last, the balanced data sets are used to train kNN, AdaBoost, and SVM to verify whether SMOTE-kTLNN is irrelevant to the classifier. The experiment results demonstrate that SMOTE-kTLNN performs better than the comparisons in 25 binary data sets, including Recall, AUC, F1-measure, and G-mean.
DBN-Mix: Training dual branch network using bilateral mixup augmentation for long-tailed visual recognition
2024, Pattern Recognition
There is growing interest in the challenging visual perception task of learning from long-tailed class distributions. The extreme class imbalance in the training dataset biases the model to prefer recognizing majority class data over minority class data. Furthermore, the lack of diversity in minority class samples makes it difficult to find a good representation. In this paper, we propose an effective data augmentation method, referred to as bilateral mixup augmentation, which can improve the performance of long-tailed visual recognition. The bilateral mixup augmentation combines two samples generated by a uniform sampler and a re-balanced sampler and augments the training dataset to enhance the representation learning for minority classes. We also reduce the classifier bias using class-wise temperature scaling, which scales the logits differently per class in the training phase. We apply both ideas to the dual-branch network (DBN) framework, presenting a new model, named dual-branch network with bilateral mixup (DBN-Mix). Experiments on popular long-tailed visual recognition datasets show that DBN-Mix improves performance significantly over baseline and that the proposed method achieves state-of-the-art performance in some categories of benchmarks.

View all citing articles on Scopus

Julián Luengo received the M.S. degree in computer science and the Ph.D. degree from the University of Granada, Granada, Spain, in 2006 and 2011 respectively. His research interests include machine learning and data mining, data preparation in knowledge discovery and data mining, missing values, data complexity and fuzzy systems.

Jerzy Stefanowski received the Ph.D. and Habilitation degrees in computer science from Poznan University of Technology, Poland, in 1994 and 2001, respectively. He is currently an Associate Professor in Institute of Computing Science, Poznan University of Technology (specialization in Machine Learning and Knowledge Discovery from Data). His research interests include: machine learning, data mining and intelligent decision support – in particular rule induction, multiple classifiers, data pre-processing; document clustering, mining sequence patterns and handling uncertainty in data. His work has also led to applications in medicine, technical diagnostics and finance. He served as a reviewer for many international journals and conferences. His publication list covers over 120 journal and conference papers.

Francisco Herrera received his M.Sc. in Mathematics in 1988 and Ph.D. in Mathematics in 1991, both from the University of Granada, Spain.

He is currently a Professor in the Department of Computer Science and Artificial Intelligence at the University of Granada. He has been the supervisor of 30 Ph.D. students. He has published more than 260 papers in international journals. He is coauthor of the book “Genetic Fuzzy Systems: Evolutionary Tuning and Learning of Fuzzy Knowledge Bases” (World Scientific, 2001).

He currently acts as Editor in Chief of the international journals “Information Fusion” (Elsevier) and “Progress in Artificial Intelligence (Springer). He acts as an area editor of the International Journal of Computational Intelligence Systems and associated editor of the journals: IEEE Transactions on Fuzzy Systems, Information Sciences, Knowledge and Information Systems, Advances in Fuzzy Systems, and International Journal of Applied Metaheuristics Computing; and he serves as a member of several journal editorial boards, among others: Fuzzy Sets and Systems, Applied Intelligence, Information Fusion, Knowledge-Based Systems, Evolutionary Intelligence, International Journal of Hybrid Intelligent Systems, Memetic Computation, and Swarm and Evolutionary Computation.

He received the following honors and awards: ECCAI Fellow 2009, IFSA Fellow 2013, 2010 Spanish National Award on Computer Science ARITMEL to the “Spanish Engineer on Computer Science”, International Cajastur “Mamdani” Prize for Soft Computing (Fourth Edition, 2010), IEEE Transactions on Fuzzy System Outstanding 2008 Paper Award (bestowed in 2011), 2011 Lotfi A. Zadeh Prize Best paper Award of the International Fuzzy Systems Association, and 2013 AEPIA Award to a scientific career in Artificial Intelligence (September 2013).

His current research interests include computing with words and decision making, bibliometrics, data mining, data preparation, instance selection and generation, imperfect data, fuzzy rule based systems, genetic fuzzy systems, imbalanced classification, knowledge extraction based on evolutionary algorithms, memetic algorithms and genetic algorithms, biometrics, cloud computing and big data.

View full text

SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering

Highlights

Abstract

Introduction

Section snippets

Classification for imbalanced datasets

SMOTE along with noise filters to tackle noisy and borderline examples

Experimental framework

Evaluation of re-sampling methods with noisy and borderline examples

SMOTE–IPF: suitability of the approach, strengths and weaknesses

Concluding remarks

Acknowledgment

Pattern Recogn.

Pattern Recogn.

Inf. Sci.

Inf. Sci.

Inf. Fus.

Inf. Sci.

Neurocomputing

Inf. Sci.

Pattern Recogn.

Decis. Support Syst.

Pattern Recogn.

Comput. Vis. Image Understand.

KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework

J. Multiple-Valued Logic Soft Comput.

KEEL: a software tool to assess evolutionary algorithms for data mining problems

Soft Comput. – Fus. Found. Methodol. Appl.

A study of the behavior of several methods for balancing machine learning training data

ACM SIGKDD Explor. Newslett.

Developing new fitness functions in genetic programming for classification with unbalanced data

IEEE Trans. Syst. Man Cybern., Part B: Cybern.

Identifying mislabeled training data

J. Artif. Intell. Res.

Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem

SMOTE: synthetic minority over-sampling technique

J. Artif. Intell. Res.

Automatically countering imbalance and its empirical relationship to cost

Data Min. Knowl. Discov.

Editorial: special issue on learning from imbalanced data sets

SIGKDD Explor.

Support vector networks

Mach. Learn.

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

Generating accurate rule sets without global optimization

Experiments with noise filtering in a medical domain

Noise detection and elimination in data preprocessing: experiments in medical domains

Appl. Artif. Intell.

Combined effects of class imbalance and class overlap on instance-based classification