Abstract
Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class. In this paper, we describe a new approach that combines boosting, an ensemble-based learning algorithm, with data generation to improve the predictive power of classifiers against imbalanced data sets consisting of two classes. In the DataBoost-IM method, hard examples from both the majority and minority classes are identified during execution of the boosting algorithm. Subsequently, the hard examples are used to separately generate synthetic examples for the majority and minority classes. The synthetic data are then added to the original training set, and the class distribution and the total weights of the different classes in the new training set are rebalanced. The DataBoost-IM method was evaluated, in terms of the F-measures, G-mean and overall accuracy, against seventeen highly and moderately imbalanced data sets using decision trees as base classifiers. Our results are promising and show that the DataBoost-IM method compares well in comparison with a base classifier, a standard benchmarking boosting algorithm and three advanced boosting-based algorithms for imbalanced data set. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.
- N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05, 2000.]]Google Scholar
- N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321--357, 2002.]] Google ScholarDigital Library
- M. A. Maloof. Learning when data sets are Imbalanced and when costs are unequal and unknown, ICML-2003 Workshop on Learning from Imbalanced Data Sets II, 2003.]]Google Scholar
- M. Kubat, R. Holte and S. Matwin. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30, 195--215, 1998.]] Google ScholarDigital Library
- M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the Fourteenth International Conference on Machine Learning San Francisco, CA, Morgan Kaufmann, 179--186, 1997.]]Google Scholar
- M. Joshi, V. Kumar and R. Agarwal. Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division, 2001.]]Google ScholarCross Ref
- N. Chawla, A. Lazarevic, L. Hall and K. Bowyer. SMOTEBoost: improving prediction of the minority class in boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 107--119, 2003.]]Google ScholarCross Ref
- Y. Freund and R. Schapire. Experiments with a new boosting algorithm. the Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 148--156, 1996]]Google Scholar
- Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119--139, 1997.]] Google ScholarDigital Library
- H. Schwenk and Y. Bengio. AdaBoosting Neural Networks: Application to On-line Character Recognition, International Conference on Artificial Neural Networks (ICANN'97), Springer-Verlag, 969--972, 1997.]] Google ScholarDigital Library
- T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139--157, 2000.]] Google ScholarDigital Library
- C. L. Blake and C. J. Merz. UCI Repository of Machine Learning Databases {http://www.ics.uci.edu/~mlearn/MLRepository.html}. Department of Information and Computer Science, University of California, Irvine, CA, 1998.]]Google Scholar
- H. Guo and HL Viktor. Boosting with data generation: Improving the Classification of Hard to Learn Examples, to be presented at the 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE). Ottawa, Canada, May 17--20, 2004.]] Google ScholarDigital Library
- C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.]] Google ScholarDigital Library
- F. Provost. T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms, Proceedings of the Fifteenth International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, 445--453, 1998.]] Google ScholarDigital Library
- F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In proceedings of the Third international conference on Knowledge discovery and data mining, Menlo park, CS. AAAI Press, 43--48, 1997.]]Google Scholar
- HL Viktor. The CILT multi-agent learning system, South African Computer Journal (SACJ), 24, 171--181, 1999.]]Google Scholar
- HL Viktor and I. Skrypnik. Improving the Competency of Ensembles of Classifiers through Data Generation, ICANNGA'2001, Prague: Czech Republic, April 21--25, 59--62, 2001.]]Google Scholar
- I. Witten, E. Frank. Data Mining: Practical Machine Learning tools and Techniques with Java Implementations, Chapter 8, Morgan Kaufmann Publishers, 2000.]] Google ScholarDigital Library
- JR Quinlan, C4.5. Programs for Machine Learning, Morgan Kaufmann, California: USA, 1994.]] Google ScholarDigital Library
- W. Fan. S. Stolfo, J. Zhang, P. Chan, AdaCost: Misclassification Cost-Sensitive Boosting, Proceedings of 16th International Conference on Machine Learning, Slovenia, 1999.]] Google ScholarDigital Library
- K. Ting, A Comparative Study of Cost-Sensitive Boosting Algorithms. Proceedings of 17th International Conference on Machine Learning, 983--990, Stanford, CA, 2000.]] Google ScholarDigital Library
- C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, Workshop on Learning from Imbalanced Data sets II held in conjunction with ICML'2003, 2003.]]Google Scholar
- H. L. Viktor and H. Guo, Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples, to be presented at the10th International Workshop on Statistical Pattern Recognition, Lisbon, Portugal, August 18--20, 2004.]]Google Scholar
Index Terms
- Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach
Recommendations
Data Mining on Imbalanced Data Sets
ICACTE '08: Proceedings of the 2008 International Conference on Advanced Computer Theory and EngineeringThe majority of machine learning algorithms previously designed usually assume that their training sets are well-balanced, and implicitly assume that all misclassification errors cost equally. But data in real-world is usually imbalanced. The class ...
Selective costing ensemble for handling imbalanced data sets
Many real-world problems exhibit skewed class distributions in which almost all cases are allotted to a class and far fewer cases to a smaller, usually more interesting class. A learner induced from an imbalanced data set has, typically, a low error ...
Post-boosting of classification boundary for imbalanced data using geometric mean
In this paper, a novel imbalance learning method for binary classes is proposed, named as Post-Boosting of classification boundary for Imbalanced data (PBI), which can significantly improve the performance of any trained neural networks (NN) ...
Comments