article

Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

Authors:
Hongyu Guo

University of Ottawa, Ottawa, Ontario, Canada

University of Ottawa, Ottawa, Ontario, Canada
View Profile

,
Herna L. Viktor

University of Ottawa, Ottawa, Ontario, Canada

University of Ottawa, Ottawa, Ontario, Canada
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 6 Issue 1June 2004pp 30–39https://doi.org/10.1145/1007730.1007736

Published:01 June 2004Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class. In this paper, we describe a new approach that combines boosting, an ensemble-based learning algorithm, with data generation to improve the predictive power of classifiers against imbalanced data sets consisting of two classes. In the DataBoost-IM method, hard examples from both the majority and minority classes are identified during execution of the boosting algorithm. Subsequently, the hard examples are used to separately generate synthetic examples for the majority and minority classes. The synthetic data are then added to the original training set, and the class distribution and the total weights of the different classes in the new training set are rebalanced. The DataBoost-IM method was evaluated, in terms of the F-measures, G-mean and overall accuracy, against seventeen highly and moderately imbalanced data sets using decision trees as base classifiers. Our results are promising and show that the DataBoost-IM method compares well in comparison with a base classifier, a standard benchmarking boosting algorithm and three advanced boosting-based algorithms for imbalanced data set. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.

References

N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05, 2000.]]Google Scholar
N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321--357, 2002.]] Google ScholarDigital Library
M. A. Maloof. Learning when data sets are Imbalanced and when costs are unequal and unknown, ICML-2003 Workshop on Learning from Imbalanced Data Sets II, 2003.]]Google Scholar
M. Kubat, R. Holte and S. Matwin. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30, 195--215, 1998.]] Google ScholarDigital Library
M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the Fourteenth International Conference on Machine Learning San Francisco, CA, Morgan Kaufmann, 179--186, 1997.]]Google Scholar
M. Joshi, V. Kumar and R. Agarwal. Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division, 2001.]]Google ScholarCross Ref
N. Chawla, A. Lazarevic, L. Hall and K. Bowyer. SMOTEBoost: improving prediction of the minority class in boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 107--119, 2003.]]Google ScholarCross Ref
Y. Freund and R. Schapire. Experiments with a new boosting algorithm. the Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 148--156, 1996]]Google Scholar
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119--139, 1997.]] Google ScholarDigital Library
H. Schwenk and Y. Bengio. AdaBoosting Neural Networks: Application to On-line Character Recognition, International Conference on Artificial Neural Networks (ICANN'97), Springer-Verlag, 969--972, 1997.]] Google ScholarDigital Library
T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139--157, 2000.]] Google ScholarDigital Library
C. L. Blake and C. J. Merz. UCI Repository of Machine Learning Databases {http://www.ics.uci.edu/~mlearn/MLRepository.html}. Department of Information and Computer Science, University of California, Irvine, CA, 1998.]]Google Scholar
H. Guo and HL Viktor. Boosting with data generation: Improving the Classification of Hard to Learn Examples, to be presented at the 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE). Ottawa, Canada, May 17--20, 2004.]] Google ScholarDigital Library
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.]] Google ScholarDigital Library
F. Provost. T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms, Proceedings of the Fifteenth International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, 445--453, 1998.]] Google ScholarDigital Library
F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In proceedings of the Third international conference on Knowledge discovery and data mining, Menlo park, CS. AAAI Press, 43--48, 1997.]]Google Scholar
HL Viktor. The CILT multi-agent learning system, South African Computer Journal (SACJ), 24, 171--181, 1999.]]Google Scholar
HL Viktor and I. Skrypnik. Improving the Competency of Ensembles of Classifiers through Data Generation, ICANNGA'2001, Prague: Czech Republic, April 21--25, 59--62, 2001.]]Google Scholar
I. Witten, E. Frank. Data Mining: Practical Machine Learning tools and Techniques with Java Implementations, Chapter 8, Morgan Kaufmann Publishers, 2000.]] Google ScholarDigital Library
JR Quinlan, C4.5. Programs for Machine Learning, Morgan Kaufmann, California: USA, 1994.]] Google ScholarDigital Library
W. Fan. S. Stolfo, J. Zhang, P. Chan, AdaCost: Misclassification Cost-Sensitive Boosting, Proceedings of 16th International Conference on Machine Learning, Slovenia, 1999.]] Google ScholarDigital Library
K. Ting, A Comparative Study of Cost-Sensitive Boosting Algorithms. Proceedings of 17th International Conference on Machine Learning, 983--990, Stanford, CA, 2000.]] Google ScholarDigital Library
C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, Workshop on Learning from Imbalanced Data sets II held in conjunction with ICML'2003, 2003.]]Google Scholar
H. L. Viktor and H. Guo, Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples, to be presented at the10th International Workshop on Statistical Pattern Recognition, Lisbon, Portugal, August 18--20, 2004.]]Google Scholar

Index Terms

Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Index terms have been assigned to the content through auto-classification.

Recommendations

Data Mining on Imbalanced Data Sets
ICACTE '08: Proceedings of the 2008 International Conference on Advanced Computer Theory and Engineering

The majority of machine learning algorithms previously designed usually assume that their training sets are well-balanced, and implicitly assume that all misclassification errors cost equally. But data in real-world is usually imbalanced. The class ...
Read More
Selective costing ensemble for handling imbalanced data sets

Many real-world problems exhibit skewed class distributions in which almost all cases are allotted to a class and far fewer cases to a smaller, usually more interesting class. A learner induced from an imbalanced data set has, typically, a low error ...
Read More
Post-boosting of classification boundary for imbalanced data using geometric mean

In this paper, a novel imbalance learning method for binary classes is proposed, named as Post-Boosting of classification boundary for Imbalanced data (PBI), which can significantly improve the performance of any trained neural networks (NN) ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 6, Issue 1
Special issue on learning from imbalanced datasets
June 2004
117 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1007730
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2004
Check for updates
Author Tags
boosting
data mining
ensembles of classifiers
imbalanced data sets
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 477
  Total Citations
  View Citations
- 2,438
  Total Downloads
- Downloads (Last 12 months)128
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Data Mining on Imbalanced Data Sets

Selective costing ensemble for handling imbalanced data sets

Post-boosting of classification boundary for imbalanced data using geometric mean

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Data Mining on Imbalanced Data Sets

Selective costing ensemble for handling imbalanced data sets

Post-boosting of classification boundary for imbalanced data using geometric mean

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media