skip to main content
article

Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach

Published:01 June 2004Publication History
Skip Abstract Section

Abstract

Learning from imbalanced data sets, where the number of examples of one (majority) class is much higher than the others, presents an important challenge to the machine learning community. Traditional machine learning algorithms may be biased towards the majority class, thus producing poor predictive accuracy over the minority class. In this paper, we describe a new approach that combines boosting, an ensemble-based learning algorithm, with data generation to improve the predictive power of classifiers against imbalanced data sets consisting of two classes. In the DataBoost-IM method, hard examples from both the majority and minority classes are identified during execution of the boosting algorithm. Subsequently, the hard examples are used to separately generate synthetic examples for the majority and minority classes. The synthetic data are then added to the original training set, and the class distribution and the total weights of the different classes in the new training set are rebalanced. The DataBoost-IM method was evaluated, in terms of the F-measures, G-mean and overall accuracy, against seventeen highly and moderately imbalanced data sets using decision trees as base classifiers. Our results are promising and show that the DataBoost-IM method compares well in comparison with a base classifier, a standard benchmarking boosting algorithm and three advanced boosting-based algorithms for imbalanced data set. Results indicate that our approach does not sacrifice one class in favor of the other, but produces high predictions against both minority and majority classes.

References

  1. N. Japkowicz. Learning from imbalanced data sets: A comparison of various strategies, Learning from imbalanced data sets: The AAAI Workshop 10-15. Menlo Park, CA: AAAI Press. Technical Report WS-00-05, 2000.]]Google ScholarGoogle Scholar
  2. N. Chawla, K. Bowyer, L. Hall and W. Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321--357, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. A. Maloof. Learning when data sets are Imbalanced and when costs are unequal and unknown, ICML-2003 Workshop on Learning from Imbalanced Data Sets II, 2003.]]Google ScholarGoogle Scholar
  4. M. Kubat, R. Holte and S. Matwin. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30, 195--215, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. Proceedings of the Fourteenth International Conference on Machine Learning San Francisco, CA, Morgan Kaufmann, 179--186, 1997.]]Google ScholarGoogle Scholar
  6. M. Joshi, V. Kumar and R. Agarwal. Evaluating boosting algorithms to classify rare classes: comparison and improvements. Technical Report RC-22147, IBM Research Division, 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  7. N. Chawla, A. Lazarevic, L. Hall and K. Bowyer. SMOTEBoost: improving prediction of the minority class in boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 107--119, 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  8. Y. Freund and R. Schapire. Experiments with a new boosting algorithm. the Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 148--156, 1996]]Google ScholarGoogle Scholar
  9. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119--139, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Schwenk and Y. Bengio. AdaBoosting Neural Networks: Application to On-line Character Recognition, International Conference on Artificial Neural Networks (ICANN'97), Springer-Verlag, 969--972, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139--157, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. L. Blake and C. J. Merz. UCI Repository of Machine Learning Databases {http://www.ics.uci.edu/~mlearn/MLRepository.html}. Department of Information and Computer Science, University of California, Irvine, CA, 1998.]]Google ScholarGoogle Scholar
  13. H. Guo and HL Viktor. Boosting with data generation: Improving the Classification of Hard to Learn Examples, to be presented at the 17th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE). Ottawa, Canada, May 17--20, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Provost. T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms, Proceedings of the Fifteenth International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, 445--453, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In proceedings of the Third international conference on Knowledge discovery and data mining, Menlo park, CS. AAAI Press, 43--48, 1997.]]Google ScholarGoogle Scholar
  17. HL Viktor. The CILT multi-agent learning system, South African Computer Journal (SACJ), 24, 171--181, 1999.]]Google ScholarGoogle Scholar
  18. HL Viktor and I. Skrypnik. Improving the Competency of Ensembles of Classifiers through Data Generation, ICANNGA'2001, Prague: Czech Republic, April 21--25, 59--62, 2001.]]Google ScholarGoogle Scholar
  19. I. Witten, E. Frank. Data Mining: Practical Machine Learning tools and Techniques with Java Implementations, Chapter 8, Morgan Kaufmann Publishers, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. JR Quinlan, C4.5. Programs for Machine Learning, Morgan Kaufmann, California: USA, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Fan. S. Stolfo, J. Zhang, P. Chan, AdaCost: Misclassification Cost-Sensitive Boosting, Proceedings of 16th International Conference on Machine Learning, Slovenia, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Ting, A Comparative Study of Cost-Sensitive Boosting Algorithms. Proceedings of 17th International Conference on Machine Learning, 983--990, Stanford, CA, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, Workshop on Learning from Imbalanced Data sets II held in conjunction with ICML'2003, 2003.]]Google ScholarGoogle Scholar
  24. H. L. Viktor and H. Guo, Multiple Classifier Prediction Improvements against Imbalanced Datasets through Added Synthetic Examples, to be presented at the10th International Workshop on Statistical Pattern Recognition, Lisbon, Portugal, August 18--20, 2004.]]Google ScholarGoogle Scholar

Index Terms

  1. Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader