Skip to main content

Data Mining for Imbalanced Datasets: An Overview

  • Chapter

Abstract

A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   229.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Batista, G. E. A. P. A., Prati, R. C, and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1).

    Google Scholar 

  • Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36(1,2).

    Google Scholar 

  • Bradley, A. P. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30(6): 1145–1159.

    Article  Google Scholar 

  • Buckland, M. and Gey, F. (1994). The Relationship Between Recall and Precision. Journal of the American Society for Information Science, 45(1): 12–19.

    Article  Google Scholar 

  • Chawla, N. V. (2003). C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. In ICML Workshop on Learning from Imbalanced Data sets, Washington, DC.

    Google Scholar 

  • Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Oversampling TEchnique. Journal of Artificial Intelligence Research, 16:321–357.

    MATH  Google Scholar 

  • Chawla, N. V., Japkowicz, N., and Kolcz, A., editors (2003a). Proceedings of the ICMU2003 Workshop on Learning from Imbalanced Data Sets II.

    Google Scholar 

  • Chawla, N. V., Japkowicz, N., and Kolcz, A., editors (2004a). SIGKDD Special Issue on Learning from Imbalanced Datasets.

    Google Scholar 

  • Chawla, N. V., Japkowicz, N., Kolcz, A. (2004b), Editorial: Learning form Imbalanced Datasets, SIGKDD Explorations, 6(1).

    Google Scholar 

  • Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. (2003b). Smote-boost: Improving Prediction of the Minority Class in Boosting. In Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 107-119, Dubrovnik, Croatia.

    Google Scholar 

  • Cohen, W. (1995a). Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115–123. Department of Computer Science, Katholieke Universiteit Leuven.

    Google Scholar 

  • Cohen, W. (1995b). Learning to Classify English Text with ILP Methods. In Proceedings of the 5th International Workshop on Inductive Logic Programming, pages 3–24. Department of Computer Science, Katholieke Universiteit Leuven.

    Google Scholar 

  • Cost, S. and Salzberg, S. (1993). A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning, 10(l):57–78.

    Google Scholar 

  • Dietterich, T. (2000). An Empirical Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 40(2): 139–157.

    Article  Google Scholar 

  • Dietterich, T, Margineantu, D., Provost, F, and Turney, P., editors (2003). Proceedings of the ICML’2000 Workshop on COST-SENSITIVE LEARNING.

    Google Scholar 

  • Domingos, P. (1999). Metacost: A General Method for Making Classifiers Cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, San Diego, CA. ACM Press.

    Google Scholar 

  • Drummond, C. and Holte, R. (2003). C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the ICMU03 Workshop on Learning from Imbalanced Data Sets.

    Google Scholar 

  • Drummond, C. and Holte, R. C. (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207, Boston. ACM.

    Google Scholar 

  • Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive Learning Algorithms and Representations for Text Categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management., pages 148–155.

    Google Scholar 

  • Egan, J. P. (1975). Signal Detection Theory and ROC Analysis. In Series in Cognition and Perception. Academic Press, New York.

    Google Scholar 

  • Elkan, C. (2001). The Foundations of Cost-sensitive Learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973–978, Seattle, WA.

    Google Scholar 

  • Ezawa, K., J., Singh, M., and Norton, S., W. (1996). Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management. In Proceedings of the International Conference on Machine Learning, ICML-96, pages 139–147, Bari, Italy. Morgan Kauffman.

    Google Scholar 

  • Fan, W., Stolfo, S., Zhang, J., and Chan, P. (1999). Adacost: Misclassification Cost-sensitive Boosting. In Proceedings of Sixteenth International Conference on Machine Learning, pages 983–990, Slovenia.

    Google Scholar 

  • Ferri, C, Flach, P., Orallo, J., and Lachice, N., editors (2004). ECAF 2004 First Workshop on ROC Analysis in AI. ECAI.

    Google Scholar 

  • Freund, Y. and Schapire, R. (1996). Experiments with a New Boosting Algorithm. In Thirteenth International Conference on Machine Learning, Bari, Italy.

    Google Scholar 

  • Guo, H. and Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explorations, 6(1).

    Google Scholar 

  • Hand, D. J. (1997). Construction and Assessment of Classification Rules. John Wiley and Sons.

    Google Scholar 

  • Hart, P. E. (1968). The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory, 14:515–516.

    Article  Google Scholar 

  • Japkowicz, N. (2000a). The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AF2000): Special Track on Inductive Learning, Las Vegas, Nevada.

    Google Scholar 

  • Japkowicz, N. (2000b). Learning from Imbalanced Data sets: A Comparison of Various Strategies. In Proceedings of the AAAF2000 Workshop on Learning from Imbalanced Data Sets, Austin, TX.

    Google Scholar 

  • Japkowicz, N. (2001a). Concept-learning in the presence of between-class and within-class imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67–77.

    Google Scholar 

  • Japkowicz, N. (2001b). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning, 42(l/2):97–122.

    Article  MATH  Google Scholar 

  • Jo, T. and Japkowicz, N. (2004). Class imbalances versus small disjuncts. SIGKDD Explorations, 6(1).

    Google Scholar 

  • Joshi, M., Kumar, V., and Agarwal, R. (2001). Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In Proceedings of the First IEEE International Conference on Data Mining, pages 257–264, San Jose, CA.

    Google Scholar 

  • Juszczak, P. and Duin, R. P. W. (2003). Uncertainty sampling methods for one-class classifiers. In Proceedings of the ICMU03 Workshop on Learning from Imbalanced Data Sets.

    Google Scholar 

  • Kubat, M., Holte, R., and Matwin, S. (1998). Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30:195–215.

    Article  Google Scholar 

  • Kubat, M. and Matwin, S. (1997). Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179–186, Nashville, Tennesse. Morgan Kaufmann.

    Google Scholar 

  • Laurikkala, J. (2001). Improving Identification of Difficult Small Classes by Balancing Class Distribution. Technical Report A-2001-2, University of Tampere.

    Google Scholar 

  • Lee, S. S. (2000). Noisy Replication in Skewed Binary Classification. Computational Statistics and Data Analysis, 34.

    Google Scholar 

  • Lewis, D. and Catlett, J. (1994). Heterogeneous Uncertainly Sampling for Supervised Learning. In Proceedings of the Eleventh International Conference of Machine Learning, pages 148–156, San Francisco, CA. Morgan Kaufmann.

    Google Scholar 

  • Ling, C. and Li, C. (1998). Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, NY. AAAI Press.

    Google Scholar 

  • Liu, Y., Chawla, N. V., Shriberg, E., Stolcke, A., and Harper, M. (2004). Resampling Techniques for Sentence Boundary Detection: A Case Study in Machine Learning from Imbalanced Data for Spoken Language Processing. Under Review.

    Google Scholar 

  • Maloof, M. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.

    Google Scholar 

  • Mladenić, D. and Grobelnik, M. (1999). Feature Selection for Unbalanced Class Distribution and Naive Bayes. In Proceedings of the 16th International Conference on Machine Learning., pages 258–267. Morgan Kaufmann.

    Google Scholar 

  • Phua, C. and Alahakoon, D. (2004). Minority report in fraud detection: Classification of skewed data. SIGKDD Explorations, 6(1).

    Google Scholar 

  • Provost, F. and Fawcett, T. (2001). Robust Classification for Imprecise Environments. Machine Learning, 42/3:203–231.

    Article  MATH  Google Scholar 

  • Quinlan, J. R. (1992). C4. 5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.

    Google Scholar 

  • Radivojac, P., Chawla, N. V., Dunker, K., and Obradovic, Z. (2004). Classification and Knowledge Discovery in Protein Databases. Journal of Biomedical Informatics, 37(4):224–239.

    Article  Google Scholar 

  • Raskutti, B. and Kowalczyk, A. (2004). Extreme rebalancing for svms: a case study. SIGKDD Explorations, 6(1).

    Google Scholar 

  • Solberg, A. H. and Solberg, R. (1996). A Large-Scale Evaluation of Features for Automatic Detection of Oil Spills in ERS SAR Images. In International Geoscience and Remote Sensing Symposium, pages 1484–1486, Lincoln, NE.

    Google Scholar 

  • Swets, J. (1988). Measuring the Accuracy of Diagnostic Systems. Science, 240:1285–1293.

    MathSciNet  Google Scholar 

  • Tax, D. (2001). One-class classification. PhD thesis, Delft University of Technology.

    Google Scholar 

  • Ting, K. (2000). A comparative study of cost-sensitive boosting algorithms. In Proceedings of Seventeenth International Conference on Machine Learning, pages 983–990, Stanford, CA.

    Google Scholar 

  • Tomek, I. (1976). Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics, 6:769–772.

    Article  MATH  MathSciNet  Google Scholar 

  • Turney, P. (2000). Types of Cost in Inductive Concept Learning. In Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, pages 15–21, Stanford, CA.

    Google Scholar 

  • Weiss, G. and Provost, F. (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19:315–354.

    MATH  Google Scholar 

  • Woods, K., Doss, C, Bowyer, K., Solka, J., Priebe, C, and Kegelmeyer, P. (1993). Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications in Mammography. International Journal of Pattern Recognition and Artificial Intelligence, 7(6): 1417–1436.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer Science+Business Media, Inc.

About this chapter

Cite this chapter

Chawla, N.V. (2005). Data Mining for Imbalanced Datasets: An Overview. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/0-387-25465-X_40

Download citation

  • DOI: https://doi.org/10.1007/0-387-25465-X_40

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-24435-8

  • Online ISBN: 978-0-387-25465-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics