Data Mining for Imbalanced Datasets: An Overview

Chawla, Nitesh V.

doi:10.1007/0-387-25465-X_40

Data Mining for Imbalanced Datasets: An Overview

Nitesh V. Chawla²

Chapter

21k Accesses
259 Citations

Abstract

A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Batista, G. E. A. P. A., Prati, R. C, and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1).
Google Scholar
Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36(1,2).
Google Scholar
Bradley, A. P. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition, 30(6): 1145–1159.
Article Google Scholar
Buckland, M. and Gey, F. (1994). The Relationship Between Recall and Precision. Journal of the American Society for Information Science, 45(1): 12–19.
Article Google Scholar
Chawla, N. V. (2003). C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. In ICML Workshop on Learning from Imbalanced Data sets, Washington, DC.
Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Oversampling TEchnique. Journal of Artificial Intelligence Research, 16:321–357.
MATH Google Scholar
Chawla, N. V., Japkowicz, N., and Kolcz, A., editors (2003a). Proceedings of the ICMU2003 Workshop on Learning from Imbalanced Data Sets II.
Google Scholar
Chawla, N. V., Japkowicz, N., and Kolcz, A., editors (2004a). SIGKDD Special Issue on Learning from Imbalanced Datasets.
Google Scholar
Chawla, N. V., Japkowicz, N., Kolcz, A. (2004b), Editorial: Learning form Imbalanced Datasets, SIGKDD Explorations, 6(1).
Google Scholar
Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. (2003b). Smote-boost: Improving Prediction of the Minority Class in Boosting. In Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 107-119, Dubrovnik, Croatia.
Google Scholar
Cohen, W. (1995a). Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115–123. Department of Computer Science, Katholieke Universiteit Leuven.
Google Scholar
Cohen, W. (1995b). Learning to Classify English Text with ILP Methods. In Proceedings of the 5th International Workshop on Inductive Logic Programming, pages 3–24. Department of Computer Science, Katholieke Universiteit Leuven.
Google Scholar
Cost, S. and Salzberg, S. (1993). A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning, 10(l):57–78.
Google Scholar
Dietterich, T. (2000). An Empirical Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 40(2): 139–157.
Article Google Scholar
Dietterich, T, Margineantu, D., Provost, F, and Turney, P., editors (2003). Proceedings of the ICML’2000 Workshop on COST-SENSITIVE LEARNING.
Google Scholar
Domingos, P. (1999). Metacost: A General Method for Making Classifiers Cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, San Diego, CA. ACM Press.
Google Scholar
Drummond, C. and Holte, R. (2003). C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the ICMU03 Workshop on Learning from Imbalanced Data Sets.
Google Scholar
Drummond, C. and Holte, R. C. (2000). Explicitly Representing Expected Cost: An Alternative to ROC Representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207, Boston. ACM.
Google Scholar
Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive Learning Algorithms and Representations for Text Categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management., pages 148–155.
Google Scholar
Egan, J. P. (1975). Signal Detection Theory and ROC Analysis. In Series in Cognition and Perception. Academic Press, New York.
Google Scholar
Elkan, C. (2001). The Foundations of Cost-sensitive Learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973–978, Seattle, WA.
Google Scholar
Ezawa, K., J., Singh, M., and Norton, S., W. (1996). Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management. In Proceedings of the International Conference on Machine Learning, ICML-96, pages 139–147, Bari, Italy. Morgan Kauffman.
Google Scholar
Fan, W., Stolfo, S., Zhang, J., and Chan, P. (1999). Adacost: Misclassification Cost-sensitive Boosting. In Proceedings of Sixteenth International Conference on Machine Learning, pages 983–990, Slovenia.
Google Scholar
Ferri, C, Flach, P., Orallo, J., and Lachice, N., editors (2004). ECAF 2004 First Workshop on ROC Analysis in AI. ECAI.
Google Scholar
Freund, Y. and Schapire, R. (1996). Experiments with a New Boosting Algorithm. In Thirteenth International Conference on Machine Learning, Bari, Italy.
Google Scholar
Guo, H. and Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explorations, 6(1).
Google Scholar
Hand, D. J. (1997). Construction and Assessment of Classification Rules. John Wiley and Sons.
Google Scholar
Hart, P. E. (1968). The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory, 14:515–516.
Article Google Scholar
Japkowicz, N. (2000a). The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AF2000): Special Track on Inductive Learning, Las Vegas, Nevada.
Google Scholar
Japkowicz, N. (2000b). Learning from Imbalanced Data sets: A Comparison of Various Strategies. In Proceedings of the AAAF2000 Workshop on Learning from Imbalanced Data Sets, Austin, TX.
Google Scholar
Japkowicz, N. (2001a). Concept-learning in the presence of between-class and within-class imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67–77.
Google Scholar
Japkowicz, N. (2001b). Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning, 42(l/2):97–122.
Article MATH Google Scholar
Jo, T. and Japkowicz, N. (2004). Class imbalances versus small disjuncts. SIGKDD Explorations, 6(1).
Google Scholar
Joshi, M., Kumar, V., and Agarwal, R. (2001). Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In Proceedings of the First IEEE International Conference on Data Mining, pages 257–264, San Jose, CA.
Google Scholar
Juszczak, P. and Duin, R. P. W. (2003). Uncertainty sampling methods for one-class classifiers. In Proceedings of the ICMU03 Workshop on Learning from Imbalanced Data Sets.
Google Scholar
Kubat, M., Holte, R., and Matwin, S. (1998). Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30:195–215.
Article Google Scholar
Kubat, M. and Matwin, S. (1997). Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179–186, Nashville, Tennesse. Morgan Kaufmann.
Google Scholar
Laurikkala, J. (2001). Improving Identification of Difficult Small Classes by Balancing Class Distribution. Technical Report A-2001-2, University of Tampere.
Google Scholar
Lee, S. S. (2000). Noisy Replication in Skewed Binary Classification. Computational Statistics and Data Analysis, 34.
Google Scholar
Lewis, D. and Catlett, J. (1994). Heterogeneous Uncertainly Sampling for Supervised Learning. In Proceedings of the Eleventh International Conference of Machine Learning, pages 148–156, San Francisco, CA. Morgan Kaufmann.
Google Scholar
Ling, C. and Li, C. (1998). Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, NY. AAAI Press.
Google Scholar
Liu, Y., Chawla, N. V., Shriberg, E., Stolcke, A., and Harper, M. (2004). Resampling Techniques for Sentence Boundary Detection: A Case Study in Machine Learning from Imbalanced Data for Spoken Language Processing. Under Review.
Google Scholar
Maloof, M. (2003). Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets.
Google Scholar
Mladenić, D. and Grobelnik, M. (1999). Feature Selection for Unbalanced Class Distribution and Naive Bayes. In Proceedings of the 16th International Conference on Machine Learning., pages 258–267. Morgan Kaufmann.
Google Scholar
Phua, C. and Alahakoon, D. (2004). Minority report in fraud detection: Classification of skewed data. SIGKDD Explorations, 6(1).
Google Scholar
Provost, F. and Fawcett, T. (2001). Robust Classification for Imprecise Environments. Machine Learning, 42/3:203–231.
Article MATH Google Scholar
Quinlan, J. R. (1992). C4. 5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Google Scholar
Radivojac, P., Chawla, N. V., Dunker, K., and Obradovic, Z. (2004). Classification and Knowledge Discovery in Protein Databases. Journal of Biomedical Informatics, 37(4):224–239.
Article Google Scholar
Raskutti, B. and Kowalczyk, A. (2004). Extreme rebalancing for svms: a case study. SIGKDD Explorations, 6(1).
Google Scholar
Solberg, A. H. and Solberg, R. (1996). A Large-Scale Evaluation of Features for Automatic Detection of Oil Spills in ERS SAR Images. In International Geoscience and Remote Sensing Symposium, pages 1484–1486, Lincoln, NE.
Google Scholar
Swets, J. (1988). Measuring the Accuracy of Diagnostic Systems. Science, 240:1285–1293.
MathSciNet Google Scholar
Tax, D. (2001). One-class classification. PhD thesis, Delft University of Technology.
Google Scholar
Ting, K. (2000). A comparative study of cost-sensitive boosting algorithms. In Proceedings of Seventeenth International Conference on Machine Learning, pages 983–990, Stanford, CA.
Google Scholar
Tomek, I. (1976). Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics, 6:769–772.
Article MATH MathSciNet Google Scholar
Turney, P. (2000). Types of Cost in Inductive Concept Learning. In Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, pages 15–21, Stanford, CA.
Google Scholar
Weiss, G. and Provost, F. (2003). Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19:315–354.
MATH Google Scholar
Woods, K., Doss, C, Bowyer, K., Solka, J., Priebe, C, and Kegelmeyer, P. (1993). Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications in Mammography. International Journal of Pattern Recognition and Artificial Intelligence, 7(6): 1417–1436.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Notre Dame, IN, 46530, USA
Nitesh V. Chawla

Authors

Nitesh V. Chawla
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Industrial Engineering, Tel-Aviv University, 69978, Ramat-Aviv, Israel
Oded Maimon & Lior Rokach &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chawla, N.V. (2005). Data Mining for Imbalanced Datasets: An Overview. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/0-387-25465-X_40

Download citation

DOI: https://doi.org/10.1007/0-387-25465-X_40
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-24435-8
Online ISBN: 978-0-387-25465-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics