article

Editorial: special issue on learning from imbalanced data sets

Authors:
Nitesh V. Chawla

Retail Risk Management, CIBC, Toronto, ON, Canada

Retail Risk Management, CIBC, Toronto, ON, Canada
View Profile

,
Nathalie Japkowicz

University of Ottawa, ON, Canada

University of Ottawa, ON, Canada
View Profile

,
Aleksander Kotcz

AOL, Inc., Dulles, VA

AOL, Inc., Dulles, VA
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 6 Issue 1June 2004pp 1–6https://doi.org/10.1145/1007730.1007733

Published:01 June 2004Publication History

ACM SIGKDD Explorations Newsletter

References

In N. Japkowicz, editor, Proceedings of the AAAI'2000 Workshop on Learning from Imbalanced Data Sets, AAAI Tech Report WS-00-05. AAAI, 2000.]]Google Scholar
In T. Dietterich, D. Margineantu, F. Provost, and P. Turney, editors, Proceedings of the ICML'2000 Workshop on Cost-sensitive Learning. 2000.]]Google Scholar
In N. V. Chawla, N. Japkowicz, and A. Kotcz, editors, Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Data Sets. 2003.]]Google Scholar
In C. Ferri, P. Flach, J. Orallo, and N. Lachice, editors, ECAI' 2004 First Workshop on ROC Analysis in AI. ECAI, 2004.]]Google Scholar
N. Abe. Invited talk: Sampling approaches to learning from imbalanced datasets: active learning, cost sensitive learning and beyond. http://www.site.uottawa.ca/~nat/Workshop2003/ICML03Workshop_Abe.ppt, 2003.]]Google Scholar
G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20--29, 2004.]] Google ScholarDigital Library
M. Castillo and J. Serrano. A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explorations, 6(1):70--79, 2004.]] Google ScholarDigital Library
P. K. Chan and S. J. Stolfo. Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In Proceedings of Knowledge Discovery and Data Mining, pages 164--168, 1998.]]Google Scholar
N. V. Chawla. C4.5 and imbalanced datasets: Investigating the effect of ampling method, probabilistic estimate, and decision tree structure. In Proceedings of the ICML'03 Workshipshop on Class Imbalances, 2003.]]Google Scholar
N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research, 16:321--357, 2002.]] Google ScholarDigital Library
N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer. Smoteboost: Improving prediction of the minority class in boosting. In Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 107--119, Dubrovnik, Croatia, 2003.]]Google ScholarCross Ref
P. Domingos, Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155--164, San Diego, CA, 1999, ACM Press.]] Google ScholarDigital Library
C. Drummond and R. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198--207, 2001.]] Google ScholarDigital Library
C. Drummond and R. Holte. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973--978, 2001.]] Google ScholarDigital Library
C. Elkan. Invited talk: The real challenges in data mining: A contrarian view. http://www.site.uottawa.ca/~nat/Workshop2003/realchallenges2.ppt., 2003.]]Google Scholar
W. Fan, S. Stolfo, J. Zhang, and P. Chan. Adacost: Misclassification cost-sensitive boosting. In Proceedings of Sixteenth International Conference on Machine Learning, pages 983--990, Slovenia, 1999.]] Google ScholarDigital Library
T. Fawcett, ROC graphs: Notes and practical considerations for researchers. http://www.hpl.hp.com/personal/Tom_Fawcett/papers/index.html, 2003.]]Google Scholar
G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289--1305, 2003.]] Google ScholarDigital Library
J. Furnkranz and P. Flach. An analysis of rule evaluation metrics. In Proceedings of the Twentieth International Conference on Machine Learning, pages 202--209, 2003.]]Google Scholar
H. Guo and H. L. Viktor. Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explorations, 6(1):30--39, 2004.]] Google ScholarDigital Library
I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157--1182, 2003.]] Google ScholarDigital Library
R. Hickey. Learning rare class footprints: the reflex algorithm. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
R. Holte. Summary of the workshop. http://www.site./uottawa.ca/~nat/Workshop2003/workshop2003.html, 2003.]]Google Scholar
N. Japkowicz. Concept-learning in the presence of between-class and within-class imbalances. In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, pages 67--77, 2001.]] Google ScholarDigital Library
N. Japkowicz. Supervised versus unsupervised binary-learning by feedforward neural networks. Machine Learning, 42(1/2):97--122, 2001.]] Google ScholarDigital Library
N. Japkowics, Class imbalance: Are we focusing on the right issue? In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
N. Japkowicz and R. Holte. Workshop report: Aaai-2000 workshop on learning from imbalanced data sets. AI Magazine, 22(1), 2001.]]Google Scholar
N. Japkowics and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):203--231, 2002.]] Google ScholarDigital Library
T. Jo and N. Japkowicz. Class imbalances versus small disjuncts. SIGKDD Explorations, 6(1):40--49, 2004.]] Google ScholarDigital Library
M. Joshi, V. Kumar, and R. Agarwal. Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In Proceedings of the First IEEE International Conference on Data Mining, pages 257--264, San Jose, CA, 2001.]] Google ScholarDigital Library
P. Juszczak and R. P. W. Duin. Uncertainty sampling methods for one-class classifiers. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
A. Kolcz and J. Alspector. Asymmetric missing-data problems: overcoming the lack of negative data in preference ranking. Information Retrieval, 5(1):5--40, 2002.]] Google ScholarDigital Library
A. Kotcz, A. Chowdhury, and J. Alspector. Data duplication: An imbalance problem? In Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets, 2003.]]Google Scholar
M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179--186, Nashville, Tennesse, 1997, Morgan Kaufmann.]]Google Scholar
B. Liu, Y. Dai, X. Li, W. S. Lee, and P. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining, pages 19--22, 2003.]] Google ScholarDigital Library
M. Maloof. Learning when data sets are imbalanced and when costs are unequal and unknown. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
L. M. Manevitz and M. Yousef. One-class SVMs for document classification. Journal of Machine Learning Research, 2:139--154, 2001.]] Google ScholarDigital Library
D. Mladenic and M. Grobelnik. Feature selection for unbalanced class distribution and naive bayes. In Proceedings of the 16th International Conference on Machine Learning, pages 258--267, 1999.]] Google ScholarDigital Library
K. Nigam, A. K. McCallum, s. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:103--134, 2000.]] Google ScholarDigital Library
R. Pearson, G. Goney, and J. Shwaber. Imbalanced clustering for microarray time-series. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
C. Phua and D. Alahakoon. Minority report in fraud detection: Classification of skewed data. SIGKDD Explorations, 6(1):50--59, 2004.]] Google ScholarDigital Library
F. Provost. Invited talk: Choosing a marginal class distribution for classifier induction. http://www.site.uottawa.ca/~nat/Workshop2003/provost.html, 2003.]]Google Scholar
F. Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42:203--231, 2001.]] Google ScholarDigital Library
J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.]] Google ScholarDigital Library
P. Radivojac, N. V. Chawla, K. Dunker, and Z. Obradovic. Classification and knowledge discovery in protein databases. Journal of Biomedical Informatics, 2004. Accepted.]] Google ScholarDigital Library
B. Raskutti and A. Kowalczyk. Extreme re-balancing for SVM's: a case study. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
B. Raskutti and A. Kowalczyk. Extreme rebalancing for svms: a case study. SIGKDD Explorations, 6(1):60--69, 2004.]] Google ScholarDigital Library
B. Schölkopf, J. C. Platt, J. Shawe-Taylor. A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443--1472, 2001.]] Google ScholarDigital Library
D. Tax. One-class classification. PhD thesis, Delft University of Technology, 2001.]]Google Scholar
K. M. Ting. A comparative study of cost-sensitive boosting algorithms. In Proceedings of Seventeenth International Conference on Machine Learning, pages 983--990, Stanford, CA, 2000.]] Google ScholarDigital Library
K. M. Ting. An instance-weighting method to induce cost-sensitive trees. IEEE Transaction on Knowledge and Data Engineering. 14:659--665, 2002.]] Google ScholarDigital Library
P. Turney. Types of cost in inductive concept learning. In Proceedings of the ICML'2000 Workshop on Cost-Sensitive Learning, pages 15--21, 2000.]]Google Scholar
S. Visa and A. Ralescu. Learning imbalanced and overlapping classes using fuzzy sets. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
G. Weiss. Mining with rarity: A unifying framework. SIGKDD Explorations, 6(1):7--19, 2004.]] Google ScholarDigital Library
G. Weiss and F. Provost. Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19:315--354, 2003.]] Google ScholarDigital Library
G. Wu and E. Y. Chang. Class-boundary alignment for imbalanced dataset learning. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets, 2003.]]Google Scholar
B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 204--213, 2001.]] Google ScholarDigital Library
B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the Third IEEE International Conference on Data Mining, pages 435--442, Melbourne, FL, 2003.]] Google ScholarDigital Library
J. Zhang and I. Mani. knn approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Datasets, 2003.]]Google Scholar
Z. Zheng and R. Srihari. Optimally combining positive and negative features for text categorization. In Proceedings of the ICML'03 Workshop on Learning from Imbalanced Date Sets, 2003.]]Google Scholar
Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explorations, 6(1):80--89, 2004.]] Google ScholarDigital Library

Index Terms

Editorial: special issue on learning from imbalanced data sets
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Index terms have been assigned to the content through auto-classification.

Recommendations

Editorial

For the first time, an issue of Organised Sound has been published without a theme. This has been done, not due to the Editors' running out of appropriate ideas, but instead to allow us to publish a number of submitted manuscripts that would have had to ...
Read More
Guest editorial: Special issue on models and methodologies for co-design of embedded systems

This special issue is based on innovative ideas presented and discussed during the first ACM/IEEE Conference on Formal Methods and Models for Co-Design (MEMOCODE) held at Mont Saint Michel in France during the summer of 2003. Selected papers from the ...
Read More
Editorial: Introduction to the Special Issue on Multimedia Data Mining

The twelve papers in this special issue focus on multimedia data mining. The special issue evolved from a successful workshop organized in conjunction with the 2006 ACM KDD conference, but the special issue was open to the whole community.

Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 6, Issue 1
Special issue on learning from imbalanced datasets
June 2004
117 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1007730
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2004
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,612
  Total Citations
  View Citations
- 10,582
  Total Downloads
- Downloads (Last 12 months)478
- Downloads (Last 6 weeks)64
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter

References

Cited By

Index Terms

Recommendations

Editorial

Guest editorial: Special issue on models and methodologies for co-design of embedded systems

Editorial: Introduction to the Special Issue on Multimedia Data Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter

References

Cited By

Index Terms

Recommendations

Editorial

Guest editorial: Special issue on models and methodologies for co-design of embedded systems

Editorial: Introduction to the Special Issue on Multimedia Data Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media