ABSTRACT
In many data mining domains, misclassification costs are different for different examples, in the same way that class membership probabilities are example-dependent. In these domains, both costs and probabilities are unknown for test examples, so both cost estimators and probability estimators must be learned. After discussing how to make optimal decisions given cost and probability estimates, we present decision tree and naive Bayesian learning methods for obtaining well-calibrated probability estimates. We then explain how to obtain unbiased estimators for example-dependent costs, taking into account the difficulty that in general, probabilities and costs are not independent random variables, and the training examples for which costs are known are not representative of all examples. The latter problem is called sample selection bias in econometrics. Our solution to it is based on Nobel prize-winning work due to the economist James Heckman. We show that the methods we propose perform better than MetaCost and all other known methods, in a comprehensive experimental comparison that uses the well-known, large, and challenging dataset from the KDD'98 data mining contest.
- 1.E. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36:105-139, 1999. Google ScholarDigital Library
- 2.S. D. Bay. UCI KDD archive. Department of Information and Computer Sciences, University of California, Irvine, 2000. http://kdd, its.uci.edu/.Google Scholar
- 3.P. N. Bennett. Assessing the calibration of naive Bayes' posterior estimates. Technical Report CMU-CS-00-155, School of Computer Science, Carnegie Mellon University, 2000.Google Scholar
- 4.J. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C. Brodley. Pruning decision trees with misclassification costs. In Proceedings of the European Conference on Machine Learning, pages 131-136, 1998. Google ScholarDigital Library
- 5.L. Breiman. Bagging predictors. Machine Learning, 24:123-140, 1996. Google ScholarDigital Library
- 6.L. Breiman, J. H. Friedman, R. A. Olsen, and C. J. Stone. Classification and Regression Trees. Wadsworth International Group, 1984.Google Scholar
- 7.J. Cussens. Bayes and pseudo-Bayes estimates of conditional probabilities and their reliability. In Proceedings of the European Conference on Machine Learning, pages 136-152. Springer Verlag, 1993. Google ScholarDigital Library
- 8.P. Domingos. MetaCost: A general method for making classifiers cost sensitive. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pages 155-164. ACM Press, 1999. Google ScholarDigital Library
- 9.P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 105-112. Morgan Kanfmann Publishers, Inc., 1996.Google Scholar
- 10.C. Elkan. Cost-sensitive learning and decision-making when costs are unknown. In Workshop Notes, Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, 2000.Google Scholar
- 11.C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Aug. 2001. Google ScholarDigital Library
- 12.F. Esposito, D. Malerba, and G. Semeraro. A comparative analysis of methods for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):476-491, May 1997. Google ScholarDigital Library
- 13.J. Heckman. Sample selection bias as a specification error. Econometrica, 47:153-161, 1979.Google ScholarCross Ref
- 14.E. C. Malthouse. Assessing the performance of direct marketing scoring models. Journal of Interactive Marketing, 15(1):49-62, 2001.Google ScholarCross Ref
- 15.D. Margineantu. On class probability estimates and cost-sensitive evaluation of classifiers. In Workshop Notes, Workshop on Cost.Sensitive Learning, International Conference on Machine Learning, June 2000. Google ScholarDigital Library
- 16.F. Provost and P. Domingos. Well-trained PETs: Improving probability estimation trees. CDER Working Paper 2000-04-1S, Stern School of Business, New York University, NY, NY 10012, 2000.Google Scholar
- 17.F. Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learnin9, 42:203-231, 2001. Google ScholarDigital Library
- 18.J. Quinla 9. C4.5: Programs for Machine Learning. San Marco, CA: Morgan Kaufmann, 1993. Google ScholarDigital Library
- 19.P. Smyth, A. Gray, and U. Fayyad. Retrofitting decision tree classifiers using kernel density estimation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 506-514. Morgan Kaufmann Publishers, Inc., 1995.Google ScholarDigital Library
- 20.J. R. Sobehart, R. M. Stein, V. Mikityanskaya, and L. Li. Moody's public firm risk model: A hybrid approach to modeling short term default risk. Technical report, Moody's Investors Service, Global Credit Research, 2000. Available at http://www, moodysqra, com} research/crm/53853, asp.Google Scholar
- 21.K. Turner and J. Ghosh. Theoretical foundations linear and order statistics combiners for neural pattern classifiers. Technical Report TR-95-02-98, Computer and Vision Research Center, The University of Texas at Austin, 1995.Google Scholar
- 22.P. Turney. Cost-sensitive learning bibliography. Institute for Information Technology, National Research Council, Ottawa, Canada, 2000. http ://extractor.iit.nrc.ca/ bibliographies/cost -sensit ive. html.Google Scholar
- 23.B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, 2001. To appear. Google ScholarDigital Library
Index Terms
- Learning and making decisions when costs and probabilities are both unknown
Recommendations
Predicting good probabilities with supervised learning
ICML '05: Proceedings of the 22nd international conference on Machine learningWe examine the relationship between the predictions made by different learning algorithms and true posterior probabilities. We show that maximum margin methods such as boosted trees and boosted stumps push probability mass away from 0 and 1 yielding a ...
How to use several noisy channels with unknown error probabilities
Consider a sender transmitting a given sequence of bits simultaneously through m binary symmetric channels with error probabilities ε1,..... εm such that 0 ≤ εi < 1/2 for each i = 1, ..... m. The receiver can base its guess for the true transmitted bit ...
Comments