article

Minority report in fraud detection: classification of skewed data

Authors:
Clifton Phua

Monash University, Clayton, Victoria, Australia

Monash University, Clayton, Victoria, Australia
View Profile

,
Damminda Alahakoon

Monash University, Clayton, Victoria, Australia

Monash University, Clayton, Victoria, Australia
View Profile

,
Vincent Lee

Monash University, Clayton, Victoria, Australia

Monash University, Clayton, Victoria, Australia
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 6 Issue 1June 2004pp 50–59https://doi.org/10.1145/1007730.1007738

Published:01 June 2004Publication History

ACM SIGKDD Explorations Newsletter

Abstract

This paper proposes an innovative fraud detection method, built upon existing fraud detection research and Minority Report, to deal with the data mining problem of skewed data distributions. This method uses backpropagation (BP), together with naive Bayesian (NB) and C4.5 algorithms, on data partitions derived from minority oversampling with replacement. Its originality lies in the use of a single meta-classifier (stacking) to choose the best base classifiers, and then combine these base classifiers' predictions (bagging) to improve cost savings (stacking-bagging). Results from a publicly available automobile insurance fraud detection data set demonstrate that stacking-bagging performs slightly better than the best performing bagged algorithm, C4.5, and its best classifier, C4.5 (2), in terms of cost savings. Stacking-bagging also outperforms the common technique used in industry (BP without both sampling and partitioning). Subsequently, this paper compares the new fraud detection method (meta-learning approach) against C4.5 trained using undersampling, oversampling, and SMOTEing without partitioning (sampling approach). Results show that, given a fixed decision threshold and cost matrix, the partitioning and multiple algorithms approach achieves marginally higher cost savings than varying the entire training data set with different class distributions. The most interesting find is confirming that the combination of classifiers to produce the best cost savings has its contributions from all three algorithms.

References

Berry M and Linoff G. Mastering Data Mining: The Art and Science of Customer Relationship Management, John Wiley and Sons, New York, USA, 2000.]] Google ScholarDigital Library
Bigus J. Data Mining with Neural Networks, McGraw Hill, New York, USA, 1996.]] Google ScholarDigital Library
Breiman L. Heuristics of Instability in Model Selection, Technical Report, Department of Statistics, University of California at Berkeley, USA, 1994.]]Google Scholar
Brockett P., Xia X. and Derrig R. "Using Kohonen's Self Organising Feature Map to Uncover Automobile Bodily Injury Claims Fraud", Journal of Risk and Insurance, USA, 1998.]]Google ScholarCross Ref
Cahill M., Lambert D., Pinheiro J. and Sun D. "Detecting Fraud In The Real World", in The Handbook of Massive Data Sets, Kluwer, pp911--930, 2002.]] Google ScholarDigital Library
Chan P. and Stolfo S. "A Comparative Evaluation of Voting and Meta-learning on Partitioned Data", in Proceedings of 12th International Conference on Machine Learning, California, USA, pp90--98, 1995.]]Google ScholarCross Ref
Chan P. and Stolfo S. "Toward Scalable Learning with Nonuniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection", In Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, New York, USA, pp164--168, 1998.]]Google Scholar
Chan P., Fan W., Prodromidis A. and Stolfo S. "Distributed Data Mining in Credit Card Fraud Detection", IEEE Intelligent Systems, 14, pp67--74, 1999.]] Google ScholarDigital Library
Chawla N., Bowyer K., Hall L., and Kegelmeyer W. "SMOTE: Synthetic Minority Over-sampling TEchnique", Journal of Artificial Intelligence Research, Morgan Kaufmann Publishers, 16, pp321--357, 2002.]] Google ScholarDigital Library
Chawla N. "C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google Scholar
CoIL Challenge 2000. The Insurance Company Case, Technical Report 2000-09, Leiden Institute of Advanced Computer Science, Netherlands, 2000.]]Google Scholar
Dick P. K. Minority Report, Orion Publishing Group, London, Great Britain, 1956.]]Google Scholar
Domingos P. and Pazzani M. "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier", in Proceedings of the 13th Conference on Machine Learning, Bari, Italy, pp105--112, 1996.]]Google Scholar
Domingos P. "Metacost: A General Method for Making Classifiers Cost-sensitive", In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp155--64, 1999.]] Google ScholarDigital Library
Drummond C. and Holte R. "C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google Scholar
Dzeroski S. and Zenko B. "Is Combining Classifiers with Stacking Better than Selecting the Best One?", Machine Learning, Kluwer, 54, pp255--273, 2004.]] Google ScholarDigital Library
Elkan C. Naive Bayesian Learning, Technical Report CS97-557, Department of Computer Science and Engineering, University of California, San Diego, USA, 1997.]]Google Scholar
Elkan C. Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000, Department of Computer Science and Engineering, University of California, San Diego, USA, 2001.]]Google Scholar
Fawcett T. and Provost F. "Combining Data Mining and Machine Learning for Effective User Profiling", in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Oregon, USA, 1996.]]Google Scholar
Fawcett T. and Provost F. "Adaptive fraud detection", Data Mining and Knowledge Discovery, Kluwer, 1, pp291--316, 1997.]] Google ScholarDigital Library
Han J. and Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, USA, 2001.]] Google ScholarDigital Library
He H., Wang J., Graco W. and Hawkins S. "Application of Neural Networks to Detection of Medical Fraud", Expert Systems with Applications, 13, pp329--336, 1997.]]Google ScholarCross Ref
Insurance Information Institute. Facts and Statistics on Auto Insurance, NY, USA, 2003.]]Google Scholar
Japkowicz N. and Stephen S. The Class Imbalance Problem: A Systematic Study, Intelligent Data Analysis, 6, 2002.]] Google ScholarDigital Library
Kalousis A., Joao G and Hilario M. "On Data and Algorithms: Understanding Inductive Performance", Machine Learning, Kluwer, 54, pp275--312, 2004.]] Google ScholarDigital Library
Kuncheva L. and Whitaker C. "Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy", Machine Learning, Kluwer, 51, pp181--207, 2003.]] Google ScholarDigital Library
Maes S., Tuyls K., Vanschoenwinkel B. and Manderick B. "Credit Card Fraud Detection Using Bayesian and Neural Networks", in Proceedings of the 1st International NAISO Congress on Neuro Fuzzy Technologies, Havana, Cuba, 2002.]]Google Scholar
Maloof M. "Learning When Data Sets are imbalanced and When Costs are Unequal and Unknown", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google Scholar
Ormerod T., Morley N., Ball L., Langley C. and Spenser C. "Using Ethnography To Design a Mass Detection Tool (MDT) For The Early Discovery of Insurance Fraud", in Proceedings of ACM CHI Conference, Florida, USA, 2003.]] Google ScholarDigital Library
Payscale. http://www.payscale.com/salary-survey/aid-8483/raname-HOURLYRATE/fid-6886, date last updated: 2003, date accessed: 28th April 2004, 2004.]]Google Scholar
Prodromidis A. Management of Intelligent Learning Agents in Distributed Data Mining Systems, Unpublished PhD thesis, Columbia University, USA, 1999.]] Google ScholarDigital Library
Provost F. "Machine Learning from Imbalanced Data Sets 101", Invited paper, in Workshop on Learning from Imbalanced Data Sets, AAAI, Texas, USA, 2000.]]Google Scholar
Provost F. and Fawcett T. "Robust Classification Systems for Imprecise Environments", Machine Learning, Kluwer, 42, pp203--231, 2001.]] Google ScholarDigital Library
Pyle D. Data Preparation for Data Mining, Morgan Kaufmann Publishers, San Francisco, USA, 1999.]] Google ScholarDigital Library
Salzberg S. L. "On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach", Data Mining and Knowledge Discovery, Kluwer, 1, pp317--327, 1997.]] Google ScholarDigital Library
Weatherford M. "Mining for Fraud", IEEE Intelligent Systems, July/August Issue, pp4--6, 2002.]]Google Scholar
Williams G. and Huang Z. "Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases", in Proceedings of the 10th Australian Joint Conference on Artificial Intelligence, Perth, Australia, 1997.]] Google ScholarDigital Library
Williams G. "Evolutionary Hot Spots Data Mining: An Architecture for Exploring for Interesting Discoveries", in Proceedings of the 3rd Pacific-Asia Conference in Knowledge Discovery and Data Mining, Beijing, China, 1999.]] Google ScholarDigital Library
Witten I. and Frank E. Data Mining: Practical Machine Learning Tools and Techniques with Java, Morgan Kauffman Publishers, California, USA, 1999.]] Google ScholarDigital Library
Wolpert D. "Stacked Generalization", Neural Networks, 5, pp241--259, 1992.]] Google ScholarDigital Library
Wolpert D. and Macready W. "No Free Lunch Theorems for Optimization", IEEE Transactions on Evolutionary Computation, 1, pp67--82, 1997.]] Google ScholarDigital Library

Index Terms

Minority report in fraud detection: classification of skewed data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Optimal construction of one-against-one classifier based on meta-learning

A commonly used strategy for solving a multi-class classification problem is to decompose the original problem into several binary subproblems. The recently proposed method, diversified one-against-one (DOAO), constructs a one-against-one classifier by ...
Read More
A Framework towards the Unification of Ensemble Classification Methods
ICMLA '13: Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 02

Multiple classifier systems, also known as classifier ensembles, have received great attention in recent years because of the improved classification accuracy in different applications. A large variety of ensemble methods have been proposed in order to ...
Read More
On the comparative study of prediction accuracy for credit card fraud detection with imbalanced classifications
SpringSim '20: Proceedings of the 2020 Spring Simulation Conference

Credit card fraud is one of the critical issues due to its significant losses to both financial institutions and individuals in the society. The accurate detection and prevention of fraudulent activities are necessary to protect financial institutions ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 6, Issue 1
Special issue on learning from imbalanced datasets
June 2004
117 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1007730
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2004
Check for updates
Author Tags
fraud detection
meta-learning
multiple classifier systems
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 349
  Total Citations
  View Citations
- 4,673
  Total Downloads
- Downloads (Last 12 months)142
- Downloads (Last 6 weeks)43
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Minority report in fraud detection: classification of skewed data

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Optimal construction of one-against-one classifier based on meta-learning

A Framework towards the Unification of Ensemble Classification Methods

On the comparative study of prediction accuracy for credit card fraud detection with imbalanced classifications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Minority report in fraud detection: classification of skewed data

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Optimal construction of one-against-one classifier based on meta-learning

A Framework towards the Unification of Ensemble Classification Methods

On the comparative study of prediction accuracy for credit card fraud detection with imbalanced classifications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media