Abstract
This paper proposes an innovative fraud detection method, built upon existing fraud detection research and Minority Report, to deal with the data mining problem of skewed data distributions. This method uses backpropagation (BP), together with naive Bayesian (NB) and C4.5 algorithms, on data partitions derived from minority oversampling with replacement. Its originality lies in the use of a single meta-classifier (stacking) to choose the best base classifiers, and then combine these base classifiers' predictions (bagging) to improve cost savings (stacking-bagging). Results from a publicly available automobile insurance fraud detection data set demonstrate that stacking-bagging performs slightly better than the best performing bagged algorithm, C4.5, and its best classifier, C4.5 (2), in terms of cost savings. Stacking-bagging also outperforms the common technique used in industry (BP without both sampling and partitioning). Subsequently, this paper compares the new fraud detection method (meta-learning approach) against C4.5 trained using undersampling, oversampling, and SMOTEing without partitioning (sampling approach). Results show that, given a fixed decision threshold and cost matrix, the partitioning and multiple algorithms approach achieves marginally higher cost savings than varying the entire training data set with different class distributions. The most interesting find is confirming that the combination of classifiers to produce the best cost savings has its contributions from all three algorithms.
- Berry M and Linoff G. Mastering Data Mining: The Art and Science of Customer Relationship Management, John Wiley and Sons, New York, USA, 2000.]] Google ScholarDigital Library
- Bigus J. Data Mining with Neural Networks, McGraw Hill, New York, USA, 1996.]] Google ScholarDigital Library
- Breiman L. Heuristics of Instability in Model Selection, Technical Report, Department of Statistics, University of California at Berkeley, USA, 1994.]]Google Scholar
- Brockett P., Xia X. and Derrig R. "Using Kohonen's Self Organising Feature Map to Uncover Automobile Bodily Injury Claims Fraud", Journal of Risk and Insurance, USA, 1998.]]Google ScholarCross Ref
- Cahill M., Lambert D., Pinheiro J. and Sun D. "Detecting Fraud In The Real World", in The Handbook of Massive Data Sets, Kluwer, pp911--930, 2002.]] Google ScholarDigital Library
- Chan P. and Stolfo S. "A Comparative Evaluation of Voting and Meta-learning on Partitioned Data", in Proceedings of 12th International Conference on Machine Learning, California, USA, pp90--98, 1995.]]Google ScholarCross Ref
- Chan P. and Stolfo S. "Toward Scalable Learning with Nonuniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection", In Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, New York, USA, pp164--168, 1998.]]Google Scholar
- Chan P., Fan W., Prodromidis A. and Stolfo S. "Distributed Data Mining in Credit Card Fraud Detection", IEEE Intelligent Systems, 14, pp67--74, 1999.]] Google ScholarDigital Library
- Chawla N., Bowyer K., Hall L., and Kegelmeyer W. "SMOTE: Synthetic Minority Over-sampling TEchnique", Journal of Artificial Intelligence Research, Morgan Kaufmann Publishers, 16, pp321--357, 2002.]] Google ScholarDigital Library
- Chawla N. "C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google Scholar
- CoIL Challenge 2000. The Insurance Company Case, Technical Report 2000-09, Leiden Institute of Advanced Computer Science, Netherlands, 2000.]]Google Scholar
- Dick P. K. Minority Report, Orion Publishing Group, London, Great Britain, 1956.]]Google Scholar
- Domingos P. and Pazzani M. "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier", in Proceedings of the 13th Conference on Machine Learning, Bari, Italy, pp105--112, 1996.]]Google Scholar
- Domingos P. "Metacost: A General Method for Making Classifiers Cost-sensitive", In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp155--64, 1999.]] Google ScholarDigital Library
- Drummond C. and Holte R. "C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google Scholar
- Dzeroski S. and Zenko B. "Is Combining Classifiers with Stacking Better than Selecting the Best One?", Machine Learning, Kluwer, 54, pp255--273, 2004.]] Google ScholarDigital Library
- Elkan C. Naive Bayesian Learning, Technical Report CS97-557, Department of Computer Science and Engineering, University of California, San Diego, USA, 1997.]]Google Scholar
- Elkan C. Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000, Department of Computer Science and Engineering, University of California, San Diego, USA, 2001.]]Google Scholar
- Fawcett T. and Provost F. "Combining Data Mining and Machine Learning for Effective User Profiling", in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Oregon, USA, 1996.]]Google Scholar
- Fawcett T. and Provost F. "Adaptive fraud detection", Data Mining and Knowledge Discovery, Kluwer, 1, pp291--316, 1997.]] Google ScholarDigital Library
- Han J. and Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, USA, 2001.]] Google ScholarDigital Library
- He H., Wang J., Graco W. and Hawkins S. "Application of Neural Networks to Detection of Medical Fraud", Expert Systems with Applications, 13, pp329--336, 1997.]]Google ScholarCross Ref
- Insurance Information Institute. Facts and Statistics on Auto Insurance, NY, USA, 2003.]]Google Scholar
- Japkowicz N. and Stephen S. The Class Imbalance Problem: A Systematic Study, Intelligent Data Analysis, 6, 2002.]] Google ScholarDigital Library
- Kalousis A., Joao G and Hilario M. "On Data and Algorithms: Understanding Inductive Performance", Machine Learning, Kluwer, 54, pp275--312, 2004.]] Google ScholarDigital Library
- Kuncheva L. and Whitaker C. "Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy", Machine Learning, Kluwer, 51, pp181--207, 2003.]] Google ScholarDigital Library
- Maes S., Tuyls K., Vanschoenwinkel B. and Manderick B. "Credit Card Fraud Detection Using Bayesian and Neural Networks", in Proceedings of the 1st International NAISO Congress on Neuro Fuzzy Technologies, Havana, Cuba, 2002.]]Google Scholar
- Maloof M. "Learning When Data Sets are imbalanced and When Costs are Unequal and Unknown", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google Scholar
- Ormerod T., Morley N., Ball L., Langley C. and Spenser C. "Using Ethnography To Design a Mass Detection Tool (MDT) For The Early Discovery of Insurance Fraud", in Proceedings of ACM CHI Conference, Florida, USA, 2003.]] Google ScholarDigital Library
- Payscale. http://www.payscale.com/salary-survey/aid-8483/raname-HOURLYRATE/fid-6886, date last updated: 2003, date accessed: 28th April 2004, 2004.]]Google Scholar
- Prodromidis A. Management of Intelligent Learning Agents in Distributed Data Mining Systems, Unpublished PhD thesis, Columbia University, USA, 1999.]] Google ScholarDigital Library
- Provost F. "Machine Learning from Imbalanced Data Sets 101", Invited paper, in Workshop on Learning from Imbalanced Data Sets, AAAI, Texas, USA, 2000.]]Google Scholar
- Provost F. and Fawcett T. "Robust Classification Systems for Imprecise Environments", Machine Learning, Kluwer, 42, pp203--231, 2001.]] Google ScholarDigital Library
- Pyle D. Data Preparation for Data Mining, Morgan Kaufmann Publishers, San Francisco, USA, 1999.]] Google ScholarDigital Library
- Salzberg S. L. "On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach", Data Mining and Knowledge Discovery, Kluwer, 1, pp317--327, 1997.]] Google ScholarDigital Library
- Weatherford M. "Mining for Fraud", IEEE Intelligent Systems, July/August Issue, pp4--6, 2002.]]Google Scholar
- Williams G. and Huang Z. "Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases", in Proceedings of the 10th Australian Joint Conference on Artificial Intelligence, Perth, Australia, 1997.]] Google ScholarDigital Library
- Williams G. "Evolutionary Hot Spots Data Mining: An Architecture for Exploring for Interesting Discoveries", in Proceedings of the 3rd Pacific-Asia Conference in Knowledge Discovery and Data Mining, Beijing, China, 1999.]] Google ScholarDigital Library
- Witten I. and Frank E. Data Mining: Practical Machine Learning Tools and Techniques with Java, Morgan Kauffman Publishers, California, USA, 1999.]] Google ScholarDigital Library
- Wolpert D. "Stacked Generalization", Neural Networks, 5, pp241--259, 1992.]] Google ScholarDigital Library
- Wolpert D. and Macready W. "No Free Lunch Theorems for Optimization", IEEE Transactions on Evolutionary Computation, 1, pp67--82, 1997.]] Google ScholarDigital Library
Index Terms
- Minority report in fraud detection: classification of skewed data
Recommendations
Optimal construction of one-against-one classifier based on meta-learning
A commonly used strategy for solving a multi-class classification problem is to decompose the original problem into several binary subproblems. The recently proposed method, diversified one-against-one (DOAO), constructs a one-against-one classifier by ...
A Framework towards the Unification of Ensemble Classification Methods
ICMLA '13: Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 02Multiple classifier systems, also known as classifier ensembles, have received great attention in recent years because of the improved classification accuracy in different applications. A large variety of ensemble methods have been proposed in order to ...
On the comparative study of prediction accuracy for credit card fraud detection with imbalanced classifications
SpringSim '20: Proceedings of the 2020 Spring Simulation ConferenceCredit card fraud is one of the critical issues due to its significant losses to both financial institutions and individuals in the society. The accurate detection and prevention of fraudulent activities are necessary to protect financial institutions ...
Comments