skip to main content
article

Minority report in fraud detection: classification of skewed data

Published:01 June 2004Publication History
Skip Abstract Section

Abstract

This paper proposes an innovative fraud detection method, built upon existing fraud detection research and Minority Report, to deal with the data mining problem of skewed data distributions. This method uses backpropagation (BP), together with naive Bayesian (NB) and C4.5 algorithms, on data partitions derived from minority oversampling with replacement. Its originality lies in the use of a single meta-classifier (stacking) to choose the best base classifiers, and then combine these base classifiers' predictions (bagging) to improve cost savings (stacking-bagging). Results from a publicly available automobile insurance fraud detection data set demonstrate that stacking-bagging performs slightly better than the best performing bagged algorithm, C4.5, and its best classifier, C4.5 (2), in terms of cost savings. Stacking-bagging also outperforms the common technique used in industry (BP without both sampling and partitioning). Subsequently, this paper compares the new fraud detection method (meta-learning approach) against C4.5 trained using undersampling, oversampling, and SMOTEing without partitioning (sampling approach). Results show that, given a fixed decision threshold and cost matrix, the partitioning and multiple algorithms approach achieves marginally higher cost savings than varying the entire training data set with different class distributions. The most interesting find is confirming that the combination of classifiers to produce the best cost savings has its contributions from all three algorithms.

References

  1. Berry M and Linoff G. Mastering Data Mining: The Art and Science of Customer Relationship Management, John Wiley and Sons, New York, USA, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bigus J. Data Mining with Neural Networks, McGraw Hill, New York, USA, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Breiman L. Heuristics of Instability in Model Selection, Technical Report, Department of Statistics, University of California at Berkeley, USA, 1994.]]Google ScholarGoogle Scholar
  4. Brockett P., Xia X. and Derrig R. "Using Kohonen's Self Organising Feature Map to Uncover Automobile Bodily Injury Claims Fraud", Journal of Risk and Insurance, USA, 1998.]]Google ScholarGoogle ScholarCross RefCross Ref
  5. Cahill M., Lambert D., Pinheiro J. and Sun D. "Detecting Fraud In The Real World", in The Handbook of Massive Data Sets, Kluwer, pp911--930, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chan P. and Stolfo S. "A Comparative Evaluation of Voting and Meta-learning on Partitioned Data", in Proceedings of 12th International Conference on Machine Learning, California, USA, pp90--98, 1995.]]Google ScholarGoogle ScholarCross RefCross Ref
  7. Chan P. and Stolfo S. "Toward Scalable Learning with Nonuniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection", In Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, New York, USA, pp164--168, 1998.]]Google ScholarGoogle Scholar
  8. Chan P., Fan W., Prodromidis A. and Stolfo S. "Distributed Data Mining in Credit Card Fraud Detection", IEEE Intelligent Systems, 14, pp67--74, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chawla N., Bowyer K., Hall L., and Kegelmeyer W. "SMOTE: Synthetic Minority Over-sampling TEchnique", Journal of Artificial Intelligence Research, Morgan Kaufmann Publishers, 16, pp321--357, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chawla N. "C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google ScholarGoogle Scholar
  11. CoIL Challenge 2000. The Insurance Company Case, Technical Report 2000-09, Leiden Institute of Advanced Computer Science, Netherlands, 2000.]]Google ScholarGoogle Scholar
  12. Dick P. K. Minority Report, Orion Publishing Group, London, Great Britain, 1956.]]Google ScholarGoogle Scholar
  13. Domingos P. and Pazzani M. "Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier", in Proceedings of the 13th Conference on Machine Learning, Bari, Italy, pp105--112, 1996.]]Google ScholarGoogle Scholar
  14. Domingos P. "Metacost: A General Method for Making Classifiers Cost-sensitive", In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp155--64, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Drummond C. and Holte R. "C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google ScholarGoogle Scholar
  16. Dzeroski S. and Zenko B. "Is Combining Classifiers with Stacking Better than Selecting the Best One?", Machine Learning, Kluwer, 54, pp255--273, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Elkan C. Naive Bayesian Learning, Technical Report CS97-557, Department of Computer Science and Engineering, University of California, San Diego, USA, 1997.]]Google ScholarGoogle Scholar
  18. Elkan C. Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000, Department of Computer Science and Engineering, University of California, San Diego, USA, 2001.]]Google ScholarGoogle Scholar
  19. Fawcett T. and Provost F. "Combining Data Mining and Machine Learning for Effective User Profiling", in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Oregon, USA, 1996.]]Google ScholarGoogle Scholar
  20. Fawcett T. and Provost F. "Adaptive fraud detection", Data Mining and Knowledge Discovery, Kluwer, 1, pp291--316, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Han J. and Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, USA, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. He H., Wang J., Graco W. and Hawkins S. "Application of Neural Networks to Detection of Medical Fraud", Expert Systems with Applications, 13, pp329--336, 1997.]]Google ScholarGoogle ScholarCross RefCross Ref
  23. Insurance Information Institute. Facts and Statistics on Auto Insurance, NY, USA, 2003.]]Google ScholarGoogle Scholar
  24. Japkowicz N. and Stephen S. The Class Imbalance Problem: A Systematic Study, Intelligent Data Analysis, 6, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kalousis A., Joao G and Hilario M. "On Data and Algorithms: Understanding Inductive Performance", Machine Learning, Kluwer, 54, pp275--312, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kuncheva L. and Whitaker C. "Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy", Machine Learning, Kluwer, 51, pp181--207, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Maes S., Tuyls K., Vanschoenwinkel B. and Manderick B. "Credit Card Fraud Detection Using Bayesian and Neural Networks", in Proceedings of the 1st International NAISO Congress on Neuro Fuzzy Technologies, Havana, Cuba, 2002.]]Google ScholarGoogle Scholar
  28. Maloof M. "Learning When Data Sets are imbalanced and When Costs are Unequal and Unknown", in Workshop on Learning from Imbalanced Data Sets II, ICML, Washington DC, USA, 2003.]]Google ScholarGoogle Scholar
  29. Ormerod T., Morley N., Ball L., Langley C. and Spenser C. "Using Ethnography To Design a Mass Detection Tool (MDT) For The Early Discovery of Insurance Fraud", in Proceedings of ACM CHI Conference, Florida, USA, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Payscale. http://www.payscale.com/salary-survey/aid-8483/raname-HOURLYRATE/fid-6886, date last updated: 2003, date accessed: 28th April 2004, 2004.]]Google ScholarGoogle Scholar
  31. Prodromidis A. Management of Intelligent Learning Agents in Distributed Data Mining Systems, Unpublished PhD thesis, Columbia University, USA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Provost F. "Machine Learning from Imbalanced Data Sets 101", Invited paper, in Workshop on Learning from Imbalanced Data Sets, AAAI, Texas, USA, 2000.]]Google ScholarGoogle Scholar
  33. Provost F. and Fawcett T. "Robust Classification Systems for Imprecise Environments", Machine Learning, Kluwer, 42, pp203--231, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pyle D. Data Preparation for Data Mining, Morgan Kaufmann Publishers, San Francisco, USA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Salzberg S. L. "On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach", Data Mining and Knowledge Discovery, Kluwer, 1, pp317--327, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Weatherford M. "Mining for Fraud", IEEE Intelligent Systems, July/August Issue, pp4--6, 2002.]]Google ScholarGoogle Scholar
  37. Williams G. and Huang Z. "Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases", in Proceedings of the 10th Australian Joint Conference on Artificial Intelligence, Perth, Australia, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Williams G. "Evolutionary Hot Spots Data Mining: An Architecture for Exploring for Interesting Discoveries", in Proceedings of the 3rd Pacific-Asia Conference in Knowledge Discovery and Data Mining, Beijing, China, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Witten I. and Frank E. Data Mining: Practical Machine Learning Tools and Techniques with Java, Morgan Kauffman Publishers, California, USA, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wolpert D. "Stacked Generalization", Neural Networks, 5, pp241--259, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wolpert D. and Macready W. "No Free Lunch Theorems for Optimization", IEEE Transactions on Evolutionary Computation, 1, pp67--82, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Minority report in fraud detection: classification of skewed data
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGKDD Explorations Newsletter
          ACM SIGKDD Explorations Newsletter  Volume 6, Issue 1
          Special issue on learning from imbalanced datasets
          June 2004
          117 pages
          ISSN:1931-0145
          EISSN:1931-0153
          DOI:10.1145/1007730
          Issue’s Table of Contents

          Copyright © 2004 Authors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 June 2004

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader