Skip to main content

Data Mining at the Interface of Computer Science and Statistics

  • Chapter
Data Mining for Scientific and Engineering Applications

Part of the book series: Massive Computing ((MACO,volume 2))

Abstract

This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Agrawal, R., Imielinski, T., and Swami, A. (1993) Mining associations between sets of items in massive databases, in Proceedings of the 1993 ACM SIGMOD International Conference on the Management of Data, New York, NY: ACM Press, 207–216.

    Chapter  Google Scholar 

  2. Armstrong, J. S., (1967) Derivation of theory by means of factor analysis or Tom Swift and his electric factor analysis machine, American Statistician, 21, 415–22.

    Google Scholar 

  3. Bauer, E. and Kohavi, R. (1999) An empirical comparison of voting classification algorithms: bagging, boosting, and variants, Machine Learning, 36(1/2), 105–139.

    Article  Google Scholar 

  4. Bay, S. and Pazzani, M. (1999) Detecting change in categorical data: mining contrast sets, in Proceedings of the Fifth ACM International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 302–305.

    Chapter  Google Scholar 

  5. Bengio, Y. (1999) Markovian models for sequential data, Neural Computing Surveys, 2, 129–162.

    Google Scholar 

  6. Berry, M. J. A. and Linoff, G. (2000) Mastering Data Mining: The Art and Science of Customer Relationship Management, New York, NY: John Wiley and Sons.

    Google Scholar 

  7. Bickel, P. J. and Doksum, K. A. (1977) Mathematical Statistics: Basic Ideas And Selected Topics, San Francisco, Holden- Day.

    MATH  Google Scholar 

  8. Bishop, C. (1995) Neural Networks for Pattern Recognition, Oxford, UK: Clarendon Press.

    MATH  Google Scholar 

  9. Bradley, P., Fayyad, U. M., and Reina, C. (1998) Scaling EM (expectation-maximization) to large databases, Technical Report MSR-TR-98–35, Microsoft Research, Redmond, WA.

    Google Scholar 

  10. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., (1984) Classification and Regression Trees, Belmont, CA: Wadsworth Statistical Press.

    MATH  Google Scholar 

  11. Cadez, I., Gaffney, S., and Smyth, P. (2000) A general probabilistic framework for clustering individuals, in Proceedings of the ACM Seventh International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 140–149.

    Google Scholar 

  12. Cannings, C, Thompson, E. A., and Skolnick, M. H. (1978) Probability functions on complex pedigrees, Advances in Applied Probability, 10, 26–61.

    Article  MathSciNet  MATH  Google Scholar 

  13. Casella, G. and Berger, R. L. (1990) Statistical Inference, Wadsworth and Brooks.

    MATH  Google Scholar 

  14. Chatfield, C. (1995) Problem Solving, 2nd ed., Chapman and Hall.

    MATH  Google Scholar 

  15. Chatfield, C. (1995) Model uncertainty, data mining and statistical inference, Journal of the Royal Statistical Society A, pp. 415–466.

    Google Scholar 

  16. Cheng, B. and Titterington, D. M. (1994) Neural networks: a review from a statistical perspective, Statistical Science, 9, 2–54.

    Article  MathSciNet  MATH  Google Scholar 

  17. Cox, D. R. and Snell, E. J. (1981) Applied Statistics: Principles and Examples, London: Chapman and Hall.

    Book  MATH  Google Scholar 

  18. Domingos, P. and G. Hulten (2000) Mining high-speed data streams, in Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pp. 71–80, Boston, MA: ACM Press.

    Chapter  Google Scholar 

  19. Du Mouchel, W., Volinsky, C, Johnson, T., Cortes, C, and Pregibon, D. (1999) Squashing flat files flatter, in Proceed- ings of the Fifth ACM International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 6–15.

    Google Scholar 

  20. Dunmur, A. P. and Titterington, D. M. (1999) Analysis of la- tent structure models with multidimensional latent variables, in Statistics and Neural Networks: Advances at the Interface, J.W. Kay and D.M. Titterington (eds.), New York: Oxford University Press, 165–194.

    Google Scholar 

  21. Einhorn, H. (1972) Alchemy in the behavioral sciences, Public Opinion Quarterly, 36, 367–378.

    Article  Google Scholar 

  22. Elder, J. F., and Pregibon, D. (1996) A statistical perspective on knowledge discovery in databases, in Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, eds., Cambridge, MA: The MIT Press, pp. 83–115.

    Google Scholar 

  23. Everitt B. and Dunn G. (1996) Applied Multivariate Data Analysis, New York, NY: John Wiley and Sons.

    MATH  Google Scholar 

  24. Freund, Y. and Schapire, R. (1997) A decision-theoretic gen- eralization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55(1), 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  25. Friedman, J. H. (1999) Greedy function approximation: a gradient boosting machine, Technical Report, Statistics Department, Stanford University.

    MATH  Google Scholar 

  26. Friedman, J. H. and Fisher, N. I. (1999) Bump hunting in high-dimensional data, Statistics and Computing, 9, 123–143.

    Article  Google Scholar 

  27. Friedman, J. H., Hastie, T., and Tibshirani, R. (2000), Additive logistic regression: a statistical view of boosting, Annals of Statistics, 28(2), 337–374.

    Article  MathSciNet  MATH  Google Scholar 

  28. Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W-Y. (1999) BOAT—Optimistic decision tree construction. Proceedings of the SIGMOD Conference 1999, New York, NY: ACM Press, 169–180.

    Book  Google Scholar 

  29. Geman, S., Bienenstock, E., and Doursat, R. (1992) Neural networks and the bias/variance dilemma, Neural Computation, 4, 1–58.

    Article  Google Scholar 

  30. Glymour C, Madigan D., Pregibon D., Smyth P. (1996) Statistical inference and data mining, Communications of the ACM, 39(11), 35–41.

    Article  Google Scholar 

  31. Glymour C, Madigan D., Pregibon D., Smyth P. (1997) Statistical themes and lessons for data mining, Journal of Data Mining and Knowledge Discovery,1, 11–28.

    Article  Google Scholar 

  32. Goodhardt, G. J., Ehrenberg, A. S. C., and Chatfield, C. (1984) The Dirichlet: a comprehensive model of buying behavior, J. R. Statist. Soc. A, 147(5), 621–655.

    Article  Google Scholar 

  33. Hand, D. J. (1994) Deconstructing statistical questions, J. R. Statist. Soc. A, 157(3), 317–356.

    Article  MathSciNet  Google Scholar 

  34. Hand D. J. (1998) Data mining—statistics and more, The American Statistician, 52(2), 112–118.

    MathSciNet  Google Scholar 

  35. Hand, D. J., Mannila, H., and Smyth, P. (2001) Principles of Data Mining, Cambridge, MA: The MIT Press, forthcoming.

    Google Scholar 

  36. Heckerman, D. (1995) A tutorial on learning Bayesian net- works, Techical Report MSR-TR-95–06, Microsoft Research, Redmond, WA.

    Google Scholar 

  37. Heckerman, D., Chickering, D. M., Meek, C, Rounthwaite, R., and Kadie, C. (2000) Dependency networks for density estimation, collaborative filtering, and data visualization, Technical Report MSR-TR-2000–16, Microsoft Research, Redmond, WA.

    MATH  Google Scholar 

  38. Hendry, D. F. (1995) Dynamic Econometrics, New York, NY: Oxford University Press.

    Book  MATH  Google Scholar 

  39. Hinton, G.E., and T. Sejnowski (eds.) (1999) Unsupervised Learning: Foundations of Neural Computation, The MIT Press.

    Google Scholar 

  40. Hoffmann, T. (1999) Probabilistic latent sematic indexing, Proceedings of SIGIR ‘99, 50–57.

    Google Scholar 

  41. Hogg, R. V. and Craig, A. T. (1978) Introduction to Mathematical Statistics, 4th ed. Macmillan.

    MATH  Google Scholar 

  42. Hosking, J. R. M., Pednault, E. P. D., and Sudan, M. (1997) A statistical perspective on data mining, Future Generation Computer Systems, 13, 117–134.

    Article  Google Scholar 

  43. Jensen, D. (1991) Knowledge discovery through induction with randomization testing, in Proceedings of the 1991 Knowledge Discovery in Databases Workshop, G. Piatetsky- Shapiro (ed.), Menlo Park, CA: AAAI Press, 148–159.

    Google Scholar 

  44. Jensen, D. and Cohen, P. (2000) Multiple comparisons in induction algorithms, Machine Learning, 38, 309–338.

    Article  MATH  Google Scholar 

  45. John, G. (1999) personal communication.

    Google Scholar 

  46. Jordan, M. I. and Jacobs, R. A. (1994) Hierarchical mixtures of experts and the EM algorithm, Neural Computation, 6, 181–214.

    Article  Google Scholar 

  47. Jordan, M. I. (ed.) (1998) Learning in Graphical Models, Cambridge, MA: The MIT Press.

    Google Scholar 

  48. Kleinberg, J. M. (1998) Authoritative sources in a hyperlinked environment, in Proc. of ACM-SIAM Symp. on Discrete Algorithms, 668–677.

    Google Scholar 

  49. Knight, K. (2000) Mathematical Statistics, Chapman and Hall.

    MATH  Google Scholar 

  50. Kohavi, R. (2000) personal communication.

    Google Scholar 

  51. Lambert, D. (2000) What use is statistics for massive data?, preprint.

    MATH  Google Scholar 

  52. Lauritzen, S. L. and Spiegelhalter, D. J. (1988) Local computations with probabilisties on graphical structures and their application to expert systems (with discussion), J. Roy. Statist. Soc. B, 50, 157–224.

    MathSciNet  MATH  Google Scholar 

  53. Learner, E. E. (1978) Specification Searches: Ad Hoc Inference with Non-Experimental Data, New York, NY: John Wiley.

    Google Scholar 

  54. Letsche, T. A. and Berry, M. W. (1997) Large-scale infor- mation retrieval with latent semantic indexing, Information Sciences—Applications, 100, 105–137.

    Article  Google Scholar 

  55. Lovell, M. (1983) Data mining, Review of Economics and Statistics, 65, 1–12.

    Article  Google Scholar 

  56. Mannila, H., Toivonen, H. and Inkeri Verkamo, A. (1995) Discovering frequent episodes in sequences, in Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, 210–215.

    Google Scholar 

  57. Miller, A. J. (1990) Subset Selection in Regression, London, Chapman and Hall.

    Book  MATH  Google Scholar 

  58. Moe, W. W., Fader, P. (2001) Which visits lead to purchases? dynamic conversion behavior at e-commerce sites, Working Paper 00–023, Department of Marketing, Wharton School of Business, University of Pennsylvania.

    Google Scholar 

  59. Moore, A. W., (1999) Cached sufficient statistics for automated discovery and data mining from massive data sources, online white paper, Department of Computer Science, Carnegie Mellon University.

    Google Scholar 

  60. Moore, A. W. and Lee, M. (1998) Cached sufficient statistics for efficient machine learning with large data sets, Journal of Artificial Intelligence Research, 8, 67–91.

    MathSciNet  MATH  Google Scholar 

  61. Morgan, J. N. and Sonquist, J. A. (1963) Problems in the analysis of survey data and a proposal, J. Am. Stat. Assoc, 58, 415–434.

    Article  MATH  Google Scholar 

  62. Pearl, J. (1988) Probabilistic Inference in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann.

    MATH  Google Scholar 

  63. Provost, F. and Kolluri, V. (1999) A survey of methods for scaling up inductive algorithms, Journal of Data Mining and Knowledge Discovery, 3(2), 131–169.

    Article  Google Scholar 

  64. Quinlan, J. R. (1987) Generating production rules from decision trees, Proceedings of the Tenth International Joint Conference on Artificial Intelligence, San Mateo, GA: Morgan Kaufmann, 304–307.

    Google Scholar 

  65. Quinlan, J. R. (1993) C4–5: Programs for Machine Learning, San Mateo: CA, Morgan Kaufmann.

    Google Scholar 

  66. Ridgeway, G., (2000) Prediction in the era of massive data sets, Statistical Modeling for Data Mining, P. Giudici (ed.),Kluwer, 109–119.

    Google Scholar 

  67. Ripley, B. D. (1994) Neural networks and related methods for classification (with discussion), J. R. Statist. Soc. B, 56, 409–456.

    MathSciNet  MATH  Google Scholar 

  68. Ripley, B. D. (1996) Pattern Recognition and Neural Net- works, Cambridge, UK: Cambridge University Press.

    Book  MATH  Google Scholar 

  69. Ross, S. M. (2000) Introduction to Probability Models, San Diego, CA: Academic Press.

    MATH  Google Scholar 

  70. Rumelhart, D. E. and McClelland, J. L. (eds.) (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, Cambridge, MA: The MIT Press.

    Google Scholar 

  71. Salzberg, S. L. (1997) On comparing classifiers: pitfalls to avoid and a recommended approach, Data Mining and Knowledge Discovery, 1:3, 317–327.

    Article  Google Scholar 

  72. Scholkopf, C, Burges, J. C, and Smola, A. J. (1999) Advances in Kernel Methods, Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  73. Selvin, H. and Stuart, A. (1966) Data dredging procedures in survey analysis, American Statistician, 20(3), 20–23.

    Google Scholar 

  74. Smyth, P. and Goodman, R. (1992) An information-theoretic approach to rule induction from databases, IEEE Transactions on Knowledge and Data Engineering, 4(4), 301–306.

    Article  Google Scholar 

  75. Smyth, P. (2000) Data mining: data analysis on a grand scale?, Statistical Methods in Medical Research, 9, 309–327.

    Article  MathSciNet  MATH  Google Scholar 

  76. Sullivan, R., Timmermann, A. and White, H. (1999) Data snooping, technical trading rule performance, and the boot-strap, Journal of Finance, 54, 1647–1692.

    Article  Google Scholar 

  77. Vapnik, V. (1998) Statistical Learning Theory, New York, NY: Springer Verlag.

    MATH  Google Scholar 

  78. Walker, M. G. and Blum, R. L. (1986) Towards automated discovery from clinical databases: the RADIX project, in Proceedings of the Fifth Conference on Medical Informatics, volume 5, 32–36.

    Google Scholar 

  79. Witten I. H., Moffat A., Bell T. C. (1999) Managing gigabytes: compressing and indexing documents and images. San Francisco, CA: Morgan Kaufmann, (2nd ed.).

    MATH  Google Scholar 

  80. Wedel, M. and Kamakura, W. A. (1998) Market Segmentation: Conceptual and Methodological Foundations, Boston, MA: Kluwer Academic Publishers.

    Google Scholar 

  81. White, H. (2001), A reality check for data snooping, Econometrica, forthcoming.

    Google Scholar 

  82. Zhang, T., Ramakrishnan, R., Livny, M., (1997) BIRCH: A new data clustering algorithm and its applications, Journal of Data Mining and Knowledge Discovery, 1(2), 141–182.

    Article  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Smyth, P. (2001). Data Mining at the Interface of Computer Science and Statistics. In: Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V., Namburu, R.R. (eds) Data Mining for Scientific and Engineering Applications. Massive Computing, vol 2. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-1733-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-1733-7_3

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4020-0114-7

  • Online ISBN: 978-1-4615-1733-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics