Data Mining at the Interface of Computer Science and Statistics

Smyth, Padhraic

doi:10.1007/978-1-4615-1733-7_3

Padhraic Smyth

Part of the book series: Massive Computing ((MACO,volume 2))

444 Accesses
4 Citations

Abstract

This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Data Analysis

United Statistical Algorithms and Data Science: An Introduction to the Principles

A Note on Artificial Intelligence and Statistics

References

Agrawal, R., Imielinski, T., and Swami, A. (1993) Mining associations between sets of items in massive databases, in Proceedings of the 1993 ACM SIGMOD International Conference on the Management of Data, New York, NY: ACM Press, 207–216.
Chapter Google Scholar
Armstrong, J. S., (1967) Derivation of theory by means of factor analysis or Tom Swift and his electric factor analysis machine, American Statistician, 21, 415–22.
Google Scholar
Bauer, E. and Kohavi, R. (1999) An empirical comparison of voting classification algorithms: bagging, boosting, and variants, Machine Learning, 36(1/2), 105–139.
Article Google Scholar
Bay, S. and Pazzani, M. (1999) Detecting change in categorical data: mining contrast sets, in Proceedings of the Fifth ACM International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 302–305.
Chapter Google Scholar
Bengio, Y. (1999) Markovian models for sequential data, Neural Computing Surveys, 2, 129–162.
Google Scholar
Berry, M. J. A. and Linoff, G. (2000) Mastering Data Mining: The Art and Science of Customer Relationship Management, New York, NY: John Wiley and Sons.
Google Scholar
Bickel, P. J. and Doksum, K. A. (1977) Mathematical Statistics: Basic Ideas And Selected Topics, San Francisco, Holden- Day.
MATH Google Scholar
Bishop, C. (1995) Neural Networks for Pattern Recognition, Oxford, UK: Clarendon Press.
MATH Google Scholar
Bradley, P., Fayyad, U. M., and Reina, C. (1998) Scaling EM (expectation-maximization) to large databases, Technical Report MSR-TR-98–35, Microsoft Research, Redmond, WA.
Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., (1984) Classification and Regression Trees, Belmont, CA: Wadsworth Statistical Press.
MATH Google Scholar
Cadez, I., Gaffney, S., and Smyth, P. (2000) A general probabilistic framework for clustering individuals, in Proceedings of the ACM Seventh International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 140–149.
Google Scholar
Cannings, C, Thompson, E. A., and Skolnick, M. H. (1978) Probability functions on complex pedigrees, Advances in Applied Probability, 10, 26–61.
Article MathSciNet MATH Google Scholar
Casella, G. and Berger, R. L. (1990) Statistical Inference, Wadsworth and Brooks.
MATH Google Scholar
Chatfield, C. (1995) Problem Solving, 2nd ed., Chapman and Hall.
MATH Google Scholar
Chatfield, C. (1995) Model uncertainty, data mining and statistical inference, Journal of the Royal Statistical Society A, pp. 415–466.
Google Scholar
Cheng, B. and Titterington, D. M. (1994) Neural networks: a review from a statistical perspective, Statistical Science, 9, 2–54.
Article MathSciNet MATH Google Scholar
Cox, D. R. and Snell, E. J. (1981) Applied Statistics: Principles and Examples, London: Chapman and Hall.
Book MATH Google Scholar
Domingos, P. and G. Hulten (2000) Mining high-speed data streams, in Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pp. 71–80, Boston, MA: ACM Press.
Chapter Google Scholar
Du Mouchel, W., Volinsky, C, Johnson, T., Cortes, C, and Pregibon, D. (1999) Squashing flat files flatter, in Proceed- ings of the Fifth ACM International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 6–15.
Google Scholar
Dunmur, A. P. and Titterington, D. M. (1999) Analysis of la- tent structure models with multidimensional latent variables, in Statistics and Neural Networks: Advances at the Interface, J.W. Kay and D.M. Titterington (eds.), New York: Oxford University Press, 165–194.
Google Scholar
Einhorn, H. (1972) Alchemy in the behavioral sciences, Public Opinion Quarterly, 36, 367–378.
Article Google Scholar
Elder, J. F., and Pregibon, D. (1996) A statistical perspective on knowledge discovery in databases, in Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, eds., Cambridge, MA: The MIT Press, pp. 83–115.
Google Scholar
Everitt B. and Dunn G. (1996) Applied Multivariate Data Analysis, New York, NY: John Wiley and Sons.
MATH Google Scholar
Freund, Y. and Schapire, R. (1997) A decision-theoretic gen- eralization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55(1), 119–139.
Article MathSciNet MATH Google Scholar
Friedman, J. H. (1999) Greedy function approximation: a gradient boosting machine, Technical Report, Statistics Department, Stanford University.
MATH Google Scholar
Friedman, J. H. and Fisher, N. I. (1999) Bump hunting in high-dimensional data, Statistics and Computing, 9, 123–143.
Article Google Scholar
Friedman, J. H., Hastie, T., and Tibshirani, R. (2000), Additive logistic regression: a statistical view of boosting, Annals of Statistics, 28(2), 337–374.
Article MathSciNet MATH Google Scholar
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W-Y. (1999) BOAT—Optimistic decision tree construction. Proceedings of the SIGMOD Conference 1999, New York, NY: ACM Press, 169–180.
Book Google Scholar
Geman, S., Bienenstock, E., and Doursat, R. (1992) Neural networks and the bias/variance dilemma, Neural Computation, 4, 1–58.
Article Google Scholar
Glymour C, Madigan D., Pregibon D., Smyth P. (1996) Statistical inference and data mining, Communications of the ACM, 39(11), 35–41.
Article Google Scholar
Glymour C, Madigan D., Pregibon D., Smyth P. (1997) Statistical themes and lessons for data mining, Journal of Data Mining and Knowledge Discovery,1, 11–28.
Article Google Scholar
Goodhardt, G. J., Ehrenberg, A. S. C., and Chatfield, C. (1984) The Dirichlet: a comprehensive model of buying behavior, J. R. Statist. Soc. A, 147(5), 621–655.
Article Google Scholar
Hand, D. J. (1994) Deconstructing statistical questions, J. R. Statist. Soc. A, 157(3), 317–356.
Article MathSciNet Google Scholar
Hand D. J. (1998) Data mining—statistics and more, The American Statistician, 52(2), 112–118.
MathSciNet Google Scholar
Hand, D. J., Mannila, H., and Smyth, P. (2001) Principles of Data Mining, Cambridge, MA: The MIT Press, forthcoming.
Google Scholar
Heckerman, D. (1995) A tutorial on learning Bayesian net- works, Techical Report MSR-TR-95–06, Microsoft Research, Redmond, WA.
Google Scholar
Heckerman, D., Chickering, D. M., Meek, C, Rounthwaite, R., and Kadie, C. (2000) Dependency networks for density estimation, collaborative filtering, and data visualization, Technical Report MSR-TR-2000–16, Microsoft Research, Redmond, WA.
MATH Google Scholar
Hendry, D. F. (1995) Dynamic Econometrics, New York, NY: Oxford University Press.
Book MATH Google Scholar
Hinton, G.E., and T. Sejnowski (eds.) (1999) Unsupervised Learning: Foundations of Neural Computation, The MIT Press.
Google Scholar
Hoffmann, T. (1999) Probabilistic latent sematic indexing, Proceedings of SIGIR ‘99, 50–57.
Google Scholar
Hogg, R. V. and Craig, A. T. (1978) Introduction to Mathematical Statistics, 4th ed. Macmillan.
MATH Google Scholar
Hosking, J. R. M., Pednault, E. P. D., and Sudan, M. (1997) A statistical perspective on data mining, Future Generation Computer Systems, 13, 117–134.
Article Google Scholar
Jensen, D. (1991) Knowledge discovery through induction with randomization testing, in Proceedings of the 1991 Knowledge Discovery in Databases Workshop, G. Piatetsky- Shapiro (ed.), Menlo Park, CA: AAAI Press, 148–159.
Google Scholar
Jensen, D. and Cohen, P. (2000) Multiple comparisons in induction algorithms, Machine Learning, 38, 309–338.
Article MATH Google Scholar
John, G. (1999) personal communication.
Google Scholar
Jordan, M. I. and Jacobs, R. A. (1994) Hierarchical mixtures of experts and the EM algorithm, Neural Computation, 6, 181–214.
Article Google Scholar
Jordan, M. I. (ed.) (1998) Learning in Graphical Models, Cambridge, MA: The MIT Press.
Google Scholar
Kleinberg, J. M. (1998) Authoritative sources in a hyperlinked environment, in Proc. of ACM-SIAM Symp. on Discrete Algorithms, 668–677.
Google Scholar
Knight, K. (2000) Mathematical Statistics, Chapman and Hall.
MATH Google Scholar
Kohavi, R. (2000) personal communication.
Google Scholar
Lambert, D. (2000) What use is statistics for massive data?, preprint.
MATH Google Scholar
Lauritzen, S. L. and Spiegelhalter, D. J. (1988) Local computations with probabilisties on graphical structures and their application to expert systems (with discussion), J. Roy. Statist. Soc. B, 50, 157–224.
MathSciNet MATH Google Scholar
Learner, E. E. (1978) Specification Searches: Ad Hoc Inference with Non-Experimental Data, New York, NY: John Wiley.
Google Scholar
Letsche, T. A. and Berry, M. W. (1997) Large-scale infor- mation retrieval with latent semantic indexing, Information Sciences—Applications, 100, 105–137.
Article Google Scholar
Lovell, M. (1983) Data mining, Review of Economics and Statistics, 65, 1–12.
Article Google Scholar
Mannila, H., Toivonen, H. and Inkeri Verkamo, A. (1995) Discovering frequent episodes in sequences, in Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, 210–215.
Google Scholar
Miller, A. J. (1990) Subset Selection in Regression, London, Chapman and Hall.
Book MATH Google Scholar
Moe, W. W., Fader, P. (2001) Which visits lead to purchases? dynamic conversion behavior at e-commerce sites, Working Paper 00–023, Department of Marketing, Wharton School of Business, University of Pennsylvania.
Google Scholar
Moore, A. W., (1999) Cached sufficient statistics for automated discovery and data mining from massive data sources, online white paper, Department of Computer Science, Carnegie Mellon University.
Google Scholar
Moore, A. W. and Lee, M. (1998) Cached sufficient statistics for efficient machine learning with large data sets, Journal of Artificial Intelligence Research, 8, 67–91.
MathSciNet MATH Google Scholar
Morgan, J. N. and Sonquist, J. A. (1963) Problems in the analysis of survey data and a proposal, J. Am. Stat. Assoc, 58, 415–434.
Article MATH Google Scholar
Pearl, J. (1988) Probabilistic Inference in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann.
MATH Google Scholar
Provost, F. and Kolluri, V. (1999) A survey of methods for scaling up inductive algorithms, Journal of Data Mining and Knowledge Discovery, 3(2), 131–169.
Article Google Scholar
Quinlan, J. R. (1987) Generating production rules from decision trees, Proceedings of the Tenth International Joint Conference on Artificial Intelligence, San Mateo, GA: Morgan Kaufmann, 304–307.
Google Scholar
Quinlan, J. R. (1993) C4–5: Programs for Machine Learning, San Mateo: CA, Morgan Kaufmann.
Google Scholar
Ridgeway, G., (2000) Prediction in the era of massive data sets, Statistical Modeling for Data Mining, P. Giudici (ed.),Kluwer, 109–119.
Google Scholar
Ripley, B. D. (1994) Neural networks and related methods for classification (with discussion), J. R. Statist. Soc. B, 56, 409–456.
MathSciNet MATH Google Scholar
Ripley, B. D. (1996) Pattern Recognition and Neural Net- works, Cambridge, UK: Cambridge University Press.
Book MATH Google Scholar
Ross, S. M. (2000) Introduction to Probability Models, San Diego, CA: Academic Press.
MATH Google Scholar
Rumelhart, D. E. and McClelland, J. L. (eds.) (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, Cambridge, MA: The MIT Press.
Google Scholar
Salzberg, S. L. (1997) On comparing classifiers: pitfalls to avoid and a recommended approach, Data Mining and Knowledge Discovery, 1:3, 317–327.
Article Google Scholar
Scholkopf, C, Burges, J. C, and Smola, A. J. (1999) Advances in Kernel Methods, Cambridge, MA: MIT Press.
MATH Google Scholar
Selvin, H. and Stuart, A. (1966) Data dredging procedures in survey analysis, American Statistician, 20(3), 20–23.
Google Scholar
Smyth, P. and Goodman, R. (1992) An information-theoretic approach to rule induction from databases, IEEE Transactions on Knowledge and Data Engineering, 4(4), 301–306.
Article Google Scholar
Smyth, P. (2000) Data mining: data analysis on a grand scale?, Statistical Methods in Medical Research, 9, 309–327.
Article MathSciNet MATH Google Scholar
Sullivan, R., Timmermann, A. and White, H. (1999) Data snooping, technical trading rule performance, and the boot-strap, Journal of Finance, 54, 1647–1692.
Article Google Scholar
Vapnik, V. (1998) Statistical Learning Theory, New York, NY: Springer Verlag.
MATH Google Scholar
Walker, M. G. and Blum, R. L. (1986) Towards automated discovery from clinical databases: the RADIX project, in Proceedings of the Fifth Conference on Medical Informatics, volume 5, 32–36.
Google Scholar
Witten I. H., Moffat A., Bell T. C. (1999) Managing gigabytes: compressing and indexing documents and images. San Francisco, CA: Morgan Kaufmann, (2nd ed.).
MATH Google Scholar
Wedel, M. and Kamakura, W. A. (1998) Market Segmentation: Conceptual and Methodological Foundations, Boston, MA: Kluwer Academic Publishers.
Google Scholar
White, H. (2001), A reality check for data snooping, Econometrica, forthcoming.
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M., (1997) BIRCH: A new data clustering algorithm and its applications, Journal of Data Mining and Knowledge Discovery, 1(2), 141–182.
Article Google Scholar

Download references

Authors

Padhraic Smyth
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Illinois, Chicago, USA
Robert L. Grossman
Lawrence Livermore National Laboratory, Livermore, USA
Chandrika Kamath
Sandia National Laboratories, Livermore, USA
Philip Kegelmeyer
Army High Performance Computing Research Center (AHPCRC), Minneapolis, USA
Vipin Kumar
Army Research Laboratory, Aberdeen Proving Ground, USA
Raju R. Namburu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Smyth, P. (2001). Data Mining at the Interface of Computer Science and Statistics. In: Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V., Namburu, R.R. (eds) Data Mining for Scientific and Engineering Applications. Massive Computing, vol 2. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-1733-7_3

Download citation

DOI: https://doi.org/10.1007/978-1-4615-1733-7_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-0114-7
Online ISBN: 978-1-4615-1733-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Data Mining at the Interface of Computer Science and Statistics

Abstract

Access this chapter

Preview

Similar content being viewed by others

Data Analysis

United Statistical Algorithms and Data Science: An Introduction to the Principles

A Note on Artificial Intelligence and Statistics

References

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Data Mining at the Interface of Computer Science and Statistics

Abstract

Access this chapter

Preview

Similar content being viewed by others

Data Analysis

United Statistical Algorithms and Data Science: An Introduction to the Principles

A Note on Artificial Intelligence and Statistics

References

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation