Abstract
This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agrawal, R., Imielinski, T., and Swami, A. (1993) Mining associations between sets of items in massive databases, in Proceedings of the 1993 ACM SIGMOD International Conference on the Management of Data, New York, NY: ACM Press, 207–216.
Armstrong, J. S., (1967) Derivation of theory by means of factor analysis or Tom Swift and his electric factor analysis machine, American Statistician, 21, 415–22.
Bauer, E. and Kohavi, R. (1999) An empirical comparison of voting classification algorithms: bagging, boosting, and variants, Machine Learning, 36(1/2), 105–139.
Bay, S. and Pazzani, M. (1999) Detecting change in categorical data: mining contrast sets, in Proceedings of the Fifth ACM International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 302–305.
Bengio, Y. (1999) Markovian models for sequential data, Neural Computing Surveys, 2, 129–162.
Berry, M. J. A. and Linoff, G. (2000) Mastering Data Mining: The Art and Science of Customer Relationship Management, New York, NY: John Wiley and Sons.
Bickel, P. J. and Doksum, K. A. (1977) Mathematical Statistics: Basic Ideas And Selected Topics, San Francisco, Holden- Day.
Bishop, C. (1995) Neural Networks for Pattern Recognition, Oxford, UK: Clarendon Press.
Bradley, P., Fayyad, U. M., and Reina, C. (1998) Scaling EM (expectation-maximization) to large databases, Technical Report MSR-TR-98–35, Microsoft Research, Redmond, WA.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., (1984) Classification and Regression Trees, Belmont, CA: Wadsworth Statistical Press.
Cadez, I., Gaffney, S., and Smyth, P. (2000) A general probabilistic framework for clustering individuals, in Proceedings of the ACM Seventh International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 140–149.
Cannings, C, Thompson, E. A., and Skolnick, M. H. (1978) Probability functions on complex pedigrees, Advances in Applied Probability, 10, 26–61.
Casella, G. and Berger, R. L. (1990) Statistical Inference, Wadsworth and Brooks.
Chatfield, C. (1995) Problem Solving, 2nd ed., Chapman and Hall.
Chatfield, C. (1995) Model uncertainty, data mining and statistical inference, Journal of the Royal Statistical Society A, pp. 415–466.
Cheng, B. and Titterington, D. M. (1994) Neural networks: a review from a statistical perspective, Statistical Science, 9, 2–54.
Cox, D. R. and Snell, E. J. (1981) Applied Statistics: Principles and Examples, London: Chapman and Hall.
Domingos, P. and G. Hulten (2000) Mining high-speed data streams, in Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pp. 71–80, Boston, MA: ACM Press.
Du Mouchel, W., Volinsky, C, Johnson, T., Cortes, C, and Pregibon, D. (1999) Squashing flat files flatter, in Proceed- ings of the Fifth ACM International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, 6–15.
Dunmur, A. P. and Titterington, D. M. (1999) Analysis of la- tent structure models with multidimensional latent variables, in Statistics and Neural Networks: Advances at the Interface, J.W. Kay and D.M. Titterington (eds.), New York: Oxford University Press, 165–194.
Einhorn, H. (1972) Alchemy in the behavioral sciences, Public Opinion Quarterly, 36, 367–378.
Elder, J. F., and Pregibon, D. (1996) A statistical perspective on knowledge discovery in databases, in Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, eds., Cambridge, MA: The MIT Press, pp. 83–115.
Everitt B. and Dunn G. (1996) Applied Multivariate Data Analysis, New York, NY: John Wiley and Sons.
Freund, Y. and Schapire, R. (1997) A decision-theoretic gen- eralization of on-line learning and an application to boosting, Journal of Computer and System Sciences, 55(1), 119–139.
Friedman, J. H. (1999) Greedy function approximation: a gradient boosting machine, Technical Report, Statistics Department, Stanford University.
Friedman, J. H. and Fisher, N. I. (1999) Bump hunting in high-dimensional data, Statistics and Computing, 9, 123–143.
Friedman, J. H., Hastie, T., and Tibshirani, R. (2000), Additive logistic regression: a statistical view of boosting, Annals of Statistics, 28(2), 337–374.
Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W-Y. (1999) BOAT—Optimistic decision tree construction. Proceedings of the SIGMOD Conference 1999, New York, NY: ACM Press, 169–180.
Geman, S., Bienenstock, E., and Doursat, R. (1992) Neural networks and the bias/variance dilemma, Neural Computation, 4, 1–58.
Glymour C, Madigan D., Pregibon D., Smyth P. (1996) Statistical inference and data mining, Communications of the ACM, 39(11), 35–41.
Glymour C, Madigan D., Pregibon D., Smyth P. (1997) Statistical themes and lessons for data mining, Journal of Data Mining and Knowledge Discovery,1, 11–28.
Goodhardt, G. J., Ehrenberg, A. S. C., and Chatfield, C. (1984) The Dirichlet: a comprehensive model of buying behavior, J. R. Statist. Soc. A, 147(5), 621–655.
Hand, D. J. (1994) Deconstructing statistical questions, J. R. Statist. Soc. A, 157(3), 317–356.
Hand D. J. (1998) Data mining—statistics and more, The American Statistician, 52(2), 112–118.
Hand, D. J., Mannila, H., and Smyth, P. (2001) Principles of Data Mining, Cambridge, MA: The MIT Press, forthcoming.
Heckerman, D. (1995) A tutorial on learning Bayesian net- works, Techical Report MSR-TR-95–06, Microsoft Research, Redmond, WA.
Heckerman, D., Chickering, D. M., Meek, C, Rounthwaite, R., and Kadie, C. (2000) Dependency networks for density estimation, collaborative filtering, and data visualization, Technical Report MSR-TR-2000–16, Microsoft Research, Redmond, WA.
Hendry, D. F. (1995) Dynamic Econometrics, New York, NY: Oxford University Press.
Hinton, G.E., and T. Sejnowski (eds.) (1999) Unsupervised Learning: Foundations of Neural Computation, The MIT Press.
Hoffmann, T. (1999) Probabilistic latent sematic indexing, Proceedings of SIGIR ‘99, 50–57.
Hogg, R. V. and Craig, A. T. (1978) Introduction to Mathematical Statistics, 4th ed. Macmillan.
Hosking, J. R. M., Pednault, E. P. D., and Sudan, M. (1997) A statistical perspective on data mining, Future Generation Computer Systems, 13, 117–134.
Jensen, D. (1991) Knowledge discovery through induction with randomization testing, in Proceedings of the 1991 Knowledge Discovery in Databases Workshop, G. Piatetsky- Shapiro (ed.), Menlo Park, CA: AAAI Press, 148–159.
Jensen, D. and Cohen, P. (2000) Multiple comparisons in induction algorithms, Machine Learning, 38, 309–338.
John, G. (1999) personal communication.
Jordan, M. I. and Jacobs, R. A. (1994) Hierarchical mixtures of experts and the EM algorithm, Neural Computation, 6, 181–214.
Jordan, M. I. (ed.) (1998) Learning in Graphical Models, Cambridge, MA: The MIT Press.
Kleinberg, J. M. (1998) Authoritative sources in a hyperlinked environment, in Proc. of ACM-SIAM Symp. on Discrete Algorithms, 668–677.
Knight, K. (2000) Mathematical Statistics, Chapman and Hall.
Kohavi, R. (2000) personal communication.
Lambert, D. (2000) What use is statistics for massive data?, preprint.
Lauritzen, S. L. and Spiegelhalter, D. J. (1988) Local computations with probabilisties on graphical structures and their application to expert systems (with discussion), J. Roy. Statist. Soc. B, 50, 157–224.
Learner, E. E. (1978) Specification Searches: Ad Hoc Inference with Non-Experimental Data, New York, NY: John Wiley.
Letsche, T. A. and Berry, M. W. (1997) Large-scale infor- mation retrieval with latent semantic indexing, Information Sciences—Applications, 100, 105–137.
Lovell, M. (1983) Data mining, Review of Economics and Statistics, 65, 1–12.
Mannila, H., Toivonen, H. and Inkeri Verkamo, A. (1995) Discovering frequent episodes in sequences, in Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, 210–215.
Miller, A. J. (1990) Subset Selection in Regression, London, Chapman and Hall.
Moe, W. W., Fader, P. (2001) Which visits lead to purchases? dynamic conversion behavior at e-commerce sites, Working Paper 00–023, Department of Marketing, Wharton School of Business, University of Pennsylvania.
Moore, A. W., (1999) Cached sufficient statistics for automated discovery and data mining from massive data sources, online white paper, Department of Computer Science, Carnegie Mellon University.
Moore, A. W. and Lee, M. (1998) Cached sufficient statistics for efficient machine learning with large data sets, Journal of Artificial Intelligence Research, 8, 67–91.
Morgan, J. N. and Sonquist, J. A. (1963) Problems in the analysis of survey data and a proposal, J. Am. Stat. Assoc, 58, 415–434.
Pearl, J. (1988) Probabilistic Inference in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann.
Provost, F. and Kolluri, V. (1999) A survey of methods for scaling up inductive algorithms, Journal of Data Mining and Knowledge Discovery, 3(2), 131–169.
Quinlan, J. R. (1987) Generating production rules from decision trees, Proceedings of the Tenth International Joint Conference on Artificial Intelligence, San Mateo, GA: Morgan Kaufmann, 304–307.
Quinlan, J. R. (1993) C4–5: Programs for Machine Learning, San Mateo: CA, Morgan Kaufmann.
Ridgeway, G., (2000) Prediction in the era of massive data sets, Statistical Modeling for Data Mining, P. Giudici (ed.),Kluwer, 109–119.
Ripley, B. D. (1994) Neural networks and related methods for classification (with discussion), J. R. Statist. Soc. B, 56, 409–456.
Ripley, B. D. (1996) Pattern Recognition and Neural Net- works, Cambridge, UK: Cambridge University Press.
Ross, S. M. (2000) Introduction to Probability Models, San Diego, CA: Academic Press.
Rumelhart, D. E. and McClelland, J. L. (eds.) (1986) Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, Cambridge, MA: The MIT Press.
Salzberg, S. L. (1997) On comparing classifiers: pitfalls to avoid and a recommended approach, Data Mining and Knowledge Discovery, 1:3, 317–327.
Scholkopf, C, Burges, J. C, and Smola, A. J. (1999) Advances in Kernel Methods, Cambridge, MA: MIT Press.
Selvin, H. and Stuart, A. (1966) Data dredging procedures in survey analysis, American Statistician, 20(3), 20–23.
Smyth, P. and Goodman, R. (1992) An information-theoretic approach to rule induction from databases, IEEE Transactions on Knowledge and Data Engineering, 4(4), 301–306.
Smyth, P. (2000) Data mining: data analysis on a grand scale?, Statistical Methods in Medical Research, 9, 309–327.
Sullivan, R., Timmermann, A. and White, H. (1999) Data snooping, technical trading rule performance, and the boot-strap, Journal of Finance, 54, 1647–1692.
Vapnik, V. (1998) Statistical Learning Theory, New York, NY: Springer Verlag.
Walker, M. G. and Blum, R. L. (1986) Towards automated discovery from clinical databases: the RADIX project, in Proceedings of the Fifth Conference on Medical Informatics, volume 5, 32–36.
Witten I. H., Moffat A., Bell T. C. (1999) Managing gigabytes: compressing and indexing documents and images. San Francisco, CA: Morgan Kaufmann, (2nd ed.).
Wedel, M. and Kamakura, W. A. (1998) Market Segmentation: Conceptual and Methodological Foundations, Boston, MA: Kluwer Academic Publishers.
White, H. (2001), A reality check for data snooping, Econometrica, forthcoming.
Zhang, T., Ramakrishnan, R., Livny, M., (1997) BIRCH: A new data clustering algorithm and its applications, Journal of Data Mining and Knowledge Discovery, 1(2), 141–182.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Smyth, P. (2001). Data Mining at the Interface of Computer Science and Statistics. In: Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V., Namburu, R.R. (eds) Data Mining for Scientific and Engineering Applications. Massive Computing, vol 2. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-1733-7_3
Download citation
DOI: https://doi.org/10.1007/978-1-4615-1733-7_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-0114-7
Online ISBN: 978-1-4615-1733-7
eBook Packages: Springer Book Archive