skip to main content
10.1145/956750.956799acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Critical event prediction for proactive management in large-scale computer clusters

Authors Info & Claims
Published:24 August 2003Publication History

ABSTRACT

As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.

References

  1. A web-based data-analysis tool for bayesian modeling. http://b-course.cs.helsinki.fi.Google ScholarGoogle Scholar
  2. N. Adiga and B. G. Team. An overview of the bluegene/1 supercomputer. In Supercomputing (SC2002) Technical Papers, November 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  4. P. J. Brockwell and R. Davis. Introduction to Time-Series and Forecasting. Springer-Verlag, New York, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  5. M. F. Buckley and D. P. Siewiorek. Vax/vms event monitoring and analysis. In FTCS-25, Computing Digest of Papers, pages 414--423, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. F. Buckley and D. P. Siewiorek. Comparative analysis of event tupling schemes. In FTCS-26, Computing Digest of Papers, pages 294--303, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Dietterich and R. Michalski. Discovering patterns in sequences of events. Artificial Intelligence., 25:187--232, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Dinda. A prediction-based real-time scheduling advisor. In Proceedings 16th. Intl. Parallel and Distributed Processing Symposium (IPDPS 2002), April 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Domeniconi, C. S. Perng, R. Vilalta, and S. Ma. A classification approach for prediction of targetted events in temporal sequences. In SIGMOD/PODS 2002, ACM SIGMOD/PODS Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Friedman. Learning belief networks in the presence of missing values and hidden variables. In Proceedings 14th. Intl. Conf. on Machine Learning, pages 125--133, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic networks. In Proceedings 14th. Conf. on Uncertainty in Artificial Intelligence (UAI), pages 139--147, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Heckerman. A tutorial on learning with bayesian networks. Tech. Rep. MSR-TR-95-06, Microsoft Research, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Horn. Autonomic computing: Ibm's prospective on the state of information technology. http://www.research.ibm.com/autonomic/, IBM Corporation, 2001.Google ScholarGoogle Scholar
  14. F. Jensen. An Introduction to Bayesian Networks. UCL Press, London, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Lee, R. K. Iye, and D. Tang. Error/failure analysis using event logs from fault tolerant systems. In Proceedings 21st Intl. Symposium on Fault-Tolerant Computing, pages 10--17, June 1991.Google ScholarGoogle ScholarCross RefCross Ref
  16. T. Y. Lin and D. P. Siewiorek. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. On Reliability, 39(4):419--432, October 1990.Google ScholarGoogle ScholarCross RefCross Ref
  17. R. S. Michalski. A theory and methodology of inductive learning. Machine Learning, 39(4):419--432, October 1995.Google ScholarGoogle Scholar
  18. K. P. Murphy. Active learning of causal bayes net structure. In http://citeseer.nj.nec.com/451402.html.Google ScholarGoogle Scholar
  19. T. Nakagawa and S. Osaki. The discrete weibull distribution. IEEE Trans. On Reliability, R-24:300--301, December 1975.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo, California, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, California, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. K. Sahoo, M. Bae, R. Vilalta, J. Moreira, S. Ma, and M. Gupta. Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems. In SHAMAN, Workshop, ICS'2002, June 2002.Google ScholarGoogle Scholar
  23. M. M. Tsao. Trend Analysis and Fault Prediction. PhD Dissertation, Carnegie-Mellon University, Mayr 1983.Google ScholarGoogle Scholar
  24. R. Vilalta, C. Apte, J. Hellerstein, S. Ma, and S. Weiss. Predictive algorithms in the management of computer systems. In IBM Systems Journal, Special Issue on Artificial Intelligence, volume 41, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Vilalta and S. Ma. Predictive rare events in temporal domains. In Proceedings IEEE Conf. on Data Mining (ICDM.02), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. M. Weiss. Timeweaver: A genetic algorithm for identifying predictive patterns in sequence of events. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 13--17, 1999.Google ScholarGoogle Scholar
  27. G. M. Weiss and H. Hirsh. Learning to predict rare events in event sequences. In Proceedings 4th. Intl. Conf. on KDD (KDD-98), volume 4, pages 359--363, 1998.Google ScholarGoogle Scholar
  28. G. M. Weiss and J. P. Ros. Implementing design patterns with object-oriented rules. 11(7):25--35, 1998.Google ScholarGoogle Scholar
  29. G. M. Weiss, J. P. Ros, and A. Singhal. Answer: Network monitoring using object-oriented rules. In Proceedings 10th. Conference IAAI (IAAI-98), volume 10, pages 1087--1093, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Critical event prediction for proactive management in large-scale computer clusters

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2003
        736 pages
        ISBN:1581137370
        DOI:10.1145/956750

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader