ABSTRACT
As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.
- A web-based data-analysis tool for bayesian modeling. http://b-course.cs.helsinki.fi.Google Scholar
- N. Adiga and B. G. Team. An overview of the bluegene/1 supercomputer. In Supercomputing (SC2002) Technical Papers, November 2002. Google ScholarDigital Library
- J. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York, 1985.Google ScholarCross Ref
- P. J. Brockwell and R. Davis. Introduction to Time-Series and Forecasting. Springer-Verlag, New York, 2002.Google ScholarCross Ref
- M. F. Buckley and D. P. Siewiorek. Vax/vms event monitoring and analysis. In FTCS-25, Computing Digest of Papers, pages 414--423, June 1995. Google ScholarDigital Library
- M. F. Buckley and D. P. Siewiorek. Comparative analysis of event tupling schemes. In FTCS-26, Computing Digest of Papers, pages 294--303, June 1996. Google ScholarDigital Library
- T. Dietterich and R. Michalski. Discovering patterns in sequences of events. Artificial Intelligence., 25:187--232, 1985. Google ScholarDigital Library
- P. Dinda. A prediction-based real-time scheduling advisor. In Proceedings 16th. Intl. Parallel and Distributed Processing Symposium (IPDPS 2002), April 2002. Google ScholarDigital Library
- C. Domeniconi, C. S. Perng, R. Vilalta, and S. Ma. A classification approach for prediction of targetted events in temporal sequences. In SIGMOD/PODS 2002, ACM SIGMOD/PODS Conference, 2002. Google ScholarDigital Library
- N. Friedman. Learning belief networks in the presence of missing values and hidden variables. In Proceedings 14th. Intl. Conf. on Machine Learning, pages 125--133, 1997. Google ScholarDigital Library
- N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic networks. In Proceedings 14th. Conf. on Uncertainty in Artificial Intelligence (UAI), pages 139--147, 1998. Google ScholarDigital Library
- D. Heckerman. A tutorial on learning with bayesian networks. Tech. Rep. MSR-TR-95-06, Microsoft Research, 1996.Google ScholarDigital Library
- P. Horn. Autonomic computing: Ibm's prospective on the state of information technology. http://www.research.ibm.com/autonomic/, IBM Corporation, 2001.Google Scholar
- F. Jensen. An Introduction to Bayesian Networks. UCL Press, London, 1996. Google ScholarDigital Library
- I. Lee, R. K. Iye, and D. Tang. Error/failure analysis using event logs from fault tolerant systems. In Proceedings 21st Intl. Symposium on Fault-Tolerant Computing, pages 10--17, June 1991.Google ScholarCross Ref
- T. Y. Lin and D. P. Siewiorek. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. On Reliability, 39(4):419--432, October 1990.Google ScholarCross Ref
- R. S. Michalski. A theory and methodology of inductive learning. Machine Learning, 39(4):419--432, October 1995.Google Scholar
- K. P. Murphy. Active learning of causal bayes net structure. In http://citeseer.nj.nec.com/451402.html.Google Scholar
- T. Nakagawa and S. Osaki. The discrete weibull distribution. IEEE Trans. On Reliability, R-24:300--301, December 1975.Google ScholarCross Ref
- J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo, California, 1988. Google ScholarDigital Library
- J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, California, 1994. Google ScholarDigital Library
- R. K. Sahoo, M. Bae, R. Vilalta, J. Moreira, S. Ma, and M. Gupta. Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems. In SHAMAN, Workshop, ICS'2002, June 2002.Google Scholar
- M. M. Tsao. Trend Analysis and Fault Prediction. PhD Dissertation, Carnegie-Mellon University, Mayr 1983.Google Scholar
- R. Vilalta, C. Apte, J. Hellerstein, S. Ma, and S. Weiss. Predictive algorithms in the management of computer systems. In IBM Systems Journal, Special Issue on Artificial Intelligence, volume 41, 2002. Google ScholarDigital Library
- R. Vilalta and S. Ma. Predictive rare events in temporal domains. In Proceedings IEEE Conf. on Data Mining (ICDM.02), 2002. Google ScholarDigital Library
- G. M. Weiss. Timeweaver: A genetic algorithm for identifying predictive patterns in sequence of events. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 13--17, 1999.Google Scholar
- G. M. Weiss and H. Hirsh. Learning to predict rare events in event sequences. In Proceedings 4th. Intl. Conf. on KDD (KDD-98), volume 4, pages 359--363, 1998.Google Scholar
- G. M. Weiss and J. P. Ros. Implementing design patterns with object-oriented rules. 11(7):25--35, 1998.Google Scholar
- G. M. Weiss, J. P. Ros, and A. Singhal. Answer: Network monitoring using object-oriented rules. In Proceedings 10th. Conference IAAI (IAAI-98), volume 10, pages 1087--1093, 1998. Google ScholarDigital Library
Index Terms
- Critical event prediction for proactive management in large-scale computer clusters
Recommendations
Exploring event correlation for failure prediction in coalitions of clusters
SC '07: Proceedings of the 2007 ACM/IEEE conference on SupercomputingIn large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and ...
Research on event prediction algorithm based on event sequence semantic
FSKD'09: Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 5Event prediction in event stream is an important problem in temporal data mining. However, existing event prediction algorithms are based on string prediction in which a character represents an event or an event type, do not take into account event ...
Research on Event Prediction Algorithm Based on Event Sequence Semantic
FSKD '09: Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05Event prediction in event stream is an important problem in temporal data mining. However, existing event prediction algorithms are based on string prediction in which a character represents an event or an event type, do not take into account event ...
Comments