Article

Critical event prediction for proactive management in large-scale computer clusters

Authors:
R. K. Sahoo

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
A. J. Oliner

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
I. Rish

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
M. Gupta

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
J. E. Moreira

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
S. Ma

IBM T. J. Watson Research Center, Yorktown Heights, NY

IBM T. J. Watson Research Center, Yorktown Heights, NY
View Profile

,
R. Vilalta

University of Houston, Houston, TX

University of Houston, Houston, TX
View Profile

,
A. Sivasubramaniam

Penn. State University, College Park, PA

Penn. State University, College Park, PA
View Profile

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2003Pages 426–435https://doi.org/10.1145/956750.956799

Published:24 August 2003Publication History

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 426–435

ABSTRACT

As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.

References

A web-based data-analysis tool for bayesian modeling. http://b-course.cs.helsinki.fi.Google Scholar
N. Adiga and B. G. Team. An overview of the bluegene/1 supercomputer. In Supercomputing (SC2002) Technical Papers, November 2002. Google ScholarDigital Library
J. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York, 1985.Google ScholarCross Ref
P. J. Brockwell and R. Davis. Introduction to Time-Series and Forecasting. Springer-Verlag, New York, 2002.Google ScholarCross Ref
M. F. Buckley and D. P. Siewiorek. Vax/vms event monitoring and analysis. In FTCS-25, Computing Digest of Papers, pages 414--423, June 1995. Google ScholarDigital Library
M. F. Buckley and D. P. Siewiorek. Comparative analysis of event tupling schemes. In FTCS-26, Computing Digest of Papers, pages 294--303, June 1996. Google ScholarDigital Library
T. Dietterich and R. Michalski. Discovering patterns in sequences of events. Artificial Intelligence., 25:187--232, 1985. Google ScholarDigital Library
P. Dinda. A prediction-based real-time scheduling advisor. In Proceedings 16th. Intl. Parallel and Distributed Processing Symposium (IPDPS 2002), April 2002. Google ScholarDigital Library
C. Domeniconi, C. S. Perng, R. Vilalta, and S. Ma. A classification approach for prediction of targetted events in temporal sequences. In SIGMOD/PODS 2002, ACM SIGMOD/PODS Conference, 2002. Google ScholarDigital Library
N. Friedman. Learning belief networks in the presence of missing values and hidden variables. In Proceedings 14th. Intl. Conf. on Machine Learning, pages 125--133, 1997. Google ScholarDigital Library
N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic networks. In Proceedings 14th. Conf. on Uncertainty in Artificial Intelligence (UAI), pages 139--147, 1998. Google ScholarDigital Library
D. Heckerman. A tutorial on learning with bayesian networks. Tech. Rep. MSR-TR-95-06, Microsoft Research, 1996.Google ScholarDigital Library
P. Horn. Autonomic computing: Ibm's prospective on the state of information technology. http://www.research.ibm.com/autonomic/, IBM Corporation, 2001.Google Scholar
F. Jensen. An Introduction to Bayesian Networks. UCL Press, London, 1996. Google ScholarDigital Library
I. Lee, R. K. Iye, and D. Tang. Error/failure analysis using event logs from fault tolerant systems. In Proceedings 21st Intl. Symposium on Fault-Tolerant Computing, pages 10--17, June 1991.Google ScholarCross Ref
T. Y. Lin and D. P. Siewiorek. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. On Reliability, 39(4):419--432, October 1990.Google ScholarCross Ref
R. S. Michalski. A theory and methodology of inductive learning. Machine Learning, 39(4):419--432, October 1995.Google Scholar
K. P. Murphy. Active learning of causal bayes net structure. In http://citeseer.nj.nec.com/451402.html.Google Scholar
T. Nakagawa and S. Osaki. The discrete weibull distribution. IEEE Trans. On Reliability, R-24:300--301, December 1975.Google ScholarCross Ref
J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo, California, 1988. Google ScholarDigital Library
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, California, 1994. Google ScholarDigital Library
R. K. Sahoo, M. Bae, R. Vilalta, J. Moreira, S. Ma, and M. Gupta. Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems. In SHAMAN, Workshop, ICS'2002, June 2002.Google Scholar
M. M. Tsao. Trend Analysis and Fault Prediction. PhD Dissertation, Carnegie-Mellon University, Mayr 1983.Google Scholar
R. Vilalta, C. Apte, J. Hellerstein, S. Ma, and S. Weiss. Predictive algorithms in the management of computer systems. In IBM Systems Journal, Special Issue on Artificial Intelligence, volume 41, 2002. Google ScholarDigital Library
R. Vilalta and S. Ma. Predictive rare events in temporal domains. In Proceedings IEEE Conf. on Data Mining (ICDM.02), 2002. Google ScholarDigital Library
G. M. Weiss. Timeweaver: A genetic algorithm for identifying predictive patterns in sequence of events. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 13--17, 1999.Google Scholar
G. M. Weiss and H. Hirsh. Learning to predict rare events in event sequences. In Proceedings 4th. Intl. Conf. on KDD (KDD-98), volume 4, pages 359--363, 1998.Google Scholar
G. M. Weiss and J. P. Ros. Implementing design patterns with object-oriented rules. 11(7):25--35, 1998.Google Scholar
G. M. Weiss, J. P. Ros, and A. Singhal. Answer: Network monitoring using object-oriented rules. In Proceedings 10th. Conference IAAI (IAAI-98), volume 10, pages 1087--1093, 1998. Google ScholarDigital Library

Index Terms

Critical event prediction for proactive management in large-scale computer clusters
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications

Recommendations

Exploring event correlation for failure prediction in coalitions of clusters
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and ...
Read More
Research on event prediction algorithm based on event sequence semantic
FSKD'09: Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 5

Event prediction in event stream is an important problem in temporal data mining. However, existing event prediction algorithms are based on string prediction in which a character represents an event or an event type, do not take into account event ...
Read More
Research on Event Prediction Algorithm Based on Event Sequence Semantic
FSKD '09: Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 05

Event prediction in event stream is an important problem in temporal data mining. However, existing event prediction algorithms are based on string prediction in which a character represents an event or an event type, do not take into account event ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2003
736 pages
ISBN:1581137370
DOI:10.1145/956750
Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
critical event prediction
large-scale clusters
system event log
Qualifiers
- Article
Conference

Acceptance Rates
KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 195
  Total Citations
  View Citations
- 1,457
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Critical event prediction for proactive management in large-scale computer clusters

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring event correlation for failure prediction in coalitions of clusters

Research on event prediction algorithm based on event sequence semantic

Research on Event Prediction Algorithm Based on Event Sequence Semantic