Clairvoyant : a log-based transformer-decoder for failure prediction in large-scale systems

[thumbnail of WRAP-clairvoyant-log-based-transformer-decoder-failure-prediction-large-scale-systems-Jhumka-2022.pdf]
Preview
PDF
WRAP-clairvoyant-log-based-transformer-decoder-failure-prediction-large-scale-systems-Jhumka-2022.pdf - Accepted Version - Requires a PDF viewer.

Download (1MB) | Preview

Request Changes to record.

Abstract

System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach – Desh, based on two real-world system log datasets. Experiments show that Clairvoyant is significantly better: e.g., it can predict node failures with an average Bleu, Rouge, and MCC scores of 0.90, 0.78, and 0.65 respectively while Desh scores only 0.58, 0.58, and 0.25. More importantly, this improvement is achieved with faster training and prediction time, with Clairvoyant being about 25× and 15× faster than Desh respectively.

Item Type: Conference Item (Paper)
Subjects: Q Science > Q Science (General)
Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Divisions: Faculty of Science, Engineering and Medicine > Science > Computer Science
Library of Congress Subject Headings (LCSH): Deep learning (Machine learning), High performance computing , Computer system failures , Software failures
Journal or Publication Title: ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing
Publisher: ACM
ISBN: 9781450392815
Official Date: 28 June 2022
Dates:
Date
Event
28 June 2022
Available
14 April 2022
Accepted
Page Range: pp. 1-14
Article Number: 35
DOI: 10.1145/3524059.3532374
Status: Peer Reviewed
Publication Status: Published
Re-use Statement: © ACM. 2022 This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing, http://dx.doi.org/10.1145/3524059.3532374
Access rights to Published version: Restricted or Subscription Access
Description:

Free access

Date of first compliant deposit: 20 May 2022
Date of first compliant Open Access: 20 May 2022
Conference Paper Type: Paper
Title of Event: ACM International Conference on Supercomputing
Type of Event: Conference
Location of Event: Virtual
Date(s) of Event: 27-30 Jun 2022
Related URLs:
URI: https://wrap.warwick.ac.uk/165622/

Export / Share Citation


Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item