
The Library
Clairvoyant : a log-based transformer-decoder for failure prediction in large-scale systems
Tools
Alharthi, Khalid, Jhumka, Arshad, Sheng, Di and Cappello, Franck (2022) Clairvoyant : a log-based transformer-decoder for failure prediction in large-scale systems. In: ACM International Conference on Supercomputing, Virtual, 27-30 Jun 2022. Published in: ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing pp. 1-14. ISBN 9781450392815. doi:10.1145/3524059.3532374
|
PDF
WRAP-clairvoyant-log-based-transformer-decoder-failure-prediction-large-scale-systems-Jhumka-2022.pdf - Accepted Version - Requires a PDF viewer. Download (1409Kb) | Preview |
Official URL: https://doi.org/10.1145/3524059.3532374
Abstract
System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach – Desh, based on two real-world system log datasets. Experiments show that Clairvoyant is significantly better: e.g., it can predict node failures with an average Bleu, Rouge, and MCC scores of 0.90, 0.78, and 0.65 respectively while Desh scores only 0.58, 0.58, and 0.25. More importantly, this improvement is achieved with faster training and prediction time, with Clairvoyant being about 25× and 15× faster than Desh respectively.
Item Type: | Conference Item (Paper) | ||||||
---|---|---|---|---|---|---|---|
Subjects: | Q Science > Q Science (General) Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software |
||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | ||||||
Library of Congress Subject Headings (LCSH): | Deep learning (Machine learning), High performance computing , Computer system failures , Software failures | ||||||
Journal or Publication Title: | ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing | ||||||
Publisher: | ACM | ||||||
ISBN: | 9781450392815 | ||||||
Official Date: | 28 June 2022 | ||||||
Dates: |
|
||||||
Page Range: | pp. 1-14 | ||||||
Article Number: | 35 | ||||||
DOI: | 10.1145/3524059.3532374 | ||||||
Status: | Peer Reviewed | ||||||
Publication Status: | Published | ||||||
Reuse Statement (publisher, data, author rights): | © ACM. 2022 This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing, http://dx.doi.org/10.1145/3524059.3532374 | ||||||
Access rights to Published version: | Restricted or Subscription Access | ||||||
Description: | Free access |
||||||
Date of first compliant deposit: | 20 May 2022 | ||||||
Date of first compliant Open Access: | 20 May 2022 | ||||||
Conference Paper Type: | Paper | ||||||
Title of Event: | ACM International Conference on Supercomputing | ||||||
Type of Event: | Conference | ||||||
Location of Event: | Virtual | ||||||
Date(s) of Event: | 27-30 Jun 2022 | ||||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year