
The Library
Time machine : generative real-time model for failure (and lead time) prediction in HPC systems
Tools
Alharthi, Khalid Ayed, Jhumka, Arshad, Di, Sheng, Gui, Lin, Cappello, Franck and McIntosh-Smith, Simon (2023) Time machine : generative real-time model for failure (and lead time) prediction in HPC systems. In: 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Network (IEEE IFIP DSN 2023), Porto, Portugal, 27-30 Jun 2023. Published in: Proceedings of the DSN 2023 (In Press)
|
PDF
WRAP-time-machine-generative-real-time-model-failure-(and-lead-time)-prediction-HPC-systems-Jhumka-2023.pdf - Accepted Version - Requires a PDF viewer. Download (974Kb) | Preview |
Abstract
High Performance Computing (HPC) systems generate a large amount of unstructured/alphanumeric log messages that capture the health state of their components. Due to their design complexity, HPC systems often undergo failures that halt applications (e.g., weather prediction, aerodynamics simulation) execution. However, existing failure prediction methods, which typically seek to extract some information theoretic features, fail to scale both in terms of accuracy and prediction speed, limiting their adoption in real-time production systems. In this paper, differently from existing work and inspired by current transformer-based neural networks which have revolutionized the sequential learning in the NLP tasks, we propose a novel scalable log-based, self-supervised model (i.e., no need for manual labels), called Time Machine1 , that predicts (i) forthcoming log events (ii) the upcoming failure and its location and (iii) the expected lead time to failure. Time Machine is designed by combining two stacks of transformer-decoders, each employing the selfattention mechanism. The first stack addresses the failure location by predicting the sequence of log events and then identifying if a failure event is part of that sequence. The lead time to predicted failure is addressed by the second stack. We evaluate Time machine on four real-world HPC log datasets and compare it against three state-of-the-art failure prediction approaches. Results show that Time Machine significantly outperforms the related works on Bleu, Rouge, MCC, and F1-score in predicting forthcoming events, failure location, failure lead-time, with higher prediction speed.
Item Type: | Conference Item (Paper) | ||||||
---|---|---|---|---|---|---|---|
Alternative Title: | |||||||
Subjects: | Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software | ||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | ||||||
Library of Congress Subject Headings (LCSH): | High performance computing, Supercomputers, Computer engineering, Electronic data processing -- Distributed processing, Soft errors (Computer science), Fault-tolerant computing | ||||||
Journal or Publication Title: | Proceedings of the DSN 2023 | ||||||
Publisher: | IEEE | ||||||
Official Date: | 2023 | ||||||
Dates: |
|
||||||
Status: | Peer Reviewed | ||||||
Publication Status: | In Press | ||||||
Reuse Statement (publisher, data, author rights): | © 2023 Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | ||||||
Access rights to Published version: | Restricted or Subscription Access | ||||||
Date of first compliant deposit: | 21 April 2023 | ||||||
Date of first compliant Open Access: | 21 April 2023 | ||||||
Conference Paper Type: | Paper | ||||||
Title of Event: | 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Network (IEEE IFIP DSN 2023) | ||||||
Type of Event: | Conference | ||||||
Location of Event: | Porto, Portugal | ||||||
Date(s) of Event: | 27-30 Jun 2023 | ||||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
![]() |
View Item |
Downloads
Downloads per month over past year