The Library
Failure diagnosis for cluster systems using partial correlations
Tools
Chuah, Edward, Jhumka, Arshad, Alt, Samantha, Evans, R. Todd and Suri, Neeraj (2021) Failure diagnosis for cluster systems using partial correlations. In: 2021 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom), New York City, NY, USA, 30 Sept - 03 Oct 2021. Published in: Proceedings of the 2021 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom) pp. 1091-1101. ISBN 9781665435741. doi:10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151
Research output not available from this repository.
Request-a-Copy directly from author or use local Library Get it For Me service.
Official URL: http://dx.doi.org/10.1109/ISPA-BDCloud-SocialCom-S...
Abstract
Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow - called IFADE - extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previously unknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. IFADE has been put on the public domain to support system administrators in failure diagnosis.
Item Type: | Conference Item (Paper) | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | |||||||||||||||
Journal or Publication Title: | Proceedings of the 2021 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom) | |||||||||||||||
Publisher: | IEEE | |||||||||||||||
ISBN: | 9781665435741 | |||||||||||||||
Book Title: | 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) | |||||||||||||||
Official Date: | 2021 | |||||||||||||||
Dates: |
|
|||||||||||||||
Page Range: | pp. 1091-1101 | |||||||||||||||
DOI: | 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00151 | |||||||||||||||
Status: | Not Peer Reviewed | |||||||||||||||
Publication Status: | Published | |||||||||||||||
Reuse Statement (publisher, data, author rights): | © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | |||||||||||||||
Access rights to Published version: | Restricted or Subscription Access | |||||||||||||||
Copyright Holders: | IEEE | |||||||||||||||
RIOXX Funder/Project Grant: |
|
|||||||||||||||
Conference Paper Type: | Paper | |||||||||||||||
Title of Event: | 2021 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom) | |||||||||||||||
Type of Event: | Conference | |||||||||||||||
Location of Event: | New York City, NY, USA | |||||||||||||||
Date(s) of Event: | 30 Sept - 03 Oct 2021 |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |