The Library
Enabling dependability-driven resource use and message log-analysis for cluster system diagnosis
Tools
Chuah, Edward, Jhumka, Arshad, Alt, Samantha, Damoulas, Theodoros, Gurumdimma, Nentawe, Sawley, Marie-Christine, Barth, William L., Minyard, Tommy and Browne, James C. (2018) Enabling dependability-driven resource use and message log-analysis for cluster system diagnosis. In: 24th IEEE International Conference on High Performance Computing, Data, and Analytics, Jaipur, India, 18-21 Dec 2017. Published in: 2017 IEEE 24th International Conference on High Performance Computing (HiPC) ISBN 9781538622940. doi:10.1109/HiPC.2017.00044
|
PDF
WRAP-enabling-dependability-driven-resource-use-log-analysis-Chuah-2017.pdf - Accepted Version - Requires a PDF viewer. Download (1361Kb) | Preview |
Official URL: https://doi.org/10.1109/HiPC.2017.00044
Abstract
Recent work have used both failure logs and resource use data separately (and together) to detect system failure-inducing errors and to diagnose system failures. System failure occurs as a result of error propagation and the (unsuccessful) execution of error recovery mechanisms. Knowledge of error propagation patterns and unsuccessful error recovery is important for more accurate and detailed failure diagnosis, and knowledge of recovery protocols deployment is important for improving system reliability. This paper presents the CORRMEXT framework which carries failure diagnosis another significant step forward by analyzing and reporting error propagation patterns and degrees of success and failure of error recovery protocols. CORRMEXT uses both error messages and resource use data in its analyses. Application of CORRMEXT to data from the Ranger supercomputer have produced new insights. CORRMEXT has: (i) identified correlations between resource use counters that capture recovery attempts after an error, (ii) identified correlations between error events to capture error propagation patterns within the system, (iii) identified error propagation and recovery paths during system execution to explain system behaviour, (iv) showed that the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. CORRMEXT will be installed on the HPC clusters at the Texas Advanced Computing Center in Autumn 2017.
Item Type: | Conference Item (Paper) | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subjects: | Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software | |||||||||||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | |||||||||||||||
Library of Congress Subject Headings (LCSH): | Computer system failures, Computer networks -- Reliability | |||||||||||||||
Journal or Publication Title: | 2017 IEEE 24th International Conference on High Performance Computing (HiPC) | |||||||||||||||
Publisher: | IEEE | |||||||||||||||
ISBN: | 9781538622940 | |||||||||||||||
Official Date: | 8 February 2018 | |||||||||||||||
Dates: |
|
|||||||||||||||
DOI: | 10.1109/HiPC.2017.00044 | |||||||||||||||
Status: | Peer Reviewed | |||||||||||||||
Publication Status: | Published | |||||||||||||||
Reuse Statement (publisher, data, author rights): | © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | |||||||||||||||
Access rights to Published version: | Restricted or Subscription Access | |||||||||||||||
Date of first compliant deposit: | 11 September 2017 | |||||||||||||||
Date of first compliant Open Access: | 29 January 2019 | |||||||||||||||
Grant number: | EP/N510129/1, 0622780, 1203604 | |||||||||||||||
RIOXX Funder/Project Grant: |
|
|||||||||||||||
Conference Paper Type: | Paper | |||||||||||||||
Title of Event: | 24th IEEE International Conference on High Performance Computing, Data, and Analytics | |||||||||||||||
Type of Event: | Conference | |||||||||||||||
Location of Event: | Jaipur, India | |||||||||||||||
Date(s) of Event: | 18-21 Dec 2017 | |||||||||||||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year