Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Using message logs and resource use data for cluster failure diagnosis

Tools
- Tools
+ Tools

Chuah, Edward, Jhumka, Arshad, Browne, James C., Gurumdimma, Nentawe, Narasimhamurthy, Sai and Barth, Bill (2017) Using message logs and resource use data for cluster failure diagnosis. In: 23rd annual IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2016), Hyderabad, India, 19-22 Dec 2016 ISBN 9781509054114.

[img]
Preview
PDF
WRAP-using-message-logs-resource-use-data-cluster-failure-diagnosis-Jhumka-2017.pdf - Accepted Version - Requires a PDF viewer.

Download (1164Kb) | Preview
Official URL: http://dx.doi.org/10.1109/HiPC.2016.035

Request Changes to record.

Abstract

Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Recent availability of resource use data provides another potentially useful source of data for failure detection and diagnosis. Early work combining message logs and resource use data for failure diagnosis has shown promising results. This paper describes the CRUMEL framework which implements a new approach to combining rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and correlates these patterns by time with system failures. Application of CRUMEL to data from the Ranger supercomputer has yielded improved diagnoses over previous research. CRUMEL has: (i) showed that more events correlated with system failures can only be identified by applying different correlation algorithms, (ii) confirmed six groups of errors, (iii) identified Lustre I/O resource use counters which are correlated with occurrence of Lustre faults which are potential flags for online detection of failures, (iv) matched the dates of correlated error events and correlated resource use with the dates of compute node hangups and (v) identified two more error groups associated with compute node hang-ups. The pre-processed data will be put on the public domain in September, 2016.

Item Type: Conference Item (Paper)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Divisions: Faculty of Science, Engineering and Medicine > Science > Computer Science
Library of Congress Subject Headings (LCSH): Computer system failures
ISBN: 9781509054114
Official Date: 2 February 2017
Dates:
DateEvent
2 February 2017Available
27 September 2016Accepted
Status: Peer Reviewed
Publication Status: Published
Date of first compliant deposit: 4 November 2016
Date of first compliant Open Access: 4 November 2016
Funder: National Science Foundation (U.S.) (NSF), University of Texas
Grant number: OCI awards #0622780 and #1203604
Conference Paper Type: Paper
Title of Event: 23rd annual IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC 2016)
Type of Event: Conference
Location of Event: Hyderabad, India
Date(s) of Event: 19-22 Dec 2016
Related URLs:
  • Organisation

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics

twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us