The Library
A survey of log-correlation tools for failure diagnosis and prediction in cluster systems
Tools
Chuah, Edward, Jhumka, Arshad, Malek, Miroslaw and Suri, Neeraj (2022) A survey of log-correlation tools for failure diagnosis and prediction in cluster systems. IEEE Access, 10 . pp. 133487-133503. doi:10.1109/access.2022.3231454 ISSN 2169-3536.
|
PDF
A_Survey_of_Log-Correlation_Tools_for_Failure_Diagnosis_and_Prediction_in_Cluster_Systems.pdf - Published Version - Requires a PDF viewer. Available under License Creative Commons Attribution 4.0. Download (1161Kb) | Preview |
Official URL: https://doi.org/10.1109/access.2022.3231454
Abstract
System logs are the first source of information available to system designers to analyze and troubleshoot their cluster systems. For example, High-Performance Computing (HPC) systems generate a large volume of heterogeneous data from multiple sub-systems, so the idea of using a single source of data to achieve a given goal, such as identification of failures, is losing its validity. System log-analysis tools assist system designers gain understanding into a large volume of system logs. They enable system designers to perform various analyses (e.g., diagnosing node failures or predicting node failures). Current system log-analysis tools vary significantly in their function and design. We conduct a systematic review of literature on system log-analysis tools and select 46 representative articles out of 3,758 initial articles. To the best of our knowledge, there is no work that studied the characteristics of log-correlation tools (LogCTs) with respect to four quality attributes including (a) spurious correlations, (b) correlation threshold settings, (c) outliers in the data and (d) missing data. In this paper, we (a) propose a quality model to evaluate LogCTs and (b) use this quality model to evaluate and recommend current LogCTs. Through our review, we (a) identify papers on LogCTs, (b) build a quality model consisting of the four quality attributes and (c) discuss several open challenges for future research. Our study highlights the advantages and limitations of existing LogCTs and identifies research opportunities that could facilitate better failure handling in large cluster systems.
Item Type: | Journal Article | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Subjects: | Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software | |||||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | |||||||||
SWORD Depositor: | Library Publications Router | |||||||||
Library of Congress Subject Headings (LCSH): | Data logging, Debugging in computer science -- Computer programs, High performance computing | |||||||||
Journal or Publication Title: | IEEE Access | |||||||||
Publisher: | Institute of Electrical and Electronics Engineers (IEEE) | |||||||||
ISSN: | 2169-3536 | |||||||||
Official Date: | 29 December 2022 | |||||||||
Dates: |
|
|||||||||
Volume: | 10 | |||||||||
Page Range: | pp. 133487-133503 | |||||||||
DOI: | 10.1109/access.2022.3231454 | |||||||||
Status: | Peer Reviewed | |||||||||
Publication Status: | Published | |||||||||
Access rights to Published version: | Open Access (Creative Commons) | |||||||||
Date of first compliant deposit: | 8 March 2023 | |||||||||
Date of first compliant Open Access: | 8 March 2023 | |||||||||
RIOXX Funder/Project Grant: |
|
|||||||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year