The Library
Improving resilience of scientific software through a domain-specific approach
Tools
Reguly, Istvan Z., Mudalige, Gihan R., Giles, Mike B. and Maheswaran, Satheesh (2019) Improving resilience of scientific software through a domain-specific approach. Journal of Parallel and Distributed Computing, 128 . pp. 99-114. doi:10.1016/j.jpdc.2019.01.015 ISSN 0743-7315.
|
PDF
WRAP-improving-reslilience-scientific-software-domain-specific-approach-Mudalige-2019.pdf - Accepted Version - Requires a PDF viewer. Download (1761Kb) | Preview |
Official URL: https://doi.org/10.1016/j.jpdc.2019.01.015
Abstract
In this paper we present research on improving the resilience of the execution of scientific software, an increasingly important concern in High Performance Computing (HPC). We build on an existing high-level abstraction framework, the Oxford Parallel library for Structured meshes (OPS), developed for the solution of multi-block structured mesh-based applications, and implement an algorithm in the library to carry out checkpointing automatically, without the in- tervention of the user. The target applications are a hydrodynamics benchmark application from the Mantevo Suite, CloverLeaf 3D, the sparse linear solver proxy application TeaLeaf, and the OpenSBLI compressible Navier-Stokes direct numerical simulation (DNS) solver. We present (1) the basic algorithm that OPS relies on to determine the optimal checkpoint in terms of size and location, (2) improvements that supply additional information to improve the decision, (3) techniques that reduce the cost of writing the checkpoints to non-volatile storage, (4) a performance analysis of the developed techniques on a single workstation and on several supercomputers, including ORNL’s Titan. Our results demonstrate the utility of the high-level abstractions approach in automating the checkpointing process and show that performance is comparable to, or better than the reference in all cases.
Item Type: | Journal Article | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subjects: | Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software | ||||||||||||||||||||||||||||||||||||
Divisions: | Faculty of Science, Engineering and Medicine > Science > Computer Science | ||||||||||||||||||||||||||||||||||||
Library of Congress Subject Headings (LCSH): | Domain-specific programming languages , High performance computing | ||||||||||||||||||||||||||||||||||||
Journal or Publication Title: | Journal of Parallel and Distributed Computing | ||||||||||||||||||||||||||||||||||||
Publisher: | Elsevier Science BV | ||||||||||||||||||||||||||||||||||||
ISSN: | 0743-7315 | ||||||||||||||||||||||||||||||||||||
Official Date: | June 2019 | ||||||||||||||||||||||||||||||||||||
Dates: |
|
||||||||||||||||||||||||||||||||||||
Volume: | 128 | ||||||||||||||||||||||||||||||||||||
Page Range: | pp. 99-114 | ||||||||||||||||||||||||||||||||||||
DOI: | 10.1016/j.jpdc.2019.01.015 | ||||||||||||||||||||||||||||||||||||
Status: | Peer Reviewed | ||||||||||||||||||||||||||||||||||||
Publication Status: | Published | ||||||||||||||||||||||||||||||||||||
Access rights to Published version: | Restricted or Subscription Access | ||||||||||||||||||||||||||||||||||||
Date of first compliant deposit: | 14 March 2019 | ||||||||||||||||||||||||||||||||||||
Date of first compliant Open Access: | 22 February 2020 | ||||||||||||||||||||||||||||||||||||
RIOXX Funder/Project Grant: |
|
||||||||||||||||||||||||||||||||||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year