The Library

Waterwave : a GPU memory flow engine for concurrent DNN training

Tools

Shi, Xuanhua, Peng, Xuan, He, Ligang, Zhao, Yunfei and Jin, Hai (2023) Waterwave : a GPU memory flow engine for concurrent DNN training. IEEE Transactions on Computers, 72 (10). pp. 2938-2950. doi:10.1109/tc.2023.3278530 ISSN 1557-9956.

Preview

PDF
WRAP-waterwave-He-2023.pdf - Accepted Version - Requires a PDF viewer.
Download (3270Kb) | Preview

Official URL: https://doi.org/10.1109/tc.2023.3278530

Request Changes to record.

Abstract

Training Deep Neural Networks (DNN) concurrently is becoming increasingly important for deep learning practitioners, e.g., hyperparameter optimization (HPO) and neural architecture search (NAS) . The GPU memory capacity is the impediment that prohibits multiple DNNs from being trained on the same GPU due to the large memory usage during training. In this paper, we propose Waterwave a GPU memory flow engine for concurrent deep learning training. Firstly, to address the memory explosion brought by the long time lag between memory allocation and deallocation time, we develop an allocator tailored for multi-streams. By making the allocator aware of the stream information, a prioritized allocation is conducted based on the chunk's synchronization attributes, allowing us to provide useable memory after scheduling rather than waiting it to be really released after GPU computation. Secondly, Waterwave partitions the compute graph to a set of continuous node groups and then performs finer-grained scheduling: NodeGroup pipeline execution , to guarantee a proper memory requests order. Waterwave can accomplish up to 96.8% of the maximum batch size of solo training. Additionally, in scenarios with high memory demand, Waterwave can outperform existing spatial sharing and temporal sharing by up to 12x and 1.49x, respectively.

Item Type:

Journal Article

Subjects:

Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software

Divisions:

Faculty of Science, Engineering and Medicine > Science > Computer Science

SWORD Depositor:

Library Publications Router

Library of Congress Subject Headings (LCSH):

Graphics processing units, Memory management (Computer science), Deep learning (Machine learning)

Journal or Publication Title:

IEEE Transactions on Computers

Publisher:

Institute of Electrical and Electronics Engineers (IEEE)

ISSN:

1557-9956

Official Date:

October 2023

Dates:

Date	Event
October 2023	Published
22 May 2023	Available
16 May 2023	Accepted

Volume:

Number:

Page Range:

pp. 2938-2950

DOI:

10.1109/tc.2023.3278530

Status:

Peer Reviewed

Publication Status:

Published

Re-use Statement:

© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Access rights to Published version:

Restricted or Subscription Access

Date of first compliant deposit:

5 July 2023

Date of first compliant Open Access:

6 July 2023

Request changes or add full text files to a record

Repository staff actions (login required)

View Item

Downloads

Downloads per month over past year

View more statistics

University of Warwick
Publications service & WRAP

Highlight your research

The Library

Waterwave : a GPU memory flow engine for concurrent DNN training

Abstract

Repository staff actions (login required)

Downloads

University of WarwickPublications service & WRAP

Highlight your research

The Library

Waterwave : a GPU memory flow engine for concurrent DNN training

Abstract

Repository staff actions (login required)

Downloads

University of Warwick
Publications service & WRAP