The Library

An elastic, parallel and distributed computing architecture for machine learning

Tools

Li, Anthony Zhenyu (2019) An elastic, parallel and distributed computing architecture for machine learning. PhD thesis, University of Warwick.

Preview

PDF
WRAP_Theses_Li_2019.pdf - Submitted Version - Requires a PDF viewer.
Download (17Mb) | Preview

Official URL: http://webcat.warwick.ac.uk/record=b3489106~S15

Request Changes to record.

Abstract

Machine learning is a powerful tool that allows us to make better and faster decisions in a data-driven fashion based on training data. Neural networks are especially popular in the context of supervised learning due to their ability to approximate auxiliary functions. However, building these models is typically computationally intensive, which can take significant time to complete on a conventional CPU-based computer. Such a long turnaround time makes business and research infeasible using these models. This research seeks to accelerate this training process through parallel and distributed computing using High-Performance Computing (HPC) resources.

To understand machine learning on HPC platforms, theoretical performance analysis from this thesis summarises four key factors for data-parallel machine learning: convergence, batch size, computational and communication efficiency. It is discovered that a maximum computational speed-up exists through parallel and distributed computing for a fixed experimental setup.

This primary focus of this thesis is convolutional neural network applications on the Apache Spark platform. The work presented in this thesis directly addresses the computational and communication inefficiencies associated with the Spark platform with improvements to the Resilient Distributed Dataset (RDD) and the introduction of an elastic non-blocking all-reduce. In addition to implementation optimisations, the computational performance has been further improved by overlapping computation and communication, and the use of large batch sizes through fine-grained control. The impacts of these improvements are more prominent with the rise of massively parallel processors and high-speed networks.

With all the techniques combined, it is predicted that training the ResNet50 model on the ImageNet dataset for 100 epochs at an effective batch size of 16K will take under 20 minutes on an NVIDIA Tesla P100 cluster, in contrast to 26 months on a single Intel Xeon E5-2660 v3 2.6 GHz processor.

Due to the similarities to scientific computing, the resulting computing model of this thesis serves as an exemplar of the integration of high-performance computing and elastic computing with dynamic workloads, which lays the foundation for future research in emerging computational steering applications, such as interactive physics simulations and data assimilation in weather forecast and research.

Item Type:

Thesis (PhD)

Subjects:

Q Science > Q Science (General)
Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software

Library of Congress Subject Headings (LCSH):

Machine learning, High performance computing, Neural networks (Computer science), Neural computers

Official Date:

September 2019

Dates:

Date	Event
September 2019	UNSPECIFIED

Institution:

University of Warwick

Theses Department:

Department of Computer Science

Thesis Type:

PhD

Publication Status:

Unpublished

Supervisor(s)/Advisor:

Jarvis, Stephen

Sponsors:

Atos IT Services UK Limited ; Engineering and Physical Sciences Research Council

Format of File:

pdf

Extent:

xiv, 131 pages : illustrations (some color)

Language:

eng

Request changes or add full text files to a record

Repository staff actions (login required)

View Item

Downloads

Downloads per month over past year

View more statistics

University of Warwick
Publications service & WRAP

Highlight your research

The Library

An elastic, parallel and distributed computing architecture for machine learning

Abstract

Repository staff actions (login required)

Downloads

University of WarwickPublications service & WRAP

Highlight your research

The Library

An elastic, parallel and distributed computing architecture for machine learning

Abstract

Repository staff actions (login required)

Downloads

University of Warwick
Publications service & WRAP