Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About

University of Warwick
Publications service & WRAP

Highlight your research

  • WRAP
    • Home
    • Search WRAP
    • Browse by Warwick Author
    • Browse WRAP by Year
    • Browse WRAP by Subject
    • Browse WRAP by Department
    • Browse WRAP by Funder
    • Browse Theses by Department
  • Publications Service
    • Home
    • Search Publications Service
    • Browse by Warwick Author
    • Browse Publications service by Year
    • Browse Publications service by Subject
    • Browse Publications service by Department
    • Browse Publications service by Funder
  • Help & Advice
University of Warwick

The Library

  • Login
  • Admin

Optimizing machine learning on Apache Spark in HPC environments

Tools
- Tools
+ Tools

Li, Zhenyu, Davis, James A. and Jarvis, Stephen A. (2018) Optimizing machine learning on Apache Spark in HPC environments. In: 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC), Dallas, TX, 11-16 Nov 2018. Published in: 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC) ISBN 9781728101811. doi:10.1109/MLHPC.2018.8638643

[img]
Preview
PDF
WRAP-optimizing-machine-learning-Apache-Spark-HPC-environments-Li-2018.pdf - Accepted Version - Requires a PDF viewer.

Download (806Kb) | Preview
Official URL: https://doi.org/10.1109/MLHPC.2018.8638643

Request Changes to record.

Abstract

Machine learning has established itself as a powerful tool for the construction of decision making models and algorithms through the use of statistical techniques on training data. However, a significant impediment to its progress is the time spent training and improving the accuracy of these models – this is a data and compute intensive process, which can often take days, weeks or even months to complete. A common approach to accelerate this process is to employ the use of multiple machines simultaneously, a trait shared with the field of High Performance Computing (HPC) and its clusters. However, existing distributed frameworks for data analytics and machine learning are designed for commodity servers, which do not realize the full potential of a HPC cluster, and thus denies the effective use of a readily available and potentially useful resource. In this work we adapt the application of Apache Spark, a distributed data-flow framework, to support the use of machine learning in HPC environments for the purposes of machine learning. There are inherent challenges to using Spark in this context; memory management, communication costs and synchronization overheads all pose challenges to its efficiency. To this end we introduce: (i) the application of MapRDD, a fine grained distributed data representation; (ii) a task-based allreduce implementation; and (iii) a new asynchronous Stochastic Gradient Descent (SGD) algorithm using non-blocking all-reduce. We demonstrate up to a 2.6x overall speedup (or a 11.2x theoretical speedup with a Nvidia K80 graphics card), a 82- 91% compute ratio, and a 80% reduction in the memory usage, when training the GoogLeNet model to classify 10% of the ImageNet dataset on a 32-node cluster. We also demonstrate a comparable convergence rate using the new asynchronous SGD with respect to the synchronous method. With increasing use of accelerator cards, larger cluster computers and deeper neural network models, we predict a 2x further speedup (i.e. 22.4x accumulated speedup) is obtainable with the new asynchronous SGD algorithm on heterogeneous clusters.

Item Type: Conference Item (Paper)
Alternative Title:
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Divisions: Faculty of Science, Engineering and Medicine > Science > Computer Science
Library of Congress Subject Headings (LCSH): Machine learning, High performance computing, Apache Spark
Journal or Publication Title: 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
Publisher: IEEE
ISBN: 9781728101811
Official Date: 11 February 2018
Dates:
DateEvent
11 February 2018Published
18 September 2018Accepted
DOI: 10.1109/MLHPC.2018.8638643
Status: Peer Reviewed
Publication Status: Published
Reuse Statement (publisher, data, author rights): © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Access rights to Published version: Restricted or Subscription Access
Date of first compliant deposit: 15 November 2018
Date of first compliant Open Access: 16 November 2018
RIOXX Funder/Project Grant:
Project/Grant IDRIOXX Funder NameFunder ID
EP/L016400/1[EPSRC] Engineering and Physical Sciences Research Councilhttp://dx.doi.org/10.13039/501100000266
Conference Paper Type: Paper
Title of Event: 2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
Type of Event: Conference
Location of Event: Dallas, TX
Date(s) of Event: 11-16 Nov 2018
Related URLs:
  • Publisher

Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics

twitter

Email us: wrap@warwick.ac.uk
Contact Details
About Us