Machine Learning Parallelism Could Be Adaptive, Composable and Automated
Abstract
In recent years, the pace of innovations in the fields of machine learning (ML) has accelerated, researchers in SysML have created algorithms and systems that parallelize ML training over multiple devices or computational nodes. As ML models become more structurally complex, many systems have struggled to provide all-round performance on a variety of models. Particularly, ML scale-up is usually underestimated in terms of the amount of knowledge and time required to map from an appropriate distribution strategy to the model. Applying parallel training systems to complex models adds nontrivial development overheads in addition to model prototyping, and often results in lower-than-expected performance. This thesis identifies and addresses research challenges in both usability and performance in parallel ML techniques and system implementations.
The first part of this thesis presents a simple design principle, adaptive parallelism, that applies suitable parallelization techniques to model building blocks (e.g., layers) according to their specific ML properties. Following it, we derive a series of optimizations and implementations optimizing different aspects of ML parallelization. We examine them and show that they significantly boost the efficiency or scalability of ML training on clusters 2-10x in their applicable scenarios.
Generalizing this methodology, this second part of this thesis formulates the ML parallelization as an end-to-end optimization problem, and seeks to solve it automatically, for two broad paradigms of ML parallelization tasks: single-node dynamic batching and distributed ML parallelisms. We present principled representations to express the two classes of ML parallelisms, along with composable system architectures, Cavs and AutoDist, respectively. They enable rapid compositions of parallelization strategies for unseen models, improve parallelization performance, and simplify parallel ML programming.
On top of them, the third part of this thesis presents an automatic parallelization framework, AutoSync, to automatically optimize synchronization strategies in data-parallel distributed training. AutoSync achieves high performance “out-of-the-box” -- it navigates the space spanned by the proposed representation, and automatically identifies synchronization strategies that report 1.2 - 1.6x speedups over existing hand-optimized systems, lowering the technical barrier of distributed ML and helping make it accessible to a larger community of users.
Collectively, the set of techniques and systems developed in this thesis lead to the proof of the concept and the prototype implementation of an end-to-end compiler system for large-scale ML training on distributed environments.
BibTeX
@phdthesis{Zhang-2020-124839,author = {Hao Zhang},
title = {Machine Learning Parallelism Could Be Adaptive, Composable and Automated},
year = {2020},
month = {October},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-20-54},
keywords = {Scalable Machine Learning, Parallelization, Machine Learning Parallelism, Distributed Machine Learning, Machine Learning System, Compiler, Automatic Parallelization, SysML, AutoML, Parameter Server, Composability},
}