eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

FastHorovod: Expediting Parallel Message-Passing Schedule for Distributed DNN Training

Resource Type: Conference
Authors: Wang, Yanghai; Dong, Dezun; Xu, Yemao; Ouyang, Shuo; Liao, Xiangke
Source: 2021 IEEE Symposium on Computers and Communications (ISCC) Computers and Communications (ISCC), 2021 IEEE Symposium on. :1-7 Sep, 2021
Subject: Communication, Networking and Broadcast Technologies
Computing and Processing
Training
Deep learning
Performance evaluation
Schedules
Cloud computing
Protocols
Graphics processing units
Distributed DNN training
Communication Optimization
Horovod
MPI
Language
ISSN: 2642-7389

Online Access

Full Text (IEEE)

초록

Large-scale deep neural networks training have been widely deployed on dense-GPU public cloud clusters. Intensive communication and synchronization cost for gradients and parameters is becoming the bottleneck of distributed deep learning training. Horovod is one of the most popular distributed communication frameworks to address the scale-out issue of deep learning training on GPU clusters. Existing public-cloud GPU datacenters, such as Amazon EC2 and Alibaba GPU cloud, are usually equipped with commodity high-speed Ethernet and TCP networking. In current vanilla Horovod, however, we observe that one GPU device is merely associated with at most one proxy communication process. The proxy process is responsible for dealing with all the communication operations of parameter all-reduce for one or multiple GPUs. Such configuration makes communication interface based on TCP protocols suffer from limited network goodput and incur training performance penalties. In this paper, we make the first attempt to improve the message passing interface of Horovod and address the mismatching between the computation and communication capability when deploying Horovod in TCP-based public-cloud GPU clusters. We propose FastHorovod to exploit more cost-efficient auxiliary communication processes on CPU to expedite parallel message-passing schedule for GPU. We conduct extensive experiments against state-of-the-art Horovod. The experiment results show that our design can significantly accelerate the distributed training communication on TCP-based public-cloud GPU clusters, and FastHorovod improves the training speed of AlexNet and VGG16 models by 64.5% and 72.6% respectively.

공지

DAU Library

eArticles

요약정보

FastHorovod: Expediting Parallel Message-Passing Schedule for Distributed DNN Training

Online Access

초록