학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

Datacenter-Scale Analysis and Optimization of GPU Machine Learning Workloads

Resource Type: Periodical
Authors: Wesolowski, L.; Acun, B.; Andrei, V.; Aziz, A.; Dankel, G.; Gregg, C.; Meng, X.; Meurillon, C.; Sheahan, D.; Tian, L.; Yang, J.; Yu, P.; Hazelwood, K.
Source: IEEE Micro Micro, IEEE. 41(5):101-112 Jan, 2021
Subject: Computing and Processing
Graphics processing units
Measurement
Telemetry
Data centers
Social networking (online)
Machine learning
Training data
Language
ISSN: 0272-1732
1937-4143

Online Access

초록

In this article, we present a system to collectively optimize efficiency in a very large scale deployment of GPU servers for machine learning workloads at Facebook. Our system 1) measures and stores system-wide efficiency metrics for every executed workflow; 2) aggregates data from across the execution stack to identify optimization opportunities that maximize fleet-wide efficiency improvements; 3) provides periodic and on-demand whole-system profiling for workflows; and 4) automatically analyzes traces for common antipatterns. We present each component of the stack and show case studies demonstrating the use of the tools to significantly improve performance. To our knowledge, our system is the most complete and effective solution for identifying and addressing efficiency problems in datacenter-scale GPU deployments.

공지

DAU Library

학술논문

요약정보

Datacenter-Scale Analysis and Optimization of GPU Machine Learning Workloads

Online Access

초록