학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory

Resource Type: Conference
Authors: Chen, Cheng; Wang, Yilin; Yang, Jun; Liu, Yiming; Lu, Mian; Zheng, Zhao; He, Bingsheng; Wong, Weng-Fai; You, Liang; Sun, Penghao; Zhao, Yuping; Hu, Fenghua; Rudoff, Andy
Source: 2023 IEEE 39th International Conference on Data Engineering (ICDE) ICDE Data Engineering (ICDE), 2023 IEEE 39th International Conference on. :2976-2987 Apr, 2023
Subject: Computing and Processing
Training
Checkpointing
Industries
Deep learning
Costs
Scalability
Random access memory
machine learning system
recommendation model
parameter server
persistent memory
Language
ISSN: 2375-026X

Online Access

Full Text (IEEE)

초록

In this paper, we present OpenEmbedding, a distributed parameter server system for deep learning recommendation models (DLRM) workloads. In order to support rapid growth in the number of features and the model size (Terabytes are common) of DLRM workloads, OpenEmbedding takes advantage of emerging persistent memory (PMem) to address scalability and reliability issues in training DLRMs. Compared to DRAM, PMem can have much lower per-GB cost, higher density, and non-volatility, while with slightly low access performance to DRAM. OpenEmbedding uses DRAM as cache and PMem as storage for the sparse features and develops a simple but effective pipeline processing approach to optimize the access latency of the sparse features in PMem. For reliability, we develop a lightweight synchronous checkpointing scheme that is specially co-designed with the pipelined cache to reduce the run-time overhead of checkpointing. Our evaluations on a real-world industry workload consisting of billions of parameters demonstrate 1) the effectiveness of our PMem-aware optimizations, 2) checkpointing mechanism with near-zero run-time overhead to the training performance and 3) fast recovery with up to 3.97× speedup compared to the state-of-the-art. OpenEmbedding has been deployed in hundreds of scenarios in industry within 4Paradigm, and is open-sourced 1 .

공지

DAU Library

학술논문

요약정보

OpenEmbedding: A Distributed Parameter Server for Deep Learning Recommendation Models using Persistent Memory

Online Access

초록