eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

LUTein: Dense-Sparse Bit-Slice Architecture With Radix-4 LUT-Based Slice-Tensor Processing Units

Resource Type: Conference
Authors: Im, Dongseok; Yoo, Hoi-Jun
Source: 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA) HPCA High-Performance Computer Architecture (HPCA), 2024 IEEE International Symposium on. :747-759 Mar, 2024
Subject: Computing and Processing
Energy consumption
Power demand
Limiting
Instruction sets
Computer architecture
Hardware
Software
Bit-scalable architecture
deep neural network
Modified Booth algorithm
bit-slice
sparsity
LUT-based computing
slice-tensor
dense-sparse core architecture
Language
ISSN: 2378-203X

Online Access

Full Text (IEEE)

초록

Bit-slice architectures have been developed to support various bit-precision and data sparsity of deep neural networks (DNNs). However, because of low-bit precision and a wide range of sparsity of bit-slice computations, bit-slice architectures are challenging in multiplier-, processing element (PE)-, core-, and software (SW)-level designs. First, data sparsity causes a power trade-off between the Radix numbers of a multiplier, and the previous multipliers cannot take advantage of all sparsity ranges. Second, a bit-slice PE which integrated massive lowbit multiplier-and-accumulate (MAC) units brings about large data transactions compared to a fixed bit-width PE. Third, bitslice core architectures only focus on either dense or sparse data computations, limiting the overall performance of bit-slice computations with a wide sparsity range. Lastly, low-bit bit-slice computations cause massive repetitive instruction fetches across hardware units. To solve the challenges, LUTein is proposed. It exploits the new lookup table (LUT)-based computing method to support the Radix-4 Modified Booth algorithm, achieving low power consumption in all sparsity ranges. Moreover, the slice-tensor PE efficiently processes slice-tensor data by sharing hardware units across the Radix-4 LUT-based MAC units. In addition, the LUTein architecture adopts a systolic datapath with a multi-port buffer to exploit both inter-PE data reuse and slice-level sparsity. Lastly, LUTein's instruction set architecture (ISA) and the hierarchical instruction decoder are introduced to alleviate repetitive instruction fetches. As a result, LUTein outperforms the state-of-the-art bit-slice architecture, Sibia, over 1.34× higher energy-efficiency and 1.78× higher area-efficiency.

공지

DAU Library

eArticles

요약정보

LUTein: Dense-Sparse Bit-Slice Architecture With Radix-4 LUT-Based Slice-Tensor Processing Units

Online Access

초록