eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

Hardware-efficient Softmax Approximation for Self-Attention Networks

Resource Type: Conference
Authors: Koca, Nazim Altar; Do, Anh Tuan; Chang, Chip-Hong
Source: 2023 IEEE International Symposium on Circuits and Systems (ISCAS) Circuits and Systems (ISCAS), 2023 IEEE International Symposium on. :1-5 May, 2023
Subject: Components, Circuits, Devices and Systems
Power, Energy and Industry Applications
Signal Processing and Analysis
Transformers
Throughput
Natural language processing
Hardware
Table lookup
Registers
Task analysis
Language
ISSN: 2158-1525

Online Access

Full Text (IEEE)

초록

Self-attention networks such as Transformer have become state-of-the-art models for natural language processing (NLP) problems. Softmax function, which serves as a normalizer to produce attention scores, turns out to be a severe throughput and latency bottleneck of a Transformer network. Softmax datapath consists of data-dependent sequential nonlinear exponentiation and division operations, which are not amenable to pipelining and parallelism, nor can they be directly linearized for pretrained models without substantial accuracy drop. In this paper, we proposed a hardware efficient Softmax approximation which can be used as a direct plug-in substitution into pretrained transformer network to accelerate NLP tasks without compromising its accuracy. Experiment results on FPGA implementation show that our design outperforms vanilla Softmax designed using Xilinx IPs with 15x less LUTs, 55x less registers and 23x lower latency at similar clock frequency and less than 1% accuracy drop on main language benchmark tasks. We also propose a pruning method to reduce the input entropy of Softmax for NLP problems with high number of inputs. It was validated on CoLA task to achieve a further 25% reduction of latency.

공지

DAU Library

eArticles

요약정보

Hardware-efficient Softmax Approximation for Self-Attention Networks

Online Access

초록