eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

Batched LU Factorization With Fast Row Interchanges for Small Matrices on GPUs

Resource Type: Conference
Authors: Huang, Rongfeng; Zhao, Yonghua; Yu, Tianyu; Liu, Shifang; Zhang, Xinyin
Source: 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys) HPCC-DSS-SMARTCITY-DEPENDSYS High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), 2022 IEEE 24th. :259-266 Dec, 2022
Subject: Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Linear systems
Smart cities
Fuses
Prefetching
Graphics processing units
Libraries
Delays
batched execution
LU factorization
graphics processing units
Language

Online Access

Full Text (IEEE)

초록

This paper presents high-performance batched lower-upper (LU) factorization routines for small matrices on graphics processing units (GPUs). LU factorization is an effective method for solving linear equations. Pivoting, at least partial pivoting, is essential for LU factorization to ensure numerical stability. However, the row interchanges resulting from the pivoting are very troublesome due to non-coalesced memory access on GPUs. We analyze the global memory access of different variants of LU factorization algorithms and design our batched LU factorization routines based on the blocked left-looking algorithm with the minimum global memory access. To release the penalty of non-coalesced memory access, we propose a swapping rows group by group approach and a delaying row interchanges method. What's more, techniques for kernel fusion and data prefetching are used to improve overall performance. The numerical results on an NVIDIA Tesla V100 GPU show that our batched LU routine outperforms the highly optimized vendor library cuBLAS up to $6.5\times$ and achieves up to $2.2\times$ over the well-known open-source package MAGMA.

공지

DAU Library

eArticles

요약정보

Batched LU Factorization With Fast Row Interchanges for Small Matrices on GPUs

Online Access

초록