Understanding time-varying vulnerability accross GPU Program Lifetime
- Resource Type
- Conference
- Authors
- Qiu, Hao; Olowogemo, Semiu A.; Lin, Bor-Tyng; Robinson, William H.; Limbrick, Daniel B.
- Source
- 2022 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2022 IEEE International Symposium on. :1-6 Oct, 2022
- Subject
- Aerospace
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
General Topics for Engineers
Fault tolerance
Fault tolerant systems
Discrete Fourier transforms
Graphics processing units
Very large scale integration
Parallel processing
Behavioral sciences
Fault Injection
GPU
Reliability
Vulnerability Characterization
- Language
- ISSN
- 2765-933X
Time-varying behaviors of GPU program vulnerability could be exploited to reduce overheads for fault-tolerant designs. However, the inherent parallelism and performance overheads for massive fault injection (FI) hindered such assessments using FI. NVBitFI, a GPU FI tool featuring high-performance and good compatibility, allows time-varying vulnerability evaluations using FI within a reasonable time. We extended NVBitFI to control FI tests on the temporal dimension. A scalable workflow characterizing the time-varying vulnerability of GPU programs at two granularities is presented. A convenient approach to profile vulnerability with actual GPU time is also proposed. Results obtained from 60K fault injections demonstrated the feasibility of the proposed methodologies. A case study exploring the improved instruction-level grouping is presented. More than 340K faults are injected into the vectorAdd kernel to show the possibility to generalize the time-varying behavior of smaller inputs to realistic workloads with large inputs.