High-Throughput Computing on High-Performance Platforms: A Case Study
- Resource Type
- Conference
- Authors
- Oleynik, Danila; Panitkin, Sergey; Turilli, Matteo; Angius, Alessio; Oral, Sarp; De, Kaushik; Klimentov, Alexei; Wells, Jack C.; Jha, Shantenu
- Source
- 2017 IEEE 13th International Conference on e-Science (e-Science) ESCIENCE e-Science (e-Science), 2017 IEEE 13th International Conference on. :295-304 Oct, 2017
- Subject
- Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Geoscience
Signal Processing and Analysis
Detectors
Servers
Large Hadron Collider
Supercomputers
Computational modeling
high-performance and throughput computing
- Language
The computing systems used by LHC experiments has historically consisted of the federation of hundreds to thousands of distributed resources, ranging from small to mid-size re-source. In spite of the impressive scale of the existing distributed computing solutions, the federation of small to mid-size resources will be insufficient to meet projected future demands. This paper is a case study of how the ATLAS experiment has embraced Titan - a DOE leadership facility in conjunction with traditional distributed high-throughput computing to reach sustained production scales of approximately 52M core-hours a years. The three main contributions of this paper are: (i) a critical evaluation of design and operational considerations to support the sustained, scalable and production usage of Titan; (ii) a preliminary characterization of a next generation executor for PanDA to support new workloads and advanced execution modes; and (iii) early lessons for how current and future experimental and observational systems can be integrated with production supercomputers and other platforms in a general and extensible manner.