RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management
- Resource Type
- Conference
- Authors
- Martinasso, Maxime; Gila, Miguel; Bianco, Mauro; Alam, Sadaf R.; McMurtrie, Colin; Schulthess, Thomas C.
- Source
- SC18: International Conference for High Performance Computing, Networking, Storage and Analysis SC High Performance Computing, Networking, Storage and Analysis, SC18: International Conference for. :320-332 Nov, 2018
- Subject
- Computing and Processing
Clocks
Tools
Resource management
Software
Containers
Production systems
resource manager
production workload
Slurm
- Language
Leading hybrid and heterogeneous supercomputing systems process hundreds of thousands of jobs using complex scheduling algorithms and parameters. The centers operating these systems aim to achieve higher levels of resource utilization while being restricted by compliance with policy constraints. There is a critical need for a high-fidelity, high-performance tool with familiar interfaces that allows not only tuning and optimization of the operational job scheduler but also enables exploration of new resource management algorithms. We propose a new methodology and a tool called RM-Replay which is not a simulator but instead a fast replay engine for production workloads. Slurm is used as a platform to demonstrate the capabilities of our replay engine. The tool accuracy is discussed and our investigation shows that, by providing better job runtime estimation or using topology-aware allocation, scheduling metric values vary. The presented methodology to create fast replay engines can be extended to other complex systems.