Performance Evaluation of Adaptive Routing on Dragonfly-based Production Systems
- Resource Type
- Conference
- Authors
- Chunduri, Sudheer; Harms, Kevin; Groves, Taylor; Mendygral, Peter; Zarins, Justs; Weiland, Michele; Ghadar, Yasaman
- Source
- 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) IPDPS Parallel and Distributed Processing Symposium (IPDPS), 2021 IEEE International. :340-349 May, 2021
- Subject
- Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Performance evaluation
Production systems
Distributed processing
Adaptive systems
Sensitivity
Runtime
System performance
- Language
- ISSN
- 1530-2075
Performance of applications in production environments can be sensitive to network congestion. Cray Aries supports adaptively routing each network packet independently based on the load or congestion encountered as a packet traverses the network. Software can dictate different routing policies, adjusting between minimal and non-minimal bias, for each posted message. We have extensively evaluated the sensitivity of the routing bias selection on application performance as well as whole system performance in both production and controlled conditions. We show that the default routing bias used in Aries-based systems is often sub-optimal and that using a higher bias towards minimal routes will not only reduce the congestion effects on the application but also will decrease the overall congestion on the network. This routing scheme results in not only improved mean performance (by up to 12%) of most production applications but also reduced run-to-run variability. Our study prompted the two supercomputing facilities (ALCF and NERSC) to change the default routing mode on their Aries-based systems. We present the substantial improvement measured in the overall congestion management and interconnect performance in production after making this change.