Training deep neural networks is a costly procedure, often performed via sophisticated deep learning frameworks on clusters of computers. As faster processor technologies are integrated into these cluster facilities (e.g., NVIDIA’s graphics accelerators or Google’s tensor processing units), the communication component of the training process rapidly becomes a performance bottleneck. In this paper, we offer a complete analysis of the key collective communication primitive for the distributed data-parallel training of convolutional network networks (CNNs) focused on three relevant instances of the Message Passing Interface (MPI): MPICH, OpenMPI, and IntelMPI. In addition, our experimental evaluation is extended to expose the practical impact of this collective primitive when the training is performed using TensorFlow+ Horovod on a 16-node cluster. Finally, the theoretical analysis is further refined to a number of accelerated cluster configurations that are emulated by adjusting the communication-arithmetic ratio of the training process.