Large-scale deep neural networks training have been widely deployed on dense-GPU public cloud clusters. Intensive communication and synchronization cost for gradients and parameters is becoming the bottleneck of distributed deep learning training. Horovod is one of the most popular distributed communication frameworks to address the scale-out issue of deep learning training on GPU clusters. Existing public-cloud GPU datacenters, such as Amazon EC2 and Alibaba GPU cloud, are usually equipped with commodity high-speed Ethernet and TCP networking. In current vanilla Horovod, however, we observe that one GPU device is merely associated with at most one proxy communication process. The proxy process is responsible for dealing with all the communication operations of parameter all-reduce for one or multiple GPUs. Such configuration makes communication interface based on TCP protocols suffer from limited network goodput and incur training performance penalties. In this paper, we make the first attempt to improve the message passing interface of Horovod and address the mismatching between the computation and communication capability when deploying Horovod in TCP-based public-cloud GPU clusters. We propose FastHorovod to exploit more cost-efficient auxiliary communication processes on CPU to expedite parallel message-passing schedule for GPU. We conduct extensive experiments against state-of-the-art Horovod. The experiment results show that our design can significantly accelerate the distributed training communication on TCP-based public-cloud GPU clusters, and FastHorovod improves the training speed of AlexNet and VGG16 models by 64.5% and 72.6% respectively.