In the last decade, enormous and renewed attention to Artificial Intelligence has emerged thanks to Deep Neural Networks (DNNs), which can achieve high performance in performing specific tasks at the cost of a high computational complexity. GPUs are commonly used to accelerate DNNs, but generally determine a very high power consumption and poor time predictability. For this reason, GPUs are becoming less attractive for resource-constrained, real-time systems, while there is a growing demand for specialized hardware accelerators that can better fit the requirements of embedded systems. Following this trend, this paper focuses on hardware acceleration for the DNNs used by Baidu Apollo, an open-source autonomous driving framework. As an experience report of performing R&D with industrial technologies, we discuss challenges faced in shifting from GPU-based to FPGA-based DNN acceleration when per-formed using the DPU core by Xilinx deployed on an Ultrascale+ SoC FPG A platform. Furthermore, it shows pros and cons of today's hardware accelerating tools. Experimental evaluations were conducted to evaluate the performance of FPGA-accelerated DNNs in terms of accuracy, throughput, and power consumption, in comparison with those achieved on embedded GPUs.