Deep neural networks (DNNs) have been employed to different devices as a popular machine learning algorithm (ML) owing to deploy the Internet of Things (IoT), data mining in cloud computing, and web search engines, which MLs had an impressive effect on IoT's edge level nodes. Deploying DNN-based applications leads to memory access problems, including communication delay, energy efficiency, and bandwidth requirement. We propose a bus scheduling for data placement on distributed local buffers in a deep learning accelerator (DLA). The contributions of this paper include (1) providing a method for data-flow mapping between off-chip DRAM and distributed local buffers, and a flow mapping approach for data transfer between distributed local buffers and processing elements (PEs) (2) employing distributed local buffers in four directions for traffic distribution on a mesh based on memory access mechanism (3) bus scheduling for data placement on distributed local buffers. Simulated experiment based on typical DNN (i.e., AlexNet, VGG-16, and GoogLeNet) workflows demonstrates the effectiveness of the design: (1) the scheduling and mapping methods improve total runtime and bandwidth requirement by approximately 42.29% and 88.95% compared with TPU, respectively. Additionally, (2) our methods reduce total runtime for row-column stationary plus by approximately 99% compared with weight-stationary data-flow in CONV1 and CONV11 of VGG-16, respectively. This work reports the simulation results based on distributing AlexNet, VGG-16, and GoogLeNet's traffics as the popular CNNs and DNNs models, whereas it investigates our method's efficiency for other trained models. [ABSTRACT FROM AUTHOR]