Object detection is the main task in computer vision. Recently, object detection tasks have been performed through convolutional neural networks and the YOLO family, and gained substantial attention from the research community. Likewise, transformer-based models were introduced to improve the efficiency and accuracy of many detection models. However, realtime object detection still suffers from slow speed. To address this, a novel approach is proposed by fusing CNN with transformerbased detection. This fusion process positively impacts the accuracy and inference speed. In our experiments, we achieved a notable 51.8 mAP, which represents a 0.9 % improvement. Note that it is performed by 36 million parameters, which is 5 million parameters fewer than transformer-based models. A 144 GFLOPS computation rate was measured for 800×1333 pixels (an excellent improvement over 640×640 pixels at 74 GFLOPS. All this is achieved through 40 epochs training schedule. In summary, the proposed model reduced millions of parameters and the computational cost of the model by restructuring the original detection-based architecture. Moreover, it significantly enhances inference speed compared to prior transformer-based models.