Spatiotemporal action detection is a hotspot in the field of computer vision, and has a wide range of application prospects. By analyzing the algorithms of action tube connection, we found that the large disparity between linking scores of regions in consecutive frames reduces the detection performance. To address this problem, a framework is proposed, which consists of the 2D ConvNet branch, 3D ConvNet branch, channel attention fusion block, bounding box regression and smooth action tube connection block. In particular, the smooth action tube connection block reduces the negative impact on the detection results due to the large difference in detection scores. Furthermore, extensive experiments are conducted on the UCF101-24 dataset. The proposed method achieves better performance than the baseline methods, and obtains the best experimental results, when video-mAP with an IoU threshold of 0.5 is used as the evaluation metric.