Submanifold convolution is widely used in 3-D detection. However, it brings different receptive fields to voxels due to the nonuniform distribution in Light Detection and Ranging (LiDAR) point clouds, resulting in degradation of the feature extraction ability for distant voxels and the performance of detectors. We propose a solution, adaptive receptive field aggregation (ARFA) network, an end-to-end two-stage LiDAR 3-D object detection architecture. ARFA searches the top- ${K}$ nearest neighbors (KNNs) to adaptively adjust the receptive field of sparse voxels, followed by a self-attention aggregation (SA) module with density feature embedding (DE) to aggregate the semantic information in the receptive field. In order to further strengthen the detection performance for small objects, we also propose an upsampling bird’s eyes view (U-BEV) backbone and a Intersection over Union (IoU)-aware head to enhance the quality of the proposals and rectify the confidence of the predicted bounding boxes. ARFA outperforms the state-of-the-art methods on the Waymo Open dataset and achieves competitive results on the popular KITTI dataset.