Cross-modality pose estimation/localization is a critical challenge for multi-sensor-based perception systems, with applications spanning vehicle localization and online calibrations. In this letter, we introduce a hybrid approach to estimate the camera pose with respect to a point cloud with co-visibility. This approach adopts a coarse-to-fine scheme, utilizing a learning-based pose estimator to initially estimate a coarse pose, followed by an optimization-based method for precise geometry alignment. Initially, we propose a neural network utilizing multi-scale cross-attention to predict the co-visible points between the camera and the point cloud. Simultaneously, a coarse pose is estimated from the tightly-coupled dual-modal features. Initialized by the coarsely estimated pose, an optimization-based approach is employed to minimize the reprojection error of the co-visible point cloud onto the imaging plane. The proposed approach is fully evaluated on the KITTI and nuScenes datasets, and superior experimental results are achieved with the comparison of SOTA methods.