Knowledge distillation enhances the performance of the student model by transferring knowledge from the teacher model. Moreover, the attention mechanism has been introduced recently to enable each layer of the student to learn knowledge from all teacher layers, which brings about considerable optimization. However, noted that features from different layers, such as shallow and deep layers, might have a big semantic gap, and compulsively aligning one student layer to all teacher layers would mislead the learning process. To tackle this problem, an effective framework called Semantic Stage-Wise learning for Knowledge Distillation (SSWKD) is presented in this paper. We divide all layers into shallow and deep stages, and only allow feature alignment within the same stage to alleviate semantic mismatch. In addition, with the observation that the performance of deep networks relies more on some key features rather than evenly on all of them, a crucial feature enhancement method based on KL divergence is then proposed for SSWKD, forcing the student to pay more attention to critical features of the teacher. Extensive experiments and visualizations show that our SSWKD outperforms other distillation methods on CIFAR-100 and COCO2017 datasets for image classification, object detection, and instance segmentation tasks.