The extraction of buildings from remote sensing (RS) imagery can be applied to tasks such as urban planning, geodatabase updates and postdisaster emergency management. In recent years, using deep-learning-based methods to extract buildings from RS images has become a trending topic. Currently, most building extraction methods are fully supervised, which requires extensive manual annotation for large numbers of RS images. In addition, there are challenges with sticking and breaking building boundaries and the incomplete extraction of large buildings. Therefore, we propose an attention-UNet (AUNet) model based on a cross-pseudosupervised (CPS) semisupervised framework to solve the above problems. First, we use a channel transformer to extract multiscale feature maps from a multilayer encoder. These feature maps are fully explored and then fused with information sampled on a decoder. The aim is to improve the integrity of the building segmentation. In addition, we add a convolutional attention module in each layer of the encoder to mitigate the loss of shallow detail features due to successive downsampling. More detailed features can alleviate sticking and breaking problems. Finally, we conduct sufficient experiments on two public datasets to demonstrate the effectiveness of AUNet under the semisupervised CPS framework.