The classic adversarial samples of black-box attacks are all aimed at the models of Convolutional neural networks (CNNs), but they do not perform well on the new recognition networks based on Transformer. In this paper, we propose an adversarial sample generation algorithm based on Vision transformers (VITs)‘ self-attention mechanism and patch partition. We noticed that different blocks in the VIT have uneven attention distribution, so we first generated a patch-based attention map and performed threshold segmentation, which is used as a mask to perform data enhancement operations based on patches with high weight and patches with low weight, and then exchanged information between patches to generate adversarial samples. The experiment of simulating black-box attack shows that the adversarial samples generated by the algorithm in this paper have a high success rate in all kinds of models based on Transformer of attacks, and also perform well on CNNs.