Multi-label image classification aims to identify all object labels within an image. Small objects and similar objects are still the primary difficulties of previous related models due to the representative capacity of convolutional kernels. Recent vision transformer networks employ the attention mechanism to extract the spatial feature, which expresses richer local semantic information but is insufficient for creating high performance in fine-grained or small-object scenarios. To overcome these disadvantages, this paper proposes a solution by using the definition of the label as text embedding and adjusting the modification of the decoder stage so the model can acquire more information. The new framework is simple, efficient, and consistently outperforms all preceding works on the FATHOMNET, FAIR1M, and DOTA datasets.