The mass unlabeled production data hinders the large-scale application of advanced supervised learning techniques in the modern industry. Metal 3D printing generates huge amounts of in-situ data that are closely related to the forming quality of parts. In order to solve the problem of labor cost caused by re-labeling dataset when changing printing materials and process parameters, a forming quality recognition model based on deep clustering is designed, which makes the forming quality recognition task of metal 3D printing more flexible. Inspired by the success of Vision Transformer, we introduce convolutional neural networks into the Vision Transformer structure to model the inductive bias of images while learning the global representations. Our approach achieves state-of-the-art accuracy over the other Vision Transformer-based models. In addition, our proposed framework is a good candidate for specific industrial vision tasks where annotations are scarce.