In recent years, Internet data has grown exponentially, but due to the lack of labels, the data that can be used is still relatively small. To solve this problem, research on weak supervision has emerged. However, common weakly supervised research often focuses on either single-modal data or multi-modal data research, which cannot be compatible with both types of data at the same time. Motivated by this observation, we propose a unified cross-modal weakly supervised classification ensemble framework (UCWSC) to tackle this issue. Especially, our proposed framework is based on high-order feature information of different modes. First, We introduce a feature fusion method based on high-order features to increase the amount of acquired information. Then we propose a modified Feature MixMatch algorithm with learning from feature representations. We propose feature fusion and decision fusion methods for weakly supervised classification of multi-modal data with voting and weighting mechanisms as discriminators to obtain the final classification results, respectively. We demonstrate the compatibility of these techniques, our classification accuracy can reach around 99% on the Wikipedia dataset and 78% on the MVSA-Multiple dataset.