Speech emotion recognition (SER) attracts much attention in recent years, especially under multilingual circum-stances because of its potential in understanding human psy-chology and developing human-computer interaction. However, recent works in SER task mainly focus on developing fantastic structures to improve performance on monolingual datasets. Little attention is paid to promote the transfer performance on multilingual datasets. In this paper, we propose a multilingual SER framework that utilizes the pre-training model as an upstream to learn high-level speech representations and develop a hierarchical grained and feature model (HGFM) as a classifier. The proposed framework extracts speech representations based on a cross-lingual speech representations (XLSR) model and utilizes the HGFM structure to finish the classification task. We validate our framework on a multilingual dataset including IEMOCAP (English), EmoDB (German), TESS (English), SAVEE (English), EMA (English), and EMOVO (Italian). Experimental results show that features extracted by upstream model achieve an average weighted accuracy (WA) of 70.6% and unweighted ac-curacy (UA) of 73.4 % in the downstream task, which outperforms not only manual features but other upstream structures. We also compare our results with the state-of-the-art and alternative methods to validate our framework and evaluate the performance of the structure in terms of F1-score.