Automatic objective assessment of dysarthria is valuable and crucial. Most previous studies focus on using audio-only data, ignoring the complementary of other modal data. In addition, traditional methods ignore the relationship between the pre-defined features and different pronunciations, reducing the performance of the automatic assessment system. To address these issues, this paper proposes a joint feature-sample selection (JFSS) based dysarthria severity level regression model using audio-visual data. In the proposed framework, relevant pronunciation samples and features are simultaneously obtained and unreliable noisy samples are discarded by the JFSS method. On the Mandarin Subacute Stroke Dysarthria Multimodal (MSDM) Database, the proposed regression model outperformed several baseline models. By using acoustic-visual features, the root mean square error (RMSE) of 13.78 and fitting coefficient R-square of 0.77 computed between the automatically predicted and perceptual evaluation metrics (i.e. Frenchay Dysarthria Assessment) were obtained, which confirmed the capacity of the proposed JFSS-based regression method in predicting dysarthria severity level.