In this paper, a multimodal classification model is constructed to predict the pineapple quality. We compile a pineapple dataset consisting of 500 pineapples with two modalities: one is tapping a pineapple to record the sounds, and the other is taking pictures by cameras. Three classification models, including audio model, visual model, and audio-visual model, are built based on the deep learning architectures for the corresponding feature representations. The experimental evaluation demonstrates that the audio-visual model, which combines both audio and visual representations, can yield the best accuracy than the other two unimodal models.