Recently, there has been growing interest in the field of multimodal dialogue systems. Different from traditional unimodal dialogue systems, our task needs to understand the context of multiple modalities before responding to users’ utterances. In this paper, we present a detailed survey of the recent advances in multimodal dialogue systems and discuss some possible research directions. In particular, we categorize the dialogue systems into two basic tasks, including outputting textual responses and outputting visual responses. In these tasks, there are two main challenges, that is, the heterogeneity gap and the semantic gap. Then, we analyze the key techniques used to solve these challenges. Moreover, we review benchmark datasets and popular evaluation metrics comprehensively. Finally, we give some promising directions for future works.