We developed a modality attention motion generation model on the basis of multi-modality prediction. This model provides interpretability about modality usage and demonstrates robustness against disturbances. We used a hierarchical model consisting of low-level recurrent neural networks (RNNs) for processing each modality individually and a high-level RNN that integrates the multi-modality. This integration is achieved by efficiently gating multi-modality and inputting it to the high-level RNN. We verified the interpretability and robustness of the task of inserting a furniture part, which consists of the “approach” phase to bring the wooden dowel closer to the hole and the “insertion” phase. While the proposed model achieves the same task success rate as the conventional model, it clarifies that it refers to vision during “approach” and force during “insertion,” providing interpretability regarding modality use. Furthermore, in contrast to the non-modality attention model, whose task success rate drops significantly under disturbance, the proposed model enhances robustness against disturbances to modalities it does not direct attention during the task, resulting in a consistently high success rate ($\simeq\! 90\%$).