Visual Commonsense Reasoning (VCR) is a challenging multimodal task involving several research fields such as vision, cognition, and reasoning, which combines images and natural language for reasoning. Existing VCR methods focus on global attention or use pre-training models, but these methods lack attention to local features of visual and language. In this paper, a multi-layer attention network is proposed for the VCR task, including an intra-modal attention module and an inter-modal attention module. The intra-modal attention module complements important features of visual and language modalities with fine-grained visual attention to improve the relevance of visual and language. The inter-modal attention module captures the internal dependencies between visual and language. Finally, the two modules are integrated into an end-to-end reasoning framework. Experiments on the VCR large-scale dataset show that the proposed method exhibits a decent improvement in the VCR task and illustrates the effectiveness of the method on three subtasks.