Automatically recovering the original structure of tables from unstructured images is a challenging task, combining techniques from computer vision (CV) and natural language processing (NLP). Unfortunately, common feature extraction methods, naive fusion strategies, and rigid inductive biases have become roadblocks to the effective improvement of previous approaches. Distinguished from other modes of data representation, tables consist of many dispersed cells that are interdependent. Therefore, in this paper, we aim to propose a novel approach for recognizing table structures by mining the special properties of tables. The method begins by utilizing the adaptive fusion method to fuse visual and textual features acquired through a two-stream network. In the second stage, the layout features will be seamlessly integrated using a Kronecker-based strategy. The table elements with multi-modality features are then modeled based on spatial relationships. Interactions among them are established by a hybrid contextual aggregator that allows message passing at both local and global levels. Finally, table structure recognition is achieved by predicting the relationship between elements. We meticulously evaluate the proposed approach on various public datasets, including ICDAR2013, UNLV, WTW, SciTSR, and SciTSR-COMP, as well as a more complicated private dataset. The proposed method performs excellently on these datasets.