Video question answering (Video Q&A) is a challenging task as it requires a sufficient understanding of the video and question information. Video is composed of frame sequence, which contains multi-scale temporal relationships and corresponding contextual information. A model competently tackle Video Q&A task that needs to be able to: 1) construct long-term and neighborhood dependencies in frame sequences to extract global and local contextual features that can reflect multi-scale temporal dependencies, and deduce the temporal-aware refined features, and 2) identify static and dynamic features from pertinent moments of a video, while filtering away question-irrelated dependencies of feature sequences, to yield the most precise and reasonable temporal-aware overall contextual features. In response to the above requirements, we propose a novel Video Q&A mechanism which consists of Bidirectional Complementary Attention(BCA) module and Adaptive Temporal-aware(ATA) module. Bidirectional complementary attention module stacks multi-head self-attention layer and convolutional layer in different orders to designed two kinds of attention units, which is able to make bidirectional multi-step reasoning based on complete global information and accurate local information to obtain temporal-aware refined features. Adaptive temporal-aware module is used to filter away question-irrelated dependencies in the feature sequence to yield the most precise and reasonable temporal-aware overall contextual features. Comprehensive comparative experiments are conducted on publicly available benchmark datasets. An extended ablation study is further conducted to show the usefulness of each module of the solution in acquiring its computational Q&A capabilities.