Intuitively, distinguishing fine-grained actions in videos requires recursively capturing subtle visual cues and learning abstract features. However, existing deep neural network based methods are counter-intuitive in that their network layers do not explicitly model the recursive feature abstraction. Therefore, we are motivated to propose an Adaptive Recursive Circle (ARC) framework that equips common neural network layers with recursive attention and recursive fusion. ARC layer inherits the same operators and parameters as the original layer, but, most critically, it treats the layer input as an evolving state, thus explicitly achieving recursive feature abstraction by alternating the state update and the feature generation. Specifically, at each recursive step, the input state is firstly updated via both recursive attention and recursive fusion from the previously generated features, and then the feature abstraction is performed with the newly updated input state. Significant improvements are observed on multiple datasets. For example, an ARC-equipped TSM-ResNet-18 outperforms TSM-ResNet-50 on the Something-Something V1 and Diving48 datasets with only half over-heads. Code will be available at: https://github.com/0HaNC/ARC-ActionRecog.