Vision Transformers are born with the property of data-dependent and long-range dependencies, accomplishing a number of astonishing results against their contemporary competitor CNNs. To alleviate the excessive computational burden, previous methods apply the local operation (e.g., convolution, local attention) in the high-resolution stages. Although these designs are efficient for local relations learning, especially for the high redundancy stages, they inevitably lead to the losses of non-locality and are constrained by the limited receptive field. In this paper, we present an effective hybrid-style vision backbone that is explicitly built with dynamic convolution and self-attention to respectively undertake both local and global interaction, dubbed C2SFormer. We adopt two homogeneous modules whose structure follows the typical Transformers. For local relations learning, we take the parallel multi-scale design and additive aggregation as simple but effective ideas, named MS-SCDC. For the global context modeling, we leverage the efficient factorized self-attention mechanism proposed in CoaT and apply it with the MS-SCDC in a cross-stacking manner over the high-resolution stages. Additionally, we further introduce a general approach for multi-scale learning of transformer-based modules, named MS-MHSA. The experiments conducted on a variety of general-purpose vision tasks demonstrate the superiority of the proposed model.