Referring video object segmentation (RVOS) aims at segmenting an object in a video with its text description. The core of RVOS lies in the modal alignment between the vision and text. To improve the performance, most previous RVOS methods are devoted to exploring more sophisticated visual cues. However, these methods fail to fully exert the inherent structured properties of text and only use coarse text features to tackle vision-text interaction. In this paper, we propose FTVR, an approach that utilizes Fine-grain Text-Video method for Referring video object segmentation to explore fine-grain text information. We introduce a LLM to split the text sentence into different functional phrases and propose three novel modules to enhance the cross-modal alignment. Concretely, we design a dynamic-aware perception module to deal with motion-related phrases and a following global-aware attention module to fuse the outputted motion information. To deal with entity-related phrases, our FTVR also introduces an entity-aware augmentation module to highlight entity information. Extensive experiments show the effectiveness of our method on four popular benchmarks.