Employing low-cost inertial measurement units (IMUs) from off-the-shelf mobile devices, inertial odometry techniques can provide environment-independent position information, exhibiting great research and commercial value. However, the high noise level of low-cost inertial sensor readings still makes this challenging. To address this, we propose a deep inertial odometry method that employs hierarchical temporal features of IMU sequences. Specifically, the proposed method transforms inertial odometry problem into a seq2seq translation task by segmenting the overall inertial sequence into subsequences, which are referred to as raw sentences. Then an attention-based hierarchical structure is designed to extract and fuse multi-level temporal features, generating feature sequences with rich contextual information, which are referred to as source sentences. Finally, we utilize the state-of-the-art Transformer as a translator for estimating corresponding pose change sequences, which are referred to as target sentences, and integrate the estimation results into the trajectory. We have conducted extensive experiments on two public datasets: the small-scale OxIOD and the large-scale IDOL. The experimental results demonstrate that our method reduces the mean absolute trajectory error and relative trajectory error by at least 16.8% and 16.7%, respectively, on the OxIOD dataset, and by 48.4% and 58.1%, respectively, on the IDOL dataset compared to competing schemes.