Video anomaly detection is crucial for behavior analysis, which has witnessed continuous progress in recent years with the auto-encoder based reconstruction framework. However, in some cases, abnormal frames may also be reconstructed well due to the strong representation ability of deep networks, increasing missed detection. To mitigate this issue, the existing methods usually the memory bank method. This method records normal patterns and assigns high errors for the reconstruction of abnormal frames into normal frames. In this paper, to better use the semantic information of normal videos recorded in the memory module, we introduce the Memory-Token Transformer (MTT) to boost the reconstruction performance on normal frames. We assume that the anomalies in a video mainly concentrate on the regions containing people and relevant objects. Therefore, during the decoding stage, we first extract the semantic concepts of a feature map and generate the corresponding semantic tokens. Then the tokens are combined with the proposed memory module. Last, we introduce a transformer to fuse the complex relationship among different tokens, and use 3D convolution with the pooling operator in our encoder to enhance spatio-temporal feature extraction as compared with 2D models. The experimental results obtained on various benchmarks demonstrate the effectiveness of the proposed method.