Cross-modal hashing has attracted great interest in the past decades. Due to the traditional hashing retrieval model requiring hand-crafted feature extraction, much research on the end-to-end deep hashing models has been explored. However, this training process of deep models is complex and needs a large number of samples. To address the above issues, we propose a lightweight hashing retrieval network with the pre-trained CLIP model, which is called CLIP-Hash. It can obtain better hash features via the pre-training CLIP model and provide a lightweight fine-tune model which is easily trained and only few training samples are required. First, we obtain the powerful feature representations by the visual and textual encodes. Then, we propose a simple hash encoding module, which is a lightweight network and only few parameters are needed to be trained. With the proposed hash encoding module, we can obtain efficient binary hash codes that can align the images and texts. We conduct experiments on MIRFlickr and NUS-WIDE two benchmark datasets. The results show that the proposed simple method can outperform the SOTA hashing methods.