Sentence embedding learned from text is widely used for semantic textual similarity, automatic evaluation of text generation, and so on. As one of the sentence embedding learning methods, SimCSE based on contrastive learning is proposed and achieves high accuracy in the semantic textual similarity task. VisualCSE and AudioCSE, which are derivatives of SimCSE, are methods that add training using image and audio data in addition to text-based training and have been shown to further improve accuracy in English. However, these methods using non-linguistic data have not been validated in Japanese. This study examines the effectiveness of VisualCSE in Japanese. As a result, VisualCSE in Japanese did not show the significant improvement in accuracy seen in the English experiment. Also, we analyze the impact of sentence embedding learning by using noise data instead of image data.