In the field of content-based remote sensing (RS) image retrieval, convolutional neural networks (CNNs) have been demonstrating overwhelming superiority among other methods in terms of performance. CNNs are basically trained in a supervised way, requiring a larger number of labeled samples. However, scarcity of labeled images is very prevalent in the RS community. Moreover, CNN-based approaches have other disadvantages such as cumbersome networks and high-dimensional features. To address these issues, we apply unsupervised transfer learning to CNN training—we transform similarity learning into deep ordinal classification with the help of several CNN experts pretrained over large-scale-labeled everyday image sets, which jointly determine image similarities and provide pseudolabels for classification. Our proposed method ends up with a brand-new lightweight model called similarity-based Siamese CNN (SBS-CNN), which can be trained from scratch with completely unlabeled RS images, and whose resulting features are compact. Furthermore, the existing CNNs are generally coupled with the cross-entropy loss, entirely ignoring the interclass semantic relationship. To overcome this shortcoming, a novel loss function called weighted Wasserstein ordinal loss is constructed to take into account the ordinal relationship among categories, thus more effectively navigating parameter updates during training. Extensive experiments have been carried out over publicly available RS data sets, and it turns out that our SBS-CNN outperforms existing CNN-based approaches.