Crack segmentation is a crucial task in various domains, particularly in infrastructure inspection, civil engineering, and road maintenance. To accurately detect various cracks from the input RGB images, two artificial intelligence (AI) approaches have been presented for segmentation purpose: Convolutional Neural Network (CNN) and transformer-based techniques (ViT). In this study, we present a comparative analysis to evaluate the robustness of convolutional networks against ViT. CNNs are built upon a series of convolutional and pooling layers, designed to capture local patterns and features in an image. On the other hand, ViTs utilize self-attention mechanisms to capture global relationships within a sequence of input patches from the image. In addition to quantitative evaluation comparison, qualitative visual explainable heat saliency maps are derived. We use two crack datasets for comparison evaluation purposes: Crack500 and DeepCrack. We compare the evaluation results among six XAI models (three models for each; CNN and ViT). The segmentation results show the commutative measurements among the CNN and ViT models. Such a comprehensive comparison study could be helpful to assist the researchers in the domain for the best model selection.