Facial recognition systems are increasingly being adopted throughout the world for their ease-of-use and noninvasive nature, however, this convenience comes with its fair share of risks in the form of spoofing (presentation) attacks. Researchers have presented numerous classical and deep learning techniques for finding practicable solutions for the detection of presentation attacks. These techniques may use data from RGB, infrared (IR) or depth sensors. In this paper, we propose a novel multimodal face anti-spoofing (FAS) transformer (MFAST) approach that utilizes two separate transformer modules for RGB and Thermal (long-wave IR) sensors. The features from each of these transformer modules are fused together to detect spoofing attacks. The proposed methodology demonstrates better anti-spoofing performance as compared to existing FAS architectures. Due to the requirement of diverse amount of data to train FAS models, researchers have proposed numerous datasets that involve several types of presentation attacks in different modalities. By inspecting these datasets, we observe that there is little work done on the ethnic aspect of the persons involved. Most of the data gathered in these datasets is related to Caucasian, East Asian or African ethnicity with only a small amount of data pertaining to the South Asian descent. Therefore, keeping in view this gap within the available datasets, we have collected a new dataset that provides large number of subjects related to South Asian ethnicity with RGB and Thermal modalities. Hence, our contributions in this paper are two-fold; not only do we propose a transformer-based presentation attack detection model, but we also provide a new FAS dataset for South Asian ethnicity.