In this paper, we propose a refined approach for visual grounding to find the most relevant object according to a natural language query. Typically, the machine must comprehend the question, recognise the main ideas in the picture, and then locate the target object by providing its bounding box. Our model adopts an end-to-end modulated detector, utilising a transformer-based architecture for the early fusion of text and image modalities, thus significantly improving task efficiency. Large numbers of trainable parameters, long training times, and high computational complexity are common challenges state-of-the-art models for vision language problems are facing nowadays. Our work tackles these problems by minimising the number of parameters and computational complexity while also reducing training time. Using our techniques, we can reduce training time by about 12% without sacrificing accuracy much on datasets flickr30k and MSCOCO. This accomplishment shows how to balance accuracy and processing efficiency in vision-language tasks effectively. Our approach can be easily extended to other problems, such as visual question answering and image caption generation.