Integrating augmented reality (AR) with externally hosted computer vision (CV) models can provide enhanced AR experiences. For instance, by utilising an advanced object detection model, an AR system can recognise a range of predefined objects within the user’s immediate surroundings. However, existing AR-CV workflows rarely incorporate user-defined contextual information, which often come in the form of multi-modal queries blending both natural and body language. Interpreting these intricate user queries, processing them via a sequence of deep learning models, and then adeptly visualising the outcomes remains a formidable challenge.In this paper, we describe Situated Imaging (SI), an extensible array of techniques for in-situ interactive visual computing. We delineate the architecture of the Situated Imaging framework, which enhances the conventional AR-CV workflow by incorporating a range of advanced interactive and generative computer vision techniques. We also describe a demonstration implementation to illustrate the pipeline’s capabilities, enabling users to engage in activities such as labelling, highlighting, or generating content within a user-defined context. Furthermore, we provide initial guidance for tailoring this framework to example use cases and identify avenues for future research. Our model-agnostic Situated Imaging pipeline acts as a valuable starting point for both academic scholars and industry practitioners interested in enhancing the AR experience by incorporating computationally intensive AI models.