In recent years, Transformer models have revolutionized machine learning. While this has resulted in impressive re-sults in the field of Natural Language Processing, Computer Vision quickly stumbled upon computation and memory problems due to the high resolution and dimensionality of the input data. This is particularly true for video, where the number of tokens increases cubically relative to the frame and temporal resolutions. A first approach to solve this was Vision Transformers, which introduce a partitioning of the input into embedded grid cells, lowering the effective reso-lution. More recently, Swin Transformers introduced a hi-erarchical scheme that brought the concepts of pooling and locality to transformers in exchange for much lower computational and memory costs. This work proposes a refor-mulation of the latter that views Swin Transformers as reg-ular Transformers applied over a quadtree representation of the input, intrinsically providing a wider range of de-sign choices for the attentional mechanism. Compared to similar approaches such as Swin and MaxViT, our method works on the full range of scales while using a single attentional mechanism, allowing us to simultaneously take into account both dense short range and sparse long range de-pendencies with low computational overhead and without introducing additional sequential operations, thus making full use of GPU parallelism.