Sample-efficient reinforcement learning (RL) methods that can learn directly from raw sensory data will open up real-world applications in robotics and control. Recent breakthroughs in visual RL have shown that incorporating a latent representation alongside traditional RL techniques bridges the gap between state-based and image-based training paradigms. In this paper, we conduct an empirical investigation of visual RL, which can be trained end-to-end directly from image pixels, to address 3D continuous control problems. To this end, we evaluate three recent visual RL algorithms (CURL, SAC+AE, and DrQ-v2) with respect to sample efficiency and task performance on two 3D locomotion tasks (‘quadruped-walk' and ‘quadrupedrun') from the DeepMind control suite. We find that using data augmentation, rather than using contrastive learning or an auto-encoder, plays an important role in improving sample efficiency and task performance in image-based training.