In this work, we present a method that can learn to model dynamic and arbitrary 3D scenes, purely from 2D visual observations. Our approach uses a keypoint-conditioned Neural Radiance Field (KP-NeRF) to capture and model these scenes with the overarching goal of supporting image-based robot manipulation. Differentiating this from previous methods, which typically condition the model on generic embedding vectors for representation, our implicit neural radiance function is conditioned on a set of keypoints that are inferred from a learned encoder given imagery observations. This implicitly separates the visual modeling components into object appearances and object pose configurations. Such inductive bias built into the architecture encourages discovered keypoints to capture state transitions in the robot’s environment across time and space. We then learn a forward prediction model of the encoded keypoints, constructed over the keypoint representation space, and perform MPC control for challenging manipulation tasks including block pushing and door closing. We evaluate the performance of our method through various tasks: novel scene view synthesis, action-conditioned forward prediction, and robot manipulation tasks.