Recent advancements in the field of representation learning and video prediction have demonstrated the potential for enhancing manipulation and control strategies across various applications through precise anticipation of future states. Nevertheless, the intricate dynamic nature inherent in real-world data poses a formidable challenge in acquiring these representations. Autoregressive models, which employ the generated future frame as input for the subsequent frame prediction, suffer from issues such as compounding errors, memory overload, and extended training times due to the need for reconstructing the state from the latent vector in each iteration. To address these limitations, recent studies have introduced the concept of State Space Models (SSMs) to forecast from the latent space, offering the advantage of predicting distant future states. However, these methodologies often exhibit restricted capabilities in extracting object-centric representations. More recent object-centric approaches concentrate on closely associated features from the input data, yet their ability to capture higher-level representations remains constrained. In this paper, we propose integrating a perceptual network into the slot attention mechanism to facilitate the extraction and segregation of high-level representations. Leveraging a pre-trained perceptual network, we derive elevated object-oriented representations for each perceptual layer, aligning them with corresponding slots. This elevated representation, rich in object-centric information, holds the potential to enhance comprehension of the present state and provide valuable guidance for accurate future state prediction.