We propose a real-time network optimized for joint semantic segmentation and object detection on edge devices. Our architecture builds on the latest YOLO series network and incorporates lightweight segmentation sub-networks for multi-task learning. Specifically, we leverage layers two to four of the YOLO network, which contain substantial semantic information at varying resolutions, to segment objects of diverse sizes. We introduce the Parallel Aggregation Pyramid Pooling Module (PAPPM) to efficiently generate buffered semantic segmentation feature maps by utilizing single-point addition and residual learning. This approach reduces computational complexity and memory usage without compromising accuracy. We also propose a novel Progressively Iterative Learning (PIL) approach to learn the weights for the backbone, neck, and multi-task heads, respectively, without catastrophic forgetting. Our approach achieves state-of-the-art performance on benchmark datasets, demonstrating the effectiveness of our proposed techniques.