The video-based separation of foreground (FG) and background (BG) has been widely studied due to its vital role in many applications, including intelligent transportation and video surveillance. Most of the existing algorithms are based on traditional computer vision techniques that perform pixel-level processing assuming that FG and BG possess distinct visual characteristics. Recently, state-of-the-art solutions exploit deep learning models targeted originally for image classification. Major drawbacks of such a strategy are the lacking delineation of FG regions due to missing temporal information as they segment the FG based on a single frame object detection strategy. To grapple with this issue, we excogitate a 3D convolutional neural network (3D CNN) with long short-term memory (LSTM) pipelines that harness seminal ideas, viz., fully convolutional networking, 3D transpose convolution, and residual feature flows. Thence, an FG-BG segmenter is implemented in an encoder-decoder fashion and trained on representative FG-BG segments. The model devises a strategy called double encoding and slow decoding, which fuses the learned spatio-temporal cues with appropriate feature maps both in the down-sampling and up-sampling paths for achieving well generalized FG object representation. Finally, from the Sigmoid confidence map generated by the 3D CNN-LSTM model, the FG is identified automatically by using Nobuyuki Otsu’s method and an empirical global threshold. The analysis of experimental results via standard quantitative metrics on 16 benchmark datasets including both indoor and outdoor scenes validates that the proposed 3D CNN-LSTM achieves competitive performance in terms of figure of merit evaluated against prior and state-of-the-art methods. Besides, a failure analysis is conducted on 20 video sequences from the DAVIS 2016 dataset.