Deep learning (DL) has recently been proposed as a novel approach for 21cm foreground removal. Before applying DL to real observations, it is essential to assess its consistency with established methods, its performance across various simulation models and its robustness against instrumental systematics. This study develops a commonly used U-Net and evaluates its performance for post-reionisation foreground removal across three distinct sky simulation models based on pure Gaussian realisations, the Lagrangian perturbation theory, and the Planck sky model. Stable outcomes across the models are achieved provided that training and testing data align with the same model. On average, the residual foreground in the U-Net reconstructed data is $\sim$10% of the signal across angular scales at the considered redshift range. Comparable results are found with traditional approaches. However, blindly using a network trained on one model for data from another model yields inaccurate reconstructions, emphasising the need for consistent training data. The study then introduces frequency-dependent Gaussian beams and gain drifts to the test data. The network struggles to denoise data affected by "unexpected" systematics without prior information. However, after re-training consistently with systematics-contaminated data, the network effectively restores its reconstruction accuracy. This highlights the importance of incorporating prior systematics knowledge during training for successful denoising. Our work provides critical guidelines for using DL for 21cm foreground removal, tailored to specific data attributes. Notably, it is the first time that DL has been applied to the Planck sky model being most realistic foregrounds at present.
Comment: 19 pages, 13 figures, submitted to MNRAS