There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving the controllability of fundamental frequency ($f_{\mathrm{o}}$) and speech rate (SR). For this purpose, we propose Harmonic-Net and Harmonic-Net+, which introduce two extended functions into the HiFi-GAN generator. The first extension is a downsampling network, named the excitation signal network, that hierarchically receives multi-channel excitation signals corresponding to $f_{\mathrm{o}}$. The second extension is the layerwise pitch-dependent dilated convolutional network (LW-PDCNN), which can flexibly change its receptive fields depending on the input $f_{\mathrm{o}}$ to handle large fluctuations in $f_{\mathrm{o}}$ for the upsampling-based HiFi-GAN generator. The proposed explicit input of excitation signals and LW-PDCNNs corresponding to $f_{\mathrm{o}}$ are expected to realize high-quality synthesis for the normal and $f_{\mathrm{o}}$-conversion conditions and for the SR-conversion condition. The results of experiments for unseen speaker synthesis, full-band singing voice synthesis, and text-to-speech synthesis show that the proposed method with harmonic waves corresponding to $f_{\mathrm{o}}$ can achieve higher synthesis quality than conventional methods in all (i.e., normal, $f_{\mathrm{o}}$-conversion, and SR-conversion) conditions.