Advanced deep-learning techniques can generate natural and synthetic voices that might be close to someone's voice. Nevertheless, misuse of such technologies is of great concern. Hence, researchers focus on detecting these malicious synthetic voices, called “deepfake speech.” Although many feature extractions and classifications have been proposed, the accuracy of deepfake detection is still unreliable. In addition, most of the current features are computed in the frequency domain. To this end, we conducted experiments to investigate the contribution of two acoustic features and deepfake speech signals. The acoustic features are timbre and shimmer, which represent our auditory perception in the time domain. We point out that eight timbre components and four shimmer components significantly contribute to discriminating deepfake speech from genuine speech. We also propose a method for detecting deepfake speech based on these timbre and shimmer features. The method was evaluated by using a dataset from the Audio Deep Synthesis Detection Challenge (ADD 2022). The results suggest that combining these eight timbre components and four shimmer components with a simple classifier using multilayer perceptron neural networks can enable deepfake speech to be detected potentially effectively.