The voice assistants benefit from the ability to localize users, especially, we can analyze the user’s habits from the historical trajectory to provide better services. However, current voice localization method requires the user to actively issue voice commands, which makes voice assistants unable to track silent users most of the time. This paper presents WSTrack, a Wi-Fi and Sound fusion tracking system for device-free human. In particular, current voice assistants naturally support both Wi-Fi and acoustic functions. Accordingly, we are able to build up the multi-modal prototype with just voice assistants and Wi-Fi routers. To track the movement of silent users, our insights are as follows: (1) the voice assistants can hear the sound of the user’s pace, and then extract which direction the user is in; (2) we can extract the user’s velocity from the Wi-Fi signal. By fusing multi-modal information, we are able to track users with a single voice assistant and Wi-Fi router. Our implementation and evaluation on commodity devices demonstrate that WSTrack achieves better performance than current systems, where the median tracking error is 0.37m.