This study focuses on the environment of surround audio and visual information. We aim at enhancing the practicality of highly realistic media contents by investigating the influence of temporally mismatched audio-visual interaction. In the left-to-right localization experiments, we presented temporally mismatched audio and visual signals, and investigated the relationship between the delay and feeling of oddity. The results showed that it sounds more naturally if the sound comes behind the image. In addition, the optimal delay depended on the visual distance to the object. In the fore-to-aft localization experiments, it was found that the accuracy degrades if the listener is not familiar to the sound. We also found that the visual localization interferes the auditory localization, and the synchronicity between audio and visual stimuli degrades the localization accuracy. These findings have shown guidelines of highly realistic contents creation.