The ideal goal of artificial intelligence (AI) is to provide natural interaction environment between human and machine by mimicking human’s acoustic and visual perception ability. The most significant elements for achieving natural human-robot interaction (HRI) are an interface for spoken dialogue and awareness of surrounding scene. In point of machine intelligence, such acoustic and visual information can be formed as multimedia data which is comprised of speech, acoustic scene, visual scene, and visual motion. This dissertation is thus to enable robust HRI by processing multimedia data with deep architecture-based techniques. Such data can be analyzed through the local descriptors and can be modeled to various scenes by deep learning and statistical learning methods. In addition, an optimized information fusion technique is proposed to combine the acoustic/visual metadata according to diverse environmental characteristics. By considering such tasks, there are three main components presented in this dissertation. The first component is an environmental robust spoken dialogue interface (SDI) through the speech enhancement methods. In various noise environments, the quality of speech is usually degraded and it hinders the performance of acoustic and speech perception. This dissertation tackles the noise environments by speech separation method. The conventional approach [1] is based on an auditory perceptual beamformer that clarifies the speech by suppressing the background noise prior to transferring into SDI module. The proposed single-channel approach [2, 3] adopts acoustic source separation method. Thus, this dissertation proposes reconstructive NMF (RNMF) algorithm reducing complexity of the conventional speech enhancement methods.The multimedia scene understanding (MSU) enables HRI device to understand surrounding scene information acoustically and visually. To cope with the multimedia inputs including heterogeneous data, the proposed methods [4, 5, 6] consist of acoustic scene analysis (ASA), automatic speech recognition (ASR), visual motion analysis (VMA), visual scene analysis (VSA), and character information analysis (CIA). Based on the extracted metadata, scene model generation (SMG) and scene understanding process (SU) carry out in order to establish scene model and understand scene, respectively. For the sake of ASA, the i-vector paradigm is applied for extracting robust acoustic metadata. Using mel-frequency cepstrum coefficients (MFCC) features, Gaussian mixture model-universal background model (GMM-UBM) is trained to generate acoustic scene model. The dense-trajectories (DT) which consists of HOG, HOF, MBHx, MBHy and TRAJ is employed for VMA. The individual sub-features of DT extract the interesting points by frame-by-frame manner and can be encoded with Fisher Vector (FV) for dimensionality reduction. Deep convolutional neural network (DCNN)-based metadata is adopted for VSA. In this dissertation, three different DCNN structures which allow to explore global and more detailed features from feature map with deep layers are employed. The different metadata type of the local descriptors is one of the significant factors for robust scene modeling process so the proposed metadata learning method adopts both statistical and deep-learning as hybrid approach. In addition, a fusion process which combines those metadata dynamically is regarded as necessary part for MSU since various scenes require various combination of metadata according to its environmental characteristic. The experiment was conducted for each method using the appropriate evaluation database and environments. The results were confirmed that the proposed methods were outperformed compared to the conventional approaches.