Vision-based robotic grasping is a fundamental task in robotic control. Dexterous and precise grasp control of the robotic arm is challenging and a critical technique for the manufacturing and emerging robot service industry. Current state-of-art methods adopt RGB-D images or point clouds in an attempt to obtain an accurate, robust, and real-time policy. However, most of these methods only use single modal data or ignore the uncertainty of sampling data especially the depth information. Even they leverage multi-modal data, they seldom fuse the features in different scales. All of these results in unreliable grasp prediction inevitably. In this paper, we propose a novel multi-modal neural network to predict grasps in real-time. The key idea is to fuse RGB and depth information hierarchically and quantify the uncertainty of raw depth data to re-weight the depth features. For higher grasping performance, a background extraction module and depth re-estimation module are used to reduce the influence caused by the incompletion and low-quality of the raw data. We evaluate the performance on the Cornell Grasp Dataset and provide a series of extensive experiments to demonstrate the advantages of our method on a real robot. The results indicate the superiority of our proposed method by outperforming the state-of-the-art methods significantly in all metrics.
Vision-based robotic grasping is a fundamental task in robotic control. Dexterous and precise grasp control of the robotic arm is challenging and a critical technique for the manufacturing and emerging robot service industry. Current state-of-art methods adopt RGB-D images or point clouds in an attempt to obtain an accurate, robust, and real-time policy. However, most of these methods only use single modal data or ignore the uncertainty of sampling data especially the depth information. Even they leverage multi-modal data, they seldom fuse the features in different scales. All of these results in unreliable grasp prediction inevitably. In this paper, we propose a novel multi-modal neural network to predict grasps in real-time. The key idea is to fuse RGB and depth information hierarchically and quantify the uncertainty of raw depth data to re-weight the depth features. For higher grasping performance, a background extraction module and depth re-estimation module are used to reduce the influence caused by the incompletion and low-quality of the raw data. We evaluate the performance on the Cornell Grasp Dataset and provide a series of extensive experiments to demonstrate the advantages of our method on a real robot. The results indicate the superiority of our proposed method by outperforming the state-of-the-art methods significantly in all metrics.