针对图像描述方法中对图像文本信息的遗忘及利用不充分问题,提出了基于场景图感知的跨模态交互网络(SGC-Net).首先,使用场景图作为图像的视觉特征并使用图卷积网络(GCN)进行特征融合,从而使图像的视觉特征和文本特征位于同一特征空间;其次,保存模型生成的文本序列,并添加对应的位置信息作为图像的文本特征,以解决单层长短期记忆(LSTM)网络导致的文本特征丢失的问题;最后,使用自注意力机制提取出重要的图像信息和文本信息后并对它们进行融合,以解决对图像信息过分依赖以及对文本信息利用不足的问题.在Flickr30K和MS-COCO(MicroSoft Common Objects in COntext)数据集上进行实验的结果表明,与Sub-GC相比,SGC-Net在BLEU1(BiLingual Evaluation Understudy with 1-gram)、BLEU4(BiLingual Evaluation Understudy with 4-grams)、METEOR(Metric for Evaluation of Translation with Explicit ORdering)、ROUGE(Recall-Oriented Understudy for Gisting Evaluation)和SPICE(Semantic Propositional Image Caption Evaluation)指标上分别提升了1.1、0.9、0.3、0.7、0.4和0.3、0.1、0.3、0.5、0.6.可见,SGC-Net所使用的方法能够有效提升模型的图像描述性能及生成描述的流畅度.
Aiming at the forgetting and underutilization of the text information of image in image captioning methods,a Scene Graph-aware Cross-modal Network(SGC-Net)was proposed.Firstly,the scene graph was utilized as the image's visual features,and the Graph Convolutional Network(GCN)was utilized for feature fusion,so that the visual and textual features were in the same feature space.Then,the text sequence generated by the model was stored,and the corresponding position information was added as the textual features of the image,so as to solve the problem of text feature loss brought by the single-layer Long Short-Term Memory(LSTM)Network.Finally,to address the issue of over dependence on image information and underuse of text information,the self-attention mechanism was utilized to extract significant image information and text information and fuse then.Experimental results on Flickr30K and MS-COCO(MicroSoft Common Objects in COntext)datasets demonstrate that SGC-Net outperforms Sub-GC on the indicators BLEU1(BiLingual Evaluation Understudy with 1-gram),BLEU4(BiLingual Evaluation Understudy with 4-grams),METEOR(Metric for Evaluation of Translation with Explicit ORdering),ROUGE(Recall-Oriented Understudy for Gisting Evaluation)and SPICE(Semantic Propositional Image Caption Evaluation)with the improvements of 1.1,0.9,0.3,0.7,0.4 and 0.3,0.1,0.3,0.5,0.6,respectively.It can be seen that the method used by SGC-Net can increase the model's image captioning performance and the fluency of the generated description effectively.