首页 | 本学科首页   官方微博 | 高级检索  
     

图像信息对句子语义理解与表示的有效性验证与分析
引用本文:张琨,吕广奕,吴乐,刘淇,陈恩红.图像信息对句子语义理解与表示的有效性验证与分析[J].计算机学报,2021,44(3):476-490.
作者姓名:张琨  吕广奕  吴乐  刘淇  陈恩红
作者单位:合肥工业大学计算机与信息学院 合肥 230601;中国科学技术大学计算机科学与技术学院 合肥 230027;中国科学技术大学计算机科学与技术学院 合肥 230027;合肥工业大学计算机与信息学院 合肥 230601;中国科学技术大学计算机科学与技术学院 合肥 230027;中国科学技术大学计算机科学与技术学院 合肥 230027
基金项目:中央 高校基本科研业务费专项资金资助;本课题得到国家杰出青年科学基金;国家自然科学基金
摘    要:近年来,图像文本建模研究已经成为自然语言处理领域一个重要的研究方向.图像常被用于增强句子的语义理解与表示.然而也有研究人员对图像信息用于句子语义理解的必要性提出质疑,原因是文本本身就能够提供强有力的先验知识,帮助模型取得非常好的效果;甚至在不使用图像的条件下就能得出正确的答案.因此研究图像文本建模需要首先回答一个问题:图像是否有助于句子语义的理解与表示?为此,本文选择一个典型的不包含图像的自然语言语义理解任务:自然语言推理,并将图像信息引入到该任务中用于验证图像信息的有效性.由于自然语言推理任务是一个单一的自然语言任务,在数据标注过程中没有考虑图像信息,因此选择该任务能够更客观地分析出图像信息对句子语义理解与表示的影响.具体而言,本文提出一种通用的即插即用框架(general plug and play framework)用于图像信息的整合.基于该框架,本文选择目前最先进的五个自然语言推理模型,对比分析这些模型在使用图像信息前后的表现,以及使用不同图像处理模型与不同图像设置时的表现.最后,本文在一个大规模公开数据集上进行了大量实验,实验结果证实图像作为额外知识,确实有助于句子语义的理解与表示.此外,还证实了不同的图像处理模型和使用方法对整个模型的表现也会造成不同的影响.

关 键 词:图像文本建模  句子语义理解与表示  图像信息  即插即用框架  自然语言推理

The Effectiveness Verification and Analysis of Additional Images for Sentence Semantic Understanding and Pepresentation
ZHANG Kun,LV Guang-Yi,WU Le,LIU Qi,CHEN En-Hong.The Effectiveness Verification and Analysis of Additional Images for Sentence Semantic Understanding and Pepresentation[J].Chinese Journal of Computers,2021,44(3):476-490.
Authors:ZHANG Kun  LV Guang-Yi  WU Le  LIU Qi  CHEN En-Hong
Affiliation:(School of Computer Science and Technology,Hefei University of Technology,Hefei 230601;School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027)
Abstract:Recently,the Visual-to-Language(V2 L)problem has attracted more and more attention and become an important research topic in natural language processing.By utilizing Convolutional Neural Networks(CNN),Recurrent Neural Networks(RNN),and Attention Mechanism,researchers have made full use of images and achieved much progress in V2 L problem,especially in the area of natural language semantic understanding.In fact,images are often treated as the important auxiliary information to enhance the sentence semantic understanding.However,some researchers have questioned the necessity of using images for such understanding enhancement.They argue that textual information has already provided a very strong prior to promise the good performance of most semantic understanding models,which are even capable of generating correct answers without the consideration of images in some scenarios.Thus,the first crucial problem of V2 L research should be addressed is whether the image information is really necessary and helpful for sentence semantic understanding and representation.To this end,in this paper,we focus on a typical sentence semantic understanding task without images,Natural Language Inference(NLI),which requires an agent to determine the semantic relation between two sentences.Then,we incorporate images as the auxiliary information into the sentence pair to verify their effect.Since it is originally a pure natural language task and images are not considered to be used during the whole process of data annotation and sentence semantic modeling,choosing NLI task for evaluation can help to assess the influence of image information on sentence semantic understanding and representation more objectively.To be specific,we first design a general plug and play framework for image utilization and integration,which consists of four general layers,i.e.,Input Embedding layer,Contextual Encoding Layer,Interaction Layer,and Label Prediction Layer,and two plug and play layers,i.e.,Fine-Grained Context-Enhanced Layer and Coarse-Grained Context-Enhanced Layer.Based on this plug and play framework,we then reproduce five stateof-the-art NLI models,i.e.,Hierarchical BiLSTM Max Pooling model,Enhanced Sequential Inference model,Multiway Attention Network model,Stochastic Answer Networks model and Generalized Pooling method with the same deep learning framework.Next,we evaluate their performances with or without images on a large annotated Stanford Natural Language Inference dataset.In order to better verify the role of images,we also compare the performances of models with different image processing methods(VGG19 and ResNet50)and different image utilization methods(Fine-grained method and Coarse-grained method).At last,extensive experimental results reveal that images,as the external knowledge,are really helpful for sentence semantic understanding.Furthermore,we have obtained some other conclusions:(1)Fine-grained image utilization method is capable of providing much more useful information.Meanwhile,this kind of method has a greater influence on the sentence semantic understanding and representation of models;(2)As a more advanced method,ResNet50 can extract the important information from images more precisely than VGG19,which is able to provide much more comprehensive auxiliary information for sentence semantic understanding and representation models.
Keywords:visual-to-language  sentence semantic understanding and representation  image information  plug and play framework  natural language inference
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号