基于多模态特征转换嵌入的场景图生成 Multimodal Feature Translating Embedding Based Scene Graph Generation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于多模态特征转换嵌入的场景图生成

引用本文：	张若楠,安高云. 基于多模态特征转换嵌入的场景图生成[J]. 信号处理, 2023, 39(1): 51-60. DOI: 10.16798/j.issn.1003-0530.2023.01.006

作者姓名：	张若楠安高云

作者单位：	1.北京交通大学信息科学研究所,北京100044

基金项目：	国家重点研发计划2019YFB2204200国家自然科学基金62006015

摘要：	场景图生成是计算机视觉领域中的热点研究方向,可连接上、下游视觉任务。场景图由形式为<主语-谓语-宾语>的三元组组成,模型需要对整幅图像的全局视觉信息进行编码,从而辅助场景理解。但目前模型在处理一对多、多对一和对称性等特殊的视觉关系时仍存在问题。基于知识图谱与场景图的相似性,我们将知识图谱中的转换嵌入模型迁移至场景图生成领域。为了更好地对此类视觉关系进行编码,本文提出了一种基于多模态特征转换嵌入的场景图生成框架,可对提取的视觉和语言等多模态特征进行重映射,最后使用重映射的特征进行谓语类别预测,从而在不明显增加模型复杂度的前提下构建更好的关系表示。该框架囊括并补充了现存的几乎所有转换嵌入模型的场景图实现,将四种转换嵌入模型（TransE、TransH、TransR、TransD）分别应用于场景图生成任务,同时详细阐述了不同的视觉关系类型适用的模型种类。本文所提框架扩展了传统应用方式,除独立模型之外,本文设计了新的应用方式,即作为即插即用的子模块插入到其他网络模型。本文利用大规模语义理解的视觉基因组数据集进行实验,实验结果充分验证了所提框架的有效性,同时,得到的更丰富的类别预测结...
关键词：	场景图生成知识图谱转换嵌入模型图像语义场景理解
收稿时间：	2022-10-19
Multimodal Feature Translating Embedding Based Scene Graph Generation

Affiliation:	1.Institute of Information Science, Beijing Jiaotong University, Beijing 100044, China2.Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China

Abstract:	? ?Scene graph generation task is a hot research direction in the field of computer vision, which can bridge low-level visual tasks and high-level visual tasks. The scene graphs are composed of triplets in the form of , and the model needs to encode comprehensive global visual information of the whole image to assist scene understanding. However, in the current works, there are still many problems for models to deal with special visual relationship such as one-to-many, many-to-one and symmetric visual relations. Based on the similarity between knowledge graphs and scene graphs, we migrated the translating embedding model from knowledge graph to scene graph generation field. To better encode such visual relations mentioned above, we proposed a multimodal feature translating embedding based scene graph generation framework, which reprojected the extracted multimodel features, such as visual and semantic features, and finally used the reprojected features to predict predicate categories, so as to construct better relational representations without significantly increasing the complexity of the model. This framework encapsulated and complemented almost all existing translating embedding models for scene graph generation, and four translating embedding models （TransE, TransH, TransR, TransD） were applied to the scene graph generation task, meanwhile, the types of models applicable to different types of visual relations were also elaborated. The framework proposed in this paper is also an extension of the traditional application approach. In addition to being an independent graph generation model, this paper also designed a new implementation as a plug-and-play sub-module to be inserted into other network models. In this paper, experiments were conducted using a large-scale semantic understanding dataset, visual genome. And the effectiveness of our method was fully validated by experimental results. Meanwhile, we also observed a richer prediction category distribution, demonstrating that the proposed method in this paper is quite helpful to solve the long-tail bias problem in the dataset.

Keywords:

	点击此处可从《信号处理》浏览原始摘要信息
	点击此处可从《信号处理》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏