首页 | 本学科首页   官方微博 | 高级检索  
     

一种针对维汉的跨语言远程监督方法
引用本文:杨振宇,王磊,马博,杨雅婷,董瑞,艾孜麦提·艾瓦尼尔,王震.一种针对维汉的跨语言远程监督方法[J].计算机工程,2023,49(2):271-278.
作者姓名:杨振宇  王磊  马博  杨雅婷  董瑞  艾孜麦提·艾瓦尼尔  王震
作者单位:1. 中国科学院新疆理化技术研究所, 乌鲁木齐 830011;2. 中国科学院大学, 北京 100049;3. 新疆民族语音语言信息处理实验室, 乌鲁木齐 830011
基金项目:国家自然科学基金本地青年人才培养专项(U2003303);国家重点研发计划(2018YFC0823002);新疆维吾尔自治区天山创新项目(2020D14045);“天山青年”计划优秀青年科技人才项目(2019Q031);中国科学院青年创新促进会项目(科发人函字[2019]26号);中国科学院西部青年学者B类项目(2019-XBQNXZ-B-008)。
摘    要:远程监督是关系抽取领域重要的语料扩充技术,可以在少量已标注语料的基础上快速生成伪标注语料。但是传统的远程监督方法主要应用于单语种文本,维吾尔语等低资源语言并不能使用这类方法得到伪标注语料。针对上述问题,提出一种针对维汉的跨语言远程监督方法,在无语料的情况下利用现有的汉语语料进行维语语料的自动扩充。将远程监督视为文本语义相似度计算问题而不是简单的文本查找,从实体语义和句子语义2个层面判断维语和汉语句子对是否包含同一关系,若为同一关系则将已有的汉语标注转移到维语句子上,实现维语语料从零开始的自动扩充。此外,为有效捕获实体的上下文和隐藏语义信息,提出一种带有门控机制的交互式匹配方法,通过门控单元控制编码层、注意力层之间的信息传递。人工标记3 500条维语句子和600条汉语句子用于模拟远程监督过程并验证模型的性能。实验结果表明,该方法 F1值达到73.05%,并且成功构造了包含97 949条维语句子的关系抽取伪标注数据集。

关 键 词:关系抽取  语义相似度  语义编码  远程监督  跨语言
收稿时间:2022-02-24
修稿时间:2022-03-28

A Cross-Lingual Distant Supervision Method for Uyghur and Chinese
YANG Zhenyu,WANG Lei,MA Bo,YANG Yating,DONG Rui,Azmat Anwar,WANG Zhen.A Cross-Lingual Distant Supervision Method for Uyghur and Chinese[J].Computer Engineering,2023,49(2):271-278.
Authors:YANG Zhenyu  WANG Lei  MA Bo  YANG Yating  DONG Rui  Azmat Anwar  WANG Zhen
Affiliation:1. The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China;2. University of Chinese Academy of Sciences, Beijing 100049, China;3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China
Abstract:Distant supervision is an important corpus expansion technology in the field of relation extraction.It can quickly generate pseudo-labeled corpus based on a small amount of annotated corpus.However, traditional distant supervision is mainly used in monolingual texts, and low-resource languages such as Uyghur cannot use this method to obtain pseudo-labeled corpora.In view of the above problems, this paper proposes a cross-lingual distant supervision method for Uyghur and Chinese, which can use the existing Chinese corpus to automatically expand the Uyghur corpus in the absence of corpus.This method regards distance supervision as a calculation of sentences semantic similarity problem rather than word search, and judges whether Uyghur and Chinese sentence pairs contain the same relation from two levels of entity semantics and sentence semantics.If the relations are the same, the existing Chinese labels will be transferred to the Uyghur sentences, that is, the automatic expansion of the Uyghur corpus from zero is realized.And in order to capture the context and hidden semantic information of entities, this paper proposes an interactive matching method with a gate mechanism, which controls the information between the encoding layer and the attention layer through the gate unit.In order to prove the effectiveness of the model, the authors manually labeled 3 500 Uighur sentences and 600 Chinese sentences to simulate the distant supervision process and verify the performance of the model.Experimental results shows that the F1 score of the method reached 73.05% and a relation extraction pseudo-labeled dataset containing 97 949 Uighur sentences is successfully constructed.
Keywords:relation extraction  semantic similarity  semantic encoding  distant supervision  cross-lingual  
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号