首页 | 本学科首页   官方微博 | 高级检索  
     

基于弱监督和半自动方法的中文关系抽取数据集构建
引用本文:马超义,徐蔚然.基于弱监督和半自动方法的中文关系抽取数据集构建[J].中文信息学报,2017,31(5):114-119.
作者姓名:马超义  徐蔚然
作者单位:北京邮电大学 信息与通信工程学院,北京 100876
基金项目:教育部博士点学科专项科研基金(20130005110004)
摘    要:关系抽取是信息抽取中的一项基础任务,对信息检索、问答系统、知识图谱等有非常重要的意义。现有的关系抽取数据集存在包含类别太少、句子标注困难、不易扩展等缺陷,且只有英文数据集,不能很好地解决中文关系抽取任务。该文采用弱监督和半自动的方法,构建了一份中文关系抽取数据集,弥补了上述不足。首先借助维基百科抽取出丰富的关系对,从百度搜索返回结果及搜狗新闻语料中抽取包含实体对的句子,完成弱监督句子抽取过程。将句子放入RNN关系抽取系统进行打分,选取标注价值高的句子提交人工标注,对标注结果进行处理,最终得到中文关系抽取数据集。

关 键 词:关系抽取  数据集  弱监督  半自动  

Semi-automatic Construction of Chinese Relation Extraction Data Set Based on a Weakly Supervised Method
MA Chaoyi,XU Weiran.Semi-automatic Construction of Chinese Relation Extraction Data Set Based on a Weakly Supervised Method[J].Journal of Chinese Information Processing,2017,31(5):114-119.
Authors:MA Chaoyi  XU Weiran
Affiliation:School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
Abstract:The relation extraction is a fundamental task in information extraction, with practical significance in information retrieval, question answering system and knowledge mapping, etc. The existing relation extraction data set are for English, containing very limited categories and neglecting sentence level annotations. This paper constructs a Chinese relation extraction data set using a weakly supervised and semi-automatic method. It firstly extracts a large amount of relation pairs from Wikipedia, then extracts sentences that contains entity pairs from the corpus of Sougou News and Baidu. Thus the weakly supervised sentence extracting is completed. These sentences are then scored in an RNN-based relation extraction system, selecting sentences with higher score for manual annotation. Finally the Chinese relation extraction data set is completed after manual annotation.
Keywords:relation extraction  data set  weakly supervised  semi-automatic  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号