首页 | 本学科首页   官方微博 | 高级检索  
     

基于文本化简的实体属性抽取方法
引用本文:吴呈,王朝坤,王沐贤.基于文本化简的实体属性抽取方法[J].计算机工程与应用,2020,56(21):115-122.
作者姓名:吴呈  王朝坤  王沐贤
作者单位:1.清华大学 软件学院,北京 100084 2.哈尔滨工业大学 计算机学院,哈尔滨 150001
基金项目:国家自然科学基金;国家重点研发计划
摘    要:研究了非结构化中文文本的实体属性抽取方法。引入文本化简作为抽取的预处理过程,解决传统信息抽取方法因为长难句的存在和自然语言表述多样性导致抽取效果不佳的问题。其中,文本化简被建模为一个序列到序列(seq2seq)的翻译过程,并用机器翻译领域的seq2seq-RNN模型进行实现。为了提升模型的化简效果,进行了不同层面的优化,包括使用预训练词向量、收集常用词汇表、引入词性标注和设计化简评分函数,这些优化使模型专注于化简过程中句法转换的学习。针对化简后的文本,设计基于简洁规则的方法进行信息元组和实体属性抽取。实验表明,对seq2seq-RNN的改进能提升文本化简的效果,而且在化简文本上抽取的信息数量比在原始文本上的多,信息也比较精确。

关 键 词:文本化简  信息抽取  实体属性  自然语言处理  神经网络  

Entity Attributes Extraction Based on Text Simplification
WU Cheng,WANG Chaokun,WANG Muxian.Entity Attributes Extraction Based on Text Simplification[J].Computer Engineering and Applications,2020,56(21):115-122.
Authors:WU Cheng  WANG Chaokun  WANG Muxian
Affiliation:1.School of Software, Tsinghua University, Beijing 100084, China 2.School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Abstract:In this paper, the method of entity attributes extraction on unstructured Chinese text is studied. Text Simplification(TS) is introduced as the pretreatment process of extraction to solve the problem that traditional information extraction methods are ineffective because of the existence of long and difficult sentences and the diversity of natural language expressions. TS is modeled as a sequence to sequence(seq2seq) procedure, and is implemented with the seq2seq-RNN model in the machine translation field. To improve the model, several strategies, including pre-trained word vectors, common vocabulary, POS tagging and simplifying scoring function, are introduced to make the model focus more on syntax transformation during TS. For the simplified text, a simple rule-based method is used to perform information tuple extraction, and later entity attributes are extracted from those tuples. The experimental results show that the improvements on seq2seq-RNN achieve better performance on text simplification, and the amount of information extracted from the simplified text is more than the original text, while the information is more accurate.
Keywords:text simplification  information extraction  entity attributes  natural language processing  neural network  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号