基于文本化简的实体属性抽取方法 Entity Attributes Extraction Based on Text Simplification期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于文本化简的实体属性抽取方法

引用本文：	吴呈,王朝坤,王沐贤.基于文本化简的实体属性抽取方法[J].计算机工程与应用,2020,56(21):115-122.

作者姓名：	吴呈王朝坤王沐贤

作者单位：	1.清华大学软件学院，北京 100084 2.哈尔滨工业大学计算机学院，哈尔滨 150001

基金项目：	国家自然科学基金;国家重点研发计划

摘要：	研究了非结构化中文文本的实体属性抽取方法。引入文本化简作为抽取的预处理过程，解决传统信息抽取方法因为长难句的存在和自然语言表述多样性导致抽取效果不佳的问题。其中，文本化简被建模为一个序列到序列（seq2seq）的翻译过程，并用机器翻译领域的seq2seq-RNN模型进行实现。为了提升模型的化简效果，进行了不同层面的优化，包括使用预训练词向量、收集常用词汇表、引入词性标注和设计化简评分函数，这些优化使模型专注于化简过程中句法转换的学习。针对化简后的文本，设计基于简洁规则的方法进行信息元组和实体属性抽取。实验表明，对seq2seq-RNN的改进能提升文本化简的效果，而且在化简文本上抽取的信息数量比在原始文本上的多，信息也比较精确。
关键词：	文本化简信息抽取实体属性自然语言处理神经网络
Entity Attributes Extraction Based on Text Simplification

WU Cheng,WANG Chaokun,WANG Muxian.Entity Attributes Extraction Based on Text Simplification[J].Computer Engineering and Applications,2020,56(21):115-122.

Authors:	WU Cheng WANG Chaokun WANG Muxian

Affiliation:	1.School of Software, Tsinghua University, Beijing 100084, China 2.School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

Abstract:	In this paper, the method of entity attributes extraction on unstructured Chinese text is studied. Text Simplification（TS） is introduced as the pretreatment process of extraction to solve the problem that traditional information extraction methods are ineffective because of the existence of long and difficult sentences and the diversity of natural language expressions. TS is modeled as a sequence to sequence（seq2seq） procedure, and is implemented with the seq2seq-RNN model in the machine translation field. To improve the model, several strategies, including pre-trained word vectors, common vocabulary, POS tagging and simplifying scoring function, are introduced to make the model focus more on syntax transformation during TS. For the simplified text, a simple rule-based method is used to perform information tuple extraction, and later entity attributes are extracted from those tuples. The experimental results show that the improvements on seq2seq-RNN achieve better performance on text simplification, and the amount of information extracted from the simplified text is more than the original text, while the information is more accurate.

Keywords:	text simplification information extraction entity attributes natural language processing neural network
本文献已被万方数据等数据库收录！
	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏