首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于注意力机制的中文短文本关键词提取模型
引用本文:杨丹浩,吴岳辛,范春晓.一种基于注意力机制的中文短文本关键词提取模型[J].计算机科学,2020,47(1):193-198.
作者姓名:杨丹浩  吴岳辛  范春晓
作者单位:北京邮电大学电子工程学院 北京 100089;北京邮电大学电子工程学院 北京 100089;北京邮电大学电子工程学院 北京 100089
摘    要:关键词抽取技术是自然语言处理领域的一个研究热点。在目前的关键词抽取算法中,深度学习方法较少考虑到中文的特点,汉字粒度的信息利用不充分,中文短文本关键词的提取效果仍有较大的提升空间。为了改进短文本的关键词提取效果,针对论文摘要关键词自动抽取任务,提出了一种将双向长短时记忆神经网络(Bidirectional Long Shot-Term Memory,BiLSTM)与注意力机制(Attention)相结合的基于序列标注(Sequence Tagging)的关键词提取模型(Bidirectional Long Short-term Memory and Attention Mechanism Based on Sequence Tagging,BAST)。首先使用基于词语粒度的词向量和基于字粒度的字向量分别表示输入文本信息;然后,训练BAST模型,利用BiLSTM和注意力机制提取文本特征,并对每个单词的标签进行分类预测;最后使用字向量模型校正词向量模型的关键词抽取结果。实验结果表明,在8159条论文摘要数据上,BAST模型的F1值达到66.93%,比BiLSTM-CRF(Bidirectional Long Shoft-Term Memory and Conditional Random Field)算法提升了2.08%,较其他传统关键词抽取算法也有进一步的提高。该模型的创新之处在于结合了字向量和词向量模型的抽取结果,充分利用了中文文本信息的特征,可以有效提取短文本的关键词,提取效果得到了进一步的改进。

关 键 词:注意力机制  词向量  字向量  关键词抽取  LSTM

Chinese Short Text Keyphrase Extraction Model Based on Attention
YANG Dan-hao,WU Yue-xin,FAN Chun-xiao.Chinese Short Text Keyphrase Extraction Model Based on Attention[J].Computer Science,2020,47(1):193-198.
Authors:YANG Dan-hao  WU Yue-xin  FAN Chun-xiao
Affiliation:(School of Electronic Engineering,Beijing University of Posts and Telecommunications,Beijing 100089,China)
Abstract:Keyphrase extraction technology is a research hotspot in the field of natural language processing.In the current keyphrase extraction algorithm,the deep learning method seldom takes into account the characteristics of Chinese,the information of Chinese character granularity is not fully utilized,and the extraction effect of Chinese short text keyworks still has a large improvement space.In order to improve the effect of the keyphrase extraction for short text,a model for automatic keyphrase extraction abstracts was proposed,namely BAST model,which combines the bidirectional long short-term memory and attention mechanism based on sequence tagging model.Firstly,word vectors in the word granularity and character vectors in the character granularity are used to represent input text information.Secondly,the BAST model is trained,text features are extracted by using BiLSTM and attention mechanism,and the label of each word is classified.Finally,the character vector model is used to correct the extraction results of the word vector model.The experimental results show that the F1-measure of the BAST model reaches 66.93%on 8159 abstract data,which is 2.08%higher than that of the BiLSTM-CRF(Bidirectional Long Shoft-Term Memory and Conditional Random Field)algorithm,and is further improved than other traditional keyphrase extraction algorithms.The innovation of the model lies in the combination of the extraction results of the word vector and the character vector model.The model makes full use of the characteristics of the Chinese text information and can effectively extract keyphrases from the short text,and extraction effect is further improved.
Keywords:Attention mechanism  Word embedding  Character embedding  Keyphrase extraction  LSTM
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号