首页 | 本学科首页   官方微博 | 高级检索  
     

基于LDA模型和Doc2vec的学术摘要聚类方法
引用本文:张卫卫,胡亚琦,翟广宇,刘志鹏.基于LDA模型和Doc2vec的学术摘要聚类方法[J].计算机工程与应用,2020,56(6):180-185.
作者姓名:张卫卫  胡亚琦  翟广宇  刘志鹏
作者单位:1.兰州交通大学 电子与信息工程学院,兰州 730070 2.兰州理工大学 经济管理学院,兰州 730050
基金项目:中国博士后科学基金;国家自然科学基金;教育部哲学社会科学研究重大课题
摘    要:针对特定任务下的短文本聚类已经成为文本数据挖掘的一项重要任务。学术摘要文本由于数据稀疏造成了聚类结果准确率低、语义鸿沟问题,狭窄的域导致大量无关紧要的单词重叠,使得很难区分主题和细粒度集群。鉴于此,提出一种新的聚类模型--主题句向量模型(Doc2vec-LDA,Doc-LDA),该模型通过将LDA主题模型(Latent Dirichlet Allocation)和句向量模型融合(Doc2vec),不仅使得在模型训练过程中既能利用整个语料库的信息,而且还利用Paragraph Vector的局部语义空间信息完善LDA的隐性语义信息。实验采用爬取到的知网摘要文本作为数据集,选用K]-Means聚类算法对各模型的摘要文本进行效果比较。实验结果表明,基于Doc-LDA模型的聚类效果优于LDA、Word2vec、LDA+Word2vec模型。

关 键 词:短文本聚类  LDA模型  Doc2vec模型  学术摘要  

Academic Abstract Clustering Method Based on LDA Model and Doc2vec
ZHANG Weiwei,HU Yaqi,ZHAI Guangyu,LIU Zhipeng.Academic Abstract Clustering Method Based on LDA Model and Doc2vec[J].Computer Engineering and Applications,2020,56(6):180-185.
Authors:ZHANG Weiwei  HU Yaqi  ZHAI Guangyu  LIU Zhipeng
Affiliation:1.School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China 2.School of Economics and Management, Lanzhou University of Technology, Lanzhou 730050, China
Abstract:Short text clustering for specific topics has become an important task in text data mining.The academic abstract text has poor stability of clustering results and semantic gap due to sparse data.Narrow domain leads to a large number of inconsequential word overlaps and making it hard to distinguish between topics and fine-grained clusters.In view of this,this paper proposes a novel clustering model called Topic Paragraph Vector model(Doc2vec-LDA,Doc-LDA).By merging LDA topic model(Latent Dirichlet Allocation)and the Paragraph vector model(Doc2vec),the model not only makes use of the information of the entire corpus in the model training process,but also uses the local semantic space information of Paragraph Vector to improve the implicit semantic information of LDA.Crawling academic abstracts from CNKI as experimental data sets,K-Means clustering algorithm is used to compare the abstract texts of each model.The experimental results show that the clustering effect based on Doc-LDA model is better than LDA,Word2vec and LDA+Word2vec models.
Keywords:short text clustering  Latent Dirichlet Allocation(LDA)model  Doc2vec model  academic abstract
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号