基于狄利克雷多项分配模型的多源文本主题挖掘模型 Multi-source text topic mining model based on Dirichlet multinomial allocation model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于狄利克雷多项分配模型的多源文本主题挖掘模型

引用本文：	徐立洋,黄瑞章,陈艳平,钱志森,黎万英. 基于狄利克雷多项分配模型的多源文本主题挖掘模型[J]. 计算机应用, 2018, 38(11): 3094-3099. DOI: 10.11772/j.issn.1001-9081.2018041359

作者姓名：	徐立洋黄瑞章陈艳平钱志森黎万英

作者单位：	1. 贵州大学计算机科学与技术学院, 贵阳 550025;2. 贵州省公共大数据重点实验室(贵州大学), 贵阳 550025;3. 计算机软件新技术国家重点实验室(南京大学), 南京 210093

基金项目：	国家自然科学基金资助项目（61462011）；国家自然科学基金重大研究计划项目（91746116）；贵州省重大应用基础研究项目（黔科合JZ字[2014]2001）；贵州省科技重大专项计划项目（黔科合重大专项字[2017]3002）；贵州省自然科学基金资助项目（黔科合基础[2018]1035）。

摘要：	随着文本数据来源渠道越来越丰富，面向多源文本数据进行主题挖掘已成为文本挖掘领域的研究重点。由于传统主题模型主要面向单源文本数据建模，直接应用于多源文本数据有较多的限制。针对该问题提出了基于狄利克雷多项分配（DMA）模型的多源文本主题挖掘模型——多源狄利克雷多项分配模型（MSDMA）。通过考虑主题在不同数据源的词分布的差异性，结合DMA模型的非参聚类性质，模型主要解决了如下三个问题：1）能够学习出同一个主题在不同数据源中特有的词分布形式；2）通过数据源之间共享主题空间和词项空间，使得数据源间可进行主题知识互补，提升对高噪声、低信息量的数据源的主题发现效果；3）能自主学习出每个数据源内的主题数量，不需要事先给定主题个数。最后通过在模拟数据集和真实数据集的实验结果表明，所提模型比传统主题模型能更有效地对多源数据进行主题信息挖掘。
关键词：	多源文本数据主题模型吉布斯采样狄利克雷多项分配模型文本挖掘
收稿时间：	2018-05-29
修稿时间：	2018-06-15
Multi-source text topic mining model based on Dirichlet multinomial allocation model

XU Liyang,HUANG Ruizhang,CHEN Yanping,QIAN Zhisen,LI Wanying. Multi-source text topic mining model based on Dirichlet multinomial allocation model[J]. Journal of Computer Applications, 2018, 38(11): 3094-3099. DOI: 10.11772/j.issn.1001-9081.2018041359

Authors:	XU Liyang HUANG Ruizhang CHEN Yanping QIAN Zhisen LI Wanying

Affiliation:	1. College of Computer Science and Technology, Guizhou University, Guiyang Guizhou 550025, China;2. Guizhou Provincial Key Laboratory of Public Big Data(Guizhou University), Guiyang Guizhou 550025, China;3. State Key Laboratory for Novel Software Technology(Nanjing University), Nanjing Jiangsu 210093, China

Abstract:	With the rapid increase of text data sources, topic mining for multi-source text data becomes the research focus of text mining. Since the traditional topic model is mainly oriented to single-source, there are many limitations to directly apply to multi-source. Therefore, a topic model for multi-source based on Dirichlet Multinomial Allocation model (DMA) was proposed considering the difference between sources of topic word-distribution and the nonparametric clustering quality of DMA, namely MSDMA (Multi-Source Dirichlet Multinomial Allocation). The main contributions of the proposed model are as follows:1) it takes into account the characteristics of each source itself when modeling the topic, and can learn the source-specific word distributions of topic k; 2) it can improve the topic discovery performance of high noise and low information through knowledge sharing; 3) it can automatically learn the number of topics within each source without the need for human pre-given. The experimental results in the simulated data set and two real datasets indicate that the proposed model can extract topic information more effectively and efficiently than the state-of-the-art topic models.

Keywords:	multi-source text data topic model blocked-Gibbs sampling Dirichlet Multinomial Allocation (DMA) text mining

	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏