首页 | 本学科首页   官方微博 | 高级检索  
     

基于滑动窗口的主题模型
引用本文:常东亚,严建峰,杨璐,刘晓升.基于滑动窗口的主题模型[J].计算机科学,2016,43(12):101-107.
作者姓名:常东亚  严建峰  杨璐  刘晓升
作者单位:苏州大学计算机科学与技术学院 苏州215006,苏州大学计算机科学与技术学院 苏州215006;香港城市大学创意媒体学院 香港999077,苏州大学计算机科学与技术学院 苏州215006,苏州大学计算机科学与技术学院 苏州215006
基金项目:本文受国家自然科学基金(61373092,61572339,61272449),江苏省科技支撑计划重点项目(BE2014005)资助
摘    要:LDA(Latent Dirichlet Allocation)是一个分层的概率主题模型,目前被广泛地应用于文本挖掘。这种模型既不考虑文档与文档之间的顺序关系,也不考虑同一篇文档中词与词之间的顺序关系,简化了问题的复杂性,同时也为模型的改进提供了契机。针对此问题提出了基于滑动窗口的主题模型,该模型的基本思想是文档中的一个单词的主题与其附近若干单词的主题关系越紧密,受附近单词主题的影响越大。根据窗口和滑动位移的大小,把文档切割为粒度更小的片段。同时,针对大数据集和数据流问题,提出了在线滑动窗口主题模型。在4个数据集上的实验表明,基于滑动窗口的主题模型训练出来的模型在数据集上有更好的泛化性能和精度。

关 键 词:潜在狄利克雷分配  主题模型  滑动窗口
收稿时间:2015/11/26 0:00:00
修稿时间:3/8/2016 12:00:00 AM

Sliding-window Based Topic Modeling
CHANG Dong-y,YAN Jian-feng,YANG Lu and LIU Xiao-sheng.Sliding-window Based Topic Modeling[J].Computer Science,2016,43(12):101-107.
Authors:CHANG Dong-y  YAN Jian-feng  YANG Lu and LIU Xiao-sheng
Affiliation:School of Computer Science and Technology,Soochow University,Suzhou 215006,China,School of Computer Science and Technology,Soochow University,Suzhou 215006,China;School of Creative Media,City University of Hong Kong,Hong Kong 999077,China,School of Computer Science and Technology,Soochow University,Suzhou 215006,China and School of Computer Science and Technology,Soochow University,Suzhou 215006,China
Abstract:LDA(Latent Dirichlet Allocation) is an important hierarchical Bayesian model for probabilistic topic mode-ling,which touches on many important applications of text mining.This model takes neither the order of documents nor the order of words in one document into account,which simplifies the complexity of issues and provides a great chance to improve itself.To achieve this goal,a sliding-window based topic model was proposed.The fundamental idea of this model is that the theme of one word in a specific document has a strong relationship at the words near by and is mainly affected by them.Through modifying the size of window and sliding step,document is cut into smaller pieces.Meanwhile,aiming at the big dataset and data flow,online sliding window theme model was proposed.Experiments show that the sliding-window based topic model has better generalization performance and accuracy on four common datasets.
Keywords:Latentdirichlet allocation  Topic model  Sliding window
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号