首页 | 本学科首页   官方微博 | 高级检索  
     

基于并行信息瓶颈的多语种文本聚类算法*
引用本文:闫小强,卢耀恩,娄铮铮,叶阳东.基于并行信息瓶颈的多语种文本聚类算法*[J].模式识别与人工智能,2017,30(6):559-568.
作者姓名:闫小强  卢耀恩  娄铮铮  叶阳东
作者单位:郑州大学 信息工程学院 郑州 450052
基金项目:国家自然科学基金项目(No.61502434,61502432,61170223)资助
摘    要:聚类算法在抽取文本数据中的模式结构时,忽略多个语种信息之间潜在的互补作用,得到的模式结构不能充分反映数据的内在信息.针对此问题,文中提出基于并行信息瓶颈的多语种文本聚类算法.首先使用词袋模型为文本数据的不同语种信息构建相应的相关变量.然后将多种相关变量引入并行信息瓶颈方法,通过最大化地保存模式结构与多个相关变量之间的信息,使得到的模式结构能够反映数据的多个语种信息.最后提出基于信息论的抽取合并方法优化文中算法的目标函数,保证其收敛到局部最优解.实验表明,文中算法能有效处理文本数据的多个语种信息,性能优于单语种聚类算法和现有的两类能够处理文本多语种信息的聚类算法.

关 键 词:并行信息瓶颈    多语种    文本聚类    信息最大化  
收稿时间:2016-09-26

Multilingual Documents Clustering Algorithm Based on Parallel Information Bottleneck
YAN Xiaoqiang,LU Yaoen,LOU Zhengzheng,YE Yangdong.Multilingual Documents Clustering Algorithm Based on Parallel Information Bottleneck[J].Pattern Recognition and Artificial Intelligence,2017,30(6):559-568.
Authors:YAN Xiaoqiang  LU Yaoen  LOU Zhengzheng  YE Yangdong
Affiliation:School of Information Engineering, Zhengzhou University, Zhengzhou 450052
Abstract:The potential complementation between different languages is ignored while traditional clustering algorithms discover the hidden structures in document collection. Thus, the latent information in the collection can not be reflected by the obtained patterns. Aiming at this problem, multilingual document clustering algorithm based on parallel information bottleneck(ML-IB) is proposed. Firstly, the relevant variables of multiple language information are constructed according to the bag-of-words model. Then,the multiple relevant variables are incorporated into the parallel information bottleneck, and the relevant information between data patterns and multiple relevant variables is preserved maximally. Finally, to optimize the objective function of ML-IB, a draw and merge method based on information theory is proposed to guarantee the convergence of ML-IB to a local optimal solution. Extensive experimental results on multilingual document datasets show that the proposed algorithm significantly outperform the state-of-the-art single and multilingual clustering methods.
Keywords:Parallel Information Bottleneck  Multilingual  Document Clustering  Information  
点击此处可从《模式识别与人工智能》浏览原始摘要信息
点击此处可从《模式识别与人工智能》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号