首页 | 本学科首页   官方微博 | 高级检索  
     

基于词条与语意差异度量的文档聚类算法
引用本文:魏霖静,练智超,王联国,侯振兴.基于词条与语意差异度量的文档聚类算法[J].计算机科学,2016,43(12):229-233, 259.
作者姓名:魏霖静  练智超  王联国  侯振兴
作者单位:甘肃农业大学信息科学技术学院 兰州730070,南京理工大学计算机科学与工程学院 南京210094,甘肃农业大学信息科学技术学院 兰州730070,南京大学信息管理学院 南京210093
基金项目:本文受国家自然科学基金项目(034031122,61063028),江苏省自然科学基金青年基金(BK20150784),中国博士后面上资助
摘    要:已有的文本聚类算法大多基于一般的相似性度量而忽略了语义内容,对此提出一种基于最大化文本判别信息的文本聚类算法。首先,分别分析词条对其类簇与其他类簇的判别信息,并且将数据集从输入空间转换至差异分数矩阵空间;然后,设计了一个贪婪算法来筛选矩阵每行的低分数词条;最终,采用最大似然估计对文本差别信息进行平滑处理。仿真实验结果表明,所提方法的文档聚类质量优于其他分层与单层聚类算法,并且具有较好的可解释性与收敛性。

关 键 词:文档聚类  语意分析  贪婪算法  收敛性  可解释性
收稿时间:3/3/2016 12:00:00 AM
修稿时间:2016/3/23 0:00:00

Term and Semantic Difference Metric Based Document Clustering Algorithm
WEI Lin-jing,LIAN Zhi-chao,WANG Lian-guo and HOU Zhen-xing.Term and Semantic Difference Metric Based Document Clustering Algorithm[J].Computer Science,2016,43(12):229-233, 259.
Authors:WEI Lin-jing  LIAN Zhi-chao  WANG Lian-guo and HOU Zhen-xing
Affiliation:School of Information Science and Technology,Gansu Agriculture University,Lanzhou 730070,China,School of Computer Science and Engineering,Nanjing University of Science and Technology,Nanjing 210094,China,School of Information Science and Technology,Gansu Agriculture University,Lanzhou 730070,China and School of Information Management,Nanjing University,Nanjing 210093,China
Abstract:The existing document clustering algorithms are based on the common similarity measurement,but ignore the semantics.So a document clustering algorithm based on maximizing the sum of the discrimination information provided by documents was proposed.Firstly,the discrimination information of term for the corresponding cluster and for the other clusters was analyzed separately,and the data set was transformed from input space to the difference scores matrix space.Then a greedy algorithm was designed to filter the terms with low score from each row of the matrix.Lastly,maximum likelihood estimation was used to smooth the document difference information.Simulation experiment results show that the proposed method has better cluster quality than the plat and hierarchical clustering algorithms,and has a good quality in interpretability and convergence.
Keywords:Document clustering  Semantic analysis  Greedy algorithm  Convergence  Interpretability
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号