首页 | 本学科首页   官方微博 | 高级检索  
     

基于相似度的词聚类算法
引用本文:袁里驰,钟义信.基于相似度的词聚类算法[J].微电子学与计算机,2005,22(8):93-95.
作者姓名:袁里驰  钟义信
作者单位:北京邮电大学信息工程学院,北京,100876
基金项目:国家自然科学基金,国家高技术研究发展计划(863计划)
摘    要:基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法.传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准.传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优.本文提出了词相似度定义、词集合相似度定义,一种自下而上的分层聚类算法.这种方法不但能改善聚类效果,而且可根据不同的模型选择不同的相似度定义,从而提高聚类的使用效果.

关 键 词:词相似度  词聚类  统计语言模型
文章编号:1000-7180(2005)08-01
收稿时间:2005-01-17
修稿时间:2005年1月17日

Word Clustering Based on Similarity
YUAN Li-chi,ZHONG Yi-xin.Word Clustering Based on Similarity[J].Microelectronics & Computer,2005,22(8):93-95.
Authors:YUAN Li-chi  ZHONG Yi-xin
Abstract:Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed, and initial choices can influence final result. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity, and proposes a bottom-up hierarchical clustering algorithm based on similarity. This method not only improves clustering effect, but also can choice different similarity definition for different cluster-based model, such as predictive clustering, conditional clustering, and combined clustering, thus improved the effect of using clusters. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance.
Keywords:Word similarity  Word clustering  Statistical language model
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号