基于相似度的词聚类算法 Word Clustering Based on Similarity期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于相似度的词聚类算法

引用本文：	袁里驰,钟义信.基于相似度的词聚类算法[J].微电子学与计算机,2005,22(8):93-95.

作者姓名：	袁里驰钟义信

作者单位：	北京邮电大学信息工程学院,北京,100876

基金项目：	国家自然科学基金，国家高技术研究发展计划(863计划)

摘要：	基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法.传统的统计方法基于贪婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准.传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优.本文提出了词相似度定义、词集合相似度定义,一种自下而上的分层聚类算法.这种方法不但能改善聚类效果,而且可根据不同的模型选择不同的相似度定义,从而提高聚类的使用效果.
关键词：	词相似度词聚类统计语言模型
文章编号：	1000-7180(2005)08-01
收稿时间：	2005-01-17
修稿时间：	2005年1月17日
Word Clustering Based on Similarity

YUAN Li-chi,ZHONG Yi-xin.Word Clustering Based on Similarity[J].Microelectronics & Computer,2005,22(8):93-95.

Authors:	YUAN Li-chi ZHONG Yi-xin

Abstract:	Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed, and initial choices can influence final result. The authors try to solve above problems in this paper. This paper presents a novel definition of word similarity. Based on word similarity, this paper gives the definition of word set similarity, and proposes a bottom-up hierarchical clustering algorithm based on similarity. This method not only improves clustering effect, but also can choice different similarity definition for different cluster-based model, such as predictive clustering, conditional clustering, and combined clustering, thus improved the effect of using clusters. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance.

Keywords:	Word similarity Word clustering Statistical language model
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏