一种基于语义内积空间模型的文本聚类算法 A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于语义内积空间模型的文本聚类算法

引用本文：	彭京,杨冬青,唐世渭,付艳,蒋汉奎. 一种基于语义内积空间模型的文本聚类算法[J]. 计算机学报, 2007, 30(8): 1354-1363

作者姓名：	彭京杨冬青唐世渭付艳蒋汉奎

作者单位：	北京大学信息科学技术学院,北京,100871;成都市公安局信息通信处,成都,610017;北京大学信息科学技术学院,北京,100871;成都市公安局信息通信处,成都,610017

基金项目：	国家自然科学基金 , 中国博士后科学基金 , 四川省青年科技基金 , 国家高技术研究发展计划(863计划) , 北京市自然科学基金

摘要：	现有数据聚类方法在处理文本数据,尤其是短文本数据时,由于没有考虑词之间潜在存在的相似情况,因此导致聚类效果不理想.文中针对文本数据高维度和稀疏空间的特点,提出了一种基于语义内积空间模型的文本聚类算法.算法首先利用内积空间的定义建立了针对中文概念、词和文本的相似度度量方法,然后从理论上进行了分析.最后通过一个两阶段处理过程,即向下分裂和向上聚合,完成文本数据的聚类.该方法成功用于中文短文本数据的聚类.实验表明相对于传统方法,文中提供的方法聚类质量更好.
关键词：	内积空间文本聚类概念相似度相似计算数据挖掘
修稿时间：	2007-03-06
A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic

PENG Jing,YANG Dong-Qing,TANG Shi-Wei,FU Yan,JIANG Han-Kui. A Novel Text Clustering Algorithm Based on Inner Product Space Model of Semantic[J]. Chinese Journal of Computers, 2007, 30(8): 1354-1363

Authors:	PENG Jing YANG Dong-Qing TANG Shi-Wei FU Yan JIANG Han-Kui

Affiliation:	1.School of Electronics Engineering and Computer Science, Peking University, Beijing 100871;2.information and Communication Department, Chengdu Public Security Bureau, Chengdu 610017

Abstract:	Due to lack considering the latent similarity information among words, the clustering result using exist clustering algorithms in processing text data, especially in processing short text data, is not ideal. Considering the text characteristic of high dimensions and sparse space, this paper proposes a novel text clustering algorithm based on semantic inner space model. The paper creates similarity method among Chinese concepts, words and text based on the definition of inner space at first, and then analyzes systematically the algorithm in theory. Through a two phrase processes, i.e. top-down "divide" phase and a bottom-up "merge" phase, it finishes the clustering of text data. The method has been applied into the data clustering of Chinese short documents. Extensive experiments show that the method is better than traditional algorithms.

Keywords:	inner product space text clustering concept similarity similarity computing data mining
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏