海量短语信息文本聚类技术研究 Study on Massive Short Documents Clustering Technology期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

海量短语信息文本聚类技术研究

引用本文：	王永恒,贾焰,杨树强. 海量短语信息文本聚类技术研究[J]. 计算机工程, 2007, 33(14): 38-40

作者姓名：	王永恒贾焰杨树强

作者单位：	国防科技大学计算机学院网络研究所,长沙,410073;国防科技大学计算机学院网络研究所,长沙,410073;国防科技大学计算机学院网络研究所,长沙,410073

基金项目：	国家高技术研究发展计划(863计划)

摘要：	信息技术的发展造成了大量的文本数据累积，其中很大一部分是短文本数据。文本聚类技术对于从海量短文中自动获取知识具有重要意义。现有的一般文本挖掘方法很难处理TB级的海量数据。由于短文本中的关键词出现次数少，文本挖掘的精度很难保证。该文提出了一种基于频繁词集并结合语义信息的并行聚类算法来解决海量短语信息的聚类问题。实验表明，该方法在处理海量短语信息时具有很好的性能和准确度。
关键词：	文本挖掘海量短语并行
文章编号：	1000-3428（2007）14-0038-03
修稿时间：	2006-07-30
Study on Massive Short Documents Clustering Technology

WANG Yongheng,JIA Yan,YANG Shuqiang. Study on Massive Short Documents Clustering Technology[J]. Computer Engineering, 2007, 33(14): 38-40

Authors:	WANG Yongheng JIA Yan YANG Shuqiang

Affiliation:	Institute of Network, Computer School, National University of Defense Technology, Changsha 410073

Abstract:	With the rapid development of information technology, huge data is accumulated. A vast amount of such data appears as short documents. It is very useful to cluster such short documents to get knowledge automatically. But most of the current clustering algorithms can’t handle massive data which is at TB level. It is also difficult to get acceptable clustering accuracy since key words appear less time in short documents. This paper proposes a frequent term based parallel clustering algorithm which can be used to cluster massive short documents. Semantic information is also used to improve the accuracy of clustering. The experimental study shows that the algorithm is accurate and efficient.

Keywords:	text mining massive short document parallel
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机工程》浏览原始摘要信息
	点击此处可从《计算机工程》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏