基于Hadoop平台的海量文本分类的并行化 Parallel Text Categorization of Massive Text Based on Hadoop期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Hadoop平台的海量文本分类的并行化

引用本文：	向小军,高阳,商琳,杨育彬.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188.

作者姓名：	向小军高阳商琳杨育彬

作者单位：	南京大学计算机科学与技术系南京210093

基金项目：	国家自然科学基金项目(61035003,60875011); 科技部国际科技合作计划项目(2010DFA11030); 江苏省自然科学基金项目(BK2010054)资助

摘要：	文本分类是信息检索与数据挖掘的研究热点与核心技术，近年来得到了广泛的关注和快速的发展。近来年随着文本数据呈指数增长，要有效地管理这些数据，就必须在分布式环境下用有效的算法来处理这些数据。在Ha- doop分布式平台下实现了一简单有效的文本分类算法—TFIDF分类算法，即一种基于向量空间模型的分类算法，它用余弦相似度得到分类结果。在两个数据集上做了实验，结果表明，这一并行化算法在大数据集上很有效并可以在实际领域中得到良好的应用。
关键词：	文本分类，并行化，海量数据，Hadoop
Parallel Text Categorization of Massive Text Based on Hadoop

XIANG Xiao-jun,GAO Yang,SHANE Lin,YANG Yu-bin.Parallel Text Categorization of Massive Text Based on Hadoop[J].Computer Science,2011,38(10):184-188.

Authors:	XIANG Xiao-jun GAO Yang SHANE Lin YANG Yu-bin

Affiliation:	(Department of Computer Science and Technology,Nanjing University,Nanjing 210093,China)

Abstract:	In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. In recent years,as the text data grows exponentially, to effectively manage the large storage of data, we must use efficient algorithm to process it in the distributed environment. In this paper, we implemented a simple and effective text categorization algorithm on ha- doop--TFIDF classifier, an algorithm based on vector space model, cosine similarity was applied as the metrics. The ex- periments on two datasets show that the parallel algorithm is effective on large storage of data and can be applied in practical application field.

Keywords:	Text categorization Parallelization Massive data Hadoop
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《计算机科学》浏览原始摘要信息
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏