一种基于随机森林的多视角文本分类方法 Multi View Text Categorization Based on Random Forests期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于随机森林的多视角文本分类方法

引用本文：	田宝明,戴新宇,陈家骏.一种基于随机森林的多视角文本分类方法[J].中文信息学报,2009,23(4):48-55.

作者姓名：	田宝明戴新宇陈家骏

作者单位：	南京大学计算机软件新技术国家重点实验室,南京大学计算机科学与技术系, 江苏南京 210093

基金项目：	国家高技术研究发展计划(863计划)，国家自然科学基金，国家社科资金资助项目

摘要：	基于词的向量空间模型是文本分类中的传统的表示文本的方法。这种表示方法的一个缺点是忽略了词之间的关系。最近一些使用潜在主题文本表示的方法,如隐含狄利克雷分配LDA (Latent Dirichlet Allocation)引起了人们的注意,这种表示方法可以处理词之间的关系。但是,只使用基于潜在主题的文本表示可能造成词信息的损失。我们使用改进的随机森林方法结合基于词的和基于LDA主题的两种文本表示方法。对于两类特征分别构造随机森林,最终分类结果通过投票机制决定。在标准数据集上的实验结果表明,相比只使用一种文本特征的方法,我们的方法可以有效地结合两类特征,提高文本分类的性能。
关键词：	计算机应用中文信息处理文本分类向量空间模型隐含狄利克雷分配集成分类随机森林
Multi View Text Categorization Based on Random Forests

TIAN Baoming,DAI Xinyu,CHEN Jiajun.Multi View Text Categorization Based on Random Forests[J].Journal of Chinese Information Processing,2009,23(4):48-55.

Authors:	TIAN Baoming DAI Xinyu CHEN Jiajun

Affiliation:	State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, Jiangsu 210093, China

Abstract:	Term-based Vector Space Model(VSM) is a traditional approach to representing documents,which defects in its neglecting of the relations between terms.To capture the relations between the terms,some latent topicsbased document representations such as LDA(Latent Dirichlet Allocation) have arisen much attention recently. However,simple latent topic-based text representations may cause loss of information carried by terms.In this paper, we use a modified random forests method to combine the term based and the L...

Keywords:	computer application Chinese information processing text categorization VSM latent dirichlet allocation ensemble classification random forests
本文献已被 CNKI 万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏