首页 | 本学科首页   官方微博 | 高级检索  
     

利用未标识文档提高中心分类法性能的研究
引用本文:何尧,张顺淼.利用未标识文档提高中心分类法性能的研究[J].数字社区&智能家居,2007,3(16):1125-1126.
作者姓名:何尧  张顺淼
作者单位:福建工程学院,计算机与信息科学系,福建,福州,350014 福建工程学院,计算机与信息科学系,福建,福州,350014
摘    要:中心分类法性能高效,但需要大量的训练文档(已标识文档)来训练分类器以保证分类的正确性.而训练文档因需花费大量人力物力来分类而数量有限,同时,网络上存在着很多未标识文档.为此,对中心分类法进行改进,提出了ONUC和0FFUC算法,以弥补当训练文档不足时,中心分类法性能急剧下降的缺陷.考虑到中心分类法易受孤立点的影响,采取了去边处理.实验证明,与普通的中心分类法、其它半监督经典算法比较,在训练文档很少的情况下,该算法能获得较好的性能.

关 键 词:中心分类法  文本分类  未标识文档  已标识文档
文章编号:1009-3044(2007)16-31125-02
修稿时间:2007年8月2日

Research on Using Unlabled Text to Improve the Performance of Centroid-based Classification Algorithms
HE Yao,ZHANG Shun-miao.Research on Using Unlabled Text to Improve the Performance of Centroid-based Classification Algorithms[J].Digital Community & Smart Home,2007,3(16):1125-1126.
Authors:HE Yao  ZHANG Shun-miao
Abstract:Centroid-based Classification Algorithms is a high efficient class of Algorithms for Text Categorization.However,in order to obtain classification model well,it requires a number of labeled documents.in practical applications,labeled documents are often very sparse because manually labeling data is tedious and costly,while there are often abundant unlabeled documents.So,we propose OFFUC and ONUC algorithms to mend the matter that centroid-based classification algorithms degrade dramatically when the training data is scarce.Considering that the training data items that are far away from the center of its training category reduce the accuracy of classification.,we exclude them from consideration.Experiment results show that OFFUC and ONUC algorithms,proposed in this paper,can improve the performance of centroid-based Classification Algorithms and outperforms the generic semi-supervised methods when the the number of labeled text is very small.
Keywords:Centroid-based Classification Algorithms  Text Categorization  Unlabled Document  Labeled Document
本文献已被 CNKI 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号