首页 | 本学科首页   官方微博 | 高级检索  
     

基于图的Co-Training网页分类
引用本文:侯翠琴,焦李成.基于图的Co-Training网页分类[J].电子学报,2009,37(10):2173-2180.
作者姓名:侯翠琴  焦李成
作者单位:西安电子科技大学智能信息处理研究所和智能感知与图像理解教育部重点实验室,陕西西安,710071
基金项目:国家自然科学基金,教育部重点项目(No.108115)国家973重点基础研究发展规划,国家863高技术研究发展计划,国家部委科技项目,教育部长江学者和创新团队支持计划 
摘    要: 本文充分利用网页数据的超链接关系和文本信息,提出了一种用于网页分类的归纳式半监督学习算法:基于图的Co-training网页分类算法(Graph based Co-training algorithm for web page classification),简称GCo-training,并从理论上证明了算法的有效性.GCo-training在Co-training算法框架下,迭代地学习一个基于由超链接信息构造的图的半监督分类器和一个基于文本特征的Bayes 分类器.基于图的半监督分类器只利用少量的标记数据,通过挖掘数据间大量的关系信息就可达到比较高的预测精度,可为Bayes分类器提供大量的标记信息;反过来学习大量标记信息后的Bayes分类器也可为基于图的分类器提供有效信息.迭代过程中,二者互相帮助,不断提高各自的性能,而后Bayes分类器可以用来预测大量未见数据的类别.在Web→KB数据集上的实验结果表明,与利用文本特征和锚文本特征的Co-training算法和基于EM的Bayes算法相比,GCo-training算法性能优越.

关 键 词:  半监督  Co-training  归纳式  网页分类
收稿时间:2008-05-19

Graph Based Co-Training Algorithm for Web Page Classification
HOU Cui-qin,JIAO Li-cheng.Graph Based Co-Training Algorithm for Web Page Classification[J].Acta Electronica Sinica,2009,37(10):2173-2180.
Authors:HOU Cui-qin  JIAO Li-cheng
Affiliation:Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Institute of Intelligent Information Processing,Xidian University,Xi'an,Shaanxi 710071,China
Abstract:This paper proposes a novel inductive semi-supervised algorithm for web page classification named Gco-training, exploiting texts in web pages and hyperlinks among them. Gco-training iteratively trains two classifiers-a graph-based semi-supervised classifier based on hyperlinks among web pages and a Bayes classifier based on texts in web pages, under the framework of Co-training. On the one hand, the graph-based semi-supervised classifier obtains high accuracy based on a small set of labeled examples through exploiting links among web pages and can augment labeled examples for the Bayes classifier. On the other hand, the Bayes classifier can also provide labeled example for the graph-based classifier after it learning on labeled set augmented by the graph-based classifier. Therefore, the two classifiers help each other and improve their respective performance during the process of training. Finally, the Bayes classifier can classify a large number of unseen examples. We test Gco-training algorithm, Co-training algorithm based on words occurring on web pages and words occurring in hyperlinks and Bayes algorithm based on EM on the Web →KB dataset.Experimental results show Gco-training performs much better than the other algorithms.
Keywords:Co-training
本文献已被 万方数据 等数据库收录!
点击此处可从《电子学报》浏览原始摘要信息
点击此处可从《电子学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号