首页 | 本学科首页   官方微博 | 高级检索  
     

基于支持向量机与无监督聚类相结合的中文网页分类器
引用本文:李晓黎,刘继敏,史忠植.基于支持向量机与无监督聚类相结合的中文网页分类器[J].计算机学报,2001,24(1):62-68.
作者姓名:李晓黎  刘继敏  史忠植
作者单位:中国科学院计算技术研究所,
基金项目:国家自然科学基金!(6 980 30 10 ),国家“八六三”高技术研究发展计划!(86 3-5 11-946 -0 10 )资助
摘    要:提出了一种将支持向量机与无监督聚类相结合的新分类算法,给出了一种新的网页表示方法并应用于网页分类问题。该算法首先利用无监督聚类分别对训练集中正例和反例聚类,然后挑选一些例子训练SVM并获得SVM分类器,任何网页可以通过比较其与聚类中心的距离决定采用无监督聚类方法或SVM分类器进行分类。该算法充分利用了SVM准确率高与无监督聚类速度快的优点。实验表明它不仅具有较高的训练效率,而且有很高的精确度。

关 键 词:支持向量机  无监督聚类  中文网页分类器  Internet  机器学习
修稿时间:1999年11月17

A Chinese Web Page Classifier Based on Support Vector Machine and Unsupervised Clustering
LI Xiao,Li,LIU Ji,Min,SHI Zhong,Zhi.A Chinese Web Page Classifier Based on Support Vector Machine and Unsupervised Clustering[J].Chinese Journal of Computers,2001,24(1):62-68.
Authors:LI Xiao  Li  LIU Ji  Min  SHI Zhong  Zhi
Abstract:This paper presents a new algorithm that combines Support Vector Machine (SVM) and unsupervised clustering. After analyzing the characteristics of web pages, it proposes a new vector representation of web pages and applies it to web page classification. Given a training set, the algorithm clusters positive and negative examples respectively by the unsupervised clustering algorithm (UC), which will produce a number of positive and negative centers. Then, it selects only some of the examples to input to SVM according to ISUC algorithm. At the end, it constructs a classifier through SVM learning. Any text can be classified by comparing the distance of clustering centers or by SVM. If the text nears one cluster center of a category and far away from all the cluster centers of other categories, UC can classify it rightly with high possibility, otherwise SVM is employed to decide the category it belongs. The algorithm utilizes the virtues of SVM and unsupervised clustering. The experiment shows that it not only improves training efficiency, but also has good precision.
Keywords:support vector machine  clustering  text classification
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号