首页 | 本学科首页   官方微博 | 高级检索  
     

基于朴素贝叶斯算法的主题爬虫的研究
引用本文:皮靖,邵雄凯,肖雅夫.基于朴素贝叶斯算法的主题爬虫的研究[J].计算机与数字工程,2012,40(6):76-78,123.
作者姓名:皮靖  邵雄凯  肖雅夫
作者单位:湖北工业大学计算机学院 武汉430068
摘    要:主题爬虫是实现主题搜索引擎的关键部分。提出了利用朴素贝叶斯算法进行主题识别的方法,介绍了主题爬虫实现过程中所涉及到的关键部分,包括种子URL集合的生成、页面分析及特征提取、主题识别等。将基于朴素贝叶斯算法的主题爬虫,与基于链接分析的主题爬虫和基于主题词表的主题爬虫进行比较,实验表明基于朴素贝叶斯算法的主题爬虫准确性较好,论证了方法的可行性,为主题信息的采集奠定了良好的基础。

关 键 词:朴素贝叶斯算法  主题爬虫  主题相关度  信息采集

Research on Focused Crawler Based on Naive Bayes Algorithm
PI Jing , SHAO Xiongkai , XIAO Yafu.Research on Focused Crawler Based on Naive Bayes Algorithm[J].Computer and Digital Engineering,2012,40(6):76-78,123.
Authors:PI Jing  SHAO Xiongkai  XIAO Yafu
Affiliation:(School of Computer Science,Hubei University of Technology,Wuhan 430068)
Abstract:Focused crawler is a key part of the focused search engine.This paper proposed a method of using Naive Bayes algorithm to identify topics,introduced the core part of the focused crawler,including the generation of seed URL collection,the page analysis and feature extraction and the topic identify.Compared the focused crawler based on Naive Bayes algorithm with the focused crawler base on links analysis and thesaurus,the experiment result proved that the focused crawler based on Naive Bayes algorithm has better accuracy and the method is feasible.It laid good foundation for the topic information collection.
Keywords:Naive Bayes algorithm  focused crawler  topic correlativity  information collection
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号