首页 | 本学科首页   官方微博 | 高级检索  
     

面向P2P特定信息的爬虫改进技术
引用本文:丁军平,蔡皖东.面向P2P特定信息的爬虫改进技术[J].计算机工程与应用,2011,47(29):23-26.
作者姓名:丁军平  蔡皖东
作者单位:西北工业大学计算机学院,西安,710072
基金项目:国家高技术研究发展计划(863)(the National High-Tech Research and Development Plan of China under Grant No.2009AA01Z424)
摘    要:针对现有主题爬虫技术在获取"元信息"时会抓取大量不相关网页的问题,对现有主题爬虫技术进行改进,加入了URL分类技术。该分类方法根据提供的URL样本信息,生成多个不相关URL关键词集合以及"元信息"URL关键词集合;对集合中的关键词设置权限信息,设置集合的分类判断阈值;将URL使用特征向量表示,计算与关键词集合的距离,对URL进行分类;对算法性能进行了详细分析。实验结果表明,所提方法在进行"元信息"获取时,与传统主题爬虫技术相比能够大幅度提高效率,在相同时间内,"元信息"获取数量可增加96.21%,完全能够满足主动监测模型对网络爬虫的性能要求。

关 键 词:“元信息”获取  主题爬虫技术  URL分类算法  特征向量表示  主动监测模型
修稿时间: 

Improved crawler algorithm technique for P2P specific information
DING Junping,CAI Wandong.Improved crawler algorithm technique for P2P specific information[J].Computer Engineering and Applications,2011,47(29):23-26.
Authors:DING Junping  CAI Wandong
Affiliation:College of Computer Science,Northwestern Polytechnical University,Xi’an 710072,China
Abstract:Current topic crawler algorithm technique can crawl lots of uncorrelated websites during obtaining of the "meta-information", so the current topic crawler algorithm technique has been improved by being added URL classification algorithm. This classification algorithm,based on the supplied URL sample information,generates multiple uncorrelated URL key words sets and "meta-information" URL key words sets.It sets up power to the key words in the set, and sets the threshold value to all sets;describes URL by feature vector,and calculates the distance with the key words set to classify URL;analyzes the algorithm performance in detail.As the test indicates,compared with the traditional topic crawler technique,the improved technique can dramatically improve the efficiency during obtaining of the "meta-information".The obtained "meta-information" quantity can be improved by 96.21% in the same time,which can fully meet the performance requirement of initiative monitoring model to crawler.
Keywords:"meta-informafion" obtaining  topic crawler technique  URL classification algorithm  feature vector representation  initiative monitoring model
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号