首页 | 本学科首页   官方微博 | 高级检索  
     

结合文本密度的语义聚焦爬虫方法
引用本文:林椹尠,袁柱,李小平.结合文本密度的语义聚焦爬虫方法[J].计算机应用与软件,2019(9):270-275.
作者姓名:林椹尠  袁柱  李小平
作者单位:1.西安邮电大学理学院;2.西安邮电大学通信与信息工程学院
基金项目:陕西省教育厅专项科学研究基金项目(18JK0699)
摘    要:针对聚焦爬虫网页核心内容提取算法准确性偏低以及相似度计算模型语义信息考虑不充分造成的爬取准确度和效率偏低的问题,提出结合文本密度的语义聚焦爬虫方法。引入核心内容提取算法,使用标题结合LCS算法定位核心内容文本的起始和终止位置,提取网页核心内容。引入基于Word2vec的主题相关度算法计算核心内容的主题相关度,改进PageRank算法计算链接主题重要度。结合主题相关度和主题重要度计算链接优先级。此外,为提高聚焦爬虫的全局搜索性能,结合主题词使用搜索引擎扩展链接集。与通用爬虫和多种聚焦爬虫相比,该方法爬虫爬取准确度和效率更优。

关 键 词:聚焦爬虫  核心内容  LCS  Word2vec  链接优先级

SEMANTIC FOCUSED CRAWLER METHOD COMBINING TEXT DENSITY
Lin Zhenxian,Yuan Zhu,Li Xiaoping.SEMANTIC FOCUSED CRAWLER METHOD COMBINING TEXT DENSITY[J].Computer Applications and Software,2019(9):270-275.
Authors:Lin Zhenxian  Yuan Zhu  Li Xiaoping
Affiliation:(School of Science, Xi’an University of Post and Telecommunications, Xi’an 710121, Shaanxi, China;School of Communication and Information Engineering, Xi’an University of Post and Telecommunications, Xi’an 710121, Shaanxi, China)
Abstract:In view of the problems of low accuracy and low efficiency of focused crawler caused by the low accuracy in web core content extraction algorithm and insufficient consideration of semantic information in similarity computing model, we proposed a semantic focused crawler method combining text density. The core content extraction algorithm was introduced to use the title combined with the LCS algorithm to locate the starting and ending positions of the core content, then extracted the core content of the web page. A topic relevance algorithm based on Word2vec was introduced to calculate the topic relevance of core content, and the PageRank algorithm was improved to calculate the importance between the link and the topic. We combined topic relevance and topic importance to calculate the link priority. In addition, in order to improve the global search performance of focused crawler, search engine was used to expand the link set with Keywords. Compared with universal crawlers and multiple focused crawlers, our method is more accurate and efficient.
Keywords:Focused crawler  Core content  LCS  Word2vec  Link priority
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号