首页 | 本学科首页   官方微博 | 高级检索  
     


Empirical evaluation of the link and content-based focused Treasure-Crawler
Affiliation:1. Department of Computer Science, The George Washington University, Washington DC, United States;2. Computer Networks and Security Laboratory (LARCES), State University of Ceará (UECE), Fortaleza, Ceará, Brazil;3. Faculty of Science, Engineering and Computing, Kingston University, United Kingdom;1. Electrical Engineering Department, Federal University of Espírito Santo—UFES, Av. Fernando Ferrari, 514, Goiabeiras, 29075-910 Vitória, ES, Brazil;2. Computer Science Department, Federal University of Espírito Santo—UFES, Av. Fernando Ferrari, 514, Goiabeiras, 29075-910 Vitória, ES, Brazil;3. Federal Institute of Espírito Santo – IFES, Rodovia ES-010, Km 6,5, Manguinhos, 29173-087 Serra, ES, Brazil
Abstract:Indexing the Web is becoming a laborious task for search engines as the Web exponentially grows in size and distribution. Presently, the most effective known approach to overcome this problem is the use of focused crawlers. A focused crawler employs a significant and unique algorithm in order to detect the pages on the Web that relate to its topic of interest. For this purpose we proposed a custom method that uses specific HTML elements of a page to predict the topical focus of all the pages that have an unvisited link within the current page. These recognized on-topic pages have to be sorted later based on their relevance to the main topic of the crawler for further actual downloads. In the Treasure-Crawler, we use a hierarchical structure called T-Graph which is an exemplary guide to assign appropriate priority score to each unvisited link. These URLs will later be downloaded based on this priority. This paper embodies the implementation, test results and performance evaluation of the Treasure-Crawler system. The Treasure-Crawler is evaluated in terms of specific information retrieval criteria such as recall and precision, both with values close to 50%. Gaining such outcome asserts the significance of the proposed approach.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号