Deep Web adaptive crawling based on minimum executable pattern |
| |
Authors: | Jun Liu Lu Jiang Zhaohui Wu Qinghua Zheng |
| |
Affiliation: | (1) MOE KLINNS Lab and SKLMS Lab, Xi’an Jiaotong University, Xi’an, 710049, People’s Republic of China |
| |
Abstract: | The key to Deep Web Crawling is to submit valid input values to a query form and retrieve Deep Web content efficiently. In
the literature, related work focus only on generic text boxes or entire query forms, causing the problem of “data islands”
or inferior validity of query submission. This paper proposes the concept of Minimum Executable Pattern (MEP), a minimal combination
of elements in a query form that can conduct a successful query, and then presents a MEPGeneration method and a MEP-based
Deep Web adaptive crawling method. The query form is parsed and partitioned into MEP set, and then local-optimal queries are
generated by choosing a MEP in the MEP set and a keyword vector of the MEP. Furthermore, the crawler can make a decision on
its termination to balance the trade-off between high coverage of the content and resource consumption. The adoption of MEP
is expected to improve the validity of query submission, and adaptive selection of multiple MEPs shows good effect for overcoming
the problem of “data islands”. We present a set of experiments to validate the effectiveness of the proposed method. Experimental
results show that our method outperforms the state of art methods in terms of query capability and applicability, and on average,
it achieves good coverage by issuing only a few hundred queries. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|