首页 | 本学科首页   官方微博 | 高级检索  
     

基于链接路径预测的聚焦Web实体搜索
引用本文:黄健斌, 孙鹤立. 基于链接路径预测的聚焦Web实体搜索[J]. 计算机研究与发展, 2010, 47(12).
作者姓名:黄健斌  孙鹤立
作者单位:1.(西安电子科技大学国家示范性软件学院 西安 710071) (西安交通大学计算机科学与技术系 西安 710049) (jbhuang@xidian.edu.cn)
基金项目:陕西省自然科学基础研究计划基金
摘    要:实体搜索是一个有前景的研究领域,因为它能够为用户提供更为详细的Web信息.快速、完全地收集特定领域实体所在的网页是实体搜索中的一个关键问题.为了解决这个问题,将Web网站建模为一组互连的状态构成的图,提出一种链接路径预测学习算法LPC,该模型能够学习大型网站中从主页通向目标网页的最优路径,从而指导爬虫快速定位到含有Web实体的目标网页.LPC算法分为两个阶段:首先,使用概率无向图模型CRF,学习从网站主页通往目标网页的链接路径模型,CRF模型能够融合超连接和网页中的各种特征,包括状态特征和转移特征;其次,结合增强学习技术和训练的CRF模型对爬行前端队列的超链接进行优先级评分.一种来自增强学习的折扣回报方法通过利用路径分类阶段学习的CRF模型来计算连接的回报值。在多个领域大量真实数据上的实验结果表明,所提出的适用CRF模型指导的链接路径预测爬行算法LPC的性能明显优于其他聚焦爬行算法.

关 键 词:实体搜索  聚焦爬行  链接路径预测  条件随机场  增强学习

Focused Web Entity Search Using the Linked-Path Prediction Model
Huang Jianbin, Sun Heli. Focused Web Entity Search Using the Linked-Path Prediction Model[J]. Journal of Computer Research and Development, 2010, 47(12).
Authors:Huang Jianbin  Sun Heli
Abstract:Entity search is a promising research topic because it will provide Web information in detail to the users. A key problem of entity search is collecting Web pages quickly and completely for the relevant entities on a specific domain. To deal with this issue, a website is modeled as a graph on a set of connected important states. Then a novel algorithm named LPC is proposed to learn the optimal link sequences leading to the goal pages which entities are embedded in. The LPC algorithm uses a two-stage strategy. In the first stage, it uses an undirected graphical learning model CRF to capture sequential link patterns leading to goal pages. The conditional exponential models of CRF are able to exploit a variety of features including state and transition features extracted around hyperlinks and HTML pages. In the second stage, the links in the crawling frontier queue are prioritized based on reinforcement learning and the trained CRF model. A discount reward approach from reinforcement learning is employed to compute the reward score using the CRF model learnt during path classification phase. The experimental results on massive real data show that the optimal prediction ability of CRF helps LPC outperforms other focused crawlers.
Keywords:entity search  focused Web crawling  linked-path prediction  conditional random field  reinforcement learning
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机研究与发展》浏览原始摘要信息
点击此处可从《计算机研究与发展》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号