首页 | 本学科首页   官方微博 | 高级检索  
     

基于领域知识抽样的深网资源采集方法
引用本文:林海伦,熊锦华,王 博,程学旗. 基于领域知识抽样的深网资源采集方法[J]. 中文信息学报, 2016, 30(2): 175-181
作者姓名:林海伦  熊锦华  王 博  程学旗
作者单位:1. 中国科学院 计算技术研究所,北京 100190;
2. 中国科学院大学,北京 100049;
3. 国家计算机网络应急技术处理协调中心,北京 100029
基金项目:国家科技支撑计划课题(2011BAH11B02,2012BAH46B04);国家242专项(2013G129);国家自然科学基金(61300206)
摘    要:深网资源是指隐藏在HTML表单后端的Web数据库资源,这些资源主要通过表单查询的方式访问。然而,目前的网页采集技术由于采用页面超链接的方式采集资源,所以无法有效覆盖这些资源,为此,该文提出了一种基于领域知识抽样的深网资源采集方法,该方法首先利用开源目录服务创建领域属性集合,接着基于置信度函数对属性进行赋值,然后利用领域属性集合选择查询接口并生成查询接口赋值集合,最后基于贪心选择策略选择置信度最高的查询接口赋值生成查询实例进行深网采集。实验表明,该方法能够有效地实现深网资源的采集。

关 键 词:深网  置信度  抽样  领域知识  

An Approach to Crawling the Deep Web Based on Domain Knowledge Sampling
LIN Hailun,XIONG Jinhua,WANG Bo,CHENG Xueqi. An Approach to Crawling the Deep Web Based on Domain Knowledge Sampling[J]. Journal of Chinese Information Processing, 2016, 30(2): 175-181
Authors:LIN Hailun  XIONG Jinhua  WANG Bo  CHENG Xueqi
Affiliation:1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. CNCERT/CC, Beijing 100029, China)
Abstract:The Deep Web refers to the Web databases content hidden behind HTML forms, which can only be accessed by performing form submissions. The current web page collection technologies can not cover these resources effectively by employing only hyperlinks. For this purpose, this paper proposes an approach to crawling the deep web based on domain knowledge sampling. Firstly, it creates a domain attributes set using open source directory services and assigns the attributes based on a confidence function; Secondly, it uses the domain attributes set to select query interface and generate assignments, and finally, it selects the assignment with the highest confidence as a query instance for deep web crawling based on greedy algorithm. Experiments show that our approach can effectively collect the deep web resources.
Keywords:deep web  confidence  sampling  domain knowledge  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号