Stratified sampling for data mining on the deep web期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Stratified sampling for data mining on the deep web

Authors:	Tantan Liu Fan Wang Gagan Agrawal

Affiliation:	Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA

Abstract:	In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

Keywords:	deep web associate rule mining stratified sampling
本文献已被 SpringerLink 等数据库收录！
	点击此处可从《Frontiers of Computer Science》浏览原始摘要信息
	点击此处可从《Frontiers of Computer Science》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏