首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
深网资源是指隐藏在HTML表单后端的Web数据库资源,这些资源主要通过表单查询的方式访问。然而,目前的网页采集技术由于采用页面超链接的方式采集资源,所以无法有效覆盖这些资源,为此,该文提出了一种基于领域知识抽样的深网资源采集方法,该方法首先利用开源目录服务创建领域属性集合,接着基于置信度函数对属性进行赋值,然后利用领域属性集合选择查询接口并生成查询接口赋值集合,最后基于贪心选择策略选择置信度最高的查询接口赋值生成查询实例进行深网采集。实验表明,该方法能够有效地实现深网资源的采集。  相似文献   

2.
The escalation of deep web databases has been phenomenal over the last decade, spawning a growing interest in automated discovery of interesting relationships among available deep web databases. Unlike the “surface” web of static pages, these deep web databases provide data through a web-based query interface and account for a huge portion of all web content. This paper presents a novel source-biased approach to efficiently discover interesting relationships among web-enabled databases on the deep web. Our approach supports a relationship-centric view over a collection of deep web databases through source-biased database analysis and exploration. Our source-biased approach has three unique features: First, we develop source-biased probing techniques, which allow us to determine in very few interactions whether a target database is relevant to the source database by probing the target with very precise probes. Second, we introduce source-biased relevance metrics to evaluate the relevance of deep web databases discovered, to identify interesting types of source-biased relationships for a collection of deep web databases, and to rank them accordingly. The source-biased relationships discovered not only present value-added metadata for each deep web database but can also provide direct support for personalized relationship-centric queries. Third, but not least, we also develop a performance optimization using source-biased probing with focal terms to further improve the effectiveness of the basic source-biased model. A prototype system is designed for crawling, probing, and supporting relationship-centric queries over deep web databases using the source-biased approach. Our experiments evaluate the effectiveness of the proposed source-biased analysis and discovery model, showing that the source-biased approach outperforms query-biased probing and unbiased probing.  相似文献   

3.
Web is flooded with data. While the crawler is responsible for accessing these web pages and giving it to the indexer for making them available to the users of search engine, the rate at which these web pages change has created the necessity for the crawler to employ refresh strategies to give updated/modified content to the search engine users. Furthermore, Deep web is that part of the web that has alarmingly abundant amounts of quality data (when compared to normal/surface web) but not technically accessible to a search engine’s crawler. The existing deep web crawl methods helps to access the deep web data from the result pages that are generated by filling forms with a set of queries and accessing the web databases through them. However, these methods suffer from not being able to maintain the freshness of the local databases. Both the surface web and the deep web needs an incremental crawl associated with the normal crawl architecture to overcome this problem. Crawling the deep web requires the selection of an appropriate set of queries so that they can cover almost all the records in the data source and in addition the overlapping of records should be low so that network utilization is reduced. An incremental crawl adds to an increase in the network utilization with every increment. Therefore, a reduced query set as described earlier should be used in order to minimize the network utilization. Our contributions in this work are the design of a probabilistic approach based incremental crawler to handle the dynamic changes of the surface web pages, adapting the above mentioned method with a modification to handle the dynamic changes in the deep web databases, a new evaluation measure called the ‘Crawl-hit rate’ to evaluate the efficiency of the incremental crawler in terms of the number of times the crawl is actually necessary in the predicted time and a semantic weighted set covering algorithm for reducing the queries so that the network cost is reduced for every increment of the crawl without any compromise in the number of records retrieved. The evaluation of incremental crawler shows a good improvement in the freshness of the databases and a good Crawl-hit rate (83 % for web pages and 81 % for deep web databases) with a lesser over head when compared to the baseline.  相似文献   

4.
When extracting knowledge (or patterns) from multiple databases, the data from different databases might be too large in volume to be merged into one database for centralized mining on one computer, the local information sources might be hidden from a global decision maker due to privacy concerns, and different local databases may have different contribution to the global pattern. Dealing with multiple databases is essentially different from mining from a single database. In multi-database mining, the global patterns must be obtained by carefully analyzing the local patterns from individual databases. In this paper, we propose a nonlinear method, named KEMGP (kernel estimation for mining global patterns), to tackle this problem, which adopts kernel estimation to synthesizing local patterns for global patterns. We also adopt a method to divide all the data in different databases according to attribute dimensionality, which reduces the total space complexity. We test our algorithm on a customer management system, where the application is to obtain all globally interesting patterns by analyzing the individual databases. The experimental results show that our method is efficient.  相似文献   

5.
Managing large-scale time series databases has attracted significant attention in the database community recently. Related fundamental problems such as dimensionality reduction, transformation, pattern mining, and similarity search have been studied extensively. Although the time series data are dynamic by nature, as in data streams, current solutions to these fundamental problems have been mostly for the static time series databases. In this paper, we first propose a framework to online summary generation for large-scale and dynamic time series data, such as data streams. Then, we propose online transform-based summarization techniques over data streams that can be updated in constant time and space. We present both the exact and approximate versions of the proposed techniques and provide error bounds for the approximate case. One of our main contributions in this paper is the extensive performance analysis. Our experiments carefully evaluate the quality of the online summaries for point, range, and knn queries using real-life dynamic data sets of substantial size. Edited by W. Aref  相似文献   

6.
We propose a general method for error estimation that displays low variance and generally low bias as well. This method is based on “bolstering” the original empirical distribution of the data. It has a direct geometric interpretation and can be easily applied to any classification rule and any number of classes. This method can be used to improve the performance of any error-counting estimation method, such as resubstitution and all cross-validation estimators, particularly in small-sample settings. We point out some similarities shared by our method with a previously proposed technique, known as smoothed error estimation. In some important cases, such as a linear classification rule with a Gaussian bolstering kernel, the integrals in the bolstered error estimate can be computed exactly. In the general case, the bolstered error estimate may be computed by Monte-Carlo sampling; however, our experiments show that a very small number of Monte-Carlo samples is needed. This results in a fast error estimator, which is in contrast to other resampling techniques, such as the bootstrap. We provide an extensive simulation study comparing the proposed method with resubstitution, cross-validation, and bootstrap error estimation, for three popular classification rules (linear discriminant analysis, k-nearest-neighbor, and decision trees), using several sample sizes, from small to moderate. The results indicate the proposed method vastly improves on resubstitution and cross-validation, especially for small samples, in terms of bias and variance. In that respect, it is competitive with, and in many occasions superior to, bootstrap error estimation, while being tens to hundreds of times faster. We provide a companion web site, which contains: (1) the complete set of tables and plots regarding the simulation study, and (2) C source code used to implement the bolstered error estimators proposed in this paper, as part of a larger library for classification and error estimation, with full documentation and examples. The companion web site can be accessed at the URL http://ee.tamu.edu/~edward/bolster.  相似文献   

7.
在数据挖掘问题中,一个基本假设是训练集样本与测试集样本的数据分布一致,但随着数据量逐渐增加,如何在海量数据中找出具有代表意义的数据也变得尤为困难。对现有的数据选择方法研究发现,传统的简单随机抽样和渐进抽样等数据选择方法,由于没有和数据挖掘工具进行结合,采样结果具有偶然性和不确定性,抽样数据很难保证数据挖掘的基本假设,这也使得最终模型的泛化误差较大。为了解决数据采样过程中类间的不平衡问题,提出一种基于双决策树的结构化数据采样方法。首先通过C4.5算法生成一棵决策树,借助决策树在数据源中选择适合的数据和数据采集点,同时通过使用另一棵决策树对选择出的数据集的质量进行评估来达到高效率和高质量的数据采样。实验表明,与简单随机抽样相比,新采样数据下训练的模型准确率有明显提高。  相似文献   

8.
Compared with structured data sources that are usually stored and analyzed in spreadsheets, relational databases, and single data tables, unstructured construction data sources such as text documents, site images, web pages, and project schedules have been less intensively studied due to additional challenges in data preparation, representation, and analysis. In this paper, our vision for data management and mining addressing such challenges are presented, together with related research results from previous work, as well as our recent developments of data mining on text-based, web-based, image-based, and network-based construction databases.  相似文献   

9.
高明  黄哲学 《集成技术》2012,1(3):47-54
随着Deep Web数量和规模的快速增长,通过对其发起查询请求以得到存储在后台数据库中的相关信息,日渐成为用户获取信息的主要方式。为了方便用户有效地利用Deep Web中的信息,越来越多的研究者致力于这一领域的研究,重点之一是Deep Web后台数据库的数据集成。由于Deep Web后台数据库存储的主要是文本信息,使得从文本处理角度出发,针对Deep Web中存储的内容进行查询与检索的研究具有十分广阔的应用前景。本文对Deep Web的研究现状进行了较为详细的分析,同时对研究的发展方向进行了展望。  相似文献   

10.
邓松  万常选 《软件学报》2017,28(12):3241-3256
在深网数据集成过程中,用户希望仅检索少量数据源便能获取高质量的检索结果,因而数据源选择成为其核心技术.为满足基于相关性和多样性的集成检索需求,提出一种适合小规模抽样文档摘要的深网数据源选择方法.该方法在数据源选择过程中首先度量数据源与用户查询的相关性,然后进一步考虑候选数据源提供数据的多样性.为提升数据源相关性判别的准确性,构建了基于层次主题的数据源摘要,并在其中引入了主题内容相关性偏差概率模型,且给出了基于人工反馈的偏差概率模型构建方法以及基于概率分析的数据源相关性度量方法.为提升数据源选择结果的多样性程度,在基于层次主题的数据源摘要中建立了多样性链接有向边,并给出了数据源多样性的评价方法.最后,将基于相关性和多样性的数据源选择问题转化为一个组合优化问题,提出了基于优化函数的数据源选择策略.实验结果表明:在基于少量抽样文档进行数据源选择时,该方法具有较高的选择准确率.  相似文献   

11.
A critical problem in planning sampling paths for autonomous underwater vehicles (AUVs) is correctly balancing two issues. First, obtaining an accurate scalar field estimation and second, efficiently utilizing the stored energy capacity of the sampling vehicle. Adaptive sampling approaches can only provide solutions when real time and a priori environmental data is available. In this paper we present an analysis of adaptive sampling methodologies for AUVs. In particular, we analyze various sampling path strategies including systematic and stratified random patterns within a wide range of sampling densities and their impact in the energy consumption of the vehicle through a cost-evaluation function. Our study demonstrates that a systematic spiral sampling path strategy is optimal for high-variance scalar fields for all sampling densities and low-variance scalar fields when sampling is sparse. In addition, our results show that the random spiral sampling path strategy is found to be optimal for low-variance scalar fields when sampling is dense.  相似文献   

12.
In this work we consider a fuzzy set based approach to the issue of discovery in databases (database mining). The concept of linguistic summaries is described and shown to be a user friendly way to present information contained in a database. We discuss methods for measuring the amount of information provided by a linguistic summary. The issue of conjecturing, how to decide on which summaries may be informative, is discussed. We suggest two approaches to help us focus on relevant summaries. The first method, called the template method, makes use of linguistic concepts related to the domain of the attributes involved in the summaries. The second approach uses the mountain clustering method to help focus our summaries. © 1996 John Wiley & Sons, Inc.  相似文献   

13.
The popularity of many‐light rendering, which converts complex global illumination computations into a simple sum of the illumination from virtual point lights (VPLs), for predictive rendering has increased in recent years. A huge number of VPLs are usually required for predictive rendering at the cost of extensive computational time. While previous methods can achieve significant speedup by clustering VPLs, none of these previous methods can estimate the total errors due to clustering. This drawback imposes on users tedious trial and error processes to obtain rendered images with reliable accuracy. In this paper, we propose an error estimation framework for many‐light rendering. Our method transforms VPL clustering into stratified sampling combined with confidence intervals, which enables the user to estimate the error due to clustering without the costly computing required to sum the illumination from all the VPLs. Our estimation framework is capable of handling arbitrary BRDFs and is accelerated by using visibility caching, both of which make our method more practical. The experimental results demonstrate that our method can estimate the error much more accurately than the previous clustering method.  相似文献   

14.
Frequent itemset mining allows us to find hidden, important information from large databases. Moreover, processing incremental databases in the itemset mining area has become more essential because a huge amount of data has been accumulated continually in a variety of application fields and users want to obtain mining results from such incremental data in more efficient ways. One of the major problems in incremental itemset mining is that the corresponding mining results can be very large-scale according to threshold settings and data volumes. In addition, it is considerably hard to analyze all of them and find meaningful information. Furthermore, not all of the mining results become actually important information. In this paper, to solve these problems, we propose an algorithm for mining weighted maximal frequent itemsets from incremental databases. By scanning a given incremental database only once, the proposed algorithm can not only conduct its mining operations suitable for the incremental environment but also extract a smaller number of important itemsets compared to previous approaches. The proposed method also has an effect on expert and intelligent systems since it can automatically provide more meaningful pattern results reflecting characteristics of given incremental databases and threshold settings, which can help users analyze the given data more easily. Our comprehensive experimental results show that the proposed algorithm is more efficient and scalable than previous state-of-the-art algorithms.  相似文献   

15.
Knowledge discovery in databases is used to discover useful and understandable knowledge from large databases. A process of knowledge discovery consists of two steps, the data mining step and the evaluation step. In this paper, evaluating and ranking the interestingness of summaries generated from databases, which is a part of the second step, is studied using diversity measures. Sixteen previously analyzed diversity measures of interestingness are used along with three not previously considered ones, brought from different well-known areas. The latter three measures are evaluated theoretically according to five principles that a measure must satisfy to be qualified acceptable for ranking summaries. A theoretical correlation study between the eight measures that satisfy all five principles is presented based on mathematical proofs. An empirical evaluation is conducted using three real databases. Then, a classification of the eight measures is deduced. The resulting classification is used to reduce the number of measures to only two, which are the best over all criteria, and that produce non-similar results. This helps the user interpret the most important discovered knowledge in his decision making process.  相似文献   

16.
在关联规则挖掘过程中,现有的取样误差量化方法和快速估计算法存在着不足,对此提出了一种新的取样误差量化三元组模型,并在实验观察和理论分析的基础上给出了一种取样误差的快速估计算法--主误差区间估计法.理论分析和实验结果均表明,此方法不但可以精确、有效地度量出样本集与原始数据集包含的频繁模式信息间的差异,而且,主误差区间估计法还可以精确、快速地估计出取样误差,并能灵活地嵌入到关联规则挖掘的各种取样方法之中;其核心思想还可以用于改进分布、并行关联规则挖掘方法的效率.  相似文献   

17.
Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, computer science, economics, epidemiology, genetics, imaging, and political science. However, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation schemes (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.  相似文献   

18.
针对化学和化工领域深层网信息量大、专业性强,但是难于检索的问题,本文研究了深层网信息挖掘的相关技术及化学和化工深层网的特点,并将其综合应用于对化学和化工深层网信息资源的挖掘系统中。该系统通过提取表单标签并结合化工物性词典种子合成绝对URL地址的方式,实现了对深层网入口表单的自动填写和提交功能,采用结合了XPath文档定位语言和XSLT数据逻辑处理模式初步实现了对返回的结果页面中化学和化工数据的提取。  相似文献   

19.
深网查询在Web上众多的应用,需要查询大量的数据源才能获得足够的数据,如多媒体数据搜索、团购网站信息聚合等.应用的成功,取决于查询多数据源的效率和效果.当前研究侧重查询与数据源的相关性而忽略数据源之间的重叠关系,使得不同数据源上相同结果的数据被重复查询,增加了查询开销及数据源的工作负载.为了提高深网查询的效率,提出一种元组水平的分层抽样方法来估计和利用查询在数据源上的统计数据,选择高相关、低重叠的数据源.该方法分为两个阶段:离线阶段,基于元组水平对数据源进行分层抽样,获得样本数据;在线阶段,基于样本数据迭代地估计查询在数据源上的覆盖率和重叠率,并采用一种启发式策略以高效地发现低重叠的数据源.实验结果表明,该方法能够显著提高重叠数据源选择的精度和效率.  相似文献   

20.
Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such possibleanswers and gauge their relevance by accessing their likelihood of being pertinent answers to the query. The autonomous nature of web databases poses several challenges in realizing this objective. Such challenges include the restricted access privileges imposed on the data, the limited support for query patterns, and the bounded pool of database and network resources in the web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges. Our technique involves reformulating the user query based on mined correlations among the database attributes. The reformulated queries are aimed at retrieving the relevant possibleanswers in addition to the certain answers. QPIAD is able to gauge the relevance of such queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naïve Bayes Classifiers), and selectivity estimates. We present empirical studies to demonstrate that our approach is able to effectively retrieve relevant possibleanswers with high precision, high recall, and manageable cost.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号