首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
A large number of web pages contain data structured in the form of ??lists??. Many such lists can be further split into multi-column tables, which can then be used in more semantically meaningful tasks. However, harvesting relational tables from such lists can be a challenging task. The lists are manually generated and hence need not have well-defined templates??they have inconsistent delimiters (if any) and often have missing information. We propose a novel technique for extracting tables from lists. The technique is domain independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields and then, compare the splits across multiple lines to identify and fix incorrect splits and bad alignments. In particular, we exploit a corpus of HTML tables, also extracted from the web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table??s quality. We conducted an extensive experimental study using both real web lists and lists derived from tables on the web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the web. The analysis of the extracted tables has led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the web.  相似文献   

3.
Most scientific databases consist of datasets (or sources) which in turn include samples (or files) with an identical structure (or schema). In many cases, samples are associated with rich metadata, describing the process that leads to building them (e.g.: the experimental conditions used during sample generation). Metadata are typically used in scientific computations just for the initial data selection; at most, metadata about query results is recovered after executing the query, and associated with its results by post-processing. In this way, a large body of information that could be relevant for interpreting query results goes unused during query processing.In this paper, we present ScQL, a new algebraic relational language, whose operations apply to objects consisting of data–metadatapairs, by preserving such one-to-one correspondence throughout the computation. We formally define each operation and we describe an optimization, called meta-first, that may significantly reduce the query processing overhead by anticipating the use of metadata for selectively loading into the execution environment only those input samples that contribute to the result samples.In ScQL, metadata have the same relevance as data, and contribute to building query results; in this way, the resulting samples are systematically associated with metadata about either the specific input samples involved or about query processing, thereby yielding a new form of metadata provenance. We present many examples of use of ScQL, relative to several application domains, and we demonstrate the effectiveness of the meta-first optimization.  相似文献   

4.
In recent years, data quality issues have attracted wide attentions. Data quality problems are mainly caused by dirty data. Currently, many methods for dirty data management have been proposed, and one of them is entity-based relational database in which one tuple represents an entity. The traditional query optimizations are not suitable for the new entity-based model. Then new query optimizations need to be developed. In this paper, we propose a new query selectivity estimation strategy based on histogram, and focus on solving the overestimation which traditional methods lead to. We prove our approaches are unbiased. The experimental results on both real and synthetic data sets show that our approaches can give good estimates with low error.  相似文献   

5.
In this paper, we present a mechanism for detecting and representing changes, given the old and new versions of a set of interlinked Web documents, retrieved in response to a user's query. In particular, we show how to detect and represent Web deltas, i.e., changes in the Web documents that are relevant to a user's query in the context of our Web warehousing system called WHOWEDA (Warehouse of Web Data). In WHOWEDA, Web information is materialized views stored in Web tables in the form of Web tuples. These Web tuples, represented as directed graphs, can be manipulated using a set of Web algebraic operators. In this paper, we present a mechanism to detect relevant Web deltas using Web algebraic operators such as the Web join and the outer Web join. Web join is used to detect identical documents residing in two Web tables, whereas, outer Web join, a derivative of Web join, is used to identify dangling Web tuples. We show how to represent these changes using delta Web tables. We develop formal algorithms for the generation of delta Web tables identifying Web documents which have been added, deleted, or modified since the last query.  相似文献   

6.
Although the popular database systems perform well on query optimization, they still face poor query execution plans when the join operations across multiple tables are complex. Bad execution planning usually results in bad cardinality estimations. The cardinality estimation models in traditional databases cannot provide high-quality estimation, because they are not capable of capturing the correlation between multiple tables in an effective fashion. Recently, the state-of-the-art learning-based cardinality estimation is estimated to work better than the traditional empirical methods. Basically, they used deep neural networks to compute the relationships and correlations of tables. In this paper, we propose a vertical scanning convolutional neural network (abbreviated as VSCNN) to capture the relationships between words in the word vector in order to generate a feature map. The proposed learning-based cardinality estimator converts Structured Query Language (SQL) queries from a sentence to a word vector and we encode table names in the one-hot encoding method and the samples into bitmaps, separately, and then merge them to obtain enough semantic information from data samples. In particular, the feature map obtained by VSCNN contains semantic information including tables, joins, and predicates about SQL queries. Importantly, in order to improve the accuracy of cardinality estimation, we propose the negative sampling method for training the word vector by gradient descent from the base table and compress it into a bitmap. Extensive experiments are conducted and the results show that the estimation quality of q-error of the proposed vertical scanning convolutional neural network based model is reduced by at least 14.6% when compared with the estimators in traditional databases.  相似文献   

7.
This paper is a sequel to our previous paper on relational similarity-based model of data and its fundamental query systems. The present paper elaborates on the dependency theory in the similarity-based model, focusing mainly on similarity-based functional dependencies, their semantic entailment, model-theoretic properties, complete axiomatizations, characterization of nonredundant bases, computational issues, and related algorithms. The paper shows that various aspects of dependencies in ranked data tables over domains with similarities can be properly formalized using complete residuated lattices as structures for similarities and ranks. In addition to their theoretical importance, the results can be directly applied in the areas of similarity-based constraints, query result analysis, and knowledge discovery from relational data which involves similarity-based reasoning. We assume that readers are acquainted with the prequel of this paper.  相似文献   

8.
This paper presents an approach to query decomposition in a multidatabase environment. The unique aspect of this approach is that it is based on performing transformations over an object algebra that can be used as the basis for a global query language. In the paper, we first present our multidatabase environment and semantic framework, where a global conceptual schema based on the Object Data Management Group standard encompasses the information from heterogeneous data sources that include relational databases as well as object-oriented databases and flat file sources. The meta-data about the global schema is enhanced with information about virtual classes as well as virtual relationships and inheritance hierarchies that exist between multiple sources. The AQUA object algebra is used as the formal foundation for manipulation of the query expression over the multidatabase. AQUA is enhanced with distribution operators for dealing with data distribution issues. During query decomposition we perform an extensive analysis of traversals for path expressions that involve virtual relationships and hierarchies for access to several heterogeneous sources. The distribution operators defined in algebraic terms enhance the global algebra expression with semantic information about the structure, distribution, and localization of the data sources relevant to the solution of the query. By using an object algebra as the basis for query processing, we are able to define algebraic transformations and exploit rewriting techniques during the decomposition phase. Our use of an object algebra also provides a formal and uniform representation for dealing with an object-oriented approach to multidatabase query processing. As part of our query processing discussion, we include an overview of a global object identification approach for relating semantically equivalent objects from diverse data sources, illustrating how knowledge about global object identity is used in the decomposition and assembly processes.  相似文献   

9.
Problems associated with defining normal forms of relational tables relevant to statistical processing are discussed. The concepts of derived identifier, class identifier, derived class-counts, count domains, compact domains, and uniform domains for statistical relational tables are introduced. The structures of the first and the second statistical-normal forms and the relational decompositions needed to achieve them are also discussed. It is shown that the statistical-normal form can be an important method to determine whether the usual statistical analysis techniques are valid. Some suggestions are presented for extending the structured query language (SQL) statements to achieve these operations on statistical relational tables. Some results linking Codd's normal forms with statistical normal forms are discussed. Relational statistical abnormalities, called outlyers, are also discussed  相似文献   

10.
In classical database theory, relational calculus has long been used in expressing query formulae and integrity constraints. In fact, relational calculus formulae are much easier to deal with than first-order formulae when evaluating queries and validating database updates in the database environment. In deductive databases, however, first-order calculus is preferred because it is convenient when proof procedures are involved. Since both situations should coexist in advanced information systems, it is very desirable to devise a conversion procedure between relational calculus and first-order calculus. In this paper, interpretation of first-order formulae in the database environment is discussed first, then tuple calculus, an extension of relational calculus, is presented. This extension enables us to describe query formulae and general rules necessary in advanced information systems, in particular, dealing with complex objects. Finally, a conversion algorithm from first-order formulae into tuple calculus formulae is presented. Several application issues are also included.  相似文献   

11.
针对油气井工程领域关系数据库,提出一种基于语义视图的SPARQL-SQL查询转换方法.该方法采用特定本体描述数据源关系模式,通过RDF三元组形式,将关系表定义为领域本体之上的语义查询视图,从而建立数据源与本体之间的映射关系,并根据语义映射信息,将提交的SPARQL语句进行解析与查询重写,转换为面向关系数据源的SQL语句.通过实现油气井虚拟数据中心,验证了该方法的可行性与有效性,并获得了良好的应用效果.  相似文献   

12.
廖祝华  张国清  杨景  傅川  张国强 《软件学报》2012,23(10):2760-2771
针对互联网络中媒体语义关联内容的快速查找和聚合方面的问题,提出了一种新的面向网络的关系路由方案.该方案在命名媒体和语义关联的基础上对网络中的语义关联请求进行路由,然后快速返回关联内容.首先介绍了语义媒体模型和基于网络的关系查询模型,设计了关系路由通信协议的数据结构和算法,尤其是对关系匹配算法、路由过程以及关系请求的非完备性返回避免方法进行了重点介绍.然后,对关系路由的关键问题,如媒体命名、查询偏好以及应用等方面的问题进行了讨论和分析.最后,在实际环境中对关系路由的模型和算法进行了实验.结果表明,关系路由方法能够快速获取语义关联内容,并为分布、动态的媒体语义聚合提供了一条有效的途径.  相似文献   

13.
Concurrency control, distribution design, and query processing are some of the important issues in the design of distributed databases. In this paper, we have studied these issues with respect to a relational database on a local computer system connected by a multiaccess broadcast bus. A broadcast bus allows information to be distributed efficiently, and hence simplifies the solutions to some of these issues. A transaction model that integrates the control strategies in concurrency control and query processing is proposed. In concurrency control, the lock, unlock, and update of data are achieved by a few broadcasts. A dynamic strategy is used in query processing, as less data are transferred when compared to a static strategy. The status information needed in dynamic query processing can be conveniently obtained by broadcasting. Lastly, some NP-hard file placement problems are found to be solvable in polynomial time when updates are broadcast.  相似文献   

14.
Joseph Fong  Herbert Shiu  Davy Cheung 《Software》2008,38(11):1183-1213
Integrating information from multiple data sources is becoming increasingly important for enterprises that partner with other companies for e‐commerce. However, companies have their internal business applications deployed on diverse platforms and no standard solution for integrating information from these sources exists. To support business intelligence query activities, it is useful to build a data warehouse on top of middleware that aggregates the data obtained from various heterogeneous database systems. Online analytical processing (OLAP) can then be used to provide fast access to materialized views from the data warehouse. Since extensible markup language (XML) documents are a common data representation standard on the Internet and relational tables are commonly used for production data, OLAP must handle both relational and XML data. SQL and XQuery can be used to process the materialized relational and XML data cubes created from the aggregated data. This paper shows how to handle the two kinds of data cubes from a relational–XML data warehouse using extract, transformation and loading. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

15.
Aggregate keyword search on large relational databases   总被引:2,自引:1,他引:1  
Keyword search has been recently extended to relational databases to retrieve information from text-rich attributes. However, all the existing methods focus on finding individual tuples matching a set of query keywords from one table or the join of multiple tables. In this paper, we motivate a novel problem of aggregate keyword search: finding minimal group-bys covering a set of query keywords well, which is useful in many applications. We develop two interesting approaches to tackle the problem. We further extend our methods to allow partial matches and matches using a keyword ontology. An extensive empirical evaluation using both real data sets and synthetic data sets is reported to verify the effectiveness of aggregate keyword search and the efficiency of our methods.  相似文献   

16.
17.
In this article we propose a method for information retrieval based on relational Multi-Criteria Decision Making. We assume that a user cannot define precise search criteria so that these criteria must be found based on the user’s assessment of several sample alternatives (‘alternatives’ here are database records, e.g. images). This situation is common in Content-based Image Retrieval, where it is easier for a user to indicate relevant images than to describe a proper query, especially in formal language. The proposed algorithm for the elicitation of criteria is based on ELECTRE III—a method originally designed for ranking a set of alternatives according to defined criteria. In our algorithm, however, the direction of reasoning is reversed: we start with several sample alternatives that have been assigned a rank by the user and then we select criteria that are compatible (in the sense of ELECTRE methodology) with the user’s preferences expressed on a sample set. Then, having determined the user’s criteria, we apply classical ELECTRE III to retrieve the relevant solutions from the database. We implemented the method in Matlab and tested it on the Microsoft Cambridge Image Database.  相似文献   

18.
A relational ranking query uses a scoring function to limit the results of a conventional query to a small number of the most relevant answers. The increasing popularity of this query paradigm has led to the introduction of specialized rank join operators that integrate the selection of top tuples with join processing. These operators access just “enough” of the input in order to generate just “enough” output and can offer significant speed-ups for query evaluation. The number of input tuples that an operator accesses is called the input depth of the operator, and this is the driving cost factor in rank join processing. This introduces the important problem of depth estimation, which is crucial for the costing of rank join operators during query compilation and thus for their integration in optimized physical plans. We introduce an estimation methodology, termed deep, for approximating the input depths of rank join operators in a physical execution plan. At the core of deep lies a general, principled framework that formalizes depth computation in terms of the joint distribution of scores in the base tables. This framework results in a systematic estimation methodology that takes the characteristics of the data directly into account and thus enables more accurate estimates. We develop novel estimation algorithms that provide an efficient realization of the formal deep framework, and describe their integration on top of the statistics module of an existing query optimizer. We validate the performance of deep with an extensive experimental study on data sets of varying characteristics. The results verify the effectiveness of deep as an estimation method and demonstrate its advantages over previously proposed techniques.  相似文献   

19.
We present an original approach for motion-based video retrieval involving partial query. More precisely, we propose a unified statistical framework allowing us to simultaneously extract entities of interest in video shots and supply the associated content-based characterization, which can be used to satisfy partial queries. It relies on the analysis of motion activity in video sequences based on a non-parametric probabilistic modeling of motion information. Areas comprising relevant types of motion activity are extracted from a Markovian region-level labeling applied to the adjacency graph of an initial block-based partition of the image. As a consequence, given a set of videos, we are able to construct a structured base of samples of entities of interest represented by their associated statistical models of motion activity. The retrieval operations is then formulated as a Bayesian inference issue using the MAP criterion. We report different results of extraction of entities of interest in video sequences and examples of retrieval operations performed on a base composed of one hundred video samples.  相似文献   

20.
李润洲  方明 《计算机工程》2007,33(17):111-113
基于企业内多个分布异构关系数据库的集成需求,设计了一个向上支持集成访问界面,向下表述数据库网络位置、数据模式、数据内容的元数据字典模式。给出了面向集成环境和各异构数据库的通用查询请求表示。基于元数据字典,提出了一种因访问需求变化而引起相关数据库关系表动态变化的动态查询语句构造算法,并对算法进行了论证。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号