期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Metadata management for scientific databases

《Information Systems》2019

Most scientific databases consist of datasets (or sources) which in turn include samples (or files) with an identical structure (or schema). In many cases, samples are associated with rich metadata, describing the process that leads to building them (e.g.: the experimental conditions used during sample generation). Metadata are typically used in scientific computations just for the initial data selection; at most, metadata about query results is recovered after executing the query, and associated with its results by post-processing. In this way, a large body of information that could be relevant for interpreting query results goes unused during query processing.In this paper, we present ScQL, a new algebraic relational language, whose operations apply to objects consisting of data–metadatapairs, by preserving such one-to-one correspondence throughout the computation. We formally define each operation and we describe an optimization, called meta-first, that may significantly reduce the query processing overhead by anticipating the use of metadata for selectively loading into the execution environment only those input samples that contribute to the result samples.In ScQL, metadata have the same relevance as data, and contribute to building query results; in this way, the resulting samples are systematically associated with metadata about either the specific input samples involved or about query processing, thereby yielding a new form of metadata provenance. We present many examples of use of ScQL, relative to several application domains, and we demonstrate the effectiveness of the meta-first optimization. 相似文献

2.

An algebraic transformation framework for multidatabase queries

Ee-Peng Lim Jaideep Srivastava San-Yih Hwang 《Distributed and Parallel Databases》1995,3(3):273-307

Existence of semantic conflicts between component databases severely impacts query processing in a multidatabase system. In this paper, we describe two types of semantic conflicts that have to be dealt with in the integration of databases modeling information about related sets of real-world entities. These are the entityidentification problem and theattribute value conflict problem. While thetwo-way outerjoin operation has been commonly used for resolving entity identification problem between two component relations, outerjoins using regular equality comparisons between component relation keys is shown to produce counter-intuitive entity identification result. We remedy this by defining a newkey-equality comparator in place of regular equality comparator, for outerjoins. For the attribute value conflict problem, we define aGeneralized Attribute Derivation (GAD) operation which allows user-defined attribute derivation functions to be used to compute new attributes from the component relations' attributes. By adding two-way outerjoin andGAD to the set of relational operations, the traditional algebraic transformation framework for relational queries is no longer adequate for multidatabase query processing and optimization. As a result, we introduceconstrained query tree as the multidatabase query representation. We show that some knowledge about query predicates and attribute derivation functions can be used to simplify queries. Such knowledge is modeled as an outerjoin graph attached to every outerjoin operation in the query tree. Based on this, we further extend the traditional algebraic transformation framework to include two-way outerjoins andGAD operations. Our framework demonstrates that properties of selection/join predicates and attribute derivation functions can be used to provide interesting transformation alternatives. This framework also serves as a formal ground for developing optimization strategies for multidatabase queries. Recommended by: Clement Yu 相似文献

3.

On Similarity Measures for Multimedia Database Applications 总被引：1，自引：1，他引：0

K. Selçuk Candan Wen-Syan Li 《Knowledge and Information Systems》2001,3(1):30-51

A multimedia database query consists of a set of fuzzy and boolean (or crisp) predicates, constants, variables, and conjunction, disjunction, and negation operators. The fuzzy predicates are evaluated based on different media criteria, such as color, shape, layout, keyword. Since media-based evaluation yields similarity values, results to such a query is defined as an ordered set. Since many multimedia applications require partial matches, query results also include tuples which do not satisfy all predicates. Hence, any fuzzy semantics which extends the boolean semantics of conjunction in a straight forward manner may not be desirable for multimedia databases. In this paper, we focus on the problem of ‘given a multimedia query which consists of multiple fuzzy and crisp predicates, how to provide the user with a meaningful overall ranking.’ More specifically, we study the problem of merging similarity values in queries with multiple fuzzy predicates. We describe the essential multimedia retrieval semantics, compare these with the known approaches, and propose a semantics which captures the retrieval requirements in multimedia databases. Received 13 August 1999 / Revised 13 May 2000 / Accepted in revised form 26 July 2000 相似文献

4.

The dynamic predicate: integrating access control with query processing in XML databases

Jae-Gil Lee Kyu-Young Whang Wook-Shin Han Il-Yeol Song 《The VLDB Journal The International Journal on Very Large Data Bases》2007,16(3):371-387

Recently, access control on XML data has become an important research topic. Previous research on access control mechanisms for XML data has focused on increasing the efficiency of access control itself, but has not addressed the issue of integrating access control with query processing. In this paper, we propose an efficient access control mechanism tightly integrated with query processing for XML databases. We present the novel concept of the dynamic predicate (DP), which represents a dynamically constructed condition during query execution. A DP is derived from instance-level authorizations and constrains accessibility of the elements. The DP allows us to effectively integrate authorization checking into the query plan so that unauthorized elements are excluded in the process of query execution. Experimental results show that the proposed access control mechanism improves query processing time significantly over the state-of-the-art access control mechanisms. We conclude that the DP is highly effective in efficiently checking instance-level authorizations in databases with hierarchical structures. 相似文献

5.

Answering constraint-based mining queries on itemsets using previous materialized results

Roberto Esposito Rosa Meo Marco Botta 《Journal of Intelligent Information Systems》2006,26(1):95-111

In recent years, researchers have begun to study inductive databases, a new generation of databases for leveraging decision support applications. In this context, the user interacts with the DBMS using advanced, constraint-based languages for data mining where constraints have been specifically introduced to increase the relevance of the results and, at the same time, to reduce its volume. In this paper we study the problem of mining frequent itemsets using an inductive database. We propose a technique for query answering which consists in rewriting the query in terms of union and intersection of the result sets of other queries, previously executed and materialized. Unfortunately, the exploitation of past queries is not always applicable. We then present sufficient conditions for the optimization to apply and show that these conditions are strictly connected with the presence of functional dependencies between the attributes involved in the queries. We show some experiments on an initial prototype of an optimizer which demonstrates that this approach to query answering is viable and in many practical cases it drastically reduces the query execution time. 相似文献

6.

Hybrid query execution engine for large attributed graphs

《Information Systems》2014

Graphs are widely used for modeling complicated data such as social networks, bibliographical networks and knowledge bases. The growing sizes of graph databases motivate the crucial need for developing powerful and scalable graph-based query engines. We propose a SPARQL-like language, G-SPARQL, for querying attributed graphs. The language enables the expression of different types of graph queries that are of large interest in the databases that are modeled as large graph such as pattern matching, reachability and shortest path queries. Each query can combine both structural predicates and value-based predicates (on the attributes of the graph nodes/edges). We describe an algebraic compilation mechanism for our proposed query language which is extended from the relational algebra and based on the basic construct of building SPARQL queries, the Triple Pattern. We describe an efficient hybrid Memory/Disk representation of large attributed graphs where only the topology of the graph is maintained in memory while the data of the graph are stored in a relational database. The execution engine of our proposed query language splits parts of the query plan to be pushed inside the relational database (using SQL) while the execution of other parts of the query plan is processed using memory-based algorithms, as necessary. Experimental results on real and synthetic datasets demonstrate the efficiency and the scalability of our approach and show that our approach outperforms native graph databases by several factors. 相似文献

7.

Query optimization via contention space partitioning and cost error controlling for dynamic multidatabase systems

Qiang Zhu Jaidev Haridas Wen-Chi Hou 《Distributed and Parallel Databases》2008,23(2):151-188

A multidatabase system (MDBS) integrates information from multiple autonomous local databases. Performing global query optimization to achieve efficient query processing in such a system is challenging due to local autonomy of the data sources. Dynamic factors in the environment make the problem even more difficult. In this paper, we present two techniques, i.e., contention space partitioning and cost error controlling, to perform global query optimization in a dynamic MDBS. Both techniques generate an execution plan with multiple versions for a query in a dynamic MDBS, utilizing the multistate cost models built for the dynamic environment via our previous multistate query sampling method. The first technique partitions the contention space of a dynamic multidatabase environment into a given number of subspaces and chooses a good query execution plan version for each subspace, while the second technique selects a set of execution plan versions by using a given error tolerance to control query execution costs. Experiments demonstrate that the proposed techniques are quite promising for performing global query optimization in a dynamic MDBS. Compared with related work on dynamic query optimization, our approach has an advantage of avoiding the high overhead for modifying or re-generating an execution plan for a query based on dynamic runtime information. Research was supported by the US National Science Foundation under Grant # IIS-9811980 and The University of Michigan. 相似文献

8.

Query Optimization in Multidatabase Systems

D.K. Subramanian K. Subramanian 《Distributed and Parallel Databases》1998,6(2):183-210

Global query execution in a multidatabase system can be done parallelly, as all the local databases are independent. In this paper, a cost model that considers parallel execution of subqueries for a global query is developed. In order to obtain maximum parallelism in query execution, it is required to find a query execution plan that is represented in the form of a bushy tree and this query tree should be balanced to the maximal possible extent with respect to execution time. A new bottom up approach called Agglomerative Approach (AA) is proposed to construct balanced bushy trees with respect to execution time. By the deterministic nature of this approach, it generates local optimal solutions. This local minima problem will be severe in the case of graph queries, i.e., queries that are represented with a graph structure. A Simulated annealing Approach (SA) is employed to obtain a (near) optimal solution. These approaches (AA and SA) are suitable for handling on-line and off-line queries respectively. A Hybrid Approach (HA), that is an integration of AA and SA, is proposed to optimize queries for which the estimated time to be spent on optimization is known a priori. Results obtained with AA and SA on both tree and graph structured queries are presented. 相似文献

9.

Multi-dimensional top-k dominating queries 总被引：1，自引：0，他引：1

Man Lung Yiu Nikos Mamoulis 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(3):695-718

The top-k dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an important tool for decision support since it provides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of top-k and skyline queries without sharing their disadvantages: (i) the output size can be controlled, (ii) no ranking functions need to be specified by users, and (iii) the result is independent of the scales at different dimensions. Despite their importance, top-k dominating queries have not received adequate attention from the research community. This paper is an extensive study on the evaluation of top-k dominating queries. First, we propose a set of algorithms that apply on indexed multi-dimensional data. Second, we investigate query evaluation on data that are not indexed. Finally, we study a relaxed variant of the query which considers dominance in dimensional subspaces. Experiments using synthetic and real datasets demonstrate that our algorithms significantly outperform a previous skyline-based approach. We also illustrate the applicability of this multi-dimensional analysis query by studying the meaningfulness of its results on real data. 相似文献

10.

Supporting top-<Emphasis Type="Italic">k</Emphasis><Emphasis Type="Italic">join</Emphasis> queries in relational databases

Ihab?F.?Ilyas Email author Walid?G.?Aref Ahmed?K.?Elmagarmid 《The VLDB Journal The International Journal on Very Large Data Bases》2004,13(3):207-221

Ranking queries, also known as top-k queries, produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Top-k queries are dominant in many emerging applications, e.g., multimedia retrieval by content, Web databases, data mining, middlewares, and most information retrieval applications. Current relational query processors do not handle ranking queries efficiently, especially when joins are involved. In this paper, we address supporting top-k join queries in relational query processors. We introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a user-specified scoring function. The idea is to rank the join results progressively during the join operation. We introduce two physical query operators based on variants of ripple join that implement the rank-join algorithm. The operators are nonblocking and can be integrated into pipelined execution plans. We also propose an efficient heuristic designed to optimize a top-k join query by choosing the best join order. We address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. We implement the new operators inside a prototype database engine based on PREDATOR. The experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance.Received: 23 December 2003, Accepted: 31 March 2004, Published online: 12 August 2004Edited by: S. AbiteboulExtended version of the paper published in the Proceedings of the 29th International Conference on Very Large Databases, VLDB 2003, Berlin, Germany, pp 754-765 相似文献

11.

Cardinality Estimator: Processing SQL with a Vertical Scanning Convolutional Neural Network

下载免费PDF全文

Shao-Jie Qiao Guo-Ping Yang Nan Han Hao Chen Fa-Liang Huang Kun Yue Yu-Gen Yi Chang-An Yuan 《计算机科学技术学报》2021,36(4):762-777

Although the popular database systems perform well on query optimization, they still face poor query execution plans when the join operations across multiple tables are complex. Bad execution planning usually results in bad cardinality estimations. The cardinality estimation models in traditional databases cannot provide high-quality estimation, because they are not capable of capturing the correlation between multiple tables in an effective fashion. Recently, the state-of-the-art learning-based cardinality estimation is estimated to work better than the traditional empirical methods. Basically, they used deep neural networks to compute the relationships and correlations of tables. In this paper, we propose a vertical scanning convolutional neural network (abbreviated as VSCNN) to capture the relationships between words in the word vector in order to generate a feature map. The proposed learning-based cardinality estimator converts Structured Query Language (SQL) queries from a sentence to a word vector and we encode table names in the one-hot encoding method and the samples into bitmaps, separately, and then merge them to obtain enough semantic information from data samples. In particular, the feature map obtained by VSCNN contains semantic information including tables, joins, and predicates about SQL queries. Importantly, in order to improve the accuracy of cardinality estimation, we propose the negative sampling method for training the word vector by gradient descent from the base table and compress it into a bitmap. Extensive experiments are conducted and the results show that the estimation quality of q-error of the proposed vertical scanning convolutional neural network based model is reduced by at least 14.6％ when compared with the estimators in traditional databases. 相似文献

12.

A general procedure to check conjunctive query containment

Miguel R. Penabad Nieves R. Brisaboa Héctor J. Hernández José R. Paramá 《Acta Informatica》2002,38(7):489-529

In this paper, we present a general procedure to test conjunctive query containment. We divide the containment problem into four categories, taking into account the underlying semantics (set or bag theoretic) and the presence or absence of built-in predicates in the queries. After a brief review of previous work on conjunctive query containment, we present a new procedure, called QCC (Query Containment Checker), which we show to be a general and uniform procedure to check the containment among conjunctive queries under the four categories mentioned above. We briefly describe the use of QCC to check bag containment of conjunctive queries, and explain in detail how to use QCC to check set containment of conjunctive queries with built-in predicates. In our conclusions, we point out some uses of QCC for other types of containment. Received: 21 January 2000 / 19 November 2001 相似文献

13.

Dynamic Query Operator Scheduling for Wide-Area Remote Access

Laurent Amsaleg Michael J. Franklin Anthony Tomasic 《Distributed and Parallel Databases》1998,6(3):217-246

Distributed databases operating over wide-area networks such as the Internet, must deal with the unpredictable nature of the performance of communication. The response times of accessing remote sources can vary widely due to network congestion, link failure, and other problems. In such an unpredictable environment, the traditional iterator-based query execution model performs poorly. We have developed a class of methods, called query scrambling, for dealing explicitly with the problem of unpredictable response times. Query scrambling dynamically modifies query execution plans on-the-fly in reaction to unexpected delays in data access. In this paper we focus on the dynamic scheduling of query operators in the context of query scrambling. We explore various choices for dynamic scheduling and examine, through a detailed simulation, the effects of these choices. Our experimental environment considers pipelined and non-pipelined join processing in a client with multiple remote data sources and delayed or possibly bursty arrivals of data. Our performance results show that scrambling rescheduling is effective in hiding the impact of delays on query response time for a number of different delay scenarios. 相似文献

14.

Query Size Estimation for Joins Using Systematic Sampling

A.H.H. Ngu B. Harangsri J. Shepherd 《Distributed and Parallel Databases》2004,15(3):237-275

We propose a new approach to the estimation of query result sizes for join queries. The technique, which we have called systematic sampling—SYSSMP, is a novel variant of the sampling-based approach. A key novelty of the systematic sampling is that it exploits the sortedness of data; the result of this is that the sample relation obtained well represents the underlying frequency distribution of the join attribute in the original relation.We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations yielded by systematic sampling is higher than those produced by the traditional simple random sampling.To ensure that sample relations produced by systematic sampling indeed assist in computing more accurate query result sizes, we compare systematic sampling with the most efficient simple random sampling called t_cross using a variety of relation configurations. The results obtained validate that systematic sampling uses the same amount of sampling but still provides more accurate query result sizes than t_cross. Furthermore, the extra sampling cost incurred by the use of systematic sampling pays off in a cheaper query execution cost at run-time. 相似文献

15.

Multidatabase Query Optimization

Cem Evrendilek Asuman Dogac Sena Nural Fatma Ozcan 《Distributed and Parallel Databases》1997,5(1):77-114

A multidatabase system (MDBS) allows the users to simultaneously access heterogeneous,and autonomous databases using an integrated schema and a single global query language. The query optimization problem in MDBSs is quite different from the query optimization problem in distributed homogeneous databases due to schema heterogeneity and autonomy of local database systems. In this work, we consider the optimization of query distribution in case of data replication and the optimization of intersite joins, that is, the join of the results returned by the local sitesin response to the global subqueries. The algorithms presented for the optimization of intersite joins try to maximize the parallelism in execution and take the federated nature of the problem into account. It has also been shown through a comparativeperformance study that the proposed intersite join optimization algorithms are efficient.The approach presented can easily be generalized to any operation required for intersite query processing. The query optimization scheme presentedin this paper is being implemented within the scopeof a multidatabase system which is based on OMG‘sobject management architecture. 相似文献

16.

Query processing techniques for arrays

Arunprasad P. Marathe Kenneth Salem 《The VLDB Journal The International Journal on Very Large Data Bases》2002,11(1):68-91

Arrays are a common and important class of data. At present, database systems do not provide adequate array support: arrays can neither be easily defined nor conveniently manipulated. Further, array manipulations are not optimized. This paper describes a language called the Array Manipulation Language (AML), for expressing array manipulations, and a collection of optimization techniques for AML expressions. In the AML framework for array manipulation, arbitrary externally-defined functions can be applied to arrays in a structured manner. AML can be adapted to different application domains by choosing appropriate external function definitions. This paper concentrates on arrays occurring in databases of digital images such as satellite or medical images. AML queries can be treated declaratively and subjected to rewrite optimizations. Rewriting minimizes the number of applications of potentially costly external functions required to compute a query result. AML queries can also be optimized for space. Query results are generated a piece at a time by pipelined execution plans, and the amount of memory required by a plan depends on the order in which pieces are generated. An optimizer can consider generating the pieces of the query result in a variety of orders, and can efficiently choose orders that require less space. An AML-based prototype array database system called ArrayDB has been built, and it is used to show the effectiveness of these optimization techniques. Edited by M. Carey. Received: 10 August 2001 / Accepted: 11 December 2001 Published online: 24 May 2002 相似文献

17.

Efficient query evaluation on probabilistic databases

Nilesh Dalvi Dan Suciu 《The VLDB Journal The International Journal on Very Large Data Bases》2007,16(4):523-544

We describe a framework for supporting arbitrarily complex SQL queries with “uncertain” predicates. The query semantics is based on a probabilistic model and the results are ranked, much like in Information Retrieval. Our main focus is query evaluation. We describe an optimization algorithm that can compute efficiently most queries. We show, however, that the data complexity of some queries is #P-complete, which implies that these queries do not admit any efficient evaluation methods. For these queries we describe both an approximation algorithm and a Monte-Carlo simulation algorithm. 相似文献

18.

Consistent selectivity estimation via maximum entropy

V. Markl P. J. Haas M. Kutsch N. Megiddo U. Srivastava T. M. Tran 《The VLDB Journal The International Journal on Very Large Data Bases》2007,16(1):55-76

Cost-based query optimizers need to estimate the selectivity of conjunctive predicates when comparing alternative query execution plans. To this end, advanced optimizers use multivariate statistics to improve information about the joint distribution of attribute values in a table. The joint distribution for all columns is almost always too large to store completely, and the resulting use of partial distribution information raises the possibility that multiple, non-equivalent selectivity estimates may be available for a given predicate. Current optimizers use cumbersome ad hoc methods to ensure that selectivities are estimated in a consistent manner. These methods ignore valuable information and tend to bias the optimizer toward query plans for which the least information is available, often yielding poor results. In this paper we present a novel method for consistent selectivity estimation based on the principle of maximum entropy (ME). Our method exploits all available information and avoids the bias problem. In the absence of detailed knowledge, the ME approach reduces to standard uniformity and independence assumptions. Experiments with our prototype implementation in DB2 UDB show that use of the ME approach can improve the optimizer’s cardinality estimates by orders of magnitude, resulting in better plan quality and significantly reduced query execution times. For almost all queries, these improvements are obtained while adding only tens of milliseconds to the overall time required for query optimization. 相似文献

19.

BWDSP10x上地址和数据谓词执行的编译优化

樊永朝郑启龙耿锐王向前王昊《计算机系统应用》2016,25(12):92-99

传统的谓词优化技术是在冯·诺伊曼体系结构计算机上实施的,仅对数据流进行优化,并没有考虑哈佛体系结构下指令和数据分开的情况.BWDSP10x是指令和数据分开的哈佛体系结构,它支持超长指令字,不仅提供了对数据谓词执行的支持也提供了对地址谓词执行的支持.特此提出了一种在区域上对两种谓词模式优化支持的方法,在进行两种比较之前,通过判断比较操作的两个操作数类型来分别实施两种模式的谓词优化,使得对地址的比较不用传输到通用寄存器中.实验结果表明该优化方法能显著地节省CPU的时间和带宽,大大减少了分支指令,使程序性能提高了28.4%. 相似文献

20.

Optimizing the execution of XSLT stylesheets for querying transformed XML data

Sven Groppe Jinghua Groppe Stefan Böttcher Thomas Wycisk Le Gruenwald 《Knowledge and Information Systems》2009,18(3):331-391

We have to deal with different data formats whenever data formats evolve or data must be integrated from heterogeneous systems. These data when implemented in XML for data exchange cannot be shared freely among applications without data transformation. A common approach to solve this problem is to convert the entire XML data from their source format to the applications’ target formats using the transformations rules specified in XSLT stylesheets. However, in many cases, not all XML data are required to be transformed except for a smaller part described by a user’s query (application). In this paper, we present an approach that optimizes the execution time of an XSLT stylesheet for answering a given XPath query by modifying the XSLT stylesheet in such a way that it would (a) capture only the parts in the XML data that are relevant to the query and (b) process only those XSLT instructions that are relevant to the query. We prove the correctness of our optimization approach, analyze its complexity and present experimental results. The experimental results show that our approach performs the best in terms of execution time, especially when many cost-intensive XSLT instructions can be excluded in the XSLT stylesheet. 相似文献