首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper we propose a technique of compressing bitmap indexes for application in data warehouses. This technique, called run-length Huffman (RLH), is based on run-length encoding and on Huffman encoding. Additionally, we present a variant of RLH, called RLH-N. In RLH-N a bitmap is divided into N-bit words that are compressed by RLH. RLH and RLH-N were implemented and experimentally compared to the well-known word aligned hybrid (WAH) bitmap compression technique that has been reported to provide the shortest query execution time. The experiments discussed in this paper show that: (1) RLH-compressed bitmaps are smaller than corresponding WAH-compressed bitmaps, regardless of the cardinality of an indexed attribute, (2) RLH-N-compressed bitmaps are smaller than corresponding WAH-compressed bitmaps for certain range of cardinalities of an indexed attribute, (3) RLH and RLH-N-compressed bitmaps offer shorter query response times than WAH-compressed bitmaps, for certain range of cardinalities of an indexed attribute, and (4) RLH-N assures shorter update time of compressed bitmaps than RLH.  相似文献   

2.
Bitmap indexes are commonly used in databases and search engines. By exploiting bit‐level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus, we might prefer compressed bitmap indexes. Following Oracle's lead, bitmaps are often compressed using run‐length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high‐performance RLE‐based bitmap encoding techniques: Word Aligned Hybrid compression scheme and Compressed ‘n’ Composable Integer Set. On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e.g., 2×) and (2) are faster than the compressed alternatives (up to 900× faster for intersections). Our results challenge the view that RLE‐based bitmap compression is best. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

3.
在全文信息检索系统中,存储文本及其上关键词的索引结构需要大量的空间。位图索引不能支持基于信息量的查询,倒排文件需要的空间比较大。提出了频率向量这种索引结构的压缩存储方法,设计并实现了基于这种压缩存储方法的存储结构,理论分析表明该压缩方法与存储结构可以获得较高的压缩比;此外,还讨论了压缩频率向量上的查询处理技术,实验结果表明这种压缩的索引结构能够保证查询结果的完备性,并能有效地提高频率向量的存储和查询效率。  相似文献   

4.
The gap between storing data in relational databases and transferring data in form of XML has been closed e.g. by SQL/XML queries that generate XML data out of relational data sources. However, only few relational database systems support the evaluation of SQL/XML queries. And even in those systems supporting SQL/XML, the evaluation of such queries is quite slow compared to the evaluation of SQL queries. In this paper, we present S2CX, an approach that allows to efficiently evaluate SQL/XML queries on any relational database system, no matter whether it supports SQL/XML or not. As a result to an SQL/XML query, S2CX supports different output formats ranging from plain XML to different compressed XML representations including a succinct encoding of XML data, schema-aware compressed XML to grammar compressed XML. In many cases, S2CX produces compressed XML as a result to an SQL/XML query even faster than the evaluation of SQL/XML queries into non-compressed XML as provided by Oracle 11 g and by DB2. Furthermore, our approach to query evaluation scales better, i.e., the larger the dataset, the faster is our approach compared to SQL/XML query evaluation in Oracle 11 g and in DB2.  相似文献   

5.
Web数据集成系统基于QC模型的物化视图选择   总被引:2,自引:0,他引:2  
在Web数据集成系统中,物化视图能够有效地减少网络传输代价,提高系统的查询效率.如何选择查询进行物化,使得选中的查询满足集成层的空间限制,同时获取最大物化收益,成为集成系统中一个迫切需要解决的问题.传统方法没有考虑到海量XML查询之间的包含关系,其选择的物化视图中可能包含冗余的信息.针对上述问题,提出了①Web数据集成系统中海量查询集合的QC(query containment)模型,该模型能够捕捉查询之间最常见的包含关系;②基于QC模型的物化视图选择算法,算法考虑了物化视图选择相关的主要因素,包括查询提交的频率、空间代价、查询重写能力和查询结果的完备性,提出了查询位图的物化视图组织方式,从而获取更加合理的物化视图选择方案.实验结果证明了该方法的有效性.  相似文献   

6.
NoSQL document stores are well-tailored to efficiently load and manage massive collections of heterogeneous documents without any prior structural validation. However, this flexibility becomes a serious challenge when querying heterogeneous documents, and hence the user has to build complex queries or reformulate existing queries whenever new schemas are introduced in a collection. In this paper we propose a novel approach, based on formal foundations, for building schema-independent queries which are designed to query multi-structured documents. We present a query enrichment mechanism that consults a pre-constructed dictionary. This dictionary binds each possible path in the documents to all its corresponding absolute paths in all the documents. We automate the process of query reformulation via a set of rules that reformulate most document store operators, such as select, project, unnest, aggregate and lookup. We then produce queries across multi-structured documents which are compatible with the native query engine of the underlying document store. To evaluate our approach, we conducted experiments on synthetic datasets. Our results show that the induced overhead can be acceptable when compared to the efforts needed to restructure the data or the time required to execute several queries corresponding to the different schemas inside the collection.  相似文献   

7.
Dataspaces are recently proposed to manage heterogeneous data, with features like partially unstructured, high dimension and extremely sparse. The inverted index has been previously extended to retrieve dataspaces. In order to achieve more efficient access to dataspaces, in this paper, we first introduce our survey of data features in the real dataspaces. Based on the features observed in our study, several partitioning based index approaches are proposed to accelerate the query processing in dataspaces. Specifically, the vertical partitioning index utilizes the partitions on tokens to merge and compress data. We can both reduce the number of I/O reads and avoid aggregation of data inside a compressed list. The horizontal partitioning index supports pruning partitions of tuples in the top-k query. Thus, we can reduce the computation overhead of irrelevant candidate tuples to the query. Finally, we also propose a hybrid index with both vertical and horizontal partitioning. The extensive experiment results in real data sets demonstrate that our approaches outperform the previous techniques and scale well with the large data size.  相似文献   

8.
联机分析查询处理中的一种聚集算法   总被引:10,自引:2,他引:10  
联机分析处理(online analytical processing,简称OLAP)查询是涉及大量数据的即席复杂查询,从SQL(structured query language)角度来看,这些查询通常都包含多表连接和分组聚集操作.从OLAP查询处理角度出发,提出一种新的基于排序的聚集查询算法MuSA(sort-based aggregation with multi-table join).该方法充分考虑到数据仓库星型模式的特点,将聚集操作和新的多表连接算法MJoin相结合,排序时采用  相似文献   

9.
State-of-the-art distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation. Others try to minimize inter-node communication, which requires an expensive data preprocessing phase, leading to a high startup cost. Apriori knowledge of the query workload has also been used to create partitions, which, however, are static and do not adapt to workload changes. In this paper, we propose AdPart, a distributed RDF system, which addresses the shortcomings of previous work. First, AdPart applies lightweight partitioning on the initial data, which distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdPart takes full advantage of the partitioning to (1) support the fully parallel processing of join patterns on subjects and (2) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdPart monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdPart implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdPart: (1) starts faster than all existing systems; (2) processes thousands of queries before other systems become online; and (3) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in subseconds.  相似文献   

10.
In this paper we address the problem of building a compressed self-index that, given a distribution for the pattern queries and a bound on the space occupancy, minimizes the expected query time within that index space bound. We solve this problem by exploiting a reduction to the problem of finding a minimum weight K-link path in a properly designed Directed Acyclic Graph. Interestingly enough, our solution can be used with any compressed index based on the Burrows-Wheeler transform. Our experiments compare this optimal strategy with several other known approaches, showing its effectiveness in practice.  相似文献   

11.
Our research extends the bit-sliced signature organization by introducing a partial evaluation approach for queries. The partial evaluation approach minimizes the response time by using a subset of the on-bits of the query signature. A new signature file optimization method, Partially evaluated Bit-Sliced Signature File (P-BSSF), for multi-term query environments using the partial evaluation approach is introduced. The analysis shows that, with 14% increase in space overhead, P-BSSF provides a query processing time improvement of more than 85% for multi-term query environments with respect to the best performance of the bit-sliced signature file (BSSF) method. Under the sequentiality assumption of disk blocks, P-BSSF provides a desirable response time of 1 second for a database size of one million records with a 28% space overhead. Due to partial evaluation, the desirable response time is guaranteed for queries with several terms.  相似文献   

12.
Decision support queries typically involve several joins, a grouping with aggregation, and/or sorting of the result tuples. We propose two new classes of query evaluation algorithms that can be used to speed up the execution of such queries. The algorithms are based on (1) early sorting and (2) early partitioning– or a combination of both. The idea is to push the sorting and/or the partitioning to the leaves, i.e., the base relations, of the query evaluation plans (QEPs) and thereby avoid sorting or partitioning large intermediate results generated by the joins. Both early sorting and early partitioning are used in combination with hash-based algorithms for evaluating the join(s) and the grouping. To enable early sorting, the sort order generated at an early stage of the QEP is retained through an arbitrary number of so-called order-preserving hash joins. To make early partitioning applicable to a large class of decision support queries, we generalize the so-called hash teams proposed by Graefe et al. [GBC98]. Hash teams allow to perform several hash-based operations (join and grouping) on the same attribute in one pass without repartitioning intermediate results. Our generalization consists of indirectly partitioning the input data. Indirect partitioning means partitioning the input data on an attribute that is not directly needed for the next hash-based operation, and it involves the construction of bitmaps to approximate the partitioning for the attribute that is needed in the next hash-based operation. Our performance experiments show that such QEPs based on early sorting, early partitioning, or both in combination perform significantly better than conventional strategies for many common classes of decision support queries. Received April 4, 2000 / Accepted June 23, 2000  相似文献   

13.
XML is rapidly emerging as a standard for data representation and exchange over the World Wide Web and an increasing amount of sensitive business data is processed in XML format. Therefore, it is critical to have control mechanisms to restrict a user to access only the parts of XML documents that she is authorized to access. In this paper, we propose the first DTD-based access control model that employs graph matching to analyze if an input query is fully acceptable, fully rejectable, or partially acceptable. In this way, there will be no further security overhead for the processing of fully acceptable and rejectable queries. For partially acceptable queries, we propose a graph-matching based authorization model for an optimized rewriting procedure in which a recursive query (query with descendant axis ‘//’) will be rewritten into an equivalent recursive one if possible and into a non-recursive one only if necessary, resulting queries that can fully take advantage of structural join based query optimization techniques. Moreover, we propose an index structure for XML element types to speed up the query rewriting procedure, a facility that is potentially useful for applications with large DTDs. Our performance study results showed that our algorithms armed with rewriting indexes are promising.  相似文献   

14.
一种基于维层次编码的OLAP聚集查询算法   总被引:8,自引:2,他引:8  
联机分析处理(OLAP)查询往往需在海量数据上进行即席的复杂分组聚集查询,在其SQL语句中通常包含多表连接和分组聚集操作,因而减少多表连接和压缩关键字,以及对查询数据进行有效地分组聚集操作,成为ROLAP查询处理的关键问题。提出了一种基于维层次编码的新型预分组聚集算法DHEPGA.DHEPGA算法充分利用了编码长度较小的维层次编码及其前缀,来快速检索出与查询关键字相匹配的维层次编码,求得维层次属性的查询范围,减少了I/O开销,提高了OLAP查询效率。理论分析和实验结果表明,DHEPGA算法性能是非常有效的。  相似文献   

15.
Compact data structures are storage structures that combine a compressed representation of the data and the access mechanisms for retrieving individual data without the need of decompressing from the beginning. The target is to be able to keep the data always compressed, even in main memory, given that the data can be processed directly in that form. With this approach, we obtain several benefits: we can load larger datasets in main memory, we can make a better usage of the memory hierarchy, and we can obtain bandwidth savings in a distributed computational scenario, without wasting time in compressing and decompressing data during data exchanges.In this work, we follow a compact data structure approach to design a storage structure for raster data, which is commonly used to represent attributes of the space (temperatures, pressure, elevation measures, etc.) in geographical information systems. As it is common in compact data structures, our new technique is not only able to store and directly access compressed data, but also indexes its content, thereby accelerating the execution of queries.Previous compact data structures designed to store raster data work well when the raster dataset has few different values. Nevertheless, when the number of different values in the raster increases, their space consumption and search performance degrade. Our experiments show that our storage structure improves previous approaches in all aspects, especially when the number of different values is large, which is critical when applying over real datasets. Compared with classical methods for storing rasters, namely netCDF, our method competes in space and excels in access and query times.  相似文献   

16.
Incremental computation of time-varying query expressions   总被引:1,自引:0,他引:1  
We present and analyze algorithms for the incremental computation of time-varying queries in which selection predicates refer to the state of a clock. Such queries occur naturally in many situations where temporal data are processed. Incremental techniques for query computation have proven to be more efficient than other techniques in many situations. However, all existing incremental techniques for query computation assume that old query results remain valid if no intermediate changes are made to the underlying database. Unfortunately, this assumption does not hold for time-varying queries whose results may change just because time passes. In order to solve this problem, we introduce the notion of a superview which contains all current tuples that will eventually satisfy the selection predicate of a time-varying selection. Based on the notion of superview, we develop efficient algorithms for the incremental computation of time-varying selections. Our algorithms, combined with existing incremental algorithms, allow complex time-varying queries to benefit from the proven efficiency of incremental techniques. It is important to notice that without our algorithms, the existing algorithms for incremental computation would be useless for any time-varying query expression  相似文献   

17.
Semantic Web search is a new application of recent advances in information retrieval (IR), natural language processing, artificial intelligence, and other fields. The Powerset group in Microsoft develops a semantic search engine that aims to answer queries not only by matching keywords, but by actually matching meaning in queries to meaning in Web documents. Compared to typical keyword search, semantic search can pose additional engineering challenges for the back-end and infrastructure designs. Of these, the main challenge addressed in this paper is how to lower query latencies to acceptable, interactive levels. Index-based semantic search requires more data processing, such as numerous synonyms, hypernyms, multiple linguistic readings, and other semantic information, both on queries and in the index. In addition, some of the algorithms can be super-linear, such as matching co-references across a document. Consequently, many semantic queries can run significantly slower than the same keyword query. Users, however, have grown to expect Web search engines to provide near-instantaneous results, and a slow search engine could be deemed unusable even if it provides highly relevant results. It is therefore imperative for any search engine to meet its users’ interactivity expectations, or risk losing them. Our approach to tackle this challenge is to exploit data parallelism in slow search queries to reduce their latency in multi-core systems. Although all search engines are designed to exploit parallelism, at the single-node level this usually translates to throughput-oriented task parallelism. This paper focuses on the engineering of two latency-oriented approaches (coarse- and fine-grained) and compares them to the task-parallel approach. We use Powerset’s deployed search engine to evaluate the various factors that affect parallel performance: workload, overhead, load balancing, and resource contention. We also discuss heuristics to selectively control the degree of parallelism and consequent overhead on a query-by-query level. Our experimental results show that using fine-grained parallelism with these dynamic heuristics can significantly reduce query latencies compared to fixed, coarse-granularity parallelization schemes. Although these results were obtained on, and optimized for, Powerset’s semantic search, they can be readily generalized to a wide class of inverted-index search engines.  相似文献   

18.
By combining an unstructured protocol with a DHT-based index, hybrid Peer-to-Peer (P2P) improves search efficiency in terms of query recall and response time. The key challenge in hybrid search is to estimate the number of peers that can answer a given query. Existing approaches assume that such a number can be directly obtained by computing item popularity. In this work, we show that such an assumption is not always valid, and previous designs cannot distinguish whether items related to a query are distributed in many peers or are in a few peers. To address this issue, we propose QRank, a difficulty-aware hybrid search, which ranks queries by weighting keywords based on term frequency. Using rank values, QRank selects proper search strategies for queries. We conduct comprehensive trace-driven simulations to evaluate this design. Results show that QRank significantly improves the search quality as well as reducing system traffic cost compared with existing approaches.  相似文献   

19.
In this paper we describe a distributed system designed to efficiently store, query and update multidimensional data organized into concept hierarchies and dispersed over a network. Our system employs an adaptive scheme that automatically adjusts the level of indexing according to the granularity of the incoming queries, without assuming any prior knowledge of the workload. Efficient roll-up and drill-down operations take place in order to maximize the performance by minimizing query flooding. Updates are performed on-line, with minimal communication overhead, depending on the level of consistency needed. Extensive experimental evaluation shows that, on top of the advantages that a distributed storage offers, our method answers the vast majority of incoming queries, both point and aggregate ones, without flooding the network and without causing significant storage or load imbalance. Our scheme proves to be especially efficient in cases of skewed workloads, even when these change dynamically with time. At the same time, it manages to preserve the hierarchical nature of data. To the best of our knowledge, this is the first attempt towards the support of concept hierarchies in DHTs.  相似文献   

20.
XML is an ordered data model and XQuery expressions return results that have a well-defined order. However, little work on how order is supported in XML query processing has been done to date. In this paper we study the issues related to handling order in the XML context, namely challenges imposed by the XML data model, the variety of order requirements of the XQuery language, and the need to maintain order in the presence of updates to the XML data. We propose an efficient solution that addresses all these issues. Our solution is based on a key encoding for XML nodes that serves as node identity and at the same time encodes order. We design rules for encoding order of processed XML nodes based on the XML algebraic query execution model and the node key encoding. These rules do not require any actual sorting for intermediate results during execution. Our approach enables efficient order-sensitive incremental view maintenance as it makes most XML algebra operators distributive with respect to bag union. We prove the correctness of our order encoding approach. Our approach is implemented and integrated with Rainbow, an XML data management system developed at WPI. We have tested the efficiency of our approach using queries that have different order requirements. We have also measured the relative cost of different components related to our order solution in different types of queries. In general the overhead of maintaining order in our approach is very small relative to the query processing time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号