期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploring spatial datasets with histograms 总被引：2，自引：0，他引：2

Chengyu Sun Nagender Bandi Divyakant Agrawal Amr El Abbadi 《Distributed and Parallel Databases》2006,20(1):57-88

As online spatial datasets grow both in number and sophistication, it becomes increasingly difficult for users to decide whether a dataset is suitable for their tasks, especially when they do not have prior knowledge of the dataset. In this paper, we propose browsing as an effective and efficient way to explore the content of a spatial dataset. Browsing allows users to view the size of a result set before evaluating the query at the database, thereby avoiding zero-hit/mega-hit queries and saving time and resources. Although the underlying technique supporting browsing is similar to range query aggregation and selectivity estimation, spatial dataset browsing poses some unique challenges. In this paper, we identify a set of spatial relations that need to be supported in browsing applications, namely, the contains, contained and the overlap relations. We prove a lower bound on the storage required to answer queries about the contains relation accurately at a given resolution. We then present three storage-efficient approximation algorithms which we believe to be the first to estimate query results about these spatial relations. We evaluate these algorithms with both synthetic and real world datasets and show that they provide highly accurate estimates for datasets with various characteristics. Recommended by: Sunil Prabhakar Work supported by NSF grants IIS 02-23022 and CNF 04-23336. An earlier version of this paper appeared in the 17th International Conference on Data Engineering (ICDE 2001). 相似文献

2.

Fast computation of spatial selections and joins using graphics hardware

Nagender Bandi Chengyu Sun Divyakant Agrawal Amr El Abbadi 《Information Systems》2007

Spatial database operations are typically performed in two steps. In the filtering step, indexes and the minimum bounding rectangles (MBRs) of the objects are used to quickly determine a set of candidate objects. In the refinement step, the actual geometries of the objects are retrieved and compared to the query geometry or each other. Because of the complexity of the computational geometry algorithms involved, the CPU cost of the refinement step is usually the dominant cost of the operation for complex geometries such as polygons. Although many run-time and pre-processing-based heuristics have been proposed to alleviate this problem, the CPU cost still remains the bottleneck. In this paper, we propose a novel approach to address this problem using the efficient rendering and searching capabilities of modern graphics hardware. This approach does not require expensive pre-processing of the data or changes to existing storage and index structures, and is applicable to both intersection and distance predicates. We evaluate this approach by comparing the performance with leading software solutions. The results show that by combining hardware and software methods, the overall computational cost can be reduced substantially for both spatial selections and joins. We integrated this hardware/software co-processing technique into a popular database to evaluate its performance in the presence of indexes, pre-processing and other proprietary optimizations. Extensive experimentation with real-world data sets show that the hardware-accelerated technique not only outperforms the run-time software solutions but also performs as well if not better than pre-processing-assisted techniques. 相似文献

3.

Distributed optimistic concurrency control with reduced rollback

Divyakant Agrawal Arthur J. Bernstein Pankaj Gupta Soumitra Sengupta 《Distributed Computing》1987,2(1):45-59

Concurrency control algorithms have traditionally been based on locking and timestamp ordering mechanisms. Recently optimistic schemes have been proposed. In this paper a distributed, multi-version, optimistic concurrency control scheme is described which is particularly advantageous in a query-dominant environment. The drawbacks of the original optimistic concurrency control scheme, namely that inconsistent views may be seen by transactions (potentially causing unpredictable behavior) and that read-only transactions must be validated and may be rolled back, have been eliminated in the proposed algorithm. Read-only transactions execute in a completely asynchronous fashion and are therefore processed with very little overhead. Furthermore, the probability that read-write transactions are rolled back has been reduced by generalizing the validation algorithm. The effects of global transactions on local transaction processing are minimized. The algorithm is also free from dedlock and cascading rollback problems. Divyakant Agrawal is currently a graduate student in the Department of Computer Science at the State University of New York at Stony Brook. He received his B.E. degree from Birla Institute of Technology and Science, Pilani, India in 1980. He worked with Tata Burroughs Limited, from 1980 to 1982. He completed his M.S. degree in Computer Science from SUNY at Stony Brook in 1984. His research interests include design of algorithms for concurrent systems, optimistic protocols and distributed systems. Arthur Bernstein is a Professor of Computer Science at the State University of New York at Stony Brook. His research is concerned with the design and verification of algorithms involving asynchronous activity and with languages for expressing such algorithms. Pankaj Gupta is currently a graduate student in the Department of Computer Science at the State University of New York at Stony Brook. He received M.S. degree in Electrical Engineering from SUNY at Stony Brook in 1982 and M.S. degree in Computer Science from SUNY at Stony Brook in 1985. His research interests include distributed systems, concurrency control, and databases. Soumitra Sengupta is currently a graduate student in the Department of Computer Science at the State University of New York at Stony Brook. He received his B.E. degree from Birla Institute of Technology and Science, Pilani, India in 1980. He worked with Tata Consultancy Services, from 1980 to 1982. He completed his M.S. degree in Computer Science from SUNY at Stony Brook in 1984. His research interests include distributed algorithms, logic databases and concurrency control.This work was supported by the National Science Foundation under grant, DCR-8502161 and the Air Force Office of Scientific Research under grant AFOSR 810197 相似文献

4.

Towards practical private processing of database queries over public data

Shiyuan Wang Divyakant Agrawal Amr El Abbadi 《Distributed and Parallel Databases》2014,32(1):65-89

Privacy is a major concern when users query public online data services. The privacy of millions of people has been jeopardized in numerous user data leakage incidents in many popular online applications. To address the critical problem of personal data leakage through queries, we enable private querying on public data services so that the contents of user queries and any user data are hidden and therefore not revealed to the online service providers. We propose two protocols for private processing of database queries, namely BHE and HHE. The two protocols provide strong query privacy by using Paillier’s homomorphic encryption, and support common database queries such as range and join queries by relying on the bucketization of public data. In contrast to traditional Private Information Retrieval proposals, BHE and HHE only incur one round of client server communication for processing a single query. BHE is a basic private query processing protocol that provides complete query privacy but still incurs expensive computation and communication costs. Built upon BHE, HHE is a hybrid protocol that applies ciphertext computation and communication on a subset of the data, such that this subset not only covers the actual requested data but also resembles some frequent query patterns of common users, thus achieving practical query performance while ensuring adequate privacy levels. By using frequent query patterns and data specific privacy protection, HHE is not vulnerable to the traditional attacks on k-Anonymity that exploit data similarity and skewness. Moreover, HHE consistently protects user query privacy for a sequence of queries in a single query session. 相似文献

5.

High dimensional nearest neighbor searching

Hakan Ferhatosmanoglu Ertem Tuncel Divyakant Agrawal Amr El Abbadi 《Information Systems》2006

As databases increasingly integrate different types of information such as time-series, multimedia and scientific data, it becomes necessary to support efficient retrieval of multi-dimensional data. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. As a result of the scale and high dimensional nature, the traditional techniques have proven inadequate. In this paper, we propose search techniques that are effective especially for large high dimensional data sets. We first propose VA⁺

{VA}^{+}

-file technique which is based on scalar quantization of the data. VA⁺

{VA}^{+}

-file is especially useful for searching exact nearest neighbors (NN) in non-uniform high dimensional data sets. We then discuss how to improve the search and make it progressive by allowing some approximations in the query result. We develop a general framework for approximate NN queries, discuss various approaches for progressive processing of similarity queries, and develop a metric for evaluation of such techniques. Finally, a new technique based on clustering is proposed, which merges the benefits of various approaches for progressive similarity searching. Extensive experimental evaluation is performed on several real-life data sets. The evaluation establishes the superiority of the proposed techniques over the existing techniques for high dimensional similarity searching. The techniques proposed in this paper are effective for real-life data sets, which are typically non-uniform, and they are scalable with respect to both dimensionality and size of the data set. 相似文献

6.

Optimal Data-Space Partitioning of Spatial Data for Parallel I/O

Hakan Ferhatosmanoğlu Divyakant Agrawal Ömer Eğecioğlu Amr El Abbadi 《Journal of Materials Science: Materials in Electronics》2004,15(11):75-101

It is desirable to design partitioning methods that minimize the I/O time incurred during query execution in spatial databases. This paper explores optimal partitioning for two-dimensional data for a class of queries and develops multi-disk allocation techniques that maximize the degree of I/O parallelism obtained in each case. We show that hexagonal partitioning has optimal I/O performance for circular queries among all partitioning methods that use convex non-overlapping regions. An analysis and extension of this result to all possible partitioning techniques is also given. For rectangular queries, we show that hexagonal partitioning has overall better I/O performance for a general class of range queries, except for rectilinear queries, in which case rectangular grid partitioning is superior. By using current algorithms for rectangular grid partitioning, parallel storage and retrieval algorithms for hexagonal partitioning can be constructed. Some of these results carry over to circular partitioning of the data—which is an example of a non-convex region. 相似文献

7.

Data space mapping for efficient I/O in large multi-dimensional databases

Hakan Ferhatosmanoglu Aravind Ramachandran Divyakant Agrawal Amr El Abbadi 《Information Systems》2007

In this paper, we propose data space mapping techniques for storage and retrieval in multi-dimensional databases on multi-disk architectures. We identify the important factors for an efficient multi-disk searching of multi-dimensional data and develop secondary storage organization and retrieval techniques that directly address these factors. We especially focus on high dimensional data, where none of the current approaches are effective. In contrast to the current declustering techniques, storage techniques in this paper consider both inter- and intra-disk organization of the data. The data space is first partitioned into buckets, then the buckets are declustered to multiple disks while they are clustered in each disk. The queries are executed through bucket identification techniques that locate the pages. One of the partitioning techniques we discuss is especially practical for high dimensional data, and our disk and page allocation techniques are optimal with respect to number of I/O accesses and seek times. We provide experimental results that support our claims on two real high dimensional datasets. 相似文献

8.

Supporting web query expansion efficiently using multi-granularity indexing and query processing 总被引：3，自引：0，他引：3

Wen-Syan Divyakant 《Data & Knowledge Engineering》2000,35(3):239-257

The problem of word mismatch in information retrieval (IR) occurs because users often use different words to describe concepts in their queries than authors use to describe the same concepts in their documents. Query expansion is used to deal with the mismatch between author and user vocabularies. To support query expansion, indices on words related by lexical semantics and syntactical co-occurrence need to be maintained. Two issues become paramount in supporting query expansion: the size of index tables and the query processing overhead. In this paper, we propose to use the notion of multi-granularity for more efficient indexing and query processing while the same degrees of precision and recall are maintained. We also describes extensions of this technique to handle: (1) query relaxation to handle words with multiple senses and with other semantic relationships; (2) progressive processing of queries with top N results and (3) progressive processing of queries with specification of the importance of each keyword. 相似文献

9.

\mathcal{MD}-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services

Shoji Nishimura Sudipto Das Divyakant Agrawal Amr El Abbadi 《Distributed and Parallel Databases》2013,31(2):289-319

The ubiquity of location enabled devices has resulted in a wide proliferation of location based applications and services. To handle the growing scale, database management systems driving such location based services (LBS) must cope with high insert rates for location updates of millions of devices, while supporting efficient real-time analysis on latest location. Traditional DBMSs, equipped with multi-dimensional index structures, can efficiently handle spatio-temporal data. However, popular open-source relational database systems are overwhelmed by the high insertion rates, real-time querying requirements, and terabytes of data that these systems must handle. On the other hand, key-value stores can effectively support large scale operation, but do not natively provide multi-attribute accesses needed to support the rich querying functionality essential for the LBSs. We present the design and implementation of $\mathcal {MD}$ -HBase, a scalable data management infrastructure for LBSs that bridges this gap between scale and functionality. Our approach leverages a multi-dimensional index structure layered over a key-value store. The underlying key-value store allows the system to sustain high insert throughput and large data volumes, while ensuring fault-tolerance, and high availability. On the other hand, the index layer allows efficient multi-dimensional query processing. Our optimized query processing technique accesses only the index and storage level entries that intersect with the query region, thus ensuring efficient query processing. We present the design of $\mathcal {MD}$ -HBase that demonstrates how two standard index structures—the K-d tree and the Quad tree—can be layered over a range partitioned key-value store to provide scalable multi-dimensional data infrastructure. Our prototype implementation using HBase, a standard open-source key-value store, can handle hundreds of thousands of inserts per second using a modest 16 node cluster, while efficiently processing multi-dimensional range queries and nearest neighbor queries in real-time with response times as low as few hundreds of milliseconds. 相似文献

10.

Exploiting sequential access when declustering data over disks and MEMS-based storage

Hailing Yu Divyakant Agrawal Amr El Abbadi 《Distributed and Parallel Databases》2006,19(2-3):147-168

Due to the large difference between seek time and transfer time in current disk technology, it is advantageous to perform large I/O using a single sequential access rather than multiple small random I/O accesses. However, prior optimal cost and data placement approaches for processing range queries over two-dimensional datasets do not consider this property. In particular, these techniques do not consider the issue of sequential data placement when multiple I/O blocks need to be retrieved from a single device. In this paper, we reevaluate the optimal cost of range queries by declustering two-dimensional datasets over multiple devices, and prove that, in general, it is impossible to achieve the new optimal cost. This is because disks cannot facilitate two-dimensional sequential access which is required by the new optimal cost. Then we revisit the existing data allocation schemes under the new optimal cost, and show that none of them can achieve the new optimal cost. Fortunately, MEMS-based storage is being developed to reduce I/O cost. We first show that the two-dimensional sequential access requirement can not be satisfied by simply modeling MEMS-based storage as conventional disks. Then we propose a new placement scheme that exploits the physical properties of MEMS-based storage to solve this problem. Our theoretical analysis and experimental results show that the new scheme achieves almost optimal I/O costs. Recommended by: Sunil Prabhakar This research is supported by the NSF grants under IIS-0220152 and CNF-0423336. 相似文献