首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We outline the processes of intelligent creation and utilization of granulated data summaries in the engine aimed at fast approximate execution of analytical SQL statements. We discuss how to use the introduced engine for the purposes of ad-hoc data exploration over large and quickly increasing data collected in a heterogeneous or distributed fashion. We focus on mechanisms that transform input data summaries into result sets representing query outcomes. We also illustrate how our computational principles can be put together with other paradigms of scaling and harnessing data analytics.  相似文献   

2.
Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions.  相似文献   

3.
Conventional data warehouses employ the query-at-a-time model, which maps each query to a distinct physical plan. When several queries execute concurrently, this model introduces contention and thrashing, because the physical plans??unaware of each other??compete for access to the underlying I/O and computation resources. As a result, while modern systems can efficiently optimize and evaluate a single complex data analysis query, their performance suffers significantly and can be highly erratic when multiple complex queries run at the same time. We present in this paper Cjoin, a new design that substantially improves throughput in large-scale data analytics systems processing many concurrent join queries. In contrast to the conventional query-at-a-time model our approach employs a single physical plan that shares I/O, computation, and tuple storage across all in-flight join queries. We use an ??always on?? pipeline of non-blocking operators, managed by a controller that continuously examines the current query mix and optimizes the pipeline on the fly. Our design enables data analytics engines to scale gracefully to large data sets, provide predictable execution times, and reduce contention. We implemented Cjoin as an extension to the PostgreSQL DBMS. This prototype outperforms conventional commercial systems by an order of magnitude for tens to hundreds of concurrent queries.  相似文献   

4.
We study the problem of answering k -hop reachability queries in a directed graph, i.e., whether there exists a directed path of length $k$ , from a source query vertex to a target query vertex in the input graph. The problem of $k$ -hop reachability is a general problem of the classic reachability (where $k=\infty $ ). Existing indexes for processing classic reachability queries, as well as for processing shortest path distance queries, are not applicable or not efficient for processing $k$ -hop reachability queries. We propose an efficient index for processing $k$ -hop reachability queries. Our experimental results on a wide range of real datasets show that our method is efficient and scalable in terms of both index construction and query processing.  相似文献   

5.
Aggregate similarity search, also known as aggregate nearest-neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the similarity is an aggregation (e.g., \({{\mathrm{sum}}}\), \(\max \)) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggregation over the distances between p and any subset of \(\phi M\) objects in Q for some support \(0< \phi \le 1\). We call this new definition flexible aggregate similarity search and accordingly refer to a query as a flexible aggregate nearest-neighbor ( Fann ) query. We present algorithms for answering Fann queries exactly and approximately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experiments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.  相似文献   

6.
A mode of a multiset S is an element aS of maximum multiplicity; that is, a occurs at least as frequently as any other element in S. Given an array A[1:n] of n elements, we consider a basic problem: constructing a static data structure that efficiently answers range mode queries on A. Each query consists of an input pair of indices (i,j) for which a mode of A[i:j] must be returned. The best previous data structure with linear space, by Krizanc, Morin, and Smid (Proceedings of the International Symposium on Algorithms and Computation (ISAAC), pp. 517–526, 2003), requires \(\varTheta (\sqrt{n}\log\log n)\) query time in the worst case. We improve their result and present an O(n)-space data structure that supports range mode queries in \(O(\sqrt{n/\log n})\) worst-case time. In the external memory model, we give a linear-space data structure that requires \(O(\sqrt{n/B})\) I/Os per query, where B denotes the block size. Furthermore, we present strong evidence that a query time significantly below \(\sqrt{n}\) cannot be achieved by purely combinatorial techniques; we show that boolean matrix multiplication of two \(\sqrt{n} \times \sqrt{n}\) matrices reduces to n range mode queries in an array of size O(n). Additionally, we give linear-space data structures for the dynamic problem (queries and updates in near O(n 3/4) time), for orthogonal range mode in d dimensions (queries in near O(n 1?1/2d ) time) and for half-space range mode in d dimensions (queries in \(O(n^{1-1/d^{2}})\) time). Finally, we complement our dynamic data structure with a reduction from the multiphase problem, again supporting that we cannot hope for much more efficient data structures.  相似文献   

7.
In this paper we consider integration of SMT solvers with the filtering algorithms for the finite domain alldifferent constraint. Such integration makes SMT solvers suitable for solving constraint satisfaction problems with the alldifferent constraint involved. First, we present a novel algorithm for explaining inconsistencies and propagations in the alldifferent constraint. We compare it to Katsirelos’ algorithm and flow-based algorithms that are commonly used for that purpose. Then we describe our DPLL(T)-compliant SMT theory solver for constraint satisfaction problems that include alldifferent constraints. We also provide an experimental evaluation of our approach.  相似文献   

8.
Despite a large body of work on XPath query processing in relational environment, systematic study of queries containing not-predicates have received little attention in the literature. Particularly, several xml supports of industrial-strength commercial rdbms fail to efficiently evaluate such queries. In this paper, we present an efficient and novel strategy to evaluate not -twig queries in a tree-unaware relational environment. not -twig queries are XPath queries with ancestor–descendant and parent–child axis and contain one or more not-predicates. We propose a novel Dewey-based encoding scheme called Andes (ANcestor Dewey-based Encoding Scheme), which enables us to efficiently filter out elements satisfying a not-predicate by comparing their ancestor group identifiers. In this approach, a set of elements under the same common ancestor at a specific level in the xml tree is assigned same ancestor group identifier. Based on this scheme, we propose a novel sql translation algorithm for not-twig query evaluation. Experiments carried out confirm that our proposed approach built on top of an off-the-shelf commercial rdbms significantly outperforms state-of-the-art relational and native approaches. We also explore the query plans selected by a commercial relational optimizer to evaluate our translated queries in different input cardinality. Such exploration further validates the performance benefits of Andes.  相似文献   

9.
SNEE: a query processor for wireless sensor networks   总被引:1,自引:0,他引:1  
A wireless sensor network (WSN) can be construed as an intelligent, large-scale device for observing and measuring properties of the physical world. In recent years, the database research community has championed the view that if we construe a WSN as a database (i.e., if a significant aspect of its intelligent behavior is that it can execute declaratively-expressed queries), then one can achieve a significant reduction in the cost of engineering the software that implements a data collection program for the WSN while still achieving, through query optimization, very favorable cost:benefit ratios. This paper describes a query processing framework for WSNs that meets many desiderata associated with the view of WSN as databases. The framework is presented in the form of compiler/optimizer, called SNEE, for a continuous declarative query language over sensed data streams, called SNEEql. SNEEql can be shown to meet the expressiveness requirements of a large class of applications. SNEE can be shown to generate effective and efficient query evaluation plans. More specifically, the paper describes the following contributions: (1) a user-level syntax and physical algebra for SNEEql, an expressive continuous query language over WSNs; (2) example concrete algorithms for physical algebraic operators defined in such a way that the task of deriving memory, time and energy analytical cost-estimation models (CEMs) for them becomes straightforward by reduction to a structural traversal of the pseudocode; (3) CEMs for the concrete algorithms alluded to; (4) an architecture for the optimization of SNEEql queries, called SNEE, building on well-established distributed query processing components where possible, but making enhancements or refinements where necessary to accommodate the WSN context; (5) algorithms that instantiate the components in the SNEE architecture, thereby supporting integrated query planning that includes routing, placement and timing; and (6) an empirical performance evaluation of the resulting framework.  相似文献   

10.
Search engine users often encounter the difficulty of phrasing the precise query that could lead to satisfactory search results. Query recommendation is considered an effective assistant in enhancing keyword-based queries in search engines and Web search software. In this paper, we present a Query-URL Bipartite based query reCommendation approach, called QUBiC. It utilizes the connectivity of a query-URL bipartite graph to recommend related queries and can significantly improve the accuracy and effectiveness of personalized query recommendation systems comparing with the conventional pairwise similarity based approach. The main contribution of the QUBiC approach is its three-phase framework for personalized query recommendations. The first phase is the preparation of queries and their search results returned by a search engine, which generates a historical query-URL bipartite collection. The second phase is the discovery of similar queries by extracting a query affinity graph from the bipartite graph, instead of operating on the original bipartite graph directly using biclique-based approach or graph clustering. The query affinity graph consists of only queries as its vertices and its edges are weighted according to a query-URL vector based similarity (dissimilarity) measure. The third phase is the ranking of similar queries. We devise a novel rank mechanism for ordering the related queries based on the merging distances of a hierarchical agglomerative clustering (HAC). By utilizing the query affinity graph and the HAC-based ranking, we are able to capture the propagation of similarity from query to query by inducing an implicit topical relatedness between queries. Furthermore, the flexibility of the HAC strategy makes it possible for users to interactively participate in the query recommendation process, and helps to bridge the gap between the determinacy of actual similarity values and the indeterminacy of users’ information needs, allowing the lists of related queries to be changed from user to user and query to query, thus adaptively recommending related queries on demand. Our experimental evaluation results show that the QUBiC approach is highly efficient and more effective compared to the conventional query recommendation systems, yielding about 13.3 % as the most improvement in terms of precision.  相似文献   

11.
Continuous visible nearest neighbor query processing in spatial databases   总被引:1,自引:0,他引:1  
In this paper, we identify and solve a new type of spatial queries, called continuous visible nearest neighbor (CVNN) search. Given a data set P, an obstacle set O, and a query line segment q in a two-dimensional space, a CVNN query returns a set of \({\langle p, R\rangle}\) tuples such that \({p \in P}\) is the nearest neighbor to every point r along the interval \({R \subseteq q}\) as well as p is visible to r. Note that p may be NULL, meaning that all points in P are invisible to all points in R due to the obstruction of some obstacles in O. In contrast to existing continuous nearest neighbor query, CVNN retrieval considers the impact of obstacles on visibility between objects, which is ignored by most of spatial queries. We formulate the problem, analyze its unique characteristics, and develop efficient algorithms for exact CVNN query processing. Our methods (1) utilize conventional data-partitioning indices (e.g., R-trees) on both P and O, (2) tackle the CVNN search by performing a single query for the entire query line segment, and (3) only access the data points and obstacles relevant to the final query result by employing a suite of effective pruning heuristics. In addition, several interesting variations of CVNN queries have been introduced, and they can be supported by our techniques, which further demonstrates the flexibility of the proposed algorithms. A comprehensive experimental evaluation using both real and synthetic data sets has been conducted to verify the effectiveness of our proposed pruning heuristics and the performance of our proposed algorithms.  相似文献   

12.
Wireless sensor networks (WSN) is a key enabling technique for achieving the vision of the Internet of Things. In many applications of WSN such as environmental monitoring and vehicle tracking, they may require to launch spatial queries for collecting and gathering sensory data for achieving certain goals. One such query is the \(K\) nearest neighbor (KNN) query, which aims to collect sensory data from \(k\) sensor nodes nearest to a certain query location. Techniques, namely the itinerary-based KNN query algorithms, are recently developed for facilitating KNN queries. Generally, these techniques propagate queries and collect data along a predetermined itinerary. However, query accuracy and boundary expansion are two challenges that are not well addressed. To mitigate these issues, in this paper, we propose a novel KNN query algorithm based on grid division routing in the setting of skewness distribution, where the itinerary is formed based on the connectivity of adjacent grid cells centers. This technique can achieve better query accuracy and cause less energy consumption by executing the query concurrently in subregions. Besides, the void region problem is well addressed based on the proximity of neighbor grid cells. Experiment result shows that our technique performs better in several aspects including query accuracy, data redundancy, and energy efficiency.  相似文献   

13.
Even with its significant impacts on the database area, the R-tree is often criticized by its lack of good worst-case guarantees. For example, in range search (where we want to report all the data points in a query rectangle), it is known that on adversely designed datasets and queries, an R-tree can be as slow as a sequential scan that simply reads all the data points. Nevertheless, R-trees work so well on real data that they have been widely implemented in commercial systems. This stark contrast has caused long-term controversy between practitioners and theoreticians as to whether this structure deserves its fame. This paper provides theoretical evidence that, somewhat surprisingly, R-trees are efficient in the worst case for range search on many real datasets. Given any integer \(K\) , we explain how to obtain an upper bound on the cost of answering all (i.e., infinitely many) range queries retrieving at most \(K\) objects. On practical data, the upper bound is only a fraction of the overhead of sequential scan (unless, apparently, \(K\) is at the same order as the dataset size). Our upper bounds are tight up to a constant factor, namely they cannot be lowered by more than \(O(1)\) times while still capturing the most expensive queries. Our upper bounds can be calculated in constant time by remembering only three integers. These integers, in turn, are generated from only the leaf MBRs of an R-tree, but not the leaf nodes themselves. In practice, the internal nodes are often buffered in memory, so that the integers aforementioned can be efficiently maintained along with the data updates and made available to a query optimizer at any time. Furthermore, our analytical framework introduces instance-level query bound as a new technique for evaluating the efficiency of heuristic structures in a theory-flavored manner (previously, experimentation was the dominant assessment method).  相似文献   

14.
This paper extends Common2, the family of objects that implement and are wait-free implementable from 2 consensus objects, in two ways: First, the stack object is shown to be in the family, refuting a conjecture to the contrary [6]. Second, Common2 is investigated in the unbounded concurrency model, whereas until now it was considered only in an n-process model. We show that the fetch-and-add, test-and-set , and stack objects are in Common2 even with respect to this stronger notion of wait-free implementation. Our constructions rely on a wait-free implementation of immediate snapshots in the unbounded concurrency model, which was previously not known to be possible. The introduction of unbounded concurrency to the study of Common2 opens several directions of research: are there objects that have n-process implementations but are not unbounded concurrency implementable? We conjecture that swap is such an object. Additionally, the hope is that a queue impossibility proof, which eludes us in the n-process model, will be easier to establish in the unbounded concurrency model.  相似文献   

15.
This paper studies the problem of probabilistic range query over uncertain data. Although existing solutions could support such query, it still has space for improvement. In this paper, we firstly propose a novel index called S-MRST for indexing uncertain data. For one thing, via using an irregular shape for bounding uncertain data, it has a stronger space pruning ability. For another, by taking the gradient of probability density function into consideration, S-MRST is also powerful in terms of probability pruning ability. More important, S-MRST is a general index which could support multiple types of probabilistic queries. Theoretical analysis and extensive experimental results demonstrate the effectiveness and efficiency of the proposed index.  相似文献   

16.
We consider the problem of finding one or more desired items out of an unsorted database. Patel has shown that if the database permits quantum queries, then mere digitization is sufficient for efficient search for one desired item. The algorithm, called factorized quantum search algorithm, presented by him can locate the desired item in an unsorted database using O( $log_4N$ ) queries to factorized oracles. But the algorithm requires that all the attribute values must be distinct from each other. In this paper, we discuss how to make a database satisfy the requirements, and present a quantum search engine based on the algorithm. Our goal is achieved by introducing auxiliary files for the attribute values that are not distinct, and converting every complex query request into a sequence of calls to factorized quantum search algorithm. The query complexity of our algorithm is O( $log_4N$ ) for most cases.  相似文献   

17.
Uncertain data streams, where data are incomplete and imprecise, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the claro system that supports stream processing for uncertain data naturally captured using continuous random variables. claro employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for relational operators by exploring statistical theory and approximation. We also consider query planning for complex queries given an accuracy requirement. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements and outperform state-of-the-art sampling methods.  相似文献   

18.
Existing classical post-processing (CPP) schemes for quantum key distribution (QKD)-based quantum private queries (QPQs) including the \(kN\rightarrow N\), \(N\rightarrow N\), and \(rM\rightarrow N\) ones have been found imperfect in terms of communication efficiency and security. In this paper, we propose a novel CPP scheme for QKD-based QPQs. The proposed CPP scheme reduces the communication complexity and improves the security of QKD-based QPQ protocols largely. Furthermore, the proposed CPP scheme can provide a multi-bit query efficiently.  相似文献   

19.
In this paper, we introduce a new problem termed query reverse engineering (QRE). Given a database \(D\) and a result table \(T\) —the output of some known or unknown query \(Q\) on \(D\) —the goal of QRE is to reverse-engineer a query \(Q'\) such that the output of query \(Q'\) on database \(D\) (denoted by \(Q'(D)\) ) is equal to \(T\) (i.e., \(Q(D)\) ). The QRE problem has useful applications in database usability, data analysis, and data security. In this work, we propose a data-driven approach, TALOS for Tree-based classifier with At Least One Semantics, that is based on a novel dynamic data classification formulation and extend the approach to efficiently support the three key dimensions of the QRE problem: whether the input query is known/unknown, supporting different query fragments, and supporting multiple database versions.  相似文献   

20.
Twitter (http://twitter.com) is one of the most popular social networking platforms. Twitter users can easily broadcast disaster-specific information, which, if effectively mined, can assist in relief operations. However, the brevity and informal nature of tweets pose a challenge to Information Retrieval (IR) researchers. In this paper, we successfully use word embedding techniques to improve ranking for ad-hoc queries on microblog data. Our experiments with the ‘Social Media for Emergency Relief and Preparedness’ (SMERP) dataset provided at an ECIR 2017 workshop show that these techniques outperform conventional term-matching based IR models. In addition, we show that, for the SMERP task, our word embedding based method is more effective if the embeddings are generated from the disaster specific SMERP data, than when they are trained on the large social media collection provided for the TREC (http://trec.nist.gov/) 2011 Microblog track dataset.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号