首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The problem of kNN (k Nearest Neighbor) queries has received considerable attention in the database and information retrieval communities. Given a dataset D and a kNN query q, the k nearest neighbor algorithm finds the closest k data points to q. The applications of kNN queries are board, not only in spatio-temporal databases but also in many areas. For example, they can be used in multimedia databases, data mining, scientific databases and video retrieval. The past studies of kNN query processing did not consider the case that the server may receive multiple kNN queries at one time. Their algorithms process queries independently. Thus, the server will be busy with continuously reaccessing the database to obtain the data that have already been acquired. This results in wasting I/O costs and degrading the performance of the whole system. In this paper, we focus on this problem and propose an algorithm named COrrelated kNN query Evaluation (COKE). The main idea of COKE is an “information sharing” strategy whereby the server reuses the query results of previously executed queries for efficiently processing subsequent queries. We conduct a comprehensive set of experiments to analyze the performance of COKE and compare it with the Best-First Search (BFS) algorithm. Empirical studies indicate that COKE outperforms BFS, and achieves lower I/O costs and less running time.  相似文献   

2.
An important query for spatio-temporal databases is to find nearest trajectories of moving objects. Existing work on this topic focuses on the closest trajectories in the whole data space. In this paper, we introduce and solve constrained k-nearest neighbor (CkNN) queries and historical continuous CkNN (HCCkNN) queries on R-tree-like structures storing historical information about moving object trajectories. Given a trajectory set D, a query object (point or trajectory) q, a temporal extent T, and a constrained region CR, (i) a CkNN query over trajectories retrieves from D within T, the k (≥ 1) trajectories that lie closest to q and intersect (or are enclosed by) CR; and (ii) an HCCkNN query on trajectories retrieves the constrained k nearest neighbors (CkNNs) of q at any time instance of T. We propose a suite of algorithms for processing CkNN queries and HCCkNN queries respectively, with different properties and advantages. In particular, we thoroughly investigate two types of CkNN queries, i.e., CkNNP and CkNNT, which are defined with respect to stationary query points and moving query trajectories, respectively; and two types of HCCkNN queries, namely, HCCkNNP and HCCkNNT, which are continuous counterparts of CkNNP and CkNNT, respectively. Our methods utilize an existing data-partitioning index for trajectory data (i.e., TB-tree) to achieve low I/O and CPU cost. Extensive experiments with both real and synthetic datasets demonstrate the performance of the proposed algorithms in terms of efficiency and scalability.  相似文献   

3.
Finding k nearest neighbor objects in spatial databases is a fundamental problem in many geospatial systems and the direction is one of the key features of a spatial object. Moreover, the recent tremendous growth of sensor technologies in mobile devices produces an enormous amount of spatio-directional (i.e., spatially and directionally encoded) objects such as photos. Therefore, an efficient and proper utilization of the direction feature is a new challenge. Inspired by this issue and the traditional k nearest neighbor search problem, we devise a new type of query, called the direction-constrained k nearest neighbor (DCkNN) query. The DCkNN query finds k nearest neighbors from the location of the query such that the direction of each neighbor is in a certain range from the direction of the query. We develop a new index structure called MULTI, to efficiently answer the DCkNN query with two novel index access algorithms based on the cost analysis. Furthermore, our problem and solution can be generalized to deal with spatio-circulant dimensional (such as a direction and circulant periods of time such as an hour, a day, and a week) objects. Experimental results show that our proposed index structure and access algorithms outperform two adapted algorithms from existing kNN algorithms.  相似文献   

4.
5.
The k nearest neighbors (k-NN) classification technique has a worldly wide fame due to its simplicity, effectiveness, and robustness. As a lazy learner, k-NN is a versatile algorithm and is used in many fields. In this classifier, the k parameter is generally chosen by the user, and the optimal k value is found by experiments. The chosen constant k value is used during the whole classification phase. The same k value used for each test sample can decrease the overall prediction performance. The optimal k value for each test sample should vary from others in order to have more accurate predictions. In this study, a dynamic k value selection method for each instance is proposed. This improved classification method employs a simple clustering procedure. In the experiments, more accurate results are found. The reasons of success have also been understood and presented.  相似文献   

6.
Bootstrap aggregation, or bagging, is a method of reducing the prediction error of a statistical learner. The goal of bagging is to construct a new learner which is the expectation of the original learner with respect to the empirical distribution function. In nearly all cases, the expectation cannot be computed analytically, and bootstrap sampling is used to produce an approximation. The k-nearest neighbor learners are exceptions to this generalization, and exact bagging of many k-nearest neighbor learners is straightforward. This article presents computationally simple and fast formulae for exact bagging of k-nearest neighbor learners and extends exact bagging methods from the conventional bootstrap sampling (sampling n observations with replacement from a set of n observations) to bootstrap sub-sampling schemes (with and without replacement). In addition, a partially exact k-nearest neighbor regression learner is developed. The article also compares the prediction error associated with elementary and exact bagging k-nearest neighbor learners, and several other ensemble methods using a suite of publicly available data sets.  相似文献   

7.
Ranking queries, also known as top-k queries, produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Top-k queries are dominant in many emerging applications, e.g., multimedia retrieval by content, Web databases, data mining, middlewares, and most information retrieval applications. Current relational query processors do not handle ranking queries efficiently, especially when joins are involved. In this paper, we address supporting top-k join queries in relational query processors. We introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a user-specified scoring function. The idea is to rank the join results progressively during the join operation. We introduce two physical query operators based on variants of ripple join that implement the rank-join algorithm. The operators are nonblocking and can be integrated into pipelined execution plans. We also propose an efficient heuristic designed to optimize a top-k join query by choosing the best join order. We address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. We implement the new operators inside a prototype database engine based on PREDATOR. The experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance.Received: 23 December 2003, Accepted: 31 March 2004, Published online: 12 August 2004Edited by: S. AbiteboulExtended version of the paper published in the Proceedings of the 29th International Conference on Very Large Databases, VLDB 2003, Berlin, Germany, pp 754-765  相似文献   

8.
We consider a distributed system where each node keeps a local count for items (similar to elections where nodes are ballot boxes and items are candidates). A top-k query in such a system asks which are the k items whose global count, across all nodes in the system, is the largest. In this paper, we present a Monte Carlo algorithm that outputs, with high probability, a set of k candidates which approximates the top-k items. The algorithm is motivated by sensor networks in that it focuses on reducing the individual communication complexity. In contrast to previous algorithms, the communication complexity depends only on the global scores and not on the partition of scores among nodes. If the number of nodes is large, our algorithm dramatically reduces the communication complexity when compared with deterministic algorithms. We show that the complexity of our algorithm is close to a lower bound on the cell-probe complexity of any non-interactive top-k approximation algorithm. We show that for some natural global distributions (such as the Geometric or Zipf distributions), our algorithm needs only polylogarithmic number of communication bits per node. An extended abstract of this paper appeared in Proc. 13th Int. Colloquium on Structural Information and Communication Complexity, SIROCCO 2006, Lecture Notes in Computer Science 4056, pp. 319–333.  相似文献   

9.
Given a graph with a source and a sink node, the NP-hard maximum k-splittable s,t-flow (M k SF) problem is to find a flow of maximum value from s to t with a flow decomposition using at most k paths. The multicommodity variant of this problem is a natural generalization of disjoint paths and unsplittable flow problems. Constructing a k-splittable flow requires two interdepending decisions. One has to decide on k paths (routing) and on the flow values for the paths (packing). We give efficient algorithms for computing exact and approximate solutions by decoupling the two decisions into a first packing step and a second routing step. Usually the routing is considered before the packing. Our main contributions are as follows: (i) We show that for constant k a polynomial number of packing alternatives containing at least one packing used by an optimal M k SF solution can be constructed in polynomial time. If k is part of the input, we obtain a slightly weaker result. In this case we can guarantee that, for any fixed ε>0, the computed set of alternatives contains a packing used by a (1−ε)-approximate solution. The latter result is based on the observation that (1−ε)-approximate flows only require constantly many different flow values. We believe that this observation is of interest in its own right. (ii) Based on (i), we prove that, for constant k, the M k SF problem can be solved in polynomial time on graphs of bounded treewidth. If k is part of the input, this problem is still NP-hard and we present a polynomial time approximation scheme for it.  相似文献   

10.
This paper presents a clustering (Clustering partitions record into clusters such that records within a cluster are similar to each other, while records in different clusters are most distinct from one another.) based k-anonymization technique to minimize the information loss while at the same time assuring data quality. Privacy preservation of individuals has drawn considerable interests in data mining research. The k-anonymity model proposed by Samarati and Sweeney is a practical approach for data privacy preservation and has been studied extensively for the last few years. Anonymization methods via generalization or suppression are able to protect private information, but lose valued information. The challenge is how to minimize the information loss during the anonymization process. We refer to the challenge as a systematic clustering problem for k-anonymization which is analysed in this paper. The proposed technique adopts group-similar data together and then anonymizes each group individually. The structure of systematic clustering problem is defined and investigated through paradigm and properties. An algorithm of the proposed problem is developed and shown that the time complexity is in O(\fracn2k){O(\frac{n^{2}}{k})}, where n is the total number of records containing individuals concerning their privacy. Experimental results show that our method attains a reasonable dominance with respect to both information loss and execution time. Finally the algorithm illustrates the usability for incremental datasets.  相似文献   

11.
K-anonymisation is an approach to protecting individuals from being identified from data.Good k-anonymisations should retain data utility and preserve privacy,but few methods have considered these two conflicting requirements together. In this paper,we extend our previous work on a clustering-based method for balancing data utility and privacy protection, and propose a set of heuristics to improve its effectiveness.We introduce new clustering criteria that treat utility and privacy on equal terms and propose sampling-based techniques to optimally set up its parameters.Extensive experiments show that the extended method achieves good accuracy in query answering and is able to prevent linking attacks effectively.  相似文献   

12.
Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network.  相似文献   

13.
In the fuzzy k-modes clustering, there is just one membership degree of interest by class for each individual which cannot be sufficient to model ambiguity of data precisely. It is known that the essence of a multivariate thinking allows to expose the inherent structure and meaning revealed within a set of variables classified. In this paper, a multivariate approach for membership degrees is presented to better handle ambiguous data that share properties of different clusters. This method is compared with other fuzzy k-modes methods of the literature based on a multivariate internal index that is also proposed in this paper. Synthetic and real categorical data sets are considered in this study.  相似文献   

14.
This paper proposes a strengthening of the author’s core-accessibility theorem for balanced TU-cooperative games. The obtained strengthening relaxes the influence of the nontransitivity of classical domination αv on the quality of the sequential improvement of dominated imputations in a game v. More specifically, we establish the k-accessibility of the core C v ) of any balanced TU-cooperative game v for all natural numbers k: for each dominated imputation x, there exists a converging sequence of imputations x0, x1,..., such that x0 = x, lim x r C v ) and xr?m is dominated by any successive imputation x r with m ∈ [1, k] and rm. For showing that the TU-property is essential to provide the k-accessibility of the core, we give an example of an NTU-cooperative game G with a ”black hole” representing a nonempty closed subset B ? G(N) of dominated imputations that contains all the α G -monotonic sequential improvement trajectories originating at any point xB.  相似文献   

15.
With the popularization of wireless networks and mobile intelligent terminals, mobile crowd sensing is becoming a promising sensing paradigm. Tasks are assigned to users with mobile devices, which then collect and submit ambient information to the server. The composition of participants greatly determines the quality and cost of the collected information. This paper aims to select fewest participants to achieve the quality required by a sensing task. The requirement namely “t-sweep k-coverage” means for a target location, every t time interval should at least k participants sense. The participant selection problem for “t-sweep k-coverage” crowd sensing tasks is NP-hard. Through delicate matrix stacking, linear programming can be adopted to solve the problem when it is in small size. We further propose a participant selection method based on greedy strategy. The two methods are evaluated through simulated experiments using users’ call detail records. The results show that for small problems, both the two methods can find a participant set meeting the requirement. The number of participants picked by the greedy based method is roughly twice of the linear programming based method. However, when problems become larger, the linear programming based method performs unstably, while the greedy based method can still output a reasonable solution.  相似文献   

16.
Choosing the best location for starting a business or expanding an existing enterprize is an important issue. A number of location selection problems have been discussed in the literature. They often apply the Reverse Nearest Neighbor as the criterion for finding suitable locations. In this paper, we apply the Average Distance as the criterion and propose the so-called k-most suitable locations (k-MSL) selection problem. Given a positive integer k and three datasets: a set of customers, a set of existing facilities, and a set of potential locations. The k-MSL selection problem outputs k locations from the potential location set, such that the average distance between a customer and his nearest facility is minimized. In this paper, we formally define the k-MSL selection problem and show that it is NP-hard. We first propose a greedy algorithm which can quickly find an approximate result for users. Two exact algorithms are then proposed to find the optimal result. Several pruning rules are applied to increase computational efficiency. We evaluate the algorithms’ performance using both synthetic and real datasets. The results show that our algorithms are able to deal with the k-MSL selection problem efficiently.  相似文献   

17.
The theory of finite pseudo-random binary sequences was built by C. Mauduit and A. Sárközy and later extended to sequences of k symbols (or k-ary sequences). Certain constructions of pseudo-random sequences of k symbols were presented over finite fields in the literature. In this paper, two families of sequences of k symbols are constructed by using the integers modulo pq for distinct odd primes p and q. The upper bounds on the well-distribution measure and the correlation measure of the families sequences are presented in terms of certain character sums over modulo pq residue class rings. And low bounds on the linear complexity profile are also estimated.  相似文献   

18.
The minimum k-terminal cut problem is of considerable theoretical interest and arises in several applied areas such as parallel and distributed computing, VLSI circuit design, and networking. In this paper we present two new approximation and exact algorithms for this problem on an n-vertex undirected weighted planar graph G. For the case when the k terminals are covered by the boundaries of m > 1 faces of G, we give a min{O(n 2 log n logm), O(m 2 n 1.5 log2 n + k n)} time algorithm with a (2–2/k)-approximation ratio (clearly, m \le k). For the case when all k terminals are covered by the boundary of one face of G, we give an O(n k3 + (n log n)k 2) time exact algorithm, or a linear time exact algorithm if k = 3, for computing an optimal k-terminal cut. Our algorithms are based on interesting observations and improve the previous algorithms when they are applied to planar graphs. To our best knowledge, no previous approximation algorithms specifically for solving the k-terminal cut problem on planar graphs were known before. The (2–2/k)-approximation algorithm of Dahlhaus et al. (for general graphs) takes O(k n 2 log n) time when applied to planar graphs. Our approximation algorithm for planar graphs runs faster than that of Dahlhaus et al. by at least an O(k/logm) factor (m \le k).  相似文献   

19.
Bichromatic reverse nearest neighbor (BRNN) has been extensively studied in spatial database literature. In this paper, we study a related problem called MaxBRNN: find an optimal region that maximizes the size of BRNNs for L p -norm in two- and three- dimensional spaces. Such a problem has many real-life applications, including the problem of finding a new server point that attracts as many customers as possible by proximity. A straightforward approach is to determine the BRNNs for all possible points that are not feasible since there are a large (or infinite) number of possible points. To the best of our knowledge, there are no existing algorithms which solve MaxBRNN for any L p -norm space of two- and three-dimensionality. Based on some interesting properties of the problem, we come up with an efficient algorithm called MaxOverlap for to solve this problem. Extensive experiments are conducted to show that our algorithm is efficient.  相似文献   

20.
Providing top-k typical relevant keyword queries would benefit the users who cannot formulate appropriate queries to express their imprecise query intentions. By extracting the semantic relationships both between keywords and keyword queries, this paper proposes a new keyword query suggestion approach which can provide typical and semantically related queries to the given query. Firstly, a keyword coupling relationship measure, which considers both intra- and inter-couplings between each pair of keywords, is proposed. Then, the semantic similarity of different keyword queries can be measured by using a semantic matrix, in which the coupling relationships between keywords in queries are reserved. Based on the query semantic similarities, we next propose an approximation algorithm to find the most typical queries from query history by using the probability density estimation method. Lastly, a threshold-based top-k query selection method is proposed to expeditiously evaluate the top-k typical relevant queries. We demonstrate that our keyword coupling relationship and query semantic similarity measures can capture the coupling relationships between keywords and semantic similarities between keyword queries accurately. The efficiency of query typicality analysis and top-k query selection algorithm is also demonstrated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号