首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 172 毫秒
1.
Given a graph with a source and a sink node, the NP-hard maximum k-splittable s,t-flow (M k SF) problem is to find a flow of maximum value from s to t with a flow decomposition using at most k paths. The multicommodity variant of this problem is a natural generalization of disjoint paths and unsplittable flow problems. Constructing a k-splittable flow requires two interdepending decisions. One has to decide on k paths (routing) and on the flow values for the paths (packing). We give efficient algorithms for computing exact and approximate solutions by decoupling the two decisions into a first packing step and a second routing step. Usually the routing is considered before the packing. Our main contributions are as follows: (i) We show that for constant k a polynomial number of packing alternatives containing at least one packing used by an optimal M k SF solution can be constructed in polynomial time. If k is part of the input, we obtain a slightly weaker result. In this case we can guarantee that, for any fixed ε>0, the computed set of alternatives contains a packing used by a (1−ε)-approximate solution. The latter result is based on the observation that (1−ε)-approximate flows only require constantly many different flow values. We believe that this observation is of interest in its own right. (ii) Based on (i), we prove that, for constant k, the M k SF problem can be solved in polynomial time on graphs of bounded treewidth. If k is part of the input, this problem is still NP-hard and we present a polynomial time approximation scheme for it.  相似文献   

2.
Ranking queries, also known as top-k queries, produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Top-k queries are dominant in many emerging applications, e.g., multimedia retrieval by content, Web databases, data mining, middlewares, and most information retrieval applications. Current relational query processors do not handle ranking queries efficiently, especially when joins are involved. In this paper, we address supporting top-k join queries in relational query processors. We introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a user-specified scoring function. The idea is to rank the join results progressively during the join operation. We introduce two physical query operators based on variants of ripple join that implement the rank-join algorithm. The operators are nonblocking and can be integrated into pipelined execution plans. We also propose an efficient heuristic designed to optimize a top-k join query by choosing the best join order. We address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. We implement the new operators inside a prototype database engine based on PREDATOR. The experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance.Received: 23 December 2003, Accepted: 31 March 2004, Published online: 12 August 2004Edited by: S. AbiteboulExtended version of the paper published in the Proceedings of the 29th International Conference on Very Large Databases, VLDB 2003, Berlin, Germany, pp 754-765  相似文献   

3.
Although k-anonymity is a good way of publishing microdata for research purposes, it cannot resist several common attacks, such as attribute disclosure and the similarity attack. To resist these attacks, many refinements of kanonymity have been proposed with t-closeness being one of the strictest privacy models. While most existing t-closeness models address the case in which the original data have only one single sensitive attribute, data with multiple sensitive attributes are more common in practice. In this paper, we cover this gap with two proposed algorithms for multiple sensitive attributes and make the published data satisfy t-closeness. Based on the observation that the values of the sensitive attributes in any equivalence class must be as spread as possible over the entire data to make the published data satisfy t-closeness, both of the algorithms use different methods to partition records into groups in terms of sensitive attributes. One uses a clustering method, while the other leverages the principal component analysis. Then, according to the similarity of quasiidentifier attributes, records are selected from different groups to construct an equivalence class, which will reduce the loss of information as much as possible during anonymization. Our proposed algorithms are evaluated using a real dataset. The results show that the average speed of the first proposed algorithm is slower than that of the second proposed algorithm but the former can preserve more original information. In addition, compared with related approaches, both proposed algorithms can achieve stronger protection of privacy and reduce less.  相似文献   

4.
This paper presents a clustering (Clustering partitions record into clusters such that records within a cluster are similar to each other, while records in different clusters are most distinct from one another.) based k-anonymization technique to minimize the information loss while at the same time assuring data quality. Privacy preservation of individuals has drawn considerable interests in data mining research. The k-anonymity model proposed by Samarati and Sweeney is a practical approach for data privacy preservation and has been studied extensively for the last few years. Anonymization methods via generalization or suppression are able to protect private information, but lose valued information. The challenge is how to minimize the information loss during the anonymization process. We refer to the challenge as a systematic clustering problem for k-anonymization which is analysed in this paper. The proposed technique adopts group-similar data together and then anonymizes each group individually. The structure of systematic clustering problem is defined and investigated through paradigm and properties. An algorithm of the proposed problem is developed and shown that the time complexity is in O(\fracn2k){O(\frac{n^{2}}{k})}, where n is the total number of records containing individuals concerning their privacy. Experimental results show that our method attains a reasonable dominance with respect to both information loss and execution time. Finally the algorithm illustrates the usability for incremental datasets.  相似文献   

5.
k-Anonymity is a useful concept to solve the tension between data utility and respondent privacy in individual data (microdata) protection. However, the generalization and suppression approach proposed in the literature to achieve k-anonymity is not equally suited for all types of attributes: (i) generalization/suppression is one of the few possibilities for nominal categorical attributes; (ii) it is just one possibility for ordinal categorical attributes which does not always preserve ordinality; (iii) and it is completely unsuitable for continuous attributes, as it causes them to lose their numerical meaning. Since attributes leading to disclosure (and thus needing k-anonymization) may be nominal, ordinal and also continuous, it is important to devise k-anonymization procedures which preserve the semantics of each attribute type as much as possible. We propose in this paper to use categorical microaggregation as an alternative to generalization/suppression for nominal and ordinal k-anonymization; we also propose continuous microaggregation as the method for continuous k-anonymization. Editor: Geoff Webb  相似文献   

6.
Bootstrap aggregation, or bagging, is a method of reducing the prediction error of a statistical learner. The goal of bagging is to construct a new learner which is the expectation of the original learner with respect to the empirical distribution function. In nearly all cases, the expectation cannot be computed analytically, and bootstrap sampling is used to produce an approximation. The k-nearest neighbor learners are exceptions to this generalization, and exact bagging of many k-nearest neighbor learners is straightforward. This article presents computationally simple and fast formulae for exact bagging of k-nearest neighbor learners and extends exact bagging methods from the conventional bootstrap sampling (sampling n observations with replacement from a set of n observations) to bootstrap sub-sampling schemes (with and without replacement). In addition, a partially exact k-nearest neighbor regression learner is developed. The article also compares the prediction error associated with elementary and exact bagging k-nearest neighbor learners, and several other ensemble methods using a suite of publicly available data sets.  相似文献   

7.
The polynomial-time solvable k-hurdle problem is a natural generalization of the classical s-t minimum cut problem where we must select a minimum-cost subset S of the edges of a graph such that |pS|≥k for every s-t path p. In this paper, we describe a set of approximation algorithms for “k-hurdle” variants of the NP-hard multiway cut and multicut problems. For the k-hurdle multiway cut problem with r terminals, we give two results, the first being a pseudo-approximation algorithm that outputs a (k−1)-hurdle solution whose cost is at most that of an optimal solution for k hurdles. Secondly, we provide a 2(1-\frac1r)2(1-\frac{1}{r})-approximation algorithm based on rounding the solution of a linear program, for which we give a simple randomized half-integrality proof that works for both edge and vertex k-hurdle multiway cuts that generalizes the half-integrality results of Garg et al. for the vertex multiway cut problem. We also describe an approximation-preserving reduction from vertex cover as evidence that it may be difficult to achieve a better approximation ratio than 2(1-\frac1r)2(1-\frac{1}{r}). For the k-hurdle multicut problem in an n-vertex graph, we provide an algorithm that, for any constant ε>0, outputs a ⌈(1−ε)k⌉-hurdle solution of cost at most O(log n) times that of an optimal k-hurdle solution, and we obtain a 2-approximation algorithm for trees.  相似文献   

8.
In this paper we present extended definitions of k-anonymity and use them to prove that a given data mining model does not violate the k-anonymity of the individuals represented in the learning examples. Our extension provides a tool that measures the amount of anonymity retained during data mining. We show that our model can be applied to various data mining problems, such as classification, association rule mining and clustering. We describe two data mining algorithms which exploit our extension to guarantee they will generate only k-anonymous output, and provide experimental results for one of them. Finally, we show that our method contributes new and efficient ways to anonymize data and preserve patterns during anonymization.  相似文献   

9.
An important query for spatio-temporal databases is to find nearest trajectories of moving objects. Existing work on this topic focuses on the closest trajectories in the whole data space. In this paper, we introduce and solve constrained k-nearest neighbor (CkNN) queries and historical continuous CkNN (HCCkNN) queries on R-tree-like structures storing historical information about moving object trajectories. Given a trajectory set D, a query object (point or trajectory) q, a temporal extent T, and a constrained region CR, (i) a CkNN query over trajectories retrieves from D within T, the k (≥ 1) trajectories that lie closest to q and intersect (or are enclosed by) CR; and (ii) an HCCkNN query on trajectories retrieves the constrained k nearest neighbors (CkNNs) of q at any time instance of T. We propose a suite of algorithms for processing CkNN queries and HCCkNN queries respectively, with different properties and advantages. In particular, we thoroughly investigate two types of CkNN queries, i.e., CkNNP and CkNNT, which are defined with respect to stationary query points and moving query trajectories, respectively; and two types of HCCkNN queries, namely, HCCkNNP and HCCkNNT, which are continuous counterparts of CkNNP and CkNNT, respectively. Our methods utilize an existing data-partitioning index for trajectory data (i.e., TB-tree) to achieve low I/O and CPU cost. Extensive experiments with both real and synthetic datasets demonstrate the performance of the proposed algorithms in terms of efficiency and scalability.  相似文献   

10.
We consider the problem of efficiently computing distributed geographical k-NN queries in an unstructured peer-to-peer (P2P) system, in which each peer is managed by an individual organization and can only communicate with its logical neighboring peers. Such queries are based on local filter query statistics, and require as less communication cost as possible which makes it more difficult than the existing distributed k-NN queries. Especially, we hope to reduce candidate peers and degrade communication cost. In this paper, we propose an efficient pruning technique to minimize the number of candidate peers to be processed to answer the k-NN queries. Our approach is especially suitable for continuous k-NN queries when updating peers, including changing ranges of peers, dynamically leaving or joining peers, and updating data in a peer. In addition, simulation results show that the proposed approach outperforms the existing Minimum Bounding Rectangle (MBR)-based query approaches, especially for continuous queries.  相似文献   

11.
Finding k nearest neighbor objects in spatial databases is a fundamental problem in many geospatial systems and the direction is one of the key features of a spatial object. Moreover, the recent tremendous growth of sensor technologies in mobile devices produces an enormous amount of spatio-directional (i.e., spatially and directionally encoded) objects such as photos. Therefore, an efficient and proper utilization of the direction feature is a new challenge. Inspired by this issue and the traditional k nearest neighbor search problem, we devise a new type of query, called the direction-constrained k nearest neighbor (DCkNN) query. The DCkNN query finds k nearest neighbors from the location of the query such that the direction of each neighbor is in a certain range from the direction of the query. We develop a new index structure called MULTI, to efficiently answer the DCkNN query with two novel index access algorithms based on the cost analysis. Furthermore, our problem and solution can be generalized to deal with spatio-circulant dimensional (such as a direction and circulant periods of time such as an hour, a day, and a week) objects. Experimental results show that our proposed index structure and access algorithms outperform two adapted algorithms from existing kNN algorithms.  相似文献   

12.
We study two related network design problems with two cost functions. In the buy-at-bulk k-Steiner tree problem we are given a graph G(V,E) with a set of terminals TV including a particular vertex s called the root, and an integer k≤|T|. There are two cost functions on the edges of G, a buy cost b:E→ℝ+ and a distance cost r:E→ℝ+. The goal is to find a subtree H of G rooted at s with at least k terminals so that the cost ∑ eH b(e)+∑ tTs dist(t,s) is minimized, where dist(t,s) is the distance from t to s in H with respect to the r cost. We present an O(log 4 n)-approximation algorithm for the buy-at-bulk k-Steiner tree problem. The second and closely related one is bicriteria approximation algorithm for Shallow-light k-Steiner trees. In the shallow-light k-Steiner tree problem we are given a graph G with edge costs b(e) and distance costs r(e), and an integer k. Our goal is to find a minimum cost (under b-cost) k-Steiner tree such that the diameter under r-cost is at most some given bound D. We develop an (O(log n),O(log 3 n))-approximation algorithm for a relaxed version of Shallow-light k-Steiner tree where the solution has at least terminals. Using this we obtain an (O(log 2 n),O(log 4 n))-approximation algorithm for the shallow-light k-Steiner tree and an O(log 4 n)-approximation algorithm for the buy-at-bulk k-Steiner tree problem. Our results are recently used to give the first polylogarithmic approximation algorithm for the non-uniform multicommodity buy-at-bulk problem (Chekuri, C., et al. in Proceedings of 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pp. 677–686, 2006). A preliminary version of this paper appeared in the Proceedings of 9th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX) 2006, LNCS 4110, pp. 153–163, 2006. M.T. Hajiaghayi supported in part by IPM under grant number CS1383-2-02. M.R. Salavatipour supported by NSERC grant No. G121210990, and a faculty start-up grant from University of Alberta.  相似文献   

13.
Nowadays, location-based services (LBS) are facilitating people in daily life through answering LBS queries. However, privacy issues including location privacy and query privacy arise at the same time. Existing works for protecting query privacy either work on trusted servers or fail to provide sufficient privacy guarantee. This paper combines the concepts of differential privacy and k-anonymity to propose the notion of differentially private k-anonymity (DPkA) for query privacy in LBS. We recognize the sufficient and necessary condition for the availability of 0-DPkA and present how to achieve it. For cases where 0-DPkA is not achievable, we propose an algorithm to achieve ??-DPkA with minimized ??. Extensive simulations are conducted to validate the proposed mechanisms based on real-life datasets and synthetic data distributions.  相似文献   

14.
The minimum k-terminal cut problem is of considerable theoretical interest and arises in several applied areas such as parallel and distributed computing, VLSI circuit design, and networking. In this paper we present two new approximation and exact algorithms for this problem on an n-vertex undirected weighted planar graph G. For the case when the k terminals are covered by the boundaries of m > 1 faces of G, we give a min{O(n 2 log n logm), O(m 2 n 1.5 log2 n + k n)} time algorithm with a (2–2/k)-approximation ratio (clearly, m \le k). For the case when all k terminals are covered by the boundary of one face of G, we give an O(n k3 + (n log n)k 2) time exact algorithm, or a linear time exact algorithm if k = 3, for computing an optimal k-terminal cut. Our algorithms are based on interesting observations and improve the previous algorithms when they are applied to planar graphs. To our best knowledge, no previous approximation algorithms specifically for solving the k-terminal cut problem on planar graphs were known before. The (2–2/k)-approximation algorithm of Dahlhaus et al. (for general graphs) takes O(k n 2 log n) time when applied to planar graphs. Our approximation algorithm for planar graphs runs faster than that of Dahlhaus et al. by at least an O(k/logm) factor (m \le k).  相似文献   

15.
Query processing in the uncertain database has played an important role in many real-world applications due to the wide existence of uncertain data. Although many previous techniques can correctly handle precise data, they are not directly applicable to the uncertain scenario. In this article, we investigate and propose a novel query, namely probabilistic top-k star (PTkS) query, which aims to retrieve k objects in an uncertain database that are “closest” to a static/dynamic query point, considering both distance and probability aspects. In order to efficiently answer PTkS queries with a static/moving query point, we propose effective pruning methods to reduce the PTkS search space, which can be seamlessly integrated into an efficient query procedure. Finally, extensive experiments have demonstrated the efficiency and effectiveness of our proposed PTkS approaches on both real and synthetic data sets, under various parameter settings.  相似文献   

16.
We present an approach of limiting the confidence of inferring sensitive properties to protect against the threats caused by data mining abilities. The problem has dual goals: preserve the information for a wanted data analysis request and limit the usefulness of unwanted sensitive inferences that may be derived from the release of data. Sensitive inferences are specified by a set of “privacy templates". Each template specifies the sensitive property to be protected, the attributes identifying a group of individuals, and a maximum threshold for the confidence of inferring the sensitive property given the identifying attributes. We show that suppressing the domain values monotonically decreases the maximum confidence of such sensitive inferences. Hence, we propose a data transformation that minimally suppresses the domain values in the data to satisfy the set of privacy templates. The transformed data is free of sensitive inferences even in the presence of data mining algorithms. The prior k-anonymization k has been italicized consistently throughout this article. focuses on personal identities. This work focuses on the association between personal identities and sensitive properties. Ke Wang received Ph.D. from Georgia Institute of Technology. He is currently a professor at School of Computing Science, Simon Fraser University. Before joining Simon Fraser, he was an associate professor at National University of Singapore. He has taught in the areas of database and data mining. Dr. Wang’s research interests include database technology, data mining and knowledge discovery, machine learning, and emerging applications, with recent interests focusing on the end use of data mining. This includes explicitly modeling the business goal (such as profit mining, bio-mining and web mining) and exploiting user prior knowledge (such as extracting unexpected patterns and actionable knowledge). He is interested in combining the strengths of various fields such as database, statistics, machine learning and optimization to provide actionable solutions to real-life problems. He is an associate editor of the IEEE TKDE journal and has served program committees for international conferences. Benjamin C. M. Fung received B.Sc. and M.Sc. degrees in computing science from Simon Fraser University. Received the postgraduate scholarship doctoral award from the Natural Sciences and Engineering Research Council of Canada (NSERC), Mr. Fung is currently a Ph.D. candidate at Simon Fraser. His recent research interests include privacy-preserving data mining, secure distributed computing, and text mining. Before pursuing his Ph.D., he worked in the R&D Department at Business Objects and designed reporting systems for various Enterprise Resource Planning (ERP) and Customer Relationship Management (CRM) systems, including BaaN, Siebel, and PeopleSoft. Mr. Fung has published in data engineering, data mining, and security conferences, journals, and books, including IEEE ICDE, IEEE ICDM, IEEE ISI, SDM, KAIS, and the Encyclopedia of Data Warehousing and Mining. Philip S. Yu received B.S. degree in E.E. from National Taiwan University, M.S. and Ph.D. degrees in E.E. from Stanford University, and M.B.A. degree from New York University. He is with IBM T.J. Watson Research Center and currently manager of the Software Tools and Techniques group. Dr. Yu has published more than 450 papers in refereed journals and conferences. He holds or has applied for more than 250 US patents. Dr. Yu is a Fellow of the ACM and the IEEE. He has received several IBM honors including two IBM Outstanding Innovation Awards, an Outstanding Technical Achievement Award, two Research Division Awards and the 85th plateau of Invention Achievement Awards. He received a Research Contributions Award from IEEE International Conference on Data Mining in 2003 and also an IEEE Region 1 Award for “promoting and perpetuating numerous new electrical engineering concepts” in 1999. Dr. Yu is an IBM Master Inventor.  相似文献   

17.
K-anonymisation is an approach to protecting individuals from being identified from data.Good k-anonymisations should retain data utility and preserve privacy,but few methods have considered these two conflicting requirements together. In this paper,we extend our previous work on a clustering-based method for balancing data utility and privacy protection, and propose a set of heuristics to improve its effectiveness.We introduce new clustering criteria that treat utility and privacy on equal terms and propose sampling-based techniques to optimally set up its parameters.Extensive experiments show that the extended method achieves good accuracy in query answering and is able to prevent linking attacks effectively.  相似文献   

18.
A spatial k-NN query returns k nearest points in a point dataset to a given query point. To measure the distance between two points, most of the literature focuses on the Euclidean distance or the network distance. For many applications, such as wildlife movement, it is necessary to consider the surface distance, which is computed from the shortest path along a terrain surface. In this paper, we investigate the problem of efficient surface k-NN (sk-NN) query processing. This is an important yet highly challenging problem because the underlying environment data can be very large and the computational cost of finding the shortest path on a surface can be very high. To minimize the amount of surface data to be used and the cost of surface distance computation, a multi-resolution surface distance model is proposed in this paper to take advantage of monotonic distance changes when the distances are computed at different resolution levels. Based on this innovative model, sk-NN queries can be processed efficiently by accessing and processing surface data at a just-enough resolution level within a just-enough search region. Our extensive performance evaluations using real world datasets confirm the efficiency of our proposed model.  相似文献   

19.
Given a multidimensional point q, a reverse k nearest neighbor (RkNN) query retrieves all the data points that have q as one of their k nearest neighbors. Existing methods for processing such queries have at least one of the following deficiencies: they (i) do not support arbitrary values of k, (ii) cannot deal efficiently with database updates, (iii) are applicable only to 2D data but not to higher dimensionality, and (iv) retrieve only approximate results. Motivated by these shortcomings, we develop algorithms for exact RkNN processing with arbitrary values of k on dynamic, multidimensional datasets. Our methods utilize a conventional data-partitioning index on the dataset and do not require any pre-computation. As a second step, we extend the proposed techniques to continuous RkNN search, which returns the RkNN results for every point on a line segment. We evaluate the effectiveness of our algorithms with extensive experiments using both real and synthetic datasets.  相似文献   

20.
An f-sensitivity distance oracle for a weighted undirected graph G(V,E) is a data structure capable of answering restricted distance queries between vertex pairs, i.e., calculating distances on a subgraph avoiding some forbidden edges. This paper presents an efficiently constructible f-sensitivity distance oracle that given a triplet (s,t,F), where s and t are vertices and F is a set of forbidden edges such that |F|≤f, returns an estimate of the distance between s and t in G(V,EF). For an integer parameter k≥1, the size of the data structure is O(fkn 1+1/k log (nW)), where W is the heaviest edge in G, the stretch (approximation ratio) of the returned distance is (8k−2)(f+1), and the query time is O(|F|⋅log 2 n⋅log log n⋅log log d), where d is the distance between s and t in G(V,EF).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号