期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Top-<Emphasis Type="Italic">k</Emphasis> overlapping densest subgraphs

Esther Galbrun Aristides Gionis Nikolaj Tatti 《Data mining and knowledge discovery》2016,30(5):1134-1165

Finding dense subgraphs is an important problem in graph mining and has many practical applications. At the same time, while large real-world networks are known to have many communities that are not well-separated, the majority of the existing work focuses on the problem of finding a single densest subgraph. Hence, it is natural to consider the question of finding the top-k densest subgraphs. One major challenge in addressing this question is how to handle overlaps: eliminating overlaps completely is one option, but this may lead to extracting subgraphs not as dense as it would be possible by allowing a limited amount of overlap. Furthermore, overlaps are desirable as in most real-world graphs there are vertices that belong to more than one community, and thus, to more than one densest subgraph. In this paper we study the problem of finding top-k overlapping densest subgraphs, and we present a new approach that improves over the existing techniques, both in theory and practice. First, we reformulate the problem definition in a way that we are able to obtain an algorithm with constant-factor approximation guarantee. Our approach relies on using techniques for solving the max-sum diversification problem, which however, we need to extend in order to make them applicable to our setting. Second, we evaluate our algorithm on a collection of benchmark datasets and show that it convincingly outperforms the previous methods, both in terms of quality and efficiency. 相似文献

2.

Top-<Emphasis Type="Italic">k</Emphasis> temporal keyword search over social media data

Fan?Xia Email author View author&#;s OrcID profile Chengcheng?Yu Linhao?Xu Weining?Qian Aoying?Zhou 《World Wide Web》2017,20(5):1049-1069

Social media services have already become main sources for monitoring emerging topics and sensing real-life events. A social media platform manages social stream consisting of a huge volume of timestamped user generated data, including original data and repost data. However, previous research on keyword search over social media data mainly emphasizes on the recency of information. In this paper, we first propose a problem of top-k most significant temporal keyword query to enable more complex query analysis. It returns top-k most popular social items that contain the keywords in the given query time window. Then, we design a temporal inverted index with two-tiers posting list to index social time series and a segment store to compute the exact social significance of social items. Next, we implement a basic query algorithm based on our proposed index structure and give a detailed performance analysis on the query algorithm. From the analysis result, we further refine our query algorithm with a piecewise maximum approximation (PMA) sketch. Finally, extensive empirical studies on a real-life microblog dataset demonstrate the combination of two-tiers posting list and PMA sketch achieves remarkable performance improvement under different query settings. 相似文献

3.

Providing <Emphasis Type="Italic">k</Emphasis>-anonymity in data mining

Arik Friedman Ran Wolff Assaf Schuster 《The VLDB Journal The International Journal on Very Large Data Bases》2008,17(4):789-804

In this paper we present extended definitions of k-anonymity and use them to prove that a given data mining model does not violate the k-anonymity of the individuals represented in the learning examples. Our extension provides a tool that measures the amount of anonymity retained during data mining. We show that our model can be applied to various data mining problems, such as classification, association rule mining and clustering. We describe two data mining algorithms which exploit our extension to guarantee they will generate only k-anonymous output, and provide experimental results for one of them. Finally, we show that our method contributes new and efficient ways to anonymize data and preserve patterns during anonymization. 相似文献

4.

Approximate Continuous Top-<Emphasis Type="Italic">k</Emphasis> Query over Sliding Window

下载免费PDF全文

Rui Zhu Bin Wang Shi-Ying Luo Xiao-Chun Yang Guo-Ren Wang 《计算机科学技术学报》2017,32(1):93-109

Continuous top-k query over sliding window is a fundamental problem in database, which retrieves k objects with the highest scores when the window slides. Existing studies mainly adopt exact algorithms to tackle this type of queries, whose key idea is to maintain a subset of objects in the window, and try to retrieve answers from it. However, all the existing algorithms are sensitive to query parameters and data distribution. In addition, they suffer from expensive overhead for incremental maintenance, and thus cannot satisfy real-time requirement. In this paper, we define a novel query named (ε, δ)-approximate continuous top-k query, which returns approximate answers for top-k query. In order to efficiently support this query, we propose an efficient framework, named PABF (Probabilistic Approximate Based Framework), to support approximate top-k query over sliding window. We firstly maintain a self-adaptive pruning value, which could filter out newly arrived objects who have a probability less than 1 ? δ of being a query result. For those objects that are not filtered, we combine them together, if the score difference among them is less than a threshold. To efficiently maintain these combined results, the framework PABF also proposes a multi-phase merging algorithm. Theoretical analysis indicates that even in the worst case, we require only logarithmic complexity for maintaining each candidate. 相似文献

5.

Continuously monitoring top-<Emphasis Type="Italic">k</Emphasis> uncertain data streams: a probabilistic threshold method

Ming Hua Jian Pei 《Distributed and Parallel Databases》2009,26(1):29-65

Recently, uncertain data processing has become more and more important. Although a significant amount of previous research explores various continuous queries on data streams, continuous queries on uncertain data streams have seldom been investigated. In this paper, we formulate a novel and challenging problem of continuously monitoring top-k uncertain data streams, and propose a probabilistic threshold method. We develop four algorithms systematically: a deterministic exact algorithm, a randomized method, and their space-efficient versions using quantile summaries. An extensive empirical study using real data sets and synthetic data sets is reported to verify the effectiveness and the efficiency of our methods. We are grateful to the anonymous reviewers and Dr. Ihab F. Ilyas, the guest editor, for their constructive comments which help to improve the quality of the paper substantially. The research of Ming Hua and Jian Pei is supported in part by an NSERC Discovery grant and an NSERC Discovery Accelerator Supplements grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. 相似文献

6.

Top-<Emphasis Type="Italic">k</Emphasis> coupled keyword recommendation for relational keyword queries

Xiangfu Meng Longbing Cao Xiaoyan Zhang Jingyu Shao 《Knowledge and Information Systems》2017,50(3):883-916

Providing top-k typical relevant keyword queries would benefit the users who cannot formulate appropriate queries to express their imprecise query intentions. By extracting the semantic relationships both between keywords and keyword queries, this paper proposes a new keyword query suggestion approach which can provide typical and semantically related queries to the given query. Firstly, a keyword coupling relationship measure, which considers both intra- and inter-couplings between each pair of keywords, is proposed. Then, the semantic similarity of different keyword queries can be measured by using a semantic matrix, in which the coupling relationships between keywords in queries are reserved. Based on the query semantic similarities, we next propose an approximation algorithm to find the most typical queries from query history by using the probability density estimation method. Lastly, a threshold-based top-k query selection method is proposed to expeditiously evaluate the top-k typical relevant queries. We demonstrate that our keyword coupling relationship and query semantic similarity measures can capture the coupling relationships between keywords and semantic similarities between keyword queries accurately. The efficiency of query typicality analysis and top-k query selection algorithm is also demonstrated. 相似文献

7.

Shooting top-<Emphasis Type="Italic">k</Emphasis> stars in uncertain databases

Xiang Lian Lei Chen 《The VLDB Journal The International Journal on Very Large Data Bases》2011,20(6):819-840

Query processing in the uncertain database has played an important role in many real-world applications due to the wide existence of uncertain data. Although many previous techniques can correctly handle precise data, they are not directly applicable to the uncertain scenario. In this article, we investigate and propose a novel query, namely probabilistic top-k star (PTkS) query, which aims to retrieve k objects in an uncertain database that are “closest” to a static/dynamic query point, considering both distance and probability aspects. In order to efficiently answer PTkS queries with a static/moving query point, we propose effective pruning methods to reduce the PTkS search space, which can be seamlessly integrated into an efficient query procedure. Finally, extensive experiments have demonstrated the efficiency and effectiveness of our proposed PTkS approaches on both real and synthetic data sets, under various parameter settings. 相似文献

8.

Maximum <Emphasis Type="Italic">k</Emphasis>-Splittable <Emphasis Type="Italic">s</Emphasis>,<Emphasis Type="Italic">t</Emphasis>-Flows

Ronald Koch Martin Skutella Ines Spenke 《Theory of Computing Systems》2008,43(1):56-66

Given a graph with a source and a sink node, the NP-hard maximum k-splittable s,t-flow (M k SF) problem is to find a flow of maximum value from s to t with a flow decomposition using at most k paths. The multicommodity variant of this problem is a natural generalization of disjoint paths and unsplittable flow problems. Constructing a k-splittable flow requires two interdepending decisions. One has to decide on k paths (routing) and on the flow values for the paths (packing). We give efficient algorithms for computing exact and approximate solutions by decoupling the two decisions into a first packing step and a second routing step. Usually the routing is considered before the packing. Our main contributions are as follows: (i) We show that for constant k a polynomial number of packing alternatives containing at least one packing used by an optimal M k SF solution can be constructed in polynomial time. If k is part of the input, we obtain a slightly weaker result. In this case we can guarantee that, for any fixed ε>0, the computed set of alternatives contains a packing used by a (1−ε)-approximate solution. The latter result is based on the observation that (1−ε)-approximate flows only require constantly many different flow values. We believe that this observation is of interest in its own right. (ii) Based on (i), we prove that, for constant k, the M k SF problem can be solved in polynomial time on graphs of bounded treewidth. If k is part of the input, this problem is still NP-hard and we present a polynomial time approximation scheme for it. 相似文献

9.

Nonblocking <Emphasis Type="Italic">k</Emphasis>-Compare-Single-Swap

Victor Luchangco Mark Moir Nir Shavit 《Theory of Computing Systems》2009,44(1):39-66

The current literature offers two extremes of nonblocking software synchronization support for concurrent data structure design: intricate designs of specific structures based on single-location operations such as compare-and-swap (CAS), and general-purpose multilocation transactional memory implementations. While the former are sometimes efficient, they are invariably hard to extend and generalize. The latter are flexible and general, but costly. This paper aims at a middle ground: reasonably efficient multilocation operations that are general enough to reduce the design difficulties of algorithms based on CAS alone. We present an obstruction-free implementation of an atomic k -location-compare single-location-swap (KCSS) operation. KCSS allows for simple nonblocking manipulation of linked data structures by overcoming the key algorithmic difficulty in their design: making sure that while a pointer is being manipulated, neighboring parts of the data structure remain unchanged. Our algorithm is efficient in the common uncontended case: A successful k-location KCSS operation requires only two CAS operations, two stores, and 2k noncached loads when there is no contention. We therefore believe our results lend themselves to efficient and flexible nonblocking manipulation of list-based data structures in today’s architectures. A preliminary version of this paper appeared in the Proceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 314–323, San Diego, California, USA, 2003. 相似文献

10.

Top-<Emphasis Type="Italic">k</Emphasis> query processing in mobile-P2P networks using economic incentive schemes

Nilesh Padhariya Anirban Mondal Sanjay Kumar Madria 《Peer-to-Peer Networking and Applications》2016,9(4):731-751

This work proposes the E-Top system for the efficient processing of top-k queries in mobile ad hoc peer to peer (M-P2P) networks using economic incentive schemes. In E-Top, brokers facilitate top-k query processing in lieu of a commission. E-Top issues economic rewards to the mobile peers, which send relevant data items (i.e., those that contribute to the top-k query result), and penalizes peers otherwise, thereby optimizing the communication traffic. Peers use the payoffs (rewards/penalties) as a means of feedback to re-evaluate the scores of their items for re-ranking purposes. The main contributions of E-Top are three-fold. First, it proposes two economic incentive schemes, namely ETK and ETK+, in which peers act individually towards top-k query processing. Second, it extends ETK and ETK+ to propose a peer group-based economic incentive scheme ETG. Third, our performance evaluation shows that our schemes are indeed effective in improving the performance of top-k queries in terms of query response times and accuracy at reasonable communication traffic cost. 相似文献

11.

Supporting top-<Emphasis Type="Italic">k</Emphasis><Emphasis Type="Italic">join</Emphasis> queries in relational databases

Ihab?F.?Ilyas Email author Walid?G.?Aref Ahmed?K.?Elmagarmid 《The VLDB Journal The International Journal on Very Large Data Bases》2004,13(3):207-221

Ranking queries, also known as top-k queries, produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Top-k queries are dominant in many emerging applications, e.g., multimedia retrieval by content, Web databases, data mining, middlewares, and most information retrieval applications. Current relational query processors do not handle ranking queries efficiently, especially when joins are involved. In this paper, we address supporting top-k join queries in relational query processors. We introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a user-specified scoring function. The idea is to rank the join results progressively during the join operation. We introduce two physical query operators based on variants of ripple join that implement the rank-join algorithm. The operators are nonblocking and can be integrated into pipelined execution plans. We also propose an efficient heuristic designed to optimize a top-k join query by choosing the best join order. We address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. We implement the new operators inside a prototype database engine based on PREDATOR. The experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance.Received: 23 December 2003, Accepted: 31 March 2004, Published online: 12 August 2004Edited by: S. AbiteboulExtended version of the paper published in the Proceedings of the 29th International Conference on Very Large Databases, VLDB 2003, Berlin, Germany, pp 754-765 相似文献

12.

Approximate distributed top-<Emphasis Type="Italic">k</Emphasis> queries

Boaz Patt-Shamir Allon Shafrir 《Distributed Computing》2008,21(1):1-22

We consider a distributed system where each node keeps a local count for items (similar to elections where nodes are ballot boxes and items are candidates). A top-k query in such a system asks which are the k items whose global count, across all nodes in the system, is the largest. In this paper, we present a Monte Carlo algorithm that outputs, with high probability, a set of k candidates which approximates the top-k items. The algorithm is motivated by sensor networks in that it focuses on reducing the individual communication complexity. In contrast to previous algorithms, the communication complexity depends only on the global scores and not on the partition of scores among nodes. If the number of nodes is large, our algorithm dramatically reduces the communication complexity when compared with deterministic algorithms. We show that the complexity of our algorithm is close to a lower bound on the cell-probe complexity of any non-interactive top-k approximation algorithm. We show that for some natural global distributions (such as the Geometric or Zipf distributions), our algorithm needs only polylogarithmic number of communication bits per node. An extended abstract of this paper appeared in Proc. 13th Int. Colloquium on Structural Information and Communication Complexity, SIROCCO 2006, Lecture Notes in Computer Science 4056, pp. 319–333. 相似文献

13.

Diversified top-<Emphasis Type="Italic">k</Emphasis> clique search

Long Yuan Lu Qin Xuemin Lin Lijun Chang Wenjie Zhang 《The VLDB Journal The International Journal on Very Large Data Bases》2016,25(2):171-196

Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight \(\mathsf {PNP}\)-\(\mathsf {Index}\), based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. Besides, for the massive input graph, we develop an I/O efficient algorithm to tackle the problem when the input graph cannot fit in main memory. We conduct extensive performance studies on real graphs and synthetic graphs. One of the real graphs contains 1.02 billion edges. The results demonstrate the high efficiency and effectiveness of our approach. 相似文献

14.

Multivariate fuzzy <Emphasis Type="Italic">k</Emphasis>-modes algorithm

Diêgo B. M. Maciel Getulio J. A. Amaral Renata M. C. R. de Souza Bruno A. Pimentel 《Pattern Analysis & Applications》2017,20(1):59-71

In the fuzzy k-modes clustering, there is just one membership degree of interest by class for each individual which cannot be sufficient to model ambiguity of data precisely. It is known that the essence of a multivariate thinking allows to expose the inherent structure and meaning revealed within a set of variables classified. In this paper, a multivariate approach for membership degrees is presented to better handle ambiguous data that share properties of different clusters. This method is compared with other fuzzy k-modes methods of the literature based on a multivariate internal index that is also proposed in this paper. Synthetic and real categorical data sets are considered in this study. 相似文献

15.

<Emphasis Type="Italic">k</Emphasis>-most suitable locations selection

Yu-Chi Chung I-Fang Su Chiang Lee 《GeoInformatica》2018,22(4):661-692

Choosing the best location for starting a business or expanding an existing enterprize is an important issue. A number of location selection problems have been discussed in the literature. They often apply the Reverse Nearest Neighbor as the criterion for finding suitable locations. In this paper, we apply the Average Distance as the criterion and propose the so-called k-most suitable locations (k-MSL) selection problem. Given a positive integer k and three datasets: a set of customers, a set of existing facilities, and a set of potential locations. The k-MSL selection problem outputs k locations from the potential location set, such that the average distance between a customer and his nearest facility is minimized. In this paper, we formally define the k-MSL selection problem and show that it is NP-hard. We first propose a greedy algorithm which can quickly find an approximate result for users. Two exact algorithms are then proposed to find the optimal result. Several pruning rules are applied to increase computational efficiency. We evaluate the algorithms’ performance using both synthetic and real datasets. The results show that our algorithms are able to deal with the k-MSL selection problem efficiently. 相似文献

16.

Categorical top-<Emphasis Type="Italic">k</Emphasis> spatial influence query

Jianye Yang Wenjie Zhang Ying Zhang Xiaoyang Wang Xuemin Lin 《World Wide Web》2017,20(2):175-203

The influence of a spatial facility object depicts the importance of the object in the whole data space. In this paper, we present a novel definition of object influence in applications where objects are of different categories. We study the problem of Spatial Influence Query which considers the contribution of an object in forming functional units consisting of a given set of objects with different categories designated by users. We first show that the problem of spatial influence query is NP-hard with respect to the number of object categories in the functional unit. To tackle the computational hardness, we develop an efficient framework following two main steps, possible participants finding and optimal functional unit computation. Based on this framework, for the first step, novel and efficient pruning techniques are developed based on the nearest neighbor set (NNS) approach. To find the optimal functional unit efficiently, we propose two algorithms, an exact algorithm and an efficient approximate algorithm with performance guarantee. Comprehensive experiments on both real and synthetic datasets demonstrate the effectiveness and efficiency of our techniques. 相似文献

17.

Multiple <Emphasis Type="Italic">k</Emphasis> nearest neighbor search

Yu-Chi Chung I-Fang Su Chiang Lee Pei-Chi Liu 《World Wide Web》2017,20(2):371-398

The problem of kNN (k Nearest Neighbor) queries has received considerable attention in the database and information retrieval communities. Given a dataset D and a kNN query q, the k nearest neighbor algorithm finds the closest k data points to q. The applications of kNN queries are board, not only in spatio-temporal databases but also in many areas. For example, they can be used in multimedia databases, data mining, scientific databases and video retrieval. The past studies of kNN query processing did not consider the case that the server may receive multiple kNN queries at one time. Their algorithms process queries independently. Thus, the server will be busy with continuously reaccessing the database to obtain the data that have already been acquired. This results in wasting I/O costs and degrading the performance of the whole system. In this paper, we focus on this problem and propose an algorithm named COrrelated kNN query Evaluation (COKE). The main idea of COKE is an “information sharing” strategy whereby the server reuses the query results of previously executed queries for efficiently processing subsequent queries. We conduct a comprehensive set of experiments to analyze the performance of COKE and compare it with the Best-First Search (BFS) algorithm. Empirical studies indicate that COKE outperforms BFS, and achieves lower I/O costs and less running time. 相似文献

18.

Maintenance of top-<Emphasis Type="Italic">k</Emphasis> materialized views

Eftychia Baikousi Panos Vassiliadis 《Distributed and Parallel Databases》2010,27(2):95-137

In this paper we present results on the problem of maintaining materialized top-k views and provide results in two directions. The first problem we tackle concerns the maintenance of top-k views in the presence of high deletion rates. We provide a principled method that complements the inefficiency of the state of the art independently of the statistical properties of the data and the characteristics of the update streams. The second problem we have been concerned with has to do with the efficient maintenance of multiple top-k views in the presence of updates to their base relation. To this end, we provide theoretical guarantees for the nucleation (practically, inclusion) of a view with respect to another view and the reflection of this property to the management of updates. We also provide algorithmic results towards the maintenance of a large number of views, via their appropriate structuring in hierarchies of views. 相似文献

19.

Online <Emphasis Type="Italic">k</Emphasis>-Server Routing Problems

Vincenzo Bonifaci Leen Stougie 《Theory of Computing Systems》2009,45(3):470-485

In an online k-server routing problem, a crew of k servers has to visit points in a metric space as they arrive in real time. Possible objective functions include minimizing the makespan (k-Traveling Salesman Problem) and minimizing the sum of completion times (k-Traveling Repairman Problem). We give competitive algorithms, resource augmentation results and lower bounds for k-server routing problems in a wide class of metric spaces. In some cases the competitive ratio is dramatically better than that of the corresponding single server problem. Namely, we give a 1+O((log k)/k)-competitive algorithm for the k-Traveling Salesman Problem and the k-Traveling Repairman Problem when the underlying metric space is the real line. We also prove that a similar result cannot hold for the Euclidean plane. An extended abstract of this work has appeared in the proceedings of the 4th Workshop on Approximation and Online Algorithms, September 2006. Research of V. Bonifaci partly supported by the Dutch Ministry of Education, Culture and Science through a Huygens scholarship. Research of L. Stougie partly supported by MRT Network ADONET of the European Community (MRTN-CT-2003-504438) and the Dutch BSIK/BRICKS project. 相似文献

20.

Participant selection for <Emphasis Type="Italic">t</Emphasis>-sweep <Emphasis Type="Italic">k</Emphasis>-coverage crowd sensing tasks

Zhiyong?Yu Jie?Zhou Wenzhong?Guo Email author Longkun?Guo Zhiwen?Yu 《World Wide Web》2018,21(3):741-758

With the popularization of wireless networks and mobile intelligent terminals, mobile crowd sensing is becoming a promising sensing paradigm. Tasks are assigned to users with mobile devices, which then collect and submit ambient information to the server. The composition of participants greatly determines the quality and cost of the collected information. This paper aims to select fewest participants to achieve the quality required by a sensing task. The requirement namely “t-sweep k-coverage” means for a target location, every t time interval should at least k participants sense. The participant selection problem for “t-sweep k-coverage” crowd sensing tasks is NP-hard. Through delicate matrix stacking, linear programming can be adopted to solve the problem when it is in small size. We further propose a participant selection method based on greedy strategy. The two methods are evaluated through simulated experiments using users’ call detail records. The results show that for small problems, both the two methods can find a participant set meeting the requirement. The number of participants picked by the greedy based method is roughly twice of the linear programming based method. However, when problems become larger, the linear programming based method performs unstably, while the greedy based method can still output a reasonable solution. 相似文献