首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Shortest distance queries are essential not only in graph analysis and graph mining tasks but also in database applications, when a large graph needs to be dealt with. Such shortest distance queries are frequently issued by end-users or requested as a subroutine in real applications. For intensive queries on large graphs, it is impractical to compute shortest distances on-line from scratch, and impractical to materialize all-pairs shortest distances. In the literature, 2-hop distance labeling is proposed to index the all-pairs shortest distances. It assigns distance labels to vertices in a large graph in a pre-computing step off-line and then answers shortest distance queries on-line by making use of such distance labels, which avoids exhaustively traversing the large graph when answering queries. However, the existing algorithms to generate 2-hop distance labels are not scalable to large graphs. Finding an optimal 2-hop distance labeling is NP-hard, and heuristic algorithms may generate large size distance labels while still needing to pre-compute all-pairs shortest paths. In this paper, we propose a multi-hop distance labeling approach, which generates a subset of the 2-hop distance labels as index off-line. We can compute the multi-hop distance labels efficiently by avoiding pre-computing all-pairs shortest paths. In addition, our multi-hop distance labeling is small in size to be stored. To answer a shortest distance query between two vertices, we first generate the query-specific small set of 2-hop distance labels for the two vertices based on our multi-hop distance labels stored and compute the shortest distance between the two vertices based on the 2-hop distance labels generated on-line. We conducted extensive performance studies on large real graphs and confirmed the efficiency of our multi-hop distance labeling scheme.  相似文献   

2.
The problem of enumerating the maximal cliques of a graph is a computationally expensive problem with applications in a number of different domains. Sometimes the benefit of knowing the maximal clique enumeration (MCE) of a single graph is worth investing the initial computation time. However, when graphs are abstractions of noisy or uncertain data, the MCE of several closely related graphs may need to be found, and the computational cost of doing so becomes prohibitively expensive.Here, we present a method by which the cost of enumerating the set of maximal cliques for related graphs can be reduced. By using the MCE for some baseline graph, the MCE for a modified, or perturbed, graph may be obtained by enumerating only the maximal cliques that are created or destroyed by the perturbation. When the baseline and perturbed graphs are relatively similar, the difference set between the two MCEs can be overshadowed by the maximal cliques common to both. Thus, by enumerating only the difference set between the baseline and perturbed graphs’ MCEs, the computational cost of enumerating the maximal cliques of the perturbed graph can be reduced.We present necessary and sufficient conditions for enumerating difference sets when the perturbed graph is formed by several different types of perturbations. We also present results of an algorithm based on these conditions that demonstrate a speedup over traditional calculations of the MCE of perturbed, real biological networks.  相似文献   

3.
Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is time-consuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel light-weight \(\mathsf {PNP}\)-\(\mathsf {Index}\), based on which we design an optimal maximal clique maintenance algorithm. We further explore three optimization strategies to avoid enumerating all maximal cliques and thus largely reduce the computational cost. Besides, for the massive input graph, we develop an I/O efficient algorithm to tackle the problem when the input graph cannot fit in main memory. We conduct extensive performance studies on real graphs and synthetic graphs. One of the real graphs contains 1.02 billion edges. The results demonstrate the high efficiency and effectiveness of our approach.  相似文献   

4.
Most graph visualization techniques focus on the structure of graphs and do not offer support for dealing with node attributes and edge labels. To enable users to detect relations and patterns in terms of data associated with nodes and edges, we present a technique where this data plays a more central role. Nodes and edges are clustered based on associated data. Via direct manipulation users can interactively inspect and query the graph. Questions that can be answered include, “which edge types are activated by specific node attributes?” and, “how and from where can I reach specific types of nodes?” To validate our approach we contrast it with current practice. We also provide several examples where our method was used to study transition graphs that model real‐world systems.  相似文献   

5.
Several graph libraries have been developed in the past few decades, and they were basically designed to work with a few graphs. However, there are many problems in which we have to consider all subgraphs satisfying certain constraints on a given graph. Since the number of subgraphs can increase exponentially with the graph size, explicitly representing these sets is infeasible. Hence, libraries concerned with efficiently representing a single graph instance are not suitable for such problems. In this paper, we develop Graphillion, a software library for very large sets of (vertex-)labeled graphs, based on zero-suppressed binary decision diagrams. Graphillion is not based on a traditional representation of graphs. Instead, a graph set is simply regarded as a “set of edge sets” ignoring vertices, which allows us to employ powerful tools of a “family of sets” (a set of sets) and permits large graph sets to be handled efficiently. We also utilize advanced graph enumeration algorithms, which enable the simple family tools to understand the graph structure. Graphillion is implemented as a Python library to encourage easy development of its applications, without introducing significant performance overheads. In experiments, we consider two case studies, a puzzle solver and a power network optimizer, in which several operations and heavy optimization have to be performed over very large sets of constrained graphs (i.e., cycles or forests with complicated conditions). The results show that Graphillion allows us to manage a huge number of graphs with very low development effort.  相似文献   

6.
本文提出了一种无向图视觉清晰化显示算法,使一般的无向关系图经过该算法重新确定顶点位置后,能得到清晰美观的输出结果。该算法首先将无向关系图去除孤立点,分离连通分支,并通过识别割边将每个连通分支分解成一系列的团,每个团内无割边,这些团以树型结构连接;然后通过识别割点和虚连线将每个团分解成子团,每个子团内无割点;最后将子团内顶点均匀分布在一个圆环上。该算法的优点在于实现方便,方法简单,运行高效,输出结果美观,并易于并行化。  相似文献   

7.
Graph-based modeling has emerged as a powerful abstraction capable of capturing in a single and unified framework many of the relational, spatial, topological, and other characteristics that are present in a variety of datasets and application areas. Computationally efficient algorithms that find patterns corresponding to frequently occurring subgraphs play an important role in developing data mining-driven methodologies for analyzing the graphs resulting from such datasets. This paper presents two algorithms, based on the horizontal and vertical pattern discovery paradigms, that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods for determining the number of edge-disjoint embeddings of a subgraph and employ novel algorithms for candidate generation and frequency counting, which allow them to operate on datasets with different characteristics and to quickly prune unpromising subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 120,000 vertices or 110,000 edges, and significantly outperform previously developed algorithms.  相似文献   

8.
Coloring large graphs based on independent set extraction   总被引:1,自引:0,他引:1  
This paper presents an effective approach (EXTRACOL) to coloring large graphs. The proposed approach uses a preprocessing method to extract large independent sets from the graph and a memetic algorithm to color the residual graph. Each preprocessing application identifies, with a dedicated tabu search algorithm, a number of pairwise disjoint independent sets of a given size in order to maximize the vertices removed from the graph. We evaluate EXTRACOL on the 11 largest graphs (with 1000 to 4000 vertices) of the DIMACS challenge benchmarks and show improved results for four very difficult graphs (DSJC1000.9, C2000.5, C2000.9, C4000.5). The behavior of the proposed algorithm is also analyzed.  相似文献   

9.
With the increasing size and complexity of available databases, existing machine learning and data mining algorithms are facing a scalability challenge. In many applications, the number of features describing the data could be extremely high. This hinders or even could make any further exploration infeasible. In fact, many of these features are redundant or simply irrelevant. Hence, feature selection plays a key role in helping to overcome the problem of information overload especially in big data applications. Since many complex datasets could be modeled by graphs of interconnected labeled elements, in this work, we are particularly interested in feature selection for subgraph patterns. In this paper, we propose MR-SimLab, a MapReduce-based approach for subgraph selection from large input subgraph sets. In many applications, it is easy to compute pairwise similarities between labels of the graph nodes. Our approach leverages such rich information to measure an approximate subgraph matching by aggregating the elementary label similarities between the matched nodes. Based on the aggregated similarity scores, our approach selects a small subset of informative representative subgraphs. We provide a distributed implementation of our algorithm on top of the MapReduce framework that optimizes the computational efficiency of our approach for big data applications. We experimentally evaluate MR-SimLab on real datasets. The obtained results show that our approach is scalable and that the selected subgraphs are informative.  相似文献   

10.
An abstraction resilient to common malware obfuscation techniques is the call-graph. A call-graph is the representation of an executable file as a directed graph with labeled vertices, where the vertices correspond to functions and the edges to function calls. Unfortunately, most of the interesting graph comparison problems, including full-graph comparison and computing the largest common subgraph, belong to the \(NP\) -hard class. This makes the study and use of graphs in large scale systems difficult. Existing work has focused only on offline clustering and has not addressed the issue of clustering streams of graphs. In this paper we present Classy, a scalable distributed system that clusters streams of large call-graphs for purposes including automated malware classification and facilitating malware analysts. Since algorithms aimed at clustering sets are not suitable for clustering streams of objects, we propose the use of a clustering algorithm that relies on the notion of candidate clusters and reference samples therein. We demonstrate via thorough experimentation that this approach yields results very close to the offline optimal. Graph similarity is determined by computing a graph edit distance (GED) of pairs of graphs using an adapted version of simulated annealing. Furthermore, we present a novel lower bound for the GED. We also study the problem of approximating statistics of clusters of graphs when the distances of only a fraction of all possible pairs have been computed. Finally, we present results and statistics from a real production-side system that has clustered and contains more than 0.8 million graphs.  相似文献   

11.
A bipartite graph is a powerful abstraction for modeling relationships between two collections. Visualizations of bipartite graphs allow users to understand the mutual relationships between the elements in the two collections, e.g., by identifying clusters of similarly connected elements. However, commonly‐used visual representations do not scale for the analysis of large bipartite graphs containing tens of millions of vertices, often resorting to an a‐priori clustering of the sets. To address this issue, we present the Who's‐Active‐On‐What‐Visualization (WAOW‐Vis) that allows for multiscale exploration of a bipartite social‐network without imposing an a‐priori clustering. To this end, we propose to treat a bipartite graph as a high‐dimensional space and we create the WAOW‐Vis adapting the multiscale dimensionality‐reduction technique HSNE. The application of HSNE for bipartite graph requires several modifications that form the contributions of this work. Given the nature of the problem, a set‐based similarity is proposed. For efficient and scalable computations, we use compressed bitmaps to represent sets and we present a novel space partitioning tree to efficiently compute similarities; the Sets Intersection Tree. Finally, we validate WAOW‐Vis on several datasets connecting Twitter‐users and ‐streams in different domains: news, computer science and politics. We show how WAOW‐Vis is particularly effective in identifying hierarchies of communities among social‐media users.  相似文献   

12.
葛尧  陈松灿 《软件学报》2020,31(4):1101-1112
图卷积网络是一种针对图信号的深度学习模型,由于具有强大的特征表征能力得到了广泛应用.推荐系统可视为图信号的链接预测问题,因此近年来提出了使用图卷积网络解决推荐问题的方法.推荐系统中存在用户与商品间的异质顶点交互和用户(或商品)内部的同质顶点交互,然而,现有方法中的图卷积操作要么仅在异质顶点间进行,要么仅在同质顶点间进行,留下了提升此类推荐系统性能的空间.考虑到这一问题,提出了一种新的基于图卷积网络的推荐算法,使用两组图卷积操作同时利用两种不同的交互信息,其中异质顶点卷积用于挖掘交互图谱域中存在的连接信息,同质顶点卷积用于使相似顶点具有相近表示.实验结果表明,该算法比现有算法具有更优的精度.  相似文献   

13.
许多来自工业应用的优化问题都是NP难问题。确定参数可解FPT作为处理这类问题的另外一种思路,在最近的10多年中受到了广泛的关注。支配集问题是图论中最重要的NP完全的组合优化问题之一,即使对于FPT体系而言,一般图中的支配集问题属于W[2]完全的,意味着不可能设计出复杂度为f(k)no(1)的算法。在本文中,我们考虑在给定的平面图G=(V,E)中参数化支配集问题,给定参数k,看是否存在大小为k的顶点集合支配图中的其他顶点,当把问题限定在平面图上,这个问题属于确定参数可解。本文给出了基于两组归约规则的搜索树算法,通过使用规约技术化简实例,构造搜索树,得到了复杂度为O(8kn)的算法,同时通过相关实验结果显示了归约规则对算法的作用。  相似文献   

14.
Almost all drift detection mechanisms designed for classification problems work reactively: after receiving the complete data set (input patterns and class labels) they apply a sequence of procedures to identify some change in the class-conditional distribution – a concept drift. However, detecting changes after its occurrence can be in some situations harmful to the process under analysis. This paper proposes a proactive approach for abrupt drift detection, called DetectA (Detect Abrupt Drift). Briefly, this method is composed of three steps: (i) label the patterns from the test set (an unlabelled data block), using an unsupervised method; (ii) compute some statistics from the train and test sets, conditioned to the given class labels for train set; and (iii) compare the training and testing statistics using a multivariate hypothesis test. Based on the results of the hypothesis tests, we attempt to detect the drift on the test set, before the real labels are obtained. A procedure for creating datasets with abrupt drift has been proposed to perform a sensitivity analysis of the DetectA model. The result of the sensitivity analysis suggests that the detector is efficient and suitable for datasets of high-dimensionality, blocks with any proportion of drifts, and datasets with class imbalance. The performance of the DetectA method, with different configurations, was also evaluated on real and artificial datasets, using an MLP as a classifier. The best results were obtained using one of the detection methods, being the proactive manner a top contender regarding improving the underlying base classifier accuracy.  相似文献   

15.
A significant number of applications require effective and efficient manipulation of relational graphs, towards discovering important patterns. Some example applications are: (i) analysis of microarray data in bioinformatics, (ii) pattern discovery in a large graph representing a social network, (iii) analysis of transportation networks, (iv) community discovery in Web data. The basic approach followed by existing methods is to apply mining techniques on graph data to discover important patterns, such as subgraphs that are likely to be useful. However, in some cases the number of mined patterns is large, posing difficulties in selecting the most important ones. For example, applying frequent subgraph mining on a set of graphs the system returns all connected subgraphs whose frequency is above a specified (usually user-defined) threshold. The number of discovered patterns may be large, and this number depends on the data characteristics and the frequency threshold specified. It would be more convenient for the user if “goodness” criteria could be set to evaluate the usefulness of these patterns, and if the user could provide preferences to the system regarding the characteristics of the discovered patterns. In this paper, we propose a methodology to support such preferences by applying subgraph discovery in relational graphs towards retrieving important connected subgraphs. The importance of a subgraph is determined by: (i) the order of the subgraph (the number of vertices) and (ii) the subgraph edge connectivity. The performance of the proposed technique is evaluated by using real-life as well as synthetically generated data sets.  相似文献   

16.
We consider the community detection problem from a partially observable network structure where some edges are not observable. Previous community detection methods are often based solely on the observed connectivity relation and the above situation is not explicitly considered. Even when the connectivity relation is partially observable, if some profile data about the vertices in the network is available, it can be exploited as auxiliary or additional information. We propose to utilize a graph structure (called a profile graph) which is constructed via the profile data, and propose a simple model to utilize both the observed connectivity relation and the profile graph. Furthermore, instead of a hierarchical approach, based on the modularity matrix of the network structure, we propose an embedding approach which utilizes the regularization via the profile graph. Various experiments are conducted over two social network datasets and comparison with several state-of-the-art methods is reported. The results are encouraging and indicate that it is promising to pursue this line of research.  相似文献   

17.
Applications like identifying different customers from their unique buying behaviours, determining ratingsof a product given by users based on different sets of features, etc. require classification using class-specific subsets of features. Most of the existing state-of-the-art classifiers for multivariate data use complete feature set for classification regardless of the different class labels. Decision tree classifier can produce class-wise subsets of features. However, none of these classifiers model the relationship between features which may enhance classification accuracy. We call the class-specific subsets of features and the features’ interrelationships as class signatures. In this work, we propose to map the original input space of multivariate data to the feature space characterized by connected graphs as graphs can easily model entities, their attributes, and relationships among attributes. Mostly, entities are modeled using graphs, where graphs occur naturally, for example, chemical compounds. However, graphs do not occur naturally in multivariate data. Thus, extracting class signatures from multivariate data is a challenging task. We propose some feature selection heuristics to obtain class-specific prominent subgraph signatures. We also propose two variants of class signatures based classifier namely: 1) maximum matching signature (gMM), and 2) score and size of matched signatures (gSM). The effectiveness of the proposed approach on real-world and synthetic datasets has been studied and compared with other established classifiers. Experimental results confirm the ascendancy of the proposed class signatures based classifier on most of the datasets.  相似文献   

18.
We represent a new method for finding all connected maximal common subgraphs in two graphs which is based on the transformation of the problem into the clique problem. We have developed new algorithms for enumerating all cliques that represent connected maximal common subgraphs. These algorithms are based on variants of the Bron–Kerbosch algorithm. In this paper we explain the transformation of the maximal common subgraph problem into the clique problem. We give a short summary of the variants of the Bron–Kerbosch algorithm in order to explain the modification of that algorithm such that the detected cliques represent connected maximal common subgraphs. After introducing and proving several variants of the modified algorithm we discuss the runtimes for all variants by means of random graphs. The results show the drastical reduction of the runtimes for the new algorithms.  相似文献   

19.
What does a “normal” computer (or social) network look like? How can we spot “abnormal” sub-networks in the Internet, or web graph? The answer to such questions is vital for outlier detection (terrorist networks, or illegal money-laundering rings), forecasting, and simulations (“how will a computer virus spread?”).The heart of the problem is finding the properties of real graphs that seem to persist over multiple disciplines. We list such patterns and “laws”, including the “min-cut plots” discovered by us. This is the first part of our NetMine package: given any large graph, it provides visual feedback about these patterns; any significant deviations from the expected patterns can thus be immediately flagged by the user as abnormalities in the graph. The second part of NetMine is the A-plots tool for visualizing the adjacency matrix of the graph in innovative new ways, again to find outliers. Third, NetMine contains the R-MAT (Recursive MATrix) graph generator, which can successfully model many of the patterns found in real-world graphs and quickly generate realistic graphs, capturing the essence of each graph in only a few parameters. We present results on multiple, large real graphs, where we show the effectiveness of our approach.  相似文献   

20.
Maximal biclique (also known as complete bipartite) subgraphs can model many applications in Web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect to the vertex number when the graph grows. In this paper, we efficiently enumerate them through the use of closed patterns of the adjacency matrix of the graph. For an undirected graph G without self-loops, we prove that 1) the number of closed patterns in the adjacency matrix of G is even, 2) the number of the closed patterns is precisely double the number of maximal biclique subgraphs of G, and 3) for every maximal biclique subgraph, there always exists a unique pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, the problem of enumerating maximal bicliques can be solved by using efficient algorithms for mining closed patterns, which are algorithms extensively studied in the data mining field. However, this direct use of existing algorithms causes a duplicated enumeration. To achieve high efficiency, we propose an O(mn) time delay algorithm for a nonduplicated enumeration, in particular, for enumerating those maximal bicliques with a large size, where m and n. are the number of edges and vertices of the graph, respectively. We evaluate the high efficiency of our algorithm by comparing it to state- of-the-art algorithms on three categories of graphs: randomly generated graphs, benchmarks, and a real-life protein interaction network. In this paper, we also prove that if self-loops are allowed in a graph, then the number of closed patterns in the adjacency matrix is not necessarily even, but the maximal bicliques are exactly the same as those of the graph after removing all the self-loops.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号