首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With the increasing size and complexity of available databases, existing machine learning and data mining algorithms are facing a scalability challenge. In many applications, the number of features describing the data could be extremely high. This hinders or even could make any further exploration infeasible. In fact, many of these features are redundant or simply irrelevant. Hence, feature selection plays a key role in helping to overcome the problem of information overload especially in big data applications. Since many complex datasets could be modeled by graphs of interconnected labeled elements, in this work, we are particularly interested in feature selection for subgraph patterns. In this paper, we propose MR-SimLab, a MapReduce-based approach for subgraph selection from large input subgraph sets. In many applications, it is easy to compute pairwise similarities between labels of the graph nodes. Our approach leverages such rich information to measure an approximate subgraph matching by aggregating the elementary label similarities between the matched nodes. Based on the aggregated similarity scores, our approach selects a small subset of informative representative subgraphs. We provide a distributed implementation of our algorithm on top of the MapReduce framework that optimizes the computational efficiency of our approach for big data applications. We experimentally evaluate MR-SimLab on real datasets. The obtained results show that our approach is scalable and that the selected subgraphs are informative.  相似文献   

2.
Frequent subgraphs proved to be powerful features for graph classification and prediction tasks. Their practical use is, however, limited by the computational intractability of pattern enumeration and that of graph embedding into frequent subgraph feature spaces. We propose a simple probabilistic technique that resolves both limitations. In particular, we restrict the pattern language to trees and relax the demand on the completeness of the mining algorithm, as well as on the correctness of the pattern matching operator by replacing transaction and query graphs with small random samples of their spanning trees. In this way we consider only a random subset of frequent subtrees, called probabilistic frequent subtrees, that can be enumerated efficiently. Our extensive empirical evaluation on artificial and benchmark molecular graph datasets shows that probabilistic frequent subtrees can be listed in practically feasible time and that their predictive and retrieval performance is very close even to those of complete sets of frequent subgraphs. We also present different fast techniques for computing the embedding of unseen graphs into (probabilistic frequent) subtree feature spaces. These algorithms utilize the partial order on tree patterns induced by subgraph isomorphism and, as we show empirically, require much less evaluations of subtree isomorphism than the standard brute-force algorithm. We also consider partial embeddings, i.e., when only a part of the feature vector has to be calculated. In particular, we propose a highly effective practical algorithm that significantly reduces the number of pattern matching evaluations required by the classical min-hashing algorithm approximating Jaccard-similarities.  相似文献   

3.
Graph-based data mining approaches have been mainly proposed to the task popularly known as frequent subgraph mining subject to a single user preference, like frequency, size, etc. In this work, we propose to deal with the frequent subgraph mining problem from multiobjective optimization viewpoint, where a subgraph (or solution) is defined by several user-defined preferences (or objectives), which are conflicting in nature. For example, mined subgraphs with high frequency are often of small size, and vice-versa. Use of such objectives in the multiobjective subgraph mining process generates Pareto-optimal subgraphs, where no subgraph is better than another subgraph in all objectives. We have applied a Pareto dominance approach for the evaluation and search subgraphs regarding to both proximity and diversity in multiobjective sense, which has incorporated in the framework of Subdue algorithm for subgraph mining. The method is called multiobjective subgraph mining by Subdue (MOSubdue) and has several advantages: (i) generation of Pareto-optimal subgraphs in a single run (ii) selection of subgraph-seeds from the candidate subgraphs based on all objectives (iii) search in the multiobjective subgraphs lattice space, and (iv) capability to deal with different multiobjective frequent subgraph mining tasks by customizing the tackled objectives. The good performance of MOSubdue is shown by performing multiobjective subgraph mining defined by two and three objectives on two real-life datasets.  相似文献   

4.
近年来,脑网络被广泛应用在脑部疾病的诊断和分类中。考虑到大脑的不确定性特征,先前的研究将不确定图应用在脑网络建模中。在不确定脑网络研究中,传统的特征提取方法多采用均值、方差、极差等进行子图特征提取,但是它们存在泛化性能差以及子图间差异无法直接衡量的问题,进而影响分类准确率。因此,相对极差被提出作为新的特征提取方法。这一方法的优点在于既考虑到子图模式间的最大差异,又考虑到子图模式间的组间差异,可以有效避免传统方法的弊端。结果表明,相对极差与其他特征提取方法相比,其分类性能显著高于传统方法。同时,在不同的特征选择方法下相对极差表现出较好的分类性能,具有很强的泛化性。该研究为不确定脑网络特征提取方法提供了重要的参考意义。  相似文献   

5.
In this paper, we present a method to extract user-specific features from common features. This contrasts with other approaches which work directly off B-rep geometric models. Here, user-specific features are called high-level features which are a set of common features combined in a user-specific manner. A feature relationship graph is used to organize common features in a part and to define high-level feature patterns. The research presented in this paper focuses mainly on feature relationship graph construction and high-level feature recognition using subgraph isomorphic techniques.  相似文献   

6.
王桂娟  印鉴  詹卫许 《计算机科学》2011,38(8):169-170,175
选择频繁的特征子图在基于频繁子图的图数据分类中起着非常重要的作用.提出了一种基于类别信息的特征子图选择策略,即从候选的频繁子图中选出独有频繁子图和显著频繁子图作为特征子图.实验结果显示,在对化合物数据分类时,该选择策略在分类性能上优于SVM方法特征选择策略和CEP方法的特征选择策略.  相似文献   

7.
The existing methods for graph-based data mining (GBDM) follow the basic approach of applying a single-objective search with a user-defined threshold to discover interesting subgraphs. This obliges the user to deal with simple thresholds and impedes her/him from evaluating the mined subgraphs by defining different “goodness” (i.e., multiobjective) criteria regarding the characteristics of the subgraphs. In previous papers, we defined a multiobjective GBDM framework to perform bi-objective graph mining in terms of subgraph support and size maximization. Two different search methods were considered with this aim, a multiobjective beam search and a multiobjective evolutionary programming (MOEP). In this contribution, we extend the latter formulation to a three-objective framework by incorporating another classical graph mining objective, the subgraph diameter. The proposed MOEP method for multiobjective GBDM is tested on five synthetic and real-world datasets and its performance is compared against single and multiobjective subgraph mining approaches based on the classical Subdue technique in GBDM. The results highlight the application of multiobjective subgraph mining allows us to discover more diversified subgraphs in the objective space.  相似文献   

8.
A significant number of applications require effective and efficient manipulation of relational graphs, towards discovering important patterns. Some example applications are: (i) analysis of microarray data in bioinformatics, (ii) pattern discovery in a large graph representing a social network, (iii) analysis of transportation networks, (iv) community discovery in Web data. The basic approach followed by existing methods is to apply mining techniques on graph data to discover important patterns, such as subgraphs that are likely to be useful. However, in some cases the number of mined patterns is large, posing difficulties in selecting the most important ones. For example, applying frequent subgraph mining on a set of graphs the system returns all connected subgraphs whose frequency is above a specified (usually user-defined) threshold. The number of discovered patterns may be large, and this number depends on the data characteristics and the frequency threshold specified. It would be more convenient for the user if “goodness” criteria could be set to evaluate the usefulness of these patterns, and if the user could provide preferences to the system regarding the characteristics of the discovered patterns. In this paper, we propose a methodology to support such preferences by applying subgraph discovery in relational graphs towards retrieving important connected subgraphs. The importance of a subgraph is determined by: (i) the order of the subgraph (the number of vertices) and (ii) the subgraph edge connectivity. The performance of the proposed technique is evaluated by using real-life as well as synthetically generated data sets.  相似文献   

9.
A graph-based approach to document classification is described in this paper. The graph representation offers the advantage that it allows for a much more expressive document encoding than the more standard bag of words/phrases approach, and consequently gives an improved classification accuracy. Document sets are represented as graph sets to which a weighted graph mining algorithm is applied to extract frequent subgraphs, which are then further processed to produce feature vectors (one per document) for classification. Weighted subgraph mining is used to ensure classification effectiveness and computational efficiency; only the most significant subgraphs are extracted. The approach is validated and evaluated using several popular classification algorithms together with a real world textual data set. The results demonstrate that the approach can outperform existing text classification algorithms on some dataset. When the size of dataset increased, further processing on extracted frequent features is essential.  相似文献   

10.
一种基于Apriori思想的频繁子图发现算法   总被引:1,自引:0,他引:1       下载免费PDF全文
如今,关联规则技术应用在许多非传统领域,许多已有的频繁项集搜索方法已经不适用了。一种解决的方法就是用图的形式表示这些领域的事务,然后利用基于图论的数据挖掘技术发现频繁子图。本文提出了一种基于Aproiri思想的频繁子图发现算法SLAGM,它可以有效地挖掘简单图中的频繁子图。实验证明,该算法在性能上优于另一种子图挖掘算法AGM。  相似文献   

11.
Periodic subgraph mining in dynamic networks   总被引:2,自引:1,他引:1  
In systems of interacting entities such as social networks, interactions that occur regularly typically correspond to significant, yet often infrequent and hard to detect, interaction patterns. To identify such regular behavior in streams of dynamic interaction data, we propose a new mining problem of finding a minimal set of periodically recurring subgraphs to capture all periodic behavior in a dynamic network. We analyze the computational complexity of the problem and show that it is polynomial, unlike many related subgraph or itemset mining problems. We propose an efficient and scalable algorithm to mine all periodic subgraphs in a dynamic network. The algorithm makes a single pass over the data and is also capable of accommodating imperfect periodicity. We demonstrate the applicability of our approach on several real-world networks and extract interesting and insightful periodic interaction patterns. We also show that periodic subgraphs can be an effective way to uncover and characterize the natural periodicities in a system.  相似文献   

12.
图挖掘是数据挖掘的一个重要研究方向,而图挖掘主要集中在图数据集内频繁子图的挖掘。频繁子图挖掘技术的关键是建立有效机制减少冗余候选子图,以便高效计算和处理所需的频繁子图。提出了一种基于路径的频繁子图挖掘算法,该算法首先找出所有频繁边从而挖掘出频繁单路径,然后通过组合、双射和操作扩展出较多的频繁路径,再通过连接操作产生所有频繁子图候选集。通过定理证明了该算法的正确性和完整性,从理论上分析了该算法时间复杂度低于现有的算法,最后进行了2个图数据集实验,在候选集产生的数量和时间性能2方面验证了算法的优越性。  相似文献   

13.
Frequent subgraph mining from a tremendous amount of small graphs is a primitive operation for many data mining applications. Existing approaches mainly focus on centralized systems and suffer from the scalability issue. Consider the increasing volume of graph data and mining frequent subgraphs is a memory-intensive task, it is difficult to tackle this problem on a centralized machine efficiently. In this paper, we therefore propose an efficient and scalable solution, called MRFSE, using MapReduce. MRFSE adopts the breadth-first search strategy to iteratively extract frequent subgraphs, i.e., all frequent subgraphs with \(i+1\) edges are generated based on frequent subgraphs with i edges at the ith iteration. In our design, existing frequent subgraph mining techniques in centralized systems can be easily extended and integrated. More importantly, new frequent subgraphs are generated without performing any isomorphism test which is costly and imperative in existing frequent subgraph mining techniques. Besides, various optimization techniques are proposed to further reduce the communication and I/O cost. Extensive experiments conducted on our in-house clusters demonstrate the superiority of our proposed solution in terms of both scalability and efficiency.  相似文献   

14.
15.
gMLC: a multi-label feature selection framework for graph classification   总被引:1,自引:1,他引:0  
Graph classification has been showing critical importance in a wide variety of applications, e.g. drug activity predictions and toxicology analysis. Current research on graph classification focuses on single-label settings. However, in many applications, each graph data can be assigned with a set of multiple labels simultaneously. Extracting good features using multiple labels of the graphs becomes an important step before graph classification. In this paper, we study the problem of multi-label feature selection for graph classification and propose a novel solution, called gMLC, to efficiently search for optimal subgraph features for graph objects with multiple labels. Different from existing feature selection methods in vector spaces that assume the feature set is given, we perform multi-label feature selection for graph data in a progressive way together with the subgraph feature mining process. We derive an evaluation criterion to estimate the dependence between subgraph features and multiple labels of graphs. Then, a branch-and-bound algorithm is proposed to efficiently search for optimal subgraph features by judiciously pruning the subgraph search space using multiple labels. Empirical studies demonstrate that our feature selection approach can effectively boost multi-label graph classification performances and is more efficient by pruning the subgraph search space using multiple labels.  相似文献   

16.
17.
网状数据结构通常获取的网络数据不完整,存在缺失节点.对此,文中提出基于图卷积神经网络的网络节点补全算法.首先对可观测网络进行成对采样,构造目标节点对的封闭子图和特征矩阵.然后利用图卷积神经网络提取子图及特征矩阵的表征向量,用于推断子图中的目标节点对之间是否存在缺失节点,同时判断不同目标节点对间的缺失节点是否为同一节点.最后,在真实网络数据集及人工生成的网络数据集上的实验表明,文中算法可较好解决网络补全问题,在缺失节点比例较大时仍能有效补全网络.  相似文献   

18.
An efficient algorithm for discovering frequent subgraphs   总被引:8,自引:0,他引:8  
Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to nontraditional domains, existing frequent pattern discovery approaches cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the data sets in these domains. An alternate way of modeling the objects in these data sets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is that of discovering subgraphs that occur frequently over the entire set of graphs. We present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph data sets. We experimentally evaluate the performance of FSG using a variety of real and synthetic data sets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in data sets containing more than 200,000 graph transactions and scales linearly with respect to the size of the data set.  相似文献   

19.
Graph-based modeling has emerged as a powerful abstraction capable of capturing in a single and unified framework many of the relational, spatial, topological, and other characteristics that are present in a variety of datasets and application areas. Computationally efficient algorithms that find patterns corresponding to frequently occurring subgraphs play an important role in developing data mining-driven methodologies for analyzing the graphs resulting from such datasets. This paper presents two algorithms, based on the horizontal and vertical pattern discovery paradigms, that find the connected subgraphs that have a sufficient number of edge-disjoint embeddings in a single large undirected labeled sparse graph. These algorithms use three different methods for determining the number of edge-disjoint embeddings of a subgraph and employ novel algorithms for candidate generation and frequency counting, which allow them to operate on datasets with different characteristics and to quickly prune unpromising subgraphs. Experimental evaluation on real datasets from various domains show that both algorithms achieve good performance, scale well to sparse input graphs with more than 120,000 vertices or 110,000 edges, and significantly outperform previously developed algorithms.  相似文献   

20.
近年来图神经网络(GNN)发展迅速,相关模型在知识图谱链接预测任务上的性能显著提升。为解释性能提升的原因,研究人员需要提取GNN学习到的子图模式。然而现有GNN解释器在知识图谱这类典型多关系(multi-relation)图数据场景下的解释准确性尚未被验证,且相关工具尚未实现,导致解释子图提取困难。针对该问题,提出一种将多关系的知识图谱转换为单关系(uni-relational)图的知识图谱链接预测模型,该模型通过将知识图谱中的实体组合为新的节点,并将关系作为新节点的特征,生成只有单一关系的新图,并在新图上训练去噪自编码器使其获得链接预测能力,最后使用GNN解释器生成子图解释。在三个基准数据集上的实验表明,与不进行转换的GraIL相比,所提基于单关系转换的链接预测模型的相对AUC指标提升显著。最后,该模型选取FB15K-237数据集进行解释子图提取实验,验证了模型在直接提取链接预测解释方面的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号