首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
王桂娟  印鉴  詹卫许 《计算机科学》2011,38(8):169-170,175
选择频繁的特征子图在基于频繁子图的图数据分类中起着非常重要的作用.提出了一种基于类别信息的特征子图选择策略,即从候选的频繁子图中选出独有频繁子图和显著频繁子图作为特征子图.实验结果显示,在对化合物数据分类时,该选择策略在分类性能上优于SVM方法特征选择策略和CEP方法的特征选择策略.  相似文献   

2.
gMLC: a multi-label feature selection framework for graph classification   总被引:1,自引:1,他引:0  
Graph classification has been showing critical importance in a wide variety of applications, e.g. drug activity predictions and toxicology analysis. Current research on graph classification focuses on single-label settings. However, in many applications, each graph data can be assigned with a set of multiple labels simultaneously. Extracting good features using multiple labels of the graphs becomes an important step before graph classification. In this paper, we study the problem of multi-label feature selection for graph classification and propose a novel solution, called gMLC, to efficiently search for optimal subgraph features for graph objects with multiple labels. Different from existing feature selection methods in vector spaces that assume the feature set is given, we perform multi-label feature selection for graph data in a progressive way together with the subgraph feature mining process. We derive an evaluation criterion to estimate the dependence between subgraph features and multiple labels of graphs. Then, a branch-and-bound algorithm is proposed to efficiently search for optimal subgraph features by judiciously pruning the subgraph search space using multiple labels. Empirical studies demonstrate that our feature selection approach can effectively boost multi-label graph classification performances and is more efficient by pruning the subgraph search space using multiple labels.  相似文献   

3.
The standard approach to feature construction and predictive learning in molecular datasets is to employ computationally expensive graph mining techniques and to bias the feature search exploration using frequency or correlation measures. These features are then typically employed in predictive models that can be constructed using, for example, SVMs or decision trees. We take a different approach: rather than mining for all optimal local patterns, we extract features from the set of pairwise maximum common subgraphs. The maximum common subgraphs are computed under the block-and-bridge-preserving subgraph isomorphism from the outerplanar examples in polynomial time. We empirically observe a significant increase in predictive performance when using maximum common subgraph features instead of correlated local patterns on 60 benchmark datasets from NCI. Moreover, we show that when we randomly sample the pairs of graphs from which to extract the maximum common subgraphs, we obtain a smaller set of features that still allows the same predictive performance as methods that exhaustively enumerate all possible patterns. The sampling strategy turns out to be a very good compromise between a slight decrease in predictive performance (although still remaining comparable with state-of-the-art methods) and a significant runtime reduction (two orders of magnitude on a popular medium size chemoinformatics dataset). This suggests that maximum common subgraphs are interesting and meaningful features.  相似文献   

4.
网状数据结构通常获取的网络数据不完整,存在缺失节点.对此,文中提出基于图卷积神经网络的网络节点补全算法.首先对可观测网络进行成对采样,构造目标节点对的封闭子图和特征矩阵.然后利用图卷积神经网络提取子图及特征矩阵的表征向量,用于推断子图中的目标节点对之间是否存在缺失节点,同时判断不同目标节点对间的缺失节点是否为同一节点.最后,在真实网络数据集及人工生成的网络数据集上的实验表明,文中算法可较好解决网络补全问题,在缺失节点比例较大时仍能有效补全网络.  相似文献   

5.
Feature selection is one of the most important machine learning procedure, and it has been successfully applied to make a preprocessing before using classification and clustering methods. High-dimensional features often appear in big data, and it’s characters block data processing. So spectral feature selection algorithms have been increasing attention by researchers. However, most feature selection methods, they consider these tasks as two steps, learn similarity matrix from original feature space (may be include redundancy for all features), and then conduct data clustering. Due to these limitations, they do not get good performance on classification and clustering tasks in big data processing applications. To address this problem, we propose an Unsupervised Feature Selection method with graph learning framework, which can reduce the redundancy features influence and utilize a low-rank constraint on the weight matrix simultaneously. More importantly, we design a new objective function to handle this problem. We evaluate our approach by six benchmark datasets. And all empirical classification results show that our new approach outperforms state-of-the-art feature selection approaches.  相似文献   

6.
随着大数据技术的迅速发展和广泛应用,用户越权访问成为制约大数据资源安全共享、受控访问的主要问题之一。基于关系的访问控制(ReBAC,relation-based access control)模型利用实体之间关系制定访问控制规则,增强了策略的逻辑表达能力,实现了动态访问控制,但仍然面临着实体关系数据缺失、规则的关系路径复杂等问题。为克服这些问题,提出了一种基于GNN双源学习的边预测模型——LPMDLG,将大数据实体关系预测问题转化为有向多重图的边预测问题。提出了基于有向包围子图的拓扑结构学习方法和有向双半径节点标记算法,通过有向包围子图提取、子图节点标记计算和拓扑结构特征学习3个环节,从实体关系图中学习节点与子图的拓扑结构特征;提出了基于有向邻居子图的节点嵌入特征学习方法,融入了注意力系数、关系类型等要素,通过有向邻居子图提取、节点嵌入特征学习等环节,学习其节点嵌入特征;设计了双源融合的评分网络,将拓扑结构与节点嵌入联合计算边的得分,从而获得实体关系图的边预测结果。边预测实验结果表明,相较于R-GCN、SEAL、GraIL、TACT等基线模型,所提模型在AUC-PR、MRR和Hits@...  相似文献   

7.
Graph-based data mining approaches have been mainly proposed to the task popularly known as frequent subgraph mining subject to a single user preference, like frequency, size, etc. In this work, we propose to deal with the frequent subgraph mining problem from multiobjective optimization viewpoint, where a subgraph (or solution) is defined by several user-defined preferences (or objectives), which are conflicting in nature. For example, mined subgraphs with high frequency are often of small size, and vice-versa. Use of such objectives in the multiobjective subgraph mining process generates Pareto-optimal subgraphs, where no subgraph is better than another subgraph in all objectives. We have applied a Pareto dominance approach for the evaluation and search subgraphs regarding to both proximity and diversity in multiobjective sense, which has incorporated in the framework of Subdue algorithm for subgraph mining. The method is called multiobjective subgraph mining by Subdue (MOSubdue) and has several advantages: (i) generation of Pareto-optimal subgraphs in a single run (ii) selection of subgraph-seeds from the candidate subgraphs based on all objectives (iii) search in the multiobjective subgraphs lattice space, and (iv) capability to deal with different multiobjective frequent subgraph mining tasks by customizing the tackled objectives. The good performance of MOSubdue is shown by performing multiobjective subgraph mining defined by two and three objectives on two real-life datasets.  相似文献   

8.
This paper presents a novel feature selection approach for backpropagation neural networks (NNs). Previously, a feature selection technique known as the wrapper model was shown effective for decision trees induction. However, it is prohibitively expensive when applied to real-world neural net training characterized by large volumes of data and many feature choices. Our approach incorporates a weight analysis-based heuristic called artificial neural net input gain measurement approximation (ANNIGMA) to direct the search in the wrapper model and allows effective feature selection feasible for neural net applications. Experimental results on standard datasets show that this approach can efficiently reduce the number of features while maintaining or even improving the accuracy. We also report two successful applications of our approach in the helicopter maintenance applications.  相似文献   

9.
脑网络分类在脑科学研究和脑疾病诊断等领域引起了学者们的广泛关注。目前大多数有关脑网络分类的研究都是以单个脑区或成对脑区之间的相关性作为分类特征,其缺点是不能反映多个脑区之间的拓扑结构信息。为克服上述缺点,提出了一种基于子图选择和图核降维的脑网络分类方法。具体包括:(1)分别从正类训练样本组及负类训练样本组中提取多个频繁子图,进而利用基于频度差的子图选择算法选取最具判别性的子图集;(2)基于上述过程中得到的子图集,利用图核主成分分析(graph-kernel-based principal component analysis,GK-PCA)方法对经过子图选择后的图数据进行特征提取;(3)利用支持向量机(support vector machine, SVM)在特征提取后的数据上进行分类。在真实的轻度认知障碍(mild cognitive impairment,MCI)脑网络数据集上对该方法进行了验证,实验结果表明了该方法的有效性。  相似文献   

10.
Frequent subgraph mining from a tremendous amount of small graphs is a primitive operation for many data mining applications. Existing approaches mainly focus on centralized systems and suffer from the scalability issue. Consider the increasing volume of graph data and mining frequent subgraphs is a memory-intensive task, it is difficult to tackle this problem on a centralized machine efficiently. In this paper, we therefore propose an efficient and scalable solution, called MRFSE, using MapReduce. MRFSE adopts the breadth-first search strategy to iteratively extract frequent subgraphs, i.e., all frequent subgraphs with \(i+1\) edges are generated based on frequent subgraphs with i edges at the ith iteration. In our design, existing frequent subgraph mining techniques in centralized systems can be easily extended and integrated. More importantly, new frequent subgraphs are generated without performing any isomorphism test which is costly and imperative in existing frequent subgraph mining techniques. Besides, various optimization techniques are proposed to further reduce the communication and I/O cost. Extensive experiments conducted on our in-house clusters demonstrate the superiority of our proposed solution in terms of both scalability and efficiency.  相似文献   

11.
Today, feature selection is an active research in machine learning. The main idea of feature selection is to choose a subset of available features, by eliminating features with little or no predictive information, as well as redundant features that are strongly correlated. There are a lot of approaches for feature selection, but most of them can only work with crisp data. Until now there have not been many different approaches which can directly work with both crisp and low quality (imprecise and uncertain) data. That is why, we propose a new method of feature selection which can handle both crisp and low quality data. The proposed approach is based on a Fuzzy Random Forest and it integrates filter and wrapper methods into a sequential search procedure with improved classification accuracy of the features selected. This approach consists of the following main steps: (1) scaling and discretization process of the feature set; and feature pre-selection using the discretization process (filter); (2) ranking process of the feature pre-selection using the Fuzzy Decision Trees of a Fuzzy Random Forest ensemble; and (3) wrapper feature selection using a Fuzzy Random Forest ensemble based on cross-validation. The efficiency and effectiveness of this approach is proved through several experiments using both high dimensional and low quality datasets. The approach shows a good performance (not only classification accuracy, but also with respect to the number of features selected) and good behavior both with high dimensional datasets (microarray datasets) and with low quality datasets.  相似文献   

12.
Rough set theory is one of the effective methods to feature selection, which can preserve the meaning of the features. The essence of rough set approach to feature selection is to find a subset of the original features. Since finding a minimal subset of the features is a NP-hard problem, it is necessary to investigate effective and efficient heuristic algorithms. Ant colony optimization (ACO) has been successfully applied to many difficult combinatorial problems like quadratic assignment, traveling salesman, scheduling, etc. It is particularly attractive for feature selection since there is no heuristic information that can guide search to the optimal minimal subset every time. However, ants can discover the best feature combinations as they traverse the graph. In this paper, we propose a new rough set approach to feature selection based on ACO, which adopts mutual information based feature significance as heuristic information. A novel feature selection algorithm is also given. Jensen and Shen proposed a ACO-based feature selection approach which starts from a random feature. Our approach starts from the feature core, which changes the complete graph to a smaller one. To verify the efficiency of our algorithm, experiments are carried out on some standard UCI datasets. The results demonstrate that our algorithm can provide efficient solution to find a minimal subset of the features.  相似文献   

13.
A significant number of applications require effective and efficient manipulation of relational graphs, towards discovering important patterns. Some example applications are: (i) analysis of microarray data in bioinformatics, (ii) pattern discovery in a large graph representing a social network, (iii) analysis of transportation networks, (iv) community discovery in Web data. The basic approach followed by existing methods is to apply mining techniques on graph data to discover important patterns, such as subgraphs that are likely to be useful. However, in some cases the number of mined patterns is large, posing difficulties in selecting the most important ones. For example, applying frequent subgraph mining on a set of graphs the system returns all connected subgraphs whose frequency is above a specified (usually user-defined) threshold. The number of discovered patterns may be large, and this number depends on the data characteristics and the frequency threshold specified. It would be more convenient for the user if “goodness” criteria could be set to evaluate the usefulness of these patterns, and if the user could provide preferences to the system regarding the characteristics of the discovered patterns. In this paper, we propose a methodology to support such preferences by applying subgraph discovery in relational graphs towards retrieving important connected subgraphs. The importance of a subgraph is determined by: (i) the order of the subgraph (the number of vertices) and (ii) the subgraph edge connectivity. The performance of the proposed technique is evaluated by using real-life as well as synthetically generated data sets.  相似文献   

14.
By revealing potential relationships between users, link prediction has long been considered as a fundamental research issue in singed social networks. The key of link prediction is to measure the similarity between users. Existing works use connections between target users or their common neighbors to measure user similarity. Rich information available for link prediction is missing since use similarity is widely influenced by many users via social connections. We therefore propose a novel graph kernel based link prediction method, which predicts links by comparing user similarity via signed social network’s structural information: we first generate a set of subgraphs with different strength of social relations for each user, then calculate the graph kernel similarities between subgraphs, in which Bhattacharyya kernel is used to measure the similarity of the k-dimensional Gaussian distributions related to each k-order Krylov subspace generated for each subgraph, and finally train SVM classifier with user similarity information to predict links. Experiments held on real application datasets show that our proposed method has good link prediction performances on both positive and negative link prediction. Our method has significantly higher link prediction accuracy and F1-score than existing works.  相似文献   

15.
Real-world networks, such as social networks, cryptocurrency networks, and e-commerce networks, always have occurrence time of interactions between nodes. Such networks are typically modeled as temporal graphs. Mining cohesive subgraphs from temporal graphs is practical and essential in numerous data mining applications, since mining cohesive subgraphs gets insights into the time-varying nature of temporal graphs. However, existing studies on mining cohesive subgraphs, such as Densest-Exact and k-truss, are mainly tailored for static graphs (whose edges have no temporal information). Therefore, those cohesive subgraph models cannot indicate both the temporal and the structural characteristics of subgraphs. To this end, we explore the model of cohesive temporal subgraphs by incorporating both the evolving and the structural characteristics of temporal subgraphs. Unfortunately, the volume of time intervals in a temporal network is quadratic. As a result, the time complexity of mining temporal cohesive subgraphs is high. To efficiently address the problem, we first mine the temporal density distribution of temporal graphs. Guided by the distribution, we can safely prune many unqualified time intervals with the linear time cost. Then, the remaining time intervals where cohesive temporal subgraphs fall in are examined using the greedy search. The results of the experiments on nine real-world temporal graphs indicate that our model outperforms state-of-the-art solutions in efficiency and quality. Specifically, our model only takes less than two minutes on a million-vertex DBLP and has the highest overall average ranking in EDB and TC metrics.  相似文献   

16.
The existing methods for graph-based data mining (GBDM) follow the basic approach of applying a single-objective search with a user-defined threshold to discover interesting subgraphs. This obliges the user to deal with simple thresholds and impedes her/him from evaluating the mined subgraphs by defining different “goodness” (i.e., multiobjective) criteria regarding the characteristics of the subgraphs. In previous papers, we defined a multiobjective GBDM framework to perform bi-objective graph mining in terms of subgraph support and size maximization. Two different search methods were considered with this aim, a multiobjective beam search and a multiobjective evolutionary programming (MOEP). In this contribution, we extend the latter formulation to a three-objective framework by incorporating another classical graph mining objective, the subgraph diameter. The proposed MOEP method for multiobjective GBDM is tested on five synthetic and real-world datasets and its performance is compared against single and multiobjective subgraph mining approaches based on the classical Subdue technique in GBDM. The results highlight the application of multiobjective subgraph mining allows us to discover more diversified subgraphs in the objective space.  相似文献   

17.
Feature selection is an important step for large-scale image data analysis, which has been proved to be difficult due to large size in both dimensions and samples. Feature selection firstly eliminates redundant and irrelevant features and then chooses a subset of features that performs as efficient as the complete set. Generally, supervised feature selection yields better performance than unsupervised feature selection because of the utilization of labeled information. However, labeled data samples are always expensive to obtain, which constraints the performance of supervised feature selection, especially for the large web image datasets. In this paper, we propose a semi-supervised feature selection algorithm that is based on a hierarchical regression model. Our contribution can be highlighted as: (1) Our algorithm utilizes a statistical approach to exploit both labeled and unlabeled data, which preserves the manifold structure of each feature type. (2) The predicted label matrix of the training data and the feature selection matrix are learned simultaneously, making the two aspects mutually benefited. Extensive experiments are performed on three large-scale image datasets. Experimental results demonstrate the better performance of our algorithm, compared with the state-of-the-art algorithms.  相似文献   

18.
Feature selection is used to choose a subset of relevant features for effective classification of data. In high dimensional data classification, the performance of a classifier often depends on the feature subset used for classification. In this paper, we introduce a greedy feature selection method using mutual information. This method combines both feature–feature mutual information and feature–class mutual information to find an optimal subset of features to minimize redundancy and to maximize relevance among features. The effectiveness of the selected feature subset is evaluated using multiple classifiers on multiple datasets. The performance of our method both in terms of classification accuracy and execution time performance, has been found significantly high for twelve real-life datasets of varied dimensionality and number of instances when compared with several competing feature selection techniques.  相似文献   

19.
图异常检测将实体间通联关系抽象为复杂网络形式表示,旨在利用结构特征识别网络中存在的异常行为与实体,具有关系客观存在且异常可解释较强的优点。目前该类方法主要以无向网络结构为基础提取特征,以达到识别异常的目的,主要关注于连边层面异常结构,对于由集体异常行为构成的异常子图识别问题研究仍较少,缺少对行为方向异常协同关系的分析。传统方法通过提取节点邻域结构特征构建特征空间,并根据节点邻域结构在特征空间中的映射点距离发现离群点,虽可发现结构具有明显差异的异常子图,但忽略了网络结构中节点的实际物理联系,以及行为由于主客体不同所导致个体间关系非对等的实际情况。针对该问题,本文提出了基于有向网络非对等关系的异常子图识别算法,通过连边方向信息提取节点间行为方向特征,度量节点间关系非对等强度,后转化为子图密度形式表示,结合基于密度的异常识别方法挖掘异常,保留了实际物理联系。通过在4种不同异常类型的合成数据集与存在实际异常的真实数据集上进行实验,验证了其具有较高的异常识别精度与鲁棒性。  相似文献   

20.
Data mining techniques are widely used in many fields. One of the applications of data mining in the field of the Bioinformatics is classification of tissue samples. In the present work, a wavelet power spectrum based approach has been presented for feature selection and successful classification of the multi class dataset. The proposed method was applied on SRBCT and the breast cancer datasets which are multi class cancer datasets. The selected features are almost those selected in previous works. The method was able to produce almost 100% accurate classification results. The method is very simple and robust to noise. No extensive preprocessing is required. The classification was performed with comparatively very lesser number of features than those used in the original works. No information is lost due to the initial pruning of the data usually performed using a threshold in other methods. The method utilizes the inherent nature of the data in performing various tasks. So, the method can be used for a wide range of data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号