首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Feature subset selection is a key problem in the data-mining classification task that helps to obtain more compact and understandable models without degrading (or even improving) their performance. In this work we focus on FSS in high-dimensional datasets, that is, with a very large number of predictive attributes. In this case, standard sophisticated wrapper algorithms cannot be applied because of their complexity, and computationally lighter filter-wrapper algorithms have recently been proposed. In this work we propose a stochastic algorithm based on the GRASP meta-heuristic, with the main goal of speeding up the feature subset selection process, basically by reducing the number of wrapper evaluations to carry out. GRASP is a multi-start constructive method which constructs a solution in its first stage, and then runs an improving stage over that solution. Several instances of the proposed GRASP method are experimentally tested and compared with state-of-the-art algorithms over 12 high-dimensional datasets. The statistical analysis of the results shows that our proposal is comparable in accuracy and cardinality of the selected subset to previous algorithms, but requires significantly fewer evaluations.  相似文献   

This paper deals with the problem of supervised wrapper-based feature subset selection in datasets with a very large number of attributes. Recently the literature has contained numerous references to the use of hybrid selection algorithms: based on a filter ranking, they perform an incremental wrapper selection over that ranking. Though working fine, these methods still have their problems: (1) depending on the complexity of the wrapper search method, the number of wrapper evaluations can still be too large; and (2) they rely on a univariate ranking that does not take into account interaction between the variables already included in the selected subset and the remaining ones.Here we propose a new approach whose main goal is to drastically reduce the number of wrapper evaluations while maintaining good performance (e.g. accuracy and size of the obtained subset). To do this we propose an algorithm that iteratively alternates between filter ranking construction and wrapper feature subset selection (FSS). Thus, the FSS only uses the first block of ranked attributes and the ranking method uses the current selected subset in order to build a new ranking where this knowledge is considered. The algorithm terminates when no new attribute is selected in the last call to the FSS algorithm. The main advantage of this approach is that only a few blocks of variables are analyzed, and so the number of wrapper evaluations decreases drastically.The proposed method is tested over eleven high-dimensional datasets (2400-46,000 variables) using different classifiers. The results show an impressive reduction in the number of wrapper evaluations without degrading the quality of the obtained subset.  相似文献   

In this research, we propose a novel method to find the relevant feature subset by using ant colony optimisation minimum-redundancy–maximum-relevance. The proposed approach considers the significance of each feature while reducing the dimensionality. The performance of proposed algorithm has been compared with existing biologically inspired feature subset selection algorithms. Eight datasets have been selected from UCI machine learning repository for experimentation. The experimental results indicate that the presented algorithm out performs the other algorithms in terms of the classification accuracy and feature reduction.  相似文献   

Past work on object detection has emphasized the issues of feature extraction and classification, however, relatively less attention has been given to the critical issue of feature selection. The main trend in feature extraction has been representing the data in a lower dimensional space, for example, using principal component analysis (PCA). Without using an effective scheme to select an appropriate set of features in this space, however, these methods rely mostly on powerful classification algorithms to deal with redundant and irrelevant features. In this paper, we argue that feature selection is an important problem in object detection and demonstrate that genetic algorithms (GAs) provide a simple, general, and powerful framework for selecting good subsets of features, leading to improved detection rates. As a case study, we have considered PCA for feature extraction and support vector machines (SVMs) for classification. The goal is searching the PCA space using GAs to select a subset of eigenvectors encoding important information about the target concept of interest. This is in contrast to traditional methods selecting some percentage of the top eigenvectors to represent the target concept, independently of the classification task. We have tested the proposed framework on two challenging applications: vehicle detection and face detection. Our experimental results illustrate significant performance improvements in both cases.  相似文献   

With the proliferation of extremely high-dimensional data, feature selection algorithms have become indispensable components of the learning process. Strangely, despite extensive work on the stability of learning algorithms, the stability of feature selection algorithms has been relatively neglected. This study is an attempt to fill that gap by quantifying the sensitivity of feature selection algorithms to variations in the training set. We assess the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset. We examine a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them. We perform a series of experiments with several feature selection algorithms on a set of proteomics datasets. The experiments allow us to explore the merits of each stability measure and create stability profiles of the feature selection algorithms. Finally, we show how stability profiles can support the choice of a feature selection algorithm. Alexandros Kalousis received the B.Sc. degree in computer science, in 1994, and the M.Sc. degree in advanced information systems, in 1997, both from the University of Athens, Greece. He received the Ph.D. degree in meta-learning for classification algorithm selection from the University of Geneva, Department of Computer Science, Geneva, in 2002. Since then he is a Senior Researcher in the same university. His research interests include relational learning with kernels and distances, stability of feature selection algorithms, and feature extraction from spectral data. Julien Prados is a Ph.D. student at the University of Geneva, Switzerland. In 1999 and 2001, he received the B.Sc. and M.Sc. degrees in computer science from the University Joseph Fourier (Grenoble, France). After a year of work in industry, he joined the Geneva Artificial Intelligence Laboratory, where he is working on bioinformatics and datamining tools for mass spectrometry data analysis. Melanie Hilario has a Ph.D. in computer science from the University of Paris VI and currently works at the University of Geneva’s Artificial Intelligence Laboratory. She has initiated and participated in several European research projects on neuro-symbolic integration, meta-learning, and biological text mining. She has served on the program committees of many conferences and workshops in machine learning, data mining, and artificial intelligence. She is currently an Associate Editor of theInternational Journal on Artificial Intelligence Toolsand a member of the Editorial Board of theIntelligent Data Analysis journal.  相似文献   

A genetic algorithm-based method for feature subset selection   总被引:5,自引:2,他引:3  
As a commonly used technique in data preprocessing, feature selection selects a subset of informative attributes or variables to build models describing data. By removing redundant and irrelevant or noise features, feature selection can improve the predictive accuracy and the comprehensibility of the predictors or classifiers. Many feature selection algorithms with different selection criteria has been introduced by researchers. However, it is discovered that no single criterion is best for all applications. In this paper, we propose a framework based on a genetic algorithm (GA) for feature subset selection that combines various existing feature selection methods. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. We conducted experiments using three data sets and three existing feature selection methods. The experimental results demonstrate that our approach is a robust and effective approach to find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm.  相似文献   

Feature selection has always been a critical step in pattern recognition, in which evolutionary algorithms, such as the genetic algorithm (GA), are most commonly used. However, the individual encoding scheme used in various GAs would either pose a bias on the solution or require a pre-specified number of features, and hence may lead to less accurate results. In this paper, a tribe competition-based genetic algorithm (TCbGA) is proposed for feature selection in pattern classification. The population of individuals is divided into multiple tribes, and the initialization and evolutionary operations are modified to ensure that the number of selected features in each tribe follows a Gaussian distribution. Thus each tribe focuses on exploring a specific part of the solution space. Meanwhile, tribe competition is introduced to the evolution process, which allows the winning tribes, which produce better individuals, to enlarge their sizes, i.e. having more individuals to search their parts of the solution space. This algorithm, therefore, avoids the bias on solutions and requirement of a pre-specified number of features. We have evaluated our algorithm against several state-of-the-art feature selection approaches on 20 benchmark datasets. Our results suggest that the proposed TCbGA algorithm can identify the optimal feature subset more effectively and produce more accurate pattern classification.  相似文献   

Searching for an optimal feature subset from a high dimensional feature space is known to be an NP-complete problem. We present a hybrid algorithm, SAGA, for this task. SAGA combines the ability to avoid being trapped in a local minimum of simulated annealing with the very high rate of convergence of the crossover operator of genetic algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks. We compare the performance over time of SAGA and well-known algorithms on synthetic and real datasets. The results show that SAGA outperforms existing algorithms.  相似文献   

The high dimensionality of microarray datasets endows the task of multiclass tissue classification with various difficulties—the main challenge being the selection of features deemed relevant and non-redundant to form the predictor set for classifier training. The necessity of varying the emphases on relevance and redundancy, through the use of the degree of differential prioritization (DDP) during the search for the predictor set is also of no small importance. Furthermore, there are several types of decomposition technique for the feature selection (FS) problem—all-classes-at-once, one-vs.-all (OVA) or pairwise (PW). Also, in multiclass problems, there is the need to consider the type of classifier aggregation used—whether non-aggregated (a single machine), or aggregated (OVA or PW). From here, first we propose a systematic approach to combining the distinct problems of FS and classification. Then, using eight well-known multiclass microarray datasets, we empirically demonstrate the effectiveness of the DDP in various combinations of FS decomposition types and classifier aggregation methods. Aided by the variable DDP, feature selection leads to classification performance which is better than that of rank-based or equal-priorities scoring methods and accuracies higher than previously reported for benchmark datasets with large number of classes. Finally, based on several criteria, we make general recommendations on the optimal choice of the combination of FS decomposition type and classifier aggregation method for multiclass microarray datasets.  相似文献   

A new local search based hybrid genetic algorithm for feature selection   总被引:2,自引:0,他引:2  
This paper presents a new hybrid genetic algorithm (HGA) for feature selection (FS), called as HGAFS. The vital aspect of this algorithm is the selection of salient feature subset within a reduced size. HGAFS incorporates a new local search operation that is devised and embedded in HGA to fine-tune the search in FS process. The local search technique works on basis of the distinct and informative nature of input features that is computed by their correlation information. The aim is to guide the search process so that the newly generated offsprings can be adjusted by the less correlated (distinct) features consisting of general and special characteristics of a given dataset. Thus, the proposed HGAFS receives the reduced redundancy of information among the selected features. On the other hand, HGAFS emphasizes on selecting a subset of salient features with reduced number using a subset size determination scheme. We have tested our HGAFS on 11 real-world classification datasets having dimensions varying from 8 to 7129. The performances of HGAFS have been compared with the results of other existing ten well-known FS algorithms. It is found that, HGAFS produces consistently better performances on selecting the subsets of salient features with resulting better classification accuracies.  相似文献   

Algorithms for feature selection in predictive data mining for classification problems attempt to select those features that are relevant, and are not redundant for the classification task. A relevant feature is defined as one which is highly correlated with the target function. One problem with the definition of feature relevance is that there is no universally accepted definition of what it means for a feature to be ‘highly correlated with the target function or highly correlated with the other features’. A new feature selection algorithm which incorporates domain specific definitions of high, medium and low correlations is proposed in this paper. The proposed algorithm conducts a heuristic search for the most relevant features for the prediction task.  相似文献   

The evaluation of feature selection methods for text classification with small sample datasets must consider classification performance, stability, and efficiency. It is, thus, a multiple criteria decision-making (MCDM) problem. Yet there has been few research in feature selection evaluation using MCDM methods which considering multiple criteria. Therefore, we use MCDM-based methods for evaluating feature selection methods for text classification with small sample datasets. An experimental study is designed to compare five MCDM methods to validate the proposed approach with 10 feature selection methods, nine evaluation measures for binary classification, seven evaluation measures for multi-class classification, and three classifiers with 10 small datasets. Based on the ranked results of the five MCDM methods, we make recommendations concerning feature selection methods. The results demonstrate the effectiveness of the used MCDM-based method in evaluating feature selection methods.  相似文献   

Protein function prediction is an important problem in functional genomics. Typically, protein sequences are represented by feature vectors. A major problem of protein datasets that increase the complexity of classification models is their large number of features. Feature selection (FS) techniques are used to deal with this high dimensional space of features. In this paper, we propose a novel feature selection algorithm that combines genetic algorithms (GA) and ant colony optimization (ACO) for faster and better search capability. The hybrid algorithm makes use of advantages of both ACO and GA methods. Proposed algorithm is easily implemented and because of use of a simple classifier in that, its computational complexity is very low. The performance of proposed algorithm is compared to the performance of two prominent population-based algorithms, ACO and genetic algorithms. Experimentation is carried out using two challenging biological datasets, involving the hierarchical functional classification of GPCRs and enzymes. The criteria used for comparison are maximizing predictive accuracy, and finding the smallest subset of features. The results of experiments indicate the superiority of proposed algorithm.  相似文献   

针对高维生物医学数据包含大量无关或弱相关特征,影响疾病诊断效率的现状,提出了一种基于改进混合蛙跳算法的高维生物医学数据特征选择方法。该方法将混沌记忆权重因子和平衡分组策略引入基本混合蛙跳算法,在强化算法多样性的同时,维持了算法全局和局部寻优之间的平衡,降低了算法陷入局部最优的可能,进一步提高了混合蛙跳算法特征选择方法在特征空间的探索能力。实验结果表明:与改进遗传算法、粒子群优化算法特征选择方法比较,改进混合蛙跳算法特征选择方法在高维生物医学数据特征子集识别、分类精度方面取得了更好的效果。  相似文献   

In this work, the Synthetic Minority Over-sampling Technique (SMOTE) approach is adapted for high-dimensional binary settings. A novel distance metric is proposed for the computation of the neighborhood for each minority sample, which takes into account only a subset of the available attributes that are relevant for the task. Three variants for the distance metric are explored: Euclidean, Manhattan, and Chebyshev distances, and four different ranking strategies: Fisher Score, Mutual Information, Eigenvector Centrality, and Correlation Score. Our proposal was compared with various oversampling techniques on low- and high-dimensional datasets with the presence of class-imbalance, including a case study on Natural Language Processing (NLP). The proposed oversampling strategy showed superior results on average when compared with SMOTE and other variants, demonstrating the importance of selecting the right attributes when defining the neighborhood in SMOTE-based oversampling methods.  相似文献   

In this paper, we present a new method for dealing with feature subset selection based on fuzzy entropy measures for handling classification problems. First, we discretize numeric features to construct the membership function of each fuzzy set of a feature. Then, we select the feature subset based on the proposed fuzzy entropy measure focusing on boundary samples. The proposed method can select relevant features to get higher average classification accuracy rates than the ones selected by the MIFS method (Battiti, R. in IEEE Trans. Neural Netw. 5(4):537–550, 1994), the FQI method (De, R.K., et al. in Neural Netw. 12(10):1429–1455, 1999), the OFEI method, Dong-and-Kothari’s method (Dong, M., Kothari, R. in Pattern Recognit. Lett. 24(9):1215–1225, 2003) and the OFFSS method (Tsang, E.C.C., et al. in IEEE Trans. Fuzzy Syst. 11(2):202–213, 2003).
Shyi-Ming ChenEmail:

In this paper. we present the MIFS-C variant of the mutual information feature-selection algorithms. We present an algorithm to find the optimal value of the redundancy parameter, which is a key parameter in the MIFS-type algorithms. Furthermore, we present an algorithm that speeds up the execution time of all the MIFS variants. Overall, the presented MIFS-C has comparable classification accuracy (in some cases even better) compared with other MIFS algorithms, while its running time is faster. We compared this feature selector with other feature selectors, and found that it performs better in most cases. The MIFS-C performed especially well for the breakeven and F-measure because the algorithm can be tuned to optimise these evaluation measures. Jan Bakus received the B.A.Sc. and M.A.Sc. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1996 and 1998, respectively, and Ph.D. degree in systems design engineering in 2005. He is currently working at Maplesoft, Waterloo, ON, Canada as an applications engineer, where he is responsible for the development of application specific toolboxes for the Maple scientific computing software. His research interests are in the area of feature selection for text classification, text classification, text clustering, and information retrieval. He is the recipient of the Carl Pollock Fellowship award from the University of Waterloo and the Datatel Scholars Foundation scholarship from Datatel. Mohamed S. Kamel holds a Ph.D. in computer science from the University of Toronto, Canada. He is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory in the Department of Electrical and Computing Engineering, University of Waterloo, Canada. Professor Kamel holds a Canada Research Chair in Cooperative Intelligent Systems. Dr. Kamel's research interests are in machine intelligence, neural networks and pattern recognition with applications in robotics and manufacturing. He has authored and coauthored over 200 papers in journals and conference proceedings, 2 patents and numerous technical and industrial project reports. Under his supervision, 53 Ph.D. and M.A.Sc. students have completed their degrees. Dr. Kamel is a member of ACM, AAAI, CIPS and APEO and has been named s Fellow of IEEE (2005). He is the editor-in-chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, the International Journal of Image and Graphics, Pattern Recognition Letters and is a member of the editorial board of the Intelligent Automation and Soft Computing. He has served as a consultant to many Companies, including NCR, IBM, Nortel, VRP and CSA. He is a member of the board of directors and cofounder of Virtek Vision International in Waterloo.  相似文献   

特征选择在许多领域具有重要作用,提出一种基于混合自适应引力搜索算法的特征选择方法,在最大化分类精度的同时从数据样本中选出最小特征子集。算法设计两种解更新策略进行组合式搜索,引入群体约简方法,有效地平衡算法的全局搜索和局部收敛能力,同时提出自适应调控参数,减少参数设置对算法性能的影响。在七组真实数据集中的实验结果表明,从分类精度、特征子集大小和运行时间三方面比较,提出的方法优于原始算法和已有相近算法,具有良好的综合性能,是一种有效的特征选择方法。  相似文献   

This paper addresses the dynamic recognition of basic facial expressions in videos using feature subset selection. Feature selection has been already used by some static classifiers where the facial expression is recognized from one single image. Past work on dynamic facial expression recognition has emphasized the issues of feature extraction and classification, however, less attention has been given to the critical issue of feature selection in the dynamic scenario. The main contributions of the paper are as follows. First, we show that dynamic facial expression recognition can be casted into a classical classification problem. Second, we combine a facial dynamics extractor algorithm with a feature selection scheme for generic classifiers.We show that the paradigm of feature subset selection with a wrapper technique can improve the dynamic recognition of facial expressions. We provide evaluations of performance on real video sequences using five standard machine learning approaches: Support Vector Machines, K Nearest Neighbor, Naive Bayes, Bayesian Networks, and Classification Trees.  相似文献   

提出了一种基于遗传算法的大数据特征选择算法。该算法首先对各维度的特征进行评估,根据每个特征在同类最近邻和异类最近邻上的差异度调整其权重,基于特征权重引导遗传算法的搜索,以提升算法的搜索能力和获取特征的准确性;然后结合特征权重计算特征的适应度,以适应度作为评价指标,启动遗传算法获取最优的特征子集,并最终实现高效准确的大数据特征选择。通过实验分析发现,该算法能够有效减小分类特征数,并提升特征分类准确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号