首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Feature subset selection is a substantial problem in the field of data classification tasks. The purpose of feature subset selection is a mechanism to find efficient subset retrieved from original datasets to increase both efficiency and accuracy rate and reduce the costs of data classification. Working on high-dimensional datasets with a very large number of predictive attributes while the number of instances is presented in a low volume needs to be employed techniques to select an optimal feature subset. In this paper, a hybrid method is proposed for efficient subset selection in high-dimensional datasets. The proposed algorithm runs filter-wrapper algorithms in two phases. The symmetrical uncertainty (SU) criterion is exploited to weight features in filter phase for discriminating the classes. In wrapper phase, both FICA (fuzzy imperialist competitive algorithm) and IWSSr (Incremental Wrapper Subset Selection with replacement) in weighted feature space are executed to find relevant attributes. The new scheme is successfully applied on 10 standard high-dimensional datasets, especially within the field of biosciences and medicine, where the number of features compared to the number of samples is large, inducing a severe curse of dimensionality problem. The comparison between the results of our method and other algorithms confirms that our method has the most accuracy rate and it is also able to achieve to the efficient compact subset.  相似文献   

Searching for an optimal feature subset from a high dimensional feature space is known to be an NP-complete problem. We present a hybrid algorithm, SAGA, for this task. SAGA combines the ability to avoid being trapped in a local minimum of simulated annealing with the very high rate of convergence of the crossover operator of genetic algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks. We compare the performance over time of SAGA and well-known algorithms on synthetic and real datasets. The results show that SAGA outperforms existing algorithms.  相似文献   

A note on genetic algorithms for large-scale feature selection   总被引:7,自引:0,他引:7  
We introduce the use of genetic algorithms (GA) for the selection of features in the design of automatic pattern classifiers. Our preliminary results suggest that GA is a powerful means of reducing the time for finding near-optimal subsets of features from large sets.  相似文献   

A new improved forward floating selection (IFFS) algorithm for selecting a subset of features is presented. Our proposed algorithm improves the state-of-the-art sequential forward floating selection algorithm. The improvement is to add an additional search step called “replacing the weak feature” to check whether removing any feature in the currently selected feature subset and adding a new one at each sequential step can improve the current feature subset. Our method provides the optimal or quasi-optimal (close to optimal) solutions for many selected subsets and requires significantly less computational load than optimal feature selection algorithms. Our experimental results for four different databases demonstrate that our algorithm consistently selects better subsets than other suboptimal feature selection algorithms do, especially when the original number of features of the database is large.  相似文献   

Graph based pattern representation offers a versatile alternative to vectorial data structures. Therefore, a growing interest in graphs can be observed in various fields. However, a serious limitation in the use of graphs is the lack of elementary mathematical operations in the graph domain, actually required in many pattern recognition algorithms. In order to overcome this limitation, the present paper proposes an embedding of a given graph population in a vector space Rn. The key idea of this embedding approach is to interpret the distances of a graph g to a number of prototype graphs as numerical features of g. In previous works, the prototypes were selected beforehand with heuristic selection algorithms. In the present paper we take a more fundamental approach and regard the problem of prototype selection as a feature selection or dimensionality reduction problem, for which many methods are available. With several experiments we show the feasibility of graph embedding based on prototypes obtained from such feature selection algorithms and demonstrate their potential to outperform previous approaches.  相似文献   

Traditional multivariate tests such as Hotelling’s test or Wilk’s test are designed for classical problems, where the number of observations is much larger than the dimension of the variables. For high-dimensional data, however, this assumption cannot be met any longer. In this article, we consider testing problems in high-dimensional MANOVA where the number of variables exceeds the sample size. To overcome the challenges with high dimensionality, we propose a new approach called a shrinkage-based regularization test, which is suitable for a variety of data structures including the one-sample problem and one-way MANOVA. Our approach uses a ridge regularization to overcome the singularity of the sample covariance matrix and applies a soft-thresholding technique to reduce random noise and improve the testing power. An appealing property of this approach is its ability to select relevant variables that provide evidence against the hypothesis. We compare the performance of our approach with some competing approaches via real microarray data and simulation studies. The results illustrate that the proposed statistics maintains relatively high power in detecting a wide family of alternatives.  相似文献   

Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimates. This paper addresses the impact of error estimation on feature selection using two performance measures: comparison of the true error of the optimal feature set with the true error of the feature set found by a feature-selection algorithm, and the number of features among the truly optimal feature set that appear in the feature set found by the algorithm. The study considers seven error estimators applied to three standard suboptimal feature-selection algorithms and exhaustive search, and it considers three different feature-label model distributions. It draws two conclusions for the cases considered: (1) depending on the sample size and the classification rule, feature-selection algorithms can produce feature sets whose corresponding classifiers possess errors far in excess of the classifier corresponding to the optimal feature set; and (2) for small samples, differences in performances among the feature-selection algorithms are less significant than performance differences among the error estimators used to implement the algorithms. Moreover, keeping in mind that results depend on the particular classifier-distribution pair, for the error estimators considered in this study, bootstrap and bolstered resubstitution usually outperform cross-validation, and bolstered resubstitution usually performs as well as or better than bootstrap.  相似文献   

Feature subset selection is a key problem in the data-mining classification task that helps to obtain more compact and understandable models without degrading (or even improving) their performance. In this work we focus on FSS in high-dimensional datasets, that is, with a very large number of predictive attributes. In this case, standard sophisticated wrapper algorithms cannot be applied because of their complexity, and computationally lighter filter-wrapper algorithms have recently been proposed. In this work we propose a stochastic algorithm based on the GRASP meta-heuristic, with the main goal of speeding up the feature subset selection process, basically by reducing the number of wrapper evaluations to carry out. GRASP is a multi-start constructive method which constructs a solution in its first stage, and then runs an improving stage over that solution. Several instances of the proposed GRASP method are experimentally tested and compared with state-of-the-art algorithms over 12 high-dimensional datasets. The statistical analysis of the results shows that our proposal is comparable in accuracy and cardinality of the selected subset to previous algorithms, but requires significantly fewer evaluations.  相似文献   

特征选择方法综述   总被引:13,自引:0,他引:13  
特征选择是模式识别的关键问题之一,特征选择结果的好坏直接影响着分类器的分类精度和泛化性能.首先分析了特征选择方法的框架;然后从搜索策略和评价准则两个角度对特征选择方法进行了分析和总结;最后分析了对特征选择的影响因素,并指出了实际应用中需要解决的问题.  相似文献   

This paper deals with the problem of supervised wrapper-based feature subset selection in datasets with a very large number of attributes. Recently the literature has contained numerous references to the use of hybrid selection algorithms: based on a filter ranking, they perform an incremental wrapper selection over that ranking. Though working fine, these methods still have their problems: (1) depending on the complexity of the wrapper search method, the number of wrapper evaluations can still be too large; and (2) they rely on a univariate ranking that does not take into account interaction between the variables already included in the selected subset and the remaining ones.Here we propose a new approach whose main goal is to drastically reduce the number of wrapper evaluations while maintaining good performance (e.g. accuracy and size of the obtained subset). To do this we propose an algorithm that iteratively alternates between filter ranking construction and wrapper feature subset selection (FSS). Thus, the FSS only uses the first block of ranked attributes and the ranking method uses the current selected subset in order to build a new ranking where this knowledge is considered. The algorithm terminates when no new attribute is selected in the last call to the FSS algorithm. The main advantage of this approach is that only a few blocks of variables are analyzed, and so the number of wrapper evaluations decreases drastically.The proposed method is tested over eleven high-dimensional datasets (2400-46,000 variables) using different classifiers. The results show an impressive reduction in the number of wrapper evaluations without degrading the quality of the obtained subset.  相似文献   

Feature selection is a process aimed at filtering out unrepresentative features from a given dataset, usually allowing the later data mining and analysis steps to produce better results. However, different feature selection algorithms use different criteria to select representative features, making it difficult to find the best algorithm for different domain datasets. The limitations of single feature selection methods can be overcome by the application of ensemble methods, combining multiple feature selection results. In the literature, feature selection algorithms are classified as filter, wrapper, or embedded techniques. However, to the best of our knowledge, there has been no study focusing on combining these three types of techniques to produce ensemble feature selection. Therefore, the aim here is to answer the question as to which combination of different types of feature selection algorithms offers the best performance for different types of medical data including categorical, numerical, and mixed data types. The experimental results show that a combination of filter (i.e., principal component analysis) and wrapper (i.e., genetic algorithms) techniques by the union method is a better choice, providing relatively high classification accuracy and a reasonably good feature reduction rate.  相似文献   

一种面向高维数据的均分式Lasso特征选择方法   总被引:1,自引:0,他引:1  
Lasso是一种基于一范式的特征选择方法。与已有的特征选择方法相比较,Lasso不仅能够准确地选择出与类标签强相关的变量,同时还具有特征选择的稳定性,因而成为人们研究的一个热点。但是,Lasso方法与其他特征选择方法一样,在高维海量或高维小样本数据集的特征选择容易出现计算开销过大或过学习问题(过拟和)。为解决此问题,提出一种改进的Lasso方法:均分式Lasso方法。均分式Lasso方法将特征集均分成K份,对每份特征子集进行特征选择,将每份所选的特征进行合并,再进行一次特征选择。实验表明,均分式Lasso方法能够很好地对高维海量或高维小样本数据集进行特征选择,是一种有效的特征选择方法。  相似文献   

Selecting relevant features for support vector machine (SVM) classifiers is important for a variety of reasons such as generalization performance, computational efficiency, and feature interpretability. Traditional SVM approaches to feature selection typically extract features and learn SVM parameters independently. Independently performing these two steps might result in a loss of information related to the classification process. This paper proposes a convex energy-based framework to jointly perform feature selection and SVM parameter learning for linear and non-linear kernels. Experiments on various databases show significant reduction of features used while maintaining classification performance.  相似文献   

In this paper. we present the MIFS-C variant of the mutual information feature-selection algorithms. We present an algorithm to find the optimal value of the redundancy parameter, which is a key parameter in the MIFS-type algorithms. Furthermore, we present an algorithm that speeds up the execution time of all the MIFS variants. Overall, the presented MIFS-C has comparable classification accuracy (in some cases even better) compared with other MIFS algorithms, while its running time is faster. We compared this feature selector with other feature selectors, and found that it performs better in most cases. The MIFS-C performed especially well for the breakeven and F-measure because the algorithm can be tuned to optimise these evaluation measures. Jan Bakus received the B.A.Sc. and M.A.Sc. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1996 and 1998, respectively, and Ph.D. degree in systems design engineering in 2005. He is currently working at Maplesoft, Waterloo, ON, Canada as an applications engineer, where he is responsible for the development of application specific toolboxes for the Maple scientific computing software. His research interests are in the area of feature selection for text classification, text classification, text clustering, and information retrieval. He is the recipient of the Carl Pollock Fellowship award from the University of Waterloo and the Datatel Scholars Foundation scholarship from Datatel. Mohamed S. Kamel holds a Ph.D. in computer science from the University of Toronto, Canada. He is at present Professor and Director of the Pattern Analysis and Machine Intelligence Laboratory in the Department of Electrical and Computing Engineering, University of Waterloo, Canada. Professor Kamel holds a Canada Research Chair in Cooperative Intelligent Systems. Dr. Kamel's research interests are in machine intelligence, neural networks and pattern recognition with applications in robotics and manufacturing. He has authored and coauthored over 200 papers in journals and conference proceedings, 2 patents and numerous technical and industrial project reports. Under his supervision, 53 Ph.D. and M.A.Sc. students have completed their degrees. Dr. Kamel is a member of ACM, AAAI, CIPS and APEO and has been named s Fellow of IEEE (2005). He is the editor-in-chief of the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, the International Journal of Image and Graphics, Pattern Recognition Letters and is a member of the editorial board of the Intelligent Automation and Soft Computing. He has served as a consultant to many Companies, including NCR, IBM, Nortel, VRP and CSA. He is a member of the board of directors and cofounder of Virtek Vision International in Waterloo.  相似文献   

王翔  胡学钢 《计算机应用》2017,37(9):2433-2438
随着生物信息学、基因表达谱微阵列、图像识别等技术的发展,高维小样本分类问题成为数据挖掘(包括机器学习、模式识别)中的一项挑战性任务,容易引发"维数灾难"和过拟合问题。针对这个问题,特征选择可以有效避免维数灾难,提升分类模型泛化能力,成为研究的热点,有必要对国内外高维小样本特征选择主要研究情况进行综述。首先分析了高维小样本特征选择问题的本质;其次,根据其算法的本质区别,重点对高维小样本数据的特征选择方法进行分类剖析和比较;最后对高维小样本特征选择研究面临的挑战以及研究方向作了展望。  相似文献   

数据库通常包含很多冗余特征,找出重要特征叫做特征提取。本文提出一种基于属性重要度的启发式特征选取算法。该算法以属性重要度为迭代准则得到属性集合的最小约简。  相似文献   

The use of feature selection can improve accuracy, efficiency, applicability and understandability of a learning process. For this reason, many methods of automatic feature selection have been developed. Some of these methods are based on the search of the features that allows the data set to be considered consistent. In a search problem we usually evaluate the search states, in the case of feature selection we measure the possible feature sets. This paper reviews the state of the art of consistency based feature selection methods, identifying the measures used for feature sets. An in-deep study of these measures is conducted, including the definition of a new measure necessary for completeness. After that, we perform an empirical evaluation of the measures comparing them with the highly reputed wrapper approach. Consistency measures achieve similar results to those of the wrapper approach with much better efficiency.  相似文献   

In feature selection problems, strong relevant features may be misjudged as redundant by the approximate Markov blanket. To avoid this, a new concept called strong approximate Markov blanket is proposed. It is theoretically proved that no strong relevant feature will be misjudged as redundant by the proposed concept. To reduce computation time, we propose the concept of modified strong approximate Markov blanket, which still performs better than the approximate Markov blanket in avoiding misjudgment of strong relevant features. A new filter-based feature selection method that is applicable to high-dimensional datasets is further developed. It first groups features to remove redundant features, and then uses a sequential forward selection method to remove irrelevant features. Numerical results on four benchmark and seven real datasets suggest that it is a competitive feature selection method with high classification accuracy, moderate number of selected features, and above-average robustness.  相似文献   

It is argued that machine algorithms based on feature detection promise the greatest chance for success in the recognition of isolated, unconstrained handprinted characters. In order to match human performance, the features used cannot be chosen in an arbitrary manner; they must have some psychological significance. A theory of characters based on functional attributes is reviewed, and three psychophysical tests are described for determining the psychological validity of any postulated attribute. The first test indicates if a particular attribute is involved in a particular letter, and the second and third tests investigate the commonality of an attribute among different letters.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号