首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Mutual information (MI) is used in feature selection to evaluate two key-properties of optimal features, the relevance of a feature to the class variable and the redundancy of similar features. Conditional mutual information (CMI), i.e., MI of the candidate feature to the class variable conditioning on the features already selected, is a natural extension of MI but not so far applied due to estimation complications for high dimensional distributions. We propose the nearest neighbor estimate of CMI, appropriate for high-dimensional variables, and build an iterative scheme for sequential feature selection with a termination criterion, called CMINN. We show that CMINN is equivalent to feature selection MI filters, such as mRMR and MaxiMin, in the presence of solely single feature effects, and more appropriate for combined feature effects. We compare CMINN to mRMR and MaxiMin on simulated datasets involving combined effects and confirm the superiority of CMINN in selecting the correct features (indicated also by the termination criterion) and giving best classification accuracy. The application to ten benchmark databases shows that CMINN obtains the same or higher classification accuracy compared to mRMR and MaxiMin at a smaller cardinality of the selected feature subset.  相似文献   

2.
分类准确性是分类器最重要的性能指标,特征子集选择是提高分类器分类准确性的一种有效方法。现有的特征子集选择方法主要针对静态分类器,缺少动态分类器特征子集选择方面的研究。首先给出具有连续属性的动态朴素贝叶斯网络分类器和动态分类准确性评价标准,在此基础上建立动态朴素贝叶斯网络分类器的特征子集选择方法,并使用真实宏观经济时序数据进行实验与分析。  相似文献   

3.
针对在模式分类问题中,数据往往存在不相关的或冗余的特征,从而影响分类的准确性的问题,提出一种融合Shapley值和粒子群优化算法的混合特征选择算法,以利用最少的特征获得最佳分类效果。在粒子群优化算法的局部搜索中引入博弈论的Shapley值,首先计算粒子(特征子集)中每个特征对分类效果的贡献值(Shapley值),然后逐步删除Shapley值最低的特征以优化特征子集,进而更新粒子,同时也增强了算法的全局搜索能力,最后将改进后的粒子群优化算法运用于特征选择,以支持向量机分类器的分类性能和选择的特征数目作为特征子集评价标准,对UCI机器学习数据集和基因表达数据集的17个具有不同特征数量的医疗数据集进行分类实验。实验结果表明所提算法能有效地删除数据集中55%以上不相关的或冗余的特征,尤其对于中大型数据集能删减80%以上,并且所选择的特征子集也具有较好的分类能力,分类准确率能提高2至23个百分点。  相似文献   

4.
In this article, a feature selection algorithm for hyperspectral data based on a recursive support vector machine (R‐SVM) is proposed. The new algorithm follows the scheme of a state‐of‐the‐art feature selection algorithm, SVM recursive feature elimination or SVM‐RFE, and uses a new ranking criterion derived from the R‐SVM. Multiple SVMs are used to address the multiclass problem. The algorithm is applied to Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data to select the most informative bands and the resulting subsets of the bands are compared with SVM‐RFE using the accuracy of classification as the evaluation of the effectiveness of the feature selection. The experimental results for an agricultural case study indicate that the feature subset generated by the newly proposed algorithm is generally competitive with SVM‐RFE in terms of classification accuracy and is more robust in the presence of noise.  相似文献   

5.
大型搜索系统对用户查询的快速响应尤为必要,同时在计算候选文档的特征相关性时,必须遵守严格的后端延迟约束。通过特征选择,提高了机器学习的效率。针对排序学习中快速特征选择的起点多为单一排序效果最好的特征的特点,首先提出了一种用层次聚类法生成特征选择起点的算法,并将该算法应用于已有的2种快速特征选择中。除此之外,还提出了一种充分利用聚类特征的新方法来处理特征选择。在2个标准数据集上的实验表明,该算法既可以在不影响精度的情况下获得较小的特征子集,也可以在中等子集上获得最佳的排序精度。  相似文献   

6.
Feature subset selection and feature ranking for multivariate time series   总被引:4,自引:0,他引:4  
Feature subset selection (FSS) is a known technique to preprocess the data before performing any data mining tasks, e.g., classification and clustering. FSS provides both cost-effective predictors and a better understanding of the underlying process that generated the data. We propose a family of novel unsupervised methods for feature subset selection from multivariate time series (MTS) based on common principal component analysis, termed CLeVer. Traditional FSS techniques, such as recursive feature elimination (RFE) and Fisher criterion (FC), have been applied to MTS data sets, e.g., brain computer interface (BCI) data sets. However, these techniques may lose the correlation information among features, while our proposed techniques utilize the properties of the principal component analysis to retain that information. In order to evaluate the effectiveness of our selected subset of features, we employ classification as the target data mining task. Our exhaustive experiments show that CLeVer outperforms RFE, FC, and random selection by up to a factor of two in terms of the classification accuracy, while taking up to 2 orders of magnitude less processing time than RFE and FC.  相似文献   

7.
基于相关性分析及遗传算法的高维数据特征选择   总被引:4,自引:0,他引:4  
特征选择是模式识别及数据挖掘等领域的重要问题之一。针对高维数据对象,特征选择一方面可以提高分类精度和效率,另一方面可以找出富含信息的特征子集。针对此问题,提出了一种综合了filter模型及wrapper模型的特征选择方法,首先基于特征与类别标签的相关性分析进行特征筛选,只保留与类别标签具有较强相关性的特征,然后针对经过筛选而精简的特征子集采用遗传算法进行随机搜索,并采用感知器模型的分类错误率作为评价指标。实验结果表明,该算法可有效地找出具有较好的线性可分离性的特征子集,从而实现降维并提高分类精度。  相似文献   

8.
一种基于信息增益及遗传算法的特征选择算法   总被引:8,自引:0,他引:8  
特征选择是模式识别及数据挖掘等领域的重要问题之一。针对高维数据对象,特征选择一方面可以提高分类精度和效率,另一方面可以找出富含信息的特征子集。针对此问题,本文提出一种综合了filter模型及wrapper模型的特征选择方法,首先基于特征之间的信息增益进行特征分组及筛选,然后针对经过筛选而精简的特征子集采用遗传算法进行随机搜索,并采用感知器模型的分类错误率作为评价指标。实验结果表明,该算法可有效地找出具有较好的线性可分离性的特征子集,从而实现降维并提高分类精度。  相似文献   

9.
在高维数据分类中,针对多重共线性、冗余特征及噪声易导致分类器识别精度低和时空开销大的问题,提出融合偏最小二乘(Partial Least Squares,PLS)有监督特征提取和虚假最近邻点(False Nearest Neighbors,FNN)的特征选择方法:首先利用偏最小二乘对高维数据提取主元,消除特征之间的多重共线性,得到携带监督信息的独立主元空间;然后通过计算各特征选择前后在此空间的相关性,建立基于虚假最近邻点的特征相似性测度,得到原始特征对类别变量解释能力强弱排序;最后,依次剔除解释能力弱的特征,构造出各种分类模型,并以支持向量机(Support Vector Machine,SVM)分类识别率为模型评估准则,搜索出识别率最高但含特征数最少的分类模型,此模型所含的特征即为最佳特征子集。3个数据集模型仿真结果:均表明,由此法选择出的最佳特征子集与各数据集的本质分类特征吻合,说明该方法:有良好的特征选择能力,为数据分类特征选择提供了一条新途径。  相似文献   

10.
A genetic algorithm-based method for feature subset selection   总被引:5,自引:2,他引:3  
As a commonly used technique in data preprocessing, feature selection selects a subset of informative attributes or variables to build models describing data. By removing redundant and irrelevant or noise features, feature selection can improve the predictive accuracy and the comprehensibility of the predictors or classifiers. Many feature selection algorithms with different selection criteria has been introduced by researchers. However, it is discovered that no single criterion is best for all applications. In this paper, we propose a framework based on a genetic algorithm (GA) for feature subset selection that combines various existing feature selection methods. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. We conducted experiments using three data sets and three existing feature selection methods. The experimental results demonstrate that our approach is a robust and effective approach to find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm.  相似文献   

11.
针对高维度小样本数据在特征选择时出现的维数灾难和过拟合的问题,提出一种混合Filter模式与Wrapper模式的特征选择方法(ReFS-AGA)。该方法结合ReliefF算法和归一化互信息,评估特征的相关性并快速筛选重要特征;采用改进的自适应遗传算法,引入最优策略平衡特征多样性,同时以最小化特征数和最大化分类精度为目标,选择特征数作为调节项设计新的评价函数,在迭代进化过程中高效获得最优特征子集。在基因表达数据上利用不同分类算法对简化后的特征子集分类识别,实验结果表明,该方法有效消除了不相关特征,提高了特征选择的效率,与ReliefF算法和二阶段特征选择算法mRMR-GA相比,在取得最小特征子集维度的同时平均分类准确率分别提高了11.18个百分点和4.04个百分点。  相似文献   

12.
Web page classification has become a challenging task due to the exponential growth of the World Wide Web. Uniform Resource Locator (URL)‐based web page classification systems play an important role, but high accuracy may not be achievable as URL contains minimal information. Nevertheless, URL‐based classifiers along with rejection framework can be used as a first‐level filter in a multistage classifier, and a costlier feature extraction from contents may be done in later stages. However, noisy and irrelevant features present in URL demand feature selection methods for URL classification. Therefore, we propose a supervised feature selection method by which relevant URL features are identified using statistical methods. We propose a new feature weighting method for a Naive Bayes classifier by embedding the term goodness obtained from the feature selection method. We also propose a rejection framework to the Naive Bayes classifier by using posterior probability for determining the confidence score. The proposed method is evaluated on the Open Directory Project and WebKB data sets. Experimental results show that our method can be an effective first‐level filter. McNemar tests confirm that our approach significantly improves the performance.  相似文献   

13.
Finding an optimal subset of features that maximizes classification accuracy is still an open problem. In this paper, we exploit the speed of the Harmony Search algorithm and the Optimum-Path Forest classifier in order to propose a new fast and accurate approach for feature selection. Comparisons to some other pattern recognition and feature selection techniques showed that the proposed hybrid algorithm for feature selection outperformed them. The experiments were carried out in the context of identifying non-technical losses in power distribution systems.  相似文献   

14.
15.
传统基于互信息的特征选择方法较少考虑特征之间的关联,并且随着特征数的增加,算法复杂度过大,基于此提出了一种新的基于互信息的特征子集评价函数。该方法充分考虑了特征间如何进行协作,选择了较优的特征子集,改善了分类准确度并且计算负荷有限。实验结果表明,该方法与传统的MIFS方法相比较,分类准确度提高了3%~5%,误差减少率也有25%~30%的改善。  相似文献   

16.
鉴于特征属性选择在网络流量分类中占据重要地位,为了确定最优特征子集,利用CFS作为适应度函数的改进遗传算法(GA-CFS),从网络流量的249个属性空间中提取主要属性并最终选定18个特征组合作为最优特征子集。通过AdaBoost算法把一系列的弱分类器提升为强分类器,对网络流量进行了深入的分类研究。实验结果表明,基于GA-CFS和AdaBoost的流量组合分类方法较弱分类器具有较高的分类准确率。  相似文献   

17.
The problem of selecting a subset of relevant features is classic and found in many branches of science including—examples in pattern recognition. In this paper, we propose a new feature selection criterion based on low-loss nearest neighbor classification and a novel feature selection algorithm that optimizes the margin of nearest neighbor classification through minimizing its loss function. At the same time, theoretical analysis based on energy-based model is presented, and some experiments are also conducted on several benchmark real-world data sets and facial data sets for gender classification to show that the proposed feature selection method outperforms other classic ones.  相似文献   

18.
Linear discriminant analysis (LDA) is a commonly used classification method. It can provide important weight information for constructing a classification model. However, real-world data sets generally have many features, not all of which benefit the classification results. If a feature selection algorithm is not employed, unsatisfactory classification will result, due to the high correlation between features and noise. This study points out that the feature selection has influence on the LDA by showing an example. The methods traditionally used for LDA to determine the beneficial feature subset are not easy or cannot guarantee the best results when problems have larger number of features.The particle swarm optimization (PSO) is a powerful meta-heuristic technique in the artificial intelligence field; therefore, this study proposed a PSO-based approach, called PSOLDA, to specify the beneficial features and to enhance the classification accuracy rate of LDA. To measure the performance of PSOLDA, many public datasets are employed to measure the classification accuracy rate. Comparing the optimal result obtained by the exhaustive enumeration, the PSOLDA approach can obtain the same optimal result. Due to much time required for exhaustive enumeration when problems have larger number of features, exhaustive enumeration cannot be applied. Therefore, many heuristic approaches, such as forward feature selection, backward feature selection, and PCA-based feature selection are used. This study showed that the classification accuracy rates of the PSOLDA were higher than those of these approaches in many public data sets.  相似文献   

19.
随机森林(random forest,RF)算法虽应用广泛且分类准确度很高,但在面对特征维度高且不平衡的数据时,算法分类性能被严重削弱。高维数据通常包含大量的无关和冗余的特征,针对这个问题,结合权重排序和递归特征筛选的思想提出了一种改进的随机森林算法RW_RF(ReliefF&wrapper random forest)。首先引用ReliefF算法对数据集的所有特征按正负类分类能力赋予不同的权值,再递归地删除冗余的低权值特征,得到分类性能最佳的特征子集来构造随机森林;同时改进ReliefF的抽样方式,以减轻不平衡数据对分类模型的影响。实验结果显示,在特征数目很多的数据集中,改进算法的各评价指标均高于原算法,证明提出的RW_RF算法有效精简了特征子集,减轻了冗余特征对模型分类精度的影响,同时也证明了改进算法对处理不平衡数据起到了一定的效果。  相似文献   

20.
葛倩  张光斌  张小凤 《计算机应用》2022,42(10):3046-3053
为解决特征选择ReliefF算法在利用欧氏距离选取近邻样本过程中,算法稳定性差以及选取的特征子集分类准确率低的问题,提出了一种利用最大信息系数(MIC)作为近邻样本选择标准的MICReliefF算法;同时,以支持向量机(SVM)模型的分类准确率作为评价指标,并多次寻优,以自动确定其最优特征子集,从而实现MICReliefF算法与分类模型的交互优化,即MICReliefF-SVM自动特征选择算法。在多个UCI公开数据集上对MICReliefF-SVM算法的性能进行了验证。实验结果表明,MICReliefF-SVM自动特征选择算法不仅可以筛除更多的冗余特征,而且可以选择出具有良好稳定性和泛化能力的特征子集。与随机森林(RF)、最大相关最小冗余(mRMR)、相关性特征选择(CFS)等经典的特征选择算法相比,MICReliefF-SVM算法具有更高的分类准确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号