首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
With the advent of technology in various scientific fields, high dimensional data are becoming abundant. A general approach to tackle the resulting challenges is to reduce data dimensionality through feature selection. Traditional feature selection approaches concentrate on selecting relevant features and ignoring irrelevant or redundant ones. However, most of these approaches neglect feature interactions. On the other hand, some datasets have imbalanced classes, which may result in biases towards the majority class. The main goal of this paper is to propose a novel feature selection method based on the interaction information (II) to provide higher level interaction analysis and improve the search procedure in the feature space. In this regard, an evolutionary feature subset selection algorithm based on interaction information is proposed, which consists of three stages. At the first stage, candidate features and candidate feature pairs are identified using traditional feature weighting approaches such as symmetric uncertainty (SU) and bivariate interaction information. In the second phase, candidate feature subsets are formed and evaluated using multivariate interaction information. Finally, the best candidate feature subsets are selected using dominant/dominated relationships. The proposed algorithm is compared with some other feature selection algorithms including mRMR, WJMI, IWFS, IGFS, DCSF, IWFS, K_OFSD, WFLNS, Information Gain and ReliefF in terms of the number of selected features, classification accuracy, F-measure and algorithm stability using three different classifiers, namely KNN, NB, and CART. The results justify the improvement of classification accuracy and the robustness of the proposed method in comparison with the other approaches.  相似文献   

2.
Feature selection plays an important role in data mining and pattern recognition, especially for large scale data. During past years, various metrics have been proposed to measure the relevance between different features. Since mutual information is nonlinear and can effectively represent the dependencies of features, it is one of widely used measurements in feature selection. Just owing to these, many promising feature selection algorithms based on mutual information with different parameters have been developed. In this paper, at first a general criterion function about mutual information in feature selector is introduced, which can bring most information measurements in previous algorithms together. In traditional selectors, mutual information is estimated on the whole sampling space. This, however, cannot exactly represent the relevance among features. To cope with this problem, the second purpose of this paper is to propose a new feature selection algorithm based on dynamic mutual information, which is only estimated on unlabeled instances. To verify the effectiveness of our method, several experiments are carried out on sixteen UCI datasets using four typical classifiers. The experimental results indicate that our algorithm achieved better results than other methods in most cases.  相似文献   

3.
文本分类中的特征选取   总被引:21,自引:0,他引:21  
刘丽珍  宋瀚涛 《计算机工程》2004,30(4):14-15,175
研究了文本分类学习中的特征选取,主要集中在大幅度降维的评估函数,因为高维的特征集对分类学习未必全是重要的和有用的。还介绍了分类的一些方法及其特点。  相似文献   

4.
黄铉 《计算机科学》2018,45(Z6):16-21, 53
数据特征的质量会直接影响模型的准确度。在模式识别领域,特征降维技术一直受到研究者们的关注。随着大数据时代的到来,数据量巨增,数据维度不断升高。在处理高维数据时,传统的数据挖掘方法的性能降低甚至失效。实践表明,在数据分析前先对其特征进行降维是避免“维数灾难”的有效手段。降维技术在各领域被广泛应用,文中详细介绍了特征提取和特征选择两类不同的降维方法,并对其特点进行了比较。通过子集搜索策略和评价准则两个关键过程对特征选择中最具代表性的算法进行了总结和分析。最后从实际应用出发,探讨了特征降维技术值得关注的研究方向。  相似文献   

5.
特征选择通过移除不相关和冗余的特征来提高学习算法的性能。基于进化算法在求解优化问题时表现出的优越性能,提出FSSAC特征选择方法。新的初始化策略和评估函数使得SAC能将特征选择作为离散空间搜索问题来解决,利用特征子集的准确率指导SAC的采样阶段。在实验阶段,FSSAC结合SVM,J48和KNN分类器,通过UCI数据集完成验证,并与FSFOA,HGAFS,PSO等算法进行了比较。实验结果表明,FSSAC可以提高分类器的分类准确率,且具有良好的泛化性能。除此之外,对FSSAC和其他算法在特征空间维度缩减情况方面做了对比。  相似文献   

6.
The results of experiments with a novel criterion for absolute non-parametric feature selection are reported. The basic idea of the new technique involves the use of computer graphics and the human pattern recognition ability to interactively choose a number of features, this number not being necessarily determined in advance, from a larger set of measurements. The triangulation method, recently proposed in the cluster analysis literature for mapping points from l-space to 2-space, is used to yield a simple and efficient algorithm for feature selection by interactive clustering. It is shown that a subset of features can thus be chosen which allows a significant reduction in storage and time while still keeping the probability of error in classification within reasonable bounds.  相似文献   

7.
黄源  李茂  吕建成 《计算机科学》2015,42(5):54-56, 77
开方检验是目前文本分类中一种常用的特征选择方法.该方法仅关注词语和类别间的关系,而没有考虑词与词之间的关联,因此选择出的特征集具有较大的冗余度.定义了词语的“剩余互信息”概念,提出了对开方检验的选择结果进行优化的方法.使用该方法可以得到既有很强表征性又有很高独立性的特征集.实验表明,该方法表现良好.  相似文献   

8.
Web文本挖掘中的特征选取方法研究   总被引:11,自引:0,他引:11  
和亚丽  陈立潮 《计算机工程》2005,31(5):181-182,190
研究了Web文本挖掘中的高维特征选取问题,对常见的评估函数法、主成分分析法、模拟退火法等特征选取和降维算法进行了理论分析与性能比较,通过实验对各种算法的优劣性及适用性进行了讨论。旨在通过降维处理来解决高维空间的文本挖掘问题。  相似文献   

9.
Searching for an optimal feature subset from a high dimensional feature space is known to be an NP-complete problem. We present a hybrid algorithm, SAGA, for this task. SAGA combines the ability to avoid being trapped in a local minimum of simulated annealing with the very high rate of convergence of the crossover operator of genetic algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks. We compare the performance over time of SAGA and well-known algorithms on synthetic and real datasets. The results show that SAGA outperforms existing algorithms.  相似文献   

10.
Data from many real-world applications can be high dimensional and features of such data are usually highly redundant. Identifying informative features has become an important step for data mining to not only circumvent the curse of dimensionality but to reduce the amount of data for processing. In this paper, we propose a novel feature selection method based on bee colony and gradient boosting decision tree aiming at addressing problems such as efficiency and informative quality of the selected features. Our method achieves global optimization of the inputs of the decision tree using the bee colony algorithm to identify the informative features. The method initializes the feature space spanned by the dataset. Less relevant features are suppressed according to the information they contribute to the decision making using an artificial bee colony algorithm. Experiments are conducted with two breast cancer datasets and six datasets from the public data repository. Experimental results demonstrate that the proposed method effectively reduces the dimensions of the dataset and achieves superior classification accuracy using the selected features.  相似文献   

11.
Vanessa  Michel  Jrme 《Neurocomputing》2009,72(16-18):3580
The classification of functional or high-dimensional data requires to select a reduced subset of features among the initial set, both to help fighting the curse of dimensionality and to help interpreting the problem and the model. The mutual information criterion may be used in that context, but it suffers from the difficulty of its estimation through a finite set of samples. Efficient estimators are not designed specifically to be applied in a classification context, and thus suffer from further drawbacks and difficulties. This paper presents an estimator of mutual information that is specifically designed for classification tasks, including multi-class ones. It is combined to a recently published stopping criterion in a traditional forward feature selection procedure. Experiments on both traditional benchmarks and on an industrial functional classification problem show the added value of this estimator.  相似文献   

12.
针对大量无关和冗余特征的存在可能降低分类器性能的问题,提出了一种基于近似Markov Blanket和动态互信息的特征选择算法。该算法利用互信息作为特征相关性的度量准则,并在未识别的样本上对互信息进行动态估值,利用近似Markov Blanket原理准确地去除冗余特征,从而获得远小于原始特征规模的特征子集。通过仿真试验证明了该算法的有效性。以支持向量机为分类器,在公共数据集UCI上进行了试验,并与DMIFS和ReliefF算法进行了对比。试验结果证明,该算法选取的特征子集与原始特征子集相比,以远小于原始特征规模的特征子集获得了高于或接近于原始特征集合的分类结果。  相似文献   

13.
宋哲理  王超  王振飞 《计算机科学》2018,45(Z11):468-473, 479
特征选择是文本分类的关键步骤,分类结果的准确度主要取决于选择得到的特征词的优劣。文中提出一种基于MapReduce的多级特征选择机制,一方面利用改进的CHI特征选择算法进行初次筛选,再通过互信息方法对初选结果进行噪声词过滤、优质特征词前置等操作;另一方面将本机制载入MapReduce模型中,以减少多级特征选择作用于海量数据的时间消耗。实验结果表明,该机制能在较短的时间内处理大规模数据,同时也提升了文本分类的精度。  相似文献   

14.
基于差分贡献的垃圾邮件过滤特征选择方法   总被引:7,自引:0,他引:7       下载免费PDF全文
垃圾邮件过滤本质上是一个二类文本分类问题,特征选择是其一个重要的组成部分。针对垃圾邮件过滤问题的特殊性,基于“差分贡献”的思想对文档频数和互信息量这两种传统的特征选择方法进行了改进,设计了新的垃圾邮件过滤特征选择方法。实验结果表明,基于差分贡献的特征选择方法使得垃圾邮件过滤的精度得到了有效的提高。  相似文献   

15.
Computed tomographic (CT) colonography is a promising alternative to traditional invasive colonoscopic methods used in the detection and removal of cancerous growths, or polyps in the colon. Existing computer-aided diagnosis (CAD) algorithms used in CT colonography typically employ the use of a classifier to discriminate between true and false positives generated by a polyp candidate detection system based on a set of features extracted from the candidates. However, these classifiers often suffer from a phenomenon termed the curse of dimensionality, whereby there is a marked degradation in the performance of a classifier as the number of features used in the classifier is increased. In addition, an increase in the number of features used also contributes to an increase in computational complexity and demands on storage space.This paper investigates the benefits of feature selection on a polyp candidate database, with the aim of increasing specificity while preserving sensitivity. Two new mutual information methods for feature selection are proposed in order to select a subset of features for optimum performance. Initial results show that the performance of the widely used support vector machine (SVM) classifier is indeed better with the use of a small set of features, with receiver operating characteristic curve (AUC) measures reaching 0.78-0.88.  相似文献   

16.
Feature selection is one of the fundamental problems in pattern recognition and data mining. A popular and effective approach to feature selection is based on information theory, namely the mutual information of features and class variable. In this paper we compare eight different mutual information-based feature selection methods. Based on the analysis of the comparison results, we propose a new mutual information-based feature selection method. By taking into account both the class-dependent and class-independent correlation among features, the proposed method selects a less redundant and more informative set of features. The advantage of the proposed method over other methods is demonstrated by the results of experiments on UCI datasets (Asuncion and Newman, 2010 [1]) and object recognition.  相似文献   

17.
文本分类的特征提取方法比较与改进   总被引:12,自引:0,他引:12  
文本的特征提取是文本分类过程中的一个重要环节,它的好坏将直接影响文本分类的准确率。该文介绍了词条的χ2统计方法(CHI)、词条与类别的互信息(MI)、信息增益(IG)、词条的期望交叉熵(CE)等文本特征提取方法,并对其取词策略进行了改进。为了对这些特征提取方法进行系统地比较,选择了三种代表性的分类器对《读卖新闻》文本数据库进行了分类实验。实验结果表明χ2统计方法具有最好的准确率,各种改进的特征提取方法都能提高文本分类的准确率。  相似文献   

18.
Using an ensemble of classifiers instead of a single classifier has been shown to improve generalization performance in many pattern recognition problems. However, the extent of such improvement depends greatly on the amount of correlation among the errors of the base classifiers. Therefore, reducing those correlations while keeping the classifiers’ performance levels high is an important area of research. In this article, we explore Input Decimation (ID), a method which selects feature subsets for their ability to discriminate among the classes and uses these subsets to decouple the base classifiers. We provide a summary of the theoretical benefits of correlation reduction, along with results of our method on two underwater sonar data sets, three benchmarks from the Probenl/UCI repositories, and two synthetic data sets. The results indicate that input decimated ensembles outperform ensembles whose base classifiers use all the input features; randomly selected subsets of features; and features created using principal components analysis, on a wide range of domains. ID="A1"Correspondance and offprint requests to: Kagan Tumer, NASA Ames Research Center, Moffett Field, CA, USA  相似文献   

19.
A new improved forward floating selection (IFFS) algorithm for selecting a subset of features is presented. Our proposed algorithm improves the state-of-the-art sequential forward floating selection algorithm. The improvement is to add an additional search step called “replacing the weak feature” to check whether removing any feature in the currently selected feature subset and adding a new one at each sequential step can improve the current feature subset. Our method provides the optimal or quasi-optimal (close to optimal) solutions for many selected subsets and requires significantly less computational load than optimal feature selection algorithms. Our experimental results for four different databases demonstrate that our algorithm consistently selects better subsets than other suboptimal feature selection algorithms do, especially when the original number of features of the database is large.  相似文献   

20.
开放动态环境下的机器学习任务面临着数据特征空间的高维性和动态性。目前已有在线流特征选择算法基本仅考虑特征的重要性和冗余性,忽略了特征的交互性。特征交互是指那些本身与标签单独统计时呈现无关或弱相关,但与其他特征结合时却能与标签呈强相关的特征。基于此,提出一种基于邻域信息交互的在线流特征选择算法,该算法分为在线交互特征选择和在线冗余特征剔除两个阶段,即直接计算新到特征与整个已选特征子集的交互强弱程度,以及利用成对比较机制剔除冗余特征。在10个数据集上的实验结果表明了所提算法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号