共查询到20条相似文献,搜索用时 0 毫秒
1.
With the advent of technology in various scientific fields, high dimensional data are becoming abundant. A general approach to tackle the resulting challenges is to reduce data dimensionality through feature selection. Traditional feature selection approaches concentrate on selecting relevant features and ignoring irrelevant or redundant ones. However, most of these approaches neglect feature interactions. On the other hand, some datasets have imbalanced classes, which may result in biases towards the majority class. The main goal of this paper is to propose a novel feature selection method based on the interaction information (II) to provide higher level interaction analysis and improve the search procedure in the feature space. In this regard, an evolutionary feature subset selection algorithm based on interaction information is proposed, which consists of three stages. At the first stage, candidate features and candidate feature pairs are identified using traditional feature weighting approaches such as symmetric uncertainty (SU) and bivariate interaction information. In the second phase, candidate feature subsets are formed and evaluated using multivariate interaction information. Finally, the best candidate feature subsets are selected using dominant/dominated relationships. The proposed algorithm is compared with some other feature selection algorithms including mRMR, WJMI, IWFS, IGFS, DCSF, IWFS, K_OFSD, WFLNS, Information Gain and ReliefF in terms of the number of selected features, classification accuracy, F-measure and algorithm stability using three different classifiers, namely KNN, NB, and CART. The results justify the improvement of classification accuracy and the robustness of the proposed method in comparison with the other approaches. 相似文献
2.
Huawen Liu Author Vitae Lei Liu Author Vitae Huijie Zhang Author Vitae 《Pattern recognition》2009,42(7):1330-2058
Feature selection plays an important role in data mining and pattern recognition, especially for large scale data. During past years, various metrics have been proposed to measure the relevance between different features. Since mutual information is nonlinear and can effectively represent the dependencies of features, it is one of widely used measurements in feature selection. Just owing to these, many promising feature selection algorithms based on mutual information with different parameters have been developed. In this paper, at first a general criterion function about mutual information in feature selector is introduced, which can bring most information measurements in previous algorithms together. In traditional selectors, mutual information is estimated on the whole sampling space. This, however, cannot exactly represent the relevance among features. To cope with this problem, the second purpose of this paper is to propose a new feature selection algorithm based on dynamic mutual information, which is only estimated on unlabeled instances. To verify the effectiveness of our method, several experiments are carried out on sixteen UCI datasets using four typical classifiers. The experimental results indicate that our algorithm achieved better results than other methods in most cases. 相似文献
3.
文本分类中的特征选取 总被引:21,自引:0,他引:21
研究了文本分类学习中的特征选取,主要集中在大幅度降维的评估函数,因为高维的特征集对分类学习未必全是重要的和有用的。还介绍了分类的一些方法及其特点。 相似文献
4.
数据特征的质量会直接影响模型的准确度。在模式识别领域,特征降维技术一直受到研究者们的关注。随着大数据时代的到来,数据量巨增,数据维度不断升高。在处理高维数据时,传统的数据挖掘方法的性能降低甚至失效。实践表明,在数据分析前先对其特征进行降维是避免“维数灾难”的有效手段。降维技术在各领域被广泛应用,文中详细介绍了特征提取和特征选择两类不同的降维方法,并对其特点进行了比较。通过子集搜索策略和评价准则两个关键过程对特征选择中最具代表性的算法进行了总结和分析。最后从实际应用出发,探讨了特征降维技术值得关注的研究方向。 相似文献
5.
特征选择通过移除不相关和冗余的特征来提高学习算法的性能。基于进化算法在求解优化问题时表现出的优越性能,提出FSSAC特征选择方法。新的初始化策略和评估函数使得SAC能将特征选择作为离散空间搜索问题来解决,利用特征子集的准确率指导SAC的采样阶段。在实验阶段,FSSAC结合SVM,J48和KNN分类器,通过UCI数据集完成验证,并与FSFOA,HGAFS,PSO等算法进行了比较。实验结果表明,FSSAC可以提高分类器的分类准确率,且具有良好的泛化性能。除此之外,对FSSAC和其他算法在特征空间维度缩减情况方面做了对比。 相似文献
6.
The results of experiments with a novel criterion for absolute non-parametric feature selection are reported. The basic idea of the new technique involves the use of computer graphics and the human pattern recognition ability to interactively choose a number of features, this number not being necessarily determined in advance, from a larger set of measurements. The triangulation method, recently proposed in the cluster analysis literature for mapping points from l-space to 2-space, is used to yield a simple and efficient algorithm for feature selection by interactive clustering. It is shown that a subset of features can thus be chosen which allows a significant reduction in storage and time while still keeping the probability of error in classification within reasonable bounds. 相似文献
7.
8.
Web文本挖掘中的特征选取方法研究 总被引:11,自引:0,他引:11
研究了Web文本挖掘中的高维特征选取问题,对常见的评估函数法、主成分分析法、模拟退火法等特征选取和降维算法进行了理论分析与性能比较,通过实验对各种算法的优劣性及适用性进行了讨论。旨在通过降维处理来解决高维空间的文本挖掘问题。 相似文献
9.
Searching for an optimal feature subset from a high dimensional feature space is known to be an NP-complete problem. We present a hybrid algorithm, SAGA, for this task. SAGA combines the ability to avoid being trapped in a local minimum of simulated annealing with the very high rate of convergence of the crossover operator of genetic algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks. We compare the performance over time of SAGA and well-known algorithms on synthetic and real datasets. The results show that SAGA outperforms existing algorithms. 相似文献
10.
Data from many real-world applications can be high dimensional and features of such data are usually highly redundant. Identifying informative features has become an important step for data mining to not only circumvent the curse of dimensionality but to reduce the amount of data for processing. In this paper, we propose a novel feature selection method based on bee colony and gradient boosting decision tree aiming at addressing problems such as efficiency and informative quality of the selected features. Our method achieves global optimization of the inputs of the decision tree using the bee colony algorithm to identify the informative features. The method initializes the feature space spanned by the dataset. Less relevant features are suppressed according to the information they contribute to the decision making using an artificial bee colony algorithm. Experiments are conducted with two breast cancer datasets and six datasets from the public data repository. Experimental results demonstrate that the proposed method effectively reduces the dimensions of the dataset and achieves superior classification accuracy using the selected features. 相似文献
11.
The classification of functional or high-dimensional data requires to select a reduced subset of features among the initial set, both to help fighting the curse of dimensionality and to help interpreting the problem and the model. The mutual information criterion may be used in that context, but it suffers from the difficulty of its estimation through a finite set of samples. Efficient estimators are not designed specifically to be applied in a classification context, and thus suffer from further drawbacks and difficulties. This paper presents an estimator of mutual information that is specifically designed for classification tasks, including multi-class ones. It is combined to a recently published stopping criterion in a traditional forward feature selection procedure. Experiments on both traditional benchmarks and on an industrial functional classification problem show the added value of this estimator. 相似文献
12.
针对大量无关和冗余特征的存在可能降低分类器性能的问题,提出了一种基于近似Markov Blanket和动态互信息的特征选择算法。该算法利用互信息作为特征相关性的度量准则,并在未识别的样本上对互信息进行动态估值,利用近似Markov Blanket原理准确地去除冗余特征,从而获得远小于原始特征规模的特征子集。通过仿真试验证明了该算法的有效性。以支持向量机为分类器,在公共数据集UCI上进行了试验,并与DMIFS和ReliefF算法进行了对比。试验结果证明,该算法选取的特征子集与原始特征子集相比,以远小于原始特征规模的特征子集获得了高于或接近于原始特征集合的分类结果。 相似文献
13.
14.
15.
Computed tomographic (CT) colonography is a promising alternative to traditional invasive colonoscopic methods used in the detection and removal of cancerous growths, or polyps in the colon. Existing computer-aided diagnosis (CAD) algorithms used in CT colonography typically employ the use of a classifier to discriminate between true and false positives generated by a polyp candidate detection system based on a set of features extracted from the candidates. However, these classifiers often suffer from a phenomenon termed the curse of dimensionality, whereby there is a marked degradation in the performance of a classifier as the number of features used in the classifier is increased. In addition, an increase in the number of features used also contributes to an increase in computational complexity and demands on storage space.This paper investigates the benefits of feature selection on a polyp candidate database, with the aim of increasing specificity while preserving sensitivity. Two new mutual information methods for feature selection are proposed in order to select a subset of features for optimum performance. Initial results show that the performance of the widely used support vector machine (SVM) classifier is indeed better with the use of a small set of features, with receiver operating characteristic curve (AUC) measures reaching 0.78-0.88. 相似文献
16.
Feature selection is one of the fundamental problems in pattern recognition and data mining. A popular and effective approach to feature selection is based on information theory, namely the mutual information of features and class variable. In this paper we compare eight different mutual information-based feature selection methods. Based on the analysis of the comparison results, we propose a new mutual information-based feature selection method. By taking into account both the class-dependent and class-independent correlation among features, the proposed method selects a less redundant and more informative set of features. The advantage of the proposed method over other methods is demonstrated by the results of experiments on UCI datasets (Asuncion and Newman, 2010 [1]) and object recognition. 相似文献
17.
18.
Using an ensemble of classifiers instead of a single classifier has been shown to improve generalization performance in many
pattern recognition problems. However, the extent of such improvement depends greatly on the amount of correlation among the
errors of the base classifiers. Therefore, reducing those correlations while keeping the classifiers’ performance levels high
is an important area of research. In this article, we explore Input Decimation (ID), a method which selects feature subsets
for their ability to discriminate among the classes and uses these subsets to decouple the base classifiers. We provide a
summary of the theoretical benefits of correlation reduction, along with results of our method on two underwater sonar data
sets, three benchmarks from the Probenl/UCI repositories, and two synthetic data sets. The results indicate that input decimated
ensembles outperform ensembles whose base classifiers use all the input features; randomly selected subsets of features; and
features created using principal components analysis, on a wide range of domains.
ID="A1"Correspondance and offprint requests to: Kagan Tumer, NASA Ames Research Center, Moffett Field, CA, USA 相似文献
19.
Songyot Nakariyakul Author Vitae David P. Casasent Author Vitae 《Pattern recognition》2009,42(9):1932-1940
A new improved forward floating selection (IFFS) algorithm for selecting a subset of features is presented. Our proposed algorithm improves the state-of-the-art sequential forward floating selection algorithm. The improvement is to add an additional search step called “replacing the weak feature” to check whether removing any feature in the currently selected feature subset and adding a new one at each sequential step can improve the current feature subset. Our method provides the optimal or quasi-optimal (close to optimal) solutions for many selected subsets and requires significantly less computational load than optimal feature selection algorithms. Our experimental results for four different databases demonstrate that our algorithm consistently selects better subsets than other suboptimal feature selection algorithms do, especially when the original number of features of the database is large. 相似文献
20.
开放动态环境下的机器学习任务面临着数据特征空间的高维性和动态性。目前已有在线流特征选择算法基本仅考虑特征的重要性和冗余性,忽略了特征的交互性。特征交互是指那些本身与标签单独统计时呈现无关或弱相关,但与其他特征结合时却能与标签呈强相关的特征。基于此,提出一种基于邻域信息交互的在线流特征选择算法,该算法分为在线交互特征选择和在线冗余特征剔除两个阶段,即直接计算新到特征与整个已选特征子集的交互强弱程度,以及利用成对比较机制剔除冗余特征。在10个数据集上的实验结果表明了所提算法的有效性。 相似文献