期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A new improved filter-based feature selection model for high-dimensional data

Munirathinam Deepak Raj Ranganadhan Mohanasundaram 《The Journal of supercomputing》2020,76(8):5745-5762

Preprocessing of data is ubiquitous, and choosing significant attributes has been one of the important steps in the prior processing of data. Feature selection is used to create a subset of relevant feature for effective classification of data. In a classification of high-dimensional data, the classifier usually depends on the feature subset that has been used for classification. The Relief algorithm is a popular heuristic approach to select significant feature subsets. The Relief algorithm estimates feature individually and selects top-scored feature for subset generation. Many extensions of the Relief algorithm have been developed. However, an important defect in the Relief-based algorithms has been ignored for years. Because of the uncertainty and noise of the instances used for measuring the feature score in the Relief algorithm, the outcome results will vacillate with the instances, which lead to poor classification accuracy. To fix this problem, a novel feature selection algorithm based on Chebyshev distance-outlier detection model is proposed called noisy feature removal-Relief, NFR-ReliefF in short. To demonstrate the performance of NFR-ReliefF algorithm, an extensive experiment, including classification tests, has been carried out on nine benchmarking high-dimensional datasets by uniting the proposed model with standard classifiers, including the naïve Bayes, C4.5 and KNN. The results prove that NFR-ReliefF outperforms the other models on most tested datasets.

相似文献

2.

Derivative-based band clustering and multi-agent PSO optimization for optimal band selection of hyper-spectral images

Kalidindi Kishore Raju Gottumukkala Pardha Saradhi Varma Davuluri Rajyalakshmi 《The Journal of supercomputing》2020,76(8):5873-5898

The Journal of Supercomputing - Images (HSIs) are popular in diversified applications, such as geosciences, biomedical imaging, molecular biology, agriculture, astronomy, food quality and safety... 相似文献

3.

Empirical study of feature selection methods based on individual feature evaluation for classification problems

Antonio Arauzo-Azofra José Luis Aznarte José M. Benítez 《Expert systems with applications》2011,38(7):8170-8177

The use of feature selection can improve accuracy, efficiency, applicability and understandability of a learning process and its resulting model. For this reason, many methods of automatic feature selection have been developed. By using a modularization of feature selection process, this paper evaluates a wide spectrum of these methods. The methods considered are created by combination of different selection criteria and individual feature evaluation modules. These methods are commonly used because of their low running time. After carrying out a thorough empirical study the most interesting methods are identified and some recommendations about which feature selection method should be used under different conditions are provided. 相似文献

4.

A survey on feature selection methods

Girish Chandrashekar Ferat Sahin 《Computers & Electrical Engineering》2014

Plenty of feature selection methods are available in literature due to the availability of data with hundreds of variables leading to data with very high dimension. Feature selection methods provides us a way of reducing computation time, improving prediction performance, and a better understanding of the data in machine learning or pattern recognition applications. In this paper we provide an overview of some of the methods present in literature. The objective is to provide a generic introduction to variable elimination which can be applied to a wide array of machine learning problems. We focus on Filter, Wrapper and Embedded methods. We also apply some of the feature selection techniques on standard datasets to demonstrate the applicability of feature selection techniques. 相似文献

5.

Constraint scores for semi-supervised feature selection: A comparative study 总被引：1，自引：0，他引：1

Mariam Kalakech Philippe Biela Ludovic MacaireDenis Hamad 《Pattern recognition letters》2011,32(5):656-665

Recent feature selection scores using pairwise constraints (must-link and cannot-link) have shown better performances than the unsupervised methods and comparable to the supervised ones. However, these scores use only the pairwise constraints and ignore the available information brought by the unlabeled data. Moreover, these constraint scores strongly depend on the given must-link and cannot-link subsets built by the user. In this paper, we address these problems and propose a new semi-supervised constraint score that uses both pairwise constraints and local properties of the unlabeled data. Experiments using Kendall’s coefficient and accuracy rates, show that this new score is less sensitive to the given constraints than the previous scores while providing similar performances. 相似文献

6.

Strong approximate Markov blanket and its application on filter-based feature selection

《Applied Soft Computing》2020

In feature selection problems, strong relevant features may be misjudged as redundant by the approximate Markov blanket. To avoid this, a new concept called strong approximate Markov blanket is proposed. It is theoretically proved that no strong relevant feature will be misjudged as redundant by the proposed concept. To reduce computation time, we propose the concept of modified strong approximate Markov blanket, which still performs better than the approximate Markov blanket in avoiding misjudgment of strong relevant features. A new filter-based feature selection method that is applicable to high-dimensional datasets is further developed. It first groups features to remove redundant features, and then uses a sequential forward selection method to remove irrelevant features. Numerical results on four benchmark and seven real datasets suggest that it is a competitive feature selection method with high classification accuracy, moderate number of selected features, and above-average robustness. 相似文献

7.

A survey on feature selection methods for mixed data

Solorio-Fernández Saúl Carrasco-Ochoa J. Ariel Martínez-Trinidad José Francisco 《Artificial Intelligence Review》2022,55(4):2821-2846

Feature Selection for mixed data is an active research area with many applications in practical problems where numerical and non-numerical features describe the objects of study. This paper provides the first comprehensive and structured revision of the existing supervised and unsupervised feature selection methods for mixed data reported in the literature. Additionally, we present an analysis of the main characteristics, advantages, and disadvantages of the feature selection methods reviewed in this survey and discuss some important open challenges and potential future research opportunities in this field.

相似文献

8.

Categorizing feature selection methods for multi-label classification

Rafael B. Pereira Alexandre Plastino Bianca Zadrozny Luiz H. C. Merschmann 《Artificial Intelligence Review》2018,49(1):57-78

In many important application domains such as text categorization, biomolecular analysis, scene classification and medical diagnosis, examples are naturally associated with more than one class label, giving rise to multi-label classification problems. This fact has led, in recent years, to a substantial amount of research on feature selection methods that allow the identification of relevant and informative features for multi-label classification. However, the methods proposed for this task are scattered in the literature, with no common framework to describe them and to allow an objective comparison. Here, we revisit a categorization of existing multi-label classification methods and, as our main contribution, we provide a comprehensive survey and novel categorization of the feature selection techniques that have been created for the multi-label classification setting. We conclude this work with concrete suggestions for future research in multi-label feature selection which have been derived from our categorization and analysis. 相似文献

9.

A review of feature selection methods on synthetic data 总被引：2，自引：1，他引：1

Verónica Bolón-Canedo Noelia Sánchez-Maroño Amparo Alonso-Betanzos 《Knowledge and Information Systems》2013,34(3):483-519

With the advent of high dimensionality, adequate identification of relevant features of the data has become indispensable in real-world scenarios. In this context, the importance of feature selection is beyond doubt and different methods have been developed. However, with such a vast body of algorithms available, choosing the adequate feature selection method is not an easy-to-solve question and it is necessary to check their effectiveness on different situations. Nevertheless, the assessment of relevant features is difficult in real datasets and so an interesting option is to use artificial data. In this paper, several synthetic datasets are employed for this purpose, aiming at reviewing the performance of feature selection methods in the presence of a crescent number or irrelevant features, noise in the data, redundancy and interaction between attributes, as well as a small ratio between number of samples and number of features. Seven filters, two embedded methods, and two wrappers are applied over eleven synthetic datasets, tested by four classifiers, so as to be able to choose a robust method, paving the way for its application to real datasets. 相似文献

10.

半监督特征选择综述

陈海燕张东方王建东《计算机应用研究》2021,38(2):321-329

如何针对半监督数据集,利用不完整的监督信息完成特征选择,已经成为模式识别与机器学习领域的研究热点。为方便研究者系统地了解半监督特征选择领域的研究现状和发展趋势,对半监督特征选择方法进行综述。首先探讨了半监督特征选择方法的分类,将其按理论基础的不同分为基于图的方法、基于伪标签的方法、基于支持向量机的方法以及其他方法;然后详细介绍并比较了各个类别的典型方法;之后整理了半监督特征选择的热点应用;最后展望了半监督特征选择方法未来的研究方向。相似文献

11.

A review of feature selection methods based on mutual information 总被引：1，自引：0，他引：1

Jorge R. Vergara Pablo A. Estévez 《Neural computing & applications》2014,24(1):175-186

In this work, we present a review of the state of the art of information-theoretic feature selection methods. The concepts of feature relevance, redundance, and complementarity (synergy) are clearly defined, as well as Markov blanket. The problem of optimal feature selection is defined. A unifying theoretical framework is described, which can retrofit successful heuristic criteria, indicating the approximations made by each method. A number of open problems in the field are presented. 相似文献

12.

Fast hyperspectral band selection based on spatial feature extraction

Xianghai Cao Yamei Ji Lin Wang Beibei Ji Licheng Jiao Jungong Han 《Journal of Real-Time Image Processing》2018,15(3):555-564

Hyperspectral images usually consist of hundreds of spectral bands, which can be used to precisely characterize different land cover types. However, the high dimensionality also has some disadvantages, such as the Hughes effect and a high storage demand. Band selection is an effective method to address these issues. However, most band selection algorithms are conducted with the high-dimensional band images, which will bring high computation complexity and may deteriorate the selection performance. In this paper, spatial feature extraction is used to reduce the dimensionality of band images and improve the band selection performance. The experiment results obtained on three real hyperspectral datasets confirmed that the spatial feature extraction-based approach exhibits more robust classification accuracy when compared with other methods. Besides, the proposed method can dramatically reduce the dimensionality of each band image, which makes it possible for band selection to be implemented in real time situations. 相似文献

13.

A comparative study of iterative and non-iterative feature selection techniques for software defect prediction

Taghi M. Khoshgoftaar Kehan Gao Amri Napolitano Randall Wald 《Information Systems Frontiers》2014,16(5):801-822

Two important problems which can affect the performance of classification models are high-dimensionality (an overabundance of independent features in the dataset) and imbalanced data (a skewed class distribution which creates at least one class with many fewer instances than other classes). To resolve these problems concurrently, we propose an iterative feature selection approach, which repeated applies data sampling (in order to address class imbalance) followed by feature selection (in order to address high-dimensionality), and finally we perform an aggregation step which combines the ranked feature lists from the separate iterations of sampling. This approach is designed to find a ranked feature list which is particularly effective on the more balanced dataset resulting from sampling while minimizing the risk of losing data through the sampling step and missing important features. To demonstrate this technique, we employ 18 different feature selection algorithms and Random Undersampling with two post-sampling class distributions. We also investigate the use of sampling and feature selection without the iterative step (e.g., using the ranked list from a single iteration, rather than combining the lists from multiple iterations), and compare these results from the version which uses iteration. Our study is carried out using three groups of datasets with different levels of class balance, all of which were collected from a real-world software system. All of our experiments use four different learners and one feature subset size. We find that our proposed iterative feature selection approach outperforms the non-iterative approach. 相似文献

14.

.邮件过滤中特征选择方法的性能评价与分析*

赵静刘培玉许明英《计算机应用研究》2012,29(2):693-697

基于内容的邮件过滤本质是二值文本分类问题。特征选择在分类之前约简特征空间以减少分类器在计算和存储上的开销,同时过滤部分噪声以提高分类的准确性,是影响邮件过滤准确性和时效性的重要因素。但各特征选择算法在同一评价环境中性能不同,且对分类器和数据集分布特征具有依赖性。结合邮件过滤自身特点,从分类器适应性、数据集依赖性及时间复杂度三个方面评价与分析各特征选择算法在邮件过滤领域的性能。实验结果表明,优势率和文档频数用于邮件过滤时垃圾邮件识别的准确率较高,运算时间较少。相似文献

15.

自动文本分类特征选择方法研究 总被引：4，自引：4，他引：4

张海龙王莲芝《计算机工程与设计》2006,27(20):3840-3841

文本分类是指根据文本的内容将大量的文本归到一个或多个类别的过程,文本表示技术是文本分类的核心技术之一,而特征选择又是文本表示技术的关键技术之一,对分类效果至关重要。文本特征选择是最大程度地识别和去除冗余信息,提高训练数据集质量的过程。对文本分类的特征选择方法,包括信息增益、互信息、X^2统计量、文档频率、低损降维和频率差法等做了详细介绍、分析、比较研究。相似文献

16.

Adaptive floating search methods in feature selection

《Pattern recognition letters》1999,20(11-13):1157-1163

A new suboptimal search strategy for feature selection is presented. It represents a more sophisticated version of “classical” floating search algorithms (Pudil et al., 1994), attempts to remove some of their potential deficiencies and facilitates finding a solution even closer to the optimal one. 相似文献

17.

A framework for cost-based feature selection

V. Bolón-Canedo I. Porto-Díaz N. Sánchez-Maroño A. Alonso-Betanzos 《Pattern recognition》2014

Over the last few years, the dimensionality of datasets involved in data mining applications has increased dramatically. In this situation, feature selection becomes indispensable as it allows for dimensionality reduction and relevance detection. The research proposed in this paper broadens the scope of feature selection by taking into consideration not only the relevance of the features but also their associated costs. A new general framework is proposed, which consists of adding a new term to the evaluation function of a filter feature selection method so that the cost is taken into account. Although the proposed methodology could be applied to any feature selection filter, in this paper the approach is applied to two representative filter methods: Correlation-based Feature Selection (CFS) and Minimal-Redundancy-Maximal-Relevance (mRMR), as an example of use. The behavior of the proposed framework is tested on 17 heterogeneous classification datasets, employing a Support Vector Machine (SVM) as a classifier. The results of the experimental study show that the approach is sound and that it allows the user to reduce the cost without compromising the classification error. 相似文献

18.

An empirical evaluation of hierarchical feature selection methods for classification in bioinformatics datasets with gene ontology-based features

Cen Wan Alex A. Freitas 《Artificial Intelligence Review》2018,50(2):201-240

Hierarchical feature selection is a new research area in machine learning/data mining, which consists of performing feature selection by exploiting dependency relationships among hierarchically structured features. This paper evaluates four hierarchical feature selection methods, i.e., HIP, MR, SHSEL and GTD, used together with four types of lazy learning-based classifiers, i.e., Naïve Bayes, Tree Augmented Naïve Bayes, Bayesian Network Augmented Naïve Bayes and k-Nearest Neighbors classifiers. These four hierarchical feature selection methods are compared with each other and with a well-known “flat” feature selection method, i.e., Correlation-based Feature Selection. The adopted bioinformatics datasets consist of aging-related genes used as instances and Gene Ontology terms used as hierarchical features. The experimental results reveal that the HIP (Select Hierarchical Information Preserving Features) method performs best overall, in terms of predictive accuracy and robustness when coping with data where the instances’ classes have a substantially imbalanced distribution. This paper also reports a list of the Gene Ontology terms that were most often selected by the HIP method. 相似文献

19.

A comparative evaluation of nonlinear dynamics methods for time series prediction 总被引：1，自引：1，他引：0

Francesco Camastra Maurizio Filippone 《Neural computing & applications》2009,18(8):1021-1029

A key problem in time series prediction using autoregressive models is to fix the model order, namely the number of past samples required to model the time series adequately. The estimation of the model order using cross-validation may be a long process. In this paper, we investigate alternative methods to cross-validation, based on nonlinear dynamics methods, namely Grassberger–Procaccia, Kégl, Levina–Bickel and False Nearest Neighbors algorithms. The experiments have been performed in two different ways. In the first case, the model order has been used to carry out the prediction, performed by a SVM for regression on three real data time series showing that nonlinear dynamics methods have performances very close to the cross-validation ones. In the second case, we have tested the accuracy of nonlinear dynamics methods in predicting the known model order of synthetic time series. In this case, most of the methods have yielded a correct estimate and when the estimate was not correct, the value was very close to the real one. 相似文献

20.

A comparative evaluation of stochastic-based inference methods for Gaussian process models

M. Filippone M. Zhong M. Girolami 《Machine Learning》2013,93(1):93-114

Gaussian Process (GP) models are extensively used in data analysis given their flexible modeling capabilities and interpretability. The fully Bayesian treatment of GP models is analytically intractable, and therefore it is necessary to resort to either deterministic or stochastic approximations. This paper focuses on stochastic-based inference techniques. After discussing the challenges associated with the fully Bayesian treatment of GP models, a number of inference strategies based on Markov chain Monte Carlo methods are presented and rigorously assessed. In particular, strategies based on efficient parameterizations and efficient proposal mechanisms are extensively compared on simulated and real data on the basis of convergence speed, sampling efficiency, and computational cost. 相似文献