首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 302 毫秒
1.
Searching for an optimal feature subset from a high-dimensional feature space is an NP-complete problem; hence, traditional optimization algorithms are inefficient when solving large-scale feature selection problems. Therefore, meta-heuristic algorithms are extensively adopted to solve such problems efficiently. This study proposes a regression-based particle swarm optimization for feature selection problem. The proposed algorithm can increase population diversity and avoid local optimal trapping by improving the jump ability of flying particles. The data sets collected from UCI machine learning databases are used to evaluate the effectiveness of the proposed approach. Classification accuracy is used as a criterion to evaluate classifier performance. Results show that our proposed approach outperforms both genetic algorithms and sequential search algorithms.  相似文献   

2.
In text classification based on a vector space model, the high dimension of the feature may pose some problems. These problems occur not only for computational reasons, but also because of overfitting. Feature selection is an important preprocessing step used for text classification applications to reduce the vector space size, control the computational time, and maintain or improve performance. In this study, we used an embedded approach in feature selection in which the Chi-square (CHI) feature selector is a filter step. In this step, the less discriminative features are discarded. In the wrapper step, a novel algorithm is proposed based on the combination of the fast global search ability of the genetic algorithm (GA) and the positive feedback mechanism of ant colony optimization (ACO). In order to validate our approach, we carried out a series of experiments on Reuters-21578 corpus, and we compare the achieved results with some other well-known techniques. The evaluation results are such that our method obtained a better performance compared with the other methods in the majority of cases.  相似文献   

3.
Features selection is the process of choosing the relevant subset of features from the high-dimensional dataset to enhance the performance of the classifier. Much research has been carried out in the present world for the process of feature selection. Algorithms such as Naïve Bayes (NB), decision tree, and genetic algorithm are applied to the high-dimensional dataset to select the relevant features and also to increase the computational speed. The proposed model presents a solution for selection of features using ensemble classifier algorithms. The proposed algorithm is the combination of minimum redundancy and maximum relevance (mRMR) and forest optimization algorithm (FOA). Ensemble-based algorithms such as support vector machine (SVM), K-nearest neighbor (KNN), and NB is further used to enhance the performance of the classifier algorithm. The mRMR-FOA is used to select the relevant features from the various datasets and 21% to 24% improvement is recorded in the feature selection. The ensemble classifier algorithms further improves the performance of the algorithm and provides accuracy of 96%.  相似文献   

4.
为了解决高维数据在分类时导致的维数灾难,降维是数据预处理阶段的主要步骤。基于稀疏学习进行特征选择是目前的研究热点。针对现实中大量非线性可分问题,借助核技巧,将非线性可分的数据样本映射到核空间,以解决特征的非线性相似问题。进一步对核空间的数据样本进行稀疏重构,得到原数据在核空间的一种简洁的稀疏表达方式,然后构建相应的评分机制选择最优子集。受益于稀疏学习的自然判别能力,该算法能够选择出保持原始数据结构特性的"好"特征,从而降低学习模型的计算复杂度并提升分类精度。在标准UCI数据集上的实验结果表明,其性能上与同类算法相比平均可提高约5%。  相似文献   

5.
张乐园  李佳烨  李鹏清 《计算机应用》2018,38(12):3444-3449
针对高维的数据中往往存在非线性、低秩形式和属性冗余等问题,提出一种基于核函数的属性自表达无监督属性选择算法——低秩约束的非线性属性选择算法(LRNFS)。首先,将每一维的属性映射到高维的核空间上,通过核空间上的线性属性选择去实现低维空间上的非线性属性选择;然后,对自表达形式引入偏差项并对系数矩阵进行低秩与稀疏处理;最后,引入核矩阵的系数向量的稀疏正则化因子来实现属性选择。所提算法中用核矩阵来体现其非线性关系,低秩考虑数据的全局信息进行子空间学习,自表达形式确定属性的重要程度。实验结果表明,相比于基于重新调整的线性平方回归(RLSR)半监督特征选择算法,所提算法进行属性选择之后作分类的准确率提升了2.34%。所提算法解决了数据在低维特征空间上线性不可分的问题,提升了属性选择的准确率。  相似文献   

6.
A hyperplane based indexing technique for high-dimensional data   总被引:1,自引:0,他引:1  
In this paper, we propose a novel hyperplane based indexing method to support efficient processing of similarity search queries in high-dimensional spaces. The main idea of the proposed index is to improve data partitioning efficiency in a high-dimensional space by using a hyperplane, which further partitions a subspace and can also take advantage of the twin node concept used in the key dimension based index. Compared with the key dimension concept, the hyperplane is more effective in data filtering. High space utilization is achieved by dynamically performing data reallocation between twin nodes. In addition, a post processing step is used after index building to ensure effective filtration. Extensive experiments based on two types of real data sets are conducted and the results illustrate a significantly improved filtering efficiency. Because of the feature of hyperplane, the proposed indexing method is only suitable to Euclidean spaces.  相似文献   

7.
大规模特征选择问题的求解通常面临两大挑战:一是真实标签不足,难以引导算法进行特征选择;二是搜索空间规模大,难以搜索到满意的高质量解。为此,提出了新型的面向大规模特征选择的自监督数据驱动粒子群优化算法。第一,提出了自监督数据驱动特征选择的新型算法框架,可不依赖于真实标签进行特征选择。第二,提出了基于离散区域编码的搜索策略,帮助算法在大规模搜索空间中找到更优解。第三,基于上述的框架和方法,提出了自监督数据驱动粒子群优化算法,实现对问题的求解。在大规模特征数据集上的实验结果显示,提出的算法与主流有监督算法表现相当,并比前沿无监督算法具有更高的特征选择效率。  相似文献   

8.
Protein function prediction is an important problem in functional genomics. Typically, protein sequences are represented by feature vectors. A major problem of protein datasets that increase the complexity of classification models is their large number of features. Feature selection (FS) techniques are used to deal with this high dimensional space of features. In this paper, we propose a novel feature selection algorithm that combines genetic algorithms (GA) and ant colony optimization (ACO) for faster and better search capability. The hybrid algorithm makes use of advantages of both ACO and GA methods. Proposed algorithm is easily implemented and because of use of a simple classifier in that, its computational complexity is very low. The performance of proposed algorithm is compared to the performance of two prominent population-based algorithms, ACO and genetic algorithms. Experimentation is carried out using two challenging biological datasets, involving the hierarchical functional classification of GPCRs and enzymes. The criteria used for comparison are maximizing predictive accuracy, and finding the smallest subset of features. The results of experiments indicate the superiority of proposed algorithm.  相似文献   

9.
数据的特征空间常随时间动态变化,而训练样本的数量固定不变,数据的特征空间在呈现超高维特点的同时通常伴随决策空间的类别不平衡问题.对此,文中提出基于最大决策边界的高维类不平衡数据在线流特征选择算法.借助邻域粗糙集模型,在充分考虑边界样本影响的基础上, 定义自适应邻域关系,设计基于最大决策边界的粗糙依赖度计算公式.同时,提出三种在线特征子集评估指标,用于选择在大类和小类之间具有强区分能力的特征.在 11 个高维类不平衡数据集上的实验表明,在相同的实验环境及特征数量下,文中算法综合性能较优.  相似文献   

10.
Most feature selection algorithms based on information-theoretic learning (ITL) adopt ranking process or greedy search as their searching strategies. The former selects features individually so that it ignores feature interaction and dependencies. The latter heavily relies on the search paths, as only one path will be explored with no possible back-track. In addition, both strategies typically lead to heuristic algorithms. To cope with these problems, this article proposes a novel feature selection framework based on correntropy in ITL, namely correntropy based feature selection using binary projection (BPFS). Our framework selects features by projecting the original high-dimensional data to a low-dimensional space through a special binary projection matrix. The formulated objective function aims at maximizing the correntropy between selected features and class labels. And this function can be efficiently optimized via standard mathematical tools. We apply the half-quadratic method to optimize the objective function in an iterative manner, where each iteration reduces to an assignment subproblem which can be highly efficiently solved with some off-the-shelf toolboxes. Comparative experiments on six real-world datasets indicate that our framework is effective and efficient.  相似文献   

11.
Feature selection is a process that provides model extraction by specifying necessary or related features and improves generalization. The Artificial Bee Colony (ABC) algorithm is one of the most popular optimization algorithms inspired on swarm intelligence developed by simulating the search behavior of honey bees. Artificial Bee Colony Programming (ABCP) is a recently proposed high level automatic programming technique for a Symbolic Regression (SR) problem based on the ABC algorithm. In this paper, a new feature selection method based on ABCP is proposed, Multi Hive ABCP (MHABCP) for high-dimensional SR problems. The learning ability and generalization performance of the proposed MHABCP is investigated using synthetic and real high-dimensional SR datasets and is compared with basic ABCP and GP automatic programming methods. Experimental results show that MHABCP has better performance choosing relevant features in high dimensional SR problems and generalization than other methods.  相似文献   

12.
Linear discriminant analysis (LDA) is a dimension reduction method which finds an optimal linear transformation that maximizes the class separability. However, in undersampled problems where the number of data samples is smaller than the dimension of data space, it is difficult to apply LDA due to the singularity of scatter matrices caused by high dimensionality. In order to make LDA applicable, several generalizations of LDA have been proposed recently. In this paper, we present theoretical and algorithmic relationships among several generalized LDA algorithms and compare their computational complexities and performances in text classification and face recognition. Towards a practical dimension reduction method for high dimensional data, an efficient algorithm is proposed, which reduces the computational complexity greatly while achieving competitive prediction accuracies. We also present nonlinear extensions of these LDA algorithms based on kernel methods. It is shown that a generalized eigenvalue problem can be formulated in the kernel-based feature space, and generalized LDA algorithms are applied to solve the generalized eigenvalue problem, resulting in nonlinear discriminant analysis. Performances of these linear and nonlinear discriminant analysis algorithms are compared extensively.  相似文献   

13.
The selection of informative and non-redundant features has become a prominent step in pattern classification. However, despite the intensive research, it is still an open issue to identify valuable feature subsets, especially in highly dimensional feature spaces. This paper proposes a wrapper feature selection method, in the context of support vector machines (SVMs), named Wr-SVM-FuzCoC. Our method combines effectively the advantages of the wrapper and filter approaches, achieving three goals simultaneously: classification performance, dimensionality reduction, and computational efficiency. In the filter part, a forward feature search methodology is developed, driven by a fuzzy complementary criterion, whereby at each iteration a feature is selected that exhibits the maximum additional contribution in regard to the previously selected subset. The quality of single features or feature subsets is assessed via a fuzzy local evaluation criterion with respect to patterns. This is achieved by the so-called fuzzy partition vector (FPV), comprising the fuzzy membership grades of every pattern in their target classes. Derivation of the feature FPVs is accomplished by incorporating a fuzzy output kernel-based support vector machine. The proposed method is favorably compared with existing SVM-based wrapper methods, in terms of performance capability and computational speed. Experimental investigation is carried out using a diverse pool of real datasets, including moderate and high-dimensional feature spaces.  相似文献   

14.
高茂庭  陆鹏 《计算机应用》2008,28(6):1411-1413
利用遗传算法优化投影方向,投影寻踪模型将高维的文本特征数据投影到2~3维的低维可视化空间上,并根据高维数据在这个低维空间当中的投影特征值来反映其线性和非线性结构或特征,达到数据降维目的并实现文本数据特征可视化。不仅大大约简了文本挖掘过程的计算复杂性,还有助于在K-means聚类算法中确定初始中心点数目,提高算法精度。实验验证了这种方法应用于文本特征降维的有效性。  相似文献   

15.
Feature selection is an important method of data preprocessing in data mining. In this paper, a novel feature selection method based on multi-fractal dimension and harmony search algorithm is proposed. Multi-fractal dimension is adopted as the evaluation criterion of feature subset, which can determine the number of selected features. An improved harmony search algorithm is used as the search strategy to improve the efficiency of feature selection. The performance of the proposed method is compared with that of other feature selection algorithms on UCI data-sets. Besides, the proposed method is also used to predict the daily average concentration of PM2.5 in China. Experimental results show that the proposed method can obtain competitive results in terms of both prediction accuracy and the number of selected features.  相似文献   

16.

Feature selection is one of the significant steps in classification tasks. It is a pre-processing step to select a small subset of significant features that can contribute the most to the classification process. Presently, many metaheuristic optimization algorithms were successfully applied for feature selection. The genetic algorithm (GA) as a fundamental optimization tool has been widely used in feature selection tasks. However, GA suffers from the hyperparameter setting, high computational complexity, and the randomness of selection operation. Therefore, we propose a new rival genetic algorithm, as well as a fast version of rival genetic algorithm, to enhance the performance of GA in feature selection. The proposed approaches utilize the competition strategy that combines the new selection and crossover schemes, which aim to improve the global search capability. Moreover, a dynamic mutation rate is proposed to enhance the search behaviour of the algorithm in the mutation process. The proposed approaches are validated on 23 benchmark datasets collected from the UCI machine learning repository and Arizona State University. In comparison with other competitors, proposed approach can provide highly competing results and overtake other algorithms in feature selection.

  相似文献   

17.
Feature selection in high-dimensional data is one of the active areas of research in pattern recognition. Most of the algorithms in this area try to select a subset of features in a way to maximize the accuracy of classification regardless of the number of selected features that affect classification time. In this article, a new method for feature selection algorithm in high-dimensional data is proposed that can control the trade-off between accuracy and classification time. This method is based on a greedy metaheuristic algorithm called greedy randomized adaptive search procedure (GRASP). It uses an extended version of a simulated annealing (SA) algorithm for local search. In this version of SA, new parameters are embedded that allow the algorithm to control the trade-off between accuracy and classification time. Experimental results show supremacy of the proposed method over previous versions of GRASP for feature selection. Also, they show how the trade-off between accuracy and classification time is controllable by the parameters introduced in the proposed method.  相似文献   

18.
IDR/QR: an incremental dimension reduction algorithm via QR decomposition   总被引:1,自引:0,他引:1  
Dimension reduction is a critical data preprocessing step for many database and data mining applications, such as efficient storage and retrieval of high-dimensional data. In the literature, a well-known dimension reduction algorithm is linear discriminant analysis (LDA). The common aspect of previously proposed LDA-based algorithms is the use of singular value decomposition (SVD). Due to the difficulty of designing an incremental solution for the eigenvalue problem on the product of scatter matrices in LDA, there has been little work on designing incremental LDA algorithms that can efficiently incorporate new data items as they become available. In this paper, we propose an LDA-based incremental dimension reduction algorithm, called IDR/QR, which applies QR decomposition rather than SVD. Unlike other LDA-based algorithms, this algorithm does not require the whole data matrix in main memory. This is desirable for large data sets. More importantly, with the insertion of new data items, the IDR/QR algorithm can constrain the computational cost by applying efficient QR-updating techniques. Finally, we evaluate the effectiveness of the IDR/QR algorithm in terms of classification error rate on the reduced dimensional space. Our experiments on several real-world data sets reveal that the classification error rate achieved by the IDR/QR algorithm is very close to the best possible one achieved by other LDA-based algorithms. However, the IDR/QR algorithm has much less computational cost, especially when new data items are inserted dynamically.  相似文献   

19.
鉴于传统属性选择算法无法捕捉属性之间的关系的问题,文中提出了一种非线性属性选择方法。该方法通过引入核函数,将原始数据集投影到高维的核空间,因在核空间内进行运算,进而可以考虑到数据属性之间的关系。由于核函数自身的优越性,即使数据通过高斯核投影到无穷维的空间中,计算复杂度亦可以控制得较小。在正则化因子的限制上,使用两种范数进行双重约束,不仅提高了算法的准确率,而且使得算法实验结果的方差仅为0.74,远小于其他同类对比算法,且算法更加稳定。在8个常用的数据集上将所提算法与6个同类算法进行比较,并用SVM分类器来测试分类准确率,最终该算法得到最少1.84%,最高3.27%,平均2.75%的提升。  相似文献   

20.
Genetic algorithms (GAs) have been used as conventional methods for classifiers to adaptively evolve solutions for classification problems. Feature selection plays an important role in finding relevant features in classification. In this paper, feature selection is explored with modular GA-based classification. A new feature selection technique, relative importance factor (RIF), is proposed to find less relevant features in the input domain of each class module. By removing these features, it is aimed to reduce the classification error and dimensionality of classification problems. Benchmark classification data sets are used to evaluate the proposed approach. The experiment results show that RIF can be used to find less relevant features and help achieve lower classification error with the feature space dimension reduced.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号