首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Reducing the dimensionality of the data has been a challenging task in data mining and machine learning applications. In these applications, the existence of irrelevant and redundant features negatively affects the efficiency and effectiveness of different learning algorithms. Feature selection is one of the dimension reduction techniques, which has been used to allow a better understanding of data and improve the performance of other learning tasks. Although the selection of relevant features has been extensively studied in supervised learning, feature selection in the absence of class labels is still a challenging task. This paper proposes a novel method for unsupervised feature selection, which efficiently selects features in a greedy manner. The paper first defines an effective criterion for unsupervised feature selection that measures the reconstruction error of the data matrix based on the selected subset of features. The paper then presents a novel algorithm for greedily minimizing the reconstruction error based on the features selected so far. The greedy algorithm is based on an efficient recursive formula for calculating the reconstruction error. Experiments on real data sets demonstrate the effectiveness of the proposed algorithm in comparison with the state-of-the-art methods for unsupervised feature selection.  相似文献   

2.
Feature selection is an important preprocessing step for dealing with high dimensional data. In this paper, we propose a novel unsupervised feature selection method by embedding a subspace learning regularization (i.e., principal component analysis (PCA)) into the sparse feature selection framework. Specifically, we select informative features via the sparse learning framework and consider preserving the principal components (i.e., the maximal variance) of the data at the same time, such that improving the interpretable ability of the feature selection model. Furthermore, we propose an effective optimization algorithm to solve the proposed objective function which can achieve stable optimal result with fast convergence. By comparing with five state-of-the-art unsupervised feature selection methods on six benchmark and real-world datasets, our proposed method achieved the best result in terms of classification performance.  相似文献   

3.
Dimensionality reduction is an important and challenging task in machine learning and data mining. Feature selection and feature extraction are two commonly used techniques for decreasing dimensionality of the data and increasing efficiency of learning algorithms. Specifically, feature selection realized in the absence of class labels, namely unsupervised feature selection, is challenging and interesting. In this paper, we propose a new unsupervised feature selection criterion developed from the viewpoint of subspace learning, which is treated as a matrix factorization problem. The advantages of this work are four-fold. First, dwelling on the technique of matrix factorization, a unified framework is established for feature selection, feature extraction and clustering. Second, an iterative update algorithm is provided via matrix factorization, which is an efficient technique to deal with high-dimensional data. Third, an effective method for feature selection with numeric data is put forward, instead of drawing support from the discretization process. Fourth, this new criterion provides a sound foundation for embedding kernel tricks into feature selection. With this regard, an algorithm based on kernel methods is also proposed. The algorithms are compared with four state-of-the-art feature selection methods using six publicly available datasets. Experimental results demonstrate that in terms of clustering results, the proposed two algorithms come with better performance than the others for almost all datasets we experimented with here.  相似文献   

4.
Exploratory data analysis methods are essential for getting insight into data. Identifying the most important variables and detecting quasi-homogenous groups of data are problems of interest in this context. Solving such problems is a difficult task, mainly due to the unsupervised nature of the underlying learning process. Unsupervised feature selection and unsupervised clustering can be successfully approached as optimization problems by means of global optimization heuristics if an appropriate objective function is considered. This paper introduces an objective function capable of efficiently guiding the search for significant features and simultaneously for the respective optimal partitions. Experiments conducted on complex synthetic data suggest that the function we propose is unbiased with respect to both the number of clusters and the number of features.  相似文献   

5.
This paper proposes a novel unsupervised feature selection method by jointing self-representation and subspace learning. In this method, we adopt the idea of self-representation and use all the features to represent each feature. A Frobenius norm regularization is used for feature selection since it can overcome the over-fitting problem. The Locality Preserving Projection (LPP) is used as a regularization term as it can maintain the local adjacent relations between data when performing feature space transformation. Further, a low-rank constraint is also introduced to find the effective low-dimensional structures of the data, which can reduce the redundancy. Experimental results on real-world datasets verify that the proposed method can select the most discriminative features and outperform the state-of-the-art unsupervised feature selection methods in terms of classification accuracy, standard deviation, and coefficient of variation.  相似文献   

6.
Qian  Youcheng  Yin  Xueyan  Gao  Wei 《Multimedia Tools and Applications》2019,78(23):33593-33615
Multimedia Tools and Applications - Feature selection aims to select the optimal feature subset which can reduce time complexity, save storage space and improve the performances of various tasks....  相似文献   

7.
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques.  相似文献   

8.
9.

In hyperspectral image (HSI) analysis, high-dimensional data may contain noisy, irrelevant and redundant information. To mitigate the negative effect from these information, feature selection is one of the useful solutions. Unsupervised feature selection is a data preprocessing technique for dimensionality reduction, which selects a subset of informative features without using any label information. Different from the linear models, the autoencoder is formulated to nonlinearly select informative features. The adjacency matrix of HSI can be constructed to extract the underlying relationship between each data point, where the latent representation of original data can be obtained via matrix factorization. Besides, a new feature representation can be also learnt from the autoencoder. For a same data matrix, different feature representations should consistently share the potential information. Motivated by these, in this paper, we propose a latent representation learning based autoencoder feature selection (LRLAFS) model, where the latent representation learning is used to steer feature selection for the autoencoder. To solve the proposed model, we advance an alternative optimization algorithm. Experimental results on three HSI datasets confirm the effectiveness of the proposed model.

  相似文献   

10.
A new efficient unsupervised feature selection method is proposed to handle nominal data without data transformation. The proposed feature selection method introduces a new data distribution factor to select appropriate clusters. The proposed method combines the compactness and separation together with a newly introduced concept of singleton item. This new feature selection method considers all features globally. It is computationally inexpensive and able to deliver very promising results. Eight datasets from the University of California Irvine (UCI) machine learning repository and a high-dimensional cDNA dataset are used in this paper. The obtained results show that the proposed method is very efficient and able to deliver very reliable results.  相似文献   

11.
Intrusion detection is very serious issue in these days because the prevention of intrusions depends on detection. Therefore, accurate detection of intrusion is very essential to secure information in computer and network systems of any organization such as private, public, and government. Several intrusion detection approaches are available but the main problem is their performance, which can be enhanced by increasing the detection rates and reducing false positives. This issue of the existing techniques is the focus of research in this paper. The poor performance of such techniques is due to raw dataset which confuse the classifier and results inaccurate detection due to redundant features. The recent approaches used principal component analysis (PCA) for feature subset selection which is based on highest eigenvalues, but the features corresponding to the highest eigenvalues may not have the optimal sensitivity for the classifier due to ignoring many sensitive features. Instead of using traditional approach of selecting features with the highest eigenvalues such as PCA, this research applied a genetic algorithm to search the genetic principal components that offers a subset of features with optimal sensitivity and the highest discriminatory power. The support vector machine (SVM) is used for classification purpose. This research work used the knowledge discovery and data mining cup dataset for experimentation. The performance of this approach was analyzed and compared with existing approaches. The results show that proposed method enhances SVM performance in intrusion detection that outperforms the existing approaches and has the capability to minimize the number of features and maximize the detection rates.  相似文献   

12.
Koch I  Naito K 《Neural computation》2007,19(2):513-545
This letter is concerned with the problem of selecting the best or most informative dimension for dimension reduction and feature extraction in high-dimensional data. The dimension of the data is reduced by principal component analysis; subsequent application of independent component analysis to the principal component scores determines the most nongaussian directions in the lower-dimensional space. A criterion for choosing the optimal dimension based on bias-adjusted skewness and kurtosis is proposed. This new dimension selector is applied to real data sets and compared to existing methods. Simulation studies for a range of densities show that the proposed method performs well and is more appropriate for nongaussian data than existing methods.  相似文献   

13.
With the development of the condition-based maintenance techniques and the consequent requirement for good machine learning methods, new challenges arise in unsupervised learning. In the real-world situations, due to the relevant features that could exhibit the real machine condition are often unknown as priori, condition monitoring systems based on unimportant features, e.g. noise, might suffer high false-alarm rates, especially when the characteristics of failures are costly or difficult to learn. Therefore, it is important to select the most representative features for unsupervised learning in fault diagnostics. In this paper, a hybrid feature selection scheme (HFS) for unsupervised learning is proposed to improve the robustness and the accuracy of fault diagnostics. It provides a general framework of the feature selection based on significance evaluation and similarity measurement with respect to the multiple clustering solutions. The effectiveness of the proposed HFS method is demonstrated by a bearing fault diagnostics application and comparison with other features selection methods.  相似文献   

14.
Neural Computing and Applications - The “curse of dimensionality” issue caused by high-dimensional datasets not only imposes high memory and computational costs but also deteriorates...  相似文献   

15.
This paper addresses the dimension reduction problem in Fisherface for face recognition. When the number of training samples is less than the image dimension (total number of pixels), the within-class scatter matrix (Sw) in Linear Discriminant Analysis (LDA) is singular, and Principal Component Analysis (PCA) is suggested to employ in Fisherface for dimension reduction of Sw so that it becomes nonsingular. The popular method is to select the largest nonzero eigenvalues and the corresponding eigenvectors for LDA. To attenuate the illumination effect, some researchers suggested removing the three eigenvectors with the largest eigenvalues and the performance is improved. However, as far as we know, there is no systematic way to determine which eigenvalues should be used. Along this line, this paper proposes a theorem to interpret why PCA can be used in LDA and an automatic and systematic method to select the eigenvectors to be used in LDA using a Genetic Algorithm (GA). A GA-PCA is then developed. It is found that some small eigenvectors should also be used as part of the basis for dimension reduction. Using the GA-PCA to reduce the dimension, a GA-Fisher method is designed and developed. Comparing with the traditional Fisherface method, the proposed GA-Fisher offers two additional advantages. First, optimal bases for dimensionality reduction are derived from GA-PCA. Second, the computational efficiency of LDA is improved by adding a whitening procedure after dimension reduction. The Face Recognition Technology (FERET) and Carnegie Mellon University Pose, Illumination, and Expression (CMU PIE) databases are used for evaluation. Experimental results show that almost 5 % improvement compared with Fisherface can be obtained, and the results are encouraging.  相似文献   

16.
Liu  Zhiyu  Gao  Xin  Jia  Xin  Xue  Bing  Fu  Shiyuan  Li  Kangsheng  Huang  Xu  Huang  Zijian 《Applied Intelligence》2022,52(13):15074-15090
Applied Intelligence - Anomaly detection problem has been extensively studied in a variety of application domains, where the data tags are difficult to obtain. Most unsupervised algorithms rely on...  相似文献   

17.
In this paper, new appearances based on neural networks (NN) algorithms are presented for face recognition. Face recognition is subdivided into two main stages: feature extraction and classifier. The suggested NN algorithms are the unsupervised Sanger principal component neural network (Sanger PCNN) and the self-organizing feature map (SOFM), which will be applied for features extraction of the frontal view of a face image. It is of interest to compare the unsupervised network with the traditional Eigenfaces technique. This paper presents an experimental comparison of the statistical Eigenfaces method for feature extraction and the unsupervised neural networks in order to evaluate the classification accuracies as comparison criteria. The classifier is done by the multilayer perceptron (MLP) neural network. Overcoming of the problem of the finite number of training samples per person is discussed. Experimental results are implemented on the Olivetti Research Laboratory database that contains variability in expression, pose, and facial details. The results show that the proposed method SOFM/MLP neural network is more efficient and robust than the Sanger PCNN/MLP and the Eigenfaces/MLP, when used a few number of training samples per person. As a result, it would be more applicable to utilize the SOFM/MLP NN in order to accomplish a higher level of accuracy within a recognition system.  相似文献   

18.
Slow feature analysis: unsupervised learning of invariances   总被引:8,自引:0,他引:8  
Invariant features of temporally varying signals are useful for analysis and classification. Slow feature analysis (SFA) is a new method for learning invariant or slowly varying features from a vectorial input signal. It is based on a nonlinear expansion of the input signal and application of principal component analysis to this expanded signal and its time derivative. It is guaranteed to find the optimal solution within a family of functions directly and can learn to extract a large number of decorrelated features, which are ordered by their degree of invariance. SFA can be applied hierarchically to process high-dimensional input signals and extract complex features. SFA is applied first to complex cell tuning properties based on simple cell output, including disparity and motion. Then more complicated input-output functions are learned by repeated application of SFA. Finally, a hierarchical network of SFA modules is presented as a simple model of the visual system. The same unstructured network can learn translation, size, rotation, contrast, or, to a lesser degree, illumination invariance for one-dimensional objects, depending on only the training stimulus. Surprisingly, only a few training objects suffice to achieve good generalization to new objects. The generated representation is suitable for object recognition. Performance degrades if the network is trained to learn multiple invariances simultaneously.  相似文献   

19.
A new scheme, incorporating dimensionality reduction and clustering, suitable for classification of a large volume of remotely sensed data using a small amount of memory is proposed. The scheme involves transforming the data from multidimensional n-space to a 3-dimensional primary color space of blue, green and red coordinates. The dimensionality reduction is followed by data reduction, which involves assigning 3-dimensional samples to a 2-dimensional array. Finally, a multi-stage ISODATA technique incorporating a novel seedpoint picking method is used to obtain the desired number of clusters.

The storage requirements are reduced to a low value by making five passes through the data and storing necessary information during each pass. The first three passes are used to find the minimum and maximum values of some of the variables. The data reduction is done and a classification table is formed during the fourth pass. The classification map is obtained during the fifth pass. The computer memory required is about 2K machine words.

The efficacy of the algorithm is justified by simulation studies using multispectral LANDSAT data.  相似文献   


20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号