首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The focus of this paper is on joint feature re-extraction and classification in cases when the training data set is small. An iterative semi-supervised support vector machine (SVM) algorithm is proposed, where each iteration consists both feature re-extraction and classification, and the feature re-extraction is based on the classification results from the previous iteration. Feature extraction is first discussed in the framework of Rayleigh coefficient maximization. The effectiveness of common spatial pattern (CSP) feature, which is commonly used in Electroencephalogram (EEG) data analysis and EEG-based brain computer interfaces (BCIs), can be explained by Rayleigh coefficient maximization. Two other features are also defined using the Rayleigh coefficient. These features are effective for discriminating two classes with different means or different variances. If we extract features based on Rayleigh coefficient maximization, a large training data set with labels is required in general; otherwise, the extracted features are not reliable. Thus we present an iterative semi-supervised SVM algorithm embedded with feature re-extraction. This iterative algorithm can be used to extract these three features reliably and perform classification simultaneously in cases where the training data set is small. Each iteration is composed of two main steps: (i) the training data set is updated/augmented using unlabeled test data with their predicted labels; features are re-extracted based on the augmented training data set. (ii) The re-extracted features are classified by a standard SVM. Regarding parameter setting and model selection of our algorithm, we also propose a semi-supervised learning-based method using the Rayleigh coefficient, in which both training data and test data are used. This method is suitable when cross-validation model selection may not work for small training data set. Finally, the results of data analysis are presented to demonstrate the validity of our approach. Editor: Olivier Chapelle.  相似文献   

2.
Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimates. This paper addresses the impact of error estimation on feature selection using two performance measures: comparison of the true error of the optimal feature set with the true error of the feature set found by a feature-selection algorithm, and the number of features among the truly optimal feature set that appear in the feature set found by the algorithm. The study considers seven error estimators applied to three standard suboptimal feature-selection algorithms and exhaustive search, and it considers three different feature-label model distributions. It draws two conclusions for the cases considered: (1) depending on the sample size and the classification rule, feature-selection algorithms can produce feature sets whose corresponding classifiers possess errors far in excess of the classifier corresponding to the optimal feature set; and (2) for small samples, differences in performances among the feature-selection algorithms are less significant than performance differences among the error estimators used to implement the algorithms. Moreover, keeping in mind that results depend on the particular classifier-distribution pair, for the error estimators considered in this study, bootstrap and bolstered resubstitution usually outperform cross-validation, and bolstered resubstitution usually performs as well as or better than bootstrap.  相似文献   

3.
针对行人重识别特征提取过程中特征图分辨率不断下降,丢失大量空间信息和细节信息,导致特征鲁棒性较低的问题,提出一种基于高分辨率特征提取网络的行人重识别方法.采取变换背景的方法对训练数据集进行数据扩充,提高数据样本的多样性;通过构建高分辨率特征提取网络,使得在整个特征提取过程中网络里始终拥有高分辨特征;结合三元损失函数和改...  相似文献   

4.
Today, feature selection is an active research in machine learning. The main idea of feature selection is to choose a subset of available features, by eliminating features with little or no predictive information, as well as redundant features that are strongly correlated. There are a lot of approaches for feature selection, but most of them can only work with crisp data. Until now there have not been many different approaches which can directly work with both crisp and low quality (imprecise and uncertain) data. That is why, we propose a new method of feature selection which can handle both crisp and low quality data. The proposed approach is based on a Fuzzy Random Forest and it integrates filter and wrapper methods into a sequential search procedure with improved classification accuracy of the features selected. This approach consists of the following main steps: (1) scaling and discretization process of the feature set; and feature pre-selection using the discretization process (filter); (2) ranking process of the feature pre-selection using the Fuzzy Decision Trees of a Fuzzy Random Forest ensemble; and (3) wrapper feature selection using a Fuzzy Random Forest ensemble based on cross-validation. The efficiency and effectiveness of this approach is proved through several experiments using both high dimensional and low quality datasets. The approach shows a good performance (not only classification accuracy, but also with respect to the number of features selected) and good behavior both with high dimensional datasets (microarray datasets) and with low quality datasets.  相似文献   

5.
Feature selection is often required as a preliminary step for many pattern recognition problems. However, most of the existing algorithms only work in a centralized fashion, i.e. using the whole dataset at once. In this research a new method for distributing the feature selection process is proposed. It distributes the data by features, i.e. according to a vertical distribution, and then performs a merging procedure which updates the feature subset according to improvements in the classification accuracy. The effectiveness of our proposal is tested on microarray data, which has brought a difficult challenge for researchers due to the high number of gene expression contained and the small samples size. The results on eight microarray datasets show that the execution time is considerably shortened whereas the performance is maintained or even improved compared to the standard algorithms applied to the non-partitioned datasets.  相似文献   

6.
小样本图片分类的目标是根据极少数带有标注的样本去识别该类别, 其中两个关键问题是带标注的数据量过少和不可见类别(训练类别和测试类别的不一致). 针对这两个问题, 我们提出了一个新的小样本分类模型: 融合扩充-双重特征提取模型. 首先, 我们引入了一个融合扩充机制(FE), 这个机制利用可见类别样本中同一类别不同样本之间的变化规则, 对支持集的样本进行扩充, 从而增加支持集中的样本数量, 使提取的特征更具鲁棒性. 其次, 我们提出了一种双重特征提取机制(DF), 该机制首先利用基类的大量数据训练两个不同的特征提取器: 局部特征提取器和整体特征提取器, 利用两个不同的特征提取器对样本特征进行提取, 使提取的特征更加全面, 然后根据局部和整体特征对比, 突出对分类影响最大的特征, 从而提高分类准确性. 在Mini-ImageNet和Tiered-ImageNet数据集上, 我们的模型都取得了较好的效果.  相似文献   

7.
In classification, feature selection is an important data pre-processing technique, but it is a difficult problem due mainly to the large search space. Particle swarm optimisation (PSO) is an efficient evolutionary computation technique. However, the traditional personal best and global best updating mechanism in PSO limits its performance for feature selection and the potential of PSO for feature selection has not been fully investigated. This paper proposes three new initialisation strategies and three new personal best and global best updating mechanisms in PSO to develop novel feature selection approaches with the goals of maximising the classification performance, minimising the number of features and reducing the computational time. The proposed initialisation strategies and updating mechanisms are compared with the traditional initialisation and the traditional updating mechanism. Meanwhile, the most promising initialisation strategy and updating mechanism are combined to form a new approach (PSO(4-2)) to address feature selection problems and it is compared with two traditional feature selection methods and two PSO based methods. Experiments on twenty benchmark datasets show that PSO with the new initialisation strategies and/or the new updating mechanisms can automatically evolve a feature subset with a smaller number of features and higher classification performance than using all features. PSO(4-2) outperforms the two traditional methods and two PSO based algorithm in terms of the computational time, the number of features and the classification performance. The superior performance of this algorithm is due mainly to both the proposed initialisation strategy, which aims to take the advantages of both the forward selection and backward selection to decrease the number of features and the computational time, and the new updating mechanism, which can overcome the limitations of traditional updating mechanisms by taking the number of features into account, which reduces the number of features and the computational time.  相似文献   

8.
9.
针对大量无关和冗余特征的存在可能降低分类器性能的问题,提出了一种基于近似Markov Blanket和动态互信息的特征选择算法。该算法利用互信息作为特征相关性的度量准则,并在未识别的样本上对互信息进行动态估值,利用近似Markov Blanket原理准确地去除冗余特征,从而获得远小于原始特征规模的特征子集。通过仿真试验证明了该算法的有效性。以支持向量机为分类器,在公共数据集UCI上进行了试验,并与DMIFS和ReliefF算法进行了对比。试验结果证明,该算法选取的特征子集与原始特征子集相比,以远小于原始特征规模的特征子集获得了高于或接近于原始特征集合的分类结果。  相似文献   

10.
Algorithms for feature selection in predictive data mining for classification problems attempt to select those features that are relevant, and are not redundant for the classification task. A relevant feature is defined as one which is highly correlated with the target function. One problem with the definition of feature relevance is that there is no universally accepted definition of what it means for a feature to be ‘highly correlated with the target function or highly correlated with the other features’. A new feature selection algorithm which incorporates domain specific definitions of high, medium and low correlations is proposed in this paper. The proposed algorithm conducts a heuristic search for the most relevant features for the prediction task.  相似文献   

11.
针对合成孔径雷达图像目标在背景复杂、场景较大、干扰杂波较多情况下检测困难的问题,设计一种层数较少的卷积神经网络,在完备数据集验证其特征提取效果后,作为基础特征提取网络使用。在训练数据集中补充复杂的大场景下目标训练样本。同时设计一种多层次卷积特征融合网络,增强对大场景下小目标的检测能力。通过对候选区域网络和目标检测网络近似联合训练后,得到一个完整的可用于不同的复杂大场景下SAR图像目标检测的模型。实验结果表明,该方法在SAR图像目标检测方面具有较好的效果,在测试数据集中具有0.86的AP值。  相似文献   

12.
We address the problems of feature selection and error estimation when the number of possible feature candidates is large and the number of training samples is limited. A Monte Carlo study has been performed to illustrate the problems when using stepwise feature selection and discriminant analysis. The simulations demonstrate that in order to find the correct features, the necessary ratio of number of training samples to feature candidates is not a constant. It depends on the number of feature candidates, training samples and the Mahalanobis distance between the classes. Moreover, the leave-one-out error estimate may be a highly biased error estimate when feature selection is performed on the same data as the error estimation. It may even indicate complete separation of the classes, while no real difference between the classes exists. However, if feature selection and leave-one-out error estimation are performed in one process, an unbiased error estimate is achieved, but with high variance. The holdout error estimate gives a reliable estimate with low variance, depending on the size of the test set.  相似文献   

13.
The regularized random forest (RRF) was recently proposed for feature selection by building only one ensemble. In RRF the features are evaluated on a part of the training data at each tree node. We derive an upper bound for the number of distinct Gini information gain values in a node, and show that many features can share the same information gain at a node with a small number of instances and a large number of features. Therefore, in a node with a small number of instances, RRF is likely to select a feature not strongly relevant.Here an enhanced RRF, referred to as the guided RRF (GRRF), is proposed. In GRRF, the importance scores from an ordinary random forest (RF) are used to guide the feature selection process in RRF. Experiments on 10 gene data sets show that the accuracy performance of GRRF is, in general, more robust than RRF when their parameters change. GRRF is computationally efficient, can select compact feature subsets, and has competitive accuracy performance, compared to RRF, varSelRF and LASSO logistic regression (with evaluations from an RF classifier). Also, RF applied to the features selected by RRF with the minimal regularization outperforms RF applied to all the features for most of the data sets considered here. Therefore, if accuracy is considered more important than the size of the feature subset, RRF with the minimal regularization may be considered. We use the accuracy performance of RF, a strong classifier, to evaluate feature selection methods, and illustrate that weak classifiers are less capable of capturing the information contained in a feature subset. Both RRF and GRRF were implemented in the “RRF” R package available at CRAN, the official R package archive.  相似文献   

14.
A technique for the selection of the best set of test features for checkout or go-no-go test of a complex electro-hydraulic servo system from input-output measurements is presented. The first step is to establish checkout tolerance bands on the system response. Then the measurement set based on gain and phase which would best discriminate between the ‘healthy’ and ‘sick’ systems is selected from an initially large set of sampled frequencies via an optimization procedure in which the feature efficiency vector is introduced to add or discard features until a satisfactory set is obtained. Finally the selected feature sets are assessed for effectiveness via a ‘goodness’ criterion.  相似文献   

15.
A novel algorithm is developed for feature selection and parameter tuning in quality monitoring of manufacturing processes using cross-validation. Due to the recent development in sensing technology, many on-line signals are collected for manufacturing process monitoring and feature extraction is then performed to extract critical features related to product/process quality. However, lack of precise process knowledge may result in many irrelevant or redundant features. Therefore, a systematic procedure is needed to select a parsimonious set of features which provide sufficient information for process monitoring. In this study, a new method for selecting features and tuning SPC limits is proposed by applying k-fold cross-validation to simultaneously select important features and set the monitoring limits using Type I and Type II errors obtained from cross-validation. The monitoring performance for production data collected from ultrasonic metal welding of batteries demonstrates that the proposed algorithm is able to select the most efficient features and control limits and thus leading to satisfactory monitoring performance.  相似文献   

16.
分类问题普遍存在于现代工业生产中。在进行分类任务之前,利用特征选择筛选有用的信息,能够有效地提高分类效率和分类精度。最小冗余最大相关算法(mRMR)考虑最大化特征与类别的相关性和最小化特征之间的冗余性,能够有效地选择特征子集;但该算法存在中后期特征重要度偏差大以及无法直接给出特征子集的问题。针对该问题,文中提出了结合邻域粗糙集差别矩阵和mRMR原理的特征选择算法。根据最大相关性和最小冗余性原则,利用邻域熵和邻域互信息定义了特征的重要度,以更好地处理混合数据类型。基于差别矩阵定义了动态差别集,利用差别集的动态演化有效去除冗余属性,缩小搜索范围,优化特征子集,并根据差别矩阵判定迭代截止条件。实验选取SVM,J48,KNN和MLP作为分类器来评价该特征选择算法的性能。在公共数据集上的实验结果表明,与已有算法相比,所提算法的平均分类精度提升了2%左右,同时在特征较多的数据集上能够有效地缩短特征选择时间。所提算法继承了差别矩阵和mRMR的优点,能够有效地处理特征选择问题。  相似文献   

17.
Extracting significant features from high-dimension and small sample size biological data is a challenging problem. Recently, Micha? Draminski proposed the Monte Carlo feature selection (MC) algorithm, which was able to search over large feature spaces and achieved better classification accuracies. However in MC the information of feature rank variations is not utilized and the ranks of features are not dynamically updated. Here, we propose a novel feature selection algorithm which integrates the ideas of the professional tennis players ranking, such as seed players and dynamic ranking, into Monte Carlo simulation. Seed players make the feature selection game more competitive and selective. The strategy of dynamic ranking ensures that it is always the current best players to take part in each competition. The proposed algorithm is tested on 8 biological datasets. Results demonstrate that the proposed method is computationally efficient, stable and has favorable performance in classification.  相似文献   

18.
命名实体识别的目的是识别文本中的实体指称的边界和类别。在进行命名实体识别模型训练的过程中,通常需要大量的标注样本。本文通过实现有效的选择算法,从大量样本中选择适合模型更新的样本,减少对样本的标注工作。通过5组对比实验,验证使用有效的选择算法能够获得更好的样本集,实现具有针对性的标注样本。通过设计在微博网络数据集上的实验,验证本文提出的基于流的主动学习算法可以针对大量互联网文本数据选择出更合适的样本集,能够有效减少人工标注的成本。本文通过2个模型分别实现实体的边界提取和类别区分。序列标注模型提取出实体在序列中的位置,实体分类模型实现对标注结果的分类,并利用主动学习的方法实现在无标注数据集上的训练。使用本文的训练方法在2个数据集上进行实验。在Weibo数据集上的实验展示算法能从无标签数据集中学习到文本特征。在MSRA数据集上的实验结果显示,在预训练数据集的比例达到40%以上时,模型在测试数据集上的F1值稳定在90%左右,与使用全部数据集的结果接近,说明模型在无标签数据集上具有一定的特征提取能力。  相似文献   

19.
目的 为了解决基于卷积神经网络的算法对高光谱图像小样本分类精度较低、模型结构复杂和计算量大的问题,提出了一种变维卷积神经网络。方法 变维卷积神经网络对高光谱分类过程可根据内部特征图维度的变化分为空—谱信息融合、降维、混合特征提取与空—谱联合分类的过程。这种变维结构通过改变特征映射的维度,简化了网络结构并减少了计算量,并通过对空—谱信息的充分提取提高了卷积神经网络对小样本高光谱图像分类的精度。结果 实验分为变维卷积神经网络的性能分析实验与分类性能对比实验,所用的数据集为Indian Pines和Pavia University Scene数据集。通过实验可知,变维卷积神经网络对高光谱小样本可取得较高的分类精度,在Indian Pines和Pavia University Scene数据集上的总体分类精度分别为87.87%和98.18%,与其他分类算法对比有较明显的性能优势。结论 实验结果表明,合理的参数优化可有效提高变维卷积神经网络的分类精度,这种变维模型可较大程度提高对高光谱图像中小样本数据的分类性能,并可进一步推广到其他与高光谱图像相关的深度学习分类模型中。  相似文献   

20.
Contemporary biological technologies produce extremely high-dimensional data sets from which to design classifiers, with 20,000 or more potential features being common place. In addition, sample sizes tend to be small. In such settings, feature selection is an inevitable part of classifier design. Heretofore, there have been a number of comparative studies for feature selection, but they have either considered settings with much smaller dimensionality than those occurring in current bioinformatics applications or constrained their study to a few real data sets. This study compares some basic feature-selection methods in settings involving thousands of features, using both model-based synthetic data and real data. It defines distribution models involving different numbers of markers (useful features) versus non-markers (useless features) and different kinds of relations among the features. Under this framework, it evaluates the performances of feature-selection algorithms for different distribution models and classifiers. Both classification error and the number of discovered markers are computed. Although the results clearly show that none of the considered feature-selection methods performs best across all scenarios, there are some general trends relative to sample size and relations among the features. For instance, the classifier-independent univariate filter methods have similar trends. Filter methods such as the t-test have better or similar performance with wrapper methods for harder problems. This improved performance is usually accompanied with significant peaking. Wrapper methods have better performance when the sample size is sufficiently large. ReliefF, the classifier-independent multivariate filter method, has worse performance than univariate filter methods in most cases; however, ReliefF-based wrapper methods show performance similar to their t-test-based counterparts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号