共查询到20条相似文献,搜索用时 15 毫秒
1.
Matthew Browne 《Pattern recognition》2012,45(4):1531-1539
Locally adaptive density estimation presents challenges for parametric or non-parametric estimators. Several useful properties of tessellation density estimators (TDEs), such as low bias, scale invariance and sensitivity to local data morphology, make them an attractive alternative to standard kernel techniques. However, simple TDEs are discontinuous and produce highly unstable estimates due to their susceptibility to sampling noise. With the motivation of addressing these concerns, we propose applying TDEs within a bootstrap aggregation algorithm, and incorporating model selection with complexity penalization. We implement complexity reduction of the TDE via sub-sampling, and use information-theoretic criteria for model selection, which leads to an automatic and approximately ideal bias/variance compromise. The procedure yields a stabilized estimator that automatically adapts to the complexity of the generating distribution and the quantity of information at hand, and retains the highly desirable properties of the TDE. Simulation studies presented suggest a high degree of stability and sensitivity can be obtained using this approach. 相似文献
2.
This article proposes a weighted bootstrap procedure, which is an efficient bootstrap technique for neural model selection. Our primary interest in reducing computer effort is to not resample (in the original bootstrap procedure) uniformly from the original sample, but to modify this distribution in order to obtain variance reduction. The performance of the weighted bootstrap is demonstrated on two artificial data sets and one real dataset. Experimental results show that the weighted bootstrap procedure permits an approximately 2 to 1 reduction in replication size. 相似文献
3.
Xiaoming Wang 《Computational statistics & data analysis》2010,54(10):2230-2243
We propose a new penalized least squares approach to handling high-dimensional statistical analysis problems. Our proposed procedure can outperform the SCAD penalty technique (Fan and Li, 2001) when the number of predictors p is much larger than the number of observations n, and/or when the correlation among predictors is high. The proposed procedure has some of the properties of the smoothly clipped absolute deviation (SCAD) penalty method, including sparsity and continuity, and is asymptotically equivalent to an oracle estimator. We show how the approach can be used to analyze high-dimensional data, e.g., microarray data, to construct a classification rule and at the same time automatically select significant genes. A simulation study and real data examples demonstrate the practical aspects of the new method. 相似文献
4.
Wen-Liang Hung E. Stanley LeeShun-Chin Chuang 《Computers & Mathematics with Applications》2011,62(12):4576-4581
Uniform resampling is the easiest to apply and is a general recipe for all problems, but it may require a large replication size B. To save computational effort in uniform resampling, balanced bootstrap resampling is proposed to change the bootstrap resampling plan. This resampling plan is effective for approximating the center of the bootstrap distribution. Therefore, this paper applies it to neural model selection. Numerical experiments indicate that it is possible to considerably reduce the replication size B. Moreover, the efficiency of balanced bootstrap resampling is also discussed in this paper. 相似文献
5.
Yiu-ming Cheung 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(11):1583-1588
The existing rival penalized competitive learning (RPCL) algorithm and its variants have provided an attractive way to perform data clustering without knowing the exact number of clusters. However, their performance is sensitive to the preselection of the rival delearning rate. In this paper, we further investigate the RPCL and present a mechanism to control the strength of rival penalization dynamically. Consequently, we propose the rival penalization controlled competitive learning (RPCCL) algorithm and its stochastic version. In each of these algorithms, the selection of the delearning rate is circumvented using a novel technique. We compare the performance of RPCCL to RPCL in Gaussian mixture clustering and color image segmentation, respectively. The experiments have produced the promising results. 相似文献
6.
Matias Salibian-Barrera 《Computational statistics & data analysis》2008,52(12):5121-5135
Robust model selection procedures control the undue influence that outliers can have on the selection criteria by using both robust point estimators and a bounded loss function when measuring either the goodness-of-fit or the expected prediction error of each model. Furthermore, to avoid favoring over-fitting models, these two measures can be combined with a penalty term for the size of the model. The expected prediction error conditional on the observed data may be estimated using the bootstrap. However, bootstrapping robust estimators becomes extremely time consuming on moderate to high dimensional data sets. It is shown that the expected prediction error can be estimated using a very fast and robust bootstrap method, and that this approach yields a consistent model selection method that is computationally feasible even for a relatively large number of covariates. Moreover, as opposed to other bootstrap methods, this proposal avoids the numerical problems associated with the small bootstrap samples required to obtain consistent model selection criteria. The finite-sample performance of the fast and robust bootstrap model selection method is investigated through a simulation study while its feasibility and good performance on moderately large regression models are illustrated on several real data examples. 相似文献
7.
Cancer diagnosis is an important emerging clinical application of microarray data. Its accurate prediction to the type or size of tumors relies on adopting powerful and reliable classification models, so as to patients can be provided with better treatment or response to therapy. However, the high dimensionality of microarray data may bring some disadvantages, such as over-fitting, poor performance and low efficiency, to traditional classification models. Thus, one of the challenging tasks in cancer diagnosis is how to identify salient expression genes from thousands of genes in microarray data that can directly contribute to the phenotype or symptom of disease. In this paper, we propose a new ensemble gene selection method (EGS) to choose multiple gene subsets for classification purpose, where the significant degree of gene is measured by conditional mutual information or its normalized form. After different gene subsets have been obtained by setting different starting points of the search procedure, they will be used to train multiple base classifiers and then aggregated into a consensus classifier by the manner of majority voting. The proposed method is compared with five popular gene selection methods on six public microarray datasets and the comparison results show that our method works well. 相似文献
8.
Input feature selection for classification problems 总被引:30,自引:0,他引:30
Feature selection plays an important role in classifying systems such as neural networks (NNs). We use a set of attributes which are relevant, irrelevant or redundant and from the viewpoint of managing a dataset which can be huge, reducing the number of attributes by selecting only the relevant ones is desirable. In doing so, higher performances with lower computational effort is expected. In this paper, we propose two feature selection algorithms. The limitation of mutual information feature selector (MIFS) is analyzed and a method to overcome this limitation is studied. One of the proposed algorithms makes more considered use of mutual information between input attributes and output classes than the MIFS. What is demonstrated is that the proposed method can provide the performance of the ideal greedy selection algorithm when information is distributed uniformly. The computational load for this algorithm is nearly the same as that of MIFS. In addition, another feature selection algorithm using the Taguchi method is proposed. This is advanced as a solution to the question as to how to identify good features with as few experiments as possible. The proposed algorithms are applied to several classification problems and compared with MIFS. These two algorithms can be combined to complement each other's limitations. The combined algorithm performed well in several experiments and should prove to be a useful method in selecting features for classification problems. 相似文献
9.
Genetic algorithms (GAs) have been used as conventional methods for classifiers to adaptively evolve solutions for classification problems. Feature selection plays an important role in finding relevant features in classification. In this paper, feature selection is explored with modular GA-based classification. A new feature selection technique, relative importance factor (RIF), is proposed to find less relevant features in the input domain of each class module. By removing these features, it is aimed to reduce the classification error and dimensionality of classification problems. Benchmark classification data sets are used to evaluate the proposed approach. The experiment results show that RIF can be used to find less relevant features and help achieve lower classification error with the feature space dimension reduced. 相似文献
10.
Domenec Puig Author Vitae Miguel Angel Garcia Author Vitae 《Pattern recognition》2010,43(10):3282-3297
Recent developments in texture classification have shown that the proper integration of texture methods from different families leads to significant improvements in terms of classification rate compared to the use of a single family of texture methods. In order to reduce the computational burden of that integration process, a selection stage is necessary. In general, a large number of feature selection techniques have been proposed. However, a specific texture feature selection must be typically applied given a particular set of texture patterns to be classified. This paper describes a new texture feature selection algorithm that is independent of specific classification problems/applications and thus must only be run once given a set of available texture methods. The proposed application-independent selection scheme has been evaluated and compared to previous proposals on both Brodatz compositions and complex real images. 相似文献
11.
This paper presents an approach to select the optimal reference subset (ORS) for nearest neighbor classifier. The optimal reference subset, which has minimum sample size and satisfies a certain resubstitution error rate threshold, is obtained through a tabu search (TS) algorithm. When the error rate threshold is set to zero, the algorithm obtains a near minimal consistent subset of a given training set. While the threshold is set to a small appropriate value, the obtained reference subset may have reasonably good generalization capacity. A neighborhood exploration method and an aspiration criterion are proposed to improve the efficiency of TS. Experimental results based on a number of typical data sets are presented and analyzed to illustrate the benefits of the proposed method. The performances of the result consistent and non-consistent reference subsets are evaluated. 相似文献
12.
In this paper. we present the MIFS-C variant of the mutual information feature-selection algorithms. We present an algorithm
to find the optimal value of the redundancy parameter, which is a key parameter in the MIFS-type algorithms. Furthermore,
we present an algorithm that speeds up the execution time of all the MIFS variants. Overall, the presented MIFS-C has comparable
classification accuracy (in some cases even better) compared with other MIFS algorithms, while its running time is faster.
We compared this feature selector with other feature selectors, and found that it performs better in most cases. The MIFS-C
performed especially well for the breakeven and F-measure because the algorithm can be tuned to optimise these evaluation measures.
Jan Bakus received the B.A.Sc. and M.A.Sc. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada,
in 1996 and 1998, respectively, and Ph.D. degree in systems design engineering in 2005. He is currently working at Maplesoft,
Waterloo, ON, Canada as an applications engineer, where he is responsible for the development of application specific toolboxes
for the Maple scientific computing software.
His research interests are in the area of feature selection for text classification, text classification, text clustering,
and information retrieval. He is the recipient of the Carl Pollock Fellowship award from the University of Waterloo and the
Datatel Scholars Foundation scholarship from Datatel.
Mohamed S. Kamel holds a Ph.D. in computer science from the University of Toronto, Canada. He is at present Professor and Director of the
Pattern Analysis and Machine Intelligence Laboratory in the Department of Electrical and Computing Engineering, University
of Waterloo, Canada. Professor Kamel holds a Canada Research Chair in Cooperative Intelligent Systems.
Dr. Kamel's research interests are in machine intelligence, neural networks and pattern recognition with applications in robotics
and manufacturing. He has authored and coauthored over 200 papers in journals and conference proceedings, 2 patents and numerous
technical and industrial project reports. Under his supervision, 53 Ph.D. and M.A.Sc. students have completed their degrees.
Dr. Kamel is a member of ACM, AAAI, CIPS and APEO and has been named s Fellow of IEEE (2005). He is the editor-in-chief of
the International Journal of Robotics and Automation, Associate Editor of the IEEE SMC, Part A, the International Journal
of Image and Graphics, Pattern Recognition Letters and is a member of the editorial board of the Intelligent Automation and
Soft Computing. He has served as a consultant to many Companies, including NCR, IBM, Nortel, VRP and CSA. He is a member of
the board of directors and cofounder of Virtek Vision International in Waterloo. 相似文献
13.
The classification of functional or high-dimensional data requires to select a reduced subset of features among the initial set, both to help fighting the curse of dimensionality and to help interpreting the problem and the model. The mutual information criterion may be used in that context, but it suffers from the difficulty of its estimation through a finite set of samples. Efficient estimators are not designed specifically to be applied in a classification context, and thus suffer from further drawbacks and difficulties. This paper presents an estimator of mutual information that is specifically designed for classification tasks, including multi-class ones. It is combined to a recently published stopping criterion in a traditional forward feature selection procedure. Experiments on both traditional benchmarks and on an industrial functional classification problem show the added value of this estimator. 相似文献
14.
Feature selection for multi-label naive Bayes classification 总被引:4,自引:0,他引:4
In multi-label learning, the training set is made up of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances. In this paper, this learning problem is addressed by using a method called Mlnb which adapts the traditional naive Bayes classifiers to deal with multi-label instances. Feature selection mechanisms are incorporated into Mlnb to improve its performance. Firstly, feature extraction techniques based on principal component analysis are applied to remove irrelevant and redundant features. After that, feature subset selection techniques based on genetic algorithms are used to choose the most appropriate subset of features for prediction. Experiments on synthetic and real-world data show that Mlnb achieves comparable performance to other well-established multi-label learning algorithms. 相似文献
15.
为解决维吾尔文文本分类中不平衡数据集问题,提出了一种改进的卡方特征选择方法.结合维吾尔文的语言特性对文本进行预处理,降低特征空间维度;运用卡方和逆文档频数相结合的方法进行特征选择,进一步降低特征空间维数;使用朴素贝叶斯分类器进行分类.在维吾尔文不平衡语料库上进行的实验表明,提出的特征选择方法在不平衡数据集中要优于卡方和信息增益特征选择方法. 相似文献
16.
Rafael B. Pereira Alexandre Plastino Bianca Zadrozny Luiz H. C. Merschmann 《Artificial Intelligence Review》2018,49(1):57-78
In many important application domains such as text categorization, biomolecular analysis, scene classification and medical diagnosis, examples are naturally associated with more than one class label, giving rise to multi-label classification problems. This fact has led, in recent years, to a substantial amount of research on feature selection methods that allow the identification of relevant and informative features for multi-label classification. However, the methods proposed for this task are scattered in the literature, with no common framework to describe them and to allow an objective comparison. Here, we revisit a categorization of existing multi-label classification methods and, as our main contribution, we provide a comprehensive survey and novel categorization of the feature selection techniques that have been created for the multi-label classification setting. We conclude this work with concrete suggestions for future research in multi-label feature selection which have been derived from our categorization and analysis. 相似文献
17.
S. K. Maxwell R. M. Hoffer P. L. Chapman 《International journal of remote sensing》2013,34(23):5061-5073
Mapping land cover of large regions often requires processing of satellite images collected from several time periods at many spectral wavelength channels. However, manipulating and processing large amounts of image data increases the complexity and time, and hence the cost, that it takes to produce a land cover map. Very few studies have evaluated the importance of individual Advanced Very High Resolution Radiometer (AVHRR) channels for discriminating cover types, especially the thermal channels (channels 3, 4 and 5). Studies rarely perform a multi-year analysis to determine the impact of inter-annual variability on the classification results. We evaluated 5 years of AVHRR data using combinations of the original AVHRR spectral channels (1-5) to determine which channels are most important for cover type discrimination, yet stabilize inter-annual variability. Particular attention was placed on the channels in the thermal portion of the spectrum. Fourteen cover types over the entire state of Colorado were evaluated using a supervised classification approach on all two-, three-, four- and five-channel combinations for seven AVHRR biweekly composite datasets covering the entire growing season for each of 5 years. Results show that all three of the major portions of the electromagnetic spectrum represented by the AVHRR sensor are required to discriminate cover types effectively and stabilize inter-annual variability. Of the two-channel combinations, channels 1 (red visible) and 2 (near-infrared) had, by far, the highest average overall accuracy (72.2%), yet the inter-annual classification accuracies were highly variable. Including a thermal channel (channel 4) significantly increased the average overall classification accuracy by 5.5% and stabilized interannual variability. Each of the thermal channels gave similar classification accuracies; however, because of the problems in consistently interpreting channel 3 data, either channel 4 or 5 was found to be a more appropriate choice. Substituting the thermal channel with a single elevation layer resulted in equivalent classification accuracies and inter-annual variability. 相似文献
18.
This paper proposed a novel feature selection method that includes a self-representation loss function, a graph regularization term and an \({l_{2,1}}\)-norm regularization term. Different from traditional least square loss function which focuses on achieving the minimal regression error between the class labels and their corresponding predictions, the proposed self-representation loss function pushes to represent each feature with a linear combination of its relevant features, aim at effectively selecting representative features and ensuring the robustness to outliers. The graph regularization terms include two kinds of inherent information, i.e., the relationship between samples (the sample–sample relation for short) and the relationship between features (the feature–feature relation for short). The feature–feature relation reflects the similarity between two features and preserves the relation into the coefficient matrix, while the sample–sample relation reflects the similarity between two samples and preserves the relation into the coefficient matrix. The \({l_{2,1}}\)-norm regularization term is used to conduct feature selection, aim at selecting the features, which satisfies the characteristics mentioned above. Furthermore, we put forward a new optimization method to solve our objective function. Finally, we feed reduced data into support vector machine (SVM) to conduct classification on real datasets. The experimental results showed that the proposed method has a better performance comparing with state-of-the-art methods, such as k nearest neighbor, ridge regression, SVM and so on. 相似文献
19.
Felipe Alonso-Atienza José Luis Rojo-ÁlvarezAlfredo Rosado-Muñoz Juan J. VinagreArcadi García-Alberola Gustavo Camps-Valls 《Expert systems with applications》2012,39(2):1956-1967
Early detection of ventricular fibrillation (VF) is crucial for the success of the defibrillation therapy in automatic devices. A high number of detectors have been proposed based on temporal, spectral, and time-frequency parameters extracted from the surface electrocardiogram (ECG), showing always a limited performance. The combination ECG parameters on different domain (time, frequency, and time-frequency) using machine learning algorithms has been used to improve detection efficiency. However, the potential utilization of a wide number of parameters benefiting machine learning schemes has raised the need of efficient feature selection (FS) procedures. In this study, we propose a novel FS algorithm based on support vector machines (SVM) classifiers and bootstrap resampling (BR) techniques. We define a backward FS procedure that relies on evaluating changes in SVM performance when removing features from the input space. This evaluation is achieved according to a nonparametric statistic based on BR. After simulation studies, we benchmark the performance of our FS algorithm in AHA and MIT-BIH ECG databases. Our results show that the proposed FS algorithm outperforms the recursive feature elimination method in synthetic examples, and that the VF detector performance improves with the reduced feature set. 相似文献
20.
In supervised classification, we often encounter many real world problems in which the data do not have an equitable distribution among the different classes of the problem. In such cases, we are dealing with the so-called imbalanced data sets. One of the most used techniques to deal with this problem consists of preprocessing the data previously to the learning process. This paper proposes a method belonging to the family of the nested generalized exemplar that accomplishes learning by storing objects in Euclidean n-space. Classification of new data is performed by computing their distance to the nearest generalized exemplar. The method is optimized by the selection of the most suitable generalized exemplars based on evolutionary algorithms. An experimental analysis is carried out over a wide range of highly imbalanced data sets and uses the statistical tests suggested in the specialized literature. The results obtained show that our evolutionary proposal outperforms other classic and recent models in accuracy and requires to store a lower number of generalized examples. 相似文献