首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Locally adaptive density estimation presents challenges for parametric or non-parametric estimators. Several useful properties of tessellation density estimators (TDEs), such as low bias, scale invariance and sensitivity to local data morphology, make them an attractive alternative to standard kernel techniques. However, simple TDEs are discontinuous and produce highly unstable estimates due to their susceptibility to sampling noise. With the motivation of addressing these concerns, we propose applying TDEs within a bootstrap aggregation algorithm, and incorporating model selection with complexity penalization. We implement complexity reduction of the TDE via sub-sampling, and use information-theoretic criteria for model selection, which leads to an automatic and approximately ideal bias/variance compromise. The procedure yields a stabilized estimator that automatically adapts to the complexity of the generating distribution and the quantity of information at hand, and retains the highly desirable properties of the TDE. Simulation studies presented suggest a high degree of stability and sensitivity can be obtained using this approach.  相似文献   

2.
This article proposes a weighted bootstrap procedure, which is an efficient bootstrap technique for neural model selection. Our primary interest in reducing computer effort is to not resample (in the original bootstrap procedure) uniformly from the original sample, but to modify this distribution in order to obtain variance reduction. The performance of the weighted bootstrap is demonstrated on two artificial data sets and one real dataset. Experimental results show that the weighted bootstrap procedure permits an approximately 2 to 1 reduction in replication size.  相似文献   

3.
We propose a new penalized least squares approach to handling high-dimensional statistical analysis problems. Our proposed procedure can outperform the SCAD penalty technique (Fan and Li, 2001) when the number of predictors p is much larger than the number of observations n, and/or when the correlation among predictors is high. The proposed procedure has some of the properties of the smoothly clipped absolute deviation (SCAD) penalty method, including sparsity and continuity, and is asymptotically equivalent to an oracle estimator. We show how the approach can be used to analyze high-dimensional data, e.g., microarray data, to construct a classification rule and at the same time automatically select significant genes. A simulation study and real data examples demonstrate the practical aspects of the new method.  相似文献   

4.
Uniform resampling is the easiest to apply and is a general recipe for all problems, but it may require a large replication size B. To save computational effort in uniform resampling, balanced bootstrap resampling is proposed to change the bootstrap resampling plan. This resampling plan is effective for approximating the center of the bootstrap distribution. Therefore, this paper applies it to neural model selection. Numerical experiments indicate that it is possible to considerably reduce the replication size B. Moreover, the efficiency of balanced bootstrap resampling is also discussed in this paper.  相似文献   

5.
The existing rival penalized competitive learning (RPCL) algorithm and its variants have provided an attractive way to perform data clustering without knowing the exact number of clusters. However, their performance is sensitive to the preselection of the rival delearning rate. In this paper, we further investigate the RPCL and present a mechanism to control the strength of rival penalization dynamically. Consequently, we propose the rival penalization controlled competitive learning (RPCCL) algorithm and its stochastic version. In each of these algorithms, the selection of the delearning rate is circumvented using a novel technique. We compare the performance of RPCCL to RPCL in Gaussian mixture clustering and color image segmentation, respectively. The experiments have produced the promising results.  相似文献   

6.
The performance of model based bootstrap methods for constructing point-wise confidence intervals around the survival function with interval censored data is investigated. It is shown that bootstrapping from the nonparametric maximum likelihood estimator of the survival function is inconsistent for the current status model. A model based smoothed bootstrap procedure is proposed and proved to be consistent. In fact, a general framework for proving the consistency of any model based bootstrap scheme in the current status model is established. In addition, simulation studies are conducted to illustrate the (in)-consistency of different bootstrap methods in mixed case interval censoring. The conclusions in the interval censoring model would extend more generally to estimators in regression models that exhibit non-standard rates of convergence.  相似文献   

7.
Robust model selection procedures control the undue influence that outliers can have on the selection criteria by using both robust point estimators and a bounded loss function when measuring either the goodness-of-fit or the expected prediction error of each model. Furthermore, to avoid favoring over-fitting models, these two measures can be combined with a penalty term for the size of the model. The expected prediction error conditional on the observed data may be estimated using the bootstrap. However, bootstrapping robust estimators becomes extremely time consuming on moderate to high dimensional data sets. It is shown that the expected prediction error can be estimated using a very fast and robust bootstrap method, and that this approach yields a consistent model selection method that is computationally feasible even for a relatively large number of covariates. Moreover, as opposed to other bootstrap methods, this proposal avoids the numerical problems associated with the small bootstrap samples required to obtain consistent model selection criteria. The finite-sample performance of the fast and robust bootstrap model selection method is investigated through a simulation study while its feasibility and good performance on moderately large regression models are illustrated on several real data examples.  相似文献   

8.
Cluster-based instance selection for machine classification   总被引:1,自引:0,他引:1  
Instance selection in the supervised machine learning, often referred to as the data reduction, aims at deciding which instances from the training set should be retained for further use during the learning process. Instance selection can result in increased capabilities and generalization properties of the learning model, shorter time of the learning process, or it can help in scaling up to large data sources. The paper proposes a cluster-based instance selection approach with the learning process executed by the team of agents and discusses its four variants. The basic assumption is that instance selection is carried out after the training data have been grouped into clusters. To validate the proposed approach and to investigate the influence of the clustering method used on the quality of the classification, the computational experiment has been carried out.  相似文献   

9.
Genetic algorithms (GAs) have been used as conventional methods for classifiers to adaptively evolve solutions for classification problems. Feature selection plays an important role in finding relevant features in classification. In this paper, feature selection is explored with modular GA-based classification. A new feature selection technique, relative importance factor (RIF), is proposed to find less relevant features in the input domain of each class module. By removing these features, it is aimed to reduce the classification error and dimensionality of classification problems. Benchmark classification data sets are used to evaluate the proposed approach. The experiment results show that RIF can be used to find less relevant features and help achieve lower classification error with the feature space dimension reduced.  相似文献   

10.
Input feature selection for classification problems   总被引:30,自引:0,他引:30  
Feature selection plays an important role in classifying systems such as neural networks (NNs). We use a set of attributes which are relevant, irrelevant or redundant and from the viewpoint of managing a dataset which can be huge, reducing the number of attributes by selecting only the relevant ones is desirable. In doing so, higher performances with lower computational effort is expected. In this paper, we propose two feature selection algorithms. The limitation of mutual information feature selector (MIFS) is analyzed and a method to overcome this limitation is studied. One of the proposed algorithms makes more considered use of mutual information between input attributes and output classes than the MIFS. What is demonstrated is that the proposed method can provide the performance of the ideal greedy selection algorithm when information is distributed uniformly. The computational load for this algorithm is nearly the same as that of MIFS. In addition, another feature selection algorithm using the Taguchi method is proposed. This is advanced as a solution to the question as to how to identify good features with as few experiments as possible. The proposed algorithms are applied to several classification problems and compared with MIFS. These two algorithms can be combined to complement each other's limitations. The combined algorithm performed well in several experiments and should prove to be a useful method in selecting features for classification problems.  相似文献   

11.
Text classification is usually based on constructing a model through learning from training examples to automatically classify text documents. However, as the size of text document repositories grows rapidly, the storage requirement and computational cost of model learning become higher. Instance selection is one solution to solve these limitations whose aim is to reduce the data size by filtering out noisy data from a given training dataset. In this paper, we introduce a novel algorithm for these tasks, namely a biological-based genetic algorithm (BGA). BGA fits a “biological evolution” into the evolutionary process, where the most streamlined process also complies with the reasonable rules. In other words, after long-term evolution, organisms find the most efficient way to allocate resources and evolve. Consequently, we can closely simulate the natural evolution of an algorithm, such that the algorithm will be both efficient and effective. The experimental results based on the TechTC-100 and Reuters-21578 datasets show the outperformance of BGA over five state-of-the-art algorithms. In particular, using BGA to select text documents not only results in the largest dataset reduction rate, but also requires the least computational time. Moreover, BGA can make the k-NN and SVM classifiers provide similar or slightly better classification accuracy than GA.  相似文献   

12.
Cancer diagnosis is an important emerging clinical application of microarray data. Its accurate prediction to the type or size of tumors relies on adopting powerful and reliable classification models, so as to patients can be provided with better treatment or response to therapy. However, the high dimensionality of microarray data may bring some disadvantages, such as over-fitting, poor performance and low efficiency, to traditional classification models. Thus, one of the challenging tasks in cancer diagnosis is how to identify salient expression genes from thousands of genes in microarray data that can directly contribute to the phenotype or symptom of disease. In this paper, we propose a new ensemble gene selection method (EGS) to choose multiple gene subsets for classification purpose, where the significant degree of gene is measured by conditional mutual information or its normalized form. After different gene subsets have been obtained by setting different starting points of the search procedure, they will be used to train multiple base classifiers and then aggregated into a consensus classifier by the manner of majority voting. The proposed method is compared with five popular gene selection methods on six public microarray datasets and the comparison results show that our method works well.  相似文献   

13.
Recent developments in texture classification have shown that the proper integration of texture methods from different families leads to significant improvements in terms of classification rate compared to the use of a single family of texture methods. In order to reduce the computational burden of that integration process, a selection stage is necessary. In general, a large number of feature selection techniques have been proposed. However, a specific texture feature selection must be typically applied given a particular set of texture patterns to be classified. This paper describes a new texture feature selection algorithm that is independent of specific classification problems/applications and thus must only be run once given a set of available texture methods. The proposed application-independent selection scheme has been evaluated and compared to previous proposals on both Brodatz compositions and complex real images.  相似文献   

14.
This paper presents an approach to select the optimal reference subset (ORS) for nearest neighbor classifier. The optimal reference subset, which has minimum sample size and satisfies a certain resubstitution error rate threshold, is obtained through a tabu search (TS) algorithm. When the error rate threshold is set to zero, the algorithm obtains a near minimal consistent subset of a given training set. While the threshold is set to a small appropriate value, the obtained reference subset may have reasonably good generalization capacity. A neighborhood exploration method and an aspiration criterion are proposed to improve the efficiency of TS. Experimental results based on a number of typical data sets are presented and analyzed to illustrate the benefits of the proposed method. The performances of the result consistent and non-consistent reference subsets are evaluated.  相似文献   

15.
动态数据流具有数据量大、变化快、随机存取代价高、详细数据难以存储等特点,挖掘动态数据流对计算能力与存储能力要求非常高。针对动态数据流的以上特点,设计了一种基于自助抽样的动态数据流贝叶斯分类算法,算法运用滑动窗口模型对动态数据流进行处理分析。该模型以每个窗口的数据为基本单位,对窗口内的数据进行处理分析;算法采用自助抽样技术对待分类数据中的属性进行裁剪和优化,解决了数据属性间的多重线性相关问题;算法结合贝叶斯算法的特点,采用动态增量存储树来解决动态样本数据流的存储问题,实现了无限动态数据流无信息失真的静态有限存储,解决了动态数据流挖掘最大的难题——数据存储;对优化的待分类数据使用all-贝叶斯分类器和k-贝叶斯分类器进行分类,结合数据流的特性对两个分类器进行实时更新。该算法有效克服了贝叶斯分类属性独立性的约束和传统贝叶斯只对静态数据分类的缺点,克服了动态数据流最大的难题——数据存储问题。通过实验测试证明,基于自助抽样的贝叶斯分类具有很高的时效性和精确性。  相似文献   

16.
Early detection of ventricular fibrillation (VF) is crucial for the success of the defibrillation therapy in automatic devices. A high number of detectors have been proposed based on temporal, spectral, and time-frequency parameters extracted from the surface electrocardiogram (ECG), showing always a limited performance. The combination ECG parameters on different domain (time, frequency, and time-frequency) using machine learning algorithms has been used to improve detection efficiency. However, the potential utilization of a wide number of parameters benefiting machine learning schemes has raised the need of efficient feature selection (FS) procedures. In this study, we propose a novel FS algorithm based on support vector machines (SVM) classifiers and bootstrap resampling (BR) techniques. We define a backward FS procedure that relies on evaluating changes in SVM performance when removing features from the input space. This evaluation is achieved according to a nonparametric statistic based on BR. After simulation studies, we benchmark the performance of our FS algorithm in AHA and MIT-BIH ECG databases. Our results show that the proposed FS algorithm outperforms the recursive feature elimination method in synthetic examples, and that the VF detector performance improves with the reduced feature set.  相似文献   

17.

Mapping land cover of large regions often requires processing of satellite images collected from several time periods at many spectral wavelength channels. However, manipulating and processing large amounts of image data increases the complexity and time, and hence the cost, that it takes to produce a land cover map. Very few studies have evaluated the importance of individual Advanced Very High Resolution Radiometer (AVHRR) channels for discriminating cover types, especially the thermal channels (channels 3, 4 and 5). Studies rarely perform a multi-year analysis to determine the impact of inter-annual variability on the classification results. We evaluated 5 years of AVHRR data using combinations of the original AVHRR spectral channels (1-5) to determine which channels are most important for cover type discrimination, yet stabilize inter-annual variability. Particular attention was placed on the channels in the thermal portion of the spectrum. Fourteen cover types over the entire state of Colorado were evaluated using a supervised classification approach on all two-, three-, four- and five-channel combinations for seven AVHRR biweekly composite datasets covering the entire growing season for each of 5 years. Results show that all three of the major portions of the electromagnetic spectrum represented by the AVHRR sensor are required to discriminate cover types effectively and stabilize inter-annual variability. Of the two-channel combinations, channels 1 (red visible) and 2 (near-infrared) had, by far, the highest average overall accuracy (72.2%), yet the inter-annual classification accuracies were highly variable. Including a thermal channel (channel 4) significantly increased the average overall classification accuracy by 5.5% and stabilized interannual variability. Each of the thermal channels gave similar classification accuracies; however, because of the problems in consistently interpreting channel 3 data, either channel 4 or 5 was found to be a more appropriate choice. Substituting the thermal channel with a single elevation layer resulted in equivalent classification accuracies and inter-annual variability.  相似文献   

18.
Feature selection for multi-label naive Bayes classification   总被引:4,自引:0,他引:4  
In multi-label learning, the training set is made up of instances each associated with a set of labels, and the task is to predict the label sets of unseen instances. In this paper, this learning problem is addressed by using a method called Mlnb which adapts the traditional naive Bayes classifiers to deal with multi-label instances. Feature selection mechanisms are incorporated into Mlnb to improve its performance. Firstly, feature extraction techniques based on principal component analysis are applied to remove irrelevant and redundant features. After that, feature subset selection techniques based on genetic algorithms are used to choose the most appropriate subset of features for prediction. Experiments on synthetic and real-world data show that Mlnb achieves comparable performance to other well-established multi-label learning algorithms.  相似文献   

19.
为解决维吾尔文文本分类中不平衡数据集问题,提出了一种改进的卡方特征选择方法.结合维吾尔文的语言特性对文本进行预处理,降低特征空间维度;运用卡方和逆文档频数相结合的方法进行特征选择,进一步降低特征空间维数;使用朴素贝叶斯分类器进行分类.在维吾尔文不平衡语料库上进行的实验表明,提出的特征选择方法在不平衡数据集中要优于卡方和信息增益特征选择方法.  相似文献   

20.
In many important application domains such as text categorization, biomolecular analysis, scene classification and medical diagnosis, examples are naturally associated with more than one class label, giving rise to multi-label classification problems. This fact has led, in recent years, to a substantial amount of research on feature selection methods that allow the identification of relevant and informative features for multi-label classification. However, the methods proposed for this task are scattered in the literature, with no common framework to describe them and to allow an objective comparison. Here, we revisit a categorization of existing multi-label classification methods and, as our main contribution, we provide a comprehensive survey and novel categorization of the feature selection techniques that have been created for the multi-label classification setting. We conclude this work with concrete suggestions for future research in multi-label feature selection which have been derived from our categorization and analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号