首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Contemporary biological technologies produce extremely high-dimensional data sets from which to design classifiers, with 20,000 or more potential features being common place. In addition, sample sizes tend to be small. In such settings, feature selection is an inevitable part of classifier design. Heretofore, there have been a number of comparative studies for feature selection, but they have either considered settings with much smaller dimensionality than those occurring in current bioinformatics applications or constrained their study to a few real data sets. This study compares some basic feature-selection methods in settings involving thousands of features, using both model-based synthetic data and real data. It defines distribution models involving different numbers of markers (useful features) versus non-markers (useless features) and different kinds of relations among the features. Under this framework, it evaluates the performances of feature-selection algorithms for different distribution models and classifiers. Both classification error and the number of discovered markers are computed. Although the results clearly show that none of the considered feature-selection methods performs best across all scenarios, there are some general trends relative to sample size and relations among the features. For instance, the classifier-independent univariate filter methods have similar trends. Filter methods such as the t-test have better or similar performance with wrapper methods for harder problems. This improved performance is usually accompanied with significant peaking. Wrapper methods have better performance when the sample size is sufficiently large. ReliefF, the classifier-independent multivariate filter method, has worse performance than univariate filter methods in most cases; however, ReliefF-based wrapper methods show performance similar to their t-test-based counterparts.  相似文献   

2.
In classification, feature selection is an important data pre-processing technique, but it is a difficult problem due mainly to the large search space. Particle swarm optimisation (PSO) is an efficient evolutionary computation technique. However, the traditional personal best and global best updating mechanism in PSO limits its performance for feature selection and the potential of PSO for feature selection has not been fully investigated. This paper proposes three new initialisation strategies and three new personal best and global best updating mechanisms in PSO to develop novel feature selection approaches with the goals of maximising the classification performance, minimising the number of features and reducing the computational time. The proposed initialisation strategies and updating mechanisms are compared with the traditional initialisation and the traditional updating mechanism. Meanwhile, the most promising initialisation strategy and updating mechanism are combined to form a new approach (PSO(4-2)) to address feature selection problems and it is compared with two traditional feature selection methods and two PSO based methods. Experiments on twenty benchmark datasets show that PSO with the new initialisation strategies and/or the new updating mechanisms can automatically evolve a feature subset with a smaller number of features and higher classification performance than using all features. PSO(4-2) outperforms the two traditional methods and two PSO based algorithm in terms of the computational time, the number of features and the classification performance. The superior performance of this algorithm is due mainly to both the proposed initialisation strategy, which aims to take the advantages of both the forward selection and backward selection to decrease the number of features and the computational time, and the new updating mechanism, which can overcome the limitations of traditional updating mechanisms by taking the number of features into account, which reduces the number of features and the computational time.  相似文献   

3.
ReliefF has proved to be a successful feature selector but when handling a large dataset, it is computationally expensive. We present an optimization using Supervised Model Construction which improves starter selection. Effectiveness has been evaluated using 12 UCI datasets and a clinical diabetes database. Experiments indicate that compared with ReliefF, the proposed method improved computation efficiency whilst maintaining the classification accuracy. In the clinical dataset (20,000 records with 47 features), feature selection via Supervised Model Construction (FSSMC) reduced the processing time by 80%, compared to ReliefF, and maintained accuracy for Naive Bayes, IB1 and C4.5 classifiers.  相似文献   

4.
An algorithm is proposed for calculating correlation measures based on entropy. The proposed algorithm allows exhaustive exploration of variable subsets on real data. Its time efficiency is demonstrated by comparison against three other variable selection methods based on entropy using 8 data sets from various domains as well as simulated data. The method is applicable to discrete data with a limited number of values making it suitable for medical diagnostic support, DNA sequence analysis, psychometry and other domains.  相似文献   

5.
This paper presents a technique that gives a minimal window W for the estimation of a W-operator from training data. The idea is to choose a subset of variables W that maximizes the information observed in a training set. The task is formalized as a combinatorial optimization problem, where the search space is the powerset of the candidate variables and the measure to be minimized is the mean entropy of the estimated conditional probabilities. As a full exploration of the search space requires prohibitive computational effort, some heuristics of the feature selection literature are applied. The proposed technique is mathematically sound and experimental results including binary image filtering and gray-scale texture recognition show its successful performance in practice.  相似文献   

6.
为使空间故障树理论中系统结构反分析的分类推理法具有严格的数学定义,以适应广泛的系统结构反分析问题,引入因素空间理论的因素逻辑对分类推理法进行重构。重构的目的在于规范原方法定义和步骤为严格的数学形式,赋予其因素逻辑推理能力,从而提升其数学层次,应用于广泛的问题分析。论文给出了空间故障树中分类推理法的基本过程和描述;使用因素逻辑重构该方法的相关定义,并给出了极小析取范式步骤,即重构分类推理法步骤。使用原方法和重构方法分析了实例,证明两种方法得到的系统结构相同,而后者具有更高的逻辑数学层次和广泛的应用能力。  相似文献   

7.
A novel method to classify multi-class biomedical objects is presented. The method is based on a hybrid approach which combines pairwise comparison, Bayesian regression and the k-nearest neighbor technique. It can be applied in a fully automatic way or in a relevance feedback framework. In the latter case, the information obtained from both an expert and the automatic classification is iteratively used to improve the results until a certain accuracy level is achieved, then, the learning process is finished and new classifications can be automatically performed. The method has been applied in two biomedical contexts by following the same cross-validation schemes as in the original studies. The first one refers to cancer diagnosis, leading to an accuracy of 77.35% versus 66.37%, originally obtained. The second one considers the diagnosis of pathologies of the vertebral column. The original method achieves accuracies ranging from 76.5% to 96.7%, and from 82.3% to 97.1% in two different cross-validation schemes. Even with no supervision, the proposed method reaches 96.71% and 97.32% in these two cases. By using a supervised framework the achieved accuracy is 97.74%. Furthermore, all abnormal cases were correctly classified.  相似文献   

8.
Identification of relevant genes from microarray data is an apparent need in many applications. For such identification different ranking techniques with different evaluation criterion are used, which usually assign different ranks to the same gene. As a result, different techniques identify different gene subsets, which may not be the set of significant genes. To overcome such problems, in this study pipelining the ranking techniques is suggested. In each stage of pipeline, few of the lower ranked features are eliminated and at the end a relatively good subset of feature is preserved. However, the order in which the ranking techniques are used in the pipeline is important to ensure that the significant genes are preserved in the final subset. For this experimental study, twenty four unique pipeline models are generated out of four gene ranking strategies. These pipelines are tested with seven different microarray databases to find the suitable pipeline for such task. Further the gene subset obtained is tested with four classifiers and four performance metrics are evaluated. No single pipeline dominates other pipelines in performance; therefore a grading system is applied to the results of these pipelines to find out a consistent model. The finding of grading system that a pipeline model is significant is also established by Nemenyi post-hoc hypothetical test. Performance of this pipeline model is compared with four ranking techniques, though its performance is not superior always but majority of time it yields better results and can be suggested as a consistent model. However it requires more computational time in comparison to single ranking techniques.  相似文献   

9.
针对用极大似然方法和限制极大似然方法进行混合模型的参数估计,设计并实现了随机效应的协方差矩阵和观测误差的协方差矩阵及其导数信息的切实可行的存储方案。  相似文献   

10.
Feature test-retest reliability is proposed as a useful criterion for the selection/exclusion of features in time series classification tasks. Three sets of physiological time series are examined, EEG and ECG recordings together with measurements of neck movement. Comparisons of reliability estimates from test-retest studies with measures of feature importance from classification tasks suggest that low reliability can be used to exclude irrelevant features prior to classifier training. By removing features with low reliability an unnecessary degradation of the classifier accuracy may be avoided.  相似文献   

11.
Nearest neighbor classification is one of the most used and well known methods in data mining. Its simplest version has several drawbacks, such as low efficiency, high storage requirements and sensitivity to noise. Data reduction techniques have been used to alleviate these shortcomings. Among them, prototype selection and generation techniques have been shown to be very effective. Positioning adjustment of prototypes is a successful trend within the prototype generation methodology.Evolutionary algorithms are adaptive methods based on natural evolution that may be used for searching and optimization. Positioning adjustment of prototypes can be viewed as an optimization problem, thus it can be solved using evolutionary algorithms. This paper proposes a differential evolution based approach for optimizing the positioning of prototypes. Specifically, we provide a complete study of the performance of four recent advances in differential evolution. Furthermore, we show the good synergy obtained by the combination of a prototype selection stage with an optimization of the positioning of prototypes previous to nearest neighbor classification. The results are contrasted with non-parametrical statistical tests and show that our proposals outperform previously proposed methods.  相似文献   

12.
Support vector machines (SVMs) are one of the most popular classification tools and show the most potential to address under-sampled noisy data (a large number of features and a relatively small number of samples). However, the computational cost is too expensive, even for modern-scale samples, and the performance largely depends on the proper setting of parameters. As the data scale increases, the improvement in speed becomes increasingly challenging. As the dimension (feature number) largely increases while the sample size remains small, the avoidance of overfitting becomes a significant challenge. In this study, we propose a two-phase sequential minimal optimization (TSMO) to largely reduce the training cost for large-scale data (tested with 3186–70,000-sample datasets) and a two-phased-in differential-learning particle swarm optimization (tDPSO) to ensure the accuracy for under-sampled data (tested with 2000–24481-feature datasets). Because the purpose of training SVMs is to identify support vectors that denote a hyperplane, TSMO is developed to quickly select support vector candidates from the entire dataset and then identify support vectors from those candidates. In this manner, the computational burden is largely reduced (a 29.4%–65.3% reduction rate). The proposed tDPSO uses topology variation and differential learning to solve PSO’s premature convergence issue. Population diversity is ensured through dynamic topology until a ring connection is achieved (topology-variation phases). Further, particles initiate chemo-type simulated-annealing operations, and the global-best particle takes a two-turn diversion in response to stagnation (event-induced phases). The proposed tDPSO-embedded SVMs were tested with several under-sampled noisy cancer datasets and showed superior performance over various methods, even those methods with feature selection for the preprocessing of data.  相似文献   

13.
The G-mode central method is described. This method allows a large number of samples to be classified based on several variables. The classification method works without a priori knowledge of the taxonomic homogeneous units forming the statistical sample.  相似文献   

14.
Classification trees are a popular tool in applied statistics because their heuristic search approach based on impurity reduction is easy to understand and the interpretation of the output is straightforward. However, all standard algorithms suffer from a major problem: variable selection based on standard impurity measures as the Gini Index is biased. The bias is such that, e.g., splitting variables with a high amount of missing values—even if missing completely at random (MCAR)—are artificially preferred. A new split selection criterion that avoids variable selection bias is introduced. The exact distribution of the maximally selected Gini gain is derived by means of a combinatorial approach and the resulting p-value is suggested as an unbiased split selection criterion in recursive partitioning algorithms. The efficiency of the method is demonstrated in simulation studies and a real data study from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. The proposed method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy.  相似文献   

15.
This paper presents a novel method for user classification in adaptive systems based on rough classification. Adaptive systems could be used in many areas, for example in a user interface construction or e-Learning environments for learning strategy selection. In this paper the adaptation of web-based system user interface is presented. The goal of rough user classification is to select the most essential attributes and their values that group together users who are very much alike concerning the system logic. In order to group users we exploit their usage data taken from the user model of the adaptive web-based system user interface. We presented three basic problems for attribute selection that generates the following partitions: that is included, that includes and that is the closest to the given partition. Ngoc Thanh Nguyen, Ph.D., D.Sc.: He currently works as an associate professor at the Faculty of Computer Science and Management, Wroclaw University of Technology in Poland. He received his diplomas of M.Sc, Ph.D. and D.Sc. in Computer Science in 1986, 1989 and 2002, respectively. Actually, he is working on intelligent technologies for conflict resolution and inconsistent knowledge processing and e-learning methods. His teaching interests consist of database systems and distributed systems. He is a co-editor of 4 special issues in international journals, author of 3 monographs, editor of one book and about 110 other publications (book chapters, journal and refereed conference papers). He is an associate editor of the following journals: “International Journal of Computer Science & Application”; “Journal of Information Knowledge System Management”; and “International Journal of Knowledge-Based & Intelligent Engineering Systems”. He is a member of societies: ACM, IFIP WG 7.2, ISAI, KES International, and WIC. Janusz Sobecki, Ph.D.: He is an Assistant Professor in Institute of Applied Informatics (IAI) at Wroclaw University of Technology (WUT). He received his M. Sc. in Computer Science from Faculty of Computer Science and Management at WUT in 1986 and Ph.D. in Computer Science from Faculty of Electronics at WUT in 1994. For 1986–1996 he was an Assistant at the Department of Information Systems (DIS) at WUT. For 1988–1996 he was also a head of the laboratory at DIS. For 1996–2004 he was an Assistant Professor in DIS and since fall of 2004 at IAI, both at WUT. His research interests include information retrieval, multimedia information systems, system usability and recommender systems. He is on the editorial board of New Generation Computing and was a co-editor of two journal special issues. He is a member of American Association of Machinery.  相似文献   

16.
AdaBoost is a highly effective ensemble learning method that combines several weak learners to produce a strong committee with higher accuracy. However, similar to other ensemble methods, AdaBoost uses a large number of base learners to produce the final outcome while addressing high-dimensional data. Thus, it poses a critical challenge in the form of high memory-space consumption. Feature selection methods can significantly reduce dimensionality in regression and have been established to be applicable in ensemble pruning. By pruning the ensemble, it is possible to generate a simpler ensemble with fewer base learners but a higher accuracy. In this article, we propose the minimax concave penalty (MCP) function to prune an AdaBoost ensemble to simplify the model and improve its accuracy simultaneously. The MCP penalty function is compared with LASSO and SCAD in terms of performance in pruning the ensemble. Experiments performed on real datasets demonstrate that MCP-pruning outperforms the other two methods. It can reduce the ensemble size effectively, and generate marginally more accurate predictions than the unpruned AdaBoost model.  相似文献   

17.
This paper presents a hybrid filter-wrapper feature subset selection algorithm based on particle swarm optimization (PSO) for support vector machine (SVM) classification. The filter model is based on the mutual information and is a composite measure of feature relevance and redundancy with respect to the feature subset selected. The wrapper model is a modified discrete PSO algorithm. This hybrid algorithm, called maximum relevance minimum redundancy PSO (mr2PSO), is novel in the sense that it uses the mutual information available from the filter model to weigh the bit selection probabilities in the discrete PSO. Hence, mr2PSO uniquely brings together the efficiency of filters and the greater accuracy of wrappers. The proposed algorithm is tested over several well-known benchmarking datasets. The performance of the proposed algorithm is also compared with a recent hybrid filter-wrapper algorithm based on a genetic algorithm and a wrapper algorithm based on PSO. The results show that the mr2PSO algorithm is competitive in terms of both classification accuracy and computational performance.  相似文献   

18.
19.
针对文本分类中传统特征选择方法卡方统计量和信息增益的不足进行了分析,得出文本分类中的特征选择关键在于选择出集中分布于某类文档并在该类文档中均匀分布且频繁出现的特征词。因此,综合考虑特征词的文档频、词频以及特征词的类间集中度、类内分散度,提出一种基于类内类间文档频和词频统计的特征选择评估函数,并利用该特征选择评估函数在训练集每个类别中选取一定比例的特征词组成该类别的特征词库,而训练集的特征词库则为各类别特征词库的并集。通过基于SVM的中文文本分类实验表明,该方法与传统的卡方统计量和信息增益相比,在一定程度上提高了文本分类的效果。  相似文献   

20.
It is significant to build up the risk classification model of cervical cancer for the evaluation of high-risk population. Data were divided into two sub-data, one is model building sub-data, the other is model testing sub-data. By using of artificial neural network (ANN) analysis method (Back Propagation, BP), the risk classification model had been setup. The parameters were listed as following: the data had been treated as normalization, and the level of network was 3, and the number of neural in hidden level was 5, and the transmitting function between input level and hidden level was logsig, and the transmitting function between hidden level and output level was purelin, and the studying method was Levenberg–Marquardt optimizing, and the error parameter eg = 0.09, maximum epochs me = 8000. The model quality was good (sensitivity = 98%, specificity = 97%), and the back calculation fitting result was excellent. The predictive value of 10 unknown data was also good, during which the correct rate of control group was 100%, and that of case group was 80%. Because ANN is with the character of self-organizing, self-learning and self-adapting, the ANN risk classification model is fit for the screening of high-risk population of local cervical cancer, risk evaluation of cervical cancer and the effect evaluation of the prevention method after training the model by new data of some area.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号