首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Machine learning algorithms such as genetic programming (GP) can evolve biased classifiers when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria in the fitness function. This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data. Using a range of real-world classification problems with class imbalance, we empirically show that these new fitness functions evolve classifiers with good performance on both the minority and majority classes. Our approaches use the original unbalanced training data in the GP learning process, without the need to artificially balance the training examples from the two classes (e.g., via sampling).  相似文献   

2.
一种基于多进化神经网络的分类方法   总被引:9,自引:0,他引:9  
商琳  王金根  姚望舒  陈世福 《软件学报》2005,16(9):1577-1583
分类问题是目前数据挖掘和机器学习领域的重要内容.提出了一种基于多进化神经网络的分类方法CABEN(classification approach based on evolutionary neural networks).利用改进的进化策略和Levenberg-Marquardt方法对多个三层前馈神经网络同时进行训练.训练好各个分类模型以后,将待识别数据分别输入,最后根据绝对多数投票法决定最终分类结果.实验结果表明,该方法可以较好地进行数据分类,而且与传统的神经网络方法以及贝叶斯方法和决策树方法相比,在  相似文献   

3.
Class imbalance limits the performance of most learning algorithms since they cannot cope with large differences between the number of samples in each class, resulting in a low predictive accuracy over the minority class. In this respect, several papers proposed algorithms aiming at achieving more balanced performance. However, balancing the recognition accuracies for each class very often harms the global accuracy. Indeed, in these cases the accuracy over the minority class increases while the accuracy over the majority one decreases. This paper proposes an approach to overcome this limitation: for each classification act, it chooses between the output of a classifier trained on the original skewed distribution and the output of a classifier trained according to a learning method addressing the course of imbalanced data. This choice is driven by a parameter whose value maximizes, on a validation set, two objective functions, i.e. the global accuracy and the accuracies for each class. A series of experiments on ten public datasets with different proportions between the majority and minority classes show that the proposed approach provides more balanced recognition accuracies than classifiers trained according to traditional learning methods for imbalanced data as well as larger global accuracy than classifiers trained on the original skewed distribution.  相似文献   

4.
In this paper we consider induction of rule-based classifiers from imbalanced data, where one class (a minority class) is under-represented in comparison to the remaining majority classes. The minority class is usually of primary interest. However, most rule-based classifiers are biased towards the majority classes and they have difficulties with correct recognition of the minority class. In this paper we discuss sources of these difficulties related to data characteristics or to an algorithm itself. Among the problems related to the data distribution we focus on the role of small disjuncts, overlapping of classes and presence of noisy examples. Then, we show that standard techniques for induction of rule-based classifiers, such as sequential covering, top-down induction of rules or classification strategies, were created with the assumption of balanced data distribution, and we explain why they are biased towards the majority classes. Some modifications of rule-based classifiers have been already introduced, but they usually concentrate on individual problems. Therefore, we propose a novel algorithm, BRACID, which more comprehensively addresses the issues associated with imbalanced data. Its main characteristics includes a hybrid representation of rules and single examples, bottom-up learning of rules and a local classification strategy using nearest rules. The usefulness of BRACID has been evaluated in experiments on several imbalanced datasets. The results show that BRACID significantly outperforms the well known rule-based classifiers C4.5rules, RIPPER, PART, CN2, MODLEM as well as other related classifiers as RISE or K-NN. Moreover, it is comparable or better than the studied approaches specialized for imbalanced data such as generalizations of rule algorithms or combinations of SMOTE + ENN preprocessing with PART. Finally, it improves the support of minority class rules, leading to better recognition of the minority class examples.  相似文献   

5.
现实中许多领域产生的数据通常具有多个类别并且是不平衡的。在多类不平衡分类中,类重叠、噪声和多个少数类等问题降低了分类器的能力,而有效解决多类不平衡问题已经成为机器学习与数据挖掘领域中重要的研究课题。根据近年来的多类不平衡分类方法的文献,从数据预处理和算法级分类方法两方面进行了分析与总结,并从优缺点和数据集等方面对所有算法进行了详细的分析。在数据预处理方法中,介绍了过采样、欠采样、混合采样和特征选择方法,对使用相同数据集算法的性能进行了比较。从基分类器优化、集成学习和多类分解技术三个方面对算法级分类方法展开介绍和分析。最后对多类不平衡数据分类研究领域的未来发展方向进行总结归纳。  相似文献   

6.
The ability to predict a student’s performance could be useful in a great number of different ways associated with university-level distance learning. Students’ marks in a few written assignments can constitute the training set for a supervised machine learning algorithm. Along with the explosive increase of data and information, incremental learning ability has become more and more important for machine learning approaches. The online algorithms try to forget irrelevant information instead of synthesizing all available information (as opposed to classic batch learning algorithms). Nowadays, combining classifiers is proposed as a new direction for the improvement of the classification accuracy. However, most ensemble algorithms operate in batch mode. Therefore a better proposal is an online ensemble of classifiers that combines an incremental version of Naive Bayes, the 1-NN and the WINNOW algorithms using the voting methodology. Among other significant conclusions it was found that the proposed algorithm is the most appropriate to be used for the construction of a software support tool.  相似文献   

7.
支持向量机(support vector machine, SVM)是一种基于结构风险最小化的机器学习方法, 能够有效解决分类问题. 但随着研究问题的复杂化, 现实的分类问题往往是多分类问题, 而SVM仅能用于处理二分类任务. 针对这个问题, 一对多策略的多生支持向量机(multiple birth support vector machine, MBSVM)能够以较低的复杂度实现多分类, 但缺点在于分类精度较低. 本文对MBSVM进行改进, 提出了一种新的SVM多分类算法: 基于超球(hypersphere)和自适应缩小步长果蝇优化算法(fruit fly optimization algorithm with adaptive step size reduction, ASSRFOA)的MBSVM, 简称HA-MBSVM. 通过拟合超球得到的信息, 先进行类别划分再构建分类器, 并引入约束距离调节因子来适当提高分类器的差异性, 同时采用ASSRFOA求解二次规划问题, HA-MBSVM可以更好地解决多分类问题. 我们采用6个数据集评估HA-MBSVM的性能, 实验结果表明HA-MBSVM的整体性能优于各对比算法.  相似文献   

8.

Dementia is one of the leading causes of severe cognitive decline, it induces memory loss and impairs the daily life of millions of people worldwide. In this work, we consider the classification of dementia using magnetic resonance (MR) imaging and clinical data with machine learning models. We adapt univariate feature selection in the MR data pre-processing step as a filter-based feature selection. Bagged decision trees are also implemented to estimate the important features for achieving good classification accuracy. Several ensemble learning-based machine learning approaches, namely gradient boosting (GB), extreme gradient boost (XGB), voting-based, and random forest (RF) classifiers, are considered for the diagnosis of dementia. Moreover, we propose voting-based classifiers that train on an ensemble of numerous basic machine learning models, such as the extra trees classifier, RF, GB, and XGB. The implementation of a voting-based approach is one of the important contributions, and the performance of different classifiers are evaluated in terms of precision, accuracy, recall, and F1 score. Moreover, the receiver operating characteristic curve (ROC) and area under the ROC curve (AUC) are used as metrics for comparing these classifiers. Experimental results show that the voting-based classifiers often perform better compared to the RF, GB, and XGB in terms of precision, recall, and accuracy, thereby indicating the promise of differentiating dementia from imaging and clinical data.

  相似文献   

9.
基于传统模型的实际分类问题,不均衡分类是一个常见的挑战问题。由于传统分类器较难学习少数类数据集内部的本质结构,导致更多地偏向于多数类,从而使少数类样本被误分为多数类样本。与此同时,样本集中的冗余数据和噪音数据也会对分类器造成困扰。为有效处理上述问题,提出一种新的不均衡分类框架SSIC,该框架充分考虑数据统计特性,自适应从大小类中选取有价值样本,并结合代价敏感学习构建不均衡数据分类器。首先,SSIC通过组合部分多数类实例和所有少数类实例来构造几个平衡的数据子集。在每个子集上,SSIC充分利用数据的特征来提取可区分的高级特征并自适应地选择重要样本,从而可以去除冗余噪声数据。其次,SSIC通过在每个样本上自动分配适当的权重来引入一种代价敏感的支持向量机(SVM),以便将少数类视为与多数类相等。  相似文献   

10.
To illustrate an unprejudiced comparison among machine learning classifiers established on proprietary databases, and to guarantee the validity and robustness of these classifiers, a Performance Evaluation Indicator (PEI) and the corresponding failure criterion are proposed in this study. Three types of machine learning classifiers, including the strictly binary classifier, the normal multiclass classifier and the misclassification cost-sensitive classifier, are trained on four datasets recorded from a water drainage TBM project. The results indicate that: (1) the PEI successfully compares the competence of classifiers under different scenarios by isolating the effects of different overlapping-degree of rockmass classes, and (2) the cost-sensitive algorithm is warranted to classify rockmasses when the ratio of inter-class classes is more than 8:1. The contributions of this research are to fill the gap in performance evaluations of a classifier for imbalanced training data, and to identify the best situation to apply this classifier.  相似文献   

11.
针对非平衡警情数据改进的K-Means-Boosting-BP模型   总被引:1,自引:0,他引:1       下载免费PDF全文
目的 掌握警情的时空分布规律,通过机器学习算法建立警情时空预测模型,制定科学的警务防控方案,有效抑制犯罪的发生,是犯罪地理研究的重点。已有研究表明,警情时空分布多集中在中心城区或居民密集区,在时空上属于非平衡数据,这种数据的非平衡性通常导致在该数据上训练的模型成为弱学习器,预测精度较低。为解决这种非平衡数据的回归问题,提出一种基于KMeans均值聚类的Boosting算法。方法 该算法以Boosting集成学习算法为基础,应用GA-BP神经网络生成基分类器,借助KMeans均值聚类算法进行基分类器的集成,从而实现将弱学习器提升为强学习器的目标。结果 与常用的解决非平衡数据回归问题的Synthetic Minority Oversampling Technique Boosting算法,简称SMOTEBoosting算法相比,该算法具有两方面的优势:1)在降低非平衡数据中少数类均方误差的同时也降低了数据的整体均方误差,SMOTEBoosting算法的整体均方误差为2.14E-04,KMeans-Boosting算法的整体均方误差达到9.85E-05;2)更好地平衡了少数类样本识别的准确率和召回率,KMeans-Boosting算法的召回率约等于52%,SMOTEBoosting算法的召回率约等于91%;但KMeans-Boosting算法的准确率等于85%,远高于SMOTEBoosting算法的19%。结论 KMeans-Boosting算法能够显著的降低非平衡数据的整体均方误差,提高少数类样本识别的准确率和召回率,是一种有效地解决非平衡数据回归问题和分类问题的算法,可以推广至其他需要处理非平衡数据的领域中。  相似文献   

12.
Multiple classifier systems (MCS) are attracting increasing interest in the field of pattern recognition and machine learning. Recently, MCS are also being introduced in the remote sensing field where the importance of classifier diversity for image classification problems has not been examined. In this article, Satellite Pour l'Observation de la Terre (SPOT) IV panchromatic and multispectral satellite images are classified into six land cover classes using five base classifiers: contextual classifier, k-nearest neighbour classifier, Mahalanobis classifier, maximum likelihood classifier and minimum distance classifier. The five base classifiers are trained with the same feature sets throughout the experiments and a posteriori probability, derived from the confusion matrix of these base classifiers, is applied to five Bayesian decision rules (product rule, sum rule, maximum rule, minimum rule and median rule) for constructing different combinations of classifier ensembles. The performance of these classifier ensembles is evaluated for overall accuracy and kappa statistics. Three statistical tests, the McNemar's test, the Cochran's Q test and the Looney's F-test, are used to examine the diversity of the classification results of the base classifiers compared to the results of the classifier ensembles. The experimental comparison reveals that (a) significant diversity amongst the base classifiers cannot enhance the performance of classifier ensembles; (b) accuracy improvement of classifier ensembles can only be found by using base classifiers with similar and low accuracy; (c) increasing the number of base classifiers cannot improve the overall accuracy of the MCS and (d) none of the Bayesian decision rules outperforms the others.  相似文献   

13.
This paper presents a novel application of advanced machine learning techniques for Mars terrain image classification. Fuzzy-rough feature selection (FRFS) is adapted and then employed in conjunction with Support Vector Machines (SVMs) to construct image classifiers. These techniques are integrated to address problems in space engineering where the images are of many classes, large-scale, and diverse representational properties. The use of the adapted FRFS allows the induction of low-dimensionality feature sets from feature patterns of a much higher dimensionality. To evaluate the proposed work, K-Nearest Neighbours (KNNs) and decision trees (DTREEs) based image classifiers as well as information gain rank (IGR) based feature selection are also investigated here, as possible alternatives to the underlying machine learning techniques adopted. The results of systematic comparative studies demonstrate that in general, feature selection improves the performance of classifiers that are intended for use in high dimensional domains. In particular, the proposed approach helps to increase the classification accuracy, while enhancing classification efficiency by requiring considerably less features. This is evident in that the resultant SVM-based classifiers which utilise FRFS-selected features generally outperform KNN and DTREE based classifiers and those which use IGR-returned features. The work is therefore shown to be of great potential for on-board or ground-based image classification in future Mars rover missions.  相似文献   

14.
Improving accuracies of machine learning algorithms is vital in designing high performance computer-aided diagnosis (CADx) systems. Researches have shown that a base classifier performance might be enhanced by ensemble classification strategies. In this study, we construct rotation forest (RF) ensemble classifiers of 30 machine learning algorithms to evaluate their classification performances using Parkinson's, diabetes and heart diseases from literature.While making experiments, first the feature dimension of three datasets is reduced using correlation based feature selection (CFS) algorithm. Second, classification performances of 30 machine learning algorithms are calculated for three datasets. Third, 30 classifier ensembles are constructed based on RF algorithm to assess performances of respective classifiers with the same disease data. All the experiments are carried out with leave-one-out validation strategy and the performances of the 60 algorithms are evaluated using three metrics; classification accuracy (ACC), kappa error (KE) and area under the receiver operating characteristic (ROC) curve (AUC).Base classifiers succeeded 72.15%, 77.52% and 84.43% average accuracies for diabetes, heart and Parkinson's datasets, respectively. As for RF classifier ensembles, they produced average accuracies of 74.47%, 80.49% and 87.13% for respective diseases.RF, a newly proposed classifier ensemble algorithm, might be used to improve accuracy of miscellaneous machine learning algorithms to design advanced CADx systems.  相似文献   

15.
分析了文本分类过程中存在的混淆类现象,主要研究混淆类的判别技术,进而改善文本分类的性能.首先,提出了一种基于分类错误分布的混淆类识别技术,识别预定义类别中的混淆类集合.为了有效判别混淆类,提出了一种基于判别能力的特征选取技术,通过评价某一特征对类别之间的判别能力实现特征选取.最后,通过基于两阶段的分类器设计框架,将初始分类器和混淆类分类器进行集成,组合了两个阶段的分类结果作为最后输出.混淆类分类器的激活条件是:当测试文本被初始分类器标注为混淆类类别时,即采用混淆类分类器进行重新判别.在比较实验中采用了Newsgroup和863中文评测语料,针对单标签、多类分类器.实验结果显示,该技术有效地改善了分类性能.  相似文献   

16.
针对传统单个分类器在不平衡数据上分类效果有限的问题,基于对抗生成网络(GAN)和集成学习方法,提出一种新的针对二类不平衡数据集的分类方法——对抗生成网络-自适应增强-决策树(GAN-AdaBoost-DT)算法。首先,利用GAN训练得到生成模型,生成模型生成少数类样本,降低数据的不平衡性;其次,将生成的少数类样本代入自适应增强(AdaBoost)模型框架,更改权重,改进AdaBoost模型,提升以决策树(DT)为基分类器的AdaBoost模型的分类性能。使用受测者工作特征曲线下面积(AUC)作为分类评价指标,在信用卡诈骗数据集上的实验分析表明,该算法与合成少数类样本集成学习相比,准确率提高了4.5%,受测者工作特征曲线下面积提高了6.5%;对比改进的合成少数类样本集成学习,准确率提高了4.9%,AUC值提高了5.9%;对比随机欠采样集成学习,准确率提高了4.5%,受测者工作特征曲线下面积提高了5.4%。在UCI和KEEL的其他数据集上的实验结果表明,该算法在不平衡二分类问题上能提高总体的准确率,优化分类器性能。  相似文献   

17.
Support vector machine (SVM) is a supervised machine learning approach that was recognized as a statistical learning apotheosis for the small-sample database. SVM has shown its excellent learning and generalization ability and has been extensively employed in many areas. This paper presents a performance analysis of six types of SVMs for the diagnosis of the classical Wisconsin breast cancer problem from a statistical point of view. The classification performance of standard SVM (St-SVM) is analyzed and compared with those of the other modified classifiers such as proximal support vector machine (PSVM) classifiers, Lagrangian support vector machines (LSVM), finite Newton method for Lagrangian support vector machine (NSVM), Linear programming support vector machines (LPSVM), and smooth support vector machine (SSVM). The experimental results reveal that these SVM classifiers achieve very fast, simple, and efficient breast cancer diagnosis. The training results indicated that LSVM has the lowest accuracy of 95.6107 %, while St-SVM performed better than other methods for all performance indices (accuracy = 97.71 %) and is closely followed by LPSVM (accuracy = 97.3282). However, in the validation phase, the overall accuracies of LPSVM achieved 97.1429 %, which was superior to LSVM (95.4286 %), SSVM (96.5714 %), PSVM (96 %), NSVM (96.5714 %), and St-SVM (94.86 %). Value of ROC and MCC for LPSVM achieved 0.9938 and 0.9369, respectively, which outperformed other classifiers. The results strongly suggest that LPSVM can aid in the diagnosis of breast cancer.  相似文献   

18.
Database mining: a performance perspective   总被引:12,自引:0,他引:12  
The authors' perspective of database mining as the confluence of machine learning techniques and the performance emphasis of database technology is presented. Three classes of database mining problems involving classification, associations, and sequences are described. It is argued that these problems can be uniformly viewed as requiring discovery of rules embedded in massive amounts of data. A model and some basic operations for the process of rule discovery are described. It is shown how the database mining problems considered map to this model, and how they can be solved by using the basic operations proposed. An example is given of an algorithm for classification obtained by combining the basic rule discovery operations. This algorithm is efficient in discovering classification rules and has accuracy comparable to ID3, one of the best current classifiers  相似文献   

19.
Along with the increase of data and information, incremental learning ability turns out to be more and more important for machine learning approaches. The online algorithms try not to remember irrelevant information instead of synthesizing all available information (as opposed to classic batch learning algorithms). Today, combining classifiers is proposed as a new road for the improvement of the classification accuracy. However, most ensemble algorithms operate in batch mode. For this reason, we propose an incremental ensemble that combines five classifiers that can operate incrementally: the Naive Bayes, the Averaged One-Dependence Estimators (AODE), the 3-Nearest Neighbors, the Non-Nested Generalised Exemplars (NNGE) and the Kstar algorithms using the voting methodology. We performed a large-scale comparison of the proposed ensemble with other state-of-the-art algorithms on several datasets and the proposed method produce better accuracy in most cases.  相似文献   

20.
秦锋  罗慧  程泽凯  任诗流  陈莉 《计算机工程与设计》2007,28(24):5919-5920,5972
分类器评估一般采用准确性评估.理论证明,基于AUC方法评估分类器优于准确性评估方法,但该方法局限于二类分类问题.提出一种将二类分类问题推广到多类分类问题的新方法,用纠错输出码转换得到转换矩阵,通过转换矩阵把多类分类问题转换成二类分类问题,计算二类分类的平均值来评估分类器的性能.新方法在MBNC实验平台下编程实现,并评估贝叶斯分类器的性能,实验结果表明,这种方法是有效的.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号