首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
数据流中的概念漂移和类别不平衡问题会严重影响数据流分类算法的性能和稳定性.针对二分类数据流中概念漂移和类别不平衡的问题,在基于数据块的集成分类方法上引入成员分类器权重的在线更新机制,结合重采样和自适应滑动窗口技术,提出了一种基于G-mean加权的不平衡数据流在线分类方法(online G-mean update ensemble for imbalance learning, OGUEIL).该方法基于集成学习框架,利用时间衰减因子增量计算成员分类器最近若干实例上的G-mean性能,并确定成员分类器权重,每到达一个新实例,在线更新所有成员分类器及其权重,并对少类实例进行随机过采样.同时,OGUEIL会周期性地根据当前数据构造类别平衡数据集训练新的候选分类器,并选择性地添加至集成框架中.在真实和人工数据集上的结果表明,所提方法的综合性能优于其他同类方法.  相似文献   

2.
针对传统单个分类器在不平衡数据上分类效果有限的问题,基于对抗生成网络(GAN)和集成学习方法,提出一种新的针对二类不平衡数据集的分类方法——对抗生成网络-自适应增强-决策树(GAN-AdaBoost-DT)算法。首先,利用GAN训练得到生成模型,生成模型生成少数类样本,降低数据的不平衡性;其次,将生成的少数类样本代入自适应增强(AdaBoost)模型框架,更改权重,改进AdaBoost模型,提升以决策树(DT)为基分类器的AdaBoost模型的分类性能。使用受测者工作特征曲线下面积(AUC)作为分类评价指标,在信用卡诈骗数据集上的实验分析表明,该算法与合成少数类样本集成学习相比,准确率提高了4.5%,受测者工作特征曲线下面积提高了6.5%;对比改进的合成少数类样本集成学习,准确率提高了4.9%,AUC值提高了5.9%;对比随机欠采样集成学习,准确率提高了4.5%,受测者工作特征曲线下面积提高了5.4%。在UCI和KEEL的其他数据集上的实验结果表明,该算法在不平衡二分类问题上能提高总体的准确率,优化分类器性能。  相似文献   

3.
为解决不均衡多分类问题,提出了一种基于采样和特征选择的不均衡数据集成分类算法(IDESF).基分类器的多样性会影响集成算法的分类性能,所以IDESF算法对数据集进行有放回采样+SMOTE的两阶段采样.两阶段采样在保证所得数据集中样本合理性的基础上,增加数据集间的差异性以此隐式地提高基分类器的多样性.两阶段采样同样可以平...  相似文献   

4.
数据流中的不平衡问题会严重影响算法的分类性能,其中概念漂移更是流数据挖掘研究领域的一个难点问题。为了提高此类问题下的分类性能,提出了一种新的基于Hellinger距离的不平衡漂移数据流Boosting分类BCA-HD算法。该算法创新性地采用实例级和分类器级的权重组合方式来动态更新分类器,以适应概念漂移的发生,在底层采用集成算法SMOTEBoost作为基分类器,该分类器内部使用重采样技术处理数据的不平衡。在16个突变型和渐变型的数据集上将所提算法与9种不同算法进行比较,实验结果表明,所提算法的G-mean和AUC的平均值和平均排名均为第1名。因此,该算法能更好地适应概念漂移和不平衡现象的同时发生,有助于提高分类性能。  相似文献   

5.
To illustrate an unprejudiced comparison among machine learning classifiers established on proprietary databases, and to guarantee the validity and robustness of these classifiers, a Performance Evaluation Indicator (PEI) and the corresponding failure criterion are proposed in this study. Three types of machine learning classifiers, including the strictly binary classifier, the normal multiclass classifier and the misclassification cost-sensitive classifier, are trained on four datasets recorded from a water drainage TBM project. The results indicate that: (1) the PEI successfully compares the competence of classifiers under different scenarios by isolating the effects of different overlapping-degree of rockmass classes, and (2) the cost-sensitive algorithm is warranted to classify rockmasses when the ratio of inter-class classes is more than 8:1. The contributions of this research are to fill the gap in performance evaluations of a classifier for imbalanced training data, and to identify the best situation to apply this classifier.  相似文献   

6.
Identifying the temporal variations in mental workload level (MWL) is crucial for enhancing the safety of human–machine system operations, especially when there is cognitive overload or inattention of human operator. This paper proposed a cost-sensitive majority weighted minority oversampling strategy to address the imbalanced MWL data classification problem. Both the inter-class and intra-class imbalance problems are considered. For the former, imbalance ratio is defined to determine the number of the synthetic samples in the minority class. The latter problem is addressed by assigning different weights to borderline samples in the minority class based on the distance and density meaures of the sample distribution. Furthermore, multi-label classifier is designed based on an ensemble of binary classifiers. The results of analyzing 21 imbalanced UCI multi-class datasets showed that the proposed approach can effectively cope with the imbalanced classification problem in terms of several performance metrics including geometric mean (G-mean) and average accuracy (ACC). Moreover, the proposed approach was applied to the analysis of the EEG data of eight experimental participants subject to fluctuating levels of mental workload. The comparative results showed that the proposed method provides a competing alternative to several existing imbalanced learning algorithms and significantly outperforms the basic/referential method that ignores the imbalance nature of the dataset.  相似文献   

7.
现实生活中存在大量的非平衡数据,大多数传统的分类算法假定类分布平衡或者样本的错分代价相同,因此在对这些非平衡数据进行分类时会出现少数类样本错分的问题。针对上述问题,在代价敏感的理论基础上,提出了一种新的基于代价敏感集成学习的非平衡数据分类算法--NIBoost(New Imbalanced Boost)。首先,在每次迭代过程中利用过采样算法新增一定数目的少数类样本来对数据集进行平衡,在该新数据集上训练分类器;其次,使用该分类器对数据集进行分类,并得到各样本的预测类标及该分类器的分类错误率;最后,根据分类错误率和预测的类标计算该分类器的权重系数及各样本新的权重。实验采用决策树、朴素贝叶斯作为弱分类器算法,在UCI数据集上的实验结果表明,当以决策树作为基分类器时,与RareBoost算法相比,F-value最高提高了5.91个百分点、G-mean最高提高了7.44个百分点、AUC最高提高了4.38个百分点;故该新算法在处理非平衡数据分类问题上具有一定的优势。  相似文献   

8.
In this paper, we propose two risk-sensitive loss functions to solve the multi-category classification problems where the number of training samples is small and/or there is a high imbalance in the number of samples per class. Such problems are common in the bio-informatics/medical diagnosis areas. The most commonly used loss functions in the literature do not perform well in these problems as they minimize only the approximation error and neglect the estimation error due to imbalance in the training set. The proposed risk-sensitive loss functions minimize both the approximation and estimation error. We present an error analysis for the risk-sensitive loss functions along with other well known loss functions. Using a neural architecture, classifiers incorporating these risk-sensitive loss functions have been developed and their performance evaluated for two real world multi-class classification problems, viz., a satellite image classification problem and a micro-array gene expression based cancer classification problem. To study the effectiveness of the proposed loss functions, we have deliberately imbalanced the training samples in the satellite image problem and compared the performance of our neural classifiers with those developed using other well-known loss functions. The results indicate the superior performance of the neural classifier using the proposed loss functions both in terms of the overall and per class classification accuracy. Performance comparisons have also been carried out on a number of benchmark problems where the data is normal i.e., not sparse or imbalanced. Results indicate similar or better performance of the proposed loss functions compared to the well-known loss functions.  相似文献   

9.
卷积神经网络具有高效的特征提取能力和较少的参数量,被广泛应用于图像处理、目标跟踪、自然语言等领域。针对传统分类模型对于结构化非平衡数据分类效果较差的问题,提出一种基于卷积神经网络的二分类结构化非平衡数据分类算法。设计结构化数据处理算法Data-Shuffle,将原始非平衡一维结构化数据转换为三维数组形式的多通道非平衡数据,为卷积神经网络提供更多的特征值,通过改进的VGG网络构建适合非平衡数据的网络结构卷积组,以提取不同的特征。在此基础上,提出更新权重加权采样算法UWSCNN,在每个迭代次数之后,根据模型的训练结果对易错样本进行重新加权,以优化训练结果。在adult、shoppers和diabetes数据集上的实验结果表明,相比逻辑回归、随机森林等传统机器学习模型,所提的Data-Shuffle算法的F1值提升了1%~19%,G-mean提升了2%~24%,相比SMOTECNN、BSMOTECNN、SMOTECNN+CS等采样算法,所提的UWSCNN算法对非平衡数据的分类效果提升了1%~13%,有效提高模型对非平衡数据的分类性能。  相似文献   

10.
师彦文  王宏杰 《计算机科学》2017,44(Z11):98-101
针对不平衡数据集的有效分类问题,提出一种结合代价敏感学习和随机森林算法的分类器。首先提出了一种新型不纯度度量,该度量不仅考虑了决策树的总代价,还考虑了同一节点对于不同样本的代价差异;其次,执行随机森林算法,对数据集作K次抽样,构建K个基础分类器;然后,基于提出的不纯度度量,通过分类回归树(CART)算法来构建决策树,从而形成决策树森林;最后,随机森林通过投票机制做出数据分类决策。在UCI数据库上进行实验,与传统随机森林和现有的代价敏感随机森林分类器相比,该分类器在分类精度、AUC面积和Kappa系数这3种性能度量上都具有良好的表现。  相似文献   

11.
实际的分类数据往往是分布不均衡的.传统的分类器大都会倾向多数类而忽略少数类,导致分类性能恶化.针对该问题提出一种基于变分贝叶斯推断最优高斯混合模型(varition Bayesian-optimized optimal Gaussian mixture model, VBoGMM)的自适应不均衡数据综合采样法. VBoGMM可自动衰减到真实的高斯成分数,实现任意数据的最优分布估计;进而基于所获得的分布特性对少数类样本进行自适应综合过采样,并采用Tomek-link对准则对采样数据进行清洗以获得相对均衡的数据集用于后续的分类模型学习.在多个公共不均衡数据集上进行大量的验证和对比实验,结果表明:所提方法能在实现样本均衡化的同时,维持多数类与少数类样本空间分布特性,因而能有效提升传统分类模型在不均衡数据集上的分类性能.  相似文献   

12.
As soil heavy metal pollution is increasing year by year, the risk assessment of soil heavy metal pollution is gradually gaining attention. Soil heavy metal datasets are usually imbalanced datasets in which most of the samples are safe samples that are not contaminated with heavy metals. Random Forest (RF) has strong generalization ability and is not easy to overfit. In this paper, we improve the Bagging algorithm and simple voting method of RF. A W-RF algorithm based on adaptive Bagging and weighted voting is proposed to improve the classification performance of RF on imbalanced datasets. Adaptive Bagging enables trees in RF to learn information from the positive samples, and weighted voting method enables trees with superior performance to have higher voting weights. Experiments were conducted using G-mean, recall and F1-score to set weights, and the results obtained were better than RF. Risk assessment experiments were conducted using W-RF on the heavy metal dataset from agricultural fields around Wuhan. The experimental results show that the RW-RF algorithm, which use recall to calculate the classifier weights, has the best classification performance. At the end of this paper, we optimized the hyperparameters of the RW-RF algorithm by a Bayesian optimization algorithm. We use G-mean as the objective function to obtain the optimal hyperparameter combination within the number of iterations.  相似文献   

13.
针对传统模型在解决不平衡数据分类问题时存在精度低、稳定性差、泛化能力弱等问题,提出基于序贯三支决策多粒度集成分类算法M GE-S3WD.采用二元关系实现粒层动态划分;根据代价矩阵计算阈值并构建多层次粒结构,将各粒层数据划分为正域、边界域和负域;将各粒层上的划分,按照正域与负域、正域与边界域、负域与边界域重新组合形成新的...  相似文献   

14.
数据不平衡现象在现实生活中普遍存在。在处理不平衡数据时,传统的机器学习算法难以达到令人满意的效果。少数类样本合成上采样技术(Synthetic Minority Oversampling Technique,SMOTE)是一种有效的方法,但在多类不平衡数据中,边界点分布错乱和类别分布不连续变得更加复杂,导致合成的样本点会侵入其他类别区域,造成数据过泛化。鉴于基于海林格距离的决策树已被证明对不平衡数据具有不敏感性,文中结合海林格距离和SMOTE,提出了一种基于海林格距离和SMOTE的上采样算法(Based on Hellinger Distance and SMOTE Oversampling Algorithm,HDSMOTE)。首先,建立基于海林格距离的采样方向选择策略,通过比较少数类样本点的局部近邻域内的海林格距离的大小,来引导合成样本点的方向。其次,设计了基于海林格距离的采样质量评估策略,以免合成的样本点侵入其他类别的区域,降低过泛化的风险。最后,采用7种代表性的上采样算法和HDSMOTE算法对15个多类不平衡数据集进行预处理,使用决策树的分类器进行分类,以Precision,Recall,F-measure,G-mean和MAUC作为评价标准对各算法的性能进行评价。实验结果表明,相比于对比算法,HDSMOTE算法在以上评价标准上均有所提升:在Precision上最高提升了17.07%,在Recall上最高提升了21.74%,在F-measure上最高提升了19.63%,在G-mean上最高提升了16.37%,在MAUC上最高提升了8.51%。HDSMOTE相对于7种代表性的上采样方法,在处理多类不平衡数据时有更好的分类效果。  相似文献   

15.
为改进SVM对不均衡数据的分类性能,提出一种基于拆分集成的不均衡数据分类算法,该算法对多数类样本依据类别之间的比例通过聚类划分为多个子集,各子集分别与少数类合并成多个训练子集,通过对各训练子集进行学习获得多个分类器,利用WE集成分类器方法对多个分类器进行集成,获得最终分类器,以此改进在不均衡数据下的分类性能.在UCI数据集上的实验结果表明,该算法的有效性,特别是对少数类样本的分类性能.  相似文献   

16.
丁要军 《计算机应用》2015,35(12):3348-3351
针对不平衡网络流量分类精度不高的问题,在旋转森林算法的基础上结合Bagging算法的Bootstrap抽样和基于分类精度排序的基分类器选择算法,提出一种改进的旋转森林算法。首先,对原始训练集按特征进行子集划分并分别使用Bagging进行样本抽样,通过主成分分析(PCA)生成主成分系数矩阵;然后,在原始训练集和主成分系数矩阵的基础上进行特征转换,生成新的训练子集,再次使用Bagging对子集进行抽样,提升训练集的差异性,并使用训练子集训练C4.5基分类器;最后,使用测试集评价基分类器,依据总体分类精度进行排序筛选,保留分类精度较高的分类器并生成一致分类结果。在不平衡网络流量数据集上进行测试实验,依据准确率和召回率两个标准对C4.5、Bagging、旋转森林和改进的旋转森林四种算法评价,依据模型训练时间和测试时间评价四种算法的时间效率。实验结果表明改进的旋转森林算法对万维网(WWW)协议、Mail协议、Attack协议、对等网(P2P)协议的分类准确度达到99.5%以上,召回率也高于旋转森林、Bagging、C4.5三种算法,可用于网络入侵取证、维护网络安全、提升网络服务质量。  相似文献   

17.
Brain–Computer Interfaces (BCIs) based on Electroencephalograms (EEG) monitor mental activity with the ultimate objective of allowing people to communicate with computers only via their thoughts. Users must create precise cerebral activity patterns that the system uses as control signals to do this. A common activity used to elicit such signals is Motor Imagery (MI), in which certain signals are created in the sensorimotor cortex while imagining the movements. The three phases of the traditional EEG–BCI processing pipeline are preprocessing, feature extraction, and classification. We provide categorization advances and track performance gains in 4-class MI-based BCIs. In this study, 4-class MI events are produced via an illusory elevation of the left hand, right hand, feet, and tongue. Finally, a two-phase classification technique is provided with ANN classifiers being used in the first phase to discriminate between different pair-wise MI tasks. Secondly, an adaptive SVM classifier is used to assess the user's end task based on the weighted outputs of the classifiers. An adaptive classifier is one technique to maintain consistency in performance, reduce training time, and eliminate non-stationaries, all of which are required for efficient BCI performance. The suggested approach outperformed conventional two-stage classification algorithms on MI data, according to experimental findings. The average classification accuracy of this technique is 96% for datasets BCI competition IV 2a. This is a 4% improvement over the comparison approach.  相似文献   

18.
在标签均衡分布且标注样本足够多的数据集上,监督式分类算法通常可以取得比较好的分类效果。然而,在实际应用中样本的标签分布通常是不均衡的,分类算法的分类性能就变得比较差。为此,结合SLDA(Supervised LDA)有监督主题模型,提出一种不均衡文本分类新算法ITC-SLDA(Imbalanced Text Categorization based on Supervised LDA)。基于SLDA主题模型,建立主题与稀少类别之间的精确映射,以提高少数类的分类精度。利用SLDA模型对未标注样本进行标注,提出一种新的未标注样本的置信度计算方法,以及类别约束的采样策略,旨在有效采样未标注样本,最终降低不均衡文本的倾斜度,提升不均衡文本的分类性能。实验结果表明,所提方法能明显提高不均衡文本分类任务中的Macro-F1和G-mean值。  相似文献   

19.
现实中许多领域产生的数据通常具有多个类别并且是不平衡的。在多类不平衡分类中,类重叠、噪声和多个少数类等问题降低了分类器的能力,而有效解决多类不平衡问题已经成为机器学习与数据挖掘领域中重要的研究课题。根据近年来的多类不平衡分类方法的文献,从数据预处理和算法级分类方法两方面进行了分析与总结,并从优缺点和数据集等方面对所有算法进行了详细的分析。在数据预处理方法中,介绍了过采样、欠采样、混合采样和特征选择方法,对使用相同数据集算法的性能进行了比较。从基分类器优化、集成学习和多类分解技术三个方面对算法级分类方法展开介绍和分析。最后对多类不平衡数据分类研究领域的未来发展方向进行总结归纳。  相似文献   

20.
: Cardiotocography (CTG) represents the fetus’s health inside the womb during labor. However, assessment of its readings can be a highly subjective process depending on the expertise of the obstetrician. Digital signals from fetal monitors acquire parameters (i.e., fetal heart rate, contractions, acceleration). Objective:: This paper aims to classify the CTG readings containing imbalanced healthy, suspected, and pathological fetus readings. Method:: We perform two sets of experiments. Firstly, we employ five classifiers: Random Forest (RF), Adaptive Boosting (AdaBoost), Categorical Boosting (CatBoost), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LGBM) without over-sampling to classify CTG readings into three categories: healthy, suspected, and pathological. Secondly, we employ an ensemble of the above-described classifiers with the over-sampling method. We use a random over-sampling technique to balance CTG records to train the ensemble models. We use 3602 CTG readings to train the ensemble classifiers and 1201 records to evaluate them. The outcomes of these classifiers are then fed into the soft voting classifier to obtain the most accurate results. Results:: Each classifier evaluates accuracy, Precision, Recall, F1-scores, and Area Under the Receiver Operating Curve (AUROC) values. Results reveal that the XGBoost, LGBM, and CatBoost classifiers yielded 99% accuracy. Conclusion:: Using ensemble classifiers over a balanced CTG dataset improves the detection accuracy compared to the previous studies and our first experiment. A soft voting classifier then eliminates the weakness of one individual classifier to yield superior performance of the overall model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号