首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
针对少数类样本合成过采样技术(Synthetic Minority Over-Sampling Technique, SMOTE)在合成少数类新样本时会带来噪音问题,提出了一种改进降噪自编码神经网络不平衡数据分类算法(SMOTE-SDAE)。该算法首先通过SMOTE方法合成少数类新样本以均衡原始数据集,考虑到合成样本过程中会产生噪音的影响,利用降噪自编码神经网络算法的逐层无监督降噪学习和有监督微调过程,有效实现对过采样数据集的降噪处理与数据分类。在UCI不平衡数据集上实验结果表明,相比传统SVM算法,该算法显著提高了不平衡数据集中少数类的分类精度。  相似文献   

2.
针对不平衡噪声数据流的分类问题,本文利用基于平均概率的集成分类器AP与抽样技术,提出了一种处理不平衡噪声数据流的集成分类器(IMDAP)模型。实验结果表明,该集成分类器更能适应存在概念漂移与噪声的不平衡数据流挖掘分类,其整体分类性能优于AP集成分类器模型,能明显提升少数类的分类精度,并且具有与AP相近的时间复杂度。  相似文献   

3.
Machine learning algorithms such as genetic programming (GP) can evolve biased classifiers when data sets are unbalanced. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria in the fitness function. This paper aims to both highlight the limitations of the current GP approaches in this area and develop several new fitness functions for binary classification with unbalanced data. Using a range of real-world classification problems with class imbalance, we empirically show that these new fitness functions evolve classifiers with good performance on both the minority and majority classes. Our approaches use the original unbalanced training data in the GP learning process, without the need to artificially balance the training examples from the two classes (e.g., via sampling).  相似文献   

4.
针对实际应用中存在的数据集分布不平衡的问题,提出一种融合特征边界数据信息的过采样方法。去除数据集中的噪声点,基于少数类样本点的多类近邻集合,融合特征边界的几何分布信息获得有利于定义最优非线性分类边界的少数类样本点,通过其与所属类簇的结合生成新样本。对不平衡数据集采用多种过采样技术处理后,利用支持向量机进行分类,对比实验表明所提方法有效改善了不平衡数据的分类精度,验证了算法的有效性。  相似文献   

5.
针对不平衡数据集中的少数类在传统分类器上预测精度低的问题,提出了一种基于欠采样和代价敏感的不平衡数据分类算法——USCBoost.首先在AdaBoost算法每次迭代训练基分类器之前对多数类样本按权重由大到小进行排序,根据样本权重选取与少数类样本数量相当的多数类样本;之后将采样后的多数类样本权重归一化并与少数类样本组成临...  相似文献   

6.
In this paper, we present an empirical analysis on transfer learning using the Fuzzy Min–Max (FMM) neural network with an online learning strategy. Three transfer learning benchmark data sets, i.e., 20 Newsgroups, WiFi Time, and Botswana, are used for evaluation. In addition, the data samples are corrupted with white Gaussian noise up to 50 %, in order to assess the robustness of the online FMM network in handling noisy transfer learning tasks. The results are analyzed and compared with those from other methods. The outcomes indicate that the online FMM network is effective for undertaking transfer learning tasks in noisy environments.  相似文献   

7.
As a recently proposed machine learning method, active learning of Gaussian processes can effectively use a small number of labeled examples to train a classifier, which in turn is used to select the most informative examples from unlabeled data for manual labeling. However, in the process of example selection, active learning usually need consider all the unlabeled data without exploiting the structural space connectivity among them. This will decrease the classification accuracy to some extent since the selected points may not be the most informative. To overcome this shortcoming, in this paper, we present a method which applies the manifold-preserving graph reduction (MPGR) algorithm to the traditional active learning method of Gaussian processes. MPGR is a simple and efficient example sparsification algorithm which can construct a subset to represent the global structure and simultaneously eliminate the influence of noisy points and outliers. Thereby, when actively selecting examples to label, we just choose from the subset constructed by MPGR instead of the whole unlabeled data. We report experimental results on multiple data sets which demonstrate that our method obtains better classification performance compared with the original active learning method of Gaussian processes.  相似文献   

8.
针对癌症数据集中存在非平衡数据及噪声样本的问题,提出一种基于RENN和SMOTE算法的癌症患者生存预测算法RENN-SMOTE-SVM。基于最近邻规则,利用RENN算法减少多数类样本中噪声样本数量,并通过SMOTE算法在少数类样本间进行线性插值增加样本数量,从而获得平衡数据集。基于美国癌症数据库非平衡乳腺癌患者数据集对癌症患者的生存情况进行预测分析,实验结果表明,与SVM算法、Tomeklinks-SVM算法等5种常用算法相比,该算法的分类及预测效果更好,其正确率、F1-score、G-means值分别为0.883,0.904,0.779。  相似文献   

9.
This paper presents a framework for incremental neural learning (INL) that allows a base neural learning system to incrementally learn new knowledge from only new data without forgetting the existing knowledge. Upon subsequent encounters of new data examples, INL utilizes prior knowledge to direct its incremental learning. A number of critical issues are addressed including when to make the system learn new knowledge, how to learn new knowledge without forgetting existing knowledge, how to perform inference using both the existing and the newly learnt knowledge, and how to detect and deal with aged learnt systems. To validate the proposed INL framework, we use backpropagation (BP) as a base learner and a multi-layer neural network as a base intelligent system. INL has several advantages over existing incremental algorithms: it can be applied to a broad range of neural network systems beyond the BP trained neural networks; it retains the existing neural network structures and weights even during incremental learning; the neural network committees generated by INL do not interact with one another and each sees the same inputs and error signals at the same time; this limited communication makes the INL architecture attractive for parallel implementation. We have applied INL to two vehicle fault diagnostics problems: end-of-line test in auto assembly plants and onboard vehicle misfire detection. These experimental results demonstrate that the INL framework has the capability to successfully perform incremental learning from unbalanced and noisy data. In order to show the general capabilities of INL, we also applied INL to three general machine learning benchmark data sets. The INL systems showed good generalization capabilities in comparison with other well known machine learning algorithms.  相似文献   

10.
针对非平衡警情数据改进的K-Means-Boosting-BP模型   总被引:1,自引:0,他引:1       下载免费PDF全文
目的 掌握警情的时空分布规律,通过机器学习算法建立警情时空预测模型,制定科学的警务防控方案,有效抑制犯罪的发生,是犯罪地理研究的重点。已有研究表明,警情时空分布多集中在中心城区或居民密集区,在时空上属于非平衡数据,这种数据的非平衡性通常导致在该数据上训练的模型成为弱学习器,预测精度较低。为解决这种非平衡数据的回归问题,提出一种基于KMeans均值聚类的Boosting算法。方法 该算法以Boosting集成学习算法为基础,应用GA-BP神经网络生成基分类器,借助KMeans均值聚类算法进行基分类器的集成,从而实现将弱学习器提升为强学习器的目标。结果 与常用的解决非平衡数据回归问题的Synthetic Minority Oversampling Technique Boosting算法,简称SMOTEBoosting算法相比,该算法具有两方面的优势:1)在降低非平衡数据中少数类均方误差的同时也降低了数据的整体均方误差,SMOTEBoosting算法的整体均方误差为2.14E-04,KMeans-Boosting算法的整体均方误差达到9.85E-05;2)更好地平衡了少数类样本识别的准确率和召回率,KMeans-Boosting算法的召回率约等于52%,SMOTEBoosting算法的召回率约等于91%;但KMeans-Boosting算法的准确率等于85%,远高于SMOTEBoosting算法的19%。结论 KMeans-Boosting算法能够显著的降低非平衡数据的整体均方误差,提高少数类样本识别的准确率和召回率,是一种有效地解决非平衡数据回归问题和分类问题的算法,可以推广至其他需要处理非平衡数据的领域中。  相似文献   

11.
近几年来,随着词向量和各种神经网络模型在自然语言处理上的成功应用,基于神经网络的文本分类方法开始成为研究主流.但是当不同类别的训练数据不均衡时,训练得到的神经网络模型会由多数类所主导,分类结果往往倾向多数类,极大影响了分类效果.针对这种情况,本文在卷积神经网络训练过程中,损失函数引入类别标签权重,强化少数类对模型参数的影响.在复旦大学文本分类数据集上进行测试,实验表明本文提出的方法相比于基线系统宏平均F1值提高了4. 49%,较好地解决数据不平衡分类问题.  相似文献   

12.
李延超  肖甫  陈志  李博 《软件学报》2020,31(12):3808-3822
主动学习从大量无标记样本中挑选样本交给专家标记.现有的批抽样主动学习算法主要受3个限制:(1)一些主动学习方法基于单选择准则或对数据、模型设定假设,这类方法很难找到既有不确定性又有代表性的未标记样本;(2)现有批抽样主动学习方法的性能很大程度上依赖于样本之间相似性度量的准确性,例如预定义函数或差异性衡量;(3)噪声标签问题一直影响批抽样主动学习算法的性能.提出一种基于深度学习批抽样的主动学习方法.通过深度神经网络生成标记和未标记样本的学习表示和采用标签循环模式,使得标记样本与未标记样本建立联系,再回到相同标签的标记样本.这样同时考虑了样本的不确定性和代表性,并且算法对噪声标签具有鲁棒性.在提出的批抽样主动学习方法中,算法使用的子模块函数确保选择的样本集合具有多样性.此外,自适应参数的优化,使得主动学习算法可以自动平衡样本的不确定性和代表性.将提出的主动学习方法应用到半监督分类和半监督聚类中,实验结果表明,所提出的主动学习方法的性能优于现有的一些先进的方法.  相似文献   

13.
分类是模式识别领域中的研究热点,大多数经典的分类器往往默认数据集是分布均衡的,而现实中的数据集往往存在类别不均衡问题,即属于正常/多数类别的数据的数量与属于异常/少数类数据的数量之间的差异很大。若不对数据进行处理往往会导致分类器忽略少数类、偏向多数类,使得分类结果恶化。针对数据的不均衡分布问题,本文提出一种融合谱聚类的综合采样算法。首先采用谱聚类方法对不均衡数据集的少数类样本的分布信息进行分析,再基于分布信息对少数类样本进行过采样,获得相对均衡的样本,用于分类模型训练。在多个不均衡数据集上进行了大量实验,结果表明,所提方法能有效解决数据的不均衡问题,使得分类器对于少数类样本的分类精度得到提升。  相似文献   

14.
董一鸿 《计算机工程》2003,29(19):136-138
提出了一种新型的基于竞争型神经网络的学习算法,该算法综合了竞争型神经网络和层次聚类的特点,通过竞争型神经两络对对象进行初步分类,并在隐含层采用Hcbb学习规则对子类进行关联学习,学习速度快,分类质量好,可以对任意形状、任意大小的簇进行聚类,同时不受噪音的影响,是一种快速高效的分类算法。  相似文献   

15.
现实中许多领域产生的数据通常具有多个类别并且是不平衡的。在多类不平衡分类中,类重叠、噪声和多个少数类等问题降低了分类器的能力,而有效解决多类不平衡问题已经成为机器学习与数据挖掘领域中重要的研究课题。根据近年来的多类不平衡分类方法的文献,从数据预处理和算法级分类方法两方面进行了分析与总结,并从优缺点和数据集等方面对所有算法进行了详细的分析。在数据预处理方法中,介绍了过采样、欠采样、混合采样和特征选择方法,对使用相同数据集算法的性能进行了比较。从基分类器优化、集成学习和多类分解技术三个方面对算法级分类方法展开介绍和分析。最后对多类不平衡数据分类研究领域的未来发展方向进行总结归纳。  相似文献   

16.
一种基于混合重取样策略的非均衡数据集分类算法   总被引:1,自引:0,他引:1  
非均衡数据是分类中的常见问题,当一类实例远远多于另一类实例,则代表类非均衡,真实世界的分类问题存在很多类别非均衡的情况并得到众多专家学者的重视,非均衡数据的分类问题已成为数据挖掘和模式识别领域中新的研究热点,是对传统分类算法的重大挑战。本文提出了一种新型重取样算法,采用改进的SMOTE算法对少数类数据进行过取样,产生新的少数类样本,使类之间数据量基本均衡,然后再根据SMO算法的特点,提出使用聚类的数据欠取样方法,删除冗余或噪音数据。通过对数据集的过取样和清理之后,一些有用的样本被保留下来,减少了数据集规模,增强支持向量机训练执行的效率。实验结果表明,该方法在保持整体分类性能的情况下可以有效地提高少数类的分类精度。  相似文献   

17.
In classification, noise may deteriorate the system performance and increase the complexity of the models built. In order to mitigate its consequences, several approaches have been proposed in the literature. Among them, noise filtering, which removes noisy examples from the training data, is one of the most used techniques. This paper proposes a new noise filtering method that combines several filtering strategies in order to increase the accuracy of the classification algorithms used after the filtering process. The filtering is based on the fusion of the predictions of several classifiers used to detect the presence of noise. We translate the idea behind multiple classifier systems, where the information gathered from different models is combined, to noise filtering. In this way, we consider the combination of classifiers instead of using only one to detect noise. Additionally, the proposed method follows an iterative noise filtering scheme that allows us to avoid the usage of detected noisy examples in each new iteration of the filtering process. Finally, we introduce a noisy score to control the filtering sensitivity, in such a way that the amount of noisy examples removed in each iteration can be adapted to the necessities of the practitioner. The first two strategies (use of multiple classifiers and iterative filtering) are used to improve the filtering accuracy, whereas the last one (the noisy score) controls the level of conservation of the filter removing potentially noisy examples. The validity of the proposed method is studied in an exhaustive experimental study. We compare the new filtering method against several state-of-the-art methods to deal with datasets with class noise and study their efficacy in three classifiers with different sensitivity to noise.  相似文献   

18.
陶新民  童智靖  刘玉  付丹丹 《控制与决策》2011,26(10):1535-1541
针对传统的支持向量机(SVM)算法在数据不均衡的情况下分类效果不理想的缺陷,为了提高SVM算法在不均衡数据集下的分类性能,提出一种新型的逐级优化递减欠采样算法.该算法去除样本中大量重叠的冗余和噪声样本,使得在减少数据的同时保留更多的有用信息,并且与边界人工少数类过采样算法相结合实现训练样本数据集的均衡.实验表明,该算法不但能有效提高SVM算法在不均衡数据中少数类的分类性能,而且总体分类性能也有所提高.  相似文献   

19.
董元方  李雄飞  李军 《计算机工程》2010,36(24):161-163
针对不平衡数据学习问题,提出一种采用渐进学习方式的分类算法。根据属性值域分布,逐步添加合成少数类样例,并在阶段分类器出现误分时,及时删除被误分的合成样例。当数据达到预期的平衡程度时,用原始数据和合成数据训练学习算法,得到最终分类器。实验结果表明,该算法优于C4.5算法,并在多数数据集上优于SMOTEBoost和DataBoost-IM。  相似文献   

20.
不平衡数据集的应用领域日益广泛,需求也越来越高,为提升整体数据集的分类准确率,以谱聚类欠取样为前提条件,构建一种自编码网络不平衡数据挖掘方法.把聚类问题转换成无向图多路径划分问题,通过无向图与标准化处理完成谱聚类,经过有选择地欠取样处理多数类数据集,获取分类边界偏移量,利用学习过程是无监督学习的自编码网络,升、降维数据,获取各维度隐藏特征,实现各层面的数据高效表示学习,根据最大均值差异与预设阈值的对比结果,调整自编码网络,基于得到的分类界面,完成不平衡数据挖掘.选用具有不同实际应用背景的UCI数据集,从中抽取10组数据作为测试集,经谱聚类欠取样处理与模拟实验,发现所提方法大幅提升少数类分类精度与整体挖掘性能,具有较好的适用性与可行性.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号