首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
宽度学习系统(broad learning system,BLS)作为深度神经网络的替代框架,具有快速自适应模型结构选择和在线增量学习能力,被认为是知识发现和数据工程领域中一种极具前途的技术.传统的BLS主要应用于数据分 布均衡且误分类代价相同的模式分类任务,但大多数实际应用的数据是非均衡分布的,如网络入侵监测、医疗诊断、信用卡欺诈检测等.基于此,提出一种基于数据分布特性的代价敏感BLS(data distribution-based cost-sensitive-BLS,DDbCs-BLS),解决数据分布不均、误分代价不同的模式分类任务.DDbCs-BLS在充分考虑数据统计分布特性的基础上寻找代价敏感型BLS分类器的最佳分类边界,保证少数类样本信息不被丢失,从而提高BLS在各类数据集上的模式分类性能.在多种公共数据集(包括均衡和不均衡数据集)上进行大量的验证性和对比性实验,结果表明DDbCs-BLS能有效确定分类边界线的最佳位置,无论是在均衡数据集还是在不均衡数据集上均能获得更好的分类性能.  相似文献   

2.
In real-world classification problems, different types of misclassification errors often have asymmetric costs, thus demanding cost-sensitive learning methods that attempt to minimize average misclassification cost rather than plain error rate. Instance weighting and post hoc threshold adjusting are two major approaches to cost-sensitive classifier learning. This paper compares the effects of these two approaches on several standard, off-the-shelf classification methods. The comparison indicates that the two approaches lead to similar results for some classification methods, such as Naïve Bayes, logistic regression, and backpropagation neural network, but very different results for other methods, such as decision tree, decision table, and decision rule learners. The findings from this research have important implications on the selection of the cost-sensitive classifier learning approach as well as on the interpretation of a recently published finding about the relative performance of Naïve Bayes and decision trees.  相似文献   

3.
不平衡数据分类研究综述   总被引:2,自引:1,他引:1  
赵楠  张小芳  张利军 《计算机科学》2018,45(Z6):22-27, 57
在很多应用领域中,数据的类别分布不平衡,如何对其正确分类是数据挖掘和机器学习领域中的研究热点。经典的数据分类算法未考虑数据类别的不平衡性,认为类别之间的误分类代价相同,导致不平衡数据分类的效果不理想。针对数据分类的各个步骤,相继提出了不同的不平衡数据分类处理方法。对多年来的相关研究成果进行归类分析,从特征选择、数据分布调整、分类算法、分类结果评估等几个方面系统地介绍了相关方法,并探讨了进一步的探索方向。  相似文献   

4.
We propose a hybrid radial basis function network-data envelopment analysis (RBFN-DEA) neural network for classification problems. The procedure uses the radial basis function to map low dimensional input data from input space to a high dimensional + feature space where DEA can be used to learn the classification function. Using simulated datasets for a non-linearly separable binary classification problem, we illustrate how the RBFN-DEA neural network can be used to solve it. We also show how asymmetric misclassification costs can be incorporated in the hybrid RBFN-DEA model. Our preliminary experiments comparing the RBFN-DEA with feed forward and probabilistic neural networks show that the RBFN-DEA fares very well.  相似文献   

5.
大多数非均衡数据集的研究集中于纯重构数据集或者纯代价敏感学习,本文针对数据集类分布非均衡和不相等误分类代价往往同时发生这一事实,提出了一种以最小误分类代价为目标的基于混合重取样的代价敏感学习算法。该算法将两种不同类型解决方案有机地融合在一起,先用样本类空间重构的方法使原始数据集的两类数据达到基本均衡,然后再引入代价敏感学习算法进行分类,能提高少数类分类精度,同时有效降低总的误分类代价。实验结果验证了该算法在处理非均衡类问题时比传统算法要优越。  相似文献   

6.
Recently, the problem of imbalanced data classification has drawn a significant amount of interest from academia, industry and government funding agencies. The fundamental issue with imbalanced data classification is the imbalanced data has posed a significant drawback of the performance of most standard learning algorithms, which assume or expect balanced class distribution or equal misclassification costs. Boosting is a meta-technique that is applicable to most learning algorithms. This paper gives a review of boosting methods for imbalanced data classification, denoted as IDBoosting (Imbalanced-data-boosting), where conventional learning algorithms can be integrated without further modifications. The main focus is on the intrinsic mechanisms without considering implementation detail. Existing methods are catalogued and each class is displayed in detail in terms of design criteria, typical algorithms and performance analysis. The essence of two IDBoosting methods is discovered followed by experimental evidence and useful reference point for future research are also given.  相似文献   

7.
不平衡数据分类是机器学习研究领域中的一个热点问题。针对传统分类算法处理不平衡数据的少数类识别率过低问题,文章提出了一种基于聚类的改进AdaBoost分类算法。算法首先进行基于聚类的欠采样,在多数类样本上进行K均值聚类,之后提取聚类质心,与少数类样本数目一致的聚类质心和所有少数类样本组成新的平衡训练集。为了避免少数类样本数量过少而使训练集过小导致分类精度下降,采用少数过采样技术过采样结合聚类欠采样。然后,借鉴代价敏感学习思想,对AdaBoost算法的基分类器分类误差函数进行改进,赋予不同类别样本非对称错分损失。实验结果表明,算法使模型训练样本具有较高的代表性,在保证总体分类性能的同时提高了少数类的分类精度。  相似文献   

8.
Associative memories have emerged as a powerful computational neural network model for several pattern classification problems. Like most traditional classifiers, these models assume that the classes share similar prior probabilities. However, in many real-life applications the ratios of prior probabilities between classes are extremely skewed. Although the literature has provided numerous studies that examine the performance degradation of renowned classifiers on different imbalanced scenarios, so far this effect has not been supported by a thorough empirical study in the context of associative memories. In this paper, we fix our attention on the applicability of the associative neural networks to the classification of imbalanced data. The key questions here addressed are whether these models perform better, the same or worse than other popular classifiers, how the level of imbalance affects their performance, and whether distinct resampling strategies produce a different impact on the associative memories. In order to answer these questions and gain further insight into the feasibility and efficiency of the associative memories, a large-scale experimental evaluation with 31 databases, seven classification models and four resampling algorithms is carried out here, along with a non-parametric statistical test to discover any significant differences between each pair of classifiers.  相似文献   

9.
不平衡入侵检测数据的代价敏感分类策略*   总被引:1,自引:0,他引:1  
提出一种新的预处理算法AdaP,不仅有效避免了数据过度拟合,且可独立使用。针对不平衡的入侵检测数据集,引入代价敏感机制,基于权值矩阵最小化误分类代价的思想,去除部分训练密集区域、拓展稀疏区域的同时再过滤噪声,最终实现了AdaP算法与AdaCost算法相结合的策略。实验证明此策略充分体现了提升算法有效提升前端弱分类算法分类精度和预处理算法平衡稀有类数据的优势,且可有效提高不平衡入侵检测数据的分类性能。  相似文献   

10.
为提高不平衡数据的分类性能,提出了基于度量指标优化的不平衡数据Boosting算法。该算法结合不平衡数据分类性能度量标准和Boosting算法,使用不平衡数据分类性能度量指标代替原有误分率指标,分别采用带有权重的正类和负类召回率、F-measure和G-means指标对Boosting算法进行优化,按照不同的度量指标计算Alpha 值进行迭代,得到带有加权值的弱学习器组合,最后使用Boosting算法进行优化。经过实验验证,与带有权重的Boosting算法进行比较,该算法对一定数据集的AUC分类性能指标有一定提高,错误率有所下降,对F-measure和G-mean性能指标有一定的改善,说明该算法侧重提高正类分类性能,改善不平衡数据的整体分类性能。  相似文献   

11.
Multi-class pattern classification has many applications including text document classification, speech recognition, object recognition, etc. Multi-class pattern classification using neural networks is not a trivial extension from two-class neural networks. This paper presents a comprehensive and competitive study in multi-class neural learning with focuses on issues including neural network architecture, encoding schemes, training methodology and training time complexity. Our study includes multi-class pattern classification using either a system of multiple neural networks or a single neural network, and modeling pattern classes using one-against-all, one-against-one, one-against-higher-order, and P-against-Q. We also discuss implementations of these approaches and analyze training time complexity associated with each approach. We evaluate six different neural network system architectures for multi-class pattern classification along the dimensions of imbalanced data, large number of pattern classes, large vs. small training data through experiments conducted on well-known benchmark data.  相似文献   

12.
入侵检测系统的防御性能经常受到类不平衡数据的影响,为了自动提取稀缺类别的数据特征,提高入侵检测系统识别未知网络攻击的精度,提出一种代价约束算法.首先,基于栈式自动编码器构建深度神经网络,在隐藏层的神经元上添加稀疏约束;其次,通过生成代价矩阵优化代价目标函数,对类不平衡数据特征分配代价;最后,利用反向传播微调神经网络模型...  相似文献   

13.
Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent occurrence of the class imbalance problem indicate the need for extra research efforts. The objective of this paper is to investigate meta-techniques applicable to most classifier learning algorithms, with the aim to advance the classification of imbalanced data. The AdaBoost algorithm is reported as a successful meta-technique for improving classification accuracy. The insight gained from a comprehensive analysis of the AdaBoost algorithm in terms of its advantages and shortcomings in tacking the class imbalance problem leads to the exploration of three cost-sensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. Further analysis shows that one of the proposed algorithms tallies with the stagewise additive modelling in statistics to minimize the cost exponential loss. These boosting algorithms are also studied with respect to their weighting strategies towards different types of samples, and their effectiveness in identifying rare cases through experiments on several real world medical data sets, where the class imbalance problem prevails.  相似文献   

14.
类别不平衡问题广泛存在于现实生活中,多数传统分类器假定类分布平衡或误分类代价相等,因此类别不平衡数据严重影响了传统分类器的分类性能。针对不平衡数据集的分类问题,提出了一种处理不平衡数据的概率阈值Bagging分类方法-PT Bagging。将阈值移动技术与Bagging集成算法结合起来,在训练阶段使用原始分布的训练集进行训练,在预测阶段引入决策阈值移动方法,利用校准的后验概率估计得到对不平衡数据分类的最大化性能测量。实验结果表明,PT Bagging算法具有更好的处理不平衡数据的分类优势。  相似文献   

15.
在集成算法中嵌入代价敏感和重采样方法是一种有效的不平衡数据分类混合策略。针对现有混合方法中误分代价计算和欠采样过程较少考虑样本的类内与类间分布的问题,提出了一种密度峰值优化的球簇划分欠采样不平衡数据分类算法DPBCPUSBoost。首先,利用密度峰值信息定义多数类样本的抽样权重,将存在“近邻簇”的多数类球簇划分为“易误分区域”和“难误分区域”,并提高“易误分区域”内样本的抽样权重;其次,在初次迭代过程中按照抽样权重对多数类样本进行欠采样,之后每轮迭代中按样本分布权重对多数类样本进行欠采样,并把欠采样后的多数类样本与少数类样本组成临时训练集并训练弱分类器;最后,结合样本的密度峰值信息与类别分布为所有样本定义不同的误分代价,并通过代价调整函数增加高误分代价样本的权重。在10个KEEL数据集上的实验结果表明,与现有自适应增强(AdaBoost)、代价敏感自适应增强(AdaCost)、随机欠采样增强(RUSBoost)和代价敏感欠采样自适应增强(USCBoost)等不平衡数据分类算法相比,DPBCPUSBoost在准确率(Accuracy)、F1分数(F1-Score)、几何均值(G-mean)和受试者工作特征(ROC)曲线下的面积(AUC)指标上获得最高性能的数据集数量均多于对比算法。实验结果验证了DPBCPUSBoost中样本误分代价和抽样权重定义的有效性。  相似文献   

16.
Automatic in silico synthesis of metabolic pathway can practically reduce the cost of wet laboratories. To achieve this, predicting whether or not two metabolites are transformable is the first essential step. The problems of predicting the possibility of transforming one metabolite into another and how to computationally synthesize a metabolic pathway were studied. These two problems were transformed to the problem of classifying features of metabolite pairs into transformable or non-transformable classes. The following two main issues were contributed in this study: (1) two new feature schemes, i.e. the projected features on their first principal component and the average features, for representing transform-ability of each metabolite pair using 2D and 3D compound structural features and (2) a method of modified imbalanced data handling by adding synthetic boundary data of different classes to balance data. Based on the E. coli reference pathways, the results of proposed features with feature selection and our imbalanced data handling approach show the better performance than the results from other methods when evaluated by several metrics. Our significant feature group possibly achieves high classification correctness of computational pathway synthesis. In pathway recovery results by a group of neural network models, 19 pathways were significantly recovered by our feature group at each recovery ratio of at least 0.5, whereas the other compared feature group gave only four significantly recovered pathways.  相似文献   

17.
卷积神经网络具有高效的特征提取能力和较少的参数量,被广泛应用于图像处理、目标跟踪、自然语言等领域。针对传统分类模型对于结构化非平衡数据分类效果较差的问题,提出一种基于卷积神经网络的二分类结构化非平衡数据分类算法。设计结构化数据处理算法Data-Shuffle,将原始非平衡一维结构化数据转换为三维数组形式的多通道非平衡数据,为卷积神经网络提供更多的特征值,通过改进的VGG网络构建适合非平衡数据的网络结构卷积组,以提取不同的特征。在此基础上,提出更新权重加权采样算法UWSCNN,在每个迭代次数之后,根据模型的训练结果对易错样本进行重新加权,以优化训练结果。在adult、shoppers和diabetes数据集上的实验结果表明,相比逻辑回归、随机森林等传统机器学习模型,所提的Data-Shuffle算法的F1值提升了1%~19%,G-mean提升了2%~24%,相比SMOTECNN、BSMOTECNN、SMOTECNN+CS等采样算法,所提的UWSCNN算法对非平衡数据的分类效果提升了1%~13%,有效提高模型对非平衡数据的分类性能。  相似文献   

18.
一种新的不平衡数据学习算法PCBoost   总被引:8,自引:0,他引:8  
现实世界中广泛存在不平衡数据,其分类问题是机器学习研究中的一个热点.多数传统分类算法假定类分布平衡或误分类代价均衡,在处理不平衡数据时,效果不够理想.文中提出一种不平衡数据分类算法-PCBoost.算法以信息增益率为分裂准则构建决策树,作为弱分类器.在每次迭代初始,利用数据合成方法添加合成的少数类样例,平衡训练信息;在子分类器形成后,修正“扰动”,删除未被正确分类的合成样例.文中讨论了数据合成方法,给出了训练误差界的理论分析,并分析了集成学习参数的选择.实验结果表明,PCBoost算法具有处理不平衡数据分类问题的优势.  相似文献   

19.
It is well-known that software defect prediction is one of the most important tasks for software quality improvement. The use of defect predictors allows test engineers to focus on defective modules. Thereby testing resources can be allocated effectively and the quality assurance costs can be reduced. For within-project defect prediction (WPDP), there should be sufficient data within a company to train any prediction model. Without such local data, cross-project defect prediction (CPDP) is feasible since it uses data collected from similar projects in other companies. Software defect datasets have the class imbalance problem increasing the difficulty for the learner to predict defects. In addition, the impact of imbalanced data on the real performance of models can be hidden by the performance measures chosen. We investigate if the class imbalance learning can be beneficial for CPDP. In our approach, the asymmetric misclassification cost and the similarity weights obtained from distributional characteristics are closely associated to guide the appropriate resampling mechanism. We performed the effect size A-statistics test to evaluate the magnitude of the improvement. For the statistical significant test, we used Wilcoxon rank-sum test. The experimental results show that our approach can provide higher prediction performance than both the existing CPDP technique and the existing class imbalance technique.  相似文献   

20.
代价敏感概率神经网络及其在故障诊断中的应用   总被引:3,自引:1,他引:2  
针对传统的分类算法人多以误分率最小化为目标,忽略了误分类型之间的差别和数据集的非平衡性的问题,提出代价敏感概率神经网络算法.该算法将代价敏感机制引入概率神经网络,用期望代价取代误分率,以期望代价最小化为目标,基于期望代价最小的贝叶斯决策规则预测新样本类别.采用工业现场数据和数据集German Credit验证了该算法的有效性.实验结果表明,该算法具有故障识别率高、泛化能力强、建模时间短等特点.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号