首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
针对信用数据中的高维稀疏特征与样本不平衡问题易导致模型分类性能欠佳,提出一种新颖的框架来构建信用评分模型。首先,通过计算特征相似度解决高维稀疏特征;其次针对样本不平衡问题提出基于特征聚类改进的SMOTE方法(FC-SMOTE),以平衡数据集进而提高模型分类性能;最后,采用XGBoost作为基分类器构建信用评分模型。选择网上公开的真实信用数据及UCI数据库中的信用数据进行实验,和传统过采样方法SMOTE、Borderline SMOTE、ADASYN进行对比,实验结果表明,提出的FC-SMOTE方法使基于XGBoost构建的信用评分模型具有更高预测精度。  相似文献   

2.
为解决不均衡多分类问题,提出了一种基于采样和特征选择的不均衡数据集成分类算法(IDESF)。基分类器的多样性会影响集成算法的分类性能,所以IDESF算法对数据集进行有放回采样+SMOTE的两阶段采样。两阶段采样在保证所得数据集中样本合理性的基础上,增加数据集间的差异性以此隐式地提高基分类器的多样性。两阶段采样同样可以平衡数据分布,防止分类器偏向多数类。在两阶段采样的基础上,IDESF算法引入了数据清洗和特征选择方法,试图进一步提高算法的分类性能。与其他不均衡分类算法在5组不均衡数据集上进行了对比实验,结果表明该算法可以获得较高的AUCarea和G-Mean值,具有较为优异的分类效果。  相似文献   

3.
《软件》2019,(8):79-83
针对软件缺陷预测中对不平衡数据分类精度较低的问题,提出了一种新的基于LogitBoost集成分类预测算法,使用SMOTE方法对原始数据集进行平衡处理,然后使用随机森林算法作为弱分类器算法进行分类预测,最后使用LogitBoost算法以加权方式集成各弱分类器的结果。通过在NASAMDP基础数据集上验证得出本文提出的分类预测算法比数据集均衡处理前准确率高出0.1%-11%,同时在均衡处理后比KNN算法平均高出0.9%,比SVM算法平均高出0.4%,比随机森林算法平均高出0.1%。  相似文献   

4.
《微型机与应用》2016,(13):51-54
针对电信客户流失数据集存在的数据维度过高及单一分类器预测效果较弱的问题,结合过滤式和封装式特征选择方法的优点及组合分类器的较高预测能力,提出了一种基于Fisher比率与预测风险准则的分步特征选择方法结合组合分类器的电信客户流失预测模型。首先,基于Fisher比率从原始特征集合中提取具有较高判别能力的特征;其次,采用预测风险准则进一步选取对分类模型预测效果影响较大的特征;最后,构建基于平均概率输出和加权概率输出的组合分类器,以进一步提高客户流失预测效果。实验结果表明,相对于单步特征提取和单分类器模型,该方法能够提高对客户流失预测的效果。  相似文献   

5.
Web spam是指通过内容作弊和网页间链接作弊来欺骗搜索引擎,从而提升自身搜索排名的作弊网页,它干扰了搜索结果的准确性和相关性。提出基于Co-Training模型的Web spam检测方法,使用了网页的两组相互独立的特征——基于内容的统计特征和基于网络图的链接特征,分别建立两个独立的基本分类器;使用Co-Training半监督式学习算法,借助大量未标记数据来改善分类器质量。在WEB SPAM-UK2007数据集上的实验证明:算法改善了SVM分类器的效果。  相似文献   

6.
少数类样本合成过采样技术(SMOTE)是一种典型的过采样数据预处理方法,它能够有效平衡非均衡数据,但会带来噪音等问题,影响分类精度。为解决此问题,借助主动学习支持向量机的分类性能,提出一种基于主动学习SMOTE的非均衡数据分类方法 ALSMOTE。由于主动学习支持向量机采用基于距离的主动选择最佳样本的学习策略,因此能够主动选择非均衡数据中的有价值的多数类样本,舍弃价值较小的样本,从而提高运算效率,改进SMOTE带来的问题。首先运用SMOTE方法均衡小部分样本,得到初始分类器;然后利用主动学习策略调整分类器精度。实验结果表明,该方法有效提高了非均衡数据的分类准确率。  相似文献   

7.
田臣  周丽娟 《计算机应用》2019,39(6):1707-1712
针对信用评估中最为常见的不均衡数据集问题以及单个分类器在不平衡数据上分类效果有限的问题,提出了一种基于带多数类权重的少数类过采样技术和随机森林(MWMOTE-RF)结合的信用评估方法。首先,在数据预处理过程中利用MWMOTE技术增加少数类别样本的样本数;然后,在预处理后的较平衡的新数据集上利用监督式机器学习算法中的随机森林算法对数据进行分类预测。使用受测者工作特征曲线下面积(AUC)作为分类评价指标,在UCI机器学习数据库中的德国信用卡数据集和某公司的汽车违约贷款数据集上的仿真实验表明,在相同数据集上,MWMOTE-RF方法与随机森林方法和朴素贝叶斯方法相比,AUC值分别提高了18%和20%。与此同时,随机森林方法分别与合成少数类过采样技术(SMOTE)方法和自适应综合过采样(ADASYN)方法结合,MWMOTE-RF方法与它们相比,AUC值分别提高了1.47%和2.34%,从而验证了所提方法的有效性及其对分类器性能的优化。  相似文献   

8.
基于样本权重更新的不平衡数据集成学习方法   总被引:1,自引:0,他引:1  
不平衡数据的问题普遍存在于大数据、机器学习的各个应用领域,如医疗诊断、异常检测等。研究者提出或采用了多种方法来进行不平衡数据的学习,比如数据采样(如SMOTE)或者集成学习(如EasyEnsemble)的方法。数据采样中的过采样方法可能存在过拟合或边界样本分类准确率较低等问题,而欠采样方法则可能导致欠拟合。文中将SMOTE,Bagging,Boosting等算法的基本思想进行融合,提出了Rotation SMOTE算法。该算法通过在Boosting过程中根据基分类器的预测结果对少数类样本进行SMOTE来间接地增大少数类样本的权重,并借鉴Focal Loss的基本思想提出了根据基分类器预测结果直接优化AdaBoost权重更新策略的FocalBoost算法。对不同应用领域共11个不平衡数据集的多个评价指标进行实验测试,结果表明,相比于其他不平衡数据算法(包括SMOTEBoost算法和EasyEnsemble算法),Rotation SMOTE算法在所有数据集上具有最高的召回率,并且在大多数数据集上具有最佳或者次佳的G-mean以及F1Score;而相比于原始的AdaBoost,FocalBoost则在其中9个不平衡数据集上都获得了更优的性能指标。  相似文献   

9.
SMOTE算法可以扩充少数类样本,提高不平衡数据集中少数类的分类能力,但是它在扩充少数类样本时对于边界样本的选择以及随机数的取值具有盲目性。针对此问题,将传统的SMOTE过采样算法进行改进,改进后的过采样算法定义为SDRSMOTE,该算法综合考虑不平衡数据集中全部样本的分布状况,通过融合支持度sd和影响因素posFac来指导少数类样本的合成。在WEKA平台上分别使用SMOTE、SDRSMOTE算法对所选用的6个不平衡数据集进行过采样数据预处理,然后使用决策树、AdaBoost、Bagging和朴素贝叶斯分类器对预处理后的数据集进行预测,选择F-value、G-mean和AUC作为分类性能的评价指标,实验表明SDRSMOTE算法预处理的不平衡数据集的分类效果更好,证明了该算法的有效性。  相似文献   

10.
为解决软件缺陷预测中的不平衡问题,提出一种基于聚类少数类的改进SMOTE算法。对训练集中的少数类样本进行K-means聚类后,通过关键特征权重及与簇心距离权重,计算每个样本的合成样本数量,采用改进的SMOTE算法实现过抽样。采用CART决策树作为基分类器,使用AdaBoost算法对平衡数据集训练,得到分类模型CSMOTE-AdaBoost。在7组NASA数据集上进行实验,验证分类模型中关键特征权重及与簇心距离权重的有效性,其结果优于传统分类算法,具有更好的分类效果。  相似文献   

11.
Most of the widely used pattern classification algorithms, such as Support Vector Machines (SVM), are sensitive to the presence of irrelevant or redundant features in the training data. Automatic feature selection algorithms aim at selecting a subset of features present in a given dataset so that the achieved accuracy of the following classifier can be maximized. Feature selection algorithms are generally categorized into two broad categories: algorithms that do not take the following classifier into account (the filter approaches), and algorithms that evaluate the following classifier for each considered feature subset (the wrapper approaches). Filter approaches are typically faster, but wrapper approaches deliver a higher performance. In this paper, we present the algorithm – Predictive Forward Selection – based on the widely used wrapper approach forward selection. Using ideas from meta-learning, the number of required evaluations of the target classifier is reduced by using experience knowledge gained during past feature selection runs on other datasets. We have evaluated our approach on 59 real-world datasets with a focus on SVM as the target classifier. We present comparisons with state-of-the-art wrapper and filter approaches as well as one embedded method for SVM according to accuracy and run-time. The results show that the presented method reaches the accuracy of traditional wrapper approaches requiring significantly less evaluations of the target algorithm. Moreover, our method achieves statistically significant better results than the filter approaches as well as the embedded method.  相似文献   

12.
Besides optimizing classifier predictive performance and addressing the curse of the dimensionality problem, feature selection techniques support a classification model as simple as possible. In this paper, we present a wrapper feature selection approach based on Bat Algorithm (BA) and Optimum-Path Forest (OPF), in which we model the problem of feature selection as an binary-based optimization technique, guided by BA using the OPF accuracy over a validating set as the fitness function to be maximized. Moreover, we present a methodology to better estimate the quality of the reduced feature set. Experiments conducted over six public datasets demonstrated that the proposed approach provides statistically significant more compact sets and, in some cases, it can indeed improve the classification effectiveness.  相似文献   

13.
This paper deals with the problem of supervised wrapper-based feature subset selection in datasets with a very large number of attributes. Recently the literature has contained numerous references to the use of hybrid selection algorithms: based on a filter ranking, they perform an incremental wrapper selection over that ranking. Though working fine, these methods still have their problems: (1) depending on the complexity of the wrapper search method, the number of wrapper evaluations can still be too large; and (2) they rely on a univariate ranking that does not take into account interaction between the variables already included in the selected subset and the remaining ones.Here we propose a new approach whose main goal is to drastically reduce the number of wrapper evaluations while maintaining good performance (e.g. accuracy and size of the obtained subset). To do this we propose an algorithm that iteratively alternates between filter ranking construction and wrapper feature subset selection (FSS). Thus, the FSS only uses the first block of ranked attributes and the ranking method uses the current selected subset in order to build a new ranking where this knowledge is considered. The algorithm terminates when no new attribute is selected in the last call to the FSS algorithm. The main advantage of this approach is that only a few blocks of variables are analyzed, and so the number of wrapper evaluations decreases drastically.The proposed method is tested over eleven high-dimensional datasets (2400-46,000 variables) using different classifiers. The results show an impressive reduction in the number of wrapper evaluations without degrading the quality of the obtained subset.  相似文献   

14.
Today, feature selection is an active research in machine learning. The main idea of feature selection is to choose a subset of available features, by eliminating features with little or no predictive information, as well as redundant features that are strongly correlated. There are a lot of approaches for feature selection, but most of them can only work with crisp data. Until now there have not been many different approaches which can directly work with both crisp and low quality (imprecise and uncertain) data. That is why, we propose a new method of feature selection which can handle both crisp and low quality data. The proposed approach is based on a Fuzzy Random Forest and it integrates filter and wrapper methods into a sequential search procedure with improved classification accuracy of the features selected. This approach consists of the following main steps: (1) scaling and discretization process of the feature set; and feature pre-selection using the discretization process (filter); (2) ranking process of the feature pre-selection using the Fuzzy Decision Trees of a Fuzzy Random Forest ensemble; and (3) wrapper feature selection using a Fuzzy Random Forest ensemble based on cross-validation. The efficiency and effectiveness of this approach is proved through several experiments using both high dimensional and low quality datasets. The approach shows a good performance (not only classification accuracy, but also with respect to the number of features selected) and good behavior both with high dimensional datasets (microarray datasets) and with low quality datasets.  相似文献   

15.
为了改善传统支持向量机SVM对不平衡数据集中少数类的分类效果,提出一种基于改进灰狼算法(IGWO)的过采样方法——IGWOSMOTE。首先,改进初始灰狼种群的生成形式,由SVM的惩罚因子、核参数、特征向量和少数类的采样率组成灰狼个体;然后,经由灰狼优化过程智能搜索获得最优相关参数和最优采样率组合,进行重新采样供分类器学习及预测。通过对6个UCI数据集的分类实验得出:IGWOSMOTE+SVM较传统SMOTE+SVM方法在少数类分类精度上提高了6.3个百分点,在整体数据集分类精度上提高了2.1个百分点,IGWOSMOTE可作为一种新的过采样分类方法。  相似文献   

16.
A new approach to design of a fuzzy-rule-based classifier that is capable of selecting informative features is discussed. Three basic stages of the classifier construction—feature selection, generation of fuzzy rule base, and optimization of the parameters of rule antecedents—are identified. At the first stage, several feature subsets on the basis of discrete harmonic search are generated by using the wrapper scheme. The classifier structure is formed by the rule base generation algorithm by using extreme feature values. The optimal parameters of the fuzzy classifier are extracted from the training data using continuous harmonic search. Akaike information criterion is deployed to identify the best performing classifiers. The performance of the classifier was tested on real-world KEEL and KDD Cup 1999 datasets. The proposed algorithms were compared with other fuzzy classifiers tested on the same datasets. Experimental results show efficiency of the proposed approach and demonstrate that highly accurate classifiers can be constructed by using relatively few features.  相似文献   

17.
针对监督分类中的特征选择问题, 提出一种基于量子进化算法的包装式特征选择方法. 首先分析了现有子集评价方法存在过度偏好分类精度的缺点, 进而提出基于固定阈值和统计检验的两种子集评价方法. 然后改进了量子进化算法的进化策略, 即将整个进化过程分为两个阶段, 分别选用个体极值和全局极值作为种群的进化目标. 在此基础上, 按照包装式特征选择遵循的一般框架设计了特征选择算法. 最后, 通过15个UCI数据集分别验证了子集评价方法和进化策略的有效性, 以及新方法相较于其它6种特征选择方法的优越性. 结果表明, 新方法在80%以上的数据集上取得相似甚至更好的分类精度, 在86.67%的数据集上选择了特征个数更小的子集.  相似文献   

18.
Occupancy information is essential to facilitate demand-driven operations of air-conditioning and mechanical ventilation (ACMV) systems. Environmental sensors are increasingly being explored as cost effective and non-intrusive means to obtain the occupancy information. This requires the extraction and selection of useful features from the sensor data. In past works, feature selection has generally been implemented using filter-based approaches. In this work, we introduce the use of wrapper and hybrid feature selection for better occupancy estimation. To achieve a fast computation time, we introduce a ranking-based incremental search in our algorithms, which is more efficient than the exhaustive search used in past works. For wrapper feature selection, we propose the WRANK-ELM, which searches an ordered list of features using the extreme learning machine (ELM) classifier. For hybrid feature selection, we propose the RIG-ELM, which is a filter–wrapper hybrid that uses the relative information gain (RIG) criterion for feature ranking and the ELM for the incremental search. We present experimental results in an office space with a multi-sensory network to validate the proposed algorithms.  相似文献   

19.
The use of feature selection can improve accuracy, efficiency, applicability and understandability of a learning process. For this reason, many methods of automatic feature selection have been developed. Some of these methods are based on the search of the features that allows the data set to be considered consistent. In a search problem we usually evaluate the search states, in the case of feature selection we measure the possible feature sets. This paper reviews the state of the art of consistency based feature selection methods, identifying the measures used for feature sets. An in-deep study of these measures is conducted, including the definition of a new measure necessary for completeness. After that, we perform an empirical evaluation of the measures comparing them with the highly reputed wrapper approach. Consistency measures achieve similar results to those of the wrapper approach with much better efficiency.  相似文献   

20.
随着各类生物智能演化算法的日益成熟,基于演化技术及其混合算法的特征选择方法不断涌现。针对高维小样本安全数据的特征选择问题,将文化基因算法(Memetic Algorithm,MA)与最小二乘支持向量机(Least Squares Support Vector Machine,LS-SVM)进行结合,设计了一种封装式(Wrapper)特征选择方法(MA-LSSVM)。该方法利用最小二乘支持向量机易于求解的特点来构造分类器,以分类的准确率作为文化基因算法寻优过程中适应度函数的主要成分。实验表明,MA-LSSVM可以较高效地、稳定地获取对分类贡献较大的特征,降低了数据维度,提高了分类效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号