首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Cluster ensemble is a powerful method for improving both the robustness and the stability of unsupervised classification solutions. This paper introduced group method of data handling (GMDH) to cluster ensemble, and proposed a new cluster ensemble framework, which named cluster ensemble framework based on the group method of data handling (CE-GMDH). CE-GMDH consists of three components: an initial solution, a transfer function and an external criterion. Several CE-GMDH models can be built according to different types of transfer functions and external criteria. In this study, three novel models were proposed based on different transfer functions: least squares approach, cluster-based similarity partitioning algorithm and semidefinite programming. The performance of CE-GMDH was compared among different transfer functions, and with some state-of-the-art cluster ensemble algorithms and cluster ensemble frameworks on synthetic and real datasets. Experimental results demonstrate that CE-GMDH can improve the performance of cluster ensemble algorithms which used as the transfer functions through its unique modelling process. It also indicates that CE-GMDH achieves a better or comparable result than the other cluster ensemble algorithms and cluster ensemble frameworks.  相似文献   

2.
自适应随机森林分类器在每个基础分类器上分别设置了警告探测器和漂移探测器,实例训练时常常会同时触发多个警告探测器,引起多棵背景树同步训练,使得运行所需的内存大、时间长。针对此问题,提出了一种改进的自适应随机森林集成分类算法,将概念漂移探测器设置在集成学习器端,移除各基础树端的漂移探测器,并根据集成器预测准确率确定需要训练的背景树的数量。用改进后的算法对较平衡的数据流进行分类,在保证分类性能的前提下,与改进前的算法相比,运行时间有所降低,消耗内存有所减少,能更快适应数据流中出现的概念漂移。  相似文献   

3.
与集成学习相比,针对单个分类器不能获得相对较高而稳定的准确率的问题,提出一种分类模型.该模型可集成多个随机森林,并以带阈值的多数投票法作为结合方法;模型实现主要分为建立集成分类模型、实例初步预测和结合分析三个层次.MapReduce编程方式实现的分类模型以P2P流量识别为例,分别与单个随机森林和集成其他算法进行对比,实验表明提出模型能获得更好的P2P流量识别综合分类性能,该模型也为二类型分类提供了一种可行的参考方法.  相似文献   

4.
《Information Fusion》2005,6(2):143-151
Categorical data clustering (CDC) and cluster ensemble (CE) have long been considered as separate research and application areas. The main focus of this paper is to investigate the commonalities between these two problems and the uses of these commonalities for the creation of new clustering algorithms for categorical data based on cross-fertilization between the two disjoint research fields. More precisely, we formally define the CDC problem as an optimization problem from the viewpoint of CE, and apply CE approach for clustering categorical data. Experimental results on real datasets show that CE based clustering method is competitive with existing CDC algorithms with respect to clustering accuracy.  相似文献   

5.
由于缺少数据分布、参数和数据类别标记的先验信息,部分基聚类的正确性无法保证,进而影响聚类融合的性能;而且不同基聚类决策对于聚类融合的贡献程度不同,同等对待基聚类决策,将影响聚类融合结果的提升。为解决此问题,提出了基于随机取样的选择性K-means聚类融合算法(RS-KMCE)。该算法中的随机取样策略可以避免基聚类决策选取陷入局部极小,而且依据多样性和正确性定义的综合评价值,有利于算法快速收敛到较优的基聚类子集,提升融合性能。通过2个仿真数据库和4个UCI数据库的实验结果显示:RS-KMCE的聚类性能优于K-means算法、K-means融合算法(KMCE)以及基于Bagging的选择性K-means聚类融合(BA-KMCE)。  相似文献   

6.
为了提升分类数据聚类集成的效果,提出了一种新的相关随机子空间聚类集成模型。该模型利用粗糙集理论将分类属性分解成相关和不相关子集,在相关属性子集上随机生成多个相关子空间并对分类数据进行聚类,通过集成多个较优且具差异性的聚类结果以获得最终的聚类划分。此外,将粗糙集约简概念应用于相关子空间属性数目的确定,有效地避免了参数对聚类结果的影响。UCI数据集实验表明,新模型的性能优于其他已有模型,说明了其有效性。  相似文献   

7.
Cluster ensemble first generates a large library of different clustering solutions and then combines them into a more accurate consensus clustering. It is commonly accepted that for cluster ensemble to work well the member partitions should be different from each other, and meanwhile the quality of each partition should remain at an acceptable level. Many different strategies have been used to generate different base partitions for cluster ensemble. Similar to ensemble classification, many studies have been focusing on generating different partitions of the original dataset, i.e., clustering on different subsets (e.g., obtained using random sampling) or clustering in different feature spaces (e.g., obtained using random projection). However, little attention has been paid to the diversity and quality of the partitions generated using these two approaches. In this paper, we propose a novel cluster generation method based on random sampling, which uses the nearest neighbor method to fill the category information of the missing samples (abbreviated as RS-NN). We evaluate its performance in comparison with k-means ensemble, a typical random projection method (Random Feature Subset, abbreviated as FS), and another random sampling method (Random Sampling based on Nearest Centroid, abbreviated as RS-NC). Experimental results indicate that the FS method always generates more diverse partitions while RS-NC method generates high-quality partitions. Our proposed method, RS-NN, generates base partitions with a good balance between the quality and the diversity and achieves significant improvement over alternative methods. Furthermore, to introduce more diversity, we propose a dual random sampling method which combines RS-NN and FS methods. The proposed method can achieve higher diversity with good quality on most datasets.  相似文献   

8.
This paper investigates the effects of confidence transformation in combining multiple classifiers using various combination rules. The combination methods were tested in handwritten digit recognition by combining varying classifier sets. The classifier outputs are transformed to confidence measures by combining three scaling functions (global normalization, Gaussian density modeling, and logistic regression) and three confidence types (linear, sigmoid, and evidence). The combination rules include fixed rules (sum-rule, product-rule, median-rule, etc.) and trained rules (linear discriminants and weighted combination with various parameter estimation techniques). The experimental results justify that confidence transformation benefits the combination performance of either fixed rules or trained rules. Trained rules mostly outperform fixed rules, especially when the classifier set contains weak classifiers. Among the trained rules, the support vector machine with linear kernel (linear SVM) performs best while the weighted combination with optimized weights performs comparably well. I have also attempted the joint optimization of confidence parameters and combination weights but its performance was inferior to that of cascaded confidence transformation-combination. This justifies that the cascaded strategy is a right way of multiple classifier combination.  相似文献   

9.
徐森  皋军  徐秀芳  花小朋  徐静  安晶 《控制与决策》2018,33(12):2208-2212
将二部图模型引入聚类集成问题中,使用二部图模型同时建模对象集和超边集,充分挖掘潜藏在对象之间的相似度信息和超边提供的属性信息.设计正则化谱聚类算法解决二部图划分问题,在低维嵌入空间运行K-means++算法划分对象集,获得最终的聚类结果.在多组基准数据集上进行实验,实验结果表明所提出方法不仅能获得优越的结果,而且具有较高的运行效率.  相似文献   

10.
针对传统基于相似度的离群点检测算法在高维不均衡数据集上效果不够理想的问题,提出一种新颖的基于随机投影与集成学习的离群点检测(ensemble learning and random projection-based outlier detection,EROD)框架。算法首先集成多个随机投影方法对高维数据进行降维,提升数据多样性;然后集成多个不同的传统离群点检测器构建异质集成模型,增加算法鲁棒性;最后使用异质模型对降维后的数据进行训练,训练后的模型经过两次优化组合以降低泛化误差,输出最终的对象离群值,离群值高的对象被算法判定为离群点。分别在四个不同领域的高维不均衡真实数据集上进行对比实验,结果表明该算法与传统离群点检测算法和基于集成学习的离群点检测算法相比,在AUC和precision@n值上平均提高了3.6%和14.45%,证明EROD算法具有处理高维不均衡数据异常的优势。  相似文献   

11.
为解决垃圾网页检测过程中的不平衡分类和"维数灾难"问题,提出一种基于随机森林(RF)和欠采样集成的二元分类器算法。首先使用欠采样技术将训练样本集大类抽样成多个子样本集,再将其分别与小类样本集合并构成多个平衡的子训练样本集;然后基于各个子训练样本集训练出多个随机森林分类器;最后用多个随机森林分类器对测试样本集进行分类,采用投票法确定测试样本的最终所属类别。在WEBSPAM UK-2006数据集上的实验表明,该集成分类器算法应用于垃圾网页检测比随机森林算法及其Bagging和Adaboost集成分类器算法效果更好,准确率、F1测度、ROC曲线下面积(AUC)等指标提高至少14%,13%和11%。与Web spam challenge 2007 优胜团队的竞赛结果相比,该集成分类器算法在F1测度上提高至少1%,在AUC上达到最优结果。  相似文献   

12.
针对高维数据实体识别问题,为了有效利用高维特征的富信息,提高分辨性能,提出一种随机组合集成分类器。定义基分类器的分类性能指标,将分类正确性和特征子集的个数作为设计基分类器两个目标,使用聚合函数将其转化为单目标优化问题。采用蚁群优化求解基分类器模型,提出利用最大信息系数度量特征的相关性作为蚁群优化启发式信息,使用谷元距离度量选择特征多样性差异最大的基分类器组合集成分类器,集成分类器的决策函数采用投票表决输出。在标准数据集上进行验证与对比,结果表明了该方法的有效性。  相似文献   

13.
针对异构数据集下的不均衡分类问题,从数据集重采样、集成学习算法和构建弱分类器3个角度出发,提出一种针对异构不均衡数据集的分类方法——HVDM-Adaboost-KNN算法(heterogeneous value difference metric-Adaboost-KNN),该算法首先通过聚类算法对数据集进行均衡处理,获得多个均衡的数据子集,并构建多个子分类器,采用异构距离计算异构数据集中2个样本之间的距离,提高KNN算法的分类准性能,然后用Adaboost算法进行迭代获得最终分类器。用8组UCI数据集来评估算法在不均衡数据集下的分类性能,Adaboost实验结果表明,相比Adaboost等算法,F1值、AUC、G-mean等指标在异构不均衡数据集上的分类性能都有相应的提高。  相似文献   

14.
准确的信用风险评估可以降低金融机构的风险。为了进一步提高信用风险评估模型的预测准确率,将基于SVM的集成学习模型应用到信用风险评估问题中,提出了一种混合集成策略,称作RSA。RSA是随机子集模型和AdaBoost两种流行策略的合成,能提高组合成员分类器的多样性,从而提高集成学习模型的预测准确率。模型在两组公开信用数据集上进行了应用,实验结果表明基于RSA的SVM的集成学习模型可以作为信用风险评估的有效模型。  相似文献   

15.
The preprocessing stage in knowledge discovery projects is costly, normally taking between 50% and 80% of the total project time. It is in this stage that data in a relational database are transformed for applying a data mining technique. This stage is a complex task that demands from database designers a strong interaction with experts having a broad knowledge about the application domain. Frameworks aiming to systemize this stage have significant limitations when applied to Credit Behavioral Scoring solutions. This paper proposes a framework based on the Model Driven Development approach to systemize the mentioned stage. This work has three main contributions: 1) improving the discriminant power of data mining techniques by means of the construction of new input variables which embed temporal knowledge for the technique; 2) reducing the time of data transformation using automatic code generation, and 3) allowing artificial intelligence and statistics modelers to perform the data transformation without the help of database experts. In order to validate the proposed framework, two comparative studies were conducted. Experiments showed that the proposed framework delivers a performance equivalent or superior to those of existing frameworks and reduces the time of data transformation with a confidence level of 95%.  相似文献   

16.
为实现角点的有效检测,提高检测速度,提出一种基于随机 Hough变换的角点检测方法。利用随机 Hough变换求取出直线参数;根据角点在 Hough空间中的特征,利用反 Hough变换的反演原理对参数空间中的峰值进行反变换,定位图像空间中的直线交点;为避免虚假角点,将那些附近不包含任何边缘的交点删除,得到正确的角点。实验结果表明,该方法相对于 Harris算法和SUSAN具有更好的准确性、鲁棒性和稳定性,实时性也有一定提高。  相似文献   

17.
Training set resampling based ensemble design techniques are successfully used to reduce the classification errors of the base classifiers. Boosting is one of the techniques used for this purpose where each training set is obtained by drawing samples with replacement from the available training set according to a weighted distribution which is modified for each new classifier to be included in the ensemble. The weighted resampling results in a classifier set, each being accurate in different parts of the input space mainly specified the sample weights. In this study, a dynamic integration of boosting based ensembles is proposed so as to take into account the heterogeneity of the input sets. An evidence-theoretic framework is developed for this purpose so as to take into account the weights and distances of the neighboring training samples in both training and testing boosting based ensembles. The effectiveness of the proposed technique is compared to the AdaBoost algorithm using three different base classifiers.  相似文献   

18.
针对目前计算机辅助肺结节良恶性分类模型精度较低的问题,提出了一种基于CT图像的集成随机森林模型肺结节良恶性鉴别方法。首先,分割肺结节区域,提取其影像学特征向量输入多个基分类器;然后,利用每个基分类器的置信度构建集成模型的分类损失函数,求出每个基分类器的权重;最后,根据每个基分类器输出的类别概率值进行加权求和,求得其中概率最大值的类作为分类类别。为验证提出的分类模型性能,本文设计3种实验方案进行测试,准确率分别达到:96.41%,91.36%,95.82%;并与已有的肺结节良恶性分类模型进行对比,结果表明,集成随机森林分类模型能够有效提高肺结节鉴别良恶性的准确度。  相似文献   

19.
Li  Xiao  Li  Kewen 《Applied Intelligence》2022,52(6):6477-6502
Applied Intelligence - Adaptive Boosting (AdaBoost) algorithm is a widely used ensemble learning algorithm. And it can effectively improve the classification performance of ordinary datasets when...  相似文献   

20.
In this paper, a measure of competence based on random classification (MCR) for classifier ensembles is presented. The measure selects dynamically (i.e. for each test example) a subset of classifiers from the ensemble that perform better than a random classifier. Therefore, weak (incompetent) classifiers that would adversely affect the performance of a classification system are eliminated. When all classifiers in the ensemble are evaluated as incompetent, the classification accuracy of the system can be increased by using the random classifier instead. Theoretical justification for using the measure with the majority voting rule is given. Two MCR based systems were developed and their performance was compared against six multiple classifier systems using data sets taken from the UCI Machine Learning Repository and Ludmila Kuncheva Collection. The systems developed had typically the highest classification accuracies regardless of the ensemble type used (homogeneous or heterogeneous).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号