首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
针对标记数据不足的多标签分类问题,提出一种新的半监督Boosting算法,即基于函数梯度下降方法给出一种半监督Boosting多标签分类的框架,并将非标记数据的条件熵作为一个正则化项引入分类模型。实验结果表明,对于多标签分类问题,新的半监督Boosting算法的分类效果随着非标记数据数量的增加而显著提高,在各方面都优于传统的监督Boosting算法。  相似文献   

2.
李秋洁  茅耀斌 《自动化学报》2013,39(9):1467-1475
接收者操作特性(Receiver operating characteristics, ROC)曲线下面积(Area under the ROC curve, AUC)常被用于度量分类器在整个类先验分布上的总体分类性能. 原始Boosting算法优化分类精度,但在AUC度量下并非最优. 提出了一种AUC优化Boosting改进算法,通过在原始Boosting迭代中引入数据重平衡操作,实现弱学习算法优化目标从精度向AUC的迁移. 实验结果表明,较之原始Boosting算法,新算法在AUC度量下能获得更好性能.  相似文献   

3.
郑燕  王杨  郝青峰  甘振韬 《计算机应用》2014,34(5):1336-1340
传统的超网络模型在处理不平衡数据分类问题时,具有很大的偏向性,正类的识别率远远高于负类。为此,提出了一种代价敏感超网络Boosting集成算法。首先,将代价敏感学习引入超网络模型,提出了代价敏感的超网络模型;同时,为了使算法能够自适应正类的错分代价,采用Boosting算法对代价敏感超网络进行集成。代价敏感超网络能很好地修正传统的超网络在处理不平衡数据分类问题时过分偏向正类的缺陷,提高对负类的分类准确性。实验结果表明,代价敏感超网络Boosting集成算法具有处理不平衡数据分类问题的优势。  相似文献   

4.
推导了使用指数损失函数和0-1损失函数的Boosting 算法的严格在线形式,证明这两种在线Boosting算法最大化样本间隔期望、最小化样本间隔方差.通过增量估计样本间隔的期望和方差,Boosting算法可应用于在线学习问题而不损失分类准确性. UCI数据集上的实验表明,指数损失在线Boosting算法的分类准确性与批量自适应 Boosting (AdaBoost)算法接近,远优于传统的在线Boosting;0-1损失在线Boosting算法分别最小化正负样本误差,适用于不平衡数据问题,并且在噪声数据上分类性能更为稳定.  相似文献   

5.
罗军  况夯 《计算机应用》2008,28(9):2386-2388
提出一种新颖的基于Boosting模糊分类的文本分类方法。首先采用潜在语义索引(LSI)对文本特征进行选择;然后提出Boosting算法集成模糊分类器学习,在每轮迭代训练过程中,算法通过调整训练样本的分布,利用遗传算法产生分类规则。减少分类规则能够正确分类样本的权值,使得新产生的分类规则重点考虑难于分类的样本。实验结果表明,该文本分类算法具有良好分类的性能。  相似文献   

6.
基于Boosting的TAN组合分类器   总被引:8,自引:1,他引:8  
Boosting是一种有效的分类器组合方法,它能够提高不稳定学习算法的分类性能,但对稳定的学习算法效果不明显,TAN(tree-augmented naive Bayes)是一种树状结构的贝叶斯网络,标准的TAN学习算法生成的TAN分类器是稳定的,用Boosting难以提高其分类性能,提出一种构造TAN的新算法GTAN,并将由GTAN生成的多个TAN分类器用组合方法Boosting-MultiTAN组合,最后实验比较了TAN组合分类器与标准的TAN分类器.实验结果表明,在大多数实验数据上,Boosting-MultiTAN分类器显示出较高的分类正确率。  相似文献   

7.
基于Boosting算法的文本自动分类器设计   总被引:2,自引:0,他引:2  
Boosting算法是目前流行的一种机器学习算法。采用一种改进的Boosting算法Adaboost.MHKR作为分类算法,设计了一个文本自动分类器,并给出了评估方法和结果。评价表明,该分类器有很好的分类精度。  相似文献   

8.
为提高不平衡数据的分类性能,提出了基于度量指标优化的不平衡数据Boosting算法。该算法结合不平衡数据分类性能度量标准和Boosting算法,使用不平衡数据分类性能度量指标代替原有误分率指标,分别采用带有权重的正类和负类召回率、F-measure和G-means指标对Boosting算法进行优化,按照不同的度量指标计算Alpha 值进行迭代,得到带有加权值的弱学习器组合,最后使用Boosting算法进行优化。经过实验验证,与带有权重的Boosting算法进行比较,该算法对一定数据集的AUC分类性能指标有一定提高,错误率有所下降,对F-measure和G-mean性能指标有一定的改善,说明该算法侧重提高正类分类性能,改善不平衡数据的整体分类性能。  相似文献   

9.
非平衡数据集的分类问题是机器学习领域的一个研究热点。针对非平衡数据集分类困难的问题,特别是由于非平衡分布引起的少数类识别能力低下的问题,提出了一种改进算法,AdaBoost-SVM-OBMS。该算法结合Boosting算法和基于错分样本产生新样本的过抽样技术。在新算法中,以支持向量机为元分类器,每次Boosting迭代中标记出错分的样本点,然后在错分样本点与其近邻间随机产生一定数量与错分样本同一类别的新样本点。新产生样本点加入原训练集中重新训练学习,以提高分类困难样本的识别能力。在AUC,F-value和G-mean 3个不同价格的评价指标下8个benchmark数据集上对AdaBoost-SVM-OBMS算法与AdaBoost-SVM算法和APLSC算法进行了对比实验,实验结果表明了AdaBoost-SVM-OBMS算法在非平衡数据集分类中的有效性。  相似文献   

10.
古平  朱庆生 《计算机科学》2006,33(4):159-161
无论是Boosting还是Bagging算法,在使用连续样本集进行分类器集合学习时,均需缓存大量数据,这对大容量样本集的应用不可行。本文提出一种基于贝叶斯集合的在线学习算法BEPOL,在保持Boosting算法加权采样思想的前提下,只需对样本集进行一次扫描,就可实现对贝叶斯集合的在线更新学习。算法针对串行训练时间长、成员相关性差的缺点,采用了并行学习的思想,通过将各贝叶斯分量映射到并行计算结构上,提高集合学习的效率。通过UCI数据集的实验表明,算法BEPOL具有与批量学习算法相近的分类性能和更小的时间开销,这使得算法对某些具有时间和空间限制的应用,如大型数据集或连续型数据集应用尤其有效。  相似文献   

11.
一种单遍扫描频繁模式树结构   总被引:1,自引:0,他引:1  
谭军  卜英勇  杨勃 《计算机工程》2010,36(14):32-33
针对频繁模式增长算法无法适应数据流的无限性和流动性的特点,提出一种新颖的FP-tree的变形结构-SP-tree,只需单遍扫描便能容纳全部数据库信息。为使SP-tree具有与FP-tree一样良好的压缩性能,给出一种有效的动态重构树的方法,称为宽度排序方法,该方法能够在挖掘过程中动态地逐条分支地重构树,最终产生一棵频繁递减的前缀树。实验结果表明,SP-tree的压缩性能优于其他单遍扫描的前缀树结构。  相似文献   

12.

Cancer classification is one of the main steps during patient healing process. This fact enforces modern clinical researchers to use advanced bioinformatics methods for cancer classification. Cancer classification is usually performed using gene expression data gained in microarray experiment and advanced machine learning methods. Microarray experiment generates huge amount of data, and its processing via machine learning methods represents a big challenge. In this study, two-step classification paradigm which merges genetic algorithm feature selection and machine learning classifiers is utilized. Genetic algorithm is built in MapReduce programming spirit which makes this algorithm highly scalable for Hadoop cluster. In order to improve the performance of the proposed algorithm, it is extended into a parallel algorithm which process on microarray data in distributed manner using the Hadoop MapReduce framework. In this paper, the algorithm was tested on eleven GEMS data sets (9 tumors, 11 tumors, 14 tumors, brain tumor 1, lung cancer, brain tumor 2, leukemia 1, DLBCL, leukemia 2, SRBCT, and prostate tumor) and its accuracy reached 100% for less than 25 selected features. The proposed cloud computing-based MapReduce parallel genetic algorithm performed well on gene expression data. In addition, the scalability of the suggested algorithm is unlimited because of underlying Hadoop MapReduce platform. The presented results indicate that the proposed method can be effectively implemented for real-world microarray data in the cloud environment. In addition, the Hadoop MapReduce framework demonstrates substantial decrease in the computation time.

  相似文献   

13.
贾鹤鸣  李瑶  孙康健 《自动化学报》2022,48(6):1601-1615
针对传统支持向量机方法用于数据分类存在分类精度低的不足问题, 将支持向量机分类方法与特征选择同步结合, 并利用智能优化算法对算法参数进行优化研究. 首先将遗传算法(Genetic algorithm, GA)和乌燕鸥优化算法(Sooty tern optimization algorithm, STOA)进行混合, 先通过对平均适应度值进行评估, 当个体的适应度函数值小于平均值时采用遗传算法对其进行局部搜索的加强, 否则进行乌燕鸥本体优化过程, 同时将支持向量机内核函数和特征选择目标共同作为优化对象, 利用改进后的STOA-GA寻找最适应解, 获得所选的特征分类结果. 其次, 通过16组经典UCI数据集和实际乳腺癌数据集进行数据分类研究, 在最佳适应度值、所选特征个数、特异性、敏感性和算法耗时方面进行对比研究, 实验结果表明, 该算法可以更加准确地处理数据, 避免冗余特征干扰, 在数据挖掘领域具有更广阔的工程应用前景.  相似文献   

14.
一种高效的离线数据流频繁模式挖掘算法   总被引:1,自引:0,他引:1  
数据流频繁模式挖掘是当前数据挖掘领域中的研究热点之一,数据流连续性、无序性、无界性及实时性的特点为挖掘算法在时间及空间性能方面提出了更高的要求.数据流中模式频度的震荡现象,迫使现有算法对概要数据结构频繁维护,致使其时间、空间效率均受到较大影响.构造了具备较高空间性能的概要数据结构SP-tree,同时定义了震荡性因子χ以量化震荡信息,提出了一种高效的离线数据流频繁模式挖掘算法SPDS,有效降低了数据震荡对算法性能的影响;在处理新到数据集时,算法采取分而治之的分离映射策略,进一步提升了时间效率;同时在查询结果方面提高了部分模式的计数精度.  相似文献   

15.
癌症基因表达数据的聚类分析可以为癌症的早期诊断和精确的癌症亚型分型提供依据。针对癌症基因表达数据的特点,提出一种称为OMB(Override Matrix Bicluster)的双向聚类算法。OMB算法分别在基因表达数据矩阵的行和列上搜索低于阈值的行和列,用删除添加算法产生一个子矩阵;构建与基因表达矩阵大小相同的覆盖矩阵,标识矩阵中上一次迭代产生的子矩阵的位置;在标识出来的矩阵中,重复贪婪迭代搜索找到K个聚类结果。Matlab实验结果表明OMB算法对具有重叠结构的癌症基因表达数据具有更好的聚类效果。  相似文献   

16.
Due to recent interest in the analysis of DNA microarray data, new methods have been considered and developed in the area of statistical classification. In particular, according to the gene expression profile of existing data, the goal is to classify the sample into a relevant diagnostic category. However, when classifying outcomes into certain cancer types, it is often the case that some genes are not important, while some genes are more important than others. A novel algorithm is presented for selecting such relevant genes referred to as marker genes for cancer classification. This algorithm is based on the Support Vector Machine (SVM) and Supervised Weighted Kernel Clustering (SWKC). To investigate the performance of this algorithm, the methods were applied to a simulated data set and some real data sets. For comparison, some other well-known methods such as Prediction Analysis of Microarrays (PAM), Support Vector Machine-Recursive Feature Elimination (SVM-RFE), and a Structured Polychotomous Machine (SPM) were considered. The experimental results indicate that the proposed SWKC/SVM algorithm is conceptually much simpler and performs more efficiently than other existing methods used in identifying marker genes for cancer classification. Furthermore, the SWKC/SVM algorithm has the advantage that it requires much less computing time compared with the other existing methods.  相似文献   

17.
Accurate diagnosis of Lung Cancer Disease (LCD) is an essential process to provide timely treatment to the lung cancer patients. Artificial Neural Networks (ANN) is a recently proposed Machine Learning (ML) algorithm which is used on both large-scale and small-size datasets. In this paper, an ensemble of Weight Optimized Neural Network with Maximum Likelihood Boosting (WONN-MLB) for LCD in big data is analyzed. The proposed method is split into two stages, feature selection and ensemble classification. In the first stage, the essential attributes are selected with an integrated Newton–Raphsons Maximum Likelihood and Minimum Redundancy (MLMR) preprocessing model for minimizing the classification time. In the second stage, Boosted Weighted Optimized Neural Network Ensemble Classification algorithm is applied to classify the patient with selected attributes which improves the cancer disease diagnosis accuracy and also minimize the false positive rate. Experimental results demonstrate that the proposed approach achieves better false positive rate, accuracy of prediction, and reduced delay in comparison to the conventional techniques.  相似文献   

18.
Gene expression data are expected to be of significant help in the development of efficient cancer diagnosis and classification platforms. One problem arising from these data is how to select a small subset of genes from thousands of genes and a few samples that are inherently noisy. This research aims to select a small subset of informative genes from the gene expression data which will maximize the classification accuracy. A model for gene selection and classification has been developed by using a filter approach, and an improved hybrid of the genetic algorithm and a support vector machine classifier. We show that the classification accuracy of the proposed model is useful for the cancer classification of one widely used gene expression benchmark data set.  相似文献   

19.
Abstract: The artificial immune recognition system (AIRS) has been shown to be an efficient approach to tackling a variety of problems such as machine learning benchmark problems and medical classification problems. In this study, the resource allocation mechanism of AIRS was replaced with a new one based on fuzzy logic. The new system, named Fuzzy-AIRS, was used as a classifier in the classification of three well-known medical data sets, the Wisconsin breast cancer data set (WBCD), the Pima Indians diabetes data set and the ECG arrhythmia data set. The performance of the Fuzzy-AIRS algorithm was tested for classification accuracy, sensitivity and specificity values, confusion matrix, computation time and receiver operating characteristic curves. Also, the AIRS and Fuzzy-AIRS algorithms were compared with respect to the amount of resources required in the execution of the algorithm. The highest classification accuracy obtained from applying the AIRS and Fuzzy-AIRS algorithms using 10-fold cross-validation was, respectively, 98.53% and 99.00% for classification of WBCD; 79.22% and 84.42% for classification of the Pima Indians diabetes data set; and 100% and 92.86% for classification of the ECG arrhythmia data set. Hence, these results show that Fuzzy-AIRS can be used as an effective classifier for medical problems.  相似文献   

20.
向伟  王新维 《计算机科学》2020,47(5):103-109
不平衡数据分类是一种重要的数据分类问题。对于不平衡数据中规模较小的类,传统的分类算法的分类效果较差。对此,提出一种多类邻域三支决策模型的不平衡数据分类算法。首先,将传统的三支决策在混合数据和多个类的情形下进行推广,提出了混合数据的多类邻域三支决策模型;然后,在该模型中给出一种自适应代价函数的设定方法,并基于该方法提出了多类邻域三支决策模型的不平衡数据分类算法。仿真实验的结果表明,所提出的分类算法对于不平衡数据具有更好的分类性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号