首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
现有的概念漂移检测方法大多集中于单标签数据流,难以满足多标签数据流概念漂移检测的需要,因此文中提出基于分层校验的多标签数据流概念漂移检测算法.算法包括检验层和校验层,检验层通过检测数据分布变化判断是否发生概念漂移,校验层通过判断标签混淆矩阵的变化程度验证是否真正发生概念漂移.在真实多标签数据集和合成多标签数据集上的实验表明,文中算法表现更优,可以有效检测概念漂移,提升分类性能.  相似文献   

2.
由于传统的概念漂移检测研究主要针对单标签数据流,对现实中常见的多标签数据流却缺乏足够的关注,多标签数据流概念漂移检测问题有待进一步的研究。因此,通过分析多标签数据流中存在的特殊依赖关系,提出了一种基于概率相关性的多标签数据流概念漂移检测算法。其基本思想是从概念漂移的产生原因出发,利用概率相关性近似描述数据分布来监测新旧数据分布变化,判断概念漂移是否发生。实验结果表明,提出的算法能够比较快速、准确地检测到概念漂移,并在多标签概念漂移数据流分类问题上取得了预期的学习效果。  相似文献   

3.
Most data-mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data-mining model generated from past examples to become less accurate and relevant for classifying the current data. Most online learning algorithms deal with concept drift by generating a new model every time a concept drift is detected. On one hand, this solution ensures accurate and relevant models at all times, thus implying an increase in the classification accuracy. On the other hand, this approach suffers from a major drawback, which is the high computational cost of generating new models. The problem is getting worse when a concept drift is detected more frequently and, hence, a compromise in terms of computational effort and accuracy is needed. This work describes a series of incremental algorithms that are shown empirically to produce more accurate classification models than the batch algorithms in the presence of a concept drift while being computationally cheaper than existing incremental methods. The proposed incremental algorithms are based on an advanced decision-tree learning methodology called “Info-Fuzzy Network” (IFN), which is capable to induce compact and accurate classification models. The algorithms are evaluated on real-world streams of traffic and intrusion-detection data.  相似文献   

4.
由于在信用卡欺诈分析等领域的广泛应用,学者们开始关注概念漂移数据流分类问题.现有算法通常假设数据一旦分类后类标已知,利用所有待分类实例的真实类别来检测数据流是否发生概念漂移以及调整分类模型.然而,由于标记实例需要耗费大量的时间和精力,该解决方案在实际应用中无法实现.据此,提出一种基于KNNModel和增量贝叶斯的概念漂移检测算法KnnM-IB.新算法在具有KNNModel算法分类被模型簇覆盖的实例分类精度高、速度快优点的同时,利用增量贝叶斯算法对难处理样本进行分类,从而保证了分类效果.算法同时利用可变滑动窗口大小的变化以及主动学习标记的少量样本进行概念漂移检测.当数据流稳定时,半监督学习被用于扩大标记实例的数量以对模型进行更新,因而更符合实际应用的要求.实验结果表明,该方法能够在对数据流进行有效分类的同时检测数据流概念漂移及相应地更新模型.  相似文献   

5.

In multi-label classification problems, every instance is associated with multiple labels at the same time. Binary classification, multi-class classification and ordinal regression problems can be seen as unique cases of multi-label classification where each instance is assigned only one label. Text classification is the main application area of multi-label classification techniques. However, relevant works are found in areas like bioinformatics, medical diagnosis, scene classification and music categorization. There are two approaches to do multi-label classification: The first is an algorithm-independent approach or problem transformation in which multi-label problem is dealt by transforming the original problem into a set of single-label problems, and the second approach is algorithm adaptation, where specific algorithms have been proposed to solve multi-label classification problem. Through our work, we not only investigate various research works that have been conducted under algorithm adaptation for multi-label classification but also perform comparative study of two proposed algorithms. The first proposed algorithm is named as fuzzy PSO-based ML-RBF, which is the hybridization of fuzzy PSO and ML-RBF. The second proposed algorithm is named as FSVD-MLRBF that hybridizes fuzzy c-means clustering along with singular value decomposition. Both the proposed algorithms are applied to real-world datasets, i.e., yeast and scene dataset. The experimental results show that both the proposed algorithms meet or beat ML-RBF and ML-KNN when applied on the test datasets.

  相似文献   

6.
目前数据流分类算法大多是基于类分布这一理想状态,然而在真实数据流环境中数据分布往往是不均衡的,并且数据流中往往伴随着概念漂移。针对数据流中的不均衡问题和概念漂移问题,提出了一种新的基于集成学习的不均衡数据流分类算法。首先为了解决数据流的不均衡问题,在训练模型前加入混合采样方法平衡数据集,然后采用基分类器加权和淘汰策略处理概念漂移问题,从而提高分类器的分类性能。最后与经典数据流分类算法在人工数据集和真实数据集上进行对比实验,实验结果表明,本文提出的算法在含有概念漂移和不均衡的数据流环境中,其整体分类性能优于其他算法的。  相似文献   

7.
IKnnM-DHecoc:一种解决概念漂移问题的方法   总被引:2,自引:0,他引:2  
随着数据流挖掘的应用日趋广泛,带概念漂移的数据流分类问题已成为一项重要且充满挑战的工作.根据带概念漂移的数据流的特点,一个有效的学习器必须能跟踪并快速适应这种变化.一种基于增量KnnModel的动态层次编码算法被提出用于解决数据流的概念漂移问题.在将数据流划分为数据块后,根据增量KnnModel算法对每块的预学习结果构建并更新类别层次树、层次编码,用可增量学习的分类算法对照编码划分进行学习,并生成备选分类器集.最后依据活跃度对结点进行剪枝处理以减少计算代价.在预测阶段,利用增量KnnModel算法和动态层次纠错输出编码算法的各自优势进行联合预测.实验结果表明:基于增量KnnModel算法的动态层次纠错输出编码算法不但能够提高模型学习的动态性和分类的正确性,而且还能够快速适应概念漂移的情况.  相似文献   

8.
现有概念漂移处理算法在检测到概念漂移发生后,通常需要在新到概念上重新训练分类器,同时“遗忘”以往训练的分类器。在概念漂移发生初期,由于能够获取到的属于新到概念的样本较少,导致新建的分类器在短时间内无法得到充分训练,分类性能通常较差。进一步,现有的基于在线迁移学习的数据流分类算法仅能使用单个分类器的知识辅助新到概念进行学习,在历史概念与新到概念相似性较差时,分类模型的分类准确率不理想。针对以上问题,文中提出一种能够利用多个历史分类器知识的数据流分类算法——CMOL。CMOL算法采取分类器权重动态调节机制,根据分类器的权重对分类器池进行更新,使得分类器池能够尽可能地包含更多的概念。实验表明,相较于其他相关算法,CMOL算法能够在概念漂移发生时更快地适应新到概念,显示出更高的分类准确率。  相似文献   

9.
Mining data streams is the process of extracting information from non-stopping, rapidly flowing data records to provide knowledge that is reliable and timely. Streaming data algorithms need to be one pass and operate under strict limitations of memory and response time. In addition, the classification of streaming data requires learning in an environment where the data characteristics might change constantly. Many of the classification algorithms presented in literature assume a 100 % labeling rate, which is impractical and expensive when data records are rapidly flowing in. In this paper, a new incremental grid density based learning framework, the GC3 framework, is proposed to perform classification of streaming data with concept drift and limited labeling. The proposed framework uses grid density clustering to detect changes in the input data space. It maintains an evolving ensemble of classifiers to learn and adapt to the model changes over time. The framework also uses a uniform grid density sampling mechanism to obtain a uniform subset of samples for better classification performance with a lower labeling rate. The entire framework is designed to be one-pass, incremental and work with limited memory to perform any-time classification on demand. Experimental comparison with state of the art concept drift handling systems demonstrate the GC3 frameworks ability to provide high classification performance, using fewer models in the ensemble and with only 4-6 % of the samples labeled. The results show that the GC3 framework is effective and attractive for use in real world data stream classification applications.  相似文献   

10.
多标记数据有很多的冗余特征和数据,为了解决多标记数据中冗余和无关特征,提高多标记学习算法的泛化能力。提出一个基于模拟退火的卷积式特征选择方法——SAML(simulated annealing based feature selection for multi-label data),已有的算法只是使用了遗传算法来进行优化,新算法采用模拟退火来寻找最优子集,其效果在已有的工作中表现出比前者遗传算法更好的效果。在用于公开评测的Yahoo网页分类数据集上的实验结果表明,SAML算法的性能优于新近提出的一些流行的多标记特征选择方法。  相似文献   

11.
针对实际应用中数据的批量到达,以及系统的存储压力和学习效率低等问题,提出一种基于信念修正思想的SVR增量学习算法。首先从历史样本信息中提取信念集,根据信念集和新增数据的特点选择相应的信念集建立支持向量回归模型并进行预测;然后对信念集进行修正,调整当前认知状态,使该算法对在线和批处理增量学习都有很好的适应性。在标准数据集上的测试验证了算法的良好性能;在某机场噪声实测数据上的对比实验也表明,该算法的性能明显优于传统学习算法和一般增量学习算法。  相似文献   

12.
Incremental learning has been used extensively for data stream classification. Most attention on the data stream classification paid on non-evolutionary methods. In this paper, we introduce new incremental learning algorithms based on harmony search. We first propose a new classification algorithm for the classification of batch data called harmony-based classifier and then give its incremental version for classification of data streams called incremental harmony-based classifier. Finally, we improve it to reduce its computational overhead in absence of drifts and increase its robustness in presence of noise. This improved version is called improved incremental harmony-based classifier. The proposed methods are evaluated on some real world and synthetic data sets. Experimental results show that the proposed batch classifier outperforms some batch classifiers and also the proposed incremental methods can effectively address the issues usually encountered in the data stream environments. Improved incremental harmony-based classifier has significantly better speed and accuracy on capturing concept drifts than the non-incremental harmony based method and its accuracy is comparable to non-evolutionary algorithms. The experimental results also show the robustness of improved incremental harmony-based classifier.  相似文献   

13.
袁泉  郭江帆 《计算机应用》2018,38(6):1591-1595
针对数据流中概念漂移和噪声问题,提出一种新型的增量式学习的数据流集成分类算法。首先,引入噪声过滤机制过滤噪声;然后,引入假设检验方法对概念漂移进行检测,以增量式C4.5决策树为基分类器构建加权集成模型;最后,实现增量式学习实例并随之动态更新分类模型。实验结果表明,该集成分类器对概念漂移的检测精度达到95%~97%,对数据流抗噪性保持在90%以上。该算法分类精度较高,且在检测概念漂移的准确性和抗噪性方面有较好的表现。  相似文献   

14.
Yan  Zhang  Hongle  Du  Gang  Ke  Lin  Zhang  Chen  Yeh-Cheng 《The Journal of supercomputing》2022,78(4):5394-5419

Data stream mining is one of the hot topics in data mining. Most existing algorithms assume that data stream with concept drift is balanced. However, in real-world, the data streams are imbalanced with concept drift. The learning algorithm will be more complex for the imbalanced data stream with concept drift. In online learning algorithm, the oversampling method is used to select a small number of samples from the previous data block through a certain strategy and add them into the current data block to amplify the current minority class. However, in this method, the number of stored samples, the method of oversampling and the weight calculation of base-classifier all affect the classification performance of ensemble classifier. This paper proposes a dynamic weighted selective ensemble (DWSE) learning algorithm for imbalanced data stream with concept drift. On the one hand, through resampling the minority samples in previous data block, the minority samples of the current data block can be amplified, and the information in the previous data block can be absorbed into building a classifier to reduce the impact of concept drift. The calculation method of information content of every sample is defined, and the resampling method and updating method of the minority samples are given in this paper. On the other hand, because of concept drift, the performance of the base-classifier will be degraded, and the decay factor is usually used to describe the performance degradation of base-classifier. However, the static decay factor cannot accurately describe the performance degradation of the base-classifier with the concept drift. The calculation method of dynamic decay factor of the base-classifier is defined in DWSE algorithm to select sub-classifiers to eliminate according to the attenuation situation, which makes the algorithm better deal with concept drift. Compared with other algorithms, the results show that the DWSE algorithm has better classification performance for majority class samples and minority samples.

  相似文献   

15.
互联网环境日新月异,使得网络数据流中存在概念漂移,对数据流的分类也由传统的静态分类变为动态分类,而如何对概念漂移进行检测是动态分类的关键.本文提出一种基于概念漂移检测的网络数据流自适应分类算法,通过比较滑动窗口中数据与历史数据的分布差异来检测概念漂移,然后将窗口中数据过采样来减少样本间的不均衡性,最后将处理后的数据集输...  相似文献   

16.
基于自适应快速决策树的不确定数据流概念漂移分类算法   总被引:1,自引:0,他引:1  

由于不确定数据流中一般隐藏着概念漂移问题, 对其进行有效分类存在着很多困难. 为此, 提出一种基于自适应快速决策树的算法. 该算法基于一般决策树算法的原理, 以自适应学习规则计算信息增益, 以无标记情景学习拆分原理检测不确定数据流中的不确定数值属性, 通过自适应快速决策树节点的拆分方法将不确定数值属性转化为不确定分类属性, 以实现对不确定数据流的有效分类, 进而有效检测到其中隐含的概念漂移现象. 仿真结果验证了所提出方法的可靠性.

  相似文献   

17.
多标签代价敏感分类集成学习算法   总被引:12,自引:2,他引:10  
付忠良 《自动化学报》2014,40(6):1075-1085
尽管多标签分类问题可以转换成一般多分类问题解决,但多标签代价敏感分类问题却很难转换成多类代价敏感分类问题.通过对多分类代价敏感学习算法扩展为多标签代价敏感学习算法时遇到的一些问题进行分析,提出了一种多标签代价敏感分类集成学习算法.算法的平均错分代价为误检标签代价和漏检标签代价之和,算法的流程类似于自适应提升(Adaptive boosting,AdaBoost)算法,其可以自动学习多个弱分类器来组合成强分类器,强分类器的平均错分代价将随着弱分类器增加而逐渐降低.详细分析了多标签代价敏感分类集成学习算法和多类代价敏感AdaBoost算法的区别,包括输出标签的依据和错分代价的含义.不同于通常的多类代价敏感分类问题,多标签代价敏感分类问题的错分代价要受到一定的限制,详细分析并给出了具体的限制条件.简化该算法得到了一种多标签AdaBoost算法和一种多类代价敏感AdaBoost算法.理论分析和实验结果均表明提出的多标签代价敏感分类集成学习算法是有效的,该算法能实现平均错分代价的最小化.特别地,对于不同类错分代价相差较大的多分类问题,该算法的效果明显好于已有的多类代价敏感AdaBoost算法.  相似文献   

18.
传统基于k近邻的多标签学习算法,在寻找近邻度量样本间的距离时,对所有特征给予同等的重要度.这些算法大多采用分解策略,对单个标签独立预测,忽略了标签间的相关性.多标签学习算法的分类效果跟输入的特征有很大的关系,不同的特征含有的标签分类信息不同,故不同特征的重要度也不同.互信息是常用的度量2个变量间关联度的重要方法之一,能够有效度量特征含有标签分类的知识量.因此,根据特征含有标签分类知识量的大小,赋予相应的权重系数,提出一种基于互信息的粒化特征加权多标签学习k近邻算法(granular feature weighted k-nearest neighbors algorithm for multi-label learning, GFWML-kNN),该算法将标签空间粒化成多个标签粒,对每个标签粒计算特征的权重系数,以解决上述问题和标签组合爆炸问题.在计算特征权重时,考虑到了标签间可能的组合,把标签间的相关性融合进特征的权重系数.实验表明:相较于若干经典的多标签学习算法,所提算法GFWML-kNN整体上能取得较好的效果.  相似文献   

19.
目前大部分已经存在的多标记学习算法在模型训练过程中所采用的共同策略是基于相同的标记属性特征集合预测所有标记类别.但这种思路并未对每个标记所独有的标记特征进行考虑.在标记空间中,这种标记特定的属性特征对于区分其它类别标记和描述自身特性是非常有帮助的信息.针对这一问题,本文提出了基于标记特定特征和相关性的ML-KNN改进算法MLF-KNN.不同于之前的多标记算法直接在原始训练数据集上进行操作,而是首先对训练数据集进行预处理,为每一种标记类别构造其特征属性,在得到的标记属性空间上进一步构造L1-范数并进行优化从而引入标记之间的相关性,最后使用改进后的ML-KNN算法进行预测分类.实验结果表明,在公开数据集image和yeast上,本文提出的算法MLF-KNN分类性能优于ML-KNN,同时与其它另外3种多标记学习算法相比也表现出一定的优越性.  相似文献   

20.
社交网络平台产生海量的短文本数据流,具有快速、海量、概念漂移、文本长度短小、类标签大量缺失等特点.为此,文中提出基于向量表示和标签传播的半监督短文本数据流分类算法,可对仅含少量有标记数据的数据集进行有效分类.同时,为了适应概念漂移,提出基于聚类簇的概念漂移检测算法.在实际短文本数据流上的实验表明,相比半监督分类算法和半监督数据流分类算法,文中算法不仅提高分类精度和宏平均,还能快速适应数据流中的概念漂移.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号