首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper proposes a classification framework based on simple classifiers organized in a tree‐like structure. It is observed that simple classifiers, even though they have high error rate, find similarities among classes in the problem domain. The authors propose to trade on this property by recognizing classes that are mistaken and constructing overlapping subproblems. The subproblems are then solved by other classifiers, which can be very simple, giving as a result a hierarchical classifier (HC). It is shown that HC, together with the proposed training algorithm and evaluation methods, performs well as a classification framework. It is also proven that such constructs give better accuracy than the root classifier it is built upon.  相似文献   

2.
One of the solutions to the classification problem are the ensemble methods, in particular a hierarchical approach. This method bases on dynamically splitting the original problem during training into smaller subproblems which should be easier to train. Then the answers are combined together to obtain the final classification. The main problem here is how to divide (cluster) the original problem to obtain best possible accuracy expressed in terms of risk function value. The exact value for a given clustering is known only after the whole training process. In this paper we propose the risk estimation method based on the analysis of the root classifier. This makes it possible to evaluate the risks for all subproblems without any training of children classifiers. Together with some earlier theoretical results on hierarchical approach, we show how to use the proposed method to evaluate the risk for the whole ensemble. A variant, which uses a genetic algorithm (GA), is proposed. We compare this method with an earlier one, based on the Bayes law. We show that the subproblem risk evaluation is highly correlated with the true risk, and that the Bayes/GA approaches give hierarchical classifiers which are superior to single ones. Our method works for any classifier which returns a class probability vector for a given example.  相似文献   

3.
基于集成聚类的流量分类架构   总被引:1,自引:0,他引:1  
鲁刚  余翔湛  张宏莉  郭荣华 《软件学报》2016,27(11):2870-2883
流量分类是优化网络服务质量的基础与关键.机器学习算法利用数据流统计特征分类流量,对于识别加密私有协议流量具有重要意义.然而,特征偏置和类别不平衡是基于机器学习的流量分类研究所面临的两大挑战.特征偏置是指一些数据流统计特征在提高部分应用识别准确率的同时也降低了另外一部分应用识别的准确率.类别不平衡是指机器学习流量分类器对样本数较少的应用识别的准确率较低.为解决上述问题,提出了基于集成聚类的流量分类架构(traffic classification framework based on ensemble clustering,简称TCFEC).TCFEC由多个基于不同特征子空间聚类的基分类器和一个最优决策部件构成,能够提高流量分类的准确率.具体而言,与传统的机器学习流量分类器相比,TCFEC的平均流准确率最高提升5%,字节准确率最高提升6%.  相似文献   

4.
针对传统模型在解决不平衡数据分类问题时存在精度低、稳定性差、泛化能力弱等问题,提出基于序贯三支决策多粒度集成分类算法MGE-S3WD。采用二元关系实现粒层动态划分;根据代价矩阵计算阈值并构建多层次粒结构,将各粒层数据划分为正域、边界域和负域;将各粒层上的划分,按照正域与负域、正域与边界域、负域与边界域重新组合形成新的数据子集,并在各数据子集上构建基分类器,实现不平衡数据的集成分类。仿真结果表明,该算法能够有效降低数据子集的不平衡比,提升集成学习中基分类器的差异性,在G-mean和F-measure1 2个评价指标下,分类性能优于或部分优于其他集成分类算法,有效提高了分类模型的分类精度和稳定性,为不平衡数据集的集成学习提供了新的研究思路。  相似文献   

5.
In this paper, we introduce a new adaptive rule-based classifier for multi-class classification of biological data, where several problems of classifying biological data are addressed: overfitting, noisy instances and class-imbalance data. It is well known that rules are interesting way for representing data in a human interpretable way. The proposed rule-based classifier combines the random subspace and boosting approaches with ensemble of decision trees to construct a set of classification rules without involving global optimisation. The classifier considers random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of decision trees to deal with class-imbalance problem. The classifier uses two popular classification techniques: decision tree and k-nearest-neighbor algorithms. Decision trees are used for evolving classification rules from the training data, while k-nearest-neighbor is used for analysing the misclassified instances and removing vagueness between the contradictory rules. It considers a series of k iterations to develop a set of classification rules from the training data and pays more attention to the misclassified instances in the next iteration by giving it a boosting flavour. This paper particularly focuses to come up with an optimal ensemble classifier that will help for improving the prediction accuracy of DNA variant identification and classification task. The performance of proposed classifier is tested with compared to well-approved existing machine learning and data mining algorithms on genomic data (148 Exome data sets) of Brugada syndrome and 10 real benchmark life sciences data sets from the UCI (University of California, Irvine) machine learning repository. The experimental results indicate that the proposed classifier has exemplary classification accuracy on different types of biological data. Overall, the proposed classifier offers good prediction accuracy to new DNA variants classification where noisy and misclassified variants are optimised to increase test performance.  相似文献   

6.
根据多模型可以改善模型估计精度,提高泛化性的思想,提出了1种粗糙分类器的多模型软测量建模方法。该方法采用聚类、分类相结合的方式对数据进行分组训练,在一定程度上消除了矛盾样本点可能对模型精度造成的影响。对各组样本利用支持向量回归机建立回归子模型,得到多模型软测量系统。同时,通过向粗糙集引入相似度作为评价样本间相似性的指标,解决了传统粗糙集无法识别训练样本集中未出现过的模式的问题。通过引入概率测度,利用概率公式作为粗糙集分类的决策规则,简化了算法。基于上述理论构造的粗糙分类器,有效地提高了分类器的分类精度,确保了各子模型的估计精度。将该方法应用于双酚A生产过程的质量指标软测量建模,仿真结果表明了该算法的有效性。  相似文献   

7.
基于AdaBoost的组合分类器在遥感影像分类中的应用   总被引:2,自引:0,他引:2  
运用组合分类器的经典算法AdaBoost将多个弱分类器-神经网络分类器组合输出,并引入混合判别多分类器综合规则,有效提高疑难类别的分类精度,进而提高分类的总精度.最后以天津地区ASTER影像为例,介绍了基于AdaBoost的组合分类算法,并在此基础上实现了天津地区的土地利用分类.分类结果表明,组合分类器能有效提高单个分类器的分类精度,分类总精度由81.13%提高到93.32%.实验表明基于AdaBoost的组合分类是遥感图像分类的一种新的有效方法.  相似文献   

8.
为解决入侵检测分类遇到的训练样本数量少、分类准确率低的问题,提出基于模糊支持向量机的多级分类机制。该分类机制训练模糊SVM模型将数据粗分为正常与攻击大类,采用DBSCAN算法产生细分模型进行攻击子集的自动聚类,将有关数据细分得到攻击的具体细类。在机制设计中,优化了隶属度函数的计算、设计了数据标准化与归一化等过程,并训练了高效分类器。实验表明,针对网络入侵检测数据中常见的孤立点干扰、噪声多,并且负样本占比多的网络业务数据集,新算法在保持分类准确率高的前提下,分类过程的计算时间较短。  相似文献   

9.
不平衡分类问题广泛地应用于现实生活中,针对大多数重采样算法侧重于类间平衡,较少关注类内数据分布不平衡问题,提出一种基于聚类的混合采样算法。首先对原始数据集聚类,然后对每一簇样本计算不平衡比,根据不平衡比的大小对该簇样本做出相应处理,最后将平衡后的数据集放入GBDT分类器进行训练。实验表明该算法与几种传统算法相比F1-value和AUC更高,分类效果更好。  相似文献   

10.
针对高光谱遥感图像维数高、样本少导致分类精度低的问题,提出一种基于DS聚类的高光谱图像集成分类算法(DSCEA)。首先,根据高光谱数据特点,从整体波段中随机选择一定数量的波段,构成不同的训练样本;其次,分析图像的空谱信息,构造无向加权图,利用优势集(DS)聚类方法得到最大特征差异的波段子集;最后,根据不同样本,利用支持向量机训练具有差异的单个分类器,采用多数表决法集成最终分类器,实现对高光谱遥感图像的分类。在Indian Pines数据集上DSCEA算法的分类精度最高可达到84.61%,在Pavia University数据集上最高可达到91.89%,实验结果表明DSCEA算法可以有效的解决高光谱分类问题。  相似文献   

11.
提出了一种没有训练集情况下实现对未标注类别文本文档进行分类的问题。类关联词是与类主体相关、能反映类主体的单词或短语。利用类关联词提供的先验信息,形成文档分类的先验概率,然后组合利用朴素贝叶斯分类器和EM迭代算法,在半监督学习过程中加入分类约束条件,用类关联词来监督构造一个分类器,实现了对完全未标注类别文档的分类。实验结果证明,此方法能够以较高的准确率实现没有训练集情况下的文本分类问题,在类关联词约束下的分类准确率要高于没有约束情况下的分类准确率。  相似文献   

12.
Related works for applying keystroke dynamics (KD) on free text identification indicated that applying KD can improve the accuracy of personal authentication on free text. As the result, this paper proposes a new biometrics, i.e., the keystroke clusters map (KC-Map), by clustering users’ keystrokes in order to effectively enhance the accuracy of personal authentication in free text. Since KC-Map is conducted via clustering, it is not suitable for traditional classifiers. In order to tackle this problem, the paper further proposes a keystroke clusters map similarity classifier (KCMS classifier). Experimental results positively show that the proposed KC-Map and KCMS classifier can efficiently improve the accuracy of personal authentication on free text with up to 1.27 times. In addition, one of the huge disadvantages on the current approaches in free text identification is that users are generally required to be trained for several months. Longer training time makes free text identification more impractical. Another motivation of this paper is to explore whether it is possible to shorten the training time into an acceptable range. Experimental results show that, to achieve relatively fair identification accuracy, users only need to carry out about 20 min for training.  相似文献   

13.
Grouping images into semantically meaningful categories using low-level visual features is a challenging and important problem in content-based image retrieval. Based on these groupings, effective indices can be built for an image database. In this paper, we show how a specific high-level classification problem (city images vs landscapes) can be solved from relatively simple low-level features geared for the particular classes. We have developed a procedure to qualitatively measure the saliency of a feature towards a classification problem based on the plot of the intra-class and inter-class distance distributions. We use this approach to determine the discriminative power of the following features: color histogram, color coherence vector, DCT coefficient, edge direction histogram, and edge direction coherence vector. We determine that the edge direction-based features have the most discriminative power for the classification problem of interest here. A weighted k-NN classifier is used for the classification which results in an accuracy of 93.9% when evaluated on an image database of 2716 images using the leave-one-out method. This approach has been extended to further classify 528 landscape images into forests, mountains, and sunset/sunrise classes. First, the input images are classified as sunset/sunrise images vs forest & mountain images (94.5% accuracy) and then the forest & mountain images are classified as forest images or mountain images (91.7% accuracy). We are currently identifying further semantic classes to assign to images as well as extracting low level features which are salient for these classes. Our final goal is to combine multiple 2-class classifiers into a single hierarchical classifier.  相似文献   

14.
On using partial supervision for text categorization   总被引:1,自引:0,他引:1  
We discuss the merits of building text categorization systems by using supervised clustering techniques. Traditional approaches for document classification on a predefined set of classes are often unable to provide sufficient accuracy because of the difficulty of fitting a manually categorized collection of documents in a given classification model. This is especially the case for heterogeneous collections of Web documents which have varying styles, vocabulary, and authorship. Hence, we investigate the use of clustering in order to create the set of categories and its use for classification of documents. Completely unsupervised clustering has the disadvantage that it has difficulty in isolating sufficiently fine-grained classes of documents relating to a coherent subject matter. We use the information from a preexisting taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in defining and creating the classes. We show that the advantage of using partially supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical definition of how each category is defined. An extremely effective way then to categorize documents is to use this a priori knowledge of the definition of each category. We also discuss a new technique to help the classifier distinguish better among closely related clusters.  相似文献   

15.
The Bayesian classifier is a fundamental classification technique. In this work, we focus on programming Bayesian classifiers in SQL. We introduce two classifiers: Naive Bayes and a classifier based on class decomposition using K-means clustering. We consider two complementary tasks: model computation and scoring a data set. We study several layouts for tables and several indexing alternatives. We analyze how to transform equations into efficient SQL queries and introduce several query optimizations. We conduct experiments with real and synthetic data sets to evaluate classification accuracy, query optimizations, and scalability. Our Bayesian classifier is more accurate than Naive Bayes and decision trees. Distance computation is significantly accelerated with horizontal layout for tables, denormalization, and pivoting. We also compare Naive Bayes implementations in SQL and C++: SQL is about four times slower. Our Bayesian classifier in SQL achieves high classification accuracy, can efficiently analyze large data sets, and has linear scalability.  相似文献   

16.
In this paper, we propose a method to predict the presence or absence of correct classification results in classification problems with many classes and the output of the classifier is provided in the form of a ranking list. This problem differs from the “traditional” classification tasks encountered in pattern recognition. While the original problem of forming a ranking of the most likely classes can be solved by running several classification methods, the analysis presented here is moved one step further. The main objective is to analyse (classify) the provided rankings (an ordered list of rankings of a fixed length) and decide whether the “true” class is present on this list. With this regard, a two-class classification problem is formulated where the underlying feature space is built through a characterization of the ranking lists. Experimental results obtained for synthetic data as well as real world face identification data are presented.  相似文献   

17.
针对基于传统支持向量机(SVM)的多类分类算法在处理大规模数据时训练速度上存在的弱势,提出了一种基于对支持向量机(TWSVM)的多类分类算法。该算法结合二叉树SVM多类分类思想,通过在二叉树节点处构造基于TWSVM的分类器来达到分类目的。为减少二叉树SVM的误差累积,算法分类前首先通过聚类算法得到各类的聚类中心,通过比较各聚类中心之间的距离来衡量样本的差异以决定二叉树节点处类别的分离顺序,最后将算法用于网络入侵检测。实验结果表明,该算法不仅保持了较高的检测精度,在训练速度上还表现了一定优势,尤其在处理稍大规模数据时,这种优势更为明显,是传统二叉树SVM多类分类算法训练速度的近两倍,为入侵检测领域大规模数据处理提供了有效参考价值。  相似文献   

18.
层级分类概率句法分析   总被引:3,自引:0,他引:3  
对已有的句法分析中引入知识的方法进行了归纳分析,认为多种句法分析方法都可被看作是基于特征标记的分类,然后分析了其中的欠分类和过分类问题.在此基础上,提出一种层级分类短语结构文法和一种层级分类概率句法分析方法(hierarchically classified probabilistic context-free grammar),并设计了一种通过对实例进行聚类来消除句法规则的分类歧义方法.还进一步将层级分类扩展到概率上下文相关句法分析方法,利用上下文相关性的层级分类来解决引入上下文相关时的数据稀疏性问题.通过上述一系列方法有效地克服了过分类与前分类之间的矛盾.  相似文献   

19.
Fingerprint classification is still a challenging problem due to large intra-class variability, small inter-class variability and the presence of noise. To deal with these difficulties, we propose a regularized orientation diffusion model for fingerprint orientation extraction and a hierarchical classifier for fingerprint classification in this paper. The proposed classification algorithm is composed of five cascading stages. The first stage rapidly distinguishes a majority of Arch by using complex filter responses. The second stage distinguishes a majority of Whorl by using core points and ridge line flow classifier. In the third stage, K-NN classifier finds the top two categories by using orientation field and complex filter responses. In the fourth stage, ridge line flow classifier is used to distinguish Loop from other classes except Whorl. SVM is adopted to make the final classification in the last stage. The regularized orientation diffusion model has been evaluated on a web-based automated evaluation system FVC-onGoing, and a promising result is obtained. The classification method has been evaluated on the NIST SD 4. It achieved a classification accuracy of 95.9% for five-class classification and 97.2% for four-class classification without rejection.  相似文献   

20.
This paper presents the modelling possibilities of kernel-based approaches to a complex real-world problem, i.e. corporate and municipal credit rating classification. Based on a model design that includes data pre-processing, the labelling of individual parameter vectors using expert knowledge, the design of various support vector machines with supervised learning as well as kernel-based approaches with semi-supervised learning, this modelling is undertaken in order to classify objects into rating classes. The results show that the rating classes assigned to bond issuers can be classified with high classification accuracy using a limited subset of input variables. This holds true for kernel-based approaches with both supervised and semi-supervised learning.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号