首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 614 毫秒
1.
一种大数据环境中分布式辅助关联分类算法   总被引:4,自引:0,他引:4  
张明卫  朱志良  刘莹  张斌 《软件学报》2015,26(11):2795-2810
在很多现实的分类应用中,新数据的类标需要由领域专家最终确定,而分类器的分类结果仅起辅助作用.另外,随着大数据所隐含价值越发被人们重视,分类器的训练会从面向单一数据集逐渐过渡到面向分布式空间数据集,大数据环境下辅助分类也将成为未来分类应用的重要分支.然而,现有的分类研究缺乏对此类应用的关注.大数据环境中的辅助分类面临以下3个问题:1) 训练集是分布式大数据集;2) 在空间上,训练集所包含的各局部数据源的类别分布不尽相同;3) 在时间上,训练集是动态变化的,会发生类别迁移现象.在考虑以上问题的基础上,提出一种大数据环境中分布式辅助关联分类方法.该方法首先给出一种大数据环境中分布式关联分类器构建算法,在该算法中,通过横向加权考虑分类数据集在空间上的类别分布差异,并给出"前件空间支持度-相关系数"的度量框架,改进关联分类算法面对不平衡数据的性能缺陷;然后,给出一种基于适应因子的辅助关联分类器动态调整方法,能够在分类器应用过程中充分利用领域专家实时反馈的结果对分类器进行动态调整,以提升其面向动态数据集的分类性能,减缓分类器的退化和重新训练的频率.实验结果表明,该方法能够面向分布式数据集较快地训练出有较高分类准确率的关联分类器,并在数据集不断扩充变化时提升分类性能,是一种有效的大数据环境中辅助分类应用方法.  相似文献   

2.
It sometimes happens (for instance in case control studies) that a classifier is trained on a data set that does not reflect the true a priori probabilities of the target classes on real-world data. This may have a negative effect on the classification accuracy obtained on the real-world data set, especially when the classifier's decisions are based on the a posteriori probabilities of class membership. Indeed, in this case, the trained classifier provides estimates of the a posteriori probabilities that are not valid for this real-world data set (they rely on the a priori probabilities of the training set). Applying the classifier as is (without correcting its outputs with respect to these new conditions) on this new data set may thus be suboptimal. In this note, we present a simple iterative procedure for adjusting the outputs of the trained classifier with respect to these new a priori probabilities without having to refit the model, even when these probabilities are not known in advance. As a by-product, estimates of the new a priori probabilities are also obtained. This iterative algorithm is a straightforward instance of the expectation-maximization (EM) algorithm and is shown to maximize the likelihood of the new data. Thereafter, we discuss a statistical test that can be applied to decide if the a priori class probabilities have changed from the training set to the real-world data. The procedure is illustrated on different classification problems involving a multilayer neural network, and comparisons with a standard procedure for a priori probability estimation are provided. Our original method, based on the EM algorithm, is shown to be superior to the standard one for a priori probability estimation. Experimental results also indicate that the classifier with adjusted outputs always performs better than the original one in terms of classification accuracy, when the a priori probability conditions differ from the training set to the real-world data. The gain in classification accuracy can be significant.  相似文献   

3.
The Domain Adaptation problem in machine learning occurs when the distribution generating the test data differs from the one that generates the training data. A common approach to this issue is to train a standard learner for the learning task with the available training sample (generated by a distribution that is different from the test distribution). One can view such learning as learning from a not-perfectly-representative training sample. The question we focus on is under which circumstances large sizes of such training samples can guarantee that the learned classifier preforms just as well as one learned from target generated samples. In other words, are there circumstances in which quantity can compensate for quality (of the training data)? We give a positive answer, showing that this is possible when using a Nearest Neighbor algorithm. We show this under some assumptions about the relationship between the training and the target data distributions (the assumptions of covariate shift as well as a bound on the ratio of certain probability weights between the source (training) and target (test) distribution). We further show that in a slightly different learning model, when one imposes restrictions on the nature of the learned classifier, these assumptions are not always sufficient to allow such a replacement of the training sample: For proper learning, where the output classifier has to come from a predefined class, we prove that any learner needs access to data generated from the target distribution.  相似文献   

4.

在类别不均衡的数据中, 类间和类内不均衡性问题都是导致分类性能下降的重要因素. 为了提高不均衡数据集下分类算法的性能, 提出一种基于概率分布估计的混合采样算法. 该算法依据数据概率分别对每个子类进行采样以保证类内的均衡性; 并扩大少数类的潜在决策域和减少多数类的冗余信息, 从而同时从全局和局部两个角度改善数据的平衡性. 实验结果表明, 该算法提高了传统分类算法在不均衡数据下的分类性能.

  相似文献   

5.
One of the serious challenges in computer vision and image classification is learning an accurate classifier for a new unlabeled image dataset, considering that there is no available labeled training data. Transfer learning and domain adaptation are two outstanding solutions that tackle this challenge by employing available datasets, even with significant difference in distribution and properties, and transfer the knowledge from a related domain to the target domain. The main difference between these two solutions is their primary assumption about change in marginal and conditional distributions where transfer learning emphasizes on problems with same marginal distribution and different conditional distribution, and domain adaptation deals with opposite conditions. Most prior works have exploited these two learning strategies separately for domain shift problem where training and test sets are drawn from different distributions. In this paper, we exploit joint transfer learning and domain adaptation to cope with domain shift problem in which the distribution difference is significantly large, particularly vision datasets. We therefore put forward a novel transfer learning and domain adaptation approach, referred to as visual domain adaptation (VDA). Specifically, VDA reduces the joint marginal and conditional distributions across domains in an unsupervised manner where no label is available in test set. Moreover, VDA constructs condensed domain invariant clusters in the embedding representation to separate various classes alongside the domain transfer. In this work, we employ pseudo target labels refinement to iteratively converge to final solution. Employing an iterative procedure along with a novel optimization problem creates a robust and effective representation for adaptation across domains. Extensive experiments on 16 real vision datasets with different difficulties verify that VDA can significantly outperform state-of-the-art methods in image classification problem.  相似文献   

6.
In the area of classification, C4.5 is a known algorithm widely used to design decision trees. In this algorithm, a pruning process is carried out to solve the problem of the over-fitting. A modification of C4.5, called Credal-C4.5, is presented in this paper. This new procedure uses a mathematical theory based on imprecise probabilities, and uncertainty measures. In this way, Credal-C4.5 estimates the probabilities of the features and the class variable by using imprecise probabilities. Besides it uses a new split criterion, called Imprecise Information Gain Ratio, applying uncertainty measures on convex sets of probability distributions (credal sets). In this manner, Credal-C4.5 builds trees for solving classification problems assuming that the training set is not fully reliable. We carried out several experimental studies comparing this new procedure with other ones and we obtain the following principal conclusion: in domains of class noise, Credal-C4.5 obtains smaller trees and better performance than classic C4.5.  相似文献   

7.
提出了一种没有训练集情况下实现对未标注类别文本文档进行分类的问题。类关联词是与类主体相关、能反映类主体的单词或短语。利用类关联词提供的先验信息,形成文档分类的先验概率,然后组合利用朴素贝叶斯分类器和EM迭代算法,在半监督学习过程中加入分类约束条件,用类关联词来监督构造一个分类器,实现了对完全未标注类别文档的分类。实验结果证明,此方法能够以较高的准确率实现没有训练集情况下的文本分类问题,在类关联词约束下的分类准确率要高于没有约束情况下的分类准确率。  相似文献   

8.
朴素贝叶斯分类器增量学习序列算法研究   总被引:6,自引:0,他引:6  
首先介绍了一种朴素贝叶斯增量分类模型,然后提出了一种新的序列学习算法以弥补其学习序列中存在的不足训练实例的先验知识得不到充分利用,测试实例的完备性对分类的影响在学习过程中得不到体现等。该算法引入一个分类损失权重系数λ,用于计算分类损失大小。引入该系数的作用在于充分利用先验知识对分类器进行了优化;通过选择合理的学习序列强化了较完备数据对分类的积极影响,弱化了噪音数据的消极影响,从而提高分类精度;弥补了独立性假设在实际问题中的不足等。  相似文献   

9.
Image semantic annotation can be viewed as a multi-class classification problem, which maps image features to semantic class labels, through the procedures of image modeling and image semantic mapping. Bayesian classifier is usually adopted for image semantic annotation which classifies image features into class labels. In order to improve the accuracy and efficiency of classifier in image annotation, we propose a combined optimization method which incorporates affinity propagation algorithm, optimizing training data algorithm, and modeling prior distribution with Gaussian mixture model to build Bayesian classifier. The experiment results illustrate that the classifier performance is improved for image semantic annotation with proposed method.  相似文献   

10.
The naive Bayes classifier is known to obtain good results with a simple procedure. The method is based on the independence of the attribute variables given the variable to be classified. In real databases, where this hypothesis is not verified, this classifier continues to give good results. In order to improve the accuracy of the method, various works have been carried out in an attempt to reconstruct the set of the attributes and to join them so that there is independence between the new sets although the elements within each set are dependent. These methods are included in the ones known as semi-naive Bayes classifiers. In this article, we present an application of uncertainty measures on closed and convex sets of probability distributions, also called credal sets, in classification. We represent the information obtained from a database by a set of probability intervals (a credal set) via the imprecise Dirichlet model and we use uncertainty measures on credal sets in order to reconstruct the set of attributes, such as those mentioned, which shall enable us to improve the result of the naive Bayes classifier in a satisfactory way.  相似文献   

11.
双重模糊K-均值算法的分类器设计   总被引:1,自引:1,他引:0  
李泰  沈祥红 《计算机测量与控制》2008,16(9):1325-1326,1334
为了提高分类器的分类率,再一次把模糊的思想引入K-均值算法,构成双重模糊K-均值算法的分类器,所不同的是把模糊化思想引入到分类规则上;用这样一个模糊规则来表示分类的模糊系统,更加有效地构建了一个能够对训练样本比较准确分类的模糊分类器,用这种方法设计的分类器有效地提高了分类器的分类率;最后用Iris数据进行仿真测试,测试结果显示其分类率能够达到98%左右,并且不需要预定义参数,训练时间短,方法简单。  相似文献   

12.
Quantifying counts and costs via classification   总被引:1,自引:1,他引:0  
Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause substantial bias in estimating class prevalence. The paper defines two research challenges for machine learning. The ‘quantification’ task is to accurately estimate the number of positive cases (or class distribution) in a test set, using a training set that may have a substantially different distribution. The ‘cost quantification’ variant estimates the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the expense to resolve the case. Quantification has a very different utility model from traditional classification research. For both forms of quantification, the paper describes a variety of methods and evaluates them with a suitable methodology, revealing which methods give reliable estimates when training data is scarce, the testing class distribution differs widely from training, and the positive class is rare, e.g., 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.  相似文献   

13.
SVM-KNN分类算法研究   总被引:1,自引:0,他引:1  
SVM-KNN分类算法是一种将支持向量机(SVM)分类和最近邻(NN)分类相结合的新分类方法。针对传统SVM分类器中存在的问题,该算法通过支持向量机的序列最小优化(SMO)训练算法对数据集进行训练,将距离差小于给定阈值的样本代入以每类所有的支持向量作为代表点的K近邻分类器中进行分类。在UCI数据集上的实验结果表明,该分类器的分类准确率比单纯使用SVM分类器要高,它在一定程度上不受核函数参数选择的影响,具有较好的稳健性。  相似文献   

14.
陈刚  吴振家 《控制与决策》2020,35(3):763-768
非平衡数据的分类问题是机器学习领域的一个重要研究课题.在一个非平衡数据里,少数类的训练样本明显少于多数类,导致分类结果往往偏向多数类.针对非平衡数据分类问题,提出一种基于高斯混合模型-均值最大化方法(GMM-EM)的概率增强算法.首先,通过高斯混合模型(GMM)与均值最大化算法(EM)建立少数类数据的概率密度函数;其次,根据高概率密度的样本生成新样本的能力比低概率密度的样本更强的性质,建立一种基于少数类样本密度函数的过采样算法,该算法保证少数类数据集在平衡前后的概率分布的一致性,从数据集的统计性质使少数类达到平衡;最后,使用决策树分类器对已经达到平衡的数据集进行分类,并且利用评价指标对分类效果进行评判.通过从UCI和KEEL数据库选出的8组数据集的分类实验,表明了所提出算法比现有算法更有效.  相似文献   

15.
提出了一种更一般化描述的多类别模糊补偿支持向量机(M-FSVM)算法,用它来解决经典支持向量机对类别分类误差的不均衡性问题。并在开源代码LibSVM的基础上实现了新算法,并应用于网络入侵检测。实验结果表明训练样本数目少的类别的分类精度得到了提高。  相似文献   

16.
Classification is used to solve countless problems. Many real world computer vision problems, such as visual surveillance, contain uninteresting but common classes alongside interesting but rare classes. The rare classes are often unknown, and need to be discovered whilst training a classifier. Given a data set active learning selects the members within it to be labelled for the purpose of constructing a classifier, optimising the choice to get the best classifier for the least amount of effort. We propose an active learning method for scenarios with unknown, rare classes, where the problems of classification and rare class discovery need to be tackled jointly. By assuming a non-parametric prior on the data the goals of new class discovery and classification refinement are automatically balanced, without any tunable parameters. The ability to work with any specific classifier is maintained, so it may be used with the technique most appropriate for the problem at hand. Results are provided for a large variety of problems, demonstrating superior performance.  相似文献   

17.
陈筱倩  王宏远 《计算机科学》2009,36(12):183-186
针对非平稳的数字调制信号,构造新的高阶交又累量特征;利用神经网络的学习机制实现自适应模糊推理调制识别器的非线性动态建模;采取分层决策的级联结构,提高了特征与识别器的契合度,最大程度上减少了隶属度函数和模糊规则的冗余;根据特征样本的大致分布建立蕴涵初始经验的级联模糊神经网络系统,使知识推理结构明确可控;通过样本训练实现结构参数自适应调整和优化,完成其逼近求精.仿真实验证明,该系统在信噪比等环境参数变化较大的情况下具有更好的稳健性,其算法识别率和效率相对于神经网络识别器和模糊识别器有明显提高.  相似文献   

18.
Supervised learning methods such as Maximum Likelihood (ML) are often used in land cover (thematic) classification of remote sensing imagery. ML classifier relies exclusively on spectral characteristics of thematic classes whose statistical distributions (class conditional probability densities) are often overlapping. The spectral response distributions of thematic classes are dependent on many factors including elevation, soil types, and ecological zones. A second problem with statistical classifiers is the requirement of the large number of accurate training samples (10 to 30 × |dimensions|), which are often costly and time consuming to acquire over large geographic regions. With the increasing availability of geospatial databases, it is possible to exploit the knowledge derived from these ancillary datasets to improve classification accuracies even when the class distributions are highly overlapping. Likewise newer semi-supervised techniques can be adopted to improve the parameter estimates of the statistical model by utilizing a large number of easily available unlabeled training samples. Unfortunately, there is no convenient multivariate statistical model that can be employed for multisource geospatial databases. In this paper we present a hybrid semi-supervised learning algorithm that effectively exploits freely available unlabeled training samples from multispectral remote sensing images and also incorporates ancillary geospatial databases. We have conducted several experiments on Landsat satellite image datasets, and our new hybrid approach shows over 24% to 36% improvement in overall classification accuracy over conventional classification schemes.  相似文献   

19.
Bayesian networks are models for uncertain reasoning which are achieving a growing importance also for the data mining task of classification. Credal networks extend Bayesian nets to sets of distributions, or credal sets. This paper extends a state-of-the-art Bayesian net for classification, called tree-augmented naive Bayes classifier, to credal sets originated from probability intervals. This extension is a basis to address the fundamental problem of prior ignorance about the distribution that generates the data, which is a commonplace in data mining applications. This issue is often neglected, but addressing it properly is a key to ultimately draw reliable conclusions from the inferred models. In this paper we formalize the new model, develop an exact linear-time classification algorithm, and evaluate the credal net-based classifier on a number of real data sets. The empirical analysis shows that the new classifier is good and reliable, and raises a problem of excessive caution that is discussed in the paper. Overall, given the favorable trade-off between expressiveness and efficient computation, the newly proposed classifier appears to be a good candidate for the wide-scale application of reliable classifiers based on credal networks, to real and complex tasks.  相似文献   

20.
王轩  张林  高磊  蒋昊坤 《计算机应用》2018,38(10):2772-2777
为应对抽样不均匀带来的影响,以基于代表的分类算法为基础,提出一种用于符号型数据分类的留一法集成学习分类算法(LOOELCA)。首先采用留一法获得n个小训练集,其中n为初始训练集大小。然后使用每个训练集构建独立的基于代表的分类器,并标注出分类错误的分类器及对象。最后,标注分类器和原始分类器形成委员会并对测试集对象进行分类。如委员会表决一致,则直接给该测试对象贴上类标签;否则,基于k最近邻(kNN)算法并利用标注对象对测试对象分类。在UCI标准数据集上的实验结果表明,LOOELCA与基于代表的粗糙集覆盖分类(RBC-CBNRS)算法相比,精度平均提升0.35~2.76个百分点,LOOELCA与ID3、J48、Naïve Bayes、OneR等方法相比也有更高的分类准确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号