首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 640 毫秒
1.
Data stream classification is a hot topic in data mining research. The great challenge is that the class priors may evolve along the data sequence. Algorithms have been proposed to estimate the dynamic class priors and adjust the classifier accordingly. However, the existing algorithms do not perform well on prior estimation due to the lack of samples from the target distribution. Sample size has great effects in parameter estimation and small-sample effects greatly contaminate the estimation performance. In this paper, we propose a novel parameter estimation method called transfer estimation. Transfer estimation makes use of samples not only from the target distribution but also from similar distributions. We apply this new estimation method to the existing algorithms and obtain an improved algorithm. Experiments on both synthetic and real data sets show that the improved algorithm outperforms the existing algorithms on both class prior estimation and classification.  相似文献   

2.
A Bayesian approach to joint feature selection and classifier design   总被引:5,自引:0,他引:5  
This paper adopts a Bayesian approach to simultaneously learn both an optimal nonlinear classifier and a subset of predictor variables (or features) that are most relevant to the classification task. The approach uses heavy-tailed priors to promote sparsity in the utilization of both basis functions and features; these priors act as regularizers for the likelihood function that rewards good classification on the training data. We derive an expectation- maximization (EM) algorithm to efficiently compute a maximum a posteriori (MAP) point estimate of the various parameters. The algorithm is an extension of recent state-of-the-art sparse Bayesian classifiers, which in turn can be seen as Bayesian counterparts of support vector machines. Experimental comparisons using kernel classifiers demonstrate both parsimonious feature selection and excellent classification accuracy on a range of synthetic and benchmark data sets.  相似文献   

3.
互联网环境日新月异,使得网络数据流中存在概念漂移,对数据流的分类也由传统的静态分类变为动态分类,而如何对概念漂移进行检测是动态分类的关键。本文提出一种基于概念漂移检测的网络数据流自适应分类算法,通过比较滑动窗口中数据与历史数据的分布差异来检测概念漂移,然后将窗口中数据过采样来减少样本间的不均衡性,最后将处理后的数据集输入到OS-ELM分类器中进行在线学习,从而更新分类器使其应对数据流中的概念漂移。本文在MOA实验平台中使用合成数据集和真实数据集对提出的算法进行验证,结果表明,该算法较集成学习算法在分类准确率和稳定性上有一定的提升,并且随着数据流量的增加,时间性能上的优势开始体现,适合复杂多变的网络环境。  相似文献   

4.
Imbalance classification techniques have been frequently applied in many machine learning application domains where the number of the majority (or positive) class of a dataset is much larger than that of the minority (or negative) class. Meanwhile, feature selection (FS) is one of the key techniques for the high-dimensional classification task in a manner which greatly improves the classification performance and the computational efficiency. However, most studies of feature selection and imbalance classification are restricted to off-line batch learning, which is not well adapted to some practical scenarios. In this paper, we aim to solve high-dimensional imbalanced classification problem accurately and efficiently with only a small number of active features in an online fashion, and we propose two novel online learning algorithms for this purpose. In our approach, a classifier which involves only a small and fixed number of features is constructed to classify a sequence of imbalanced data received in an online manner. We formulate the construction of such online learner into an optimization problem and use an iterative approach to solve the problem based on the passive-aggressive (PA) algorithm as well as a truncated gradient (TG) method. We evaluate the performance of the proposed algorithms based on several real-world datasets, and our experimental results have demonstrated the effectiveness of the proposed algorithms in comparison with the baselines.  相似文献   

5.
In this paper, a new procedure to continuously adjust weights in a multi-layered neural network is proposed. The network is initially trained by using a traditional backpropagation algorithm. After this first step, a non-linear programming technique is used to properly calculate the new weights sets online. This methodology is tailored to be used in time varying (non-stationary) models, eliminating the necessity for retraining. Numerical results for a controlled experiment and for real data are presented.  相似文献   

6.
蔡崇超  王士同 《计算机应用》2007,27(5):1235-1237
在Bernoulli混合模型和期望最大化(EM)算法的基础上给出了一种基于不完整数据的改进方法。首先在已标记数据的基础上通过Bernoulli混合模型和朴素贝叶斯算法得到似然函数参数估计初始值, 然后利用含有权值的EM算法对分类器的先验概率模型进行参数估计,得到最终的分类器。实验结果表明,该方法在准确率和查全率方面要优于朴素贝叶斯文本分类。  相似文献   

7.
In this paper, we propose a novel supervised dimension reduction algorithm based on K-nearest neighbor (KNN) classifier. The proposed algorithm reduces the dimension of data in order to improve the accuracy of the KNN classification. This heuristic algorithm proposes independent dimensions which decrease Euclidean distance of a sample data and its K-nearest within-class neighbors and increase Euclidean distance of that sample and its M-nearest between-class neighbors. This algorithm is a linear dimension reduction algorithm which produces a mapping matrix for projecting data into low dimension. The dimension reduction step is followed by a KNN classifier. Therefore, it is applicable for high-dimensional multiclass classification. Experiments with artificial data such as Helix and Twin-peaks show ability of the algorithm for data visualization. This algorithm is compared with state-of-the-art algorithms in classification of eight different multiclass data sets from UCI collection. Simulation results have shown that the proposed algorithm outperforms the existing algorithms. Visual place classification is an important problem for intelligent mobile robots which not only deals with high-dimensional data but also has to solve a multiclass classification problem. A proper dimension reduction method is usually needed to decrease computation and memory complexity of algorithms in large environments. Therefore, our method is very well suited for this problem. We extract color histogram of omnidirectional camera images as primary features, reduce the features into a low-dimensional space and apply a KNN classifier. Results of experiments on five real data sets showed superiority of the proposed algorithm against others.  相似文献   

8.
It sometimes happens (for instance in case control studies) that a classifier is trained on a data set that does not reflect the true a priori probabilities of the target classes on real-world data. This may have a negative effect on the classification accuracy obtained on the real-world data set, especially when the classifier's decisions are based on the a posteriori probabilities of class membership. Indeed, in this case, the trained classifier provides estimates of the a posteriori probabilities that are not valid for this real-world data set (they rely on the a priori probabilities of the training set). Applying the classifier as is (without correcting its outputs with respect to these new conditions) on this new data set may thus be suboptimal. In this note, we present a simple iterative procedure for adjusting the outputs of the trained classifier with respect to these new a priori probabilities without having to refit the model, even when these probabilities are not known in advance. As a by-product, estimates of the new a priori probabilities are also obtained. This iterative algorithm is a straightforward instance of the expectation-maximization (EM) algorithm and is shown to maximize the likelihood of the new data. Thereafter, we discuss a statistical test that can be applied to decide if the a priori class probabilities have changed from the training set to the real-world data. The procedure is illustrated on different classification problems involving a multilayer neural network, and comparisons with a standard procedure for a priori probability estimation are provided. Our original method, based on the EM algorithm, is shown to be superior to the standard one for a priori probability estimation. Experimental results also indicate that the classifier with adjusted outputs always performs better than the original one in terms of classification accuracy, when the a priori probability conditions differ from the training set to the real-world data. The gain in classification accuracy can be significant.  相似文献   

9.
It is an actual and challenging issue to learn cost-sensitive models from those datasets that are with few labeled data and plentiful unlabeled data, because some time labeled data are very difficult, time consuming and/or expensive to obtain. To solve this issue, in this paper we proposed two classification strategies to learn cost-sensitive classifier from training datasets with both labeled and unlabeled data, based on Expectation Maximization (EM). The first method, Direct-EM, uses EM to build a semi-supervised classifier, then directly computes the optimal class label for each test example using the class probability produced by the learning model. The second method, CS-EM, modifies EM by incorporating misclassification cost into the probability estimation process. We conducted extensive experiments to evaluate the efficiency, and results show that when using only a small number of labeled training examples, the CS-EM outperforms the other competing methods on majority of the selected UCI data sets across different cost ratios, especially when cost ratio is high.  相似文献   

10.
基于单类分类器的半监督学习   总被引:1,自引:0,他引:1  
提出一种结合单类学习器和集成学习优点的Ensemble one-class半监督学习算法.该算法首先为少量有标识数据中的两类数据分别建立两个单类分类器.然后用建立好的两个单类分类器共同对无标识样本进行识别,利用已识别的无标识样本对已建立的两个分类面进行调整、优化.最终被识别出来的无标识数据和有标识数据集合在一起训练一个基分类器,多个基分类器集成在一起对测试样本的测试结果进行投票.在5个UCI数据集上进行实验表明,该算法与tri-training算法相比平均识别精度提高4.5%,与仅采用纯有标识数据的单类分类器相比,平均识别精度提高8.9%.从实验结果可以看出,该算法在解决半监督问题上是有效的.  相似文献   

11.
In this paper, we present a novel spatially constrained generative model and an expectation-maximization (EM) algorithm for model-based image segmentation. The generative model assumes that the unobserved class labels of neighboring pixels in the image are generated by prior distributions with similar parameters, where similarity is defined by entropic quantities relating to the neighboring priors. In order to estimate model parameters from observations, we derive a spatially constrained EM algorithm that iteratively maximizes a lower bound on the data log-likelihood, where the penalty term is data-dependent. Our algorithm is very easy to implement and is similar to the standard EM algorithm for Gaussian mixtures with the main difference that the labels posteriors are "smoothed" over pixels between each E- and M-step by a standard image filter. Experiments on synthetic and real images show that our algorithm achieves competitive segmentation results compared to other Markov-based methods, and is in general faster  相似文献   

12.
It is estimated that 70% or more of broadband bandwidth is consumed by transmitting music, games, video and other content through Peer-to-Peer (P2P) clients. In order to detect, identify, and manage P2P traffic, some port, payload and transport layer feature based methods were proposed. Most of them were applied to offline traffic classification mainly due to the performance reason. In this paper, a network processors (NPs) based online hybrid traffic classifier is proposed. The designed hardware classifier is able to classify P2P traffic based on the static characteristic namely on line speed, and the Flexible Neural Tree(FNT) based software classifier helps learning and selecting P2P traffic attributes from the statistical characteristics of the P2P traffic. Experiment results illustrate that the hybrid classifier performs well for online classification of P2P traffic from gigabit network. The proposed framework also depicts good expansion capabilities to add new P2P features and to adapt to new P2P applications online.  相似文献   

13.
Li Y  Guan C 《Neural computation》2006,18(11):2730-2761
For many electroencephalogram (EEG)-based brain-computer interfaces (BCIs), a tedious and time-consuming training process is needed to set parameters. In BCI Competition 2005, reducing the training process was explicitly proposed as a task. Furthermore, an effective BCI system needs to be adaptive to dynamic variations of brain signals; that is, its parameters need to be adjusted online. In this article, we introduce an extended expectation maximization (EM) algorithm, where the extraction and classification of common spatial pattern (CSP) features are performed jointly and iteratively. In each iteration, the training data set is updated using all or part of the test data and the labels predicted in the previous iteration. Based on the updated training data set, the CSP features are reextracted and classified using a standard EM algorithm. Since the training data set is updated frequently, the initial training data set can be small (semi-supervised case) or null (unsupervised case). During the above iterations, the parameters of the Bayes classifier and the CSP transformation matrix are also updated concurrently. In online situations, we can still run the training process to adjust the system parameters using unlabeled data while a subject is using the BCI system. The effectiveness of the algorithm depends on the robustness of CSP feature to noise and iteration convergence, which are discussed in this article. Our proposed approach has been applied to data set IVa of BCI Competition 2005. The data analysis results show that we can obtain satisfying prediction accuracy using our algorithm in the semisupervised and unsupervised cases. The convergence of the algorithm and robustness of CSP feature are also demonstrated in our data analysis.  相似文献   

14.
目前数据流分类算法大多是基于类分布这一理想状态,然而在真实数据流环境中数据分布往往是不均衡的,并且数据流中往往伴随着概念漂移。针对数据流中的不均衡问题和概念漂移问题,提出了一种新的基于集成学习的不均衡数据流分类算法。首先为了解决数据流的不均衡问题,在训练模型前加入混合采样方法平衡数据集,然后采用基分类器加权和淘汰策略处理概念漂移问题,从而提高分类器的分类性能。最后与经典数据流分类算法在人工数据集和真实数据集上进行对比实验,实验结果表明,本文提出的算法在含有概念漂移和不均衡的数据流环境中,其整体分类性能优于其他算法的。  相似文献   

15.
针对现有学习算法难以有效提高不均衡在线贯序数据中少类样本分类精度的问题,提出一种基于不均衡样本重构的加权在线贯序极限学习机。该算法从提取在线贯序数据的分布特性入手,主要包括离线和在线两个阶段:离线阶段主要采用主曲线构建少类样本的可信区域,并通过对该区域内样本进行过采样,来构建符合样本分布趋势的均衡样本集,进而建立初始模型;而在线阶段则对贯序到达的数据根据训练误差赋予各样本相应权重,同时动态更新网络权值。采用UCI标准数据集和澳门实测气象数据进行实验对比,结果表明,与现有在线贯序极限学习机(OS-ELM)、极限学习机(ELM)和元认知在线贯序极限学习机(MCOS-ELM)相比,所提算法对少类样本的识别能力更高,且所提算法的模型训练时间与其他三种算法相差不大。结果表明在不影响算法复杂度的情况下,所提算法能有效提高少类样本的分类精度。  相似文献   

16.
In this paper, we study the problem of learning from multiple model data for the purpose of document classification. In this problem, each document is composed of two different models of data, i.e., an image and a text. We propose to represent the data of two models by projecting them to a shared data space by using cross-model factor analysis formula and classify them in the shared space by using a linear class label predictor, named cross-model classifier. The parameters of both cross-model classifier and cross-model factor analysis are learned jointly, so that they can regularize the learning of each other. We construct a unified objective function for this learning problem. With this objective function, we minimize the distance between the projections of image and text of the same document, and the classification error of the projections measured by hinge loss function. The objective function is optimized by an alternate optimization strategy in an iterative algorithm. Experiments in two different multiple model document data sets show the advantage of the proposed algorithm over state-of-the-art multimedia data classification methods.  相似文献   

17.
Vector quantization is a useful approach for multi-dimensional data compression and pattern classification. One of the most popular techniques for vector quantization design is the LBG (Linde, Buzo, Gray) algorithm. To address the problem of producing poor estimate of vector centroids which are subjected to biased data in vector quantization; we propose a fuzzy declustering strategy for the LBG algorithm. The proposed technique calculates appropriate declustering weights to adjust the global data distribution. Using the result of fuzzy declustering-based vector quantization design, we incorporate the notion of fuzzy partition entropy into the distortion measures that can be useful for classification of spectral features. Experimental results obtained from simulated and real data sets demonstrate the effective performance of the proposed approach.  相似文献   

18.
针对海量网页在线自动高效获取网页分类系统设计中如何更有效地平衡准确度与资源开销之间的矛盾问题,提出一种基于级联式分类器的网页分类方法。该方法利用级联策略,将在线与离线网页分类方法结合,各取所长。级联分类系统的一级分类采用在线分类方法,仅利用锚文本中网页标题包含的特征预测其分类,同时计算分类结果的置信度,分类结果的置信度由分类后验概率分布的信息熵度量。若置信度高于阈值(该阈值采用多目标粒子群优化算法预先计算取得),则触发二级分类器。二级分类器从下载的网页正文中提取特征,利用预先基于网页正文特征训练的分类器进行离线分类。结果表明,相对于单独的在线法和离线法,级联分类系统的F1值分别提升了10.85%和4.57%,并且级联分类系统的效率比在线法未降低很多(30%左右),而比离线法的效率提升了约70%。级联式分类系统不仅具有更高的分类能力,而且显著地减少了分类的计算开销与带宽消耗。  相似文献   

19.
Incremental learning has been used extensively for data stream classification. Most attention on the data stream classification paid on non-evolutionary methods. In this paper, we introduce new incremental learning algorithms based on harmony search. We first propose a new classification algorithm for the classification of batch data called harmony-based classifier and then give its incremental version for classification of data streams called incremental harmony-based classifier. Finally, we improve it to reduce its computational overhead in absence of drifts and increase its robustness in presence of noise. This improved version is called improved incremental harmony-based classifier. The proposed methods are evaluated on some real world and synthetic data sets. Experimental results show that the proposed batch classifier outperforms some batch classifiers and also the proposed incremental methods can effectively address the issues usually encountered in the data stream environments. Improved incremental harmony-based classifier has significantly better speed and accuracy on capturing concept drifts than the non-incremental harmony based method and its accuracy is comparable to non-evolutionary algorithms. The experimental results also show the robustness of improved incremental harmony-based classifier.  相似文献   

20.
杨丹  陈默  王刚  孙良旭 《计算机科学》2017,44(3):215-219
已有的传统的实体识别技术大多是以线下、非实时的方式,在静态数据集上进行,对于大数据集的执行通常需要大量的时间和系统资源。对于异构信息空间中具有时间信息、不断演化的异构实体来说,时间感知的查询时实体识别与数据融合越来越成为一种保证数据质量和满足用户需求的发展趋势。针对异构信息空间中使用时间上下文的关键字查询进行的实体搜索,提出一种时间感知的查询时实体识别与数据融合方法TQ-ER,以给用户提供准确的实体概貌(entity profile);提出一种迭代式时间感知的实体候选集生成算法。TQ-ER充分利用查询的时间上下文和实体的时间信息给正确的回答一个给定查询所需要的、最少的实体数据,以进行识别与数据融合。在真实数据集上的大量实验结果表明了TQ-ER的有效性和正确性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号