首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper we introduce a framework for making statistical inference on the asymptotic prediction of parallel classification ensembles. The validity of the analysis is fairly general. It only requires that the individual classifiers are generated in independent executions of some randomized learning algorithm, and that the final ensemble prediction is made via majority voting. Given an unlabeled test instance, the predictions of the classifiers in the ensemble are obtained sequentially. As the individual predictions become known, Bayes' theorem is used to update an estimate of the probability that the class predicted by the current ensemble coincides with the classification of the corresponding ensemble of infinite size. Using this estimate, the voting process can be halted when the confidence on the asymptotic prediction is sufficiently high. An empirical investigation in several benchmark classification problems shows that most of the test instances require querying only a small number of classifiers to converge to the infinite ensemble prediction with a high degree of confidence. For these instances, the difference between the generalization error of the finite ensemble and the infinite ensemble limit is very small, often negligible.  相似文献   

2.
基于集成学习的自训练算法是一种半监督算法,不少学者通过集成分类器类别投票或平均置信度的方法选择可靠样本。基于置信度的投票策略倾向选择置信度高的样本或置信度低但投票却一致的样本进行标记,后者这种情形可能会误标记靠近决策边界的样本,而采用异构集成分类器也可能会导致各基分类器对高置信度样本的类别标记不同,从而无法将其有效加入到有标记样本集。提出了结合主动学习与置信度投票策略的集成自训练算法用来解决上述问题。该算法合理调整了投票策略,选择置信度高且投票一致的无标记样本加以标注,同时利用主动学习对投票不一致而置信度较低的样本进行人工标注,以弥补集成自训练学习只关注置信度高的样本,而忽略了置信度低的样本的有用信息的缺陷。在UCI数据集上的对比实验验证了该算法的有效性。  相似文献   

3.
Ensemble learning has attracted considerable attention owing to its good generalization performance. The main issues in constructing a powerful ensemble include training a set of diverse and accurate base classifiers, and effectively combining them. Ensemble margin, computed as the difference of the vote numbers received by the correct class and the another class received with the most votes, is widely used to explain the success of ensemble learning. This definition of the ensemble margin does not consider the classification confidence of base classifiers. In this work, we explore the influence of the classification confidence of the base classifiers in ensemble learning and obtain some interesting conclusions. First, we extend the definition of ensemble margin based on the classification confidence of the base classifiers. Then, an optimization objective is designed to compute the weights of the base classifiers by minimizing the margin induced classification loss. Several strategies are tried to utilize the classification confidences and the weights. It is observed that weighted voting based on classification confidence is better than simple voting if all the base classifiers are used. In addition, ensemble pruning can further improve the performance of a weighted voting ensemble. We also compare the proposed fusion technique with some classical algorithms. The experimental results also show the effectiveness of weighted voting with classification confidence.  相似文献   

4.
Boosting Algorithms for Parallel and Distributed Learning   总被引:1,自引:0,他引:1  
The growing amount of available information and its distributed and heterogeneous nature has a major impact on the field of data mining. In this paper, we propose a framework for parallel and distributed boosting algorithms intended for efficient integrating specialized classifiers learned over very large, distributed and possibly heterogeneous databases that cannot fit into main computer memory. Boosting is a popular technique for constructing highly accurate classifier ensembles, where the classifiers are trained serially, with the weights on the training instances adaptively set according to the performance of previous classifiers. Our parallel boosting algorithm is designed for tightly coupled shared memory systems with a small number of processors, with an objective of achieving the maximal prediction accuracy in fewer iterations than boosting on a single processor. After all processors learn classifiers in parallel at each boosting round, they are combined according to the confidence of their prediction. Our distributed boosting algorithm is proposed primarily for learning from several disjoint data sites when the data cannot be merged together, although it can also be used for parallel learning where a massive data set is partitioned into several disjoint subsets for a more efficient analysis. At each boosting round, the proposed method combines classifiers from all sites and creates a classifier ensemble on each site. The final classifier is constructed as an ensemble of all classifier ensembles built on disjoint data sets. The new proposed methods applied to several data sets have shown that parallel boosting can achieve the same or even better prediction accuracy considerably faster than the standard sequential boosting. Results from the experiments also indicate that distributed boosting has comparable or slightly improved classification accuracy over standard boosting, while requiring much less memory and computational time since it uses smaller data sets.  相似文献   

5.
Many techniques have been proposed for credit risk assessment, from statistical models to artificial intelligence methods. During the last few years, different approaches to classifier ensembles have successfully been applied to credit scoring problems, demonstrating to be more accurate than single prediction models. However, it is still a question what base classifiers should be employed in each ensemble in order to achieve the highest performance. Accordingly, the present paper evaluates the performance of seven individual prediction techniques when used as members of five different ensemble methods. The ultimate aim of this study is to suggest appropriate classifiers for each ensemble approach in the context of credit scoring. The experimental results and statistical tests show that the C4.5 decision tree constitutes the best solution for most ensemble methods, closely followed by the multilayer perceptron neural network and logistic regression, whereas the nearest neighbour and the naive Bayes classifiers appear to be significantly the worst.  相似文献   

6.
The availability of a large amount of medical data leads to the need of intelligent disease prediction and analysis tools to extract hidden information. A large number of data mining and statistical analysis tools are used for disease prediction. Single data‐mining techniques show acceptable level of accuracy for heart disease diagnosis. This article focuses on prediction and analysis of heart disease using weighted vote‐based classifier ensemble technique. The proposed ensemble model overcomes the limitations of conventional data‐mining techniques by employing the ensemble of five heterogeneous classifiers: naive Bayes, decision tree based on Gini index, decision tree based on information gain, instance‐based learner, and support vector machines. We have used five benchmark heart disease data sets taken from UCI repository. Each data set contains different set of feature space that ultimately leads to the prediction of heart disease. The effectiveness of proposed ensemble classifier is investigated by comparing the performance with different researchers' techniques. Tenfold cross‐validation is used to handle the class imbalance problem. Moreover, confusion matrices and analysis of variance statistics are used to show the prediction results of all classifiers. The experimental results verify that the proposed ensemble classifier can deal with all types of attributes and it has achieved the high diagnosis accuracy of 87.37%, sensitivity of 93.75%, specificity of 92.86%, and F‐measure of 82.17%. The F‐ratio higher than the F‐critical and p‐value less than 0.01 for a 95% confidence interval indicate that the results are statistically significant for all the data sets.  相似文献   

7.
In this paper, we propose a hybrid deep neural network model for recognizing human actions in videos. A hybrid deep neural network model is designed by the fusion of homogeneous convolutional neural network (CNN) classifiers. The ensemble of classifiers is built by diversifying the input features and varying the initialization of the weights of the neural network. The convolutional neural network classifiers are trained to output a value of one, for the predicted class and a zero, for all the other classes. The outputs of the trained classifiers are considered as confidence value for prediction so that the predicted class will have a confidence value of approximately 1 and the rest of the classes will have a confidence value of approximately 0. The fusion function is computed as the maximum value of the outputs across all classifiers, to pick the correct class label during fusion. The effectiveness of the proposed approach is demonstrated on UCF50 dataset resulting in a high recognition accuracy of 99.68%.  相似文献   

8.
To build a successful customer churn prediction model, a classification algorithm should be chosen that fulfills two requirements: strong classification performance and a high level of model interpretability. In recent literature, ensemble classifiers have demonstrated superior performance in a multitude of applications and data mining contests. However, due to an increased complexity they result in models that are often difficult to interpret. In this study, GAMensPlus, an ensemble classifier based upon generalized additive models (GAMs), in which both performance and interpretability are reconciled, is presented and evaluated in a context of churn prediction modeling. The recently proposed GAMens, based upon Bagging, the Random Subspace Method and semi-parametric GAMs as constituent classifiers, is extended to include two instruments for model interpretability: generalized feature importance scores, and bootstrap confidence bands for smoothing splines. In an experimental comparison on data sets of six real-life churn prediction projects, the competitive performance of the proposed algorithm over a set of well-known benchmark algorithms is demonstrated in terms of four evaluation metrics. Further, the ability of the technique to deliver valuable insight into the drivers of customer churn is illustrated in a case study on data from a European bank. Firstly, it is shown how the generalized feature importance scores allow the analyst to identify the relative importance of churn predictors in function of the criterion that is used to measure the quality of the model predictions. Secondly, the ability of GAMensPlus to identify nonlinear relationships between predictors and churn probabilities is demonstrated.  相似文献   

9.
Previous studies about ensembles of classifiers for bankruptcy prediction and credit scoring have been presented. In these studies, different ensemble schemes for complex classifiers were applied, and the best results were obtained using the Random Subspace method. The Bagging scheme was one of the ensemble methods used in the comparison. However, it was not correctly used. It is very important to use this ensemble scheme on weak and unstable classifiers for producing diversity in the combination. In order to improve the comparison, Bagging scheme on several decision trees models is applied to bankruptcy prediction and credit scoring. Decision trees encourage diversity for the combination of classifiers. Finally, an experimental study shows that Bagging scheme on decision trees present the best results for bankruptcy prediction and credit scoring.  相似文献   

10.
The overproduce-and-choose strategy, which is divided into the overproduction and selection phases, has traditionally focused on finding the most accurate subset of classifiers at the selection phase, and using it to predict the class of all the samples in the test data set. It is therefore, a static classifier ensemble selection strategy. In this paper, we propose a dynamic overproduce-and-choose strategy which combines optimization and dynamic selection in a two-level selection phase to allow the selection of the most confident subset of classifiers to label each test sample individually. The optimization level is intended to generate a population of highly accurate candidate classifier ensembles, while the dynamic selection level applies measures of confidence to reveal the candidate ensemble with the highest degree of confidence in the current decision. Experimental results conducted to compare the proposed method to a static overproduce-and-choose strategy and a classical dynamic classifier selection approach demonstrate that our method outperforms both these selection-based methods, and is also more efficient in terms of performance than combining the decisions of all classifiers in the initial pool.  相似文献   

11.
The automatic detection of construction materials in images acquired on a construction site has been regarded as a critical topic. Recently, several data mining techniques have been used as a way to solve the problem of detecting construction materials. These studies have applied single classifiers to detect construction materials—and distinguish them from the background—by using color as a feature. Recent studies suggest that combining multiple classifiers (into what is called a heterogeneous ensemble classifier) would show better performance than using a single classifier. However, the performance of ensemble classifiers in construction material detection is not fully understood. In this study, we investigated the performance of six single classifiers and potential ensemble classifiers on three data sets: one each for concrete, steel, and wood. A heterogeneous voting-based ensemble classifier was created by selecting base classifiers which are diverse and accurate; their prediction probabilities for each target class were averaged to yield a final decision for that class. In comparison with the single classifiers, the ensemble classifiers performed better in the three data sets overall. This suggests that it is better to use an ensemble classifier to enhance the detection of construction materials in images acquired on a construction site.  相似文献   

12.
Financial distress prediction (FDP) is of great importance to both inner and outside parts of companies. Though lots of literatures have given comprehensive analysis on single classifier FDP method, ensemble method for FDP just emerged in recent years and needs to be further studied. Support vector machine (SVM) shows promising performance in FDP when compared with other single classifier methods. The contribution of this paper is to propose a new FDP method based on SVM ensemble, whose candidate single classifiers are trained by SVM algorithms with different kernel functions on different feature subsets of one initial dataset. SVM kernels such as linear, polynomial, RBF and sigmoid, and the filter feature selection/extraction methods of stepwise multi discriminant analysis (MDA), stepwise logistic regression (logit), and principal component analysis (PCA) are applied. The algorithm for selecting SVM ensemble's base classifiers from candidate ones is designed by considering both individual performance and diversity analysis. Weighted majority voting based on base classifiers’ cross validation accuracy on training dataset is used as the combination mechanism. Experimental results indicate that SVM ensemble is significantly superior to individual SVM classifier when the number of base classifiers in SVM ensemble is properly set. Besides, it also shows that RBF SVM based on features selected by stepwise MDA is a good choice for FDP when individual SVM classifier is applied.  相似文献   

13.
基于FP-Tree 的快速选择性集成算法   总被引:3,自引:1,他引:2  
赵强利  蒋艳凰  徐明 《软件学报》2011,22(4):709-721
选择性集成通过选择部分基分类器参与集成,从而提高集成分类器的泛化能力,降低预测开销.但已有的选择性集成算法普遍耗时较长,将数据挖掘的技术应用于选择性集成,提出一种基于FP-Tree(frequent pattern tree)的快速选择性集成算法:CPM-EP(coverage based pattern mining for ensemble pruning).该算法将基分类器对校验样本集的分类结果组织成一个事务数据库,从而使选择性集成问题可转化为对事务数据集的处理问题.针对所有可能的集成分类器大小,CPM-EP算法首先得到一个精简的事务数据库,并创建一棵FP-Tree树保存其内容;然后,基于该FP-Tree获得相应大小的集成分类器.在获得的所有集成分类器中,对校验样本集预测精度最高的集成分类器即为算法的输出.实验结果表明,CPM-EP算法以很低的计算开销获得优越的泛化能力,其分类器选择时间约为GASEN的1/19以及Forward-Selection的1/8,其泛化能力显著优于参与比较的其他方法,而且产生的集成分类器具有较少的基分类器.  相似文献   

14.
The problem addressed in this study concerns mining data streams with concept drift. The goal of the article is to propose and validate a new approach to mining data streams with concept-drift using the ensemble classifier constructed from the one-class base classifiers. It is assumed that base classifiers of the proposed ensemble are induced from incoming chunks of the data stream. Each chunk consists of prototypes and information about whether the class prediction of these instances, carried-out at earlier steps, has been correct. Each data chunk can be updated by using the instance selection technique when new data arrive. When a new data chunk is formed, the ensemble model is also updated on the basis of weights assigned to each one-class classifier. In this article, two well-known instance-based learning algorithms—the CNN and the ENN—have been adopted to solve the one-class classification problems and, consequently, update the proposed classifier ensemble. The proposed approaches have been validated experimentally, and the computational experiment results are shown and discussed. The experiment results prove that the proposed approach using the ensemble classifier constructed from the one-class base classifiers with instance selection for chunk updating can outperform well-known approaches for data streams with concept drift.  相似文献   

15.
Ensemble pruning deals with the reduction of base classifiers prior to combination in order to improve generalization and prediction efficiency. Existing ensemble pruning algorithms require much pruning time. This paper presents a fast pruning approach: pattern mining based ensemble pruning (PMEP). In this algorithm, the prediction results of all base classifiers are organized as a transaction database, and FP-Tree structure is used to compact the prediction results. Then a greedy pattern mining method is explored to find the ensemble of size k. After obtaining the ensembles of all possible sizes, the one with the best accuracy is outputted. Compared with Bagging, GASEN, and Forward Selection, experimental results show that PMEP achieves the best prediction accuracy and keeps the size of the final ensemble small, more importantly, its pruning time is much less than other ensemble pruning algorithms.  相似文献   

16.
Predicting future stock index price movement has always been a fascinating research area both for the investors who wish to yield a profit by trading stocks and for the researchers who attempt to expose the buried information from the complex stock market time series data. This prediction problem can be addressed as a binary classification problem with two class labels, one for the increasing movement and other for the decreasing movement. In literature, a wide range of classifiers has been tested for this application. As the performance of individual classifier varies for a diverse dataset with respect to different performance measures, it is impractical to acknowledge a specific classifier to be the best one. Hence, designing an efficient classifier ensemble instead of an individual classifier is fetching increasing attention from many researchers. Again selection of base classifiers and deciding their preferences in ensemble with respect to a variety of performance criteria can be considered as a Multi Criteria Decision Making (MCDM) problem. In this paper, an integrated TOPSIS Crow Search based weighted voting classifier ensemble is proposed for stock index price movement prediction. Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), one of the popular MCDM techniques, is suggested for ranking and selecting a set of base classifiers for the ensemble whereas the weights of the classifiers used in the ensemble are tuned by the Crow Search method. The proposed ensemble model is validated for prediction of stock index price over the historical prices of BSE SENSEX, S&P500 and NIFTY 50 stock indices. The model has shown better performance compared to individual classifiers and other ensemble models such as majority voting, weighted voting, differential evolution and particle swarm optimization based classifier ensemble.  相似文献   

17.
18.
在文本分类研究中,集成学习是一种提高分类器性能的有效方法.Bagging算法是目前流行的一种集成学习算法.针对Bagging算法弱分类器具有相同权重问题,提出一种改进的Bagging算法.该方法通过对弱分类器分类结果进行可信度计算得到投票权重,应用于Attribute Bagging算法设计了一个中文文本自动分类器.采用kNN作为弱分类器基本模型对Sogou实验室提供的新闻集进行分类.实验表明该算法比Attribute Bagging有更好的分类精度.  相似文献   

19.
Financial distress prediction is very important to financial institutions who must be able to make critical decisions regarding customer loans. Bankruptcy prediction and credit scoring are the two main aspects considered in financial distress prediction. To assist in this determination, thereby lowering the risk borne by the financial institution, it is necessary to develop effective prediction models for prediction of the likelihood of bankruptcy and estimation of credit risk. A number of financial distress prediction models have been constructed, which utilize various machine learning techniques, such as single classifiers and classifier ensembles, but improving the prediction accuracy is the major research issue. In addition, aside from improving the prediction accuracy, there have been very few studies that specifically consider lowering the Type I error. In practice, Type I errors need to receive careful consideration during model construction because they can affect the cost to the financial institution. In this study, we introduce a classifier ensemble approach designed to reduce the misclassification cost. The outputs produced by multiple classifiers are combined by utilizing the unanimous voting (UV) method to find the final prediction result. Experimental results obtained based on four relevant datasets show that our UV ensemble approach outperforms the baseline single classifiers and classifier ensembles. Specifically, the UV ensemble not only provides relatively good prediction accuracy and minimizes Type I/II errors, but also produces the smallest misclassification cost.  相似文献   

20.
Ensemble tracking   总被引:4,自引:0,他引:4  
We consider tracking as a binary classification problem, where an ensemble of weak classifiers is trained online to distinguish between the object and the background. The ensemble of weak classifiers is combined into a strong classifier using AdaBoost. The strong classifier is then used to label pixels in the next frame as either belonging to the object or the background, giving a confidence map. The peak of the map and, hence, the new position of the object, is found using mean shift. Temporal coherence is maintained by updating the ensemble with new weak classifiers that are trained online during tracking. We show a realization of this method and demonstrate it on several video sequences  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号