首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
We propose a new ensemble algorithm called Convex Hull Ensemble Machine (CHEM). CHEM in Hilbert space is first developed and modified for regression and classification problems. We prove that the ensemble model converges to the optimal model in Hilbert space under regularity conditions. Empirical studies reveal that, for classification problems, CHEM has a prediction accuracy similar to that of boosting, but CHEM is much more robust with respect to output noise and never overfits datasets even when boosting does. For regression problems, CHEM is competitive with other ensemble methods such as gradient boosting and bagging.  相似文献   

2.
Ensembles that combine the decisions of classifiers generated by using perturbed versions of the training set where the classes of the training examples are randomly switched can produce a significant error reduction, provided that large numbers of units and high class switching rates are used. The classifiers generated by this procedure have statistically uncorrelated errors in the training set. Hence, the ensembles they form exhibit a similar dependence of the training error on ensemble size, independently of the classification problem. In particular, for binary classification problems, the classification performance of the ensemble on the training data can be analysed in terms of a Bernoulli process. Experiments on several UCI datasets demonstrate the improvements in classification accuracy that can be obtained using these class-switching ensembles.  相似文献   

3.
This paper proposes a method for constructing ensembles of decision trees, random feature weights (RFW). The method is similar to Random Forest, they are methods that introduce randomness in the construction method of the decision trees. In Random Forest only a random subset of attributes are considered for each node, but RFW considers all of them. The source of randomness is a weight associated with each attribute. All the nodes in a tree use the same set of random weights but different from the set of weights in other trees. So, the importance given to the attributes will be different in each tree and that will differentiate their construction. The method is compared to Bagging, Random Forest, Random-Subspaces, AdaBoost and MultiBoost, obtaining favourable results for the proposed method, especially when using noisy data sets. RFW can be combined with these methods. Generally, the combination of RFW with other method produces better results than the combined methods. Kappa-error diagrams and Kappa-error movement diagrams are used to analyse the relationship between the accuracies of the base classifiers and their diversity.  相似文献   

4.
Combining Different Methods and Numbers of Weak Decision Trees   总被引:1,自引:0,他引:1  
Several ways of manipulating a training set have shown that weakened classifier combination can improve prediction accuracy. In the present paper, we focus on learning set sampling (Breiman’s Bagging) and random feature subset selections (Ho’s Random Subspaces). We present a combination scheme labelled ‘Bagfs’, in which new learning sets are generated on the basis of both bootstrap replicates and random subspaces. The performances of the three methods (Bagging, Random Subspaces and Bagfs) are compared to the standard Adaboost algorithm. All four methods are assessed by means of a decision-tree inducer (C4.5). In addition, we also study whether the number and the way in which they are created has a significant influence on the performance of their combination. To answer these two questions, we undertook the application of the McNemar test of significance and the Kappa degree-of-agreement. The results, obtained on 23 conventional databases, show that on average, Bagfs exhibits the best agreement between prediction and supervision. Received: 17 November 2000, Received in revised form: 30 October 2001, Accepted: 13 December 2001  相似文献   

5.
Enterprise credit risk assessment has long been regarded as a critical topic and many statistical and intelligent methods have been explored for this issue. However there are no consistent conclusions on which methods are better. Recent researches suggest combining multiple classifiers, i.e., ensemble learning, may have a better performance. In this paper, we propose a new hybrid ensemble approach, called RSB-SVM, which is based on two popular ensemble strategies, i.e., bagging and random subspace and uses Support Vector Machine (SVM) as base learner. As there are two different factors, i.e., bootstrap selection of instances and random selection of features, encouraging diversity in RSB-SVM, it would be advantageous to get better performance. The enterprise credit risk dataset, which includes 239 companies’ financial records and is collected by the Industrial and Commercial Bank of China, is selected to demonstrate the effectiveness and feasibility of proposed method. Experimental results reveal that RSB-SVM can be used as an alternative method for enterprise credit risk assessment.  相似文献   

6.
基于Bagging的组合k-NN预测模型与方法   总被引:1,自引:0,他引:1  
k-近邻方法基于单一k值预测,无法兼顾不同实例可能存在的特征差异,总体预测精度难以保证.针对该问题,提出了一种基于Bagging的组合k-NN预测模型,并在此基础上实现了具有属性选择的Bgk-NN预测方法.该方法通过训练建立个性化预测模型集合,各模型独立生成未知实例预测值,并以各预测值的中位数作为组合预测结果.Bgk-NN预测可适用于包含离散值属性及连续值属性的各种类型数据集.标准数据集上的实验表明,Bgk-NN预测精度较之传统k-NN方法有了明显提高.  相似文献   

7.
Protein subcellular localization plays a vital role in understanding proteins’ behavior under different circumstances. The effectiveness of various drugs can be assessed by the successful prediction of protein locations. Therefore, it is important to develop a prediction system that is sufficiently reliable and accurate in making decisions regarding the protein localization. However, main problem in developing a reliable and high throughput prediction system is the presence of imbalanced data, which greatly affects the performance of a prediction system. In order to remedy this problem, we utilized the notion of oversampling through Synthetic Minority Oversampling TEchnique (SMOTE). Further, different feature extraction strategies and ensemble classification techniques are assessed for their contribution toward the solution of the challenging problem of subcellular localization. After applying SMOTE data balancing technique, a remarkable improvement is observed in the performance of random forest and rotation forest ensemble classifiers for CHOM, CHOA and VeroA datasets. It is anticipated that our proposed model might be helpful for the research community in the field of functional and structural proteomics as well as in drug discovery.  相似文献   

8.
Cluster ensemble first generates a large library of different clustering solutions and then combines them into a more accurate consensus clustering. It is commonly accepted that for cluster ensemble to work well the member partitions should be different from each other, and meanwhile the quality of each partition should remain at an acceptable level. Many different strategies have been used to generate different base partitions for cluster ensemble. Similar to ensemble classification, many studies have been focusing on generating different partitions of the original dataset, i.e., clustering on different subsets (e.g., obtained using random sampling) or clustering in different feature spaces (e.g., obtained using random projection). However, little attention has been paid to the diversity and quality of the partitions generated using these two approaches. In this paper, we propose a novel cluster generation method based on random sampling, which uses the nearest neighbor method to fill the category information of the missing samples (abbreviated as RS-NN). We evaluate its performance in comparison with k-means ensemble, a typical random projection method (Random Feature Subset, abbreviated as FS), and another random sampling method (Random Sampling based on Nearest Centroid, abbreviated as RS-NC). Experimental results indicate that the FS method always generates more diverse partitions while RS-NC method generates high-quality partitions. Our proposed method, RS-NN, generates base partitions with a good balance between the quality and the diversity and achieves significant improvement over alternative methods. Furthermore, to introduce more diversity, we propose a dual random sampling method which combines RS-NN and FS methods. The proposed method can achieve higher diversity with good quality on most datasets.  相似文献   

9.
Random Forests receive much attention from researchers because of their excellent performance. As Breiman suggested, the performance of Random Forests depends on the strength of the weak learners in the forests and the diversity among them. However, in the literature, many researchers only considered pre-processing of the data or post-processing of the Random Forests models. In this paper, we propose a new method to increase the diversity of each tree in the forests and thereby improve the overall accuracy. During the training process of each individual tree in the forest, different rotation spaces are concatenated into a higher space at the root node. Then the best split is exhaustively searched within this higher space. The location where the best split lies decides which rotation method to be used for all subsequent nodes. The performance of the proposed method here is evaluated on 42 benchmark data sets from various research fields and compared with the standard Random Forests. The results show that the proposed method improves the performance of the Random Forests in most cases.  相似文献   

10.
Ensemble learning has attracted considerable attention owing to its good generalization performance. The main issues in constructing a powerful ensemble include training a set of diverse and accurate base classifiers, and effectively combining them. Ensemble margin, computed as the difference of the vote numbers received by the correct class and the another class received with the most votes, is widely used to explain the success of ensemble learning. This definition of the ensemble margin does not consider the classification confidence of base classifiers. In this work, we explore the influence of the classification confidence of the base classifiers in ensemble learning and obtain some interesting conclusions. First, we extend the definition of ensemble margin based on the classification confidence of the base classifiers. Then, an optimization objective is designed to compute the weights of the base classifiers by minimizing the margin induced classification loss. Several strategies are tried to utilize the classification confidences and the weights. It is observed that weighted voting based on classification confidence is better than simple voting if all the base classifiers are used. In addition, ensemble pruning can further improve the performance of a weighted voting ensemble. We also compare the proposed fusion technique with some classical algorithms. The experimental results also show the effectiveness of weighted voting with classification confidence.  相似文献   

11.
Ensemble pruning deals with the reduction of base classifiers prior to combination in order to improve generalization and prediction efficiency. Existing ensemble pruning algorithms require much pruning time. This paper presents a fast pruning approach: pattern mining based ensemble pruning (PMEP). In this algorithm, the prediction results of all base classifiers are organized as a transaction database, and FP-Tree structure is used to compact the prediction results. Then a greedy pattern mining method is explored to find the ensemble of size k. After obtaining the ensembles of all possible sizes, the one with the best accuracy is outputted. Compared with Bagging, GASEN, and Forward Selection, experimental results show that PMEP achieves the best prediction accuracy and keeps the size of the final ensemble small, more importantly, its pruning time is much less than other ensemble pruning algorithms.  相似文献   

12.
Constructing support vector machine ensemble   总被引:30,自引:0,他引:30  
Hyun-Chul  Shaoning  Hong-Mo  Daijin  Sung 《Pattern recognition》2003,36(12):2757-2767
Even the support vector machine (SVM) has been proposed to provide a good generalization performance, the classification result of the practically implemented SVM is often far from the theoretically expected level because their implementations are based on the approximated algorithms due to the high complexity of time and space. To improve the limited classification performance of the real SVM, we propose to use the SVM ensemble with bagging (bootstrap aggregating) or boosting. In bagging, each individual SVM is trained independently using the randomly chosen training samples via a bootstrap technique. In boosting, each individual SVM is trained using the training samples chosen according to the sample's probability distribution that is updated in proportional to the errorness of the sample. In both bagging and boosting, the trained individual SVMs are aggregated to make a collective decision in several ways such as the majority voting, least-squares estimation-based weighting, and the double-layer hierarchical combining. Various simulation results for the IRIS data classification and the hand-written digit recognition, and the fraud detection show that the proposed SVM ensemble with bagging or boosting outperforms a single SVM in terms of classification accuracy greatly.  相似文献   

13.
Detection of malware using data mining techniques has been explored extensively. Techniques used for detecting malware based on structural features rely on being able to identify anomalies in the structure of executable files. The structural attributes of an executable that can be extracted include byte ngrams, Portable Executable (PE) features, API call sequences and Strings. After a thorough analysis we have extracted various features from executable files and applied it on an ensemble of classifiers to efficiently detect malware. Ensemble methods combine several individual pattern classifiers in order to achieve better classification. The challenge is to choose the minimal number of classifiers that achieve the best performance. An ensemble that contains too many members might incur large storage requirements and even reduce the classification performance. Hence the goal of ensemble pruning is to identify a subset of ensemble members that performs at least as good as the original ensemble and discard any other members.  相似文献   

14.
针对复杂场景中目标检测精确度过低的问题,基于随机森林算法,提出一种能适应由姿态、视角和形状引起外观变化的目标检测方法,同时还能有效预测最佳检测框大小,使其与真实目标区域有很高重叠度。首先,提出一种基于图像块多维特征的树节点分裂函数;然后,利用Boosting算法逐层生成树,使得每次分裂中错分样本更受关注;最后,扩展了随机森林输入输出空间,使其在分类同时,还可预测目标检测框的最优长宽比。实验结果表明,本文方法在不增加时间开销的同时,提高了检测的精确度,对森林中树生成算法的改进提升了分类性能,对森林输出空间的扩展使得目检测框与真实目标区域有更高重叠率。  相似文献   

15.
The investigation of the accuracy of methods employed to forecast agricultural commodities prices is an important area of study. In this context, the development of effective models is necessary. Regression ensembles can be used for this purpose. An ensemble is a set of combined models which act together to forecast a response variable with lower error. Faced with this, the general contribution of this work is to explore the predictive capability of regression ensembles by comparing ensembles among themselves, as well as with approaches that consider a single model (reference models) in the agribusiness area to forecast prices one month ahead. In this aspect, monthly time series referring to the price paid to producers in the state of Parana, Brazil for a 60 kg bag of soybean (case study 1) and wheat (case study 2) are used. The ensembles bagging (random forests — RF), boosting (gradient boosting machine — GBM and extreme gradient boosting machine — XGB), and stacking (STACK) are adopted. The support vector machine for regression (SVR), multilayer perceptron neural network (MLP) and K-nearest neighbors (KNN) are adopted as reference models. Performance measures such as mean absolute percentage error (MAPE), root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE) are used for models comparison. Friedman and Wilcoxon signed rank tests are applied to evaluate the models’ absolute percentage errors (APE). From the comparison of test set results, MAPE lower than 1% is observed for the best ensemble approaches. In this context, the XGB/STACK (Least Absolute Shrinkage and Selection Operator-KNN-XGB-SVR) and RF models showed better performance for short-term forecasting tasks for case studies 1 and 2, respectively. Better APE (statistically smaller) is observed for XGB/STACK and RF in relation to reference models. Besides that, approaches based on boosting are consistent, providing good results in both case studies. Alongside, a rank according to the performances is: XGB, GBM, RF, STACK, MLP, SVR and KNN. It can be concluded that the ensemble approach presents statistically significant gains, reducing prediction errors for the price series studied. The use of ensembles is recommended to forecast agricultural commodities prices one month ahead, since a more assertive performance is observed, which allows to increase the accuracy of the constructed model and reduce decision-making risk.  相似文献   

16.
Effective fault diagnostics on rolling bearings is vital to ensuring safe and reliable operations of industrial equipment. In recent years, enabled by Machine Learning (ML) algorithms, data-based fault diagnostics approaches have been steadily developed as promising solutions to support industries. However, each ML algorithm exhibits some shortcomings limiting its applicability in practice. To tackle this issue, in this paper, Deep Learning (DL) and Ensemble Learning (EL) algorithms are integrated as a novel Deep Ensemble Learning (DEL) approach. In the DEL approach, the training requirements for the DL algorithm are alleviated, and the accuracy for fault condition classifications is enhanced by the EL algorithm. The DEL approach is comprised of the following critical steps: (i) Convolutional Neural Networks (CNNs) are constructed to pre-process vibration signals of rolling bearings to extract fault-related preliminary features efficiently; (ii) decision trees are designed to optimise the extracted features by quantifying their importance contributing to the faults of rolling bearings; (iii) the EL algorithm, which is enabled by a Gradient Boosting Decision Tree (GBDT) algorithm and a Non-equivalent Cost Logistic Regression (NCLR) algorithm, is developed for fault condition classifications with optimised non-equivalent costs assigned to different fault severities. Case studies demonstrate that the DEL approach is superior to some other comparative ML approaches. The industrial applicability of the DEL approach is showcased via the case studies and analyses.  相似文献   

17.
Diversity among individual classifiers is widely recognized to be a key factor to successful ensemble selection, while the ultimate goal of ensemble pruning is to improve its predictive accuracy. Diversity and accuracy are two important properties of an ensemble. Existing ensemble pruning methods always consider diversity and accuracy separately. However, in contrast, the two closely interrelate with each other, and should be considered simultaneously. Accordingly, three new measures, i.e., Simultaneous Diversity & Accuracy, Diversity-Focused-Two and Accuracy-Reinforcement, are developed for pruning the ensemble by greedy algorithm. The motivation for Simultaneous Diversity & Accuracy is to consider the difference between the subensemble and the candidate classifier, and simultaneously, to consider the accuracy of both of them. With Simultaneous Diversity & Accuracy, those difficult samples are not given up so as to further improve the generalization performance of the ensemble. The inspiration of devising Diversity-Focused-Two stems from the cognition that ensemble diversity attaches more importance to the difference among the classifiers in an ensemble. Finally, the proposal of Accuracy-Reinforcement reinforces the concern about ensemble accuracy. Extensive experiments verified the effectiveness and efficiency of the proposed three pruning measures. Through the investigation of this work, it is found that by considering diversity and accuracy simultaneously for ensemble pruning, well-performed selective ensemble with superior generalization capability can be acquired, which is the scientific value of this paper.  相似文献   

18.
Handwritten text recognition is one of the most difficult problems in the field of pattern recognition. Recently, a number of classifier creation and combination methods, known as ensemble methods, have been proposed in the field of machine learning. They have shown improved recognition performance over single classifiers. In this paper the application of some of those ensemble methods in the domain of offline cursive handwritten word recognition is described. The basic word recognizers are given by hidden Markov models (HMMs). It is demonstrated through experiments that ensemble methods have the potential of improving recognition accuracy also in the domain of handwriting recognition.Received: 23 November 2001, Accepted: 19 September 2002, Published online: 6 June 2003  相似文献   

19.
The purpose of this study is to find the determinants of the profits for the Development and Investment Banks (IaDB) in Turkey. In Turkish Banking System, the main financial source of the banks is the deposits, which constitute almost%60 of the balance sheet. As being a sub-group of the banking system, IaDB are not allowed to accept deposits in Turkey, which changes the total structure of the profitability compared to other banks. Till today, none of the relevant research was concentrated on the profit structure of the IaDB neither in Turkey nor in any other countries. Such research would fill that unexpectedly disregarded yet highly important gap.Therefore, to address this gap, quarterly financial data (10 balance sheet ratios) of 13 banks in the period of 2002Q4-2014Q3 were utilized. As a profit measurement among all other available measures, Return on Equity was chosen as dependent variable as it was the most used one as well as many other researcher have preferred as well. This study investigates the potential usage of bagging (Bag), which is one of the most popular ensemble learning methods, in building ensemble models, is used to predict the determinants of Turkish IaDB profitability. Three well-known tree-based machine learning (ML) models (i.e., Decision Stump (DStump), Random Tree (RTree), Reduced Error Pruning Tree (REPTree)) are deployed as base learner. This empirical study indicates that bagging ensemble models (i.e., Bag-DStump, Bag-RTree, Bag-MLP and Bag-REPTree) are superior to their base learners and could improve the prediction accuracy of individual ML models (i.e., DStump, RTree, REPTree).  相似文献   

20.
《Pattern recognition》2014,47(2):854-864
In this work, a new one-class classification ensemble strategy called approximate polytope ensemble is presented. The main contribution of the paper is threefold. First, the geometrical concept of convex hull is used to define the boundary of the target class defining the problem. Expansions and contractions of this geometrical structure are introduced in order to avoid over-fitting. Second, the decision whether a point belongs to the convex hull model in high dimensional spaces is approximated by means of random projections and an ensemble decision process. Finally, a tiling strategy is proposed in order to model non-convex structures. Experimental results show that the proposed strategy is significantly better than state of the art one-class classification methods on over 200 datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号