首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Bauer  Eric  Kohavi  Ron 《Machine Learning》1999,36(1-2):105-139
Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms, which use perturbation, reweighting, and combination techniques, affect classification error. We provide a bias and variance decomposition of the error to show how different methods and variants influence these two terms. This allowed us to determine that Bagging reduced variance of unstable methods, while boosting methods (AdaBoost and Arc-x4) reduced both the bias and variance of unstable methods but increased the variance for Naive-Bayes, which was very stable. We observed that Arc-x4 behaves differently than AdaBoost if reweighting is used instead of resampling, indicating a fundamental difference. Voting variants, some of which are introduced in this paper, include: pruning versus no pruning, use of probabilistic estimates, weight perturbations (Wagging), and backfitting of data. We found that Bagging improves when probabilistic estimates in conjunction with no-pruning are used, as well as when the data was backfit. We measure tree sizes and show an interesting positive correlation between the increase in the average tree size in AdaBoost trials and its success in reducing the error. We compare the mean-squared error of voting methods to non-voting methods and show that the voting methods lead to large and significant reductions in the mean-squared errors. Practical problems that arise in implementing boosting algorithms are explored, including numerical instabilities and underflows. We use scatterplots that graphically show how AdaBoost reweights instances, emphasizing not only hard areas but also outliers and noise.  相似文献   

2.
BoosTexter: A Boosting-based System for Text Categorization   总被引:60,自引:4,他引:60  
Schapire  Robert E.  Singer  Yoram 《Machine Learning》2000,39(2-3):135-168
This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.  相似文献   

3.
Boosting Algorithms for Parallel and Distributed Learning   总被引:1,自引:0,他引:1  
The growing amount of available information and its distributed and heterogeneous nature has a major impact on the field of data mining. In this paper, we propose a framework for parallel and distributed boosting algorithms intended for efficient integrating specialized classifiers learned over very large, distributed and possibly heterogeneous databases that cannot fit into main computer memory. Boosting is a popular technique for constructing highly accurate classifier ensembles, where the classifiers are trained serially, with the weights on the training instances adaptively set according to the performance of previous classifiers. Our parallel boosting algorithm is designed for tightly coupled shared memory systems with a small number of processors, with an objective of achieving the maximal prediction accuracy in fewer iterations than boosting on a single processor. After all processors learn classifiers in parallel at each boosting round, they are combined according to the confidence of their prediction. Our distributed boosting algorithm is proposed primarily for learning from several disjoint data sites when the data cannot be merged together, although it can also be used for parallel learning where a massive data set is partitioned into several disjoint subsets for a more efficient analysis. At each boosting round, the proposed method combines classifiers from all sites and creates a classifier ensemble on each site. The final classifier is constructed as an ensemble of all classifier ensembles built on disjoint data sets. The new proposed methods applied to several data sets have shown that parallel boosting can achieve the same or even better prediction accuracy considerably faster than the standard sequential boosting. Results from the experiments also indicate that distributed boosting has comparable or slightly improved classification accuracy over standard boosting, while requiring much less memory and computational time since it uses smaller data sets.  相似文献   

4.
Using Model Trees for Classification   总被引:1,自引:0,他引:1  
Frank  Eibe  Wang  Yong  Inglis  Stuart  Holmes  Geoffrey  Witten  Ian H. 《Machine Learning》1998,32(1):63-76
Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5, based on Quinlan's M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.  相似文献   

5.
On the Learnability and Design of Output Codes for Multiclass Problems   总被引:4,自引:0,他引:4  
Crammer  Koby  Singer  Yoram 《Machine Learning》2002,47(2-3):201-233
Output coding is a general framework for solving multiclass categorization problems. Previous research on output codes has focused on building multiclass machines given predefined output codes. In this paper we discuss for the first time the problem of designing output codes for multiclass problems. For the design problem of discrete codes, which have been used extensively in previous works, we present mostly negative results. We then introduce the notion of continuous codes and cast the design problem of continuous codes as a constrained optimization problem. We describe three optimization problems corresponding to three different norms of the code matrix. Interestingly, for the l 2 norm our formalism results in a quadratic program whose dual does not depend on the length of the code. A special case of our formalism provides a multiclass scheme for building support vector machines which can be solved efficiently. We give a time and space efficient algorithm for solving the quadratic program. We describe preliminary experiments with synthetic data show that our algorithm is often two orders of magnitude faster than standard quadratic programming packages. We conclude with the generalization properties of the algorithm.  相似文献   

6.
Boosting算法是近年来在机器学习领域中一种流行的用来提高学习精度的算法。文中以AdaBoost算法为例来介绍Boosting的基本原理。  相似文献   

7.
随着部队实战化训练的深入,传统的训练成绩分析方法已不能适应科学组训的需要。该文引入射击训练实例,应用人工智能中机器学习的ID3归纳学习算法,对射击情况进行分类分析,得出影响军事训练成绩的内部原因以及其他一些结论。  相似文献   

8.
Parallel Formulations of Decision-Tree Classification Algorithms   总被引:5,自引:0,他引:5  
Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method. Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.  相似文献   

9.
Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a base learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization.  相似文献   

10.
随着部队实战化训练的深入,传统的训练成绩分析方法已不能适应科学组训的需要。该文引入射击训练实例,应用人工智能中机器学习的ID3归纳学习算法,对射击情况进行分类分析,得出影响军事训练成绩的内部原因以及其他一些结论。  相似文献   

11.
MultiBoosting: A Technique for Combining Boosting and Wagging   总被引:12,自引:0,他引:12  
Webb  Geoffrey I. 《Machine Learning》2000,40(2):159-196
MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, MultiBoosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution.  相似文献   

12.
Decision trees that are based on information-theory are useful paradigms for learning from examples. However, in some real-world applications, known information-theoretic methods frequently generate non-monotonic decision trees, in which objects with better attribute values are sometimes classified to lower classes than objects with inferior values. This property is undesirable for problem solving in many application domains, such as credit scoring and insurance premium determination, where monotonicity of subsequent classifications is important. An attribute-selection metric is proposed here that takes both the error as well as monotonicity into account while building decision trees. The metric is empirically shown capable of significantly reducing the degree of non-monotonicity of decision trees without sacrificing their inductive accuracy.  相似文献   

13.
遗传算法的改进   总被引:1,自引:1,他引:1  
谷峰  吴勇  唐俊 《微机发展》2003,13(6):80-81,85
简要介绍了一般遗传算法的基本原理,由此提出了一个新的改进算法,它导向以适应度比较高的模式为祖先的染色体“家族”方向。文中给出了两个典型求最大值的例子,从结果中看,改进后的遗传算法大大提高了算法的速度,精度也有所提高。  相似文献   

14.
We have found one reason why AdaBoost tends not to perform well on gene expression data, and identified simple modifications that improve its ability to find accurate class prediction rules. These modifications appear especially to be needed when there is a strong association between expression profiles and class designations. Cross-validation analysis of six microarray datasets with different characteristics suggests that, suitably modified, boosting provides competitive classification accuracy in general.Sometimes the goal in a microarray analysis is to find a class prediction rule that is not only accurate, but that depends on the level of expression of few genes. Because boosting makes an effort to find genes that are complementary sources of evidence of the correct classification of a tissue sample, it appears especially useful for such gene-efficient class prediction. This appears particularly to be true when there is a strong association between expression profiles and class designations, which is often the case for example when comparing tumor and normal samples.  相似文献   

15.
Approximation Algorithms for Connected Dominating Sets   总被引:38,自引:0,他引:38  
S. Guha  S. Khuller 《Algorithmica》1998,20(4):374-387
The dominating set problem in graphs asks for a minimum size subset of vertices with the following property: each vertex is required to be either in the dominating set, or adjacent to some vertex in the dominating set. We focus on the related question of finding a connected dominating set of minimum size, where the graph induced by vertices in the dominating set is required to be connected as well. This problem arises in network testing, as well as in wireless communication. Two polynomial time algorithms that achieve approximation factors of 2H(Δ)+2 and H(Δ)+2 are presented, where Δ is the maximum degree and H is the harmonic function. This question also arises in relation to the traveling tourist problem, where one is looking for the shortest tour such that each vertex is either visited or has at least one of its neighbors visited. We also consider a generalization of the problem to the weighted case, and give an algorithm with an approximation factor of (c n +1) \ln n where c n ln k is the approximation factor for the node weighted Steiner tree problem (currently c n = 1.6103 ). We also consider the more general problem of finding a connected dominating set of a specified subset of vertices and provide a polynomial time algorithm with a (c+1) H(Δ) +c-1 approximation factor, where c is the Steiner approximation ratio for graphs (currently c = 1.644 ). Received June 22, 1996; revised February 28, 1997.  相似文献   

16.
A set of classification rules can be considered as a disjunction of rules, where each rule is a disjunct. A small disjunct is a rule covering a small number of examples. Small disjuncts are a serious problem for effective classification, because the small number of examples satisfying these rules makes their prediction unreliable and error-prone. This paper offers two main contributions to the research on small disjuncts. First, it investigates six candidate solutions (algorithms) for the problem of small disjuncts. Second, it reports the results of a meta-learning experiment, which produced meta-rules predicting which algorithm will tend to perform best for a given data set. The algorithms investigated in this paper belong to different machine learning paradigms and their hybrid combinations, as follows: two versions of a decision-tree (DT) induction algorithm; two versions of a hybrid DT/genetic algorithm (GA) method; one GA; one hybrid DT/instance-based learning (IBL) algorithm. Experiments with 22 data sets evaluated both the predictive accuracy and the simplicity of the discovered rule sets, with the following conclusions. If one wants to maximize predictive accuracy only, then the hybrid DT/IBL seems to be the best choice. On the other hand, if one wants to maximize both predictive accuracy and rule set simplicity -- which is important in the context of data mining -- then a hybrid DT/GA seems to be the best choice.  相似文献   

17.
Using Nondeterminism to Design Efficient Deterministic Algorithms   总被引:5,自引:0,他引:5  
In this paper we illustrate how nondeterminism can be used conveniently and effectively in designing efficient deterministic algorithms. In particular, our method gives a parameterized algorithm of running time O((5.7 k)k n) for the 3-D matching problem, which significantly improves the previous algorithm by Downey et al. The algorithm can be generalized to yield an improved algorithm for the r-D matching problem for any positive integer r. The method can also be employed in designing deterministic algorithms for other optimization problems as well.  相似文献   

18.
In Web classification, web pages are assigned to pre-defined categories mainly according to their content (content mining). However, the structure of the web site might provide extra information about their category (structure mining). Traditionally, both approaches have been applied separately, or are dealt with techniques that do not generate a model, such as Bayesian techniques. Unfortunately, in some classification contexts, a comprehensible model becomes crucial. Thus, it would be interesting to apply rule-based techniques (rule learning, decision tree learning) for the web categorisation task. In this paper we outline how our general-purpose learning algorithm, the so called distance based decision tree learning algorithm (DBDT), could be used in web categorisation scenarios. This algorithm differs from traditional ones in the sense that the splitting criterion is defined by means of metric conditions (“is nearer than”). This change allows decision trees to handle structured attributes (lists, graphs, sets, etc.) along with the well-known nominal and numerical attributes. Generally speaking, these structured attributes will be employed to represent the content and the structure of the web-site.  相似文献   

19.
The binary representation of each classification from a subset of a space of admissible classifications is considered. A metric in a unit cube is introduced, and a correct algebra of classification algorithms is constructed. The correctness and completeness of a model of classification algorithms are proved. An example of construction of a complete model for a classification problem is considered.  相似文献   

20.
旅行商问题(TSP)的一种改进遗传算法   总被引:16,自引:1,他引:16  
马欣  朱双东  杨斐 《计算机仿真》2003,20(4):36-37,15
传统的序号编码遗传算法(GA)使用PMX、CX和OX等特殊的交叉算子,这些算子实施起来很麻烦。针对TSP问题的求解,提出了一种新的改进遗传算法:单亲进化遗传算法(PEGA),PEGA是利用父体所提供的有效边的信息,使用保留最小边的方法进行个体的进化。与传统的遗传算法相比,PEGA算法弥补了它们的不足之处,简化了遗传算法。给出了PEGA算法的数值算例,仿真实验表明了该算法对于对称的TSP和非对称的TSP问题,都具有收敛速度快的特点,证明了该算法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号