首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
Partitioning Nominal Attributes in Decision Trees   总被引:1,自引:1,他引:0  
To find the optimal branching of a nominal attribute at a node in an L-ary decision tree, one is often forced to search over all possible L-ary partitions for the one that yields the minimum impurity measure. For binary trees (L = 2) when there are just two classes a short-cut search is possible that is linear in n, the number of distinct values of the attribute. For the general case in which the number of classes, k, may be greater than two, Burshtein et al. have shown that the optimal partition satisfies a condition that involves the existence of 2 L hyperplanes in the class probability space. We derive a property of the optimal partition for concave impurity measures (including in particular the Gini and entropy impurity measures) in terms of the existence ofL vectors in the dual of the class probability space, which implies the earlier condition.Unfortunately, these insights still do not offer a practical search method when n and k are large, even for binary trees. We therefore present a new heuristic search algorithm to find a good partition. It is based on ordering the attribute's values according to their principal component scores in the class probability space, and is linear in n. We demonstrate the effectiveness of the new method through Monte Carlo simulation experiments and compare its performance against other heuristic methods.  相似文献   

2.
A Distance-Based Attribute Selection Measure for Decision Tree Induction   总被引:12,自引:3,他引:9  
This note introduces a new attribute selection measure for ID3-like inductive algorithms. This measure is based on a distance between partitions such that the selected attribute in a node induces the partition which is closest to the correct partition of the subset of training examples corresponding to this node. The relationship of this measure with Quinlan's information gain is also established. It is also formally proved that our distance is not biased towards attributes with large numbers of values. Experimental studies with this distance confirm previously reported results showing that the predictive accuracy of induced decision trees is not sensitive to the goodness of the attribute selection measure. However, this distance produces smaller trees than the gain ratio measure of Quinlan, especially in the case of data whose attributes have significantly different numbers of values.  相似文献   

3.
We explore an algorithm for learning a classification procedure to minimize the cost of misclassified examples. The described approach is based on the generation of oblique decision trees. The various misclassification costs are defined by a cost matrix. A special splitting criterion is defined to determine the next node for splitting. Clustering techniques are used to process the splitting. The specific splitting criterion is based on cost histograms that count the misclassification costs per class. To avoid overfitting cross-validation techniques are directly integrated into the framing cycle to terminate the splitting process. Several successful tests with different data sets cause this method to seem very promising.  相似文献   

4.
Motivated by the desire to construct compact (in terms of expected length to be traversed to reach a decision) decision trees, we propose a new node splitting measure for decision tree construction. We show that the proposed measure is convex and cumulative and utilize this in the construction of decision trees for classification. Results obtained from several datasets from the UCI repository show that the proposed measure results in decision trees that are more compact with classification accuracy that is comparable to that obtained using popular node splitting measures such as Gain Ratio and the Gini Index.  相似文献   

5.
6.
用Boosting方法组合增强Stumps进行文本分类   总被引:11,自引:0,他引:11  
为提高文本分类的精度,Schapire和Singer尝试了一个用Boosting来组合仅有一个划分的简单决策树(Stumps)的方法.其基学习器的划分是由某个特定词项是否在待分类文档中出现决定的.这样的基学习器明显太弱,造成最后组合成的Boosting分类器精度不够理想,而且需要的迭代次数很大,因而效率很低.针对这个问题,提出由文档中所有词项来决定基学习器划分以增强基学习器分类能力的方法.它把以VSM表示的文档与类代表向量之间的相似度和某特定阈值的大小关系作为基学习器划分的标准.同时,为提高算法的收敛速度,在类代表向量的计算过程中动态引入Boosting分配给各学习样本的权重.实验结果表明,这种方法提高了用Boosting组合Stump分类器进行文本分类的性能(精度和效率),而且问题规模越大,效果越明显.  相似文献   

7.
遥感影像信息提取包括分类和特定类别提取,特定类别提取精度是随着特征空间的类别划分变化的,研究了类别划分对特定类别信息提取的影响。基于贝叶斯分类器,理论上分析了类别划分对特定类别提取的影响,对不同类别划分的特定类别提取进行实验研究,表明不同类别划分下的特定类别信息提取精度不同。为了确定合适的类别划分,提出基于散布矩阵的类间可分性的类别划分选择方法,并由实验结果进行了验证。  相似文献   

8.
Insertion schemes for various classes of multiway search trees have been implemented in PASCAL and experimentally studied. While the original B-tree insertion scheme does not consider brothers the dense m-ary-tree insertion scheme considers all brothers before splitting an overflowing node. There are many possible schemes in between these two extremes. We study the influence of the number of considered brothers, the splitting of an overflowing node with respect to the storage utilization and the number of input/output operations per insertion.  相似文献   

9.
In some classification problems the feature space is heterogeneous in that the best features on which to base the classification are different in different parts of the feature space. In some other problems the classes can be divided into subsets such that distinguishing one subset of classes from another and classifying examples within the subsets require very different decision rules, involving different sets of features. In such heterogeneous problems, many modeling techniques (including decision trees, rules, and neural networks) evaluate the performance of alternative decision rules by averaging over the entire problem space, and are prone to generating a model that is suboptimal in any of the regions or subproblems. Better overall models can be obtained by splitting the problem appropriately and modeling each subproblem separately.This paper presents a new measure to determine the degree of dissimilarity between the decision surfaces of two given problems, and suggests a way to search for a strategic splitting of the feature space that identifies regions with different characteristics. We illustrate the concept using a multiplexor problem, and apply the method to a DNA classification problem.  相似文献   

10.
Hakan   《Pattern recognition》2007,40(12):3540-3551
Decision trees recursively partition the instance space by generating nodes that implement a decision function belonging to an a priori specified model class. Each decision may be univariate, linear or nonlinear. Alternatively, in omnivariate decision trees, one of the model types is dynamically selected by taking into account the complexity of the problem defined by the samples reaching that node. The selection is based on statistical tests where the most appropriate model type is selected as the one providing significantly better accuracy than others. In this study, we propose the use of model ensemble-based nodes where a multitude of models are considered for making decisions at each node. The ensemble members are generated by perturbing the model parameters and input attributes. Experiments conducted on several datasets and three model types indicate that the proposed approach achieves better classification accuracies compared to individual nodes, even in cases when only one model class is used in generating ensemble members.  相似文献   

11.
决策树是数据挖掘中常用的分类方法。针对高等院校学生就业问题中出现由噪声造成的不一致性数据,本文提出了基于变精度粗糙集的决策树模型,并应用于学生就业数据分析。该方法以变精度粗糙集的分类质量的量度作为信息函数,对条件属性进行选择,作为树的节点,自上而下地分割数据集,直到满足某种终止条件。它充分考虑了属性间的依赖性和冗余性,允许在构造决策树的过程中划入正域的实例类别存在一定的不一致性。实验表明,该算法能够有效地处理不一致性数据集,并能正确合理地将就业数据分类,最终得到若干有价值的结论,供决策分析。该算法大大提高了决策规则的泛化能力,减化了树的结构。  相似文献   

12.
Look-ahead based fuzzy decision tree induction   总被引:2,自引:0,他引:2  
Decision tree induction is typically based on a top-down greedy algorithm that makes locally optimal decisions at each node. Due to the greedy and local nature of the decisions made at each node, there is considerable possibility of instances at the node being split along branches such that instances along some or all of the branches require a large number of additional nodes for classification. In this paper, we present a computationally efficient way of incorporating look-ahead into fuzzy decision tree induction. Our algorithm is based on establishing the decision at each internal node by jointly optimizing the node splitting criterion (information gain or gain ratio) and the classifiability of instances along each branch of the node. Simulations results confirm that the use of the proposed look-ahead method leads to smaller decision trees and as a consequence better test performance  相似文献   

13.
This paper presents a new architecture of a fuzzy decision tree based on fuzzy rules – fuzzy rule based decision tree (FRDT) and provides a learning algorithm. In contrast with “traditional” axis-parallel decision trees in which only a single feature (variable) is taken into account at each node, the node of the proposed decision trees involves a fuzzy rule which involves multiple features. Fuzzy rules are employed to produce leaves of high purity. Using multiple features for a node helps us minimize the size of the trees. The growth of the FRDT is realized by expanding an additional node composed of a mixture of data coming from different classes, which is the only non-leaf node of each layer. This gives rise to a new geometric structure endowed with linguistic terms which are quite different from the “traditional” oblique decision trees endowed with hyperplanes as decision functions. A series of numeric studies are reported using data coming from UCI machine learning data sets. The comparison is carried out with regard to “traditional” decision trees such as C4.5, LADtree, BFTree, SimpleCart, and NBTree. The results of statistical tests have shown that the proposed FRDT exhibits the best performance in terms of both accuracy and the size of the produced trees.  相似文献   

14.
Classifiability-based omnivariate decision trees   总被引:1,自引:0,他引:1  
Top-down induction of decision trees is a simple and powerful method of pattern classification. In a decision tree, each node partitions the available patterns into two or more sets. New nodes are created to handle each of the resulting partitions and the process continues. A node is considered terminal if it satisfies some stopping criteria (for example, purity, i.e., all patterns at the node are from a single class). Decision trees may be univariate, linear multivariate, or nonlinear multivariate depending on whether a single attribute, a linear function of all the attributes, or a nonlinear function of all the attributes is used for the partitioning at each node of the decision tree. Though nonlinear multivariate decision trees are the most powerful, they are more susceptible to the risks of overfitting. In this paper, we propose to perform model selection at each decision node to build omnivariate decision trees. The model selection is done using a novel classifiability measure that captures the possible sources of misclassification with relative ease and is able to accurately reflect the complexity of the subproblem at each node. The proposed approach is fast and does not suffer from as high a computational burden as that incurred by typical model selection algorithms. Empirical results over 26 data sets indicate that our approach is faster and achieves better classification accuracy compared to statistical model select algorithms.  相似文献   

15.
针对C4.5决策树构造复杂、分类精度不高等问题,提出了一种基于变精度粗糙集的决策树构造改进算法.该算法采用近似分类质量作为节点选择属性的启发函数,与信息增益率相比,该标准更能准确地刻画属性分类的综合贡献能力,同时对噪声有一定的抑制能力.此外还针对两个或两个以上属性的近似分类质量相等的特殊情形,给出了如何选择最优的分类属...  相似文献   

16.
Fuzzy relational classifier trained by fuzzy clustering   总被引:5,自引:0,他引:5  
A novel approach to nonlinear classification is presented, in the training phase of the classifier, the training data is first clustered in an unsupervised way by fuzzy c-means or a similar algorithm. The class labels are not used in this step. Then, a fuzzy relation between the clusters and the class identifiers is computed. This approach allows the number of prototypes to be independent of the number of actual classes. For the classification of unseen patterns, the membership degrees of the feature vector in the clusters are first computed by using the distance measure of the clustering algorithm. Then, the output fuzzy set is obtained by relational composition. This fuzzy set contains the membership degrees of the pattern in the given classes. A crisp decision is obtained by defuzzification, which gives either a single class or a "reject" decision, when a unique class cannot be selected based on the available information. The principle of the proposed method is demonstrated on an artificial data set and the applicability of the method is shown on the identification of live-stock from recorded sound sequences. The obtained results are compared with two other classifiers.  相似文献   

17.
基于代表性数据的决策树集成*   总被引:1,自引:1,他引:0  
为了获得更好的决策树集成效果,在理论分析的基础上从数据的角度提出了一种基于代表性数据的决策树集成方法。该方法使用围绕中心点的划分(PAM)算法从原始训练集中提取出代表性训练集,由该代表性训练集来训练出多个决策树分类器,并由此建立决策树集成模型。该方法能选取尽可能少的代表性数据来训练出尽可能好的决策树集成模型。实验结果表明,该方法使用更少的代表性数据能获得比Bagging和Boosting还要高的决策树集成精度。  相似文献   

18.
A standard approach to determining decision trees is to learn them from examples. A disadvantage of this approach is that once a decision tree is learned, it is difficult to modify it to suit different decision making situations. Such problems arise, for example, when an attribute assigned to some node cannot be measured, or there is a significant change in the costs of measuring attributes or in the frequency distribution of events from different decision classes. An attractive approach to resolving this problem is to learn and store knowledge in the form of decision rules, and to generate from them, whenever needed, a decision tree that is most suitable in a given situation. An additional advantage of such an approach is that it facilitates buildingcompact decision trees, which can be much simpler than the logically equivalent conventional decision trees (by compact trees are meant decision trees that may contain branches assigned aset of values, and nodes assignedderived attributes, i.e., attributes that are logical or mathematical functions of the original ones). The paper describes an efficient method, AQDT-1, that takes decision rules generated by an AQ-type learning system (AQ15 or AQ17), and builds from them a decision tree optimizing a given optimality criterion. The method can work in two modes: thestandard mode, which produces conventional decision trees, andcompact mode, which produces compact decision trees. The preliminary experiments with AQDT-1 have shown that the decision trees generated by it from decision rules (conventional and compact) have outperformed those generated from examples by the well-known C4.5 program both in terms of their simplicity and their predictive accuracy.  相似文献   

19.
一种新的基于粗糙集模型的决策树算法   总被引:3,自引:1,他引:2       下载免费PDF全文
在基于粗糙集模型的决策树生成算法中,由于分类的精确性,导致生成算法在对实例进行划分时往往过于细化,无法避免少数特殊实例对决策树造成的不良影响,使得生成的决策树过于庞大,不便于理解,同时也降低了其对未来数据的分类和预测能力。针对上述问题,该文给出一个新的基于粗糙集模型的决策树生成算法,引入了抑制因子。对即将扩张的结点,除了常用的终止条件外,再加入一个终止条件:若样本的抑制因子大于给定的阈值,便不再扩展该结点。有效地避免了划分过细的问题,也不会生成过于庞大的决策树,便于用户理解。  相似文献   

20.
An algorithm that computes the best matching of two trees is described. The degree of mismatch, i.e., the distance, is measured in terms of the number of node splitting and merging operations required. The proposed tree distance is a more appropriate measurement of structural defonnation than the tree distance measure in terms of the number of insertions, deletions, and substitutions of tree nodes, as defined in previous studies. An algorithm that uses a divide-and-conquer strategy is presented. The analysis shows that the time complexity is O(NM2) where N and Al are the number of nodes of the two trees, respectively. The algorithm has been implemented on a VAX 11/780.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号