首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 192 毫秒
1.
White  Allan P.  Liu  Wei Zhong 《Machine Learning》1994,15(3):321-329
A fresh look is taken at the problem of bias in information-based attribute selection measures, used in the induction of decision trees. The approach uses statistical simulation techniques to demonstrate that the usual measures such as information gain, gain ratio, and a new measure recently proposed by Lopez de Mantaras (1991) are all biased in favour of attributes with large numbers of values. It is concluded that approaches which utilise the chi-square distribution are preferable because they compensate automatically for differences between attributes in the number of levels they take.  相似文献   

2.
房立  黄泽宇 《微机发展》2006,16(8):106-109
构建决策树分类器关键是选择分裂属性。通过分析信息增益和增益比率、Gini索引、基于Goodman-Kruskal关联索引这三种选择分裂属性的标准,提出了一种改进经典决策树分类器C4.5算法的方法(竞争选择分裂属性的决策树分类模型),它综合三种选择分裂属性的标准,通过竞争机制选择最佳分裂属性。实验结果表明它在大多数情况下,使得不牺牲分类精确度而获得更小的决策树成为了可能。  相似文献   

3.
Classifiability-based omnivariate decision trees   总被引:1,自引:0,他引:1  
Top-down induction of decision trees is a simple and powerful method of pattern classification. In a decision tree, each node partitions the available patterns into two or more sets. New nodes are created to handle each of the resulting partitions and the process continues. A node is considered terminal if it satisfies some stopping criteria (for example, purity, i.e., all patterns at the node are from a single class). Decision trees may be univariate, linear multivariate, or nonlinear multivariate depending on whether a single attribute, a linear function of all the attributes, or a nonlinear function of all the attributes is used for the partitioning at each node of the decision tree. Though nonlinear multivariate decision trees are the most powerful, they are more susceptible to the risks of overfitting. In this paper, we propose to perform model selection at each decision node to build omnivariate decision trees. The model selection is done using a novel classifiability measure that captures the possible sources of misclassification with relative ease and is able to accurately reflect the complexity of the subproblem at each node. The proposed approach is fast and does not suffer from as high a computational burden as that incurred by typical model selection algorithms. Empirical results over 26 data sets indicate that our approach is faster and achieves better classification accuracy compared to statistical model select algorithms.  相似文献   

4.
变精度粗糙集模型在决策树构造中的应用   总被引:1,自引:0,他引:1       下载免费PDF全文
针对ID3算法构造决策树复杂、分类效率不高等问题,本文基于变精度粗糙集模型提出了一种新的决策树构造算法。该算法采用加权分类粗糙度作为节点选择属性的启发函数,与信息增益相比,该标准更能够全面地刻画属性分类的综合贡献能力,计算简单,并且可以消除噪声数据对选择属性和生成叶节点的影响。实验结果证明,本算法构造的决策树在规模与分类效率上均优于ID3算法。  相似文献   

5.
Relief is a measure of attribute quality which is often used for feature subset selection. Its use in induction of classification trees and rules, discretization, and other methods has however been hindered by its inability to suggest subsets of values of discrete attributes and thresholds for splitting continuous attributes into intervals. We present efficient algorithms for both tasks.  相似文献   

6.
基于遗传算法的多属性模糊决策树的优化   总被引:1,自引:0,他引:1       下载免费PDF全文
决策树是数据挖掘中的一种高效方法,但是当训练数据的属性很多时,构建的决策树的规模会随属性个数增加而指数级增长,进而会产生海量的规则。针对该问题,提出了一种基于遗传算法的优化方法。首先根据信息增益利用轮盘赌方法选取若干组属性,构建多棵决策树,然后利用遗传算法对多棵决策树进行组合,并最终形成规则集。最后给出了实验结果,证明了该方法的可行性和有效性。  相似文献   

7.
针对模糊多属性决策问题,给出一种基于指数型模糊数的多属性决策模型。一方面,通过定义指数型模糊数的期望,以实现属性权重向量的解模糊化处理;另一方面,根据三元区间数理论和指数型模糊数的截集信息,定义指数型模糊数上一种新的距离度量,以计算各备选方案与正、负理想方案之间的距离。根据模糊理想点思想,基于指数型模糊数的期望和距离的定义,给出一种指数型模糊数上的Topsis多属性决策方法。将该模型应用于一个具体实例,其结果证实了该方法的有效性。  相似文献   

8.
In this paper we describe a sound, but not complete, analysis to prove the termination of higher-order attribute grammar evaluation caused by the creation of an unbounded number of (finite) trees as local tree-valued attributes, which are then themselves decorated with attributes. The analysis extracts a set of term-rewriting rules from the grammar that model creation of new syntax trees during the evaluation of higher-order attributes. If this term rewriting system terminates, then only a finite number of trees will be created during attribute grammar evaluation. The analysis places an ordering on nonterminals to handle the cases in which higher-order inherited attributes are used to ensure that a finite number of trees are created using such attributes. When paired with the traditional completeness and circularity analyses for attribute grammars and the assumption that each attribute equation defines a terminating computation, this analysis can be used to show that attribute grammar evaluation will terminate normally. This analysis can be applied to a wide range of common attribute grammar idioms and has been used to show that evaluation of our specification of Java 1.4 terminates. We also describe a modular version of the analysis that is performed on independently developed language extension grammars and the host language being extended. If the extensions individually pass the modular analysis then their composition is also guaranteed to terminate.  相似文献   

9.
Mining optimized gain rules for numeric attributes   总被引:7,自引:0,他引:7  
Association rules are useful for determining correlations between attributes of a relation and have applications in the marketing, financial, and retail sectors. Furthermore, optimized association rules are an effective way to focus on the most interesting characteristics involving certain attributes. Optimized association rules are permitted to contain uninstantiated attributes and the problem is to determine instantiations such that either the support, confidence, or gain of the rule is maximized. In this paper, we generalize the optimized gain association rule problem by permitting rules to contain disjunctions over uninstantiated numeric attributes. Our generalized association rules enable us to extract more useful information about seasonal and local patterns involving the uninstantiated attribute. For rules containing a single numeric attribute, we present an algorithm with linear complexity for computing optimized gain rules. Furthermore, we propose a bucketing technique that can result in a significant reduction in input size by coalescing contiguous values without sacrificing optimality. We also present an approximation algorithm based on dynamic programming for two numeric attributes. Using recent results on binary space partitioning trees, we show that the approximations are within a constant factor of the optimal optimized gain rules. Our experimental results with synthetic data sets for a single numeric attribute demonstrate that our algorithm scales up linearly with the attribute's domain size as well as the number of disjunctions. In addition, we show that applying our optimized rule framework to a population survey real-life data set enables us to discover interesting underlying correlations among the attributes.  相似文献   

10.
基于知识的模型自动选择策略   总被引:1,自引:0,他引:1  
戴超凡  冯旸赫 《计算机工程》2010,36(11):170-172
模型自动选择是决策支持系统智能化发展的必然要求。针对目前实用算法较少的现状,提出一种模型自动选择策略。基于知识框架描述模型,根据事实库和知识库提取相应规则生成推理树,结合经验和专业知识实现模型自动选择。实验结果表明,该策略具有较高的命中率。  相似文献   

11.
Partitioning Nominal Attributes in Decision Trees   总被引:1,自引:1,他引:0  
To find the optimal branching of a nominal attribute at a node in an L-ary decision tree, one is often forced to search over all possible L-ary partitions for the one that yields the minimum impurity measure. For binary trees (L = 2) when there are just two classes a short-cut search is possible that is linear in n, the number of distinct values of the attribute. For the general case in which the number of classes, k, may be greater than two, Burshtein et al. have shown that the optimal partition satisfies a condition that involves the existence of 2 L hyperplanes in the class probability space. We derive a property of the optimal partition for concave impurity measures (including in particular the Gini and entropy impurity measures) in terms of the existence ofL vectors in the dual of the class probability space, which implies the earlier condition.Unfortunately, these insights still do not offer a practical search method when n and k are large, even for binary trees. We therefore present a new heuristic search algorithm to find a good partition. It is based on ordering the attribute's values according to their principal component scores in the class probability space, and is linear in n. We demonstrate the effectiveness of the new method through Monte Carlo simulation experiments and compare its performance against other heuristic methods.  相似文献   

12.
基于粗糙集的决策树构造算法   总被引:7,自引:2,他引:5  
针对ID3算法构造决策树复杂、分类效率不高问题,基于粗糙集理论提出一种决策树构造算法。该算法采用加权分类粗糙度作为节点选择属性的启发函数,与信息增益相比,能全面地刻画属性分类的综合贡献能力,并且计算简单。为消除噪声对选择属性和生成叶节点的影响,利用变精度粗糙集模型对该算法进行优化。实验结果表明,该算法构造的决策树在规模与分类效率上均优于ID3算法。  相似文献   

13.
针对指数型模糊数上的模糊多属性决策问题,根据模糊理想点法的思想,给出两种多属性topsis决策方法。通过定义指数型模糊数的期望值,实现属性权重向量的解模糊化处理;定义指数型模糊数之间的距离测度,以计算各方案与理想方案之间的距离。基于期望值和距离测度的定义,从两种不同的角度出发,给出了两种模糊多属性topsis决策方法。实例验证两种方法的可行性和有效性,并对这两种方法进行比较和分析。  相似文献   

14.
We present a general technique for dynamizing a class of problems whose underlying structure is a computation graph embedded in a tree. We introduce three fully dynamic data structures, called path attribute systems, tree attribute systems, and linear attribute grammars, which extend and generalize the dynamic trees of Sleator and Tarjan. More specifically, we associate values, called attributes, with the nodes and paths of a rooted tree. Path attributes form a path attribute system if they can be maintained in constant time under path concatenation. Node attributes form a tree attribute system if the tree attributes of the tail of a path Π can be determined in constant time from the path attributes of Π. A linear attribute grammar is a tree-based linear expression such that the values of a node μ are calculated from the values at the parent, siblings, and/for children of μ. We provide a framework for maintaining path attribute systems, tree attribute systems, and linear attribute grammars in a fully dynamic environment using linear space and logarithmic time per operation. Also, we demonstrate the applicability of our techniques by showing examples of graph and geometric problems that can be efficiently dynamized, including biconnectivity and triconnectivity queries, planarity testing, drawing trees and series-parallel digraphs, slicing floorplan compaction, point location, and many optimization problems on bounded tree-width graphs. Received May 13, 1994; revised October 12, 1995.  相似文献   

15.
It is important to use a better criterion in selection and discretization of attributes for the generation of decision trees to construct a better classifier in the area of pattern recognition in order to intelligently access huge amount of data efficiently. Two well-known criteria are gain and gain ratio, both based on the entropy of partitions. We propose in this paper a new criterion based also on entropy, and use both theoretical analysis and computer simulation to demonstrate that it works better than gain or gain ratio in a wide variety of situations. We use the usual entropy calculation where the base of the logarithm is not two but the number of successors to the node. Our theoretical analysis leads some specific situations in which the new criterion works always better than gain or gain ratio, and the simulation result may implicitly cover all the other situations not covered by the analysis  相似文献   

16.
In this paper, a method based on prospect theory is proposed to solve the multiple attribute decision making (MADM) problem considering aspiration-levels of attributes, where attribute values and aspiration-levels are represented in two different formats: crisp numbers and interval numbers. According to the idea of prospect theory, aspiration-levels are firstly regarded as the reference points, and the four possible types for comparing an attribute value with an aspiration-level are described. Then, for all possible cases of the four types, the calculation formulae of gains and losses of alternatives concerning attributes are given. By calculating gain and loss of each alternative, a gain matrix and a loss matrix are constructed, respectively. Further, using the value function proposed in prospect theory and the simple additive weighting method, the overall prospect value of each alternative is calculated. Based on the obtained overall prospect values, a ranking of alternatives can be determined. Finally, a numerical example is used to illustrate the use of the proposed method.  相似文献   

17.
Nonlinear integrals play an important role in information fusion. So far, all existing nonlinear integrals of a function with respect to a set function are defined on a subset of a space. In many of the problems with information fusion, such as decision tree generation in inductive learning, we often need to deal with the function defined on a partition of the space. Motivated by minimizing the classification information entropy of a partition while generating decision trees, this paper proposes a nonlinear integral of a function with respect to a nonnegative set function on a partition, and provides the conclusion that the sum of the weighted entropy of the union of several subsets is not less than the sum of the weighted entropy of a single subset. It is shown that selecting the entropy of a single attribute is better than selecting the entropy of the union of several attributes in generating rules by decision trees.  相似文献   

18.
决策树是数据挖掘中常用的分类方法。针对高等院校学生就业问题中出现由噪声造成的不一致性数据,本文提出了基于变精度粗糙集的决策树模型,并应用于学生就业数据分析。该方法以变精度粗糙集的分类质量的量度作为信息函数,对条件属性进行选择,作为树的节点,自上而下地分割数据集,直到满足某种终止条件。它充分考虑了属性间的依赖性和冗余性,允许在构造决策树的过程中划入正域的实例类别存在一定的不一致性。实验表明,该算法能够有效地处理不一致性数据集,并能正确合理地将就业数据分类,最终得到若干有价值的结论,供决策分析。该算法大大提高了决策规则的泛化能力,减化了树的结构。  相似文献   

19.
一种基于修正信息增益的ID3算法   总被引:2,自引:0,他引:2       下载免费PDF全文
ID3算法是决策树中影响最大的算法之一,它以信息增益为标准选择决策树的测试属性。这种算法存在不足之处,在选择合适的测试属性时,倾向于选择取值较多的属性,而在实际应用中,取值较多的属性未必是重要的。针对此算法的不足,本文提出了一种对增益修正的 ID3算法,为改善 ID3的多值偏向问题提供了一种有效途径。通过理论分析和实验证明,这种算法能较好地解决多值倾向的问题。  相似文献   

20.
首先提出几种基于兰氏距离的犹豫模糊集距离测度.然后针对两个犹豫模糊数中的隶属度个数不相等问题,提出新的犹豫模糊数降维方案.该方案不需要反复添加最大最小隶属度数值到犹豫模糊数中,不仅很好地保留了数据的原始信息,而且减少了计算距离时的计算量.针对属性权重信息完全未知的情况,采用实际数据信息构造犹豫模糊指数熵,并利用信息熵最小化原则计算得到属性权重.最后利用指数熵加权的降维犹豫模糊兰氏距离测度,结合实际的医疗诊断数据进行实例分析.结果表明,所提出的基于指数熵加权的降维犹豫模糊兰氏距离测度不仅在$\lambda$取不同值时诊断结果一致,而且减少了计算量,提高了诊断效率,对实时、有效的医疗诊断具有一定的应用价值.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号