首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 743 毫秒
1.
We consider multisplitting of numerical value ranges, a task that is encountered as a discretization step preceding induction and also embedded into learning algorithms. We are interested in finding the partition that optimizes the value of a given attribute evaluation function. For most commonly used evaluation functions this task takes quadratic time in the number of potential cut points in the numerical range. Hence, it is a potential bottleneck in data mining algorithms.We present two techniques that speed up the optimal multisplitting task. The first one aims at discarding cut point candidates in a quick linear-time preprocessing scan before embarking on the actual search. We generalize the definition of boundary points by Fayyad and Irani to allow us to merge adjacent example blocks that have the same relative class distribution. We prove for several commonly used evaluation functions that this processing removes only suboptimal cut points. Hence, the algorithm does not lose optimality.Our second technique tackles the quadratic-time dynamic programming algorithm, which is the best schema for optimizing many well-known evaluation functions. We present a technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm. The method works for all convex and cumulative evaluation functions.Together the use of these two techniques speeds up the multisplitting process considerably. Compared to the baseline dynamic programming algorithm the speed-up is around 50 percent on the average and up to 90 percent in some cases. We conclude that optimal multisplitting is fully feasible on all benchmark data sets we have encountered.  相似文献   

2.
魏俐华  陈刚 《计算机应用研究》2021,38(10):2954-2960
针对属性值为Pythagorean模糊语言,属性权重未知且考虑群体一致性和决策者属性偏好不确定性的决策问题,探讨了一种考虑信任度的两阶段交互多属性群决策方法.首先,考虑决策者偏好,将属性集分为必选属性集和可选属性集;其次,构建两阶段交互机制以确保必选属性集的群体一致性达到阈值,第一阶段以提升群体共识水平为目标进行交互,第二阶段以降低冲突水平为目标进行交互;再次,同时考虑交互的积累稳定性和积累影响因子以确定必选属性集下的决策者权重,并依据信任度确定可选属性集下的决策者权重;最后,用距离熵确定属性权重,并用VIKOR法选出最优方案.算例分析表明,该方法能够较好地解决考虑共识和冲突水平的多属性群决策问题.  相似文献   

3.
Classifiability-based omnivariate decision trees   总被引:1,自引:0,他引:1  
Top-down induction of decision trees is a simple and powerful method of pattern classification. In a decision tree, each node partitions the available patterns into two or more sets. New nodes are created to handle each of the resulting partitions and the process continues. A node is considered terminal if it satisfies some stopping criteria (for example, purity, i.e., all patterns at the node are from a single class). Decision trees may be univariate, linear multivariate, or nonlinear multivariate depending on whether a single attribute, a linear function of all the attributes, or a nonlinear function of all the attributes is used for the partitioning at each node of the decision tree. Though nonlinear multivariate decision trees are the most powerful, they are more susceptible to the risks of overfitting. In this paper, we propose to perform model selection at each decision node to build omnivariate decision trees. The model selection is done using a novel classifiability measure that captures the possible sources of misclassification with relative ease and is able to accurately reflect the complexity of the subproblem at each node. The proposed approach is fast and does not suffer from as high a computational burden as that incurred by typical model selection algorithms. Empirical results over 26 data sets indicate that our approach is faster and achieves better classification accuracy compared to statistical model select algorithms.  相似文献   

4.
We consider the problem of exploiting a taxonomy of propositionalized attributes in order to learn compact and robust classifiers. We introduce propositionalized attribute taxonomy guided decision tree learner (PAT-DTL), an inductive learning algorithm that exploits a taxonomy of propositionalized attributes as prior knowledge to generate compact decision trees. Since taxonomies are unavailable in most domains, we also introduce propositionalized attribute taxonomy learner (PAT-Learner) that automatically constructs taxonomy from data. PAT-DTL uses top-down and bottom-up search to find a locally optimal cut that corresponds to the literals of decision rules from data and propositionalized attribute taxonomy. PAT-Learner propositionalizes attributes and hierarchically clusters the propositionalized attributes based on the distribution of class labels that co-occur with them to generate a taxonomy. Our experimental results on UCI repository data sets show that the proposed algorithms can generate a decision tree that is generally more compact than and is sometimes comparably accurate to those produced by standard decision tree learners.  相似文献   

5.
Cut selection based on heuristic information is one of the most fundamental issues in the induction of decision trees with continuous valued attributes. This paper connects the selection of optimal cuts with a class of heuristic information functions together. It statistically shows that both training and testing accuracies in decision tree learning are dependent strongly on the selection of heuristics. A clear relationship between the second-order derivative of heuristic information function and locations of optimal cuts is mathematically derived and further is confirmed experimentally. Incorporating this relationship into a process of building decision trees, we can significantly reduce the number of detected cuts and furthermore improve the generalization of the decision tree.  相似文献   

6.
基于粗糙集的确定性控制规则决策模型   总被引:1,自引:0,他引:1       下载免费PDF全文
粗糙集属性分区数的变化会影响属性重要性和属性对决策属性的支持度。该文对知识表示系统的数据相关性进行分析,综合考虑系统的泛化能力,提出能生成确定性控制规则的决策模型,给出决策模型中属性分区数求取以及属性相对约减产生的判据与算法实现。实验结果表明,该算法简洁有效,验证了决策模型的准确性与实用性。  相似文献   

7.
A Distance-Based Attribute Selection Measure for Decision Tree Induction   总被引:12,自引:3,他引:9  
This note introduces a new attribute selection measure for ID3-like inductive algorithms. This measure is based on a distance between partitions such that the selected attribute in a node induces the partition which is closest to the correct partition of the subset of training examples corresponding to this node. The relationship of this measure with Quinlan's information gain is also established. It is also formally proved that our distance is not biased towards attributes with large numbers of values. Experimental studies with this distance confirm previously reported results showing that the predictive accuracy of induced decision trees is not sensitive to the goodness of the attribute selection measure. However, this distance produces smaller trees than the gain ratio measure of Quinlan, especially in the case of data whose attributes have significantly different numbers of values.  相似文献   

8.
Data categorization using decision trellises   总被引:4,自引:0,他引:4  
We introduce a probabilistic graphical model for supervised learning on databases with categorical attributes. The proposed belief network contains hidden variables that play a role similar to nodes in decision trees and each of their states either corresponds to a class label or to a single attribute test. As a major difference with respect to decision trees, the selection of the attribute to be tested is probabilistic. Thus, the model can be used to assess the probability that a tuple belongs to some class, given the predictive attributes. Unfolding the network along the hidden states dimension yields a trellis structure having a signal flow similar to second order connectionist networks. The network encodes context specific probabilistic independencies to reduce parametric complexity. We present a custom tailored inference algorithm and derive a learning procedure based on the expectation-maximization algorithm. We propose decision trellises as an alternative to decision trees in the context of tuple categorization in databases, which is an important step for building data mining systems. Preliminary experiments on standard machine learning databases are reported, comparing the classification accuracy of decision trellises and decision trees induced by C4.5. In particular, we show that the proposed model can offer significant advantages for sparse databases in which many predictive attributes are missing  相似文献   

9.
突发事件应急响应面临多个可能状态的应急方案选择问题,是一个值得探讨且具有实际价值的研究课题。本文针对具有随机变量形式属性值的突发事件应急方案选择问题,给出一种应急方案选择方法。首先,针对属性值为随机变量形式的各个属性,构建在各状态下针对每个属性的占优矩阵;其次,构建针对各属性的优势度矩阵,并将其规范化得到规范化优势度矩阵;然后,针对属性值为数值型的各个属性,构建针对各属性的规范化决策矩阵;进一步地,将规范化优势度矩阵和规范化决策矩阵进行集结,构建综合决策矩阵,在此基础上计算出各备选应急方案的综合评价值,并依据综合评价值的大小来确定突发事件应急响应的应急方案;最后,通过一个算例说明了提出方法的可用性。  相似文献   

10.
Univariate decision trees are classifiers currently used in many data mining applications. This classifier discovers partitions in the input space via hyperplanes that are orthogonal to the axes of attributes, producing a model that can be understood by human experts. One disadvantage of univariate decision trees is that they produce complex and inaccurate models when decision boundaries are not orthogonal to axes. In this paper we introduce the Fisher’s Tree, it is a classifier that takes advantage of dimensionality reduction of Fisher’s linear discriminant and uses the decomposition strategy of decision trees, to come up with an oblique decision tree. Our proposal generates an artificial attribute that is used to split the data in a recursive way.The Fisher’s decision tree induces oblique trees whose accuracy, size, number of leaves and training time are competitive with respect to other decision trees reported in the literature. We use more than ten public available data sets to demonstrate the effectiveness of our method.  相似文献   

11.
具有高可理解性的二分决策树生成算法研究   总被引:3,自引:0,他引:3  
蒋艳凰  杨学军  赵强利 《软件学报》2003,14(12):1996-2005
二分离散化是决策树生成中处理连续属性最常用的方法,对于连续属性较多的问题,生成的决策树庞大,知识表示难以理解.针对两类分类问题,提出一种基于属性变换的多区间离散化方法--RCAT,该方法首先将连续属性转化为某类别的概率属性,此概率属性的二分法结果对应于原连续属性的多区间划分,然后对这些区间的边缘进行优化,获得原连续属性的信息熵增益,最后采用悲观剪枝与无损合并剪枝技术对RCAT决策树进行简化.对多个领域的数据集进行实验,结果表明:对比二分离散化,RCAT算法的执行效率高,生成的决策树在保持分类精度的同时,树的规模小,可理解性强.  相似文献   

12.
Induction of multiple fuzzy decision trees based on rough set technique   总被引:5,自引:0,他引:5  
The integration of fuzzy sets and rough sets can lead to a hybrid soft-computing technique which has been applied successfully to many fields such as machine learning, pattern recognition and image processing. The key to this soft-computing technique is how to set up and make use of the fuzzy attribute reduct in fuzzy rough set theory. Given a fuzzy information system, we may find many fuzzy attribute reducts and each of them can have different contributions to decision-making. If only one of the fuzzy attribute reducts, which may be the most important one, is selected to induce decision rules, some useful information hidden in the other reducts for the decision-making will be losing unavoidably. To sufficiently make use of the information provided by every individual fuzzy attribute reduct in a fuzzy information system, this paper presents a novel induction of multiple fuzzy decision trees based on rough set technique. The induction consists of three stages. First several fuzzy attribute reducts are found by a similarity based approach, and then a fuzzy decision tree for each fuzzy attribute reduct is generated according to the fuzzy ID3 algorithm. The fuzzy integral is finally considered as a fusion tool to integrate the generated decision trees, which combines together all outputs of the multiple fuzzy decision trees and forms the final decision result. An illustration is given to show the proposed fusion scheme. A numerical experiment on real data indicates that the proposed multiple tree induction is superior to the single tree induction based on the individual reduct or on the entire feature set for learning problems with many attributes.  相似文献   

13.
Learning from data streams is a challenging task which demands a learning algorithm with several high quality features. In addition to space complexity and speed requirements needed for processing the huge volume of data which arrives at high speed, the learning algorithm must have a good balance between stability and plasticity. This paper presents a new approach to induce incremental decision trees on streaming data. In this approach, the internal nodes contain trainable split tests. In contrast with traditional decision trees in which a single attribute is selected as the split test, each internal node of the proposed approach contains a trainable function based on multiple attributes, which not only provides the flexibility needed in the stream context, but also improves stability. Based on this approach, we propose evolving fuzzy min–max decision tree (EFMMDT) learning algorithm in which each internal node of the decision tree contains an evolving fuzzy min–max neural network. EFMMDT splits the instance space non-linearly based on multiple attributes which results in much smaller and shallower decision trees. The extensive experiments reveal that the proposed algorithm achieves much better precision in comparison with the state-of-the-art decision tree learning algorithms on the benchmark data streams, especially in the presence of concept drift.  相似文献   

14.
Decision theory shows that the optimal decision is a function of the posterior class probabilities. More specifically, in binary classification, the optimal decision is based on the comparison of the posterior probabilities with some threshold. Therefore, the most accurate estimates of the posterior probabilities are required near these decision thresholds. This paper discusses the design of objective functions that provide more accurate estimates of the probability values, taking into account the characteristics of each decision problem. We propose learning algorithms based on the stochastic gradient minimization of these loss functions. We show that the performance of the classifier is improved when these algorithms behave like sample selectors: samples near the decision boundary are the most relevant during learning.  相似文献   

15.
Only a subset of the boundary points—the segment borders—have to be taken into account in searching for the optimal multisplit of a numerical value range with respect to the most commonly used attribute evaluation functions of classification learning algorithms. Segments and their borders can be found efficiently in a linear-time preprocessing step.In this paper we expand the applicability of segment borders by showing that inspecting them alone suffices in optimizing any convex evaluation function. For strictly convex evaluation functions inspecting all segment borders is also necessary. These results are derived directly from Jensen's inequality.We also study the evaluation function Training Set Error which is not strictly convex. With that function the data can be preprocessed into an even smaller number of cut point candidates, called alternations, when striving for optimal partition. Examining all alternations also seems necessary, since—analogously to strictly convex functions—the placement of neighboring cut points affects the optimality of an alternation. We test empirically the reduction of the number of cut point candidates that can be obtained for Training Set Error on real-world data.  相似文献   

16.
基于RST的决策树生成与剪枝方法   总被引:1,自引:0,他引:1       下载免费PDF全文
基于粗糙集理论构建决策树的过程中,通过计算各条件属性相对某分类的边界,选取边界最小的属性作为当前分支的节点,但此方法在多值分类情况下不能直接应用。为此,本文利用明确区的概念作为选取属性的标准,对各候选条件属性,选取相对于整个结果属性的明确区最大的属性作为当前分支的节点。并且基于明确区的概念,提出了一种新
新的对决策树进行剪枝的方法,通过一个实例说明该剪枝方法是简洁有效的.  相似文献   

17.
Lazy Learning of Bayesian Rules   总被引:19,自引:0,他引:19  
The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute inter-dependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR. This algorithm can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of LBR are reasonable in a wide cross-section of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier.  相似文献   

18.
A standard approach to determining decision trees is to learn them from examples. A disadvantage of this approach is that once a decision tree is learned, it is difficult to modify it to suit different decision making situations. Such problems arise, for example, when an attribute assigned to some node cannot be measured, or there is a significant change in the costs of measuring attributes or in the frequency distribution of events from different decision classes. An attractive approach to resolving this problem is to learn and store knowledge in the form of decision rules, and to generate from them, whenever needed, a decision tree that is most suitable in a given situation. An additional advantage of such an approach is that it facilitates buildingcompact decision trees, which can be much simpler than the logically equivalent conventional decision trees (by compact trees are meant decision trees that may contain branches assigned aset of values, and nodes assignedderived attributes, i.e., attributes that are logical or mathematical functions of the original ones). The paper describes an efficient method, AQDT-1, that takes decision rules generated by an AQ-type learning system (AQ15 or AQ17), and builds from them a decision tree optimizing a given optimality criterion. The method can work in two modes: thestandard mode, which produces conventional decision trees, andcompact mode, which produces compact decision trees. The preliminary experiments with AQDT-1 have shown that the decision trees generated by it from decision rules (conventional and compact) have outperformed those generated from examples by the well-known C4.5 program both in terms of their simplicity and their predictive accuracy.  相似文献   

19.
新的决策树构造方法   总被引:2,自引:1,他引:2       下载免费PDF全文
决策树算法是数据挖掘中的一个比较活跃的研究领域,是对分类问题进行深入分析的一种方法。但构造最优决策树是一个NP难问题。首先介绍了ID3算法的基本思想,然后针对算法中存在的不足,引入了广义相关函数的概念,提出了一种以条件属性和决策属性之间的广义相关函数作为属性选择标准的决策树构造方法,并且与ID3算法进行了实验比较。实验表明,这种方法不但可以优化决策树模型,而且用该方法构造的决策树的预测精度也得到明显改善。  相似文献   

20.
针对决策树C4.5算法在处理连续值属性过程中时间复杂度较高的问题,提出一种新的决策树构建方法:采用概率论中属性间的相关系数(Pearson),对数据集中的属性进行约简;结合属性的信息增益率,保留决策属性的最优子集,保证属性子集中没有冗余属性;采用边界点的判定,改进了连续值属性离散化过程中阈值分割方法,对信息增益率的计算进行修正。采用UCI数据库中的数据集,在Pycharm平台上进行一系列对比实验,结果表明:采用改进后C4.5决策树算法,决策树生成效率提高了约50%,准确率提升约2%,比较有效地解决了原C4.5算法属性选择偏连续值属性的问题。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号