首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Motivated by the desire to construct compact (in terms of expected length to be traversed to reach a decision) decision trees, we propose a new node splitting measure for decision tree construction. We show that the proposed measure is convex and cumulative and utilize this in the construction of decision trees for classification. Results obtained from several datasets from the UCI repository show that the proposed measure results in decision trees that are more compact with classification accuracy that is comparable to that obtained using popular node splitting measures such as Gain Ratio and the Gini Index.  相似文献   

2.
This paper proposes a method for constructing ensembles of decision trees, random feature weights (RFW). The method is similar to Random Forest, they are methods that introduce randomness in the construction method of the decision trees. In Random Forest only a random subset of attributes are considered for each node, but RFW considers all of them. The source of randomness is a weight associated with each attribute. All the nodes in a tree use the same set of random weights but different from the set of weights in other trees. So, the importance given to the attributes will be different in each tree and that will differentiate their construction. The method is compared to Bagging, Random Forest, Random-Subspaces, AdaBoost and MultiBoost, obtaining favourable results for the proposed method, especially when using noisy data sets. RFW can be combined with these methods. Generally, the combination of RFW with other method produces better results than the combined methods. Kappa-error diagrams and Kappa-error movement diagrams are used to analyse the relationship between the accuracies of the base classifiers and their diversity.  相似文献   

3.
We have proposed a hybrid SVM based decision tree to speedup SVMs in its testing phase for binary classification tasks. While most existing methods addressed towards this task aim at reducing the number of support vectors, we have focused on reducing the number of test datapoints that need SVM’s help in getting classified. The central idea is to approximate the decision boundary of SVM using decision trees. The resulting tree is a hybrid tree in the sense that it has both univariate and multivariate (SVM) nodes. The hybrid tree takes SVM’s help only in classifying crucial datapoints lying near decision boundary; remaining less crucial datapoints are classified by fast univariate nodes. The classification accuracy of the hybrid tree is guaranteed by tuning a threshold parameter. Extensive computational comparisons on 19 publicly available datasets indicate that the proposed method achieves significant speedup when compared to SVMs, without any compromise in classification accuracy.  相似文献   

4.
Learning decision tree for ranking   总被引:1,自引:3,他引:1  
Decision tree is one of the most effective and widely used methods for classification. However, many real-world applications require instances to be ranked by the probability of class membership. The area under the receiver operating characteristics curve, simply AUC, has been recently used as a measure for ranking performance of learning algorithms. In this paper, we present two novel class probability estimation algorithms to improve the ranking performance of decision tree. Instead of estimating the probability of class membership using simple voting at the leaf where the test instance falls into, our algorithms use similarity-weighted voting and naive Bayes. We design empirical experiments to verify that our new algorithms significantly outperform the recent decision tree ranking algorithm C4.4 in terms of AUC.
Liangxiao JiangEmail:
  相似文献   

5.
6.
We propose the use of Vapnik's vicinal risk minimization (VRM) for training decision trees to approximately maximize decision margins. We implement VRM by propagating uncertainties in the input attributes into the labeling decisions. In this way, we perform a global regularization over the decision tree structure. During a training phase, a decision tree is constructed to minimize the total probability of misclassifying the labeled training examples, a process which approximately maximizes the margins of the resulting classifier. We perform the necessary minimization using an appropriate meta-heuristic (genetic programming) and present results over a range of synthetic and benchmark real datasets. We demonstrate the statistical superiority of VRM training over conventional empirical risk minimization (ERM) and the well-known C4.5 algorithm, for a range of synthetic and real datasets. We also conclude that there is no statistical difference between trees trained by ERM and using C4.5. Training with VRM is shown to be more stable and repeatable than by ERM.  相似文献   

7.
Decision tree classifiers have received much recent attention, particularly with regards to land cover classifications at continental to global scales. Despite their many benefits and general flexibility, the use of decision trees with high spatial resolution data has not yet been fully explored. In support of the National Park Service (NPS) Vegetation Mapping Program (VMP), we have examined the feasibility of using a commercially available decision tree classifier with multitemporal satellite data from the Enhanced Thematic Mapper-Plus (ETM+) instrument to map 11 land cover types at the Delaware Water Gap National Recreation Area near Milford, PA. Ensemble techniques such as boosting and consensus filtering of the training data were used to improve both the quality of the input training data as well as the final products.Using land cover classes as specified by the National Vegetation Classification Standard at the Formation level, the final land cover map has an overall accuracy of 82% (κ=0.80) when tested against a validation data set acquired on the ground (n=195). This same accuracy is 99.5% when considering only forest vs. nonforest classes. Usage of ETM+ scenes acquired at multiple dates improves the accuracy over the use of a single date, particularly for the different forest types. These results demonstrate the potential applicability and usability of such an approach to the entire National Park system, and to high spatial resolution land cover and forest mapping applications in general.  相似文献   

8.
By exploiting the unattended nature of the wireless sensor networks, an attacker can physically capture and compromise sensor nodes and then launch a variety of attacks. He can additionally create many replicas of a few compromised nodes and spread these replicas over the network, thus launching further attacks with their help. In order to minimize the damage incurred by compromised and replicated nodes, it is very important to detect such malicious nodes as quickly as possible. In this review article, we synthesize our previous works on node compromise detection in sensor networks while providing the extended analysis in terms of performance comparison to the related work. More specifically, we use the methodology of the sequential analysis to detect static and mobile compromised nodes, as well as mobile replicated nodes in sensor networks. With the help of analytical and simulation results, we also demonstrate that our schemes provide robust and efficient node compromise detection capability.  相似文献   

9.
This study introduces a change detection model based on Neighborhood Correlation Image (NCI) logic. It is based on the fact that the same geographic area (e.g., a 3 × 3 pixel window) on two dates of imagery will tend to be highly correlated if little change has occurred, and uncorrelated when change occurs. Computing the piecewise correlation between two data sets provides valuable information regarding the location and numeric change value derived using contextual information within the specified neighborhood. Various neighborhood configurations (i.e., multi-level NCIs) were explored in the study using high spatial resolution multispectral imagery: smaller neighborhood sizes provided some detailed change information (such as a new patios added to an existing building) at the cost of introducing some noise (such as changes in shadows). Larger neighborhood sizes were useful for removing this noise but introduced some inaccurate change information (such as removing some linear feature changes). When combined with image classification using a machine learning decision tree (C5.0), classifications based on multi-level NCIs yielded superior results (e.g., using a 3-pixel circular radius neighborhood had a Kappa of 0.94), compared to the classification that did not incorporate NCIs (Kappa = 0.86).  相似文献   

10.
Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy (CE) family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper we explore the relative performance of the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI) measure. Our results suggest that while some datasets are insensitive to the choice of splitting methods, other datasets are very sensitive to the choice of splitting methods. For example, some of the CAMI family methods may be more appropriate than the popular Gain Ratio (GR) method for datasets which have nominal predictor attributes, and are competitive with the GR method for those datasets where all predictor attributes are numeric. Given that it is never known beforehand which splitting method will lead to the best DT for a given dataset, and given the relatively good performance of the CAMI methods, it seems appropriate to suggest that splitting methods from the CAMI family should be included in data mining toolsets. Kweku-Mauta Osei-Bryson is Professor of Information Systems at Virginia Commonwealth University, where he also served as the Coordinator of the Ph.D. program in Information Systems during 2001–2003. Previously he was Professor of Information Systems and Decision Analysis in the School of Business at Howard University, Washington, DC, U.S.A. He has also worked as an Information Systems practitioner in both industry and government. He holds a Ph.D. in Applied Mathematics (Management Science & Information Systems) from the University of Maryland at College Park, a M.S. in Systems Engineering from Howard University, and a B.Sc. in Natural Sciences from the University of the West Indies at Mona. He currently does research in various areas including: Data Mining, Expert Systems, Decision Support Systems, Group Support Systems, Information Systems Outsourcing, Multi-Criteria Decision Analysis. His papers have been published in various journals including: Information & Management, Information Systems Journal, Information Systems Frontiers, Business Process Management Journal, International Journal of Intelligent Systems, IEEE Transactions on Knowledge & Data Engineering, Data & Knowledge Engineering, Information & Software Technology, Decision Support Systems, Information Processing and Management, Computers & Operations Research, European Journal of Operational Research, Journal of the Operational Research Society, Journal of the Association for Information Systems, Journal of Multi-Criteria Decision Analysis, Applications of Management Science. Currently he serves an Associate Editor of the INFORMS Journal on Computing, and is a member of the Editorial Board of the Computers & Operations Research journal. Kendall E. Giles received the BS degree in Electrical Engineering from Virginia Tech in 1991, the MS degree in Electrical Engineering from Purdue University in 1993, the MS degree in Information Systems from Virginia Commonwealth University in 2002, and the MS degree in Computer Science from Johns Hopkins University in 2004. Currently he is a PhD student (ABD) in Computer Science at Johns Hopkins, and is a Research Assistant in the Applied Mathematics and Statistics department. He has over 15 years of work experience in industry, government, and academic institutions. His research interests can be partially summarized by the following keywords: network security, mathematical modeling, pattern classification, and high dimensional data analysis.  相似文献   

11.
Lung transplantation has a vital role among all organ transplant procedures since it is the only accepted treatment for the end-stage pulmonary failure. There have been several research attempts to model the performance of lung transplants. Yet, these early studies either lack model predictive capability by relying on strong statistical assumptions or provide adequate predictive capability but suffer from less interpretability to the medical professionals. The proposed method described in this paper is focused on overcoming these limitations by providing a structural equation modeling-based decision tree construction procedure for lung transplant performance evaluation. Specifically, partial least squares-based path modeling algorithm is used for the structural equation modeling part. The proposed method is validated through a US nation-wide dataset obtained from United Network for Organ Sharing (UNOS). The results are promising in terms of both prediction and interpretation capabilities, and are superior to the existing techniques. Hence, we assert that a decision support system, which is based on the proposed method, can bridge the knowledge-gap between the large amount of available data and betterment of the lung transplantation procedures.  相似文献   

12.
We present a new approach for approximating node deletion problems by combining the local ratio and the greedy multicovering algorithms. For a function , our approach allows to design a 2+maxvV(G)logf(v) approximation algorithm for the problem of deleting a minimum number of nodes so that the degree of each node v in the remaining graph is at most f(v). This approximation ratio is shown to be asymptotically optimal. The new method is also used to design a 1+(log2)(k−1) approximation algorithm for the problem of deleting a minimum number of nodes so that the remaining graph contains no k-bicliques.  相似文献   

13.
With the developments in the information technology, fraud is spreading all over the world, resulting in huge financial losses. Though fraud prevention mechanisms such as CHIP&PIN are developed for credit card systems, these mechanisms do not prevent the most common fraud types such as fraudulent credit card usages over virtual POS (Point Of Sale) terminals or mail orders so called online credit card fraud. As a result, fraud detection becomes the essential tool and probably the best way to stop such fraud types. In this study, a new cost-sensitive decision tree approach which minimizes the sum of misclassification costs while selecting the splitting attribute at each non-terminal node is developed and the performance of this approach is compared with the well-known traditional classification models on a real world credit card data set. In this approach, misclassification costs are taken as varying. The results show that this cost-sensitive decision tree algorithm outperforms the existing well-known methods on the given problem set with respect to the well-known performance metrics such as accuracy and true positive rate, but also a newly defined cost-sensitive metric specific to credit card fraud detection domain. Accordingly, financial losses due to fraudulent transactions can be decreased more by the implementation of this approach in fraud detection systems.  相似文献   

14.
A storing of spatial data and processing of spatial queries are important tasks for modern data-bases. The execution efficiency of spatial query depends on underlying index structure. R-tree is a well-known spatial index structure. Currently there exist various versions of R-tree, and one of the most common variations between them is node splitting algorithm. The problem of node splitting in one-dimensional R-tree may seem to be too trivial to be considered separately. One-dimensional intervals can be split on the base of their sorting. Some of the node splitting algorithms for R-tree with two or more dimensions comprise one-dimensional split as their part. However, under detailed consideration, existing algorithms for one-dimensional split do not perform ideally in some complicated cases. This paper introduces a novel one-dimensional node splitting algorithm based on two sortings that can handle such complicated cases better. Also this paper introduces node splitting algorithm for R-tree with two or more dimensions that is based on the one-dimensional algorithm mentioned above. The tests show significantly better behavior of the proposed algorithms in the case of highly overlapping data.  相似文献   

15.
We study the possibility of constructing decision trees with evolutionary algorithms in order to increase their predictive accuracy. We present a self-adapting evolutionary algorithm for the induction of decision trees and describe the principle of decision making based on multiple evolutionary induced decision trees—decision forest. The developed model is used as a fault predictive approach to foresee dangerous software modules, which identification can largely enhance the reliability of software.  相似文献   

16.
An algorithm is developed for the design of an efficient decision tree with application to the pattern recognition problems involving discrete variables. The problem of evaluating an extremely large number of trees in search of a minimum cost decision tree is tackled by defining a criterion to estimate the minimum expected cost of a tree in terms of the weights of its terminal nodes and costs of the measurements, which then is used to establish the search procedure for the efficient decision tree. The concept of prime events is used to obtain the number of modes and the corresponding weights in the design samples. An application of the proposed algorithm is presented for the design of an efficient decision tree for classifying Devanagri numerals.  相似文献   

17.
为提高智能模型的识别精度,增强其泛化能力,需要对用于智能建模的数据集中的对象类别异常进行检测和修正。在进行数据集和决策树形式化描述的基础上,将基尼指数增益率作为确定连续条件属性最优二分原则,采用递归算法生成叶节点中对象为同一类别的二叉决策树。利用信息熵评价决策树剪除叶节点中对象的类别分布效果,实现数据集类别异常的类别修正。决策树的生成和剪枝本质上是完成基于基尼指数和信息熵的连续条件属性数据空间分割和合并类别修正。实验和实际应用验证了决策树生成和剪枝是数据集类别优化的有效方法。  相似文献   

18.
19.
20.
Change detection on spatial data is important in many applications, such as environmental monitoring. Given a set of snapshots of spatial objects at various temporal instants, a user may want to derive the changing regions between any two snapshots. Most of the existing methods have to use at least one of the original data sets to detect changing regions. However, in some important applications, due to data access constraints such as privacy concerns and limited data online availability, original data may not be available for change analysis. In this paper, we tackle the problem by proposing a simple yet effective model-based approach. In the model construction phase, data snapshots are summarized using the novel cluster-embedded decision trees as concise models. Once the models are built, the original data snapshots will not be accessed anymore. In the change detection phase, to mine changing regions between any two instants, we compare the two corresponding cluster-embedded decision trees. Our systematic experimental results on both real and synthetic data sets show that our approach can detect changes accurately and effectively. Irene Pekerskaya’s and Jian Pei’s research is supported partly by National Sciences and Engineering Research Council of Canada and National Science Foundation of the US, and a President’s Research Grant and an Endowed Research Fellowship Award at Simon Fraser University. Ke Wang’s research is supported partly by Natural Sciences and Engineering Research Council of Canada. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号