首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Hybrid decision tree   总被引:6,自引:0,他引:6  
  相似文献   

2.
Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy (CE) family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper we explore the relative performance of the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI) measure. Our results suggest that while some datasets are insensitive to the choice of splitting methods, other datasets are very sensitive to the choice of splitting methods. For example, some of the CAMI family methods may be more appropriate than the popular Gain Ratio (GR) method for datasets which have nominal predictor attributes, and are competitive with the GR method for those datasets where all predictor attributes are numeric. Given that it is never known beforehand which splitting method will lead to the best DT for a given dataset, and given the relatively good performance of the CAMI methods, it seems appropriate to suggest that splitting methods from the CAMI family should be included in data mining toolsets. Kweku-Mauta Osei-Bryson is Professor of Information Systems at Virginia Commonwealth University, where he also served as the Coordinator of the Ph.D. program in Information Systems during 2001–2003. Previously he was Professor of Information Systems and Decision Analysis in the School of Business at Howard University, Washington, DC, U.S.A. He has also worked as an Information Systems practitioner in both industry and government. He holds a Ph.D. in Applied Mathematics (Management Science & Information Systems) from the University of Maryland at College Park, a M.S. in Systems Engineering from Howard University, and a B.Sc. in Natural Sciences from the University of the West Indies at Mona. He currently does research in various areas including: Data Mining, Expert Systems, Decision Support Systems, Group Support Systems, Information Systems Outsourcing, Multi-Criteria Decision Analysis. His papers have been published in various journals including: Information & Management, Information Systems Journal, Information Systems Frontiers, Business Process Management Journal, International Journal of Intelligent Systems, IEEE Transactions on Knowledge & Data Engineering, Data & Knowledge Engineering, Information & Software Technology, Decision Support Systems, Information Processing and Management, Computers & Operations Research, European Journal of Operational Research, Journal of the Operational Research Society, Journal of the Association for Information Systems, Journal of Multi-Criteria Decision Analysis, Applications of Management Science. Currently he serves an Associate Editor of the INFORMS Journal on Computing, and is a member of the Editorial Board of the Computers & Operations Research journal. Kendall E. Giles received the BS degree in Electrical Engineering from Virginia Tech in 1991, the MS degree in Electrical Engineering from Purdue University in 1993, the MS degree in Information Systems from Virginia Commonwealth University in 2002, and the MS degree in Computer Science from Johns Hopkins University in 2004. Currently he is a PhD student (ABD) in Computer Science at Johns Hopkins, and is a Research Assistant in the Applied Mathematics and Statistics department. He has over 15 years of work experience in industry, government, and academic institutions. His research interests can be partially summarized by the following keywords: network security, mathematical modeling, pattern classification, and high dimensional data analysis.  相似文献   

3.
Costs are often an important part of the classification process. Cost factors have been taken into consideration in many previous studies regarding decision tree models. In this study, we also consider a cost-sensitive decision tree construction problem. We assume that there are test costs that must be paid to obtain the values of the decision attribute and that a record must be classified without exceeding the spending cost threshold. Unlike previous studies, however, in which records were classified with only a single condition attribute, in this study, we are able to simultaneously classify records with multiple condition attributes. An algorithm is developed to build a cost-constrained decision tree, which allows us to simultaneously classify multiple condition attributes. The experimental results show that our algorithm satisfactorily handles data with multiple condition attributes under different cost constraints.  相似文献   

4.
Distance-based tree models for ranking data   总被引:1,自引:0,他引:1  
Ranking data has applications in different fields of studies, like marketing, psychology and politics. Over the years, many models for ranking data have been developed. Among them, distance-based ranking models, which originate from the classical rank correlations, postulate that the probability of observing a ranking of items depends on the distance between the observed ranking and a modal ranking. The closer to the modal ranking, the higher the ranking probability is. However, such a model basically assumes a homogeneous population and does not incorporate the presence of covariates.To overcome these limitations, we combine the strength of a tree model and the existing distance-based models to build a model that can handle more complexity and improve prediction accuracy. We will introduce a recursive partitioning algorithm for building a tree model with a distance-based ranking model fitted at each leaf. We will also consider new weighted distance measures which allow different weights for different ranks in formulating more flexible distance-based tree models. Finally, we will apply the proposed methodology to analyze a ranking dataset of Inglehart’s items collected in the 1999 European Values Studies.  相似文献   

5.
Ying  Dengsheng  Guojun   《Pattern recognition》2008,41(8):2554-2570
Semantic-based image retrieval has attracted great interest in recent years. This paper proposes a region-based image retrieval system with high-level semantic learning. The key features of the system are: (1) it supports both query by keyword and query by region of interest. The system segments an image into different regions and extracts low-level features of each region. From these features, high-level concepts are obtained using a proposed decision tree-based learning algorithm named DT-ST. During retrieval, a set of images whose semantic concept matches the query is returned. Experiments on a standard real-world image database confirm that the proposed system significantly improves the retrieval performance, compared with a conventional content-based image retrieval system. (2) The proposed decision tree induction method DT-ST for image semantic learning is different from other decision tree induction algorithms in that it makes use of the semantic templates to discretize continuous-valued region features and avoids the difficult image feature discretization problem. Furthermore, it introduces a hybrid tree simplification method to handle the noise and tree fragmentation problems, thereby improving the classification performance of the tree. Experimental results indicate that DT-ST outperforms two well-established decision tree induction algorithms ID3 and C4.5 in image semantic learning.  相似文献   

6.
将二叉决策机制融入到模糊支持向量机分类系统中,对图像进行情感语义层面的分类。其难点在于建立从图像的低阶特征到高层情感语义之间的映射关系,以及合理的参数选择问题。采用与决策树方法相结合,实现了多类分类。实验结果表明,本系统在图像情感分类中具有简单、快速、高效等特点。  相似文献   

7.
Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing with such problems. Various approaches have been introduced in the literature to deal with imbalanced datasets, and are typically based on oversampling, undersampling or cost-sensitive classification. In this paper, we introduce an effective ensemble of cost-sensitive decision trees for imbalanced classification. Base classifiers are constructed according to a given cost matrix, but are trained on random feature subspaces to ensure sufficient diversity of the ensemble members. We employ an evolutionary algorithm for simultaneous classifier selection and assignment of committee member weights for the fusion process. Our proposed algorithm is evaluated on a variety of benchmark datasets, and is confirmed to lead to improved recognition of the minority class, to be capable of outperforming other state-of-the-art algorithms, and hence to represent a useful and effective approach for dealing with imbalanced datasets.  相似文献   

8.
Learning decision tree for ranking   总被引:4,自引:3,他引:1  
Decision tree is one of the most effective and widely used methods for classification. However, many real-world applications require instances to be ranked by the probability of class membership. The area under the receiver operating characteristics curve, simply AUC, has been recently used as a measure for ranking performance of learning algorithms. In this paper, we present two novel class probability estimation algorithms to improve the ranking performance of decision tree. Instead of estimating the probability of class membership using simple voting at the leaf where the test instance falls into, our algorithms use similarity-weighted voting and naive Bayes. We design empirical experiments to verify that our new algorithms significantly outperform the recent decision tree ranking algorithm C4.4 in terms of AUC.
Liangxiao JiangEmail:
  相似文献   

9.
新型决策树构造方法   总被引:1,自引:0,他引:1       下载免费PDF全文
决策树是一种重要的数据挖掘工具,但构造最优决策树是一个NP-完全问题。提出了一种基于关联规则挖掘的决策树构造方法。首先定义了高可信度的近似精确规则,给出了挖掘这类规则的算法;在近似精确规则的基础上产生新的属性,并讨论了新生成属性的评价方法;然后利用新生成的属性和数据本身的属性共同构造决策树;实验结果表明新的决策树构造方法具有较高的精度。  相似文献   

10.
In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist.  相似文献   

11.
Most of the research on machine learning-based real-time scheduling (RTS) systems has been aimed toward product constant mix environments. However, in a product mix variety manufacturing environment, the scheduling knowledge base (KB) is dynamic; therefore, it would be interesting to develop a procedure that would automatically modify the scheduling knowledge when important changes occur in the manufacturing system. All of the machine learning-based RTS systems (including a KB refinement mechanism) proposed in earlier studies periodically require the addition of new training samples and regeneration of new KBs. Hence, previous approaches investigating machine learning-based RTS systems have been confronted with the training data overflow problem and an increase in the scheduling KB building time, which are unsuitable for RTS control. The objective of this paper is to develop a KB class selection mechanism that can be supported in various product mix ratio environments. Hence, the RTS KB is developed by a two-level decision tree (DT) learning approach. First, a suitable scheduling KB class is selected. Then, for each KB class, the best (proper) dispatching rule is selected for the next scheduling period. Here, the proposed two-level DT RTS system comprises five key components: (1) training samples generation mechanism, (2) GA/DT-based feature selection mechanism, (3) building a KB class label by a two-level self-organizing map, (4) DT-based KB class selection module, and (5) DT-based dynamic dispatching rule selection module. The proposed two-level DT-based KB RTS system yields better system performance than that by a one-level DT-based RTS system and heuristic individual dispatching rules in a flexible manufacturing system under various performance criteria over a long period.  相似文献   

12.
Algorithms for the analysis of graph sequences are proposed in this paper. In particular, we study the problem of recovering missing information and predicting the occurrence of nodes and edges in time series of graphs. Two different recovery schemes are developed. The first scheme uses reference patterns that are extracted from a training set of graph sequences, while the second method is based on decision tree induction. Our work is motivated by applications in computer network analysis. However, the proposed recovery and prediction schemes are generic and can be applied in other domains as well.  相似文献   

13.
As opposed to trees that use a single type of decision node, an omnivariate decision tree contains nodes of different types. We propose to use Structural Risk Minimization (SRM) to choose between node types in omnivariate decision tree construction to match the complexity of a node to the complexity of the data reaching that node. In order to apply SRM for model selection, one needs the VC-dimension of the candidate models. In this paper, we first derive the VC-dimension of the univariate model, and estimate the VC-dimension of all three models (univariate, linear multivariate or quadratic multivariate) experimentally. Second, we compare SRM with other model selection techniques including Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC) and cross-validation (CV) on standard datasets from the UCI and Delve repositories. We see that SRM induces omnivariate trees that have a small percentage of multivariate nodes close to the root and they generalize more or at least as accurately as those constructed using other model selection techniques.  相似文献   

14.
Several investigations indicate that the Bidirectional Reflectance Distribution Function (BRDF) contains information that can be used to complement spectral information for improved land cover classification accuracies. Prior studies on the addition of BRDF information to improve land cover classifications have been conducted primarily at local or regional scales. Thus, the potential benefits of adding BRDF information to improve global to continental scale land cover classification have not yet been explored. Here we examine the impact of multidirectional global scale data from the first Polarization and Directionality of Earth Reflectances (POLDER) spacecraft instrument flown on the Advanced Earth Observing Satellite (ADEOS-1) platform on overall classification accuracy and per-class accuracies for 15 land cover categories specified by the International Geosphere Biosphere Programme (IGBP).

A set of 36,648 global training pixels (7 × 6 km spatial resolution) was used with a decision tree classifier to evaluate the performance of classifying POLDER data with and without the inclusion of BRDF information. BRDF ‘metrics’ for the eight-month POLDER on ADEOS-1 archive (10/1996–06/1997) were developed that describe the temporal evolution of the BRDF as captured by a semi-empirical BRDF model. The concept of BRDF ‘feature space’ is introduced and used to explore and exploit the bidirectional information content. The C5.0 decision tree classifier was applied with a boosting option, with the temporal metrics for spectral albedo as input for a first test, and with spectral albedo and BRDF metrics for a second test. Results were evaluated against 20 random subsets of the training data.

Examination of the BRDF feature space indicates that coarse scale BRDF coefficients from POLDER provide information on land cover that is different from the spectral and temporal information of the imagery. The contribution of BRDF information to reducing classification errors is also demonstrated: the addition of BRDF metrics reduces the mean, overall classification error rates by 3.15% (from 18.1% to 14.95% error) with larger improvements for producer's accuracies of individual classes such as Grasslands (+ 8.71%), Urban areas (+ 8.02%), and Wetlands (+ 7.82%). User's accuracies for the Urban (+ 7.42%) and Evergreen Broadleaf Forest (+ 6.70%) classes are also increased. The methodology and results are widely applicable to current multidirectional satellite data from the Multi-angle Imaging Spectroradiometer (MISR), and to the next generation of POLDER-like multi-directional instruments.  相似文献   


15.
Classifying online network traffic is becoming critical in network management and security. Recently, new classification methods based on analysis of statistical features of transport layer traffic have been proposed. While these new methods address the limitations of the port based and payload based traffic classification, the current software-based solutions are not fast enough to deal with the traffic of today’s high-speed networks. In this paper, we propose an online statistical traffic classifier using the C4.5 machine learning algorithm running on the NetFPGA platform. Our NetFPGA classifier is constructed by adding three main modules to the NetFPGA reference switch design; a Netflow module, a feature extractor module, and a C4.5 search tree classifier. The proposed classifier is able to classify the input traffics at the maximum line speed of the NetFPGA platform, i.e. 8 Gbps without any packet loss. Our method is based on the statistical features of the first few packets of a flow. The flow is classified just a few micro seconds after receiving the desired number of packets.  相似文献   

16.
Although numerous research studies in recent years have been proposed for comparing and ranking fuzzy numbers, most of the existing approaches suffer from plenty of shortcomings. In particular, they have produced counter-intuitive ranking orders under certain cases, inconsistent ranking orders of the fuzzy numbers’ images, and lack of discrimination power to rank similar and symmetric fuzzy numbers. This study's goal is to propose a new epsilon-deviation degree approach based on the left and right areas of a fuzzy number and the concept of a centroid point to overcome previous drawbacks. The proposed approach defines an epsilon-transfer coefficient to avoid illogicality when ranking fuzzy numbers with identical centroid points and develops two innovative ranking indices to consistently distinguish similar or symmetric fuzzy numbers by considering the decision maker's attitude. The advantages of the proposed method are illustrated through several numerical examples and comparisons with the existing approaches. The results demonstrate that this approach is effective for ranking generalized fuzzy numbers and overcomes the shortcomings in recent studies.  相似文献   

17.
A Direct Sum Theorem holds in a model of computation, when for every problem solving some k input instances together is k times as expensive as solving one. We show that Direct Sum Theorems hold in the models of deterministic and randomized decision trees for all relations. We also note that a near optimal Direct Sum Theorem holds for quantum decision trees for boolean functions.  相似文献   

18.
张棪  曹健 《计算机科学》2016,43(Z6):374-379, 383
决策树作为机器学习中的一个预测模型,因其输出结果易于理解和解释,而被广泛应用于各个领域,成为了学术界研究的热点。随着数据产生速度的剧增,由于内存容量和处理器速度等限制,常规的决策树算法无法对大数据集进行处理,因此需要对决策树算法的实现进行针对性的处理。首先阐述了决策树的基本算法和优化方法,在此基础上结合大数据带来的挑战,分类比较了各类针对性算法的优缺点,并介绍了支撑这些算法运行的平台。最后讨论了面向大数据的决策树算法的未来发展方向。  相似文献   

19.
The software in modern systems has become too complex to make accurate predictions about their performance under different configurations. Real-time or even responsiveness requirements cannot be met because it is not possible to perform admission control for new or changing tasks if we cannot tell how their execution affects the other tasks already running. Previously, we proposed a resource-allocation middleware that manages the execution of tasks in a complex distributed system with real-time requirements. The middleware behavior can be modeled depending on the configuration of the tasks running, so that the performance of any given configuration can be calculated. This makes it possible to have admission control in such a system, but the model requires knowledge of run-time parameters. We propose the utilization of machine-learning algorithms to obtain the model parameters, and be able to predict the system performance under any configuration, so that we can provide a full admission control mechanism for complex software systems. In this paper, we present such an admission control mechanism, we measure its accuracy in estimating the parameters of the model, and we evaluate its performance to determine its suitability for a real-time or responsive system.  相似文献   

20.
This paper proposes an expert system called VIBEX (VIBration EXpert) to aid plant operators in diagnosing the cause of abnormal vibration for rotating machinery. In order to automatize the diagnosis, a decision table based on the cause-symptom matrix is used as a probabilistic method for diagnosing abnormal vibration. Also a decision tree is used as the acquisition of structured knowledge in the form of concepts is introduced to build a knowledge base which is indispensable for vibration expert systems. The decision tree is a technique used for building knowledge-based systems by the inductive inference from examples and plays a role itself as a vibration diagnostic tool. The proposed system has been successfully implemented on Microsoft Windows environment and is written in Microsoft Visual Basic and Visual C++. To validate the system performance, the diagnostic system was tested with some examples using the two diagnostic methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号