期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Splitting methods for decision tree induction: An exploration of the relative performance of two entropy-based families

Kweku-Muata Osei-Bryson Kendall Giles 《Information Systems Frontiers》2006,8(3):195-209

Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy (CE) family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper we explore the relative performance of the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI) measure. Our results suggest that while some datasets are insensitive to the choice of splitting methods, other datasets are very sensitive to the choice of splitting methods. For example, some of the CAMI family methods may be more appropriate than the popular Gain Ratio (GR) method for datasets which have nominal predictor attributes, and are competitive with the GR method for those datasets where all predictor attributes are numeric. Given that it is never known beforehand which splitting method will lead to the best DT for a given dataset, and given the relatively good performance of the CAMI methods, it seems appropriate to suggest that splitting methods from the CAMI family should be included in data mining toolsets. Kweku-Mauta Osei-Bryson is Professor of Information Systems at Virginia Commonwealth University, where he also served as the Coordinator of the Ph.D. program in Information Systems during 2001–2003. Previously he was Professor of Information Systems and Decision Analysis in the School of Business at Howard University, Washington, DC, U.S.A. He has also worked as an Information Systems practitioner in both industry and government. He holds a Ph.D. in Applied Mathematics (Management Science & Information Systems) from the University of Maryland at College Park, a M.S. in Systems Engineering from Howard University, and a B.Sc. in Natural Sciences from the University of the West Indies at Mona. He currently does research in various areas including: Data Mining, Expert Systems, Decision Support Systems, Group Support Systems, Information Systems Outsourcing, Multi-Criteria Decision Analysis. His papers have been published in various journals including: Information & Management, Information Systems Journal, Information Systems Frontiers, Business Process Management Journal, International Journal of Intelligent Systems, IEEE Transactions on Knowledge & Data Engineering, Data & Knowledge Engineering, Information & Software Technology, Decision Support Systems, Information Processing and Management, Computers & Operations Research, European Journal of Operational Research, Journal of the Operational Research Society, Journal of the Association for Information Systems, Journal of Multi-Criteria Decision Analysis, Applications of Management Science. Currently he serves an Associate Editor of the INFORMS Journal on Computing, and is a member of the Editorial Board of the Computers & Operations Research journal. Kendall E. Giles received the BS degree in Electrical Engineering from Virginia Tech in 1991, the MS degree in Electrical Engineering from Purdue University in 1993, the MS degree in Information Systems from Virginia Commonwealth University in 2002, and the MS degree in Computer Science from Johns Hopkins University in 2004. Currently he is a PhD student (ABD) in Computer Science at Johns Hopkins, and is a Research Assistant in the Applied Mathematics and Statistics department. He has over 15 years of work experience in industry, government, and academic institutions. His research interests can be partially summarized by the following keywords: network security, mathematical modeling, pattern classification, and high dimensional data analysis. 相似文献

2.

Building a cost-constrained decision tree with multiple condition attributes

Yen-Liang Chen Chia-Chi Wu 《Information Sciences》2009,179(7):967-5226

Costs are often an important part of the classification process. Cost factors have been taken into consideration in many previous studies regarding decision tree models. In this study, we also consider a cost-sensitive decision tree construction problem. We assume that there are test costs that must be paid to obtain the values of the decision attribute and that a record must be classified without exceeding the spending cost threshold. Unlike previous studies, however, in which records were classified with only a single condition attribute, in this study, we are able to simultaneously classify records with multiple condition attributes. An algorithm is developed to build a cost-constrained decision tree, which allows us to simultaneously classify multiple condition attributes. The experimental results show that our algorithm satisfactorily handles data with multiple condition attributes under different cost constraints. 相似文献

3.

Distance-based tree models for ranking data 总被引：1，自引：0，他引：1

Paul H. Lee Philip L.H. Yu 《Computational statistics & data analysis》2010,54(6):1672-1682

Ranking data has applications in different fields of studies, like marketing, psychology and politics. Over the years, many models for ranking data have been developed. Among them, distance-based ranking models, which originate from the classical rank correlations, postulate that the probability of observing a ranking of items depends on the distance between the observed ranking and a modal ranking. The closer to the modal ranking, the higher the ranking probability is. However, such a model basically assumes a homogeneous population and does not incorporate the presence of covariates.To overcome these limitations, we combine the strength of a tree model and the existing distance-based models to build a model that can handle more complexity and improve prediction accuracy. We will introduce a recursive partitioning algorithm for building a tree model with a distance-based ranking model fitted at each leaf. We will also consider new weighted distance measures which allow different weights for different ranks in formulating more flexible distance-based tree models. Finally, we will apply the proposed methodology to analyze a ranking dataset of Inglehart’s items collected in the 1999 European Values Studies. 相似文献

4.

Region-based image retrieval with high-level semantics using decision tree learning

Ying Dengsheng Guojun 《Pattern recognition》2008,41(8):2554-2570

Semantic-based image retrieval has attracted great interest in recent years. This paper proposes a region-based image retrieval system with high-level semantic learning. The key features of the system are: (1) it supports both query by keyword and query by region of interest. The system segments an image into different regions and extracts low-level features of each region. From these features, high-level concepts are obtained using a proposed decision tree-based learning algorithm named DT-ST. During retrieval, a set of images whose semantic concept matches the query is returned. Experiments on a standard real-world image database confirm that the proposed system significantly improves the retrieval performance, compared with a conventional content-based image retrieval system. (2) The proposed decision tree induction method DT-ST for image semantic learning is different from other decision tree induction algorithms in that it makes use of the semantic templates to discretize continuous-valued region features and avoids the difficult image feature discretization problem. Furthermore, it introduces a hybrid tree simplification method to handle the noise and tree fragmentation problems, thereby improving the classification performance of the tree. Experimental results indicate that DT-ST outperforms two well-established decision tree induction algorithms ID3 and C4.5 in image semantic learning. 相似文献

5.

基于决策树和模糊SVM的图像情感分类研究

吴立棣秦亮曦陈永生《微计算机信息》2011,(8)

将二叉决策机制融入到模糊支持向量机分类系统中,对图像进行情感语义层面的分类。其难点在于建立从图像的低阶特征到高层情感语义之间的映射关系,以及合理的参数选择问题。采用与决策树方法相结合,实现了多类分类。实验结果表明,本系统在图像情感分类中具有简单、快速、高效等特点。相似文献

6.

Learning decision tree for ranking 总被引：1，自引：3，他引：1

Liangxiao Jiang Chaoqun Li Zhihua Cai 《Knowledge and Information Systems》2009,20(1):123-135

Decision tree is one of the most effective and widely used methods for classification. However, many real-world applications require instances to be ranked by the probability of class membership. The area under the receiver operating characteristics curve, simply AUC, has been recently used as a measure for ranking performance of learning algorithms. In this paper, we present two novel class probability estimation algorithms to improve the ranking performance of decision tree. Instead of estimating the probability of class membership using simple voting at the leaf where the test instance falls into, our algorithms use similarity-weighted voting and naive Bayes. We design empirical experiments to verify that our new algorithms significantly outperform the recent decision tree ranking algorithm C4.4 in terms of AUC.

Liangxiao JiangEmail:

相似文献

7.

Cost-sensitive decision tree ensembles for effective imbalanced classification

《Applied Soft Computing》2014

Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing with such problems. Various approaches have been introduced in the literature to deal with imbalanced datasets, and are typically based on oversampling, undersampling or cost-sensitive classification. In this paper, we introduce an effective ensemble of cost-sensitive decision trees for imbalanced classification. Base classifiers are constructed according to a given cost matrix, but are trained on random feature subspaces to ensure sufficient diversity of the ensemble members. We employ an evolutionary algorithm for simultaneous classifier selection and assignment of committee member weights for the fusion process. Our proposed algorithm is evaluated on a variety of benchmark datasets, and is confirmed to lead to improved recognition of the minority class, to be capable of outperforming other state-of-the-art algorithms, and hence to represent a useful and effective approach for dealing with imbalanced datasets. 相似文献

8.

Flexible decision tree for data stream classification in the presence of concept change, noise and missing values 总被引：1，自引：0，他引：1

Sattar Hashemi Ying Yang 《Data mining and knowledge discovery》2009,19(1):95-131

In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist. 相似文献

9.

Development of two-level decision tree-based real-time scheduling system under product mix variety environment 总被引：1，自引：0，他引：1

Yeou-Ren Shiue 《Robotics and Computer》2009,25(4-5):709-720

Most of the research on machine learning-based real-time scheduling (RTS) systems has been aimed toward product constant mix environments. However, in a product mix variety manufacturing environment, the scheduling knowledge base (KB) is dynamic; therefore, it would be interesting to develop a procedure that would automatically modify the scheduling knowledge when important changes occur in the manufacturing system. All of the machine learning-based RTS systems (including a KB refinement mechanism) proposed in earlier studies periodically require the addition of new training samples and regeneration of new KBs. Hence, previous approaches investigating machine learning-based RTS systems have been confronted with the training data overflow problem and an increase in the scheduling KB building time, which are unsuitable for RTS control. The objective of this paper is to develop a KB class selection mechanism that can be supported in various product mix ratio environments. Hence, the RTS KB is developed by a two-level decision tree (DT) learning approach. First, a suitable scheduling KB class is selected. Then, for each KB class, the best (proper) dispatching rule is selected for the next scheduling period. Here, the proposed two-level DT RTS system comprises five key components: (1) training samples generation mechanism, (2) GA/DT-based feature selection mechanism, (3) building a KB class label by a two-level self-organizing map, (4) DT-based KB class selection module, and (5) DT-based dynamic dispatching rule selection module. The proposed two-level DT-based KB RTS system yields better system performance than that by a one-level DT-based RTS system and heuristic individual dispatching rules in a flexible manufacturing system under various performance criteria over a long period. 相似文献

10.

Recovery of missing information in graph sequences by means of reference pattern matching and decision tree learning

Horst Bunke Peter Dickinson 《Pattern recognition》2006,39(4):573-586

Algorithms for the analysis of graph sequences are proposed in this paper. In particular, we study the problem of recovering missing information and predicting the occurrence of nodes and edges in time series of graphs. Two different recovery schemes are developed. The first scheme uses reference patterns that are extracted from a training set of graph sequences, while the second method is based on decision tree induction. Our work is motivated by applications in computer network analysis. However, the proposed recovery and prediction schemes are generic and can be applied in other domains as well. 相似文献

11.

Model selection in omnivariate decision trees using Structural Risk Minimization

Olcay Taner Y?ld?z 《Information Sciences》2011,181(23):5214-5226

As opposed to trees that use a single type of decision node, an omnivariate decision tree contains nodes of different types. We propose to use Structural Risk Minimization (SRM) to choose between node types in omnivariate decision tree construction to match the complexity of a node to the complexity of the data reaching that node. In order to apply SRM for model selection, one needs the VC-dimension of the candidate models. In this paper, we first derive the VC-dimension of the univariate model, and estimate the VC-dimension of all three models (univariate, linear multivariate or quadratic multivariate) experimentally. Second, we compare SRM with other model selection techniques including Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC) and cross-validation (CV) on standard datasets from the UCI and Delve repositories. We see that SRM induces omnivariate trees that have a small percentage of multivariate nodes close to the root and they generalize more or at least as accurately as those constructed using other model selection techniques. 相似文献

12.

Online NetFPGA decision tree statistical traffic classifier

Alireza Monemi Roozbeh Zarei Muhammad N. Marsono 《Computer Communications》2013

Classifying online network traffic is becoming critical in network management and security. Recently, new classification methods based on analysis of statistical features of transport layer traffic have been proposed. While these new methods address the limitations of the port based and payload based traffic classification, the current software-based solutions are not fast enough to deal with the traffic of today’s high-speed networks. In this paper, we propose an online statistical traffic classifier using the C4.5 machine learning algorithm running on the NetFPGA platform. Our NetFPGA classifier is constructed by adding three main modules to the NetFPGA reference switch design; a Netflow module, a feature extractor module, and a C4.5 search tree classifier. The proposed classifier is able to classify the input traffics at the maximum line speed of the NetFPGA platform, i.e. 8 Gbps without any packet loss. Our method is based on the statistical features of the first few packets of a flow. The flow is classified just a few micro seconds after receiving the desired number of packets. 相似文献

13.

Improving global scale land cover classifications with multi-directional POLDER data and a decision tree classifier

Eric C. Brown de Colstoun Charles L. Walthall 《Remote sensing of environment》2006,100(4):474-485

Several investigations indicate that the Bidirectional Reflectance Distribution Function (BRDF) contains information that can be used to complement spectral information for improved land cover classification accuracies. Prior studies on the addition of BRDF information to improve land cover classifications have been conducted primarily at local or regional scales. Thus, the potential benefits of adding BRDF information to improve global to continental scale land cover classification have not yet been explored. Here we examine the impact of multidirectional global scale data from the first Polarization and Directionality of Earth Reflectances (POLDER) spacecraft instrument flown on the Advanced Earth Observing Satellite (ADEOS-1) platform on overall classification accuracy and per-class accuracies for 15 land cover categories specified by the International Geosphere Biosphere Programme (IGBP).

A set of 36,648 global training pixels (7 × 6 km spatial resolution) was used with a decision tree classifier to evaluate the performance of classifying POLDER data with and without the inclusion of BRDF information. BRDF ‘metrics’ for the eight-month POLDER on ADEOS-1 archive (10/1996–06/1997) were developed that describe the temporal evolution of the BRDF as captured by a semi-empirical BRDF model. The concept of BRDF ‘feature space’ is introduced and used to explore and exploit the bidirectional information content. The C5.0 decision tree classifier was applied with a boosting option, with the temporal metrics for spectral albedo as input for a first test, and with spectral albedo and BRDF metrics for a second test. Results were evaluated against 20 random subsets of the training data.

Examination of the BRDF feature space indicates that coarse scale BRDF coefficients from POLDER provide information on land cover that is different from the spectral and temporal information of the imagery. The contribution of BRDF information to reducing classification errors is also demonstrated: the addition of BRDF metrics reduces the mean, overall classification error rates by 3.15% (from 18.1% to 14.95% error) with larger improvements for producer's accuracies of individual classes such as Grasslands (+ 8.71%), Urban areas (+ 8.02%), and Wetlands (+ 7.82%). User's accuracies for the Urban (+ 7.42%) and Evergreen Broadleaf Forest (+ 6.70%) classes are also increased. The methodology and results are widely applicable to current multidirectional satellite data from the Multi-angle Imaging Spectroradiometer (MISR), and to the next generation of POLDER-like multi-directional instruments. 相似文献

14.

Ranking fuzzy numbers based on epsilon-deviation degree

Vincent F. Yu Ha Thi Xuan Chi Chien-wen Shen 《Applied Soft Computing》2013,13(8):3621-3627

Although numerous research studies in recent years have been proposed for comparing and ranking fuzzy numbers, most of the existing approaches suffer from plenty of shortcomings. In particular, they have produced counter-intuitive ranking orders under certain cases, inconsistent ranking orders of the fuzzy numbers’ images, and lack of discrimination power to rank similar and symmetric fuzzy numbers. This study's goal is to propose a new epsilon-deviation degree approach based on the left and right areas of a fuzzy number and the concept of a centroid point to overcome previous drawbacks. The proposed approach defines an epsilon-transfer coefficient to avoid illogicality when ranking fuzzy numbers with identical centroid points and develops two innovative ranking indices to consistently distinguish similar or symmetric fuzzy numbers by considering the decision maker's attitude. The advantages of the proposed method are illustrated through several numerical examples and comparisons with the existing approaches. The results demonstrate that this approach is effective for ranking generalized fuzzy numbers and overcomes the shortcomings in recent studies. 相似文献

15.

Optimal direct sum results for deterministic and randomized decision tree complexity

Rahul Jain Hartmut Klauck 《Information Processing Letters》2010,110(20):893-897

A Direct Sum Theorem holds in a model of computation, when for every problem solving some k input instances together is k times as expensive as solving one. We show that Direct Sum Theorems hold in the models of deterministic and randomized decision trees for all relations. We also note that a near optimal Direct Sum Theorem holds for quantum decision trees for boolean functions. 相似文献

16.

面向大数据分析的决策树算法

张棪曹健《计算机科学》2016,43(Z6):374-379, 383

决策树作为机器学习中的一个预测模型,因其输出结果易于理解和解释,而被广泛应用于各个领域,成为了学术界研究的热点。随着数据产生速度的剧增,由于内存容量和处理器速度等限制,常规的决策树算法无法对大数据集进行处理,因此需要对决策树算法的实现进行针对性的处理。首先阐述了决策树的基本算法和优化方法,在此基础上结合大数据带来的挑战,分类比较了各类针对性算法的优缺点,并介绍了支撑这些算法运行的平台。最后讨论了面向大数据的决策树算法的未来发展方向。相似文献

17.

Entropy lower bounds for quantum decision tree complexity

Yaoyun Shi 《Information Processing Letters》2002,81(1):23-27

相似文献

18.

Efficient decision tree design for discrete variable pattern recognition problems

I.K. Sethi B. Chatterjee 《Pattern recognition》1977,9(4):197-206

An algorithm is developed for the design of an efficient decision tree with application to the pattern recognition problems involving discrete variables. The problem of evaluating an extremely large number of trees in search of a minimum cost decision tree is tackled by defining a criterion to estimate the minimum expected cost of a tree in terms of the weights of its terminal nodes and costs of the measurements, which then is used to establish the search procedure for the efficient decision tree. The concept of prime events is used to obtain the number of modes and the corresponding weights in the design samples. An application of the proposed algorithm is presented for the design of an efficient decision tree for classifying Devanagri numerals. 相似文献

19.

Admission control for a responsive distributed middleware using decision trees to model run-time parameters

Luis Garcés-Erice 《Parallel Computing》2011,37(8):379-391

The software in modern systems has become too complex to make accurate predictions about their performance under different configurations. Real-time or even responsiveness requirements cannot be met because it is not possible to perform admission control for new or changing tasks if we cannot tell how their execution affects the other tasks already running. Previously, we proposed a resource-allocation middleware that manages the execution of tasks in a complex distributed system with real-time requirements. The middleware behavior can be modeled depending on the configuration of the tasks running, so that the performance of any given configuration can be calculated. This makes it possible to have admission control in such a system, but the model requires knowledge of run-time parameters. We propose the utilization of machine-learning algorithms to obtain the model parameters, and be able to predict the system performance under any configuration, so that we can provide a full admission control mechanism for complex software systems. In this paper, we present such an admission control mechanism, we measure its accuracy in estimating the parameters of the model, and we evaluate its performance to determine its suitability for a real-time or responsive system. 相似文献

20.

An assessment of the effectiveness of decision tree methods for land cover classification 总被引：11，自引：0，他引：11

Mahesh PalPaul M Mather 《Remote sensing of environment》2003,86(4):554-565

Choice of a classification algorithm is generally based upon a number of factors, among which are availability of software, ease of use, and performance, measured here by overall classification accuracy. The maximum likelihood (ML) procedure is, for many users, the algorithm of choice because of its ready availability and the fact that it does not require an extended training process. Artificial neural networks (ANNs) are now widely used by researchers, but their operational applications are hindered by the need for the user to specify the configuration of the network architecture and to provide values for a number of parameters, both of which affect performance. The ANN also requires an extended training phase.In the past few years, the use of decision trees (DTs) to classify remotely sensed data has increased. Proponents of the method claim that it has a number of advantages over the ML and ANN algorithms. The DT is computationally fast, make no statistical assumptions, and can handle data that are represented on different measurement scales. Software to implement DTs is readily available over the Internet. Pruning of DTs can make them smaller and more easily interpretable, while the use of boosting techniques can improve performance.In this study, separate test and training data sets from two different geographical areas and two different sensors—multispectral Landsat ETM+ and hyperspectral DAIS—are used to evaluate the performance of univariate and multivariate DTs for land cover classification. Factors considered are: the effects of variations in training data set size and of the dimensionality of the feature space, together with the impact of boosting, attribute selection measures, and pruning. The level of classification accuracy achieved by the DT is compared to results from back-propagating ANN and the ML classifiers. Our results indicate that the performance of the univariate DT is acceptably good in comparison with that of other classifiers, except with high-dimensional data. Classification accuracy increases linearly with training data set size to a limit of 300 pixels per class in this case. Multivariate DTs do not appear to perform better than univariate DTs. While boosting produces an increase in classification accuracy of between 3% and 6%, the use of attribute selection methods does not appear to be justified in terms of accuracy increases. However, neither the univariate DT nor the multivariate DT performed as well as the ANN or ML classifiers with high-dimensional data. 相似文献