首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Costs are often an important part of the classification process. Cost factors have been taken into consideration in many previous studies regarding decision tree models. In this study, we also consider a cost-sensitive decision tree construction problem. We assume that there are test costs that must be paid to obtain the values of the decision attribute and that a record must be classified without exceeding the spending cost threshold. Unlike previous studies, however, in which records were classified with only a single condition attribute, in this study, we are able to simultaneously classify records with multiple condition attributes. An algorithm is developed to build a cost-constrained decision tree, which allows us to simultaneously classify multiple condition attributes. The experimental results show that our algorithm satisfactorily handles data with multiple condition attributes under different cost constraints.  相似文献   

2.
This paper proposes a complete framework to assess the overall performance of classification models from a user perspective in terms of accuracy, comprehensibility, and justifiability. A review is provided of accuracy and comprehensibility measures, and a novel metric is introduced that allows one to measure the justifiability of classification models. Furthermore, taxonomy of domain constraints is introduced, and an overview of the existing approaches to impose constraints and include domain knowledge in data mining techniques is presented. Finally, justifiability metric is applied to a credit scoring and customer churn prediction case.  相似文献   

3.
Extracting decision trees from trained neural networks   总被引:4,自引:0,他引:4  
In this paper we present a methodology for extracting decision trees from input data generated from trained neural networks instead of doing it directly from the data. A genetic algorithm is used to query the trained network and extract prototypes. A prototype selection mechanism is then used to select a subset of the prototypes. Finally, a standard induction method like ID3 or C5.0 is used to extract the decision tree. The extracted decision trees can be used to understand the working of the neural network besides performing classification. This method is able to extract different decision trees of high accuracy and comprehensibility from the trained neural network.  相似文献   

4.
We have proposed a decision tree classifier named MMC (multi-valued and multi-labeled classifier) before. MMC is known as its capability of classifying a large multi-valued and multi-labeled data. Aiming to improve the accuracy of MMC, this paper has developed another classifier named MMDT (multi-valued and multi-labeled decision tree). MMDT differs from MMC mainly in attribute selection. MMC attempts to split a node into child nodes whose records approach the same multiple labels. It basically measures the average similarity of labels of each child node to determine the goodness of each splitting attribute. MMDT, in contrast, uses another measuring strategy which considers not only the average similarity of labels of each child node but also the average appropriateness of labels of each child node. The new measuring strategy takes scoring approach to have a look-ahead measure of accuracy contribution of each attribute's splitting. The experimental results show that MMDT has improved the accuracy of MMC.  相似文献   

5.
This paper deals with some improvements to rule induction algorithms in order to resolve the tie that appear in special cases during the rule generation procedure for specific training data sets. These improvements are demonstrated by experimental results on various data sets. The tie occurs in decision tree induction algorithm when the class prediction at a leaf node cannot be determined by majority voting. When there is a conflict in the leaf node, we need to find the source and the solution to the problem. In this paper, we propose to calculate the Influence factor for each attribute and an update procedure to the decision tree has been suggested to deal with the problem and provide subsequent rectification steps.  相似文献   

6.
This paper contributes to the conceptualisation and analysis of double-sided matching problems, taking the land use planning problem as an example. It does so by introducing functional classification theory at the knowledge level, the symbol level and the system level of a DSS. This theory explicitly expresses the methodological viewpoint of relational realism. At the knowledge level this implies defining knowledge on the basis of matching the intension and extension of concepts. At the symbol level it deals with knowledge representation and here decision tables are advanced and formally introduced. At the system level the formalism used at the symbol level is implemented to develop a relational matching DSS.  相似文献   

7.
Culverts are important components of a roadway and should be properly maintained to ensure adequate road surface drainage and public safety. Culvert maintenance greatly relies on culvert inspection which is time consuming and requires a large number of skilled labor hours. Currently, State Departments of Transportation use rigid methods for scheduling culvert inspection based on one or two factors such as culvert size and/or condition. The objective of the research described in the paper is to develop a more intelligent scheduling system for culvert inspection to improve the utilization of limited resources. The proposed intelligent system first predicts the conditions of the culverts that are due for inspection in a given year and based on the prediction results, only schedule inspections for those predicted to be in poor condition. The prediction models utilized a Decision Tree algorithm together with the Synthetic Minority Over-Sampling Technique to deal with the highly imbalanced data in the culvert inventory database. The case study presented in the paper utilized 12,400 culvert records from the Ohio Department of Transportation to train and test the prediction models. The developed prediction models have achieved accuracies over 80% for the training set and 75% for the testing set and satisfactory values for the areas under the curve of 0.8. The case study concluded that by implementing the proposed intelligent culvert inspection scheduling system, the number of culverts needing inspections is reduced by 44%. Implementation of the proposed system could assist state and local agencies with prioritizing inspection of culverts needing attention while maximizing the use of limited resources. While this study is applied to culverts in Ohio, the proposed framework can be used on any similarly available culvert data set worldwide. The paper ends by providing suggestions to improve the quality of the data in culvert inventory databases.  相似文献   

8.
Shuyu  Zhongying 《Knowledge》2006,19(8):675-680
This paper proposes an improved decision tree method for web information retrieval with self-map attributes. Our self-map tree has a value of self-map attribute in its internal node, and information based on dissimilarity between a pair of map sequences. Our method selects self-map which exists between data by exhaustive search based on relation and attribute information. Experimental results confirm that our improved method constructs comprehensive and accurate decision tree. Moreover, an example shows that our self-map decision tree is promising for data mining and knowledge discovery.  相似文献   

9.
In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist.  相似文献   

10.
Choice of a classification algorithm is generally based upon a number of factors, among which are availability of software, ease of use, and performance, measured here by overall classification accuracy. The maximum likelihood (ML) procedure is, for many users, the algorithm of choice because of its ready availability and the fact that it does not require an extended training process. Artificial neural networks (ANNs) are now widely used by researchers, but their operational applications are hindered by the need for the user to specify the configuration of the network architecture and to provide values for a number of parameters, both of which affect performance. The ANN also requires an extended training phase.In the past few years, the use of decision trees (DTs) to classify remotely sensed data has increased. Proponents of the method claim that it has a number of advantages over the ML and ANN algorithms. The DT is computationally fast, make no statistical assumptions, and can handle data that are represented on different measurement scales. Software to implement DTs is readily available over the Internet. Pruning of DTs can make them smaller and more easily interpretable, while the use of boosting techniques can improve performance.In this study, separate test and training data sets from two different geographical areas and two different sensors—multispectral Landsat ETM+ and hyperspectral DAIS—are used to evaluate the performance of univariate and multivariate DTs for land cover classification. Factors considered are: the effects of variations in training data set size and of the dimensionality of the feature space, together with the impact of boosting, attribute selection measures, and pruning. The level of classification accuracy achieved by the DT is compared to results from back-propagating ANN and the ML classifiers. Our results indicate that the performance of the univariate DT is acceptably good in comparison with that of other classifiers, except with high-dimensional data. Classification accuracy increases linearly with training data set size to a limit of 300 pixels per class in this case. Multivariate DTs do not appear to perform better than univariate DTs. While boosting produces an increase in classification accuracy of between 3% and 6%, the use of attribute selection methods does not appear to be justified in terms of accuracy increases. However, neither the univariate DT nor the multivariate DT performed as well as the ANN or ML classifiers with high-dimensional data.  相似文献   

11.
As two classical measures, approximation accuracy and consistency degree can be employed to evaluate the decision performance of a decision table. However, these two measures cannot give elaborate depictions of the certainty and consistency of a decision table when their values are equal to zero. To overcome this shortcoming, we first classify decision tables in rough set theory into three types according to their consistency and introduce three new measures for evaluating the decision performance of a decision-rule set extracted from a decision table. We then analyze how each of these three measures depends on the condition granulation and decision granulation of each of the three types of decision tables. Experimental analyses on three practical data sets show that the three new measures appear to be well suited for evaluating the decision performance of a decision-rule set and are much better than the two classical measures.  相似文献   

12.
An effective incident information management system needs to deal with several challenges. It must support heterogeneous distributed incident data, allow decision makers (DMs) to detect anomalies and extract useful knowledge, assist DMs in evaluating the risks and selecting an appropriate alternative during an incident, and provide differentiated services to satisfy the requirements of different incident management phases. To address these challenges, this paper proposes an incident information management framework that consists of three major components. The first component is a high-level data integration module in which heterogeneous data sources are integrated and presented in a uniform format. The second component is a data mining module that uses data mining methods to identify useful patterns and presents a process to provide differentiated services for pre-incident and post-incident information management. The third component is a multi-criteria decision-making (MCDM) module that utilizes MCDM methods to assess the current situation, find the satisfactory solutions, and take appropriate responses in a timely manner. To validate the proposed framework, this paper conducts a case study on agrometeorological disasters that occurred in China between 1997 and 2001. The case study demonstrates that the combination of data mining and MCDM methods can provide objective and comprehensive assessments of incident risks.  相似文献   

13.
The main objective of the present paper is to characterize smoking behavior among older adults by assessing the psychological distress, physical health status, alcohol use, and demographic variables in relations to the current smoking. We targeted 466 senior American smokers who are 65 years of age or older from the 2006 National Survey on Drug Use and Health (NSDUH, 2006). We employed a decision tree algorithm to conduct classification analysis to find the relationship between the average numbers of cigarette use per day. The results showed that the most important explanatory variable for prediction of the average number of cigarette use per day is the age when first started smoking cigarettes every day, followed by education level, and psychological distress. These results suggest that social workers need to provide more customized and individualized intervention to older adults.  相似文献   

14.
Extracting classification rules from data is an important task of data mining and gaining considerable more attention in recent years. In this paper, a new meta-heuristic algorithm which is called as TACO-miner is proposed for rule extraction from artificial neural networks (ANN). The proposed rule extraction algorithm actually works on the trained ANNs in order to discover the hidden knowledge which is available in the form of connection weights within ANN structure. The proposed algorithm is mainly based on a meta-heuristic which is known as touring ant colony optimization (TACO) and consists of two-step hierarchical structure. The proposed algorithm is experimentally evaluated on six binary and n-ary classification benchmark data sets. Results of the comparative study show that TACO-miner is able to discover accurate and concise classification rules.  相似文献   

15.
决策树采掘技术及发展趋势   总被引:18,自引:0,他引:18  
介绍了决策树采掘技术的主要内容和最新应用,对决策树的生长和剪枝算法进行了比较。指出了决策采掘技术的研究方向。  相似文献   

16.
This paper proposes an expert system called VIBEX (VIBration EXpert) to aid plant operators in diagnosing the cause of abnormal vibration for rotating machinery. In order to automatize the diagnosis, a decision table based on the cause-symptom matrix is used as a probabilistic method for diagnosing abnormal vibration. Also a decision tree is used as the acquisition of structured knowledge in the form of concepts is introduced to build a knowledge base which is indispensable for vibration expert systems. The decision tree is a technique used for building knowledge-based systems by the inductive inference from examples and plays a role itself as a vibration diagnostic tool. The proposed system has been successfully implemented on Microsoft Windows environment and is written in Microsoft Visual Basic and Visual C++. To validate the system performance, the diagnostic system was tested with some examples using the two diagnostic methods.  相似文献   

17.
Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy (CE) family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper we explore the relative performance of the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI) measure. Our results suggest that while some datasets are insensitive to the choice of splitting methods, other datasets are very sensitive to the choice of splitting methods. For example, some of the CAMI family methods may be more appropriate than the popular Gain Ratio (GR) method for datasets which have nominal predictor attributes, and are competitive with the GR method for those datasets where all predictor attributes are numeric. Given that it is never known beforehand which splitting method will lead to the best DT for a given dataset, and given the relatively good performance of the CAMI methods, it seems appropriate to suggest that splitting methods from the CAMI family should be included in data mining toolsets. Kweku-Mauta Osei-Bryson is Professor of Information Systems at Virginia Commonwealth University, where he also served as the Coordinator of the Ph.D. program in Information Systems during 2001–2003. Previously he was Professor of Information Systems and Decision Analysis in the School of Business at Howard University, Washington, DC, U.S.A. He has also worked as an Information Systems practitioner in both industry and government. He holds a Ph.D. in Applied Mathematics (Management Science & Information Systems) from the University of Maryland at College Park, a M.S. in Systems Engineering from Howard University, and a B.Sc. in Natural Sciences from the University of the West Indies at Mona. He currently does research in various areas including: Data Mining, Expert Systems, Decision Support Systems, Group Support Systems, Information Systems Outsourcing, Multi-Criteria Decision Analysis. His papers have been published in various journals including: Information & Management, Information Systems Journal, Information Systems Frontiers, Business Process Management Journal, International Journal of Intelligent Systems, IEEE Transactions on Knowledge & Data Engineering, Data & Knowledge Engineering, Information & Software Technology, Decision Support Systems, Information Processing and Management, Computers & Operations Research, European Journal of Operational Research, Journal of the Operational Research Society, Journal of the Association for Information Systems, Journal of Multi-Criteria Decision Analysis, Applications of Management Science. Currently he serves an Associate Editor of the INFORMS Journal on Computing, and is a member of the Editorial Board of the Computers & Operations Research journal. Kendall E. Giles received the BS degree in Electrical Engineering from Virginia Tech in 1991, the MS degree in Electrical Engineering from Purdue University in 1993, the MS degree in Information Systems from Virginia Commonwealth University in 2002, and the MS degree in Computer Science from Johns Hopkins University in 2004. Currently he is a PhD student (ABD) in Computer Science at Johns Hopkins, and is a Research Assistant in the Applied Mathematics and Statistics department. He has over 15 years of work experience in industry, government, and academic institutions. His research interests can be partially summarized by the following keywords: network security, mathematical modeling, pattern classification, and high dimensional data analysis.  相似文献   

18.
Classification in imbalanced domains is a recent challenge in data mining. We refer to imbalanced classification when data presents many examples from one class and few from the other class, and the less representative class is the one which has more interest from the point of view of the learning task. One of the most used techniques to tackle this problem consists in preprocessing the data previously to the learning process. This preprocessing could be done through under-sampling; removing examples, mainly belonging to the majority class; and over-sampling, by means of replicating or generating new minority examples. In this paper, we propose an under-sampling procedure guided by evolutionary algorithms to perform a training set selection for enhancing the decision trees obtained by the C4.5 algorithm and the rule sets obtained by PART rule induction algorithm. The proposal has been compared with other under-sampling and over-sampling techniques and the results indicate that the new approach is very competitive in terms of accuracy when comparing with over-sampling and it outperforms standard under-sampling. Moreover, the obtained models are smaller in terms of number of leaves or rules generated and they can considered more interpretable. The results have been contrasted through non-parametric statistical tests over multiple data sets.  相似文献   

19.
Lazy和Eager分类算法的比较研究   总被引:1,自引:1,他引:0  
数据挖掘的两个高层目标是预测和描述,这个过程中分类算法的应用是非常广泛的。分类算法在机器学习领域中可以分为Lazy和Eager两种类型,分别具有不同的算法特点。文章基于实验对这两种类型的分类算法进行了分析,概括出适宜两种类型的环境条件,旨在提出实际应用中进行算法选择的经验性结论。  相似文献   

20.
一个基于数据仓库的企业财务管理决策支持系统框架初探   总被引:2,自引:0,他引:2  
基于Internet环境,应用数据仓库和数据挖掘技术,本文构建了一个企业财务管理决策支持系统框架,提出了系统的体系结构。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号