首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Decision tree regression for soft classification of remote sensing data   总被引:1,自引:0,他引:1  
In recent years, decision tree classifiers have been successfully used for land cover classification from remote sensing data. Their implementation as a per-pixel based classifier to produce hard or crisp classification has been reported in the literature. Remote sensing images, particularly at coarse spatial resolutions, are contaminated with mixed pixels that contain more than one class on the ground. The per-pixel approach may result in erroneous classification of images dominated by mixed pixels. Therefore, soft classification approaches that decompose the pixel into its class constituents in the form of class proportions have been advocated. In this paper, we employ a decision tree regression approach to determine class proportions within a pixel so as to produce soft classification from remote sensing data. Classification accuracy achieved by decision tree regression is compared with those achieved by the most widely used maximum likelihood classifier, implemented in the soft mode, and a supervised version of the fuzzy c-means classifier. Root Mean Square Error (RMSE) and fuzzy error matrix based measures have been used for accuracy assessment of soft classification.  相似文献   

2.
Costs are often an important part of the classification process. Cost factors have been taken into consideration in many previous studies regarding decision tree models. In this study, we also consider a cost-sensitive decision tree construction problem. We assume that there are test costs that must be paid to obtain the values of the decision attribute and that a record must be classified without exceeding the spending cost threshold. Unlike previous studies, however, in which records were classified with only a single condition attribute, in this study, we are able to simultaneously classify records with multiple condition attributes. An algorithm is developed to build a cost-constrained decision tree, which allows us to simultaneously classify multiple condition attributes. The experimental results show that our algorithm satisfactorily handles data with multiple condition attributes under different cost constraints.  相似文献   

3.
Choice of a classification algorithm is generally based upon a number of factors, among which are availability of software, ease of use, and performance, measured here by overall classification accuracy. The maximum likelihood (ML) procedure is, for many users, the algorithm of choice because of its ready availability and the fact that it does not require an extended training process. Artificial neural networks (ANNs) are now widely used by researchers, but their operational applications are hindered by the need for the user to specify the configuration of the network architecture and to provide values for a number of parameters, both of which affect performance. The ANN also requires an extended training phase.In the past few years, the use of decision trees (DTs) to classify remotely sensed data has increased. Proponents of the method claim that it has a number of advantages over the ML and ANN algorithms. The DT is computationally fast, make no statistical assumptions, and can handle data that are represented on different measurement scales. Software to implement DTs is readily available over the Internet. Pruning of DTs can make them smaller and more easily interpretable, while the use of boosting techniques can improve performance.In this study, separate test and training data sets from two different geographical areas and two different sensors—multispectral Landsat ETM+ and hyperspectral DAIS—are used to evaluate the performance of univariate and multivariate DTs for land cover classification. Factors considered are: the effects of variations in training data set size and of the dimensionality of the feature space, together with the impact of boosting, attribute selection measures, and pruning. The level of classification accuracy achieved by the DT is compared to results from back-propagating ANN and the ML classifiers. Our results indicate that the performance of the univariate DT is acceptably good in comparison with that of other classifiers, except with high-dimensional data. Classification accuracy increases linearly with training data set size to a limit of 300 pixels per class in this case. Multivariate DTs do not appear to perform better than univariate DTs. While boosting produces an increase in classification accuracy of between 3% and 6%, the use of attribute selection methods does not appear to be justified in terms of accuracy increases. However, neither the univariate DT nor the multivariate DT performed as well as the ANN or ML classifiers with high-dimensional data.  相似文献   

4.
一种能够适应概念漂移变化的数据流分类方法   总被引:1,自引:0,他引:1  
目前多数的数据流分类方法都是基于数据稳定分布这一假设,忽略了真实数据在一段时间内会发生潜在概念性的变化,这可能会降低分类模型的预测精度.针对数据流的特性,提出一种能够识别并适应概念漂移发生的在线分类算法,实验表明它能根据目前概念漂移的状况,自动地调整训练窗口和模型重建期间新样本的个数.  相似文献   

5.
Database classification suffers from two well-known difficulties, i.e., the high dimensionality and non-stationary variations within the large historic data. This paper presents a hybrid classification model by integrating a case-based reasoning technique, a fuzzy decision tree (FDT), and genetic algorithms (GAs) to construct a decision-making system for data classification in various database applications. The model is major based on the idea that the historic database can be transformed into a smaller case base together with a group of fuzzy decision rules. As a result, the model can be more accurately respond to the current data under classifying from the inductions by these smaller case-based fuzzy decision trees. Hit rate is applied as a performance measure and the effectiveness of our proposed model is demonstrated experimentally compared with other approaches on different database classification applications. The average hit rate of our proposed model is the highest among others.  相似文献   

6.
Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing with such problems. Various approaches have been introduced in the literature to deal with imbalanced datasets, and are typically based on oversampling, undersampling or cost-sensitive classification. In this paper, we introduce an effective ensemble of cost-sensitive decision trees for imbalanced classification. Base classifiers are constructed according to a given cost matrix, but are trained on random feature subspaces to ensure sufficient diversity of the ensemble members. We employ an evolutionary algorithm for simultaneous classifier selection and assignment of committee member weights for the fusion process. Our proposed algorithm is evaluated on a variety of benchmark datasets, and is confirmed to lead to improved recognition of the minority class, to be capable of outperforming other state-of-the-art algorithms, and hence to represent a useful and effective approach for dealing with imbalanced datasets.  相似文献   

7.
This paper deals with some improvements to rule induction algorithms in order to resolve the tie that appear in special cases during the rule generation procedure for specific training data sets. These improvements are demonstrated by experimental results on various data sets. The tie occurs in decision tree induction algorithm when the class prediction at a leaf node cannot be determined by majority voting. When there is a conflict in the leaf node, we need to find the source and the solution to the problem. In this paper, we propose to calculate the Influence factor for each attribute and an update procedure to the decision tree has been suggested to deal with the problem and provide subsequent rectification steps.  相似文献   

8.
将二叉决策机制融入到模糊支持向量机分类系统中,对图像进行情感语义层面的分类。其难点在于建立从图像的低阶特征到高层情感语义之间的映射关系,以及合理的参数选择问题。采用与决策树方法相结合,实现了多类分类。实验结果表明,本系统在图像情感分类中具有简单、快速、高效等特点。  相似文献   

9.
Application of machine learning techniques to the functional Magnetic Resonance Imaging (fMRI) data is recently an active field of research. There is however one area which does not receive due attention in the literature – preparation of the fMRI data for subsequent modelling. In this study we focus on the issue of synchronization of the stream of fMRI snapshots with the mental states of the subject, which is a form of smart filtering of the input data, performed prior to building a predictive model. We demonstrate, investigate and thoroughly discuss the negative effects of lack of alignment between the two streams and propose an original data-driven approach to efficiently address this problem. Our solution involves casting the issue as a constrained optimization problem in combination with an alternative classification accuracy assessment scheme, applicable to both batch and on-line scenarios and able to capture information distributed across a number of input samples lifting the common simplifying i.i.d. assumption. The proposed method is tested using real fMRI data and experimentally compared to the state-of-the-art ensemble models reported in the literature, outperforming them by a wide margin.  相似文献   

10.
Multivariable stream data is becoming increasingly common as diverse types of sensor devices and networks are deployed. Building accurate classification models for such data has attracted a lot of attention from the research community. Most of the previous works, however, relied on features extracted from individual streams, and did not take into account the dependency relations among the features within and across the streams. In this work, we propose new classification models that exploit temporal relations among features. We showed that consideration of such dependencies does significantly improve the classification accuracy. Another benefit of employing temporal relations is the improved interpretability of the resulting classification models, as the set of temporal relations can be easily translated to a rule using a sequence of inter-dependent events characterizing the class. We evaluated the proposed scheme using different classification models including the Naive Bayesian, TFIDF, and vector distance models. We showed that the proposed model can be a useful addition to the set of existing stream classification algorithms.  相似文献   

11.
An important objective of data mining is the development of predictive models. Based on a number of observations, a model is constructed that allows the analysts to provide classifications or predictions for new observations. Currently, most research focuses on improving the accuracy or precision of these models and comparatively little research has been undertaken to increase their comprehensibility to the analyst or end-user. This is mainly due to the subjective nature of ‘comprehensibility’, which depends on many factors outside the model, such as the user's experience and his/her prior knowledge. Despite this influence of the observer, some representation formats are generally considered to be more easily interpretable than others. In this paper, an empirical study is presented which investigates the suitability of a number of alternative representation formats for classification when interpretability is a key requirement. The formats under consideration are decision tables, (binary) decision trees, propositional rules, and oblique rules. An end-user experiment was designed to test the accuracy, response time, and answer confidence for a set of problem-solving tasks involving the former representations. Analysis of the results reveals that decision tables perform significantly better on all three criteria, while post-test voting also reveals a clear preference of users for decision tables in terms of ease of use.  相似文献   

12.
王旅  彭宏  胡劲松 《计算机工程与设计》2006,27(11):1929-1931,1963
数据分类是数据挖掘领域中一个非常重要的研究课题,而判定树归纳分类是数据分类技术中最常用的方法之一,应用广泛.工程建设需要对工程用土进行分类定名,在土工试验中土的分类定名相当烦琐,而且土的用途或工程性质不同适用分类标准也不同,手工进行土质分类定名容易出错.将判定树归纳分类法应用于土质分类定名工作,介绍了判定树归纳算法,根据最高信息增益构建土质分类的预测模型,并给出了具体的数据分类实例.  相似文献   

13.
14.
Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Na?¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Na?¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Na?¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Na?¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.  相似文献   

15.
This paper reports the development of a decision tree algorithm to classify the surface soil freeze/thaw states. The algorithm uses SSM/I brightness temperatures recorded in the early morning. Three critical indices are used as classification criteria—the scattering index (SI), the 37 GHz vertical polarization brightness temperature (T37V), and the 19 GHz polarization difference (PD19). The thresholds of these criteria were obtained from samples of frozen soil, thawed soil, desert, and snow. The algorithm is capable of distinguishing between frozen soil, thawed soil, desert, and precipitation. In-situ 4-cm deep soil temperatures on the Qinghai-Tibetan Plateau were used to validate the classification results, and the average classification accuracy was found to be 87%. Regarding the misclassified pixels, about 40% and 73% of them appeared when the surface soil temperature ranged from − 0.5 °C to 0.5 °C and from − 2.0 °C to 2.0 °C, respectively, which means that most misclassifications occurred near the soil freezing point. In addition, misclassifications mainly occurred from April to May and September to October, the transition periods between warm and cold seasons. A grid-to-grid Kappa analysis was also conducted to evaluate the consistency between the map of the actual number of frozen days obtained using the decision tree classification algorithm and the reference map of geocryological regionalization and classification in China. The overall classification accuracy was 91.7%, and the Kappa index was 80.5%. The boundary between the frozen and thawed soil was consistent with the southern limit of seasonally frozen ground from the reference map. The statistics show that the maximum area of frozen soil is about 6.82 × 106 km2 in late January, accounting for 69% of total Chinese land area.  相似文献   

16.
基于SPRINT方法的并行决策树分类研究   总被引:9,自引:0,他引:9  
决策树技术的最大问题之一就是它的计算复杂性和训练数据的规模成正比,导致在大的数据集上构造决策树的计算时间太长。并行构造决策树是解决这个问题的一种有效方法。文中基于同步构造决策树的思想,对SPRINT方法的并行性做了详细分析和研究,并提出了进一步研究的方向。  相似文献   

17.
In this paper we introduce a method called CL.E.D.M. (CLassification through ELECTRE and Data Mining), that employs aspects of the methodological framework of the ELECTRE I outranking method, and aims at increasing the accuracy of existing data mining classification algorithms. In particular, the method chooses the best decision rules extracted from the training process of the data mining classification algorithms, and then it assigns the classes that correspond to these rules, to the objects that must be classified. Three well known data mining classification algorithms are tested in five different widely used databases to verify the robustness of the proposed method.  相似文献   

18.
Algorithms for the analysis of graph sequences are proposed in this paper. In particular, we study the problem of recovering missing information and predicting the occurrence of nodes and edges in time series of graphs. Two different recovery schemes are developed. The first scheme uses reference patterns that are extracted from a training set of graph sequences, while the second method is based on decision tree induction. Our work is motivated by applications in computer network analysis. However, the proposed recovery and prediction schemes are generic and can be applied in other domains as well.  相似文献   

19.
One type of hierarchical fuzzy-operator-based network implementation is investigated. In this approach, we generalized the Dombi operator as an effective component for decision analysis and making. This methodology provides several advantages due to the fact that the input to each node is the evidence supplied by the degree of satisfaction of sub-criteria and the output is the aggregated evidence. Thus, the decision making process is to aggregate and propagate the evidence information through such a hierarchical network. This trainable network is able to perceive and interpret complex decisions by using those transparent fuzzy models. This study examines the behavior of the fuzzy additive operator in more detail and the results show that the proposed framework exhibits reliable decision in the pattern classification domain.  相似文献   

20.
Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy (CE) family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper we explore the relative performance of the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI) measure. Our results suggest that while some datasets are insensitive to the choice of splitting methods, other datasets are very sensitive to the choice of splitting methods. For example, some of the CAMI family methods may be more appropriate than the popular Gain Ratio (GR) method for datasets which have nominal predictor attributes, and are competitive with the GR method for those datasets where all predictor attributes are numeric. Given that it is never known beforehand which splitting method will lead to the best DT for a given dataset, and given the relatively good performance of the CAMI methods, it seems appropriate to suggest that splitting methods from the CAMI family should be included in data mining toolsets. Kweku-Mauta Osei-Bryson is Professor of Information Systems at Virginia Commonwealth University, where he also served as the Coordinator of the Ph.D. program in Information Systems during 2001–2003. Previously he was Professor of Information Systems and Decision Analysis in the School of Business at Howard University, Washington, DC, U.S.A. He has also worked as an Information Systems practitioner in both industry and government. He holds a Ph.D. in Applied Mathematics (Management Science & Information Systems) from the University of Maryland at College Park, a M.S. in Systems Engineering from Howard University, and a B.Sc. in Natural Sciences from the University of the West Indies at Mona. He currently does research in various areas including: Data Mining, Expert Systems, Decision Support Systems, Group Support Systems, Information Systems Outsourcing, Multi-Criteria Decision Analysis. His papers have been published in various journals including: Information & Management, Information Systems Journal, Information Systems Frontiers, Business Process Management Journal, International Journal of Intelligent Systems, IEEE Transactions on Knowledge & Data Engineering, Data & Knowledge Engineering, Information & Software Technology, Decision Support Systems, Information Processing and Management, Computers & Operations Research, European Journal of Operational Research, Journal of the Operational Research Society, Journal of the Association for Information Systems, Journal of Multi-Criteria Decision Analysis, Applications of Management Science. Currently he serves an Associate Editor of the INFORMS Journal on Computing, and is a member of the Editorial Board of the Computers & Operations Research journal. Kendall E. Giles received the BS degree in Electrical Engineering from Virginia Tech in 1991, the MS degree in Electrical Engineering from Purdue University in 1993, the MS degree in Information Systems from Virginia Commonwealth University in 2002, and the MS degree in Computer Science from Johns Hopkins University in 2004. Currently he is a PhD student (ABD) in Computer Science at Johns Hopkins, and is a Research Assistant in the Applied Mathematics and Statistics department. He has over 15 years of work experience in industry, government, and academic institutions. His research interests can be partially summarized by the following keywords: network security, mathematical modeling, pattern classification, and high dimensional data analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号