首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The software in modern systems has become too complex to make accurate predictions about their performance under different configurations. Real-time or even responsiveness requirements cannot be met because it is not possible to perform admission control for new or changing tasks if we cannot tell how their execution affects the other tasks already running. Previously, we proposed a resource-allocation middleware that manages the execution of tasks in a complex distributed system with real-time requirements. The middleware behavior can be modeled depending on the configuration of the tasks running, so that the performance of any given configuration can be calculated. This makes it possible to have admission control in such a system, but the model requires knowledge of run-time parameters. We propose the utilization of machine-learning algorithms to obtain the model parameters, and be able to predict the system performance under any configuration, so that we can provide a full admission control mechanism for complex software systems. In this paper, we present such an admission control mechanism, we measure its accuracy in estimating the parameters of the model, and we evaluate its performance to determine its suitability for a real-time or responsive system.  相似文献   

2.
基于离散度的决策树构造方法   总被引:1,自引:0,他引:1  
在构造决策树的过程中,属性选择将影响到决策树的分类精度.对此,讨论了基于信息熵方法和WMR方法的局限性,提出了信息系统中条件属性集的离散度的概念.利用该概念在决策树构造过程中选择划分属性,设计了基于离散度的决策树构造算法DSD.DSD算法可以解决WMR方法在实际应用中的局限性.在UCI数据集上的实验表明,该方法构造的决策树精度与基于信息熵的方法相近,而时间复杂度则优于基于信息熵的方法.  相似文献   

3.
Decision trees have been widely used in data mining and machine learning as a comprehensible knowledge representation. While ant colony optimization (ACO) algorithms have been successfully applied to extract classification rules, decision tree induction with ACO algorithms remains an almost unexplored research area. In this paper we propose a novel ACO algorithm to induce decision trees, combining commonly used strategies from both traditional decision tree induction algorithms and ACO. The proposed algorithm is compared against three decision tree induction algorithms, namely C4.5, CART and cACDT, in 22 publicly available data sets. The results show that the predictive accuracy of the proposed algorithm is statistically significantly higher than the accuracy of both C4.5 and CART, which are well-known conventional algorithms for decision tree induction, and the accuracy of the ACO-based cACDT decision tree algorithm.  相似文献   

4.
Ranking problems have recently become an important research topic in the joint field of machine learning and information retrieval. This paper presented a new splitting rule that introduces a metric, i.e., an impurity measure, to construct decision trees for ranking tasks. We provided a theoretical basis and some intuitive explanations for the splitting rule. Our approach is also meaningful to collaborative filtering in the sense of dealing with categorical data and selecting relative features. Some experiments were made to illustrate our ranking approach, whose results showed that our algorithm outperforms both perceptron-based ranking and the classification tree algorithms in term of accuracy as well as speed.
Fen XiaEmail:
  相似文献   

5.
采用重复剪辑近邻法提高决策树算法的性能   总被引:4,自引:0,他引:4       下载免费PDF全文
决策树算法易受训练样本集中噪声和混杂区域的影响,重复剪辑近邻法能消除样本集中符合某些先决条件的噪声,清除混杂区域中后验概率较小的类别所包含的样本,并在各类样本间形成符合Bayes分类准则的界线,用它对合适的训练样本集进行筛选,可在不损害分类准确率的同时明显地减小决策树的规模,有助于增强决策树的可理解性和可用性,从而提高决策树的性能。  相似文献   

6.
The Transportation Security Agency provides airline security in the United States using a variety of measures including a computer based passenger prescreening system. This paper develops Bayesian decision models of two prescreening systems: one that places ticketed passengers into two classifications (fly and no-fly), and a three classification system that includes potential flight. Using a parameterized cost structure, and the expected monetary value decision criteria, this paper develops optimal levels of undesirable personal characteristics that should place people into the various categories. The models are explored from both the government perspective and the passenger's perspective.  相似文献   

7.
Costs are often an important part of the classification process. Cost factors have been taken into consideration in many previous studies regarding decision tree models. In this study, we also consider a cost-sensitive decision tree construction problem. We assume that there are test costs that must be paid to obtain the values of the decision attribute and that a record must be classified without exceeding the spending cost threshold. Unlike previous studies, however, in which records were classified with only a single condition attribute, in this study, we are able to simultaneously classify records with multiple condition attributes. An algorithm is developed to build a cost-constrained decision tree, which allows us to simultaneously classify multiple condition attributes. The experimental results show that our algorithm satisfactorily handles data with multiple condition attributes under different cost constraints.  相似文献   

8.
In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist.  相似文献   

9.
Hybrid decision tree   总被引:6,自引:0,他引:6  
  相似文献   

10.
Decision tree (DT) induction is among the more popular of the data mining techniques. An important component of DT induction algorithms is the splitting method, with the most commonly used method being based on the Conditional Entropy (CE) family. However, it is well known that there is no single splitting method that will give the best performance for all problem instances. In this paper we explore the relative performance of the Conditional Entropy family and another family that is based on the Class-Attribute Mutual Information (CAMI) measure. Our results suggest that while some datasets are insensitive to the choice of splitting methods, other datasets are very sensitive to the choice of splitting methods. For example, some of the CAMI family methods may be more appropriate than the popular Gain Ratio (GR) method for datasets which have nominal predictor attributes, and are competitive with the GR method for those datasets where all predictor attributes are numeric. Given that it is never known beforehand which splitting method will lead to the best DT for a given dataset, and given the relatively good performance of the CAMI methods, it seems appropriate to suggest that splitting methods from the CAMI family should be included in data mining toolsets. Kweku-Mauta Osei-Bryson is Professor of Information Systems at Virginia Commonwealth University, where he also served as the Coordinator of the Ph.D. program in Information Systems during 2001–2003. Previously he was Professor of Information Systems and Decision Analysis in the School of Business at Howard University, Washington, DC, U.S.A. He has also worked as an Information Systems practitioner in both industry and government. He holds a Ph.D. in Applied Mathematics (Management Science & Information Systems) from the University of Maryland at College Park, a M.S. in Systems Engineering from Howard University, and a B.Sc. in Natural Sciences from the University of the West Indies at Mona. He currently does research in various areas including: Data Mining, Expert Systems, Decision Support Systems, Group Support Systems, Information Systems Outsourcing, Multi-Criteria Decision Analysis. His papers have been published in various journals including: Information & Management, Information Systems Journal, Information Systems Frontiers, Business Process Management Journal, International Journal of Intelligent Systems, IEEE Transactions on Knowledge & Data Engineering, Data & Knowledge Engineering, Information & Software Technology, Decision Support Systems, Information Processing and Management, Computers & Operations Research, European Journal of Operational Research, Journal of the Operational Research Society, Journal of the Association for Information Systems, Journal of Multi-Criteria Decision Analysis, Applications of Management Science. Currently he serves an Associate Editor of the INFORMS Journal on Computing, and is a member of the Editorial Board of the Computers & Operations Research journal. Kendall E. Giles received the BS degree in Electrical Engineering from Virginia Tech in 1991, the MS degree in Electrical Engineering from Purdue University in 1993, the MS degree in Information Systems from Virginia Commonwealth University in 2002, and the MS degree in Computer Science from Johns Hopkins University in 2004. Currently he is a PhD student (ABD) in Computer Science at Johns Hopkins, and is a Research Assistant in the Applied Mathematics and Statistics department. He has over 15 years of work experience in industry, government, and academic institutions. His research interests can be partially summarized by the following keywords: network security, mathematical modeling, pattern classification, and high dimensional data analysis.  相似文献   

11.
Web网页识别中的特征选择问题研究   总被引:26,自引:0,他引:26  
对Web网页识别中有关特征选择的两个重要问题进行了深入的探讨,提出了一种新的描述特征选择方法,并将其与3种已有的描述特征方法进行实验比较,证实其有效性,此外还对5种在文本归类中,具有代表性的识别特征选择方法在Web网页识别中的实际应用效果进行了评估比较,并发现信息增益和统计方法,选择识别特征效果最佳。  相似文献   

12.
Data from many real-world applications can be high dimensional and features of such data are usually highly redundant. Identifying informative features has become an important step for data mining to not only circumvent the curse of dimensionality but to reduce the amount of data for processing. In this paper, we propose a novel feature selection method based on bee colony and gradient boosting decision tree aiming at addressing problems such as efficiency and informative quality of the selected features. Our method achieves global optimization of the inputs of the decision tree using the bee colony algorithm to identify the informative features. The method initializes the feature space spanned by the dataset. Less relevant features are suppressed according to the information they contribute to the decision making using an artificial bee colony algorithm. Experiments are conducted with two breast cancer datasets and six datasets from the public data repository. Experimental results demonstrate that the proposed method effectively reduces the dimensions of the dataset and achieves superior classification accuracy using the selected features.  相似文献   

13.
We evaluate the performance of two decision tree procedures and four Bayesian network classifiers as potential decision support systems in the cytodiagnosis of breast cancer. In order to test their performance thoroughly, we use two real-world databases containing 692 cases and 322 cases collected by a single observer and 19 observers, respectively. The results show that, in general, there are considerable differences in all tests (accuracy, sensitivity, specificity, PV+, PV− and ROC) when a specific classifier uses the single-observer dataset compared to those when this same classifier uses the multiple-observer dataset. These results suggest that different observers see different things: a problem known as interobserver variability. We graphically unveil such a problem by presenting the structures of the decision trees and Bayesian networks resultant from running both databases.  相似文献   

14.
Most of the research on machine learning-based real-time scheduling (RTS) systems has been aimed toward product constant mix environments. However, in a product mix variety manufacturing environment, the scheduling knowledge base (KB) is dynamic; therefore, it would be interesting to develop a procedure that would automatically modify the scheduling knowledge when important changes occur in the manufacturing system. All of the machine learning-based RTS systems (including a KB refinement mechanism) proposed in earlier studies periodically require the addition of new training samples and regeneration of new KBs. Hence, previous approaches investigating machine learning-based RTS systems have been confronted with the training data overflow problem and an increase in the scheduling KB building time, which are unsuitable for RTS control. The objective of this paper is to develop a KB class selection mechanism that can be supported in various product mix ratio environments. Hence, the RTS KB is developed by a two-level decision tree (DT) learning approach. First, a suitable scheduling KB class is selected. Then, for each KB class, the best (proper) dispatching rule is selected for the next scheduling period. Here, the proposed two-level DT RTS system comprises five key components: (1) training samples generation mechanism, (2) GA/DT-based feature selection mechanism, (3) building a KB class label by a two-level self-organizing map, (4) DT-based KB class selection module, and (5) DT-based dynamic dispatching rule selection module. The proposed two-level DT-based KB RTS system yields better system performance than that by a one-level DT-based RTS system and heuristic individual dispatching rules in a flexible manufacturing system under various performance criteria over a long period.  相似文献   

15.
Hakan   《Pattern recognition》2007,40(12):3540-3551
Decision trees recursively partition the instance space by generating nodes that implement a decision function belonging to an a priori specified model class. Each decision may be univariate, linear or nonlinear. Alternatively, in omnivariate decision trees, one of the model types is dynamically selected by taking into account the complexity of the problem defined by the samples reaching that node. The selection is based on statistical tests where the most appropriate model type is selected as the one providing significantly better accuracy than others. In this study, we propose the use of model ensemble-based nodes where a multitude of models are considered for making decisions at each node. The ensemble members are generated by perturbing the model parameters and input attributes. Experiments conducted on several datasets and three model types indicate that the proposed approach achieves better classification accuracies compared to individual nodes, even in cases when only one model class is used in generating ensemble members.  相似文献   

16.
GA-based learning bias selection mechanism for real-time scheduling systems   总被引:1,自引:0,他引:1  
The use of machine learning technologies in order to develop knowledge bases (KBs) for real-time scheduling (RTS) problems has produced encouraging results in recent researches. However, few researches focus on the manner of selecting proper learning biases in the early developing stage of the RTS system to enhance the generalization ability of the resulting KBs. The selected learning bias usually assumes a set of proper system features that are known in advance. Moreover, the machine learning algorithm for developing scheduling KBs is predetermined. The purpose of this study is to develop a genetic algorithm (GA)-based learning bias selection mechanism to determine an appropriate learning bias that includes the machine learning algorithm, feature subset, and learning parameters. Three machine learning algorithms are considered: the back propagation neural network (BPNN), C4.5 decision tree (DT) learning, and support vector machines (SVMs). The proposed GA-based learning bias selection mechanism can search the best machine learning algorithm and simultaneously determine the optimal subset of features and the learning parameters used to build the RTS system KBs. In terms of the accuracy of prediction of unseen data under various performance criteria, it also offers better generalization ability as compared to the case where the learning bias selection mechanism is not used. Furthermore, the proposed approach to build RTS system KBs can improve the system performance as compared to other classifier KBs under various performance criteria over a long period.  相似文献   

17.
Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing with such problems. Various approaches have been introduced in the literature to deal with imbalanced datasets, and are typically based on oversampling, undersampling or cost-sensitive classification. In this paper, we introduce an effective ensemble of cost-sensitive decision trees for imbalanced classification. Base classifiers are constructed according to a given cost matrix, but are trained on random feature subspaces to ensure sufficient diversity of the ensemble members. We employ an evolutionary algorithm for simultaneous classifier selection and assignment of committee member weights for the fusion process. Our proposed algorithm is evaluated on a variety of benchmark datasets, and is confirmed to lead to improved recognition of the minority class, to be capable of outperforming other state-of-the-art algorithms, and hence to represent a useful and effective approach for dealing with imbalanced datasets.  相似文献   

18.
Ying  Dengsheng  Guojun   《Pattern recognition》2008,41(8):2554-2570
Semantic-based image retrieval has attracted great interest in recent years. This paper proposes a region-based image retrieval system with high-level semantic learning. The key features of the system are: (1) it supports both query by keyword and query by region of interest. The system segments an image into different regions and extracts low-level features of each region. From these features, high-level concepts are obtained using a proposed decision tree-based learning algorithm named DT-ST. During retrieval, a set of images whose semantic concept matches the query is returned. Experiments on a standard real-world image database confirm that the proposed system significantly improves the retrieval performance, compared with a conventional content-based image retrieval system. (2) The proposed decision tree induction method DT-ST for image semantic learning is different from other decision tree induction algorithms in that it makes use of the semantic templates to discretize continuous-valued region features and avoids the difficult image feature discretization problem. Furthermore, it introduces a hybrid tree simplification method to handle the noise and tree fragmentation problems, thereby improving the classification performance of the tree. Experimental results indicate that DT-ST outperforms two well-established decision tree induction algorithms ID3 and C4.5 in image semantic learning.  相似文献   

19.
基于SVM的软件需求分析风险评估模型   总被引:1,自引:0,他引:1       下载免费PDF全文
潘梅森  熊齐 《计算机工程》2007,33(12):78-81
需求分析风险是软件项目风险管理的重要内容。该文以13种风险为基础,建立了一个新的软件项目需求分析风险评估模型,把以往每个软件项目的13种需求分析风险看作一个1×13维行向量,作为SVM的训练向量,把其分成风险低、风险中等、风险高3个类别,并对项目需求分析风险水平进行了预测。  相似文献   

20.
Unscheduled maintenance of aircraft can cause significant costs. The machine needs to be repaired before it can operate again. Thus it is desirable to have concepts and methods to prevent unscheduled maintenance. This paper proposes a method for forecasting the condition of aircraft air conditioning system based on observed past data. Forecasting is done in a point by point way, by iterating the algorithm. The proposed method uses decision trees to find and learn patterns in past data and use these patterns to select the best forecasting method to forecast future data points. Forecasting a data point is based on selecting the best applicable approximation method. The selection is done by calculating different features/attributes of the time series and then evaluating the decision tree. A genetic algorithm is used to find the best feature set for the given problem to increase the forecasting performance. The experiments show a good forecasting ability even when the function is disturbed by noise.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号