首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Costs are often an important part of the classification process. Cost factors have been taken into consideration in many previous studies regarding decision tree models. In this study, we also consider a cost-sensitive decision tree construction problem. We assume that there are test costs that must be paid to obtain the values of the decision attribute and that a record must be classified without exceeding the spending cost threshold. Unlike previous studies, however, in which records were classified with only a single condition attribute, in this study, we are able to simultaneously classify records with multiple condition attributes. An algorithm is developed to build a cost-constrained decision tree, which allows us to simultaneously classify multiple condition attributes. The experimental results show that our algorithm satisfactorily handles data with multiple condition attributes under different cost constraints.  相似文献   

2.
Cost-sensitive learning algorithms are typically designed for minimizing the total cost when multiple costs are taken into account. Like other learning algorithms, cost-sensitive learning algorithms must face a significant challenge, over-fitting, in an applied context of cost-sensitive learning. Specifically speaking, they can generate good results on training data but normally do not produce an optimal model when applied to unseen data in real world applications. It is called data over-fitting. This paper deals with the issue of data over-fitting by designing three simple and efficient strategies, feature selection, smoothing and threshold pruning, against the TCSDT (test cost-sensitive decision tree) method. The feature selection approach is used to pre-process the data set before applying the TCSDT algorithm. The smoothing and threshold pruning are used in a TCSDT algorithm before calculating the class probability estimate for each decision tree leaf. To evaluate our approaches, we conduct extensive experiments on the selected UCI data sets across different cost ratios, and on a real world data set, KDD-98 with real misclassification cost. The experimental results show that our algorithms outperform both the original TCSDT and other competing algorithms on reducing data over-fitting.  相似文献   

3.
A new approach to the problem of graph and subgraph isomorphism detection from an input graph to a database of model graphs is proposed in this paper. It is based on a preprocessing step in which the model graphs are used to create a decision tree. At run time, subgraph isomorphisms are detected by means of decision tree traversal. If we neglect the time needed for preprocessing, the computational complexity of the new graph algorithm is only polynomial in the number of input graph vertices. In particular, it is independent of the number of model graphs and the number of edges in any of the graphs. However, the decision tree is of exponential size. Several pruning techniques which aim at reducing the size of the decision tree are presented. A computational complexity analysis of the new method is given and its behavior is studied in a number of practical experiments with randomly generated graphs.  相似文献   

4.
Fan Min  Qihe Liu 《Information Sciences》2009,179(14):2442-2452
Cost-sensitive learning is an important issue in both data mining and machine learning, in that it deals with the problem of learning from decision systems relative to a variety of costs. In this paper, we introduce a hierarchy of cost-sensitive decision systems from a test cost perspective. Two major issues are addressed with regard to test cost dependency. The first is concerned with the common test cost, where a group of tests share a common cost, while the other relates to the sequence-dependent test cost, where the order of the test sequence influences the total cost. Theoretical aspects of each of the six models in our hierarchy are investigated and illustrated via examples. The proposed models are shown to be useful for exploring cost related information in various different applications.  相似文献   

5.
Methods for classification of ultrasound thyroid images have been presented. These methods allow us to classify examined patients as either sick or healthy. Decision tree induction and a multilayer perceptron neural network have been used to build classification models. Test results showed that the proposed methods can provide a starting point for building a support system in the process of medical diagnosis. Better accuracy of classifiers was achieved for the normalized images. We have also found that, under adopted assumptions, the results obtained for them were statistically significant in contrast to original images. The proposed methods allow us to separate a fairly large group of incorrectly classified cases. According to the authors, this group may contain features of the early stage of Hashimoto’s disease.  相似文献   

6.
Database classification suffers from two well-known difficulties, i.e., the high dimensionality and non-stationary variations within the large historic data. This paper presents a hybrid classification model by integrating a case-based reasoning technique, a fuzzy decision tree (FDT), and genetic algorithms (GAs) to construct a decision-making system for data classification in various database applications. The model is major based on the idea that the historic database can be transformed into a smaller case base together with a group of fuzzy decision rules. As a result, the model can be more accurately respond to the current data under classifying from the inductions by these smaller case-based fuzzy decision trees. Hit rate is applied as a performance measure and the effectiveness of our proposed model is demonstrated experimentally compared with other approaches on different database classification applications. The average hit rate of our proposed model is the highest among others.  相似文献   

7.
Choice of a classification algorithm is generally based upon a number of factors, among which are availability of software, ease of use, and performance, measured here by overall classification accuracy. The maximum likelihood (ML) procedure is, for many users, the algorithm of choice because of its ready availability and the fact that it does not require an extended training process. Artificial neural networks (ANNs) are now widely used by researchers, but their operational applications are hindered by the need for the user to specify the configuration of the network architecture and to provide values for a number of parameters, both of which affect performance. The ANN also requires an extended training phase.In the past few years, the use of decision trees (DTs) to classify remotely sensed data has increased. Proponents of the method claim that it has a number of advantages over the ML and ANN algorithms. The DT is computationally fast, make no statistical assumptions, and can handle data that are represented on different measurement scales. Software to implement DTs is readily available over the Internet. Pruning of DTs can make them smaller and more easily interpretable, while the use of boosting techniques can improve performance.In this study, separate test and training data sets from two different geographical areas and two different sensors—multispectral Landsat ETM+ and hyperspectral DAIS—are used to evaluate the performance of univariate and multivariate DTs for land cover classification. Factors considered are: the effects of variations in training data set size and of the dimensionality of the feature space, together with the impact of boosting, attribute selection measures, and pruning. The level of classification accuracy achieved by the DT is compared to results from back-propagating ANN and the ML classifiers. Our results indicate that the performance of the univariate DT is acceptably good in comparison with that of other classifiers, except with high-dimensional data. Classification accuracy increases linearly with training data set size to a limit of 300 pixels per class in this case. Multivariate DTs do not appear to perform better than univariate DTs. While boosting produces an increase in classification accuracy of between 3% and 6%, the use of attribute selection methods does not appear to be justified in terms of accuracy increases. However, neither the univariate DT nor the multivariate DT performed as well as the ANN or ML classifiers with high-dimensional data.  相似文献   

8.
刘枭  王晓国 《计算机应用》2019,39(4):1214-1219
目前银行对电信诈骗的标记数据积累少,人工标记数据的代价大,导致电信诈骗检测的有监督学习方法可使用的标记数据不足。针对这个问题,提出一种基于密集子图的无监督学习方法用于电信诈骗的检测。首先,通过在账户-资源(IP地址和MAC地址统称为资源)网络搜索可疑度较高的子图来识别欺诈账户;然后,设计了一种符合电信诈骗特性的子图可疑度量;最后,提出一种磁盘驻留、线性内存消耗且有理论保障的可疑子图搜索算法。在两组模拟数据集上,所提方法的F1-score分别达到0.921和0.861,高于CrossSpot、fBox和EvilCohort算法,与M-Zoom算法的0.899和0.898相近,但是所提方法的平均运行时间和内存消耗峰值均小于M-Zoom算法;在真实数据集上,所提方法的F1-score达到0.550,高于fBox和EvilCohort算法,与M-Zoom算法的0.529相近。实验结果表明,所提方法能较好地应用于现阶段的银行反电信诈骗业务,且非常适合于实际应用中的大规模数据集。  相似文献   

9.
Over the past few years, investigators in Brazil have been uncovering numerous corruption and money laundering schemes at all levels of government and in the country's largest corporations. It is estimated that between 2% and 5% of the global GDP is lost annually because of such practices, not only directly impacting public services and private sector development but also strengthening organized crime. However, most law enforcement agencies do not have the capability to carry out systematic corruption risk assessment leveraging on the availability of data related to public procurement. The currently prevailing approach employed by Brazilian law enforcement agencies to detect companies involved in potential cases of fraud consists in receiving circumstantial evidence or complaints from whistleblowers. As a result, a large number of companies involved in fraud remain undetected and unprosecuted. The decision support system (DSS) described in this work addresses these existing limitations by providing a tool for systematic analysis of public procurement. It allows the law enforcement agencies to establish priorities concerning the companies to be investigated. This DSS incorporates data mining algorithms for quantifying dozens of corruption risk patterns for all public contractors inside a specific jurisdiction, leading to improvements in the quality of public spending and to the identification of more cases of fraud. These algorithms combine operations research tools such as graph theory, clusterization, and regression analysis with advanced data science methods to allow the identification of the main risk patterns, such as collusion between bidders, conflicts of interest (e.g., a politician who owns a company contracted by the same government body where he or she was elected), and companies owned by a potentially straw person used for disguising its real owner (e.g., beneficiaries of cash conditional transfer programs). The DSS has already led to a detailed analysis of large public procurement datasets, which add up to more than 50 billion dollars. Moreover, the DSS provided strategic inputs to investigations conducted by federal and state agencies.  相似文献   

10.
In business applications such as direct marketing, decision-makers are required to choose the action which best maximizes a utility function. Cost-sensitive learning methods can help them achieve this goal. In this paper, we introduce Pessimistic Active Learning (PAL). PAL employs a novel pessimistic measure, which relies on confidence intervals and is used to balance the exploration/exploitation trade-off. In order to acquire an initial sample of labeled data, PAL applies orthogonal arrays of fractional factorial design. PAL was tested on ten datasets using a decision tree inducer. A comparison of these results to those of other methods indicates PAL’s superiority.  相似文献   

11.
Hybrid decision tree   总被引:6,自引:0,他引:6  
  相似文献   

12.
Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing with such problems. Various approaches have been introduced in the literature to deal with imbalanced datasets, and are typically based on oversampling, undersampling or cost-sensitive classification. In this paper, we introduce an effective ensemble of cost-sensitive decision trees for imbalanced classification. Base classifiers are constructed according to a given cost matrix, but are trained on random feature subspaces to ensure sufficient diversity of the ensemble members. We employ an evolutionary algorithm for simultaneous classifier selection and assignment of committee member weights for the fusion process. Our proposed algorithm is evaluated on a variety of benchmark datasets, and is confirmed to lead to improved recognition of the minority class, to be capable of outperforming other state-of-the-art algorithms, and hence to represent a useful and effective approach for dealing with imbalanced datasets.  相似文献   

13.
This paper deals with some improvements to rule induction algorithms in order to resolve the tie that appear in special cases during the rule generation procedure for specific training data sets. These improvements are demonstrated by experimental results on various data sets. The tie occurs in decision tree induction algorithm when the class prediction at a leaf node cannot be determined by majority voting. When there is a conflict in the leaf node, we need to find the source and the solution to the problem. In this paper, we propose to calculate the Influence factor for each attribute and an update procedure to the decision tree has been suggested to deal with the problem and provide subsequent rectification steps.  相似文献   

14.
We have proposed a decision tree classifier named MMC (multi-valued and multi-labeled classifier) before. MMC is known as its capability of classifying a large multi-valued and multi-labeled data. Aiming to improve the accuracy of MMC, this paper has developed another classifier named MMDT (multi-valued and multi-labeled decision tree). MMDT differs from MMC mainly in attribute selection. MMC attempts to split a node into child nodes whose records approach the same multiple labels. It basically measures the average similarity of labels of each child node to determine the goodness of each splitting attribute. MMDT, in contrast, uses another measuring strategy which considers not only the average similarity of labels of each child node but also the average appropriateness of labels of each child node. The new measuring strategy takes scoring approach to have a look-ahead measure of accuracy contribution of each attribute's splitting. The experimental results show that MMDT has improved the accuracy of MMC.  相似文献   

15.
The more the telecom services marketing paradigm evolves, the more important it becomes to retain high value customers. Traditional customer segmentation methods based on experience or ARPU (Average Revenue per User) consider neither customers’ future revenue nor the cost of servicing customers of different types. Therefore, it is very difficult to effectively identify high-value customers. In this paper, we propose a novel customer segmentation method based on customer lifecycle, which includes five decision models, i.e. current value, historic value, prediction of long-term value, credit and loyalty. Due to the difficulty of quantitative computation of long-term value, credit and loyalty, a decision tree method is used to extract important parameters related to long-term value, credit and loyalty. Then a judgments matrix formulated on the basis of characteristics of data and the experience of business experts is presented. Finally a simple and practical customer value evaluation system is built. This model is applied to telecom operators in a province in China and good accuracy is achieved.  相似文献   

16.
Some biological phenomena offer clues to solving real‐life, complex problems. Researchers have been studying techniques such as neural networks and genetic algorithms for computational intelligence and their applications to such complex problems. The problem of security management is one of the major concerns in the development of eBusiness services and networks. Recent incidents have shown that the perpetrators of cybercrimes are using increasingly sophisticated methods. Hence, it is necessary to investigate non‐traditional mechanisms, such as biological techniques, to manage the security of evolving eBusiness networks and services. Towards this end, this paper investigates the use of an Artificial Immune System (AIS). The AIS emulates the mechanism of human immune systems that save human bodies from complex natural biological attacks. The paper discusses the use of AIS on one aspect of security management, viz. the detection of credit card fraud. The solution is illustrated with a case study on the management of frauds in credit card transactions, although this technique may be used in a range of security management applications in eBusiness.  相似文献   

17.
基于决策树和协议分析的入侵检测研究   总被引:1,自引:0,他引:1  
当前大多数入侵检测产品使用的是基于规则的简单模式匹配技术,它们存在着资源消耗量大、误报率高以及随着网速的提高出现丢包等问题。针对这些问题,提出了用决策树算法实现基于协议分析的入侵检测方法。试验结果表明,该方法具有较高的检测速度和较低的正误报率。  相似文献   

18.
赵晓峰  叶震 《计算机应用》2007,27(5):1041-1043
传统的决策树分类方法(如ID3和C4.5)对于相对小的数据集是很有效的。但是,当这些算法用于入侵检测这样的非常大的数据时,其有效性就显得不足。采用了一种基于随机模型的决策树算法, 在保证分类准确率的基础上,减少了对系统资源的占用,并设计了基于此算法的分布式入侵检测模型。最后通过对比试验表明该模型在对计算机入侵数据的分类上有着出色的表现。  相似文献   

19.
The objective of this paper is to construct a lightweight Intrusion Detection System (IDS) aimed at detecting anomalies in networks. The crucial part of building lightweight IDS depends on preprocessing of network data, identifying important features and in the design of efficient learning algorithm that classify normal and anomalous patterns. Therefore in this work, the design of IDS is investigated from these three perspectives. The goals of this paper are (i) removing redundant instances that causes the learning algorithm to be unbiased (ii) identifying suitable subset of features by employing a wrapper based feature selection algorithm (iii) realizing proposed IDS with neurotree to achieve better detection accuracy. The lightweight IDS has been developed by using a wrapper based feature selection algorithm that maximizes the specificity and sensitivity of the IDS as well as by employing a neural ensemble decision tree iterative procedure to evolve optimal features. An extensive experimental evaluation of the proposed approach with a family of six decision tree classifiers namely Decision Stump, C4.5, Naive Baye’s Tree, Random Forest, Random Tree and Representative Tree model to perform the detection of anomalous network pattern has been introduced.  相似文献   

20.
In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号