首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) -- a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.  相似文献   

2.
Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS (Semi-Random Multiple decision Trees for Data Streams), based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naive Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.  相似文献   

3.
Classification can be regarded as dividing the data space into decision regions separated by decision boundaries.In this paper we analyze decision tree algorithms and the NBTree algorithm from this perspective.Thus,a decision tree can be regarded as a classifier tree,in which each classifier on a non-root node is trained in decision regions of the classifier on the parent node.Meanwhile,the NBTree algorithm,which generates a classifier tree with the C4.5 algorithm and the naive Bayes classifier as the root and leaf classifiers respectively,can also be regarded as training naive Bayes classifiers in decision regions of the C4.5 algorithm.We propose a second division (SD) algorithm and three soft second division (SD-soft) algorithms to train classifiers in decision regions of the naive Bayes classifier.These four novel algorithms all generate two-level classifier trees with the naive Bayes classifier as root classifiers.The SD and three SD-soft algorithms can make good use of both the information contained in instances near decision boundaries,and those that may be ignored by the naive Bayes classifier.Finally,we conduct experiments on 30 data sets from the UC Irvine (UCI) repository.Experiment results show that the SD algorithm can obtain better generali-zation abilities than the NBTree and the averaged one-dependence estimators (AODE) algorithms when using the C4.5 algorithm and support vector machine (SVM) as leaf classifiers.Further experiments indicate that our three SD-soft algorithms can achieve better generalization abilities than the SD algorithm when argument values are selected appropriately.  相似文献   

4.
This paper uses the geometric method to describe Lie group Machine Learning (LML) based on the theoretical framework of LML, which gives the geometric algorithms of Dynkin diagrams in LML. It includes the basic conceptions of Dynkin diagrams in LML, the classification theorems of Dynkin diagrams in LML, the classification algorithm of Dynkin diagrams in LML and the verification of the classification algorithm with experimental results.  相似文献   

5.
In this paper,the problem of increasing information transfer authenticity is formulated.And to reach a decision,the control methods and algorithms based on the use of statistical and structural information redundancy are presented.It is assumed that the controllable information is submitted as the text element images and it contains redundancy,caused by statistical relations and non-uniformity probability distribution of the transmitted data.The use of statistical redundancy allows to develop the adaptive rules of the authenticity control which take into account non-stationarity properties of image data while transferring the information.The structural redundancy peculiar to the container of image in a data transfer package is used for developing new rules to control the information authenticity on the basis of pattern recognition mechanisms.The techniques offered in this work are used to estimate the authenticity in structure of data transfer packages.The results of comparative analysis for developed methods and algorithms show that their parameters of efficiency are increased by criterion of probability of undetected mistakes,labour input and cost of realization.  相似文献   

6.
Not only does business performance serve a major indicator for investors' decision, but it also has a lot to do with employees' living. Generally speaking, when predicting or analyzing business performance classification, most researchers adopt corporate financial early-warning or credit rating models, which pretty much use previous data and facts. Therefore, this paper brings about an alternative method to discriminate between excellent and poor business management, so as to take preventive measures prior to business crisis or bankruptcy. We collect the financial reports and financial ratios from the listed firms in mainland China and Taiwan as our samples to build up tbur kinds of forecasting models for business performance. The empirical results show that the GAANFIS model provides better classification forecasting capability than other models do while ANFIS model adjusted by genetic algorithm could effectively enhance the classification forecasting capability.  相似文献   

7.
Predicting the response variables of the target dataset is one of the main problems in machine learning. Predictive models are desired to perform satisfactorily in a broad range of target domains. However, that may not be plausible if there is a mismatch between the source and target domain distributions. The goal of domain adaptation algorithms is to solve this issue and deploy a model across different target domains. We propose a method based on kernel distribution embedding and Hilbert-Schmidt independence criterion (HSIC) to address this problem. The proposed method embeds both source and target data into a new feature space with two properties: 1) the distributions of the source and the target datasets are as close as possible in the new feature space, and 2) the important structural information of the data is preserved. The embedded data can be in lower dimensional space while preserving the aforementioned properties and therefore the method can be considered as a dimensionality reduction method as well. Our proposed method has a closed-form solution and the experimental results show that it works well in practice.  相似文献   

8.
This paper proposes one method of feature selection by using Bayes' theorem. The purpose of the proposed method is to reduce the computational complexity and increase the classification accuracy of the selected feature subsets. The dependence between two attributes (binary) is determined based on the probabilities of their joint values that contribute to positive and negative classification decisions. If opposing sets of attribute values do not lead to opposing classification decisions (zero probability), then the two attributes are considered independent of each other, otherwise dependent, and one of them can be removed and thus the number of attributes is reduced. The process must be repeated on all combinations of attributes. The paper also evaluates the approach by comparing it with existing feature selection algorithms over 8 datasets from University of California, Irvine (UCI) machine learning databases. The proposed method shows better results in terms of number of selected features, classification accuracy, and running time than most existing algorithms.  相似文献   

9.
Vision-based road detection is an important research topic in different areas of computer vision such as the autonomous navigation of mobile robots. In outdoor unstructured environments such as villages and deserts, the roads are usually not well-paved and have variant colors or texture distributions. Traditional region- or edge-based approaches, however, arc effective only in specific environments, and most of them have weak adaptability to varying road types and appearances. In this paper we describe a novel top-down based hybrid algorithm which properly combines both region and edge cues from the images. The main difference between our proposed algorithm and previous ones is that, before road detection, an off-line scene classifier is efficiently learned by both low- and high-level image cues to predict the unstructured road model. This scene classification can bc considered a decision process which guides the selection of the optimal solution from region- or edge-based approaches to detect the road. Moreover, a temporal smoothing mechanism is incorporated, which further makes both model prediction and region classification more stable. Experimental results demonstrate that compared with traditional region- and edge-based algorithms, our algorithm is more robust in detecting the road areas with diverse road types and varying appearances in unstructured conditions.  相似文献   

10.
The hierarchical identification model with multiple detectors is an innovative approach for biometric systems design which improves the identification accuracy while ensuring the computational complexity reduction. This complexity reduction provides additional advantages in terms of execution time and recognition accuracy. The model is different from the actual solutions for biometric data classification because it essentially uses a special kind of classifiers (detectors) and the identification decision is issued in a hierarchical way according to the users importance; this makes it suitable for various security requirements applications (users with different authorization levels). The model includes a local feature-level fusion for each of the integrated biometrics. The paper defines and explains the multi-detector security architecture with its basic functions. The achieved experimental results are discussed to reveal the proposed method advantages and further potential enhancements for particular use cases.  相似文献   

11.
In this paper,a new effective method is proposed to find class association rules (CAR),to get useful class associaiton rules(UCAR)by removing the spurious class association rules (SCAR),and to generate exception class associaiton rules(ECAR)for each UCAR.CAR mining,which integrates the techniques of classification and association,is of great interest recently.However,it has two drawbacks:one is that a large part of CARs are spurious and maybe misleading to users ;the other is that some important ECARs are diffcult to find using traditional data mining techniques .The method introduced in this paper aims to get over these flaws.According to our approach,a user can retrieve correct information from UCARs and konw the influence from different conditions by checking corresponding ECARs.Experimental results demonstrate the effectiveness of our proposed approach.  相似文献   

12.
In this pager,we report our success in building efficient scalable classifiers by exploring the capabilities of modern relational database management systems (RDBMS).In addition to high classification accuracy,the unique features of the approach include its high training speed ,linear scalability,and simplicity in implementation.More importantly,the major computation required in the approach can be implemented using standard functions provided by the modern realtional DBMS.Besides,with the effective rule pruning strategy,the algorithm proposed in this paper can produce a compact set of classification rules,The results of experiments conducted for performance evaluation an analysis are presented.  相似文献   

13.
Efficient Incremental Maintenance of Frequent Patterns with FP-Tree   总被引:3,自引:0,他引:3       下载免费PDF全文
Mining frequent patterns has been studied popularly in data mining area. However, little work has been done on mining patterns when the database has an influx of fresh data constantly. In these dynamic scenarios, efficient maintenance of the discovered patterns is crucial. Most existing methods need to scan the entire database repeatedly, which is an obvious disadvantage. In this paper, an efficient incremental mining algorithm, Incremental-Mining (IM), is proposed for maintenance of the frequent patterns when new incremental data come. Based on the frequent pattern tree (FP-tree) structure, IM gives a way to make the most of the things from the previous mining process, and requires scanning the original data once at most. Furthermore, IM can identify directly the differential set of frequent patterns, which may be more informative to users. Moreover, IM can deal with changing thresholds as well as changing data, thus provide a full maintenance scheme. IM has been implemented and the performance study shows it outperforms three other incremental algorithms: FUP, DB-tree and re-running frequent pattern growth (FP-growth).  相似文献   

14.
With the growing popularity of the World Wide Web, large volume of user access data has been gathered automatically by Web servers and stored in Web logs. Discovering and understanding user behavior patterns from log files can provide Web personalized recommendation services. In this paper, a novel clustering method is presented for log files called Clustering large Weblog based on Key Path Model (CWKPM), which is based on user browsing key path model, to get user behavior profiles. Compared with the previous Boolean model, key path model considers the major features of users‘ accessing to the Web: ordinal, contiguous and duplicate. Moreover, for clustering, it has fewer dimensions. The analysis and experiments show that CWKPM is an efficient and effective approach for clustering large and high-dimension Web logs.  相似文献   

15.
Mining frequent patterns from datasets is one of the key success of data mining research. Currently,most of the studies focus on the data sets in which the elements are independent, such as the items in the marketing basket. However, the objects in the real world often have close relationship with each other. How to extract frequent patterns from these relations is the objective of this paper. The authors use graphs to model the relations, and select a simple type for analysis. Combining the graph theory and algorithms to generate frequent patterns, a new algorithm called Topology, which can mine these graphs efficiently, has been proposed.The performance of the algorithm is evaluated by doing experiments with synthetic datasets and real data. The experimental results show that Topology can do the job well. At the end of this paper, the potential improvement is mentioned.  相似文献   

16.
In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ε)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.  相似文献   

17.
Tracking clusters in evolving data streams over sliding windows   总被引:6,自引:4,他引:2  
Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method. Aoying Zhou is currently a Professor in Computer Science at Fudan University, Shanghai, P.R. China. He won his Bachelor and Master degrees in Computer Science from Sichuan University in Chengdu, Sichuan, P.R. China in 1985 and 1988, respectively, and Ph.D. degree from Fudan University in 1993. He served as the member or chair of program committee for many international conferences such as WWW, SIGMOD, VLDB, EDBT, ICDCS, ER, DASFAA, PAKDD, WAIM, and etc. His papers have been published in ACM SIGMOD, VLDB, ICDE, and several other international journals. His research interests include Data mining and knowledge discovery, XML data management, Web mining and searching, data stream analysis and processing, peer-to-peer computing. Feng Cao is currently an R&D engineer in IBM China Research Laboratories. He received a B.E. degree from Xi'an Jiao Tong University, Xi'an, P.R. China, in 2000 and an M.E. degree from Huazhong University of Science and Technology, Wuhan, P.R. China, in 2003. From October 2004 to March 2005, he worked in Fudan-NUS Competency Center for Peer-to-Peer Computing, Singapore. In 2006, he received his Ph.D. degree from Fudan University, Shanghai, P.R. China. His current research interests include data mining and data stream. Weining Qian is currently an Assistant Professor in computer science at Fudan University, Shanghai, P.R. China. He received his M.S. and Ph.D. degree in computer science from Fudan University in 2001 and 2004, respectively. He is supported by Shanghai Rising-Star Program under Grant No. 04QMX1404 and National Natural Science Foundation of China (NSFC) under Grant No. 60673134. He served as the program committee member of several international conferences, including DASFAA 2006, 2007 and 2008, APWeb/WAIM 2007, INFOSCALE 2007, and ECDM 2007. His papers have been published in ICDE, SIAM DM, and CIKM. His research interests include data stream query processing and mining, and large-scale distributed computing for database applications. Cheqing Jin is currently an Assistant Professor in Computer Science at East China University of Science and Technology. He received his Bachelor and Master degrees in Computer Science from Zhejiang University in Hangzhou, P.R. China in 1999 and 2002, respectively, and the Ph.D. degree from Fudan University, Shanghai, P.R. China. He worked as a Research Assistant at E-business Technology Institute, the Hong Kong University from December 2003 to May 2004. His current research interests include data mining and data stream.  相似文献   

18.
Many supervised machine learning tasks can be cast as multi-class classification problems. Support vector machines (SVMs) excel at binary classification problems, but the elegant theory behind large-margin hyperplane cannot be easily extended to their multi-class counterparts. On the other hand, it was shown that the decision hyperplanes for binary classification obtained by SVMs are equivalent to the solutions obtained by Fisher's linear discriminant on the set of support vectors. Discriminant analysis approaches are well known to learn discriminative feature transformations in the statistical pattern recognition literature and can be easily extend to multi-class cases. The use of discriminant analysis, however, has not been fully experimented in the data mining literature. In this paper, we explore the use of discriminant analysis for multi-class classification problems. We evaluate the performance of discriminant analysis on a large collection of benchmark datasets and investigate its usage in text categorization. Our experiments suggest that discriminant analysis provides a fast, efficient yet accurate alternative for general multi-class classification problems. Tao Li is currently an assistant professor in the School of Computer Science at Florida International University. He received his Ph.D. degree in Computer Science from University of Rochester in 2004. His primary research interests are: data mining, machine learning, bioinformatics, and music information retrieval. Shenghuo Zhu is currently a researcher in NEC Laboratories America, Inc. He received his B.E. from Zhejiang University in 1994, B.E. from Tsinghua University in 1997, and Ph.D degree in Computer Science from University of Rochester in 2003. His primary research interests include information retrieval, machine learning, and data mining. Mitsunori Ogihara received a Ph.D. in Information Sciences at Tokyo Institute of Technology in 1993. He is currently Professor and Chair of the Department of Computer Science at the University of Rochester. His primary research interests are data mining, computational complexity, and molecular computation.  相似文献   

19.
ARMiner: A Data Mining Tool Based on Association Rules   总被引:3,自引:0,他引:3       下载免费PDF全文
In this paper,ARM iner,a data mining tool based on association rules,is introduced.Beginning with the system architecture,the characteristics and functions are discussed in details,including data transfer,concept hierarchy generalization,mining rules with negative items and the re-development of the system.An example of the tool‘s application is also shown.Finally,Some issues for future research are presented.  相似文献   

20.
The study on database technologies, or more generally, the technologies of data and information management, is an important and active research field. Recently, many exciting results have been reported. In this fast growing field, Chinese researchers play more and more active roles. Research papers from Chinese scholars, both in China and abroad,appear in prestigious academic forums.In this paper,we, nine young Chinese researchers working in the United States, present concise surveys and report our recent progress on the selected fields that we are working on.Although the paper covers only a small number of topics and the selection of the topics is far from balanced, we hope that such an effort would attract more and more researchers,especially those in China,to enter the frontiers of database research and promote collaborations. For the obvious reason, the authors are listed alphabetically, while the sections are arranged in the order of the author list.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号