首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
An automated method is presented for the design of linear tree classifiers, i.e. tree classifiers in which a decision based on a linear sum of features is carried out at each node. The method exploits the discriminability of Tomek links joining opposed pairs of data points in multidimensional feature space to produce a hierarchically structured piecewise linear decision function. The corresponding decision surface is optimized by a gradient descent that maximizes the number of Tomek links cut by each linear segment of the decision surface, followed by training each node's linear decision segment on the data associated with that node. Experiments on real data obtained from ship images and character images suggest that the accuracy of the tree classifier designed by this scheme is comparable to that of the k-NN classifier while providing much greater decision speeds, and that the trade-off between the speed and the accuracy in pattern classification can be controlled by bounding the number of features to be used at each node of the tree. Further experiments comparing the classification errors of our tree classifier with the tree classifier produced by the Mui/Fu technique indicate that our tree classifier is no less accurate and often much faster than the Mui/Fu classifier.  相似文献   

2.
With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In This work, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.  相似文献   

3.
Lazy Learning of Bayesian Rules   总被引:19,自引:0,他引:19  
The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute inter-dependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR. This algorithm can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of LBR are reasonable in a wide cross-section of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier.  相似文献   

4.
Recently, many data mining techniques have been applied to analyze and interpret the huge volume of data collected from wireless sensor networks. Such techniques, especially classification and clustering, have been used to relate raw data and assign a class label (a useful interpretation) to the set of attributes values received from the sensors' nodes. However, building a classifier, such as decision tree, is a cost process in terms of energy consumption due to the large size of the resultant tree. In this article, we propose a model-checking based classification method that relies on cutting-off parts of the decision tree while keeping the performance fixed. The pruning process aims to reduce the size of the tree and, thus, reduce the amount of the energy needed to maintain the classifier. Results have shown energy reduction between 10–15% compared with a nonpruned decision tree.  相似文献   

5.
A successful application of data mining to bioinformatics is protein classification. A number of techniques have been developed to classify proteins according to important features in their sequences, secondary structures, or three-dimensional structures. In this paper, we introduce a novel approach to protein classification based on significant patterns discovered on the surface of a protein. We define a notion called /spl alpha/-surface. We discuss the geometric properties of /spl alpha/-surface and present an algorithm that calculates the /spl alpha/-surface from a finite set of points in R/sup 3/. We apply the algorithm to extracting the /spl alpha/-surface of a protein and use a pattern discovery algorithm to discover frequently occurring patterns on the surfaces. The pattern discovery algorithm utilizes a new index structure called the /spl Delta/B/sup +/ tree. We use these patterns to classify the proteins. While most existing techniques focus on the binary classification problem, we apply our approach to classifying three families of proteins. Experimental results show the good performance of the proposed approach.  相似文献   

6.
沈思倩  毛宇光  江冠儒 《计算机科学》2017,44(6):139-143, 149
主要研究在对不完全数据集进行决策树分析时,如何加入差分隐私保护技术。首先简单介绍了差分隐私ID3算法和差分隐私随机森林决策树算法;然后针对上述算法存在的缺陷和不足进行了修改,提出指数机制的差分隐私随机森林决策树算法;最后对于不完全数据集提出了一种新的WP(Weight Partition)缺失值处理方法,能够在不需要插值的情况下,使决策树分析算法既能满足差分隐私保护,也能拥有更高的预测准确率和适应性。实验证明,无论是Laplace机制还是指数机制,无论是ID3算法还是随机森林决策树算法,都能适用于所提方法。  相似文献   

7.
Robust automated vortex detection algorithms are needed to facilitate the exploration of large‐scale turbulent fluid flow simulations. Unfortunately, robust non‐local vortex detection algorithms are computationally intractable for large data sets and local algorithms, while computationally tractable, lack robustness. We argue that the deficiencies inherent to the local definitions occur because of two fundamental issues: the lack of a rigorous definition of a vortex and the fact that a vortex is an intrinsically non‐local phenomenon. As a first step towards addressing this problem, we demonstrate the use of machine learning techniques to enhance the robustness of local vortex detection algorithms. We motivate the presence of an expert‐in‐the‐loop using empirical results based on machine learning techniques. We employ adaptive boosting to combine a suite of widely used, local vortex detection algorithms, which we term weak classifiers, into a robust compound classifier. Fundamentally, the training phase of the algorithm, in which an expert manually labels small, spatially contiguous regions of the data, incorporates non‐local information into the resulting compound classifier. We demonstrate the efficacy of our approach by applying the compound classifier to two data sets obtained from computational fluid dynamical simulations. Our results demonstrate that the compound classifier has a reduced misclassification rate relative to the component classifiers.  相似文献   

8.
A novel pruning approach using expert knowledge for data-specific pruning   总被引:1,自引:0,他引:1  
Classification is an important data mining task that discovers hidden knowledge from the labeled datasets. Most approaches to pruning assume that all dataset are equally uniform and equally important, so they apply equal pruning to all the datasets. However, in real-world classification problems, all the datasets are not equal and considering equal pruning rate during pruning tends to generate a decision tree with large size and high misclassification rate. We approach the problem by first investigating the properties of each dataset and then deriving data-specific pruning value using expert knowledge which is used to design pruning techniques to prune decision trees close to perfection. An efficient pruning algorithm dubbed EKBP is proposed and is very general as we are free to use any learning algorithm as the base classifier. We have implemented our proposed solution and experimentally verified its effectiveness with forty real world benchmark dataset from UCI machine learning repository. In all these experiments, the proposed approach shows it can dramatically reduce the tree size while enhancing or retaining the level of accuracy.  相似文献   

9.
Univariate decision trees are classifiers currently used in many data mining applications. This classifier discovers partitions in the input space via hyperplanes that are orthogonal to the axes of attributes, producing a model that can be understood by human experts. One disadvantage of univariate decision trees is that they produce complex and inaccurate models when decision boundaries are not orthogonal to axes. In this paper we introduce the Fisher’s Tree, it is a classifier that takes advantage of dimensionality reduction of Fisher’s linear discriminant and uses the decomposition strategy of decision trees, to come up with an oblique decision tree. Our proposal generates an artificial attribute that is used to split the data in a recursive way.The Fisher’s decision tree induces oblique trees whose accuracy, size, number of leaves and training time are competitive with respect to other decision trees reported in the literature. We use more than ten public available data sets to demonstrate the effectiveness of our method.  相似文献   

10.
An early warning system can help to identify at-risk students, or predict student learning performance by analyzing learning portfolios recorded in a learning management system (LMS). Although previous studies have shown the applicability of determining learner behaviors from an LMS, most investigated datasets are not assembled from online learning courses or from whole learning activities undertaken on courses that can be analyzed to evaluate students’ academic achievement. Previous studies generally focus on the construction of predictors for learner performance evaluation after a course has ended, and neglect the practical value of an “early warning” system to predict at-risk students while a course is in progress. We collected the complete learning activities of an online undergraduate course and applied data-mining techniques to develop an early warning system. Our results showed that, time-dependent variables extracted from LMS are critical factors for online learning. After students have used an LMS for a period of time, our early warning system effectively characterizes their current learning performance. Data-mining techniques are useful in the construction of early warning systems; based on our experimental results, classification and regression tree (CART), supplemented by AdaBoost is the best classifier for the evaluation of learning performance investigated by this study.  相似文献   

11.
Basak J  Kothari R 《Neural computation》2004,16(7):1525-1544
In general, pattern classification algorithms assume that all the features are available during the construction of a classifier and its subsequent use. In many practical situations, data are recorded in different servers that are geographically apart, and each server observes features of local interest. The underlying infrastructure and other logistics (such as access control) in many cases do not permit continual synchronization. Each server thus has a partial view of the data in the sense that feature subsets (not necessarily disjoint) are available at each server. In this article, we present a classification algorithm for this distributed vertically partitioned data. We assume that local classifiers can be constructed based on the local partial views of the data available at each server. These local classifiers can be any one of the many standard classifiers (e.g., neural networks, decision tree, k nearest neighbor). Often these local classifiers are constructed to support decision making at each location, and our focus is not on these individual local classifiers. Rather, our focus is constructing a classifier that can use these local classifiers to achieve an error rate that is as close as possible to that of a classifier having access to the entire feature set. We empirically demonstrate the efficacy of the proposed algorithm and also provide theoretical results quantifying the loss that results as compared to the situation where the entire feature set is available to any single classifier.  相似文献   

12.
ContextBuilding defect prediction models in large organizations has many challenges due to limited resources and tight schedules in the software development lifecycle. It is not easy to collect data, utilize any type of algorithm and build a permanent model at once. We have conducted a study in a large telecommunications company in Turkey to employ a software measurement program and to predict pre-release defects. Based on our prior publication, we have shared our experience in terms of the project steps (i.e. challenges and opportunities). We have further introduced new techniques that improve our earlier results.ObjectiveIn our previous work, we have built similar predictors using data representative for US software development. Our task here was to check if those predictors were specific solely to US organizations or to a broader class of software.MethodWe have presented our approach and results in the form of an experience report. Specifically, we have made use of different techniques for improving the information content of the software data and the performance of a Naïve Bayes classifier in the prediction model that is locally tuned for the company. We have increased the information content of the software data by using module dependency data and improved the performance by adjusting the hyper-parameter (decision threshold) of the Naïve Bayes classifier. We have reported and discussed our results in terms of defect detection rates and false alarms. We also carried out a cost–benefit analysis to show that our approach can be efficiently put into practice.ResultsOur general result is that general defect predictors, which exist across a wide range of software (in both US and Turkish organizations), are present. Our specific results indicate that concerning the organization subject to this study, the use of version history information along with code metrics decreased false alarms by 22%, the use of dependencies between modules further reduced false alarms by 8%, and the decision threshold optimization for the Naïve Bayes classifier using code metrics and version history information further improved false alarms by 30% in comparison to a prediction using only code metrics and a default decision threshold.ConclusionImplementing statistical techniques and machine learning on a real life scenario is a difficult yet possible task. Using simple statistical and algorithmic techniques produces an average detection rate of 88%. Although using dependency data improves our results, it is difficult to collect and analyze such data in general. Therefore, we would recommend optimizing the hyper-parameter of the proposed technique, Naïve Bayes, to calibrate the defect prediction model rather than employing more complex classifiers. We also recommend that researchers who explore statistical and algorithmic methods for defect prediction should spend less time on their algorithms and more time on studying the pragmatic considerations of large organizations.  相似文献   

13.
Time-series classification (TSC) problems present a specific challenge for classification algorithms: how to measure similarity between series. A shapelet is a time-series subsequence that allows for TSC based on local, phase-independent similarity in shape. Shapelet-based classification uses the similarity between a shapelet and a series as a discriminatory feature. One benefit of the shapelet approach is that shapelets are comprehensible, and can offer insight into the problem domain. The original shapelet-based classifier embeds the shapelet-discovery algorithm in a decision tree, and uses information gain to assess the quality of candidates, finding a new shapelet at each node of the tree through an enumerative search. Subsequent research has focused mainly on techniques to speed up the search. We examine how best to use the shapelet primitive to construct classifiers. We propose a single-scan shapelet algorithm that finds the best $k$ shapelets, which are used to produce a transformed dataset, where each of the $k$ features represent the distance between a time series and a shapelet. The primary advantages over the embedded approach are that the transformed data can be used in conjunction with any classifier, and that there is no recursive search for shapelets. We demonstrate that the transformed data, in conjunction with more complex classifiers, gives greater accuracy than the embedded shapelet tree. We also evaluate three similarity measures that produce equivalent results to information gain in less time. Finally, we show that by conducting post-transform clustering of shapelets, we can enhance the interpretability of the transformed data. We conduct our experiments on 29 datasets: 17 from the UCR repository, and 12 we provide ourselves.  相似文献   

14.
The management of atmospheric pollution using video is not yet widespread. However it is an efficient way to evaluate the polluting rejects coming from large industrial facilities when traditional captors are not usable. This paper presents a comparison of different classifiers for a monitoring system of polluting smokes. The data used in this work are stemming from a system of video analysis and signal processing. The database includes the pollution level of puffs of smoke defined by an expert. Six machine learning techniques are tested and compared to classify the puffs of smoke: k-nearest neighbour, naïve Bayes classifier, artificial neural network, decision tree, support vector machine and a fuzzy model. The parameters of each type of classifier are split into three categories: learned parameters, parameters determined by a first step of the experimentation, and parameters set by the programmer. We compare the results of the best classifier of each type depending on the size of the learning set. A part of the discussion concerns the robustness of the classifier facing the case where classes of interest are under-represented, as the high level of pollution in our data.  相似文献   

15.
16.
Cross validation (CV) has been widely used for choosing and evaluating statistical models. The main purpose of this study is to explore the behavior of CV in tree-based models. We achieve this goal by an experimental approach, which compares a cross-validated tree classifier with the Bayes classifier that is ideal for the underlying distribution. The main observation of this study is that the difference between the testing and training errors from a cross-validated tree classifier and the Bayes classifier empirically has a linear regression relation. The slope and the coefficient of determination of the regression model can serve as performance measure of a cross-validated tree classifier. Moreover, simulation reveals that the performance of a cross-validated tree classifier depends on the geometry, parameters of the underlying distributions, and sample sizes. Our study can explain, evaluate, and justify the use of CV in tree-based models when the sample size is relatively small.  相似文献   

17.
Nowadays, credit scoring is one of the most important topics in the banking sector. Credit scoring models have been widely used to facilitate the process of credit assessing. In this paper, an application of the locally linear model tree algorithm (LOLIMOT) was experimented to evaluate the superiority of its performance to predict the customer's credit status. The algorithm is improved with an aim of adjustment by credit scoring domain by means of data fusion and feature selection techniques. Two real world credit data sets – Australian and German – from UCI machine learning database were selected to demonstrate the performance of our new classifier. The analytical results indicate that the improved LOLIMOT significantly increase the prediction accuracy.  相似文献   

18.
Group decision making is a multi-criteria decision-making method applied in many fields. However, the use of group decision-making techniques in multi-class classification problems and rule generation is not explored widely. This investigation developed a group decision classifier with particle swarm optimization (PSO) and decision tree (GDCPSODT) for analyzing students’ mathematic and scientific achievements, which is a multi-class classification problem involving rule generation. The PSO technique is employed to determine weights of condition attributes; the decision tree is used to generate rules. To demonstrate the performance of the developed GDCPSODT model, other classifiers such as the Bayesian classifier, the k-nearest neighbor (KNN) classifier, the back propagation neural networks classifier with particle swarm optimization (BPNNPSO) and the radial basis function neural networks classifier with PSO (RBFNNPSO) are used to cope with the same data. Experimental results indicated the testing accuracy of GDCPSODT is higher than the other four classifiers. Furthermore, rules and some improvement directions of academic achievements are provided by the GDCPSODT model. Therefore, the GDCPSODT model is a feasible and promising alternative for analyzing student-related mathematic and scientific achievement data.  相似文献   

19.
We present a new method for preprocessing and organizing discrete scalar volume data of any dimension on external storage. We describe our implementation of a visual navigation system using our method. The techniques have important applications for out-of-core visualization of volume data sets and image understanding. The applications include extracting isosurfaces in a manner that helps reduce both I/O and disk seek time, a priori topologically correct isosurface simplification (prior to extraction), and producing a visual atlas of all topologically distinct objects in the data set. The preprocessing algorithm computes regions of space that we call topological zone components, so that any isosurface component (contour) is completely contained in a zone component and all contours contained in a zone component are topologically equivalent. The algorithm also constructs a criticality tree that is related to the recently studied contour tree. However, unlike the contour tree, the zones and the criticality tree hierarchically organize the data set. We demonstrate that the techniques work on both irregularly and regularly gridded data, and can be extended to data sets with nonunique values, by the mathematical analysis we call Digital Morse Theory (DMT), so that perturbation of the data set is not required. We present the results of our initial experiments with three dimensional volume data (CT) and describe future extensions of our DMT organizing technology.  相似文献   

20.
LogitBoost is a popular Boosting variant that can be applied to either binary or multi-class classification. From a statistical viewpoint LogitBoost can be seen as additive tree regression by minimizing the Logistic loss. Following this setting, it is still non-trivial to devise a sound multi-class LogitBoost compared with to devise its binary counterpart. The difficulties are due to two important factors arising in multiclass Logistic loss. The first is the invariant property implied by the Logistic loss, causing the optimal classifier output being not unique, i.e. adding a constant to each component of the output vector won’t change the loss value. The second is the density of the Hessian matrices that arise when computing tree node split gain and node value fittings. Oversimplification of this learning problem can lead to degraded performance. For example, the original LogitBoost algorithm is outperformed by ABC-LogitBoost thanks to the latter’s more careful treatment of the above two factors. In this paper we propose new techniques to address the two main difficulties in multiclass LogitBoost setting: (1) we adopt a vector tree model (i.e. each node value is vector) where the unique classifier output is guaranteed by adding a sum-to-zero constraint, and (2) we use an adaptive block coordinate descent that exploits the dense Hessian when computing tree split gain and node values. Higher classification accuracy and faster convergence rates are observed for a range of public data sets when compared to both the original and the ABC-LogitBoost implementations. We also discuss another possibility to cope with LogitBoost’s dense Hessian matrix. We derive a loss similar to the multi-class Logistic loss but which guarantees a diagonal Hessian matrix. While this makes the optimization (by Newton descent) easier we unfortunately observe degraded performance for this modification. We argue that working with the dense Hessian is likely unavoidable, therefore making techniques like those proposed in this paper necessary for efficient implementations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号