共查询到20条相似文献,搜索用时 15 毫秒
1.
《Concurrency and Computation》2017,29(8)
In the supervised classification, large training data are very common, and decision trees are widely used. However, as some bottlenecks such as memory restrictions, time complexity, or data complexity, many supervised classifiers including classical C4.5 tree cannot directly handle big data. One solution for this problem is to design a highly parallelized learning algorithm. Motivated by this, we propose a parallelized C4.5 decision tree algorithm based on MapReduce (MR‐C4.5‐Tree) with 2 parallelized methods to build the tree nodes. First, an information entropy‐based parallelized attribute selection method (MR‐A‐S) on several subsets for MR‐C4.5‐Tree is proposed to confirm the best splitting attribute and the cut points. Then, a data splitting method (MR‐D‐S) in parallel is presented to partition the training data into subsets. At last, we introduce the MR‐C4.5‐Tree learning algorithm that grows in a top‐down recursive way. Besides, the depth of the constructed decision tree, the number of samples and the maximal class probability in each tree node are used as the termination conditions to avoid the over‐partitioning problem. Experimental studies show the feasibility and the good performance of the proposed parallelized MR‐C4.5‐Tree algorithm. 相似文献
2.
决策树与人工神经网络的对比分析 总被引:2,自引:0,他引:2
决策树和人工神经网络是数据挖掘分类任务中两项重要技术,各具特点,对不同的数据类型应采用不同的算法进行相应的研究应用。为了深入地说明各自的特点,根据决策树C 4.5算法的原理和流程,以及人工神经网络的BP网络模型原理和实现分类的流程,并应用具体的实例,对两种技术进行了对比分析研究,得出并验证了它们在实现分类中的一些性能差异。 相似文献
3.
4.
分类挖掘在大学生智能评估系统中的设计与实现 总被引:5,自引:0,他引:5
主要介绍了耗时短、效率高、发展比较成熟的决策树算法C4.5,以及该算法在大学生智能评估系统中的分类挖掘子模块中的设计和实现。 相似文献
5.
《Concurrency and Computation》2018,30(10)
To address the time‐consuming problem for the confirmation of splitting attributes and splitting points in classic rank mutual information based decision trees, this paper establishes a fast rank mutual information based decision tree (FRMIDT) for classification problems. First, the proposed FRMIDT algorithm improves the velocity by a max‐relevance and min‐redundancy criterion to remove the redundant attributes in each tree node building. Then, the fuzzy c‐means algorithm is employed to confirm the splitting points for further acceleration. Meanwhile, a parallel implementation is developed in the framework of Map‐Reduce (MR‐FRMIDT) for medium or large‐scale data classification. Several comparative studies are conducted on UCI benchmark data sets. In contrast to the classic rank mutual information based decision tree on 12 data sets, the proposed FRMIDT model effectively reduces the computational time on the premise of keeping testing accuracy. Furthermore, the proposed FRMIDT algorithm is comparable through comparing FRMIDT with other traditional decision tree classifiers including BFT, C4.5, LAD, NBT, and SC. Meanwhile, the comparison with 7 different popular splitting measures based monotonic decision trees on several data sets illustrates the effectiveness of FRMIDT in monotonic classification. At last, the experimental analysis on other 6 data sets shows that the proposed MR‐FRMIDT is feasible and has a good parallel performance on reducing execution time and avoiding memory restrictions. 相似文献
6.
介绍智能导学系统的特点,并对决策树C4.5算法的原理进行了阐述,通过C4.5构造了一个学生在线学习效果的评估模型.并利用该模型得到的分类规则进行预测,得到准确性评估表,从而验证决策树算法的灵活性和计算的高效性. 相似文献
7.
C4.5算法在保险客户流失分析中的应用 总被引:11,自引:0,他引:11
保持客户和吸引客户是保险公司提高竞争力的关键,目前保险公司对客户流失的分析是粗略的或根据经验来判断。论文利用面向属性归纳和决策树C4.5算法对保险客户基本信息进行分析,找出客户流失的特征,帮助保险公司有针对性地改善客户关系。 相似文献
8.
Parallel Formulations of Decision-Tree Classification Algorithms 总被引:5,自引:0,他引:5
Anurag Srivastava Eui-Hong Han Vipin Kumar Vineet Singh 《Data mining and knowledge discovery》1999,3(3):237-261
Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing,
fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing
with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency,
but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel
formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations.
One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We
discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features
of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method.
Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability. 相似文献
9.
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging,Boosting, and Randomization 总被引:43,自引:0,他引:43
Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a base learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization. 相似文献
10.
Using Model Trees for Classification 总被引:1,自引:0,他引:1
Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducer M5, based on Quinlan's M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric. 相似文献
11.
Guangping Tang Wangdong Yang Kenli Li Yu Ye Guoqing Xiao Keqin Li 《Concurrency and Computation》2015,27(17):5076-5095
An optimized parallel algorithm is proposed to solve the problem occurred in the process of complicated backward substitution of cyclic reduction during solving tridiagonal linear systems. Adopting a hybrid parallel model, this algorithm combines the cyclic reduction method and the partition method. This hybrid algorithm has simple backward substitution on parallel computers comparing with the cyclic reduction method. In this paper, the operation count and execution time are obtained to evaluate and make comparison for these methods. On the basis of results of these measured parameters, the hybrid algorithm using the hybrid approach with a multi‐threading implementation achieves better efficiency than the other parallel methods, that is, the cyclic reduction and the partition methods. In particular, the approach involved in this paper has the least scalar operation count and the shortest execution time on a multi‐core computer when the size of equations meets some dimension threshold. The hybrid parallel algorithm improves the performance of the cyclic reduction and partition methods by 19.2% and 13.2%, respectively. In addition, by comparing the single‐iteration and multi‐iteration hybrid parallel algorithms, it is found that increasing iteration steps of the cyclic reduction method does not affect the performance of the hybrid parallel algorithm very much. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
12.
13.
14.
在基于C4.5算法的网络流量分类方法中,网络流量数据量的海量性及其特征的多样性使得决策树的构建速度、分类速度成为评价网络流量分类器的重要标准。在原C4.5算法的基础上提出一种改进的信息熵的计算方法,通过减少计算函数的复杂度,提高决策树的构建速度。实验表明,基于改进后算法的分类器在达到原有分类准确率的同时,极大地缩短了决策树的构成时间。 相似文献
15.
Game developers are often faced with very demanding requirements on huge numbers of agents moving naturally through increasingly large and detailed virtual worlds. With the advent of multi‐core architectures, new approaches to accelerate expensive pathfinding operations are worth being investigated. Traditional single‐processor pathfinding strategies, such as A* and its derivatives, have been long praised for their flexibility. We implemented several parallel versions of such algorithms to analyze their intrinsic behavior, concluding that they have a large overhead, yield far from optimal paths, do not scale up to many cores or are cache unfriendly. In this article, we propose Parallel Ripple Search, a novel parallel pathfinding algorithm that largely solves these limitations. It utilizes a high‐level graph to assign local search areas to CPU cores at “equidistant” intervals. These cores then use A* flooding behavior to expand towards each other, yielding good “guesstimate points” at border touch on. The process does not rely on expensive parallel programming synchronization locks but instead relies on the opportunistic use of node collisions among cooperating cores, exploiting the multi‐core's shared memory architecture. As a result, all cores effectively run at full speed until enough way‐points are found. We show that this approach is a fast, practical and scalable solution and that it flexibly handles dynamic obstacles in a natural way. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献
16.
《Expert systems with applications》2014,41(10):4625-4637
In the area of classification, C4.5 is a known algorithm widely used to design decision trees. In this algorithm, a pruning process is carried out to solve the problem of the over-fitting. A modification of C4.5, called Credal-C4.5, is presented in this paper. This new procedure uses a mathematical theory based on imprecise probabilities, and uncertainty measures. In this way, Credal-C4.5 estimates the probabilities of the features and the class variable by using imprecise probabilities. Besides it uses a new split criterion, called Imprecise Information Gain Ratio, applying uncertainty measures on convex sets of probability distributions (credal sets). In this manner, Credal-C4.5 builds trees for solving classification problems assuming that the training set is not fully reliable. We carried out several experimental studies comparing this new procedure with other ones and we obtain the following principal conclusion: in domains of class noise, Credal-C4.5 obtains smaller trees and better performance than classic C4.5. 相似文献
17.
18.
We propose a new parallel learning algorithm of latent local support vector machines (SVM), called latent‐lSVM for effectively classifying very high‐dimensional and large‐scale multi‐class datasets. The common framework of texts/images classification tasks using the Bag‐Of‐(visual)‐Words model for the data representation leads to hard classification problem with thousands of dimensions and hundreds of classes. Our latent‐lSVM algorithm performs these complex tasks into two main steps. The first one is to use latent Dirichlet allocation for assigning the datapoint (text/image) to some topics (clusters) with the corresponding probabilities. This aims at reducing the number of classes and the number of datapoints in the cluster compared to the full dataset, followed by the second one: to learn in a parallel way nonlinear SVM models to classify data clusters locally. The numerical test results on nine real datasets show that the latent‐lSVM algorithm achieves very high accuracy compared to state‐of‐the‐art algorithms. An example of its effectiveness is given with an accuracy of 70.14% obtained in the classification of Book dataset having 100 000 individuals in 89 821 dimensional input space and 661 classes in 11.2 minutes using a PC Intel(R) Core i7‐4790 CPU, 3.6 GHz, 4 cores. 相似文献
19.
对大型数据库进行数据开采时,数据抽取问题及数据库和开采算法的接口设计就变得十分重要,通过定义SQL数据开采抽取器,设计了数据开采算法和数据库管理系统接口的框架体系,并通过常用的数据开采算法C4.5说明了这种标准的SQL数据开采抽取器的适用性。 相似文献
20.
列车轨道故障检测的实现需要对大量的数据进行分析来判定检测结果,决策树是进行数据挖掘与分类分析的常用工具。文中主要讨论如何应用C4.5算法构造列车轨道故障检测的决策树以及根据生成的决策树实现轨道故障的判决。 相似文献