首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
一种与神经元网络杂交的决策树算法   总被引:7,自引:0,他引:7  
神经元网络在多数情况下获得的精度要比决策树和回归算法精度高,这是因为它能适应更复杂的模型,同时由于决策树通常每次只使用一个变量来分支,它所对应的识别空间只能是超矩形,这也就比神经元网络简单,粗度不能与神经元网络相比,然而神经元网络需要相对多的学习时间,并且其模型的可理解性不如决策树、Naive-Bayes等方法直观,本文在进行两种算法对复杂模型的识别对比后,提出了一个新的算法NNTree,这是一个决策树和神经元网络杂交的算法,决策树节点包含单变量的分支就象正常的决策树,但是叶子节点包含神经元网络分类器,这个方法针对决策树处理大型数据的效能,保留了决策树的可理解性,改善了神经元网络的学习性能,同时可使这个分类器的精度大大超过这两种算法,尤其在测试更大的数据集复杂模型时更为明显。  相似文献   

2.
Bagging schemes on the presence of class noise in classification   总被引:1,自引:0,他引:1  
In this paper, we study one application of Bagging credal decision tree, i.e. decision trees built using imprecise probabilities and uncertainty measures, on data sets with class noise (data sets with wrong assignations of the class label). For this aim, previously we also extend a original method that build credal decision trees to one which works with continuous features and missing data. Through an experimental study, we prove that Bagging credal decision trees outperforms more complex Bagging approaches on data sets with class noise. Finally, using a bias-variance error decomposition analysis, we also justify the performance of the method of Bagging credal decision trees, showing that it achieves a stronger reduction of the variance error component.  相似文献   

3.
Cost-sensitive learning algorithms are typically designed for minimizing the total cost when multiple costs are taken into account. Like other learning algorithms, cost-sensitive learning algorithms must face a significant challenge, over-fitting, in an applied context of cost-sensitive learning. Specifically speaking, they can generate good results on training data but normally do not produce an optimal model when applied to unseen data in real world applications. It is called data over-fitting. This paper deals with the issue of data over-fitting by designing three simple and efficient strategies, feature selection, smoothing and threshold pruning, against the TCSDT (test cost-sensitive decision tree) method. The feature selection approach is used to pre-process the data set before applying the TCSDT algorithm. The smoothing and threshold pruning are used in a TCSDT algorithm before calculating the class probability estimate for each decision tree leaf. To evaluate our approaches, we conduct extensive experiments on the selected UCI data sets across different cost ratios, and on a real world data set, KDD-98 with real misclassification cost. The experimental results show that our algorithms outperform both the original TCSDT and other competing algorithms on reducing data over-fitting.  相似文献   

4.
数据预处理是提高挖掘过程精度和性能的关键。文章在分析决策树算法和滑坡数据属性值特点基础上,利用聚类将连续属性值划分区间,提出了一种针对滑坡数据连续属性值离散化的方法,通过实验,新方法构造的决策树比原算法的分类正确率高,规则冗余少。  相似文献   

5.
Model trees are a particular case of decision trees employed to solve regression problems. They have the advantage of presenting an interpretable output, helping the end-user to get more confidence in the prediction and providing the basis for the end-user to have new insight about the data, confirming or rejecting hypotheses previously formed. Moreover, model trees present an acceptable level of predictive performance in comparison to most techniques used for solving regression problems. Since generating the optimal model tree is an NP-Complete problem, traditional model tree induction algorithms make use of a greedy top-down divide-and-conquer strategy, which may not converge to the global optimal solution. In this paper, we propose a novel algorithm based on the use of the evolutionary algorithms paradigm as an alternate heuristic to generate model trees in order to improve the convergence to globally near-optimal solutions. We call our new approach evolutionary model tree induction (E-Motion). We test its predictive performance using public UCI data sets, and we compare the results to traditional greedy regression/model trees induction algorithms, as well as to other evolutionary approaches. Results show that our method presents a good trade-off between predictive performance and model comprehensibility, which may be crucial in many machine learning applications.  相似文献   

6.
Abstract: Decision tree induction has been widely studied and applied. In safety applications, such as determining whether a chemical process is safe or whether a person has a medical condition, the cost of misclassification in one of the classes is significantly higher than in the other class. Several authors have tackled this problem by developing cost-sensitive decision tree learning algorithms or have suggested ways of changing the distribution of training examples to bias the decision tree learning process so as to take account of costs. A prerequisite for applying such algorithms is the availability of costs of misclassification. Although this may be possible for some applications, obtaining reasonable estimates of costs of misclassification is not easy in the area of safety .
This paper presents a new algorithm for applications where the cost of misclassifications cannot be quantified, although the cost of misclassification in one class is known to be significantly higher than in another class. The algorithm utilizes linear discriminant analysis to identify oblique relationships between continuous attributes and then carries out an appropriate modification to ensure that the resulting tree errs on the side of safety. The algorithm is evaluated with respect to one of the best known cost-sensitive algorithms (ICET), a well-known oblique decision tree algorithm (OC1) and an algorithm that utilizes robust linear programming.  相似文献   

7.
This paper presents the frontier tree exploration algorithm, a novel approach to autonomously explore unknown and unstructured areas. Focus of this work is the exploration of domestic environments with lots of arbitrary obstacles, for example furbished appartements. Existing and well-studied approaches like greedy algorithms perform worse when obstacles are included and the range of distance sensors is limited. Base of this research is the frontier tree. This data structure offers two main features. Firstly it serves as a memory of past poses during exploration, where boundaries between known and unknown space are inserted into the tree. Secondly, the data structure is then utilized to decide between future navigation goals. This approach is compared to the class of nearest neighbor exploration algorithms and a decision theoretic approach. The algorithm is tested in a simulation study with furbished ground maps and in a real life domestic environment. The paper shows, that frontier trees exhibit a superior performance of distance traveled and the number of exploration steps compared to a nearest neighbor algorithm.  相似文献   

8.
Classical nonlinear expectile regression has two shortcomings. It is difficult to choose a nonlinear function, and it does not consider the interaction effects among explanatory variables. Therefore, we combine the random forest model with the expectile regression method to propose a new nonparametric expectile regression model: expectile regression forest (ERF). The major novelty of the ERF model is using the bagging method to build multiple decision trees, calculating the conditional expectile of each leaf node in each decision tree, and deriving final results through aggregating these decision tree results via simple average approach. At the same time, in order to compensate for the black box problem in the model interpretation of the ERF model, the measurement of the importance of explanatory variable and the partial dependence is defined to evaluate the magnitude and direction of the influence of each explanatory variable on the response variable. The advantage of ERF model is illustrated by Monte Carlo simulation studies. The numerical simulation results show that the estimation and prediction ability of the ERF model is significantly better than alternative approaches. We also apply the ERF model to analyse the real data. From the nonparametric expectile regression analysis of these data sets, we have several conclusions that are consistent with the results of numerical simulation.  相似文献   

9.
CAIM discretization algorithm   总被引:8,自引:0,他引:8  
The task of extracting knowledge from databases is quite often performed by machine learning algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case of continuous attributes, there is a need for a discretization algorithm that transforms continuous attributes into discrete ones. We describe such an algorithm, called CAIM (class-attribute interdependence maximization), which is designed to work with supervised data. The goal of the CAIM algorithm is to maximize the class-attribute interdependence and to generate a (possibly) minimal number of discrete intervals. The algorithm does not require the user to predefine the number of intervals, as opposed to some other discretization algorithms. The tests performed using CAIM and six other state-of-the-art discretization algorithms show that discrete attributes generated by the CAIM algorithm almost always have the lowest number of intervals and the highest class-attribute interdependency. Two machine learning algorithms, the CLIP4 rule algorithm and the decision tree algorithm, are used to generate classification rules from data discretized by CAIM. For both the CLIP4 and decision tree algorithms, the accuracy of the generated rules is higher and the number of the rules is lower for data discretized using the CAIM algorithm when compared to data discretized using six other discretization algorithms. The highest classification accuracy was achieved for data sets discretized with the CAIM algorithm, as compared with the other six algorithms.  相似文献   

10.
现实生活中存在大量的非平衡数据,大多数传统的分类算法假定类分布平衡或者样本的错分代价相同,因此在对这些非平衡数据进行分类时会出现少数类样本错分的问题。针对上述问题,在代价敏感的理论基础上,提出了一种新的基于代价敏感集成学习的非平衡数据分类算法--NIBoost(New Imbalanced Boost)。首先,在每次迭代过程中利用过采样算法新增一定数目的少数类样本来对数据集进行平衡,在该新数据集上训练分类器;其次,使用该分类器对数据集进行分类,并得到各样本的预测类标及该分类器的分类错误率;最后,根据分类错误率和预测的类标计算该分类器的权重系数及各样本新的权重。实验采用决策树、朴素贝叶斯作为弱分类器算法,在UCI数据集上的实验结果表明,当以决策树作为基分类器时,与RareBoost算法相比,F-value最高提高了5.91个百分点、G-mean最高提高了7.44个百分点、AUC最高提高了4.38个百分点;故该新算法在处理非平衡数据分类问题上具有一定的优势。  相似文献   

11.
基于聚类的数据预处理对模糊决策树产生的影响   总被引:1,自引:1,他引:0  
在模糊决策树归纳过程中,数据的模糊化预处理通常使用三角形隶属函数,该隶属函数的中心点参数将决定数据模糊化的效果,进而影响模糊决策树的执行效率、精度和规模。Kohonen'sfeature-maps聚类算法能够用来选取连续属性值的中心点。实验研究表明,该算法选取的中心点使模糊子集之间的覆盖范围不再相同,因而能够更合理地表示模糊概念之间的重叠关系。通过与其它算法比较证明该算法使模糊决策树可以获得更高的分类精度。  相似文献   

12.
针对现有决策树模型在分类过程中没有考虑决策者对结果的偏好行为,因而不能很好的预测具有明显偏好倾向问题的不足,提出了一种偏好敏感决策树(Preference Sensitive Decision Tree, PSDT)分类算法。该算法引入了偏好度和偏好代价的概念,并通过综合考虑属性信息和有效偏好,构建新型属性选择因子和基于有效偏好的结点类标号分配准则。通过自适应调整偏好度,可生成最佳偏好敏感决策树。实验结果证明该算法既能实现对偏好类的高精度预测,同时能够保证决策树拥有良好的整体精度,且具有较高的有效性和实用性,能够很好的解决偏好敏感环境下的决策问题。  相似文献   

13.
This paper presents a novel host-based combinatorial method based on k-Means clustering and ID3 decision tree learning algorithms for unsupervised classification of anomalous and normal activities in computer network ARP traffic. The k-Means clustering method is first applied to the normal training instances to partition it into k clusters using Euclidean distance similarity. An ID3 decision tree is constructed on each cluster. Anomaly scores from the k-Means clustering algorithm and decisions of the ID3 decision trees are extracted. A special algorithm is used to combine results of the two algorithms and obtain final anomaly score values. The threshold rule is applied for making the decision on the test instance normality. Experiments are performed on captured network ARP traffic. Some anomaly criteria has been defined and applied to the captured ARP traffic to generate normal training instances. Performance of the proposed approach is evaluated using five defined measures and empirically compared with the performance of individual k-Means clustering and ID3 decision tree classification algorithms and the other proposed approaches based on Markovian chains and stochastic learning automata. Experimental results show that the proposed approach has specificity and positive predictive value of as high as 96 and 98%, respectively.  相似文献   

14.
徐盈盈  钟才明 《计算机应用》2014,34(8):2184-2187
模式识别与机器学习的一些算法只能处理离散属性值,而在现实生活中的很多数据具有连续的属性值,针对数据离散化的问题提出了一种无监督的方法。首先,使用K-means方法将数据集进行划分得到类别信息;然后,应用有监督的离散化方法对划分后的数据离散化,重复上述过程以得到多个离散化的结果,再将这些结果进行集成;最后,将集成得到的最小子区间进行合并,这里根据数据间的邻居关系选择优先合并的维度及相邻区间。其中,通过数据间的近邻关系自动寻求子区间数目,尽可能保持其内在结构关系不变。将离散后的数据应用于聚类算法,如谱聚类算法,并对聚类后的效果进行评价。实验结果表明,该算法聚类精确度比其他4种方法平均提高约33%,表明了该算法的可行性和有效性。通过该算法得到的离散化数据可应用于一些数据挖掘算法,如ID3决策树算法。  相似文献   

15.
Knowledge inference systems are built to identify hidden and logical patterns in huge data. Decision trees play a vital role in knowledge discovery but crisp decision tree algorithms have a problem with sharp decision boundaries which may not be implicated to all knowledge inference systems. A fuzzy decision tree algorithm overcomes this drawback. Fuzzy decision trees are implemented through fuzzification of the decision boundaries without disturbing the attribute values. Data reduction also plays a crucial role in many classification problems. In this research article, it presents an approach using principal component analysis and modified Gini index based fuzzy SLIQ decision tree algorithm. The PCA is used for dimensionality reduction, and modified Gini index fuzzy SLIQ decision tree algorithm to construct decision rules. Finally, through PID data set, the method is validated in the simulation experiment in MATLAB.  相似文献   

16.
胡淼  王开军 《计算机应用》2019,39(4):956-962
针对现有基于随机森林的异常检测算法性能不高的问题,提出一种结合双特征和松弛边界的随机森林算法用于异常点检测。首先,在只使用正常类数据构建随机森林的分类决策树过程中,在二叉决策树的每个节点里记录两个特征的取值范围(每个特征对应一个值域),以此双特征值域作为异常点判断的依据。然后,在进行异常检测时,当某样本不满足决策树节点中的双特征值域时,该样本被标记为候选异常类;否则,该样本进入决策树的下层树节点继续作特征值域的比较,若无下层节点则被标记为候选正常类。最后,由随机森林算法中的判别机制决定该样本的类别。在5个UCI数据集上进行的异常点检测实验结果表明,所提方法比现有的异常检测随机森林算法性能更好,其综合性能与孤立森林(iForest)和一类支持向量机(OCSVM)方法相当或更好,且稳定于较高水平。  相似文献   

17.
随着人民生活水平的不断提高,肿瘤疾病的人数在不断增多,其中肺癌是21世纪严重危害人类健康的重大疾病.为此提出一种基于电子病历的肺癌诊断决策树方法.首先分析肺癌电子病历的特点以及决策树存在结构不稳定、过拟合等现象,运用主成分分析法结合C5.0算法构建的优化决策树模型.首先,建立主成分特征根大于1以及主成分累计贡献率大于85%的特征降维两种方法,然后通过C5.0算法建立决策树模型和剪枝操作,最后给出数据预处理过程及模型的执行流程和测试结果.实验结果分析,改进的算法有较好的准确率以及良好的可扩展性,从而验证了改进后的算法对于辅助肺癌临床实验具有重要的意义.  相似文献   

18.
食品安全决策是食品安全问题研究的一项重要内容。为了对食品安全状况进行分析,基于粗糙集变精度模型,提出了一种包含规则置信度的构造决策树新方法。这种新方法针对传统加权决策树生成算法进行了改进,新算法以加权平均变精度粗糙度作为属性选择标准构造决策树,用变精度近似精度来代替近似精度,可以在数据库中消除噪声冗余数据,并且能够忽略部分矛盾数据,保证决策树构建过程中能够兼容部分存在冲突的决策规则。该算法可以在生成决策树的过程中,简化其生成过程,提高其应用范围,并且有助于诠释其生成规则。验证结果表明该算法是有效可行的。  相似文献   

19.
Decision tree regression for soft classification of remote sensing data   总被引:1,自引:0,他引:1  
In recent years, decision tree classifiers have been successfully used for land cover classification from remote sensing data. Their implementation as a per-pixel based classifier to produce hard or crisp classification has been reported in the literature. Remote sensing images, particularly at coarse spatial resolutions, are contaminated with mixed pixels that contain more than one class on the ground. The per-pixel approach may result in erroneous classification of images dominated by mixed pixels. Therefore, soft classification approaches that decompose the pixel into its class constituents in the form of class proportions have been advocated. In this paper, we employ a decision tree regression approach to determine class proportions within a pixel so as to produce soft classification from remote sensing data. Classification accuracy achieved by decision tree regression is compared with those achieved by the most widely used maximum likelihood classifier, implemented in the soft mode, and a supervised version of the fuzzy c-means classifier. Root Mean Square Error (RMSE) and fuzzy error matrix based measures have been used for accuracy assessment of soft classification.  相似文献   

20.
沈思倩  毛宇光  江冠儒 《计算机科学》2017,44(6):139-143, 149
主要研究在对不完全数据集进行决策树分析时,如何加入差分隐私保护技术。首先简单介绍了差分隐私ID3算法和差分隐私随机森林决策树算法;然后针对上述算法存在的缺陷和不足进行了修改,提出指数机制的差分隐私随机森林决策树算法;最后对于不完全数据集提出了一种新的WP(Weight Partition)缺失值处理方法,能够在不需要插值的情况下,使决策树分析算法既能满足差分隐私保护,也能拥有更高的预测准确率和适应性。实验证明,无论是Laplace机制还是指数机制,无论是ID3算法还是随机森林决策树算法,都能适用于所提方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号