首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Shapelets are discriminative subsequences of time series, usually embedded in shapelet-based decision trees. The enumeration of time series shapelets is, however, computationally costly, which in addition to the inherent difficulty of the decision tree learning algorithm to effectively handle high-dimensional data, severely limits the applicability of shapelet-based decision tree learning from large (multivariate) time series databases. This paper introduces a novel tree-based ensemble method for univariate and multivariate time series classification using shapelets, called the generalized random shapelet forest algorithm. The algorithm generates a set of shapelet-based decision trees, where both the choice of instances used for building a tree and the choice of shapelets are randomized. For univariate time series, it is demonstrated through an extensive empirical investigation that the proposed algorithm yields predictive performance comparable to the current state-of-the-art and significantly outperforms several alternative algorithms, while being at least an order of magnitude faster. Similarly for multivariate time series, it is shown that the algorithm is significantly less computationally costly and more accurate than the current state-of-the-art.  相似文献   

2.
One class classification is a binary classification task for which only one class of samples is available for learning. In some preliminary works, we have proposed One Class Random Forests (OCRF), a method based on a random forest algorithm and an original outlier generation procedure that makes use of classifier ensemble randomization principles. In this paper, we propose an extensive study of the behavior of OCRF, that includes experiments on various UCI public datasets and comparison to reference one class namely, Gaussian density models, Parzen estimators, Gaussian mixture models and One Class SVMs—with statistical significance. Our aim is to show that the randomization principles embedded in a random forest algorithm make the outlier generation process more efficient, and allow in particular to break the curse of dimensionality. One Class Random Forests are shown to perform well in comparison to other methods, and in particular to maintain stable performance in higher dimension, while the other algorithms may fail.  相似文献   

3.
The random forests (RF) algorithm, which combines the predictions from an ensemble of random trees, has achieved significant improvements in terms of classification accuracy. In many real-world applications, however, ranking is often required in order to make optimal decisions. Thus, we focus our attention on the ranking performance of RF in this paper. Our experimental results based on the entire 36 UC Irvine Machine Learning Repository (UCI) data sets published on the main website of Weka platform show that RF doesn’t perform well in ranking, and is even about the same as a single C4.4 tree. This fact raises the question of whether several improvements to RF can scale up its ranking performance. To answer this question, we single out an improved random forests (IRF) algorithm. Instead of the information gain measure and the maximum-likelihood estimate, the average gain measure and the similarity-weighted estimate are used in IRF. Our experiments show that IRF significantly outperforms all the other algorithms used to compare in terms of ranking while maintains the high classification accuracy characterizing RF.  相似文献   

4.
Cios  K.J. Shin  I. Goodenday  L.S. 《Computer》1991,24(3):57-63
The use of fuzzy sets to represent perfusion defects and to generate expert results to help in diagnosis is reported. Retrospective data collected from 91 patients who underwent both stress thallium-201 myocardial scintigraphy and coronary arteriography were used. Of the total, 64 scans were chosen at random for training, and the remaining 27 scans were used for testing data. It was found that 17 rules generated by fuzzy set theory performed as well as 68 rules specified by cardiologists in diagnosing coronary artery stenosis  相似文献   

5.

Successful use of probabilistic classification requires well-calibrated probability estimates, i.e., the predicted class probabilities must correspond to the true probabilities. In addition, a probabilistic classifier must, of course, also be as accurate as possible. In this paper, Venn predictors, and its special case Venn-Abers predictors, are evaluated for probabilistic classification, using random forests as the underlying models. Venn predictors output multiple probabilities for each label, i.e., the predicted label is associated with a probability interval. Since all Venn predictors are valid in the long run, the size of the probability intervals is very important, with tighter intervals being more informative. The standard solution when calibrating a classifier is to employ an additional step, transforming the outputs from a classifier into probability estimates, using a labeled data set not employed for training of the models. For random forests, and other bagged ensembles, it is, however, possible to use the out-of-bag instances for calibration, making all training data available for both model learning and calibration. This procedure has previously been successfully applied to conformal prediction, but was here evaluated for the first time for Venn predictors. The empirical investigation, using 22 publicly available data sets, showed that all four versions of the Venn predictors were better calibrated than both the raw estimates from the random forest, and the standard techniques Platt scaling and isotonic regression. Regarding both informativeness and accuracy, the standard Venn predictor calibrated on out-of-bag instances was the best setup evaluated. Most importantly, calibrating on out-of-bag instances, instead of using a separate calibration set, resulted in tighter intervals and more accurate models on every data set, for both the Venn predictors and the Venn-Abers predictors.

  相似文献   

6.
Lorenzen  Stephan S.  Igel  Christian  Seldin  Yevgeny 《Machine Learning》2019,108(8-9):1503-1522
Machine Learning - Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning...  相似文献   

7.
Regression conformal prediction produces prediction intervals that are valid, i.e., the probability of excluding the correct target value is bounded by a predefined confidence level. The most important criterion when comparing conformal regressors is efficiency; the prediction intervals should be as tight (informative) as possible. In this study, the use of random forests as the underlying model for regression conformal prediction is investigated and compared to existing state-of-the-art techniques, which are based on neural networks and k-nearest neighbors. In addition to their robust predictive performance, random forests allow for determining the size of the prediction intervals by using out-of-bag estimates instead of requiring a separate calibration set. An extensive empirical investigation, using 33 publicly available data sets, was undertaken to compare the use of random forests to existing state-of-the-art conformal predictors. The results show that the suggested approach, on almost all confidence levels and using both standard and normalized nonconformity functions, produced significantly more efficient conformal predictors than the existing alternatives.  相似文献   

8.
与集成学习相比,针对单个分类器不能获得相对较高而稳定的准确率的问题,提出一种分类模型.该模型可集成多个随机森林,并以带阈值的多数投票法作为结合方法;模型实现主要分为建立集成分类模型、实例初步预测和结合分析三个层次.MapReduce编程方式实现的分类模型以P2P流量识别为例,分别与单个随机森林和集成其他算法进行对比,实验表明提出模型能获得更好的P2P流量识别综合分类性能,该模型也为二类型分类提供了一种可行的参考方法.  相似文献   

9.
10.
Data Mining and Knowledge Discovery - In critical situations involving discrimination, gender inequality, economic damage, and even the possibility of casualties, machine learning models must be...  相似文献   

11.
Autism Spectrum Disorder (ASD) is comprised of a group of heterogeneous neurodevelopmental conditions, typically characterized by a triad of symptoms consisting of (1) impaired communication, (2) restricted interests, and (3) repetitive and stereotypical behavior pattern. An accurate and early diagnosis of autism can provide the basis for an appropriate educational and treatment program. In this work, we propose a computational model using a Multilayer Fuzzy Cognitive Map (hereafter referred to as MFCM) based on standardized behavioral assessments diagnosing the ASD (MFCM-ASD). The two standards used in the model are: the Autism Diagnostic Observation Schedule, Second Edition (ADOS2), and the Autism Diagnostic Interview Revised (ADIR). The MFCM’s are a soft computing technique characterized by robust properties that make it an effective technique for medical decision support systems. For the evaluation of the MFCM-ASD model, we have used real datasets of diagnosed cases, so as to compare against other method/approaches. Initial experiments demonstrated that the proposed model outperforms conventional Fuzzy Cognitive Maps (FCMs) for ASD diagnosis. Our MFCM-ASD model serves as a diagnostic tool required to support the medical decisions when determining the correct diagnosis of Autism in children with different cognitive characteristics.  相似文献   

12.
改进的随机森林及其在遥感图像中的应用   总被引:1,自引:0,他引:1  
对于遥感图像训练样本获取难的问题,引入适用于小样本分类的随机森林算法。为了随机森林能在小样本情况下有更优的分类效果和更高的稳定性,在决策树基础上提出了一种更加随机的特征组合的方法,降低了决策树之间的相关性,从而降低了森林的泛化误差;引入人工免疫算法来对改进后的随机森林进行压缩优化,很好地权衡了森林规模和分类稳定性、精度的矛盾。通过UCI数据集的实验表明,改进的随机森林的有效性及其优化的模型的可行性,优化后森林的规模降低了,且有更高的分类精度。在遥感图像上与传统的方法进行了对比。  相似文献   

13.
Gaussian Processes are powerful tools in machine learning which offer wide applicability in regression and classification problems due to their non-parametric and non-linear behavior. However, one of their main drawbacks is the training time complexity which scales cubically with the number of samples. Our work addresses this issue by combining Gaussian Processes with Randomized Decision Forests to enable fast learning. An important advantage of our method is its simplicity and the ability to directly control the trade-off between classification performance and computation speed. Experiments on an indoor place recognition task show that our method can handle large training sets in reasonable time while retaining a good classification accuracy.  相似文献   

14.
We propose a new differentially-private decision forest algorithm that minimizes both the number of queries required, and the sensitivity of those queries. To do so, we build an ensemble of random decision trees that avoids querying the private data except to find the majority class label in the leaf nodes. Rather than using a count query to return the class counts like the current state-of-the-art, we use the Exponential Mechanism to only output the class label itself. This drastically reduces the sensitivity of the query – often by several orders of magnitude – which in turn reduces the amount of noise that must be added to preserve privacy. Our improved sensitivity is achieved by using “smooth sensitivity”, which takes into account the specific data used in the query rather than assuming the worst-case scenario. We also extend work done on the optimal depth of random decision trees to handle continuous features, not just discrete features. This, along with several other improvements, allows us to create a differentially private decision forest with substantially higher predictive power than the current state-of-the-art.  相似文献   

15.
The random subspace method for constructing decision forests   总被引:28,自引:0,他引:28  
Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy  相似文献   

16.
随着我国老龄化和高龄化趋势的加速,以及家庭养老功能弱化、社会养老服务体系不健全等问题,养老事业面临诸多挑战。为了更好地为老年人提供居住安排建议,同时为养老事业管理部门提供精准的决策支持,对CHARLS问卷中将近2万名老年人的数据进行了分析,力图发现影响老年人居住偏好的主要因素。同时,也尝试利用大数据和数据挖掘方法,从个人层面对老年人居住偏好进行预测,并针对类不平衡的情况下随机森林特征选择算法进行了改进。研究结果表明:基于老年人的特征数据可以很好地预测其居住偏好,为养老事业的精准化决策提供一种依据。  相似文献   

17.
Gaussian processes are powerful modeling tools in machine learning which offer wide applicability for regression and classification tasks due to their non-parametric and non-linear behavior. However, one of their main drawbacks is the training time complexity which scales cubically with the number of examples. Our work addresses this issue by combining Gaussian processes with random decision forests to enable fast learning. An important advantage of our method is its simplicity and the ability to directly control the tradeoff between classification performance and computational speed. Experiments on an indoor place recognition task and on standard machine learning benchmarks show that our method can handle large training sets of up to three million examples in reasonable time while retaining good classification accuracy.  相似文献   

18.
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.  相似文献   

19.
Seasonality effects and empirical regularities in financial data have been well documented in the financial economics literature for over seven decades. This paper proposes an expert system that uses novel machine learning techniques to predict the price return over these seasonal events, and then uses these predictions to develop a profitable trading strategy. While simple approaches to trading these regularities can prove profitable, such trading leads to potential large drawdowns (peak-to-trough decline of an investment measured as a percentage between the peak and the trough) in profit. In this paper, we introduce an automated trading system based on performance weighted ensembles of random forests that improves the profitability and stability of trading seasonality events. An analysis of various regression techniques is performed as well as an exploration of the merits of various techniques for expert weighting. The performance of the models is analysed using a large sample of stocks from the DAX. The results show that recency-weighted ensembles of random forests produce superior results in terms of both profitability and prediction accuracy compared with other ensemble techniques. It is also found that using seasonality effects produces superior results than not having them modelled explicitly.  相似文献   

20.
Cardiovascular disease accompanied by arrhythmia reduces an individual’s lifespan and health, and long term ECG monitoring would generate large amounts of data. Fortunately, arrhythmia classification assisted by computer science would greatly improve the efficiency of doctors’ diagnoses. However, due to individual differences, noise affecting the signal, the great variety of arrhythmias, and heavy computing workload, it is difficult to implement these advanced techniques for clinical context analysis. Thus, this paper proposes a comprehensive approach based on discrete wavelet and random forest techniques for arrhythmia classification. Specifically, discrete wavelet transformation is used to remove high-frequency noise and baseline drift, while discrete wavelet transformation, autocorrelation, principal component analysis, variances and other mathematical methods are used to extract frequency-domain features, time-domain features and morphology features. Furthermore, an arrhythmia classification system is developed, and its availability is verified that the proposed scheme can significantly be used for guidance and reference in clinical arrhythmia automatic classification.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号