首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
One class classification is a binary classification task for which only one class of samples is available for learning. In some preliminary works, we have proposed One Class Random Forests (OCRF), a method based on a random forest algorithm and an original outlier generation procedure that makes use of classifier ensemble randomization principles. In this paper, we propose an extensive study of the behavior of OCRF, that includes experiments on various UCI public datasets and comparison to reference one class namely, Gaussian density models, Parzen estimators, Gaussian mixture models and One Class SVMs—with statistical significance. Our aim is to show that the randomization principles embedded in a random forest algorithm make the outlier generation process more efficient, and allow in particular to break the curse of dimensionality. One Class Random Forests are shown to perform well in comparison to other methods, and in particular to maintain stable performance in higher dimension, while the other algorithms may fail.  相似文献   

2.
The random forests (RF) algorithm, which combines the predictions from an ensemble of random trees, has achieved significant improvements in terms of classification accuracy. In many real-world applications, however, ranking is often required in order to make optimal decisions. Thus, we focus our attention on the ranking performance of RF in this paper. Our experimental results based on the entire 36 UC Irvine Machine Learning Repository (UCI) data sets published on the main website of Weka platform show that RF doesn’t perform well in ranking, and is even about the same as a single C4.4 tree. This fact raises the question of whether several improvements to RF can scale up its ranking performance. To answer this question, we single out an improved random forests (IRF) algorithm. Instead of the information gain measure and the maximum-likelihood estimate, the average gain measure and the similarity-weighted estimate are used in IRF. Our experiments show that IRF significantly outperforms all the other algorithms used to compare in terms of ranking while maintains the high classification accuracy characterizing RF.  相似文献   

3.

Successful use of probabilistic classification requires well-calibrated probability estimates, i.e., the predicted class probabilities must correspond to the true probabilities. In addition, a probabilistic classifier must, of course, also be as accurate as possible. In this paper, Venn predictors, and its special case Venn-Abers predictors, are evaluated for probabilistic classification, using random forests as the underlying models. Venn predictors output multiple probabilities for each label, i.e., the predicted label is associated with a probability interval. Since all Venn predictors are valid in the long run, the size of the probability intervals is very important, with tighter intervals being more informative. The standard solution when calibrating a classifier is to employ an additional step, transforming the outputs from a classifier into probability estimates, using a labeled data set not employed for training of the models. For random forests, and other bagged ensembles, it is, however, possible to use the out-of-bag instances for calibration, making all training data available for both model learning and calibration. This procedure has previously been successfully applied to conformal prediction, but was here evaluated for the first time for Venn predictors. The empirical investigation, using 22 publicly available data sets, showed that all four versions of the Venn predictors were better calibrated than both the raw estimates from the random forest, and the standard techniques Platt scaling and isotonic regression. Regarding both informativeness and accuracy, the standard Venn predictor calibrated on out-of-bag instances was the best setup evaluated. Most importantly, calibrating on out-of-bag instances, instead of using a separate calibration set, resulted in tighter intervals and more accurate models on every data set, for both the Venn predictors and the Venn-Abers predictors.

  相似文献   

4.
Lorenzen  Stephan S.  Igel  Christian  Seldin  Yevgeny 《Machine Learning》2019,108(8-9):1503-1522
Machine Learning - Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning...  相似文献   

5.
Regression conformal prediction produces prediction intervals that are valid, i.e., the probability of excluding the correct target value is bounded by a predefined confidence level. The most important criterion when comparing conformal regressors is efficiency; the prediction intervals should be as tight (informative) as possible. In this study, the use of random forests as the underlying model for regression conformal prediction is investigated and compared to existing state-of-the-art techniques, which are based on neural networks and k-nearest neighbors. In addition to their robust predictive performance, random forests allow for determining the size of the prediction intervals by using out-of-bag estimates instead of requiring a separate calibration set. An extensive empirical investigation, using 33 publicly available data sets, was undertaken to compare the use of random forests to existing state-of-the-art conformal predictors. The results show that the suggested approach, on almost all confidence levels and using both standard and normalized nonconformity functions, produced significantly more efficient conformal predictors than the existing alternatives.  相似文献   

6.
Time-series classification (TSC) problems present a specific challenge for classification algorithms: how to measure similarity between series. A shapelet is a time-series subsequence that allows for TSC based on local, phase-independent similarity in shape. Shapelet-based classification uses the similarity between a shapelet and a series as a discriminatory feature. One benefit of the shapelet approach is that shapelets are comprehensible, and can offer insight into the problem domain. The original shapelet-based classifier embeds the shapelet-discovery algorithm in a decision tree, and uses information gain to assess the quality of candidates, finding a new shapelet at each node of the tree through an enumerative search. Subsequent research has focused mainly on techniques to speed up the search. We examine how best to use the shapelet primitive to construct classifiers. We propose a single-scan shapelet algorithm that finds the best $k$ shapelets, which are used to produce a transformed dataset, where each of the $k$ features represent the distance between a time series and a shapelet. The primary advantages over the embedded approach are that the transformed data can be used in conjunction with any classifier, and that there is no recursive search for shapelets. We demonstrate that the transformed data, in conjunction with more complex classifiers, gives greater accuracy than the embedded shapelet tree. We also evaluate three similarity measures that produce equivalent results to information gain in less time. Finally, we show that by conducting post-transform clustering of shapelets, we can enhance the interpretability of the transformed data. We conduct our experiments on 29 datasets: 17 from the UCR repository, and 12 we provide ourselves.  相似文献   

7.
Using random forests to diagnose aviation turbulence   总被引:1,自引:0,他引:1  
Atmospheric turbulence poses a significant hazard to aviation, with severe encounters costing airlines millions of dollars per year in compensation, aircraft damage, and delays due to required post-event inspections and repairs. Moreover, attempts to avoid turbulent airspace cause flight delays and en route deviations that increase air traffic controller workload, disrupt schedules of air crews and passengers and use extra fuel. For these reasons, the Federal Aviation Administration and the National Aeronautics and Space Administration have funded the development of automated turbulence detection, diagnosis and forecasting products. This paper describes a methodology for fusing data from diverse sources and producing a real-time diagnosis of turbulence associated with thunderstorms, a significant cause of weather delays and turbulence encounters that is not well-addressed by current turbulence forecasts. The data fusion algorithm is trained using a retrospective dataset that includes objective turbulence reports from commercial aircraft and collocated predictor data. It is evaluated on an independent test set using several performance metrics including receiver operating characteristic curves, which are used for FAA turbulence product evaluations prior to their deployment. A prototype implementation fuses data from Doppler radar, geostationary satellites, a lightning detection network and a numerical weather prediction model to produce deterministic and probabilistic turbulence assessments suitable for use by air traffic managers, dispatchers and pilots. The algorithm is scheduled to be operationally implemented at the National Weather Service’s Aviation Weather Center in 2014.  相似文献   

8.
与集成学习相比,针对单个分类器不能获得相对较高而稳定的准确率的问题,提出一种分类模型.该模型可集成多个随机森林,并以带阈值的多数投票法作为结合方法;模型实现主要分为建立集成分类模型、实例初步预测和结合分析三个层次.MapReduce编程方式实现的分类模型以P2P流量识别为例,分别与单个随机森林和集成其他算法进行对比,实验表明提出模型能获得更好的P2P流量识别综合分类性能,该模型也为二类型分类提供了一种可行的参考方法.  相似文献   

9.
Yi Mao  Guy Lebanon 《Machine Learning》2009,77(2-3):225-248
Conditional random fields are one of the most popular structured prediction models. Nevertheless, the problem of incorporating domain knowledge into the model is poorly understood and remains an open issue. We explore a new approach for incorporating a particular form of domain knowledge through generalized isotonic constraints on the model parameters. The resulting approach has a clear probabilistic interpretation and efficient training procedures. We demonstrate the applicability of our framework with an experimental study on sentiment prediction and information extraction tasks.  相似文献   

10.
11.
Data Mining and Knowledge Discovery - In critical situations involving discrimination, gender inequality, economic damage, and even the possibility of casualties, machine learning models must be...  相似文献   

12.
Gaussian Processes are powerful tools in machine learning which offer wide applicability in regression and classification problems due to their non-parametric and non-linear behavior. However, one of their main drawbacks is the training time complexity which scales cubically with the number of samples. Our work addresses this issue by combining Gaussian Processes with Randomized Decision Forests to enable fast learning. An important advantage of our method is its simplicity and the ability to directly control the trade-off between classification performance and computation speed. Experiments on an indoor place recognition task show that our method can handle large training sets in reasonable time while retaining a good classification accuracy.  相似文献   

13.
The random subspace method for constructing decision forests   总被引:28,自引:0,他引:28  
Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy  相似文献   

14.
We propose a new differentially-private decision forest algorithm that minimizes both the number of queries required, and the sensitivity of those queries. To do so, we build an ensemble of random decision trees that avoids querying the private data except to find the majority class label in the leaf nodes. Rather than using a count query to return the class counts like the current state-of-the-art, we use the Exponential Mechanism to only output the class label itself. This drastically reduces the sensitivity of the query – often by several orders of magnitude – which in turn reduces the amount of noise that must be added to preserve privacy. Our improved sensitivity is achieved by using “smooth sensitivity”, which takes into account the specific data used in the query rather than assuming the worst-case scenario. We also extend work done on the optimal depth of random decision trees to handle continuous features, not just discrete features. This, along with several other improvements, allows us to create a differentially private decision forest with substantially higher predictive power than the current state-of-the-art.  相似文献   

15.
Gaussian processes are powerful modeling tools in machine learning which offer wide applicability for regression and classification tasks due to their non-parametric and non-linear behavior. However, one of their main drawbacks is the training time complexity which scales cubically with the number of examples. Our work addresses this issue by combining Gaussian processes with random decision forests to enable fast learning. An important advantage of our method is its simplicity and the ability to directly control the tradeoff between classification performance and computational speed. Experiments on an indoor place recognition task and on standard machine learning benchmarks show that our method can handle large training sets of up to three million examples in reasonable time while retaining good classification accuracy.  相似文献   

16.
随着我国老龄化和高龄化趋势的加速,以及家庭养老功能弱化、社会养老服务体系不健全等问题,养老事业面临诸多挑战。为了更好地为老年人提供居住安排建议,同时为养老事业管理部门提供精准的决策支持,对CHARLS问卷中将近2万名老年人的数据进行了分析,力图发现影响老年人居住偏好的主要因素。同时,也尝试利用大数据和数据挖掘方法,从个人层面对老年人居住偏好进行预测,并针对类不平衡的情况下随机森林特征选择算法进行了改进。研究结果表明:基于老年人的特征数据可以很好地预测其居住偏好,为养老事业的精准化决策提供一种依据。  相似文献   

17.
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources.  相似文献   

18.
Cardiovascular disease accompanied by arrhythmia reduces an individual’s lifespan and health, and long term ECG monitoring would generate large amounts of data. Fortunately, arrhythmia classification assisted by computer science would greatly improve the efficiency of doctors’ diagnoses. However, due to individual differences, noise affecting the signal, the great variety of arrhythmias, and heavy computing workload, it is difficult to implement these advanced techniques for clinical context analysis. Thus, this paper proposes a comprehensive approach based on discrete wavelet and random forest techniques for arrhythmia classification. Specifically, discrete wavelet transformation is used to remove high-frequency noise and baseline drift, while discrete wavelet transformation, autocorrelation, principal component analysis, variances and other mathematical methods are used to extract frequency-domain features, time-domain features and morphology features. Furthermore, an arrhythmia classification system is developed, and its availability is verified that the proposed scheme can significantly be used for guidance and reference in clinical arrhythmia automatic classification.  相似文献   

19.
Seasonality effects and empirical regularities in financial data have been well documented in the financial economics literature for over seven decades. This paper proposes an expert system that uses novel machine learning techniques to predict the price return over these seasonal events, and then uses these predictions to develop a profitable trading strategy. While simple approaches to trading these regularities can prove profitable, such trading leads to potential large drawdowns (peak-to-trough decline of an investment measured as a percentage between the peak and the trough) in profit. In this paper, we introduce an automated trading system based on performance weighted ensembles of random forests that improves the profitability and stability of trading seasonality events. An analysis of various regression techniques is performed as well as an exploration of the merits of various techniques for expert weighting. The performance of the models is analysed using a large sample of stocks from the DAX. The results show that recency-weighted ensembles of random forests produce superior results in terms of both profitability and prediction accuracy compared with other ensemble techniques. It is also found that using seasonality effects produces superior results than not having them modelled explicitly.  相似文献   

20.
刘颖  李旭  吕政  赵珺  王伟 《控制与决策》2024,39(7):2315-2324
时间序列数据广泛存在于工业、医疗等应用领域,由于其时序相关性强、特征空间维度大,使得传统的时间序列分类方法普遍存在精度不足和需要复杂特征工程等问题.充分考虑深度神经网络在处理复杂时序数据上的优越性以及决策树方法拟合数据能力强的优势,提出一种基于残差网络和概率决策树的端到端统一深度学习模型.该模型利用残差网络从原始时间序列中提取高级特征,为了更好地建立时序数据特征与类别标签间的映射关系,将概率决策树融入至残差网络的分类层.同时,设计随机子空间的集成策略,缓解由于残差网络的深层结构产生的过度拟合现象,并给出联合优化模型分裂参数和预测参数的迭代优化方案.在大量的基准数据集和工业案例上进行实验和分析,实验结果表明,所提出模型的分类性能优于传统方法与其他深度学习方法,且可有效提高残差网络的泛化能力.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号