首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
自适应随机森林分类器在每个基础分类器上分别设置了警告探测器和漂移探测器,实例训练时常常会同时触发多个警告探测器,引起多棵背景树同步训练,使得运行所需的内存大、时间长。针对此问题,提出了一种改进的自适应随机森林集成分类算法,将概念漂移探测器设置在集成学习器端,移除各基础树端的漂移探测器,并根据集成器预测准确率确定需要训练的背景树的数量。用改进后的算法对较平衡的数据流进行分类,在保证分类性能的前提下,与改进前的算法相比,运行时间有所降低,消耗内存有所减少,能更快适应数据流中出现的概念漂移。  相似文献   

2.
3.
Data Mining and Knowledge Discovery - In recent years data stream mining and learning from imbalanced data have been active research areas. Even though solutions exist to tackle these two problems,...  相似文献   

4.
Support vector machine (SVM) is sensitive to the outliers, which reduces its generalization ability. This paper presents a novel support vector regression (SVR) together with fuzzification theory, inconsistency matrix and neighbors match operator to address this critical issue. Fuzzification method is exploited to assign similarities on the input space and on the output response to each pair of training samples respectively. The inconsistency matrix is used to calculate the weights of input variables, followed by searching outliers through a novel neighborhood matching algorithm and then eliminating them. Finally, the processed data is sent to the original SVR, and the prediction results are acquired. A simulation example and three real-world applications demonstrate the proposed method for data set with outliers.  相似文献   

5.
6.
Liang  Shunpan  Pan  Weiwei  You  Dianlong  Liu  Ze  Yin  Ling 《Applied Intelligence》2022,52(12):13398-13414

Multi-label learning has attracted many attentions. However, the continuous data generated in the fields of sensors, network access, etc., that is data streams, the scenario brings challenges such as real-time, limited memory, once pass. Several learning algorithms have been proposed for offline multi-label classification, but few researches develop it for dynamic multi-label incremental learning models based on cascading schemes. Deep forest can perform representation learning layer by layer, and does not rely on backpropagation, using this cascading scheme, this paper proposes a multi-label data stream deep forest (VDSDF) learning algorithm based on cascaded Very Fast Decision Tree (VFDT) forest, which can receive examples successively, perform incremental learning, and adapt to concept drift. Experimental results show that the proposed VDSDF algorithm, as an incremental classification algorithm, is more competitive than batch classification algorithms on multiple indicators. Moreover, in dynamic flow scenarios, the adaptability of VDSDF to concept drift is better than that of the contrast algorithm.

  相似文献   

7.
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.  相似文献   

8.
This paper, for the first time, introduces a random forest regression based Inertial Navigation System (INS) and Global Positioning System (GPS) integration methodology to provide continuous, accurate and reliable navigation solution. Numerous techniques such as those based on Kalman filter (KF) and artificial intelligence approaches exist to fuse the INS and GPS data. The basic idea behind these fusion techniques is to model the INS error during GPS signal availability. In the case of outages, the developed model provides an INS error estimates, thereby maintaining the continuity and improving the navigation solution accuracy. KF based approaches possess several inadequacies related to sensor error model, immunity to noise, and computational load. Alternatively, neural network (NN) proposed to overcome KF limitations works unsatisfactorily for low-cost INS, as they suffer from poor generalization capability due to the presence of high amount of noise.In this study, random forest regression has shown to effectively model the highly non-linear INS error due to its improved generalization capability. To evaluate the proposed method effectiveness in bridging the period of GPS outages, four simulated GPS outages are considered over a real field test data. The proposed methodology illustrates a significant reduction in the positional error by 24–56%.  相似文献   

9.
Although the observations concerning the factors which influence the siRNA efficacy give clues to the mechanism of RNAi, the quantitative prediction of the siRNA efficacy is still a challenge task. In this paper, we introduced a novel non-linear regression method: random forest regression (RFR), to quantitatively estimate siRNAs efficacy values. Compared with an alternative machine learning regression algorithm, support vector machine regression (SVR) and four other score-based algorithms [A. Reynolds, D. Leake, Q. Boese, S. Scaringe, W.S. Marshall, A. Khvorova, Rational siRNA design for RNA interference, Nat. Biotechnol. 22 (2004) 326-330; K. Ui-Tei, Y. Naito, F. Takahashi, T. Haraguchi, H. Ohki-Hamazaki, A. Juni, R. Ueda, K. Saigo, Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference, Nucleic Acids Res. 32 (2004) 936-948; A.C. Hsieh, R. Bo, J. Manola, F. Vazquez, O. Bare, A. Khvorova, S. Scaringe, W.R. Sellers, A library of siRNA duplexes targeting the phosphoinositide 3-kinase pathway: determinants of gene silencing for use in cell-based screens, Nucleic Acids Res. 32 (2004) 893-901; M. Amarzguioui, H. Prydz, An algorithm for selection of functional siRNA sequences, Biochem. Biophys. Res. Commun. 316 (2004) 1050-1058) our RFR model achieved the best performance of all. A web-server, RFRCDB-siRNA (http://www.bioinf.seu.edu.cn/siRNA/index.htm), has been developed. RFRCDB-siRNA consists of two modules: a siRNA-centric database and a RFR prediction system. RFRCDB-siRNA works as follows: (1) Instead of directly predicting the gene silencing activity of siRNAs, the service takes these siRNAs as queries to search against the siRNA-centric database. The matched sequences with the exceeding the user defined functionality value threshold are kept. (2) The mismatched sequences are then processed into the RFR prediction system for further analysis.  相似文献   

10.
When several data owners possess data on different records but the same variables, known as horizontally partitioned data, the owners can improve statistical inferences by sharing their data with each other. Often, however, the owners are unwilling or unable to share because the data are confidential or proprietary. Secure computation protocols enable the owners to compute parameter estimates for some statistical models, including linear regressions, without sharing individual records’ data. A drawback to these techniques is that the model must be specified in advance of initiating the protocol, and the usual exploratory strategies for determining good-fitting models have limited usefulness since the individual records are not shared. In this paper, we present a protocol for secure adaptive regression splines that allows for flexible, semi-automatic regression modeling. This reduces the risk of model mis-specification inherent in secure computation settings. We illustrate the protocol with air pollution data.  相似文献   

11.
When several data owners possess data on different records but the same variables, known as horizontally partitioned data, the owners can improve statistical inferences by sharing their data with each other. Often, however, the owners are unwilling or unable to share because the data are confidential or proprietary. Secure computation protocols enable the owners to compute parameter estimates for some statistical models, including linear regressions, without sharing individual records’ data. A drawback to these techniques is that the model must be specified in advance of initiating the protocol, and the usual exploratory strategies for determining good-fitting models have limited usefulness since the individual records are not shared. In this paper, we present a protocol for secure adaptive regression splines that allows for flexible, semi-automatic regression modeling. This reduces the risk of model mis-specification inherent in secure computation settings. We illustrate the protocol with air pollution data.  相似文献   

12.
随机森林已经被证明是一种高效的分类与特征选择方法。尽管参数的设置对结果影响较小,但合适的参数可以使分类器得到理想的效果。主要针对癌症研究中小样本不均衡数据的分类和特征选择问题,研究了随机森林中类权重的设置。为了比较在不同的类权重下特征选择的效果,同时使用支持向量机(Support Vector Machine,SVM)方法。最终结果显示最优的类权重是不确定的。最后总结出几条规律指导研究者选择合适的权重使分类和特征选择效果得到改善。  相似文献   

13.
The integration of spectral, textural, and topographic information using a random forest classifier for land-cover mapping in the rugged Nujiang Grand Canyon was investigated in this study. Only a few land-cover categories were accurately discriminated using spectral information exclusively, with an overall accuracy of 0.56 and a kappa coefficient of 0.51. The inclusion of topographic information as additional bands provided higher overall accuracy (0.69) and kappa coefficient (0.65) than topographic correction (overall accuracy, 0.57–0.58; kappa coefficient range, 0.52–0.53), which failed to markedly improve classification accuracy. In contrast with the exclusive use of spectral bands, most of the included land-cover categories were correctly classified using textural features exclusively (overall accuracy, 0.67–0.88; kappa coefficient, 0.63–0.87). In particular, classification based on geostatistical features led to slightly more accurate results than did grey-level co-occurrence matrix parameters. The window size selected for texture calculation markedly affected the texture-based classification accuracy: larger window size yielded higher classification accuracy. However, no optimal window size exists. The inclusion of the topographic bands in the texture images led to an increase in the overall accuracy of 1.1–9.0%, and to an increase in the kappa coefficient of 0.0–10.9%. Thus, for the Nujiang Grand Canyon, topographic information was more important for the discrimination of some land-cover types than spectral and textural information. Among the Landsat Thematic Mapper (TM) spectral bands, bands 6 and 4 were of greatest importance. The relative importance of textural features generally increased with window size, and a few textural features were of consistently high importance. Although a random forest classifier does not overfit, undertaking feature selection analysis prior to classification may still be valuable.  相似文献   

14.
In a small case study of mixed hardwood Hyrcanian forests of Iran, three non-parametric methods, namely k-nearest neighbour (k-NN), support vector machine regression (SVR) and tree regression based on random forest (RF), were used in plot-level estimation of volume/ha, basal area/ha and stems/ha using field inventory and Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) data. Relevant pre-processing and processing steps were applied to the ASTER data for geometric and atmospheric correction and for enhancing quantitative forest parameters. After collecting terrestrial information on trees in the 101 samples, the volume, basal area and tree number per hectare were calculated in each plot. In the k-NN implementation using different distance measures and k, the cross-validation method was used to find the best distance measure and optimal k. In SVR, the best regularized parameters of four kernel types were obtained using leave-one-out cross-validation. RF was implemented using a bootstrap learning method with regularized parameters for decision tree model and stopping. The validity of performances was examined using unused test samples by absolute and relative root mean square error (RMSE) and bias metrics. In volume/ha estimation, the results showed that all the three algorithms had similar performances. However, SVR and RF produced better results than k-NN with relative RMSE values of 28.54, 25.86 and 26.86 (m3 ha–1), respectively, using k-NN, SVR and RF algorithms, but RF could generate unbiased estimation. In basal area/ha and stems/ha estimation, the implementation results of RF showed that RF was slightly superior in relative RMSE (18.39, 20.64) to SVR (19.35, 22.09) and k-NN (20.20, 21.53), but k-NN could generate unbiased estimation compared with the other two algorithms used.  相似文献   

15.
Classical nonlinear expectile regression has two shortcomings. It is difficult to choose a nonlinear function, and it does not consider the interaction effects among explanatory variables. Therefore, we combine the random forest model with the expectile regression method to propose a new nonparametric expectile regression model: expectile regression forest (ERF). The major novelty of the ERF model is using the bagging method to build multiple decision trees, calculating the conditional expectile of each leaf node in each decision tree, and deriving final results through aggregating these decision tree results via simple average approach. At the same time, in order to compensate for the black box problem in the model interpretation of the ERF model, the measurement of the importance of explanatory variable and the partial dependence is defined to evaluate the magnitude and direction of the influence of each explanatory variable on the response variable. The advantage of ERF model is illustrated by Monte Carlo simulation studies. The numerical simulation results show that the estimation and prediction ability of the ERF model is significantly better than alternative approaches. We also apply the ERF model to analyse the real data. From the nonparametric expectile regression analysis of these data sets, we have several conclusions that are consistent with the results of numerical simulation.  相似文献   

16.
针对现有欠采样处理算法中存在样本缺少代表性、分类性能差等问题,提出了一种基于聚类欠采样的加权随机森林算法(weighted random forest algorithm based on clustering under-sampling,CUS-WRF)。利用K-means算法对多数类样本聚类,引入欧氏距离作为欠采样时分配样本个数的权重依据,使采样后的多数类样本与少数类样本形成一个平衡的样本集,以CART决策树为基分类器,加权随机森林为整体框架,同时将测试样本的准确率作为每棵树的权值来完成对结果的最终投票,有效提高了整体分类性能。选择八组KEEL数据集进行实验,结果表明,与其余四种基于随机森林的不平衡数据处理算法相比,CUS-WRF算法的分类性能及稳定性更具优势。  相似文献   

17.
为了充分利用高光谱图像的光谱信息和空间结构信息,提出了一种新的基于随机森林的高光谱遥感图像分类方法,首先,利用主成分分析降低数据的维数,并对主成分进行独立成分分析提取其光谱特征,同时消除像元的空间相关性,再采用形态学分析提取像元的空间结构特征,然后,根据像元的谱域和空域特征分别构造随机森林,并引入空间连续性对像元点的预测结果进行约束修正,最后由投票机制决定最后的分类结果。在AVIRIS和ROSIS高光谱图像上的实验结果表明,所提方法的分类性能要优于传统的高光谱图像分类方法,且分类精度高于基于单一特征的方法。  相似文献   

18.
The prediction method plays crucial roles in accurate precipitation forecasts. Recently, machine learning has been widely used for forecasting precipitation, and the K-nearest neighbor (KNN) algorithm, one of machine learning techniques, showed good performance. In this paper, we propose an improved KNN algorithm, which offers robustness against different choices of the neighborhood size k, particularly in the case of the irregular class distribution of the precipitation dataset. Then, based our improved KNN algorithm, a new precipitation forecast approach is put forward. Extensive experimental results demonstrate that the effectiveness of our proposed precipitation forecast approach based on improved KNN algorithm.  相似文献   

19.
A data proximity matrix is an important information source in random forests (RF) based data mining, including data clustering, visualization, outlier detection, substitution of missing values, and finding mislabeled data samples. A novel approach to estimate proximity is proposed in this work. The approach is based on measuring distance between two terminal nodes in a decision tree. To assess the consistency (quality) of data proximity estimate, we suggest using the proximity matrix as a kernel matrix in a support vector machine (SVM), under the assumption that a matrix of higher quality leads to higher classification accuracy. It is experimentally shown that the proposed approach improves the proximity estimate, especially when RF is made of a small number of trees. It is also demonstrated that, for some tasks, an SVM exploiting the suggested proximity matrix based kernel, outperforms an SVM based on a standard radial basis function kernel and the standard proximity matrix based kernel.  相似文献   

20.
Machine learning-based classification techniques provide support for the decision-making process in many areas of health care, including diagnosis, prognosis, screening, etc. Feature selection (FS) is expected to improve classification performance, particularly in situations characterized by the high data dimensionality problem caused by relatively few training examples compared to a large number of measured features. In this paper, a random forest classifier (RFC) approach is proposed to diagnose lymph diseases. Focusing on feature selection, the first stage of the proposed system aims at constructing diverse feature selection algorithms such as genetic algorithm (GA), Principal Component Analysis (PCA), Relief-F, Fisher, Sequential Forward Floating Search (SFFS) and the Sequential Backward Floating Search (SBFS) for reducing the dimension of lymph diseases dataset. Switching from feature selection to model construction, in the second stage, the obtained feature subsets are fed into the RFC for efficient classification. It was observed that GA-RFC achieved the highest classification accuracy of 92.2%. The dimension of input feature space is reduced from eighteen to six features by using GA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号