期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

张馨予安建成曹锐《计算机工程与科学》2020,42(3):543-549

自适应随机森林分类器在每个基础分类器上分别设置了警告探测器和漂移探测器,实例训练时常常会同时触发多个警告探测器,引起多棵背景树同步训练,使得运行所需的内存大、时间长。针对此问题,提出了一种改进的自适应随机森林集成分类算法,将概念漂移探测器设置在集成学习器端,移除各基础树端的漂移探测器,并根据集成器预测准确率确定需要训练的背景树的数量。用改进后的算法对较平衡的数据流进行分类,在保证分类性能的前提下,与改进前的算法相比,运行时间有所降低,消耗内存有所减少,能更快适应数据流中出现的概念漂移。相似文献

2.

Optimizing regression models for data streams with missing values

Indrė Žliobaitė Jaakko Hollmén 《Machine Learning》2015,99(1):47-73

相似文献

3.

Chebyshev approaches for imbalanced data streams regression models

Aminian Ehsan Ribeiro Rita P. Gama João 《Data mining and knowledge discovery》2021,35(6):2389-2466

Data Mining and Knowledge Discovery - In recent years data stream mining and learning from imbalanced data have been active research areas. Even though solutions exist to tackle these two problems,... 相似文献

4.

A novel support vector regression for data set with outliers

《Applied Soft Computing》2015

Support vector machine (SVM) is sensitive to the outliers, which reduces its generalization ability. This paper presents a novel support vector regression (SVR) together with fuzzification theory, inconsistency matrix and neighbors match operator to address this critical issue. Fuzzification method is exploited to assign similarities on the input space and on the output response to each pair of training samples respectively. The inconsistency matrix is used to calculate the weights of input variables, followed by searching outliers through a novel neighborhood matching algorithm and then eliminating them. Finally, the processed data is sent to the original SVR, and the prediction results are acquired. A simulation example and three real-world applications demonstrate the proposed method for data set with outliers. 相似文献

5.

Fast adaptive kernel density estimator for data streams

Arnold P. Boedihardjo Chang-Tien Lu Feng Chen 《Knowledge and Information Systems》2015,42(2):285-317

相似文献

6.

Incremental deep forest for multi-label data streams learning

Liang Shunpan Pan Weiwei You Dianlong Liu Ze Yin Ling 《Applied Intelligence》2022,52(12):13398-13414

Multi-label learning has attracted many attentions. However, the continuous data generated in the fields of sensors, network access, etc., that is data streams, the scenario brings challenges such as real-time, limited memory, once pass. Several learning algorithms have been proposed for offline multi-label classification, but few researches develop it for dynamic multi-label incremental learning models based on cascading schemes. Deep forest can perform representation learning layer by layer, and does not rely on backpropagation, using this cascading scheme, this paper proposes a multi-label data stream deep forest (VDSDF) learning algorithm based on cascaded Very Fast Decision Tree (VFDT) forest, which can receive examples successively, perform incremental learning, and adapt to concept drift. Experimental results show that the proposed VDSDF algorithm, as an incremental classification algorithm, is more competitive than batch classification algorithms on multiple indicators. Moreover, in dynamic flow scenarios, the adaptability of VDSDF to concept drift is better than that of the contrast algorithm.

相似文献

7.

An adaptive ensemble classifier for mining concept drifting data streams

Dewan Md. Farid Li Zhang Alamgir Hossain Chowdhury Mofizur Rahman Rebecca Strachan Graham Sexton Keshav Dahal 《Expert systems with applications》2013,40(15):5895-5906

It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications. 相似文献

8.

A low-cost INS/GPS integration methodology based on random forest regression

Srujana Adusumilli Deepak Bhatt Hong Wang Prabir Bhattacharya Vijay Devabhaktuni 《Expert systems with applications》2013,40(11):4653-4659

This paper, for the first time, introduces a random forest regression based Inertial Navigation System (INS) and Global Positioning System (GPS) integration methodology to provide continuous, accurate and reliable navigation solution. Numerous techniques such as those based on Kalman filter (KF) and artificial intelligence approaches exist to fuse the INS and GPS data. The basic idea behind these fusion techniques is to model the INS error during GPS signal availability. In the case of outages, the developed model provides an INS error estimates, thereby maintaining the continuity and improving the navigation solution accuracy. KF based approaches possess several inadequacies related to sensor error model, immunity to noise, and computational load. Alternatively, neural network (NN) proposed to overcome KF limitations works unsatisfactorily for low-cost INS, as they suffer from poor generalization capability due to the presence of high amount of noise.In this study, random forest regression has shown to effectively model the highly non-linear INS error due to its improved generalization capability. To evaluate the proposed method effectiveness in bridging the period of GPS outages, four simulated GPS outages are considered over a real field test data. The proposed methodology illustrates a significant reduction in the positional error by 24–56%. 相似文献

9.

RFRCDB-siRNA: improved design of siRNAs by random forest regression model coupled with database searching 总被引：1，自引：0，他引：1

Jiang P Wu H Da Y Sang F Wei J Sun X Lu Z 《Computer methods and programs in biomedicine》2007,87(3):230-238

Although the observations concerning the factors which influence the siRNA efficacy give clues to the mechanism of RNAi, the quantitative prediction of the siRNA efficacy is still a challenge task. In this paper, we introduced a novel non-linear regression method: random forest regression (RFR), to quantitatively estimate siRNAs efficacy values. Compared with an alternative machine learning regression algorithm, support vector machine regression (SVR) and four other score-based algorithms [A. Reynolds, D. Leake, Q. Boese, S. Scaringe, W.S. Marshall, A. Khvorova, Rational siRNA design for RNA interference, Nat. Biotechnol. 22 (2004) 326-330; K. Ui-Tei, Y. Naito, F. Takahashi, T. Haraguchi, H. Ohki-Hamazaki, A. Juni, R. Ueda, K. Saigo, Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference, Nucleic Acids Res. 32 (2004) 936-948; A.C. Hsieh, R. Bo, J. Manola, F. Vazquez, O. Bare, A. Khvorova, S. Scaringe, W.R. Sellers, A library of siRNA duplexes targeting the phosphoinositide 3-kinase pathway: determinants of gene silencing for use in cell-based screens, Nucleic Acids Res. 32 (2004) 893-901; M. Amarzguioui, H. Prydz, An algorithm for selection of functional siRNA sequences, Biochem. Biophys. Res. Commun. 316 (2004) 1050-1058) our RFR model achieved the best performance of all. A web-server, RFRCDB-siRNA (http://www.bioinf.seu.edu.cn/siRNA/index.htm), has been developed. RFRCDB-siRNA consists of two modules: a siRNA-centric database and a RFR prediction system. RFRCDB-siRNA works as follows: (1) Instead of directly predicting the gene silencing activity of siRNAs, the service takes these siRNAs as queries to search against the siRNA-centric database. The matched sequences with the exceeding the user defined functionality value threshold are kept. (2) The mismatched sequences are then processed into the RFR prediction system for further analysis. 相似文献

10.

Secure computation with horizontally partitioned data using adaptive regression splines

Joyee Ghosh Alan F. Karr 《Computational statistics & data analysis》2007,51(12):5813-5820

When several data owners possess data on different records but the same variables, known as horizontally partitioned data, the owners can improve statistical inferences by sharing their data with each other. Often, however, the owners are unwilling or unable to share because the data are confidential or proprietary. Secure computation protocols enable the owners to compute parameter estimates for some statistical models, including linear regressions, without sharing individual records’ data. A drawback to these techniques is that the model must be specified in advance of initiating the protocol, and the usual exploratory strategies for determining good-fitting models have limited usefulness since the individual records are not shared. In this paper, we present a protocol for secure adaptive regression splines that allows for flexible, semi-automatic regression modeling. This reduces the risk of model mis-specification inherent in secure computation settings. We illustrate the protocol with air pollution data. 相似文献

11.

Secure computation with horizontally partitioned data using adaptive regression splines

《Computational statistics & data analysis》2008,52(12):5813-5820

When several data owners possess data on different records but the same variables, known as horizontally partitioned data, the owners can improve statistical inferences by sharing their data with each other. Often, however, the owners are unwilling or unable to share because the data are confidential or proprietary. Secure computation protocols enable the owners to compute parameter estimates for some statistical models, including linear regressions, without sharing individual records’ data. A drawback to these techniques is that the model must be specified in advance of initiating the protocol, and the usual exploratory strategies for determining good-fitting models have limited usefulness since the individual records are not shared. In this paper, we present a protocol for secure adaptive regression splines that allows for flexible, semi-automatic regression modeling. This reduces the risk of model mis-specification inherent in secure computation settings. We illustrate the protocol with air pollution data. 相似文献

12.

随机森林针对小样本数据类权重设置

下载免费PDF全文

李建更高志坤《计算机工程与应用》2009,45(26):131-134

随机森林已经被证明是一种高效的分类与特征选择方法。尽管参数的设置对结果影响较小,但合适的参数可以使分类器得到理想的效果。主要针对癌症研究中小样本不均衡数据的分类和特征选择问题,研究了随机森林中类权重的设置。为了比较在不同的类权重下特征选择的效果,同时使用支持向量机（Support Vector Machine,SVM）方法。最终结果显示最优的类权重是不确定的。最后总结出几条规律指导研究者选择合适的权重使分类和特征选择效果得到改善。相似文献

13.

Land-cover mapping in the Nujiang Grand Canyon: integrating spectral,textural, and topographic data in a random forest classifier

Hui Fan 《International journal of remote sensing》2013,34(21):7545-7567

The integration of spectral, textural, and topographic information using a random forest classifier for land-cover mapping in the rugged Nujiang Grand Canyon was investigated in this study. Only a few land-cover categories were accurately discriminated using spectral information exclusively, with an overall accuracy of 0.56 and a kappa coefficient of 0.51. The inclusion of topographic information as additional bands provided higher overall accuracy (0.69) and kappa coefficient (0.65) than topographic correction (overall accuracy, 0.57–0.58; kappa coefficient range, 0.52–0.53), which failed to markedly improve classification accuracy. In contrast with the exclusive use of spectral bands, most of the included land-cover categories were correctly classified using textural features exclusively (overall accuracy, 0.67–0.88; kappa coefficient, 0.63–0.87). In particular, classification based on geostatistical features led to slightly more accurate results than did grey-level co-occurrence matrix parameters. The window size selected for texture calculation markedly affected the texture-based classification accuracy: larger window size yielded higher classification accuracy. However, no optimal window size exists. The inclusion of the topographic bands in the texture images led to an increase in the overall accuracy of 1.1–9.0%, and to an increase in the kappa coefficient of 0.0–10.9%. Thus, for the Nujiang Grand Canyon, topographic information was more important for the discrimination of some land-cover types than spectral and textural information. Among the Landsat Thematic Mapper (TM) spectral bands, bands 6 and 4 were of greatest importance. The relative importance of textural features generally increased with window size, and a few textural features were of consistently high importance. Although a random forest classifier does not overfit, undertaking feature selection analysis prior to classification may still be valuable. 相似文献

14.

Forest attribute imputation using machine-learning methods and ASTER data: comparison of k-NN,SVR and random forest regression algorithms

Shaban Shataee Syavash Kalbi Asghar Fallah Dieter Pelz 《International journal of remote sensing》2013,34(19):6254-6280

In a small case study of mixed hardwood Hyrcanian forests of Iran, three non-parametric methods, namely k-nearest neighbour (k-NN), support vector machine regression (SVR) and tree regression based on random forest (RF), were used in plot-level estimation of volume/ha, basal area/ha and stems/ha using field inventory and Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) data. Relevant pre-processing and processing steps were applied to the ASTER data for geometric and atmospheric correction and for enhancing quantitative forest parameters. After collecting terrestrial information on trees in the 101 samples, the volume, basal area and tree number per hectare were calculated in each plot. In the k-NN implementation using different distance measures and k, the cross-validation method was used to find the best distance measure and optimal k. In SVR, the best regularized parameters of four kernel types were obtained using leave-one-out cross-validation. RF was implemented using a bootstrap learning method with regularized parameters for decision tree model and stopping. The validity of performances was examined using unused test samples by absolute and relative root mean square error (RMSE) and bias metrics. In volume/ha estimation, the results showed that all the three algorithms had similar performances. However, SVR and RF produced better results than k-NN with relative RMSE values of 28.54, 25.86 and 26.86 (m³ ha^–1), respectively, using k-NN, SVR and RF algorithms, but RF could generate unbiased estimation. In basal area/ha and stems/ha estimation, the implementation results of RF showed that RF was slightly superior in relative RMSE (18.39, 20.64) to SVR (19.35, 22.09) and k-NN (20.20, 21.53), but k-NN could generate unbiased estimation compared with the other two algorithms used. 相似文献

15.

Expectile regression forest: A new nonparametric expectile regression model

Chao Cai Haotian Dong Xinyi Wang 《Expert Systems》2023,40(1):e13087

Classical nonlinear expectile regression has two shortcomings. It is difficult to choose a nonlinear function, and it does not consider the interaction effects among explanatory variables. Therefore, we combine the random forest model with the expectile regression method to propose a new nonparametric expectile regression model: expectile regression forest (ERF). The major novelty of the ERF model is using the bagging method to build multiple decision trees, calculating the conditional expectile of each leaf node in each decision tree, and deriving final results through aggregating these decision tree results via simple average approach. At the same time, in order to compensate for the black box problem in the model interpretation of the ERF model, the measurement of the importance of explanatory variable and the partial dependence is defined to evaluate the magnitude and direction of the influence of each explanatory variable on the response variable. The advantage of ERF model is illustrated by Monte Carlo simulation studies. The numerical simulation results show that the estimation and prediction ability of the ERF model is significantly better than alternative approaches. We also apply the ERF model to analyse the real data. From the nonparametric expectile regression analysis of these data sets, we have several conclusions that are consistent with the results of numerical simulation. 相似文献

16.

处理不平衡数据的聚类欠采样加权随机森林算法

王磊刘雨刘志中齐俊艳《计算机应用研究》2021,38(5):1398-1402

针对现有欠采样处理算法中存在样本缺少代表性、分类性能差等问题,提出了一种基于聚类欠采样的加权随机森林算法(weighted random forest algorithm based on clustering under-sampling,CUS-WRF)。利用K-means算法对多数类样本聚类,引入欧氏距离作为欠采样时分配样本个数的权重依据,使采样后的多数类样本与少数类样本形成一个平衡的样本集,以CART决策树为基分类器,加权随机森林为整体框架,同时将测试样本的准确率作为每棵树的权值来完成对结果的最终投票,有效提高了整体分类性能。选择八组KEEL数据集进行实验,结果表明,与其余四种基于随机森林的不平衡数据处理算法相比,CUS-WRF算法的分类性能及稳定性更具优势。相似文献

17.

基于随机森林的高光谱遥感图像分类

李垒任越美《计算机工程与应用》2016,52(24):189-193

为了充分利用高光谱图像的光谱信息和空间结构信息,提出了一种新的基于随机森林的高光谱遥感图像分类方法,首先,利用主成分分析降低数据的维数,并对主成分进行独立成分分析提取其光谱特征,同时消除像元的空间相关性,再采用形态学分析提取像元的空间结构特征,然后,根据像元的谱域和空域特征分别构造随机森林,并引入空间连续性对像元点的预测结果进行约束修正,最后由投票机制决定最后的分类结果。在AVIRIS和ROSIS高光谱图像上的实验结果表明,所提方法的分类性能要优于传统的高光谱图像分类方法,且分类精度高于基于单一特征的方法。相似文献

18.

A novel approach for precipitation forecast via improved K-nearest neighbor algorithm

《Advanced Engineering Informatics》2017

The prediction method plays crucial roles in accurate precipitation forecasts. Recently, machine learning has been widely used for forecasting precipitation, and the K-nearest neighbor (KNN) algorithm, one of machine learning techniques, showed good performance. In this paper, we propose an improved KNN algorithm, which offers robustness against different choices of the neighborhood size k, particularly in the case of the irregular class distribution of the precipitation dataset. Then, based our improved KNN algorithm, a new precipitation forecast approach is put forward. Extensive experimental results demonstrate that the effectiveness of our proposed precipitation forecast approach based on improved KNN algorithm. 相似文献

19.

A novel approach to estimate proximity in a random forest: An exploratory study

C. Englund A. Verikas 《Expert systems with applications》2012,39(17):13046-13050

A data proximity matrix is an important information source in random forests (RF) based data mining, including data clustering, visualization, outlier detection, substitution of missing values, and finding mislabeled data samples. A novel approach to estimate proximity is proposed in this work. The approach is based on measuring distance between two terminal nodes in a decision tree. To assess the consistency (quality) of data proximity estimate, we suggest using the proximity matrix as a kernel matrix in a support vector machine (SVM), under the assumption that a matrix of higher quality leads to higher classification accuracy. It is experimentally shown that the proposed approach improves the proximity estimate, especially when RF is made of a small number of trees. It is also demonstrated that, for some tasks, an SVM exploiting the suggested proximity matrix based kernel, outperforms an SVM based on a standard radial basis function kernel and the standard proximity matrix based kernel. 相似文献

20.

A random forest classifier for lymph diseases

Ahmad Taher Azar Hanaa Ismail Elshazly Aboul Ella Hassanien Abeer Mohamed Elkorany 《Computer methods and programs in biomedicine》2014

Machine learning-based classification techniques provide support for the decision-making process in many areas of health care, including diagnosis, prognosis, screening, etc. Feature selection (FS) is expected to improve classification performance, particularly in situations characterized by the high data dimensionality problem caused by relatively few training examples compared to a large number of measured features. In this paper, a random forest classifier (RFC) approach is proposed to diagnose lymph diseases. Focusing on feature selection, the first stage of the proposed system aims at constructing diverse feature selection algorithms such as genetic algorithm (GA), Principal Component Analysis (PCA), Relief-F, Fisher, Sequential Forward Floating Search (SFFS) and the Sequential Backward Floating Search (SBFS) for reducing the dimension of lymph diseases dataset. Switching from feature selection to model construction, in the second stage, the obtained feature subsets are fed into the RFC for efficient classification. It was observed that GA-RFC achieved the highest classification accuracy of 92.2%. The dimension of input feature space is reduced from eighteen to six features by using GA. 相似文献