首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 625 毫秒
1.
针对DBN算法训练时间复杂度高,容易过拟合等问题,受模糊理论启发,提出了一种基于模糊划分和模糊加权的集成深度信念网络,即FE-DBN(ensemble deep belief network with fuzzy partition and fuzzy weighting),用于处理大样本数据的分类问题。通过模糊聚类算法FCM将训练数据划分为多个子集,在各个子集上并行训练不同结构的DBN,将每个分类器的结果进行模糊加权。在人工数据集、UCI数据集上的实验结果表明,提出的FE-DBN比DBN精度均有所提升,具有更快的运行时间。  相似文献   

2.
Phishing attacks are security attacks that do not affect only individuals’ or organizations’ websites but may affect Internet of Things (IoT) devices and networks. IoT environment is an exposed environment for such attacks. Attackers may use thingbots software for the dispersal of hidden junk emails that are not noticed by users. Machine and deep learning and other methods were used to design detection methods for these attacks. However, there is still a need to enhance detection accuracy. Optimization of an ensemble classification method for phishing website (PW) detection is proposed in this study. A Genetic Algorithm (GA) was used for the proposed method optimization by tuning several ensemble Machine Learning (ML) methods parameters, including Random Forest (RF), AdaBoost (AB), XGBoost (XGB), Bagging (BA), GradientBoost (GB), and LightGBM (LGBM). These were accomplished by ranking the optimized classifiers to pick out the best classifiers as a base for the proposed method. A PW dataset that is made up of 4898 PWs and 6157 legitimate websites (LWs) was used for this study's experiments. As a result, detection accuracy was enhanced and reached 97.16 percent.  相似文献   

3.
The ensemble learning paradigm has proved to be relevant to solving most challenging industrial problems. Despite its successful application especially in the Bioinformatics, the petroleum industry has not benefited enough from the promises of this machine learning technology. The petroleum industry, with its persistent quest for high-performance predictive models, is in great need of this new learning methodology. A marginal improvement in the prediction indices of petroleum reservoir properties could have huge positive impact on the success of exploration, drilling and the overall reservoir management portfolio. Support vector machines (SVM) is one of the promising machine learning tools that have performed excellently well in most prediction problems. However, its performance is a function of the prudent choice of its tuning parameters most especially the regularization parameter, C. Reports have shown that this parameter has significant impact on the performance of SVM. Understandably, no specific value has been recommended for it. This paper proposes a stacked generalization ensemble model of SVM that incorporates different expert opinions on the optimal values of this parameter in the prediction of porosity and permeability of petroleum reservoirs using datasets from diverse geological formations. The performance of the proposed SVM ensemble was compared to that of conventional SVM technique, another SVM implemented with the bagging method, and Random Forest technique. The results showed that the proposed ensemble model, in most cases, outperformed the others with the highest correlation coefficient, and the lowest mean and absolute errors. The study indicated that there is a great potential for ensemble learning in petroleum reservoir characterization to improve the accuracy of reservoir properties predictions for more successful explorations and increased production of petroleum resources. The results also confirmed that ensemble models perform better than the conventional SVM implementation.  相似文献   

4.
5.
Minimal Learning Machine (MLM) is a recently proposed supervised learning algorithm with performance comparable to most state-of-the-art machine learning methods. In this work, we propose ensemble methods for classification and regression using MLMs. The goal of ensemble strategies is to produce more robust and accurate models when compared to a single classifier or regression model. Despite its successful application, MLM employs a computationally intensive optimization problem as part of its test procedure (out-of-sample data estimation). This becomes even more noticeable in the context of ensemble learning, where multiple models are used. Aiming to provide fast alternatives to the standard MLM, we also propose the Nearest Neighbor Minimal Learning Machine and the Cubic Equation Minimal Learning Machine to cope with classification and single-output regression problems, respectively. The experimental assessment conducted on real-world datasets reports that ensemble of fast MLMs perform comparably or superiorly to reference machine learning algorithms.  相似文献   

6.
Accurate prediction of sea surface temperature (SST) is extremely important for forecasting oceanic environmental events and for ocean studies. However, the existing SST prediction methods do not consider the seasonal periodicity and abnormal fluctuation characteristics of SST or the importance of historical SST data from different times; thus, these methods suffer from low prediction accuracy. To solve this problem, we comprehensively consider the effects of seasonal periodicity and abnormal fluctuation characteristics of SST data, as well as the influence of historical data in different periods, on prediction accuracy. We propose a novel ensemble learning approach that combines the Predictive Recurrent Neural Network(PredRNN) network and an attention mechanism for effective SST field prediction. In this approach, the XGBoost model is used to learn the long-period fluctuation law of SST and to extract seasonal periodic features from SST data. The exponential smoothing method is used to mitigate the impact of severely abnormal SST fluctuations and extract the a priori features of SST data. The outputs of the two aforementioned models and the original SST data are stacked and used as inputs for the next model, the PredRNN network. PredRNN is the most recently developed spatiotemporal deep learning network, which simulates both spatial and temporal representations and is capable of transferring memory across layers and time steps. Therefore, we used it to extract the spatiotemporal correlations of SST data and predict future SSTs. Finally, an attention mechanism is added to capture the importance of different historical SST data, weigh the output of each step of the PredRNN network, and improve the prediction accuracy. The experimental results on two ocean datasets confirm that the proposed approach achieves higher training efficiency and prediction accuracy than the existing SST field prediction approaches do.  相似文献   

7.
为解决临床医学量表数据类别不均衡容易对模型产生影响,以及在处理量表数据任务时深度学习框架性能难以媲美传统机器学习方法问题,提出了一种基于级联欠采样的Transformer网络模型(layer by layer Transformer, LLT)。LLT通过级联欠采样方法对多数类数据逐层删减,实现数据类别平衡,降低数据类别不均衡对分类器的影响,并利用注意力机制对输入数据的特征进行相关性评估实现特征选择,细化特征提取能力,改善模型性能。采用类风湿关节炎(RA)数据作为测试样本,实验证明,在不改变样本分布的情况下,提出的级联欠采样方法对少数类别的识别率增加了6.1%,与常用的NEARMISS和ADASYN相比,分别高出1.4%和10.4%;LLT在RA量表数据的准确率和F1-score指标上达到了72.6%和71.5%,AUC值为0.89,mAP值为0.79,性能超过目前RF、XGBoost和GBDT等主流量表数据分类模型。最后对模型过程进行可视化,分析了影响RA的特征,对RA临床诊断具有较好的指导意义。  相似文献   

8.
Credit scoring is an effective tool for banks to properly guide decision profitably on granting loans. Ensemble methods, which according to their structures can be divided into parallel and sequential ensembles, have been recently developed in the credit scoring domain. These methods have proven their superiority in discriminating borrowers accurately. However, among the ensemble models, little consideration has been provided to the following: (1) highlighting the hyper-parameter tuning of base learner despite being critical to well-performed ensemble models; (2) building sequential models (i.e., boosting, as most have focused on developing the same or different algorithms in parallel); and (3) focusing on the comprehensibility of models. This paper aims to propose a sequential ensemble credit scoring model based on a variant of gradient boosting machine (i.e., extreme gradient boosting (XGBoost)). The model mainly comprises three steps. First, data pre-processing is employed to scale the data and handle missing values. Second, a model-based feature selection system based on the relative feature importance scores is utilized to remove redundant variables. Third, the hyper-parameters of XGBoost are adaptively tuned with Bayesian hyper-parameter optimization and used to train the model with selected feature subset. Several hyper-parameter optimization methods and baseline classifiers are considered as reference points in the experiment. Results demonstrate that Bayesian hyper-parameter optimization performs better than random search, grid search, and manual search. Moreover, the proposed model outperforms baseline models on average over four evaluation measures: accuracy, error rate, the area under the curve (AUC) H measure (AUC-H measure), and Brier score. The proposed model also provides feature importance scores and decision chart, which enhance the interpretability of credit scoring model.  相似文献   

9.
基于深度学习的三维模型分类方法大都面向特定的具体任务,在面向三维模型多样化分类任务时表现不佳,泛用性不足。为此,提出了一种通用的端到端的深度集成学习模型E2E-DEL(end-to-end deep ensemble learning),由多个初级学习器和一个集成学习器组成,可以自动学习复杂三维模型的复合特征信息;并使用层次迭代式学习策略,综合考量不同层次网络的特征学习能力,合理平衡各个初级学习器的子特征学习和集成学习器的集成特征学习效果,自适应于三维模型多样化分类任务。基于此,设计了一种面向多视图的深度集成学习网络MV-DEL(multi-view deep ensemble learning),应用于一般性、细粒度、零样本三种不同类型的三维模型分类任务中。在多个公开数据集上的实验验证了该方法具有良好的泛化性与普适性。  相似文献   

10.
We address the issue of small data size for training models for regression problems, which is a significant issue in materials science. Many density estimators that use generative models based on deep neural networks have been proposed. With generative models, normalizing flows can provide exact density estimations. Using normalizing flows, we address training data augmentation issue, where we use a real-valued non-volume preserving model (real-NVP) as the normalizing flow. A generative adversarial net (GAN)-based training method is applied to improve real-NVP training using real-NVP as the generator. Using kernel ridge regression trained by generated data, generalization performance was measured for evaluating the models. Experiments were conducted with seven benchmark datasets and a dataset of ionic conductivity of materials to compare the GAN-based real-NVP to state-of-the-art models, such as real-NVP and masked autoregressive flows. The experimental results demonstrated that the GAN-based real-NVP was comparable to state-of-the-art models and implied that the data sampled by the GAN-based real-NVP were available as new training data.  相似文献   

11.
Evolving diverse ensembles using genetic programming has recently been proposed for classification problems with unbalanced data. Population diversity is crucial for evolving effective algorithms. Multilevel selection strategies that involve additional colonization and migration operations have shown better performance in some applications. Therefore, in this paper, we are interested in analysing the performance of evolving diverse ensembles using genetic programming for software defect prediction with unbalanced data by using different selection strategies. We use colonization and migration operators along with three ensemble selection strategies for the multi-objective evolutionary algorithm. We compare the performance of the operators for software defect prediction datasets with varying levels of data imbalance. Moreover, to generalize the results, gain a broader view and understand the underlying effects, we replicated the same experiments on UCI datasets, which are often used in the evolutionary computing community. The use of multilevel selection strategies provides reliable results with relatively fast convergence speeds and outperforms the other evolutionary algorithms that are often used in this research area and investigated in this paper. This paper also presented a promising ensemble strategy based on a simple convex hull approach and at the same time it raised the question whether ensemble strategy based on the whole population should also be investigated.  相似文献   

12.
在大数据环境背景下,传统机器学习算法多采用单机离线训练的方式,显然已经无法适应持续增长的大规模流式数据的变化。针对该问题,提出一种基于Flink平台的分布式在线集成学习算法。该方法基于Flink分布式计算框架,首先通过数据并行的方式对在线学习算法进行分布式在线训练;然后将训练出的多个子模型通过随机梯度下降算法进行模型的动态权重分配,实现对多个子模型的结果聚合;与此同时,对于训练效果不好的模型利用其样本进行在线更新;最后通过单机与集群环境在不同数据集上做实验对比分析。实验结果表明,在线学习算法结合Flink框架的分布式集成训练,能达到集中训练方式下的性能,同时大大提高了训练的时间效率。  相似文献   

13.
目前已有的不完整数据填充方法大多局限于单一类型的缺失变量,对大规模数据的填充效果相对弱势.为了解决真实大数据中混合类型变量的缺失问题,本文提出了一个新的模型——SXGBI(Spark-based eXtreme Gradient Boosting Imputation),其适应于连续型和分类型两种缺失变量并存的不完整数据填充,同时具备快速处理大数据的泛化能力.该方法通过对集成学习方法XGBoost的改进,将多种补全算法结合在一起,构建了一个集成学习器,并结合Spark分布式计算框架进行了并行化设计,能较好地运行于Spark分布式集群上.实验表明,随着缺失率的增长,SXGBI在RMSE、PFC和F1几项评价指标上都取得了比实验中其它填充方法更好的填充结果.此外,它还可以有效地运用在大规模的数据集上.  相似文献   

14.
叶志宇  冯爱民  高航 《计算机应用》2019,39(12):3434-3439
针对轻量化梯度促进机(LightGBM)等集成学习模型只对数据信息进行一次挖掘,无法自动地细化数据挖掘粒度或通过深入挖掘得到更多的数据中潜在内部关联信息的问题,提出了深度LightGBM集成学习模型,该模型由滑动窗口和加深两部分组成。首先,通过滑动窗口使得集成学习模型能够自动地细化数据挖掘粒度,从而更加深入地挖掘数据中潜在的内部关联信息,同时赋予模型一定的表示学习能力。然后,基于滑动窗口,用加深步骤进一步地提升模型的表示学习能力。最后,结合特征工程对数据集进行处理。在谷歌商店数据集上进行的实验结果表明,所提深度集成学习模型相较原始集成学习模型的预测精度高出6.16个百分点。所提方法能够自动地细化数据挖掘粒度,从而获取更多数据集中的潜在信息,并且深度LightGBM集成学习模型与传统深度神经网络相比是非神经网络的深度模型,参数更少,可解释性更强。  相似文献   

15.
在集成学习中使用平均法、投票法作为结合策略无法充分利用基分类器的有效信息,且根据波动性设置基分类器的权重不精确、不恰当。以上问题会降低集成学习的效果,为了进一步提高集成学习的性能,提出将证据推理(evidence reasoning, ER)规则作为结合策略,并使用多样性赋权法设置基分类器的权重。首先,由多个深度学习模型作为基分类器、ER规则作为结合策略,构建集成学习的基本结构;然后,通过多样性度量方法计算每个基分类器相对于其他基分类器的差异性;最后,将差异性归一化实现基分类器的权重设置。通过多个图像数据集的分类实验,结果表明提出的方法较实验选取的其他方法准确率更高且更稳定,证明了该方法可以充分利用基分类器的有效信息,且多样性赋权法更精确。  相似文献   

16.
We address the task of multi-target regression, where we generate global models that simultaneously predict multiple continuous variables. We use ensembles of generalized decision trees, called predictive clustering trees (PCTs), in particular bagging and random forests (RF) of PCTs and extremely randomized PCTs (extra PCTs). We add another dimension of randomization to these ensemble methods by learning individual base models that consider random subsets of target variables, while leaving the input space randomizations (in RF PCTs and extra PCTs) intact. Moreover, we propose a new ensemble prediction aggregation function, where the final ensemble prediction for a given target is influenced only by those base models that considered it during learning. An extensive experimental evaluation on a range of benchmark datasets has been conducted, where the extended ensemble methods were compared to the original ensemble methods, individual multi-target regression trees, and ensembles of single-target regression trees in terms of predictive performance, running times and model sizes. The results show that the proposed ensemble extension can yield better predictive performance, reduce learning time or both, without a considerable change in model size. The newly proposed aggregation function gives best results when used with extremely randomized PCTs. We also include a comparison with three competing methods, namely random linear target combinations and two variants of random projections.  相似文献   

17.
针对大规模数据集减法聚类时间复杂度高的问题,提出一种基于Nyst(o)m密度值逼近的减法聚类方法.特别适用于大规模数据集的减法聚类问题,可极大程度降低减法聚类的时间复杂度.基于Nystr(o)m逼近理论,结合经典减法聚类样本密度值计算的特点,巧妙地将Nystr(o)m理论用于减法聚类未采样样本之间密度权值矩阵的逼近,从而实现了对所有样本的密度值逼近,最后沿用经典减法聚类修正样本密度值的方法,实现整个减法聚类过程.将本文算法在人工数据、标准彩色图像及UCI数据集上进行了实验,详细说明了本文算法利用少数采样样本逼近多数未采样样本密度权值、密度值以及进行减法聚类的详细过程,并给出了聚类准确率、耗时及算法性能加速比.实验结果表明,与经典的减法聚类相比,本文算法在不影响聚类结果的情况下,对于较大规模数据集,可显著降低减法聚类的时间复杂度,极大程度地提高减法聚类的实时性能.  相似文献   

18.
郭茂祖  张彬  赵玲玲  张昱 《计算机应用》2020,40(11):3159-3165
针对以往活动语义识别研究单纯提取时间维度上的序列特征以及周期特征、缺乏对空间信息的深度挖掘等问题,提出一种基于联合特征和极限梯度提升(XGBoost)的活动语义识别方法。首先,挖掘时间信息中的活动周期性特征和空间信息中的经纬度特征;然后,使用经纬度信息通过具有噪声的基于密度的聚类(DBSCAN)算法提取空间区域热度特征,将这些特征组成特征向量来刻画用户活动语义;最后,采用集成学习方法中的XGBoost算法建立活动语义识别模型。在FourSquare的两个公共签到数据集上,基于联合特征的模型比基于时间特征的模型在识别准确率上提高了28个百分点,与上下文感知混合(CAH)方法和时空活动偏好(STAP)方法对比,所提方法的识别准确率分别提高了30个百分点和5个百分点。实验结果表明所提方法与对比方法相比在活动语义识别问题上更加准确有效。  相似文献   

19.
郭茂祖  张彬  赵玲玲  张昱 《计算机应用》2005,40(11):3159-3165
针对以往活动语义识别研究单纯提取时间维度上的序列特征以及周期特征、缺乏对空间信息的深度挖掘等问题,提出一种基于联合特征和极限梯度提升(XGBoost)的活动语义识别方法。首先,挖掘时间信息中的活动周期性特征和空间信息中的经纬度特征;然后,使用经纬度信息通过具有噪声的基于密度的聚类(DBSCAN)算法提取空间区域热度特征,将这些特征组成特征向量来刻画用户活动语义;最后,采用集成学习方法中的XGBoost算法建立活动语义识别模型。在FourSquare的两个公共签到数据集上,基于联合特征的模型比基于时间特征的模型在识别准确率上提高了28个百分点,与上下文感知混合(CAH)方法和时空活动偏好(STAP)方法对比,所提方法的识别准确率分别提高了30个百分点和5个百分点。实验结果表明所提方法与对比方法相比在活动语义识别问题上更加准确有效。  相似文献   

20.
In recent years, the multi-label classification task has gained the attention of the scientific community given its ability to solve problems where each of the instances of the dataset may be associated with several class labels at the same time instead of just one. The main problems to deal with in multi-label classification are the imbalance, the relationships among the labels, and the high complexity of the output space. A large number of methods for multi-label classification has been proposed, but although they aimed to deal with one or many of these problems, most of them did not take into account these characteristics of the data in their building phase. In this paper we present an evolutionary algorithm for automatic generation of ensembles of multi-label classifiers by tackling the three previously mentioned problems, called Evolutionary Multi-label Ensemble (EME). Each multi-label classifier is focused on a small subset of the labels, still considering the relationships among them but avoiding the high complexity of the output space. Further, the algorithm automatically designs the ensemble evaluating both its predictive performance and the number of times that each label appears in the ensemble, so that in imbalanced datasets infrequent labels are not ignored. For this purpose, we also proposed a novel mutation operator that considers the relationship among labels, looking for individuals where the labels are more related. EME was compared to other state-of-the-art algorithms for multi-label classification over a set of fourteen multi-label datasets and using five evaluation measures. The experimental study was carried out in two parts, first comparing EME to classic multi-label classification methods, and second comparing EME to other ensemble-based methods in multi-label classification. EME performed significantly better than the rest of classic methods in three out of five evaluation measures. On the other hand, EME performed the best in one measure in the second experiment and it was the only one that did not perform significantly worse than the control algorithm in any measure. These results showed that EME achieved a better and more consistent performance than the rest of the state-of-the-art methods in MLC.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号