首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Ensemble learning is the process of aggregating the decisions of different learners/models. Fundamentally, the performance of the ensemble relies on the degree of accuracy in individual learner predictions and the degree of diversity among the learners. The trade-off between accuracy and diversity within the ensemble needs to be optimized to provide the best grouping of learners as it relates to their performance. In this optimization theory article, we propose a novel ensemble selection algorithm which, focusing specifically on clustering problems, selects the optimal subset of the ensemble that has both accurate and diverse models. Those ensemble selection algorithms work for a given number of the best learners within the subset prior to their selection. The cardinality of a subset of the ensemble changes the prediction accuracy. The proposed algorithm in this study determines both the number of best learners and also the best ones. We compared our prediction results to recent ensemble clustering selection algorithms by the number of cardinalities and best predictions, finding better and approximated results to the optimum solutions.  相似文献   

2.
Neural networks and genetic algorithms for bankruptcy predictions   总被引:9,自引:0,他引:9  
We are focusing on three alternative techniques-linear discriminant analysis, logit analysis and genetic algorithms-that can be used to empirically select predictors for neural networks in failure prediction. The selected techniques all have different assumptions about the relationships between the independent variables. Linear discriminant analysis is based on linear combination of independent variables, logit analysis uses the logistical cumulative function and genetic algorithms is a global search procedure based on the mechanics of natural selection and natural genetics. In an empirical test all three selection methods chose different bankruptcy prediction variables. The best prediction results were achieved when using genetic algorithms.  相似文献   

3.

Workload prediction is an essential prerequisite to allocate resources efficiently and maintain service level agreements in cloud computing environment. However, the best solution for a prediction task may not be a single model due to the challenge of varied characteristics of different systems. Thus, in this work, we propose an ensemble model, namely ESNemble, based on echo state network (ESN) for workload time series forecasting. ESNemble consists of four main steps, including features selection using ESN reservoirs, dimensionality reduction using kernel principal component analysis, features aggregation using matrices concatenation, and regression using least absolute shrinkage and selection operator for final predictions. In addition, necessary hyperparameters for ESNemble are optimized using genetic algorithm. For experimental evaluation, we have used ESNemble to combine five different prediction algorithms on three recent logs extracted from real-world web servers. Through our experimental results, we have shown that ESNemble outperforms all component models in terms of accuracy and resource allocation and presented the running time of our model to show the feasibility of our model in real-world applications.

  相似文献   

4.
Financially distressed prediction (FDP) has been a widely and continually studied topic in the field of corporate finance. One of the core problems to FDP is to design effective feature selection algorithms. In contrast to existing approaches, we propose an integrated approach to feature selection for the FDP problem that embeds expert knowledge with the wrapper method. The financial features are categorized into seven classes according to their financial semantics based on experts’ domain knowledge surveyed from literature. We then apply the wrapper method to search for “good” feature subsets consisting of top candidates from each feature class. For concept verification, we compare several scholars’ models as well as leading feature selection methods with the proposed method. Our empirical experiment indicates that the prediction model based on the feature set selected by the proposed method outperforms those models based on traditional feature selection methods in terms of prediction accuracy.  相似文献   

5.
Evolving diverse ensembles using genetic programming has recently been proposed for classification problems with unbalanced data. Population diversity is crucial for evolving effective algorithms. Multilevel selection strategies that involve additional colonization and migration operations have shown better performance in some applications. Therefore, in this paper, we are interested in analysing the performance of evolving diverse ensembles using genetic programming for software defect prediction with unbalanced data by using different selection strategies. We use colonization and migration operators along with three ensemble selection strategies for the multi-objective evolutionary algorithm. We compare the performance of the operators for software defect prediction datasets with varying levels of data imbalance. Moreover, to generalize the results, gain a broader view and understand the underlying effects, we replicated the same experiments on UCI datasets, which are often used in the evolutionary computing community. The use of multilevel selection strategies provides reliable results with relatively fast convergence speeds and outperforms the other evolutionary algorithms that are often used in this research area and investigated in this paper. This paper also presented a promising ensemble strategy based on a simple convex hull approach and at the same time it raised the question whether ensemble strategy based on the whole population should also be investigated.  相似文献   

6.
为了使高校的就业指导工作更具针对性,可以有针对性地培养学生,本文收集了毕业生的相关信息及其各自的就业情况,构建了基于HMIGW特征选择和XGBoost的分类预测建模算法,并将其应用于毕业生就业预测.本文首先考虑到学生信息数据具有离散型和连续型混合的特点,提出一种适应于就业预测的基于互信息和权重的混合(Hybrid feature selection based on Mutual Information and Gain Weight,以下简称HMIGW)特征选择算法,该方法先对学生数据的特征做相关性估值,然后采用前向特征添加后向递归删除策略进行特征选择,最后基于选择后的最优特征子集数据用XGBoost预测模型进行训练与结果预测.通过对比不同算法的结果,本文采用的预测方法在准确率和时间等评价指标上有较好的表现,对于毕业生培养就业指导具有积极作用.  相似文献   

7.
In QoS-based Web service recommendation, predicting quality of service (QoS) for users will greatly aid service selection and discovery. Collaborative filtering (CF) is an effective method for Web service selection and recommendation. CF algorithms can be divided into two main categories: memory-based and model-based algorithms. Memory-based CF algorithms are easy to implement and highly effective, but they suffer from a fundamental problem: inability to scale-up. Model-based CF algorithms, such as clustering CF algorithms, address the scalability problem by seeking users for recommendation within smaller and highly similar clusters, rather than within the entire database. However, they are often time-consuming to build and update. In this paper, we propose a time-aware and location-aware CF algorithms. To validate our algorithm, this paper conducts series of large-scale experiments based on a real-world Web service QoS data set. Experimental results show that our approach is capable of addressing the three important challenges of recommender systems–high quality of prediction, high scalability, and easy to build and update.  相似文献   

8.
Strategies for selecting informative data points for training prediction algorithms are important, particularly when data points are difficult and costly to obtain. A Query by Committee (QBC) training strategy for selecting new data points uses the disagreement between a committee of different algorithms to suggest new data points, which most rationally complement existing data, that is, they are the most informative data points. In order to evaluate this QBC approach on a real-world problem, we compared strategies for selecting new data points. We trained neural network algorithms to obtain methods to predict the binding affinity of peptides binding to the MHC class I molecule, HLA-A2. We show that the QBC strategy leads to a higher performance than a baseline strategy where new data points are selected at random from a pool of available data. Most peptides bind HLA-A2 with a low affinity, and as expected using a strategy of selecting peptides that are predicted to have high binding affinities also lead to more accurate predictors than the base line strategy. The QBC value is shown to correlate with the measured binding affinity. This demonstrates that the different predictors can easily learn if a peptide will fail to bind, but often conflict in predicting if a peptide binds. Using a carefully constructed computational setup, we demonstrate that selecting peptides with a high QBC performs better than low QBC peptides independently from binding affinity. When predictors are trained on a very limited set of data they cannot be expected to disagree in a meaningful way and we find a data limit below which the QBC strategy fails. Finally, it should be noted that data selection strategies similar to those used here might be of use in other settings in which generation of more data is a costly process.  相似文献   

9.
为了提高光伏发电输出功率的预测精度和可靠性,本文提出一种基于Stacking模型融合的光伏发电功率预测方法.选取某光伏电站温度、湿度、辐照度等历史实测数据为研究对象,在将光伏发电功率数据进行特征交叉以及基于模型的递归特征消除法进行预处理和特征选择的基础上,以XGBoost、LightGBM、RandomForest 3种机器学习算法作为Stacking集成学习的第一层基学习器,以LinearRegression作为第二层元学习器,构建了多个机器学习算法嵌入的Stacking模型融合的光伏发电功率预测模型.预测结果表明,该方法的R2、MSE分别达到了0.9874和0.1056,相较于单一的机器学习模型,预测精度显著提升.  相似文献   

10.
Accurate modeling of prosody is prerequisite for the production of synthetic speech of high quality. Phone duration, as one of the key prosodic parameters, plays an important role for the generation of emotional synthetic speech with natural sounding. In the present work we offer an overview of various phone duration modeling techniques, and consequently evaluate ten models, based on decision trees, linear regression, lazy-learning algorithms and meta-learning algorithms, which over the past decades have been successfully used in various modeling tasks. Furthermore, we study the opportunity for performance optimization by applying two feature selection techniques, the RReliefF and the Correlation-based Feature Selection, on a large set of numerical and nominal linguistic features extracted from text, such as: phonetic, phonologic and morphosyntactic ones, which have been reported successful on the phone and syllable duration modeling task. We investigate the practical usefulness of these phone duration modeling techniques on a Modern Greek emotional speech database, which consists of five categories of emotional speech: anger, fear, joy, neutral, sadness. The experimental results demonstrated that feature selection significantly improves the accuracy of phone duration prediction regardless of the type of machine learning algorithm used for phone duration modeling. Specifically, in four out of the five categories of emotional speech, feature selection contributed to the improvement of the phone duration modeling, when compared to the case without feature selection. The M5p trees based phone duration model was observed to achieve the best phone duration prediction accuracy in terms of RMSE and MAE.  相似文献   

11.
Ensemble methods have been shown to be an effective tool for solving multi-label classification tasks. In the RAndom k-labELsets (RAKEL) algorithm, each member of the ensemble is associated with a small randomly-selected subset of k labels. Then, a single label classifier is trained according to each combination of elements in the subset. In this paper we adopt a similar approach, however, instead of randomly choosing subsets, we select the minimum required subsets of k labels that cover all labels and meet additional constraints such as coverage of inter-label correlations. Construction of the cover is achieved by formulating the subset selection as a minimum set covering problem (SCP) and solving it by using approximation algorithms. Every cover needs only to be prepared once by offline algorithms. Once prepared, a cover may be applied to the classification of any given multi-label dataset whose properties conform with those of the cover. The contribution of this paper is two-fold. First, we introduce SCP as a general framework for constructing label covers while allowing the user to incorporate cover construction constraints. We demonstrate the effectiveness of this framework by proposing two construction constraints whose enforcement produces covers that improve the prediction performance of random selection by achieving better coverage of labels and inter-label correlations. Second, we provide theoretical bounds that quantify the probabilities of random selection to produce covers that meet the proposed construction criteria. The experimental results indicate that the proposed methods improve multi-label classification accuracy and stability compared to the RAKEL algorithm and to other state-of-the-art algorithms.  相似文献   

12.
The equation for response to selection and its use for prediction   总被引:13,自引:0,他引:13  
The Breeder Genetic Algorithm (BGA) was designed according to the theories and methods used in the science of livestock breeding. The prediction of a breeding experiment is based on the response to selection (RS) equation. This equation relates the change in a population's fitness to the standard deviation of its fitness, as well as to the parameters selection intensity and realized heritability. In this paper the exact RS equation is derived for proportionate selection given an infinite population in linkage equilibrium. In linkage equilibrium the genotype frequencies are the product of the univariate marginal frequencies. The equation contains Fisher's fundamental theorem of natural selection as an approximation. The theorem shows that the response is approximately equal to the quotient of a quantity called additive genetic variance, VA, and the average fitness. We compare Mendelian two-parent recombination with gene-pool recombination, which belongs to a special class of genetic algorithms that we call univariate marginal distribution (UMD) algorithms. UMD algorithms keep the genotypes in linkage equilibrium. For UMD algorithms, an exact RS equation is proven that can be used for long-term prediction. Empirical and theoretical evidence is provided that indicates that Mendelian two-parent recombination is also mainly exploiting the additive genetic variance. We compute an exact RS equation for binary tournament selection. It shows that the two classical methods for estimating realized heritability--the regression heritability and the heritability in the narrow sense--may give poor estimates. Furthermore, realized heritability for binary tournament selection can be very different from that of proportionate selection. The paper ends with a short survey about methods that extend standard genetic algorithms and UMD algorithms by detecting interacting variables in nonlinear fitness functions and using this information to sample new points.  相似文献   

13.
Peptide vaccination for cancer immunotherapy requires identification of peptide epitopes derived from antigenic proteins associated with the tumor. Such peptides can bind to MHC proteins (MHC molecules) on the tumor-cell surface, with the potential to initiate a host immune response against the tumor. Computer prediction of peptide epitopes can be based on known motifs for peptide sequences that bind to a certain MHC molecule, on algorithms using experimental data as a training set, or on structure-based approaches. We have developed an algorithm, which we refer to as PePSSI, for flexible structural prediction of peptide binding to MHC molecules. Here, we have applied this algorithm to identify peptide epitopes (of nine amino acids, the common length) from the sequence of the cancer-testis antigen KU-CT-1, based on the potential of these peptides to bind to the human MHC molecule HLA-A2. We compared the PePSSI predictions with those of other algorithms and found that several peptides predicted to be strong HLA-A2 binders by PePSSI were similarly predicted by another structure-based algorithm, PREDEP. The results show how structure-based prediction can identify potential peptide epitopes without known binding motifs and suggest that side chain orientation in binding peptides may be obtained using PePSSI.  相似文献   

14.
郭娜  刘聪  李彩虹  陆婷  闻立杰  曾庆田 《软件学报》2024,35(3):1341-1356
流程剩余时间预测对于业务异常的预防和干预有着重要的价值和意义.现有的剩余时间预测方法通过深度学习技术达到了更高的准确率,然而大多数深度模型结构复杂难以解释预测结果,即不可解释问题.此外,剩余时间预测除了活动这一关键属性还会根据领域知识选择若干其他属性作为预测模型的输入特征,缺少通用的特征选择方法,对于预测的准确率和模型的可解释性存在一定的影响.针对上述问题,提出基于可解释特征分层模型(explainable feature-based hierarchical model,EFH model)的流程剩余时间预测框架.具体而言,首先提出特征自选择策略,通过基于优先级的后向特征删除和基于特征重要性值的前向特征选择,得到对预测任务具有积极影响的属性作为模型输入.然后提出可解释特征分层模型架构,通过逐层加入不同特征得到每层的预测结果,解释特征值与预测结果的内在联系.采用LightGBM (light gradient boosting machine)和LSTM (long short-term memory)算法实例化所提方法,框架是通用的,不限于选用算法.最后在8个真实事件日志上与最新方法进行比较.实验结果表明所提方法能够选取出有效特征,提高预测的准确率,并解释预测结果.  相似文献   

15.
Previous studies on predicting the box-office performance of a movie using machine learning techniques have shown practical levels of predictive accuracy. Their works are technically- and methodologically-oriented, focusing mainly on what algorithms are better at predicting the movie performance. However, the accuracy of prediction model can also be elevated by taking other perspectives such as introducing unexplored features that might be related to the prediction of the outcomes. In this paper, we examine multiple approaches to improve the performance of the prediction model. First, we develop and add a new feature derived from the theory of transmedia storytelling. Such theory-driven feature selection not only increases the forecast accuracy, but also enhances the interpretability of a prediction model. Second, we use an ensemble approach, which has rarely been adopted in the research on predicting box-office performance. As a result, the proposed model, Cinema Ensemble Model (CEM), outperforms the prediction models from the past studies that use machine learning algorithms. We suggest that CEM can be extensively used for industrial experts as a powerful tool for improving decision-making process.  相似文献   

16.
Forgetting Exceptions is Harmful in Language Learning   总被引:2,自引:0,他引:2  
We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.  相似文献   

17.
In this paper, we design and implement a variety of parallel algorithms for both sweep spin selection and random spin selection. We analyze our parallel algorithms on LogP, a portable and general parallel machine model. We then obtain rigorous theoretical runtime results on LogP for all the parallel algorithms. Moreover, a guiding equation is derived for choosing data layouts (blocked vs. stripped) for sweep spin selection. In regard to random spin selection, we are able to develop parallel algorithms with efficient communication schemes. We introduce two novel schemes, namely the FML scheme and the α-scheme. We analyze randomness of our schemes using statistical methods and provide comparisons between the different schemes.  相似文献   

18.
针对基因组新测序物种缺乏高质量的基因结构用于从头预测软件训练的现状,本文提出了一种以新测序物种自身RNA-seq组装为基础的可靠基因训练集构建方法(Building reliable training gene set,BRTGS)。该方法利用RNA-seq组装获得大量初始基因结构,然后根据蛋白同源证据筛选具有正确且编码区相对完整的基因结构,最后综合利用RNA-seq组装结构和蛋白同源证据统计信息确定的基因起始密码子和终止密码子位置,从而获得基因完整的编码结构。实验结果表明,该方法不仅可为各种组装水平的基因组构建高质量的基因训练集,而且从头预测软件在这些基因集上训练后能够获得很好的预测性能。  相似文献   

19.
预测问题通常涉及相同的输入变量同时预测多个目标变量。当目标变量为二进制时,预测任务被称为多标签分类;当目标变量为实值时,预测任务称为多目标预测。本文提出2种新的多目标回归方法:多目标堆叠(Multi-Target Stacking, MTS)和集成回归链(Ensemble of Regressor Chains, ERC)。灵感来自2种流行的多标签分类方法。MTS和ERC在第一阶段的训练,都将采用基于回归树AdaBoost算法(ART)建立的单目标预测(Single-Target Prediction)模型作为基准方法;在第二阶段的训练,MTS和ERC都通过额外加入第一阶段的目标预测值作为输入变量来扩展第二阶段的输入变量空间,以此建立多目标预测模型。这2种方法都利用目标变量之间的关系,不同的是,ERC除了考虑目标的依赖性关系外还考虑了目标的顺序问题。此外,总结了MTS和ERC这2种方法的缺点,并且对算法进行修改,提出了相应的改进版本MTS Corrected(MTSC)和ERC Corrected(ERCC)。实验结果表明,修改后的回归链ART-ERCC算法在多目标预测问题中表现最好。  相似文献   

20.
随着电信行业市场竞争的不断加剧,用户对服务质量要求逐步提高,导致用户投诉率不断攀升。在此情况下,通过准确预测用户投诉行为来降低用户投诉率成为运营商关注的重点。目前传统的投诉预测模型仅从分类算法和人工调研特征来讨论,而没有充分利用运营商的大数据。因此,提出了在Hadoop/Spark大数据平台上使用并行随机森林来构建用户预测投诉模型,它不仅用到了业务支持系统数据,而且还用到了运营支持系统数据和客服工单数据,并在此基础上进一步增加了反映用户相互关系的图特征和二阶特征。基于上海市某运营商数据的实验结果表明,利用多来源、高维度的特征来训练用户投诉预测模型的精度会明显高于传统方法,在此基础上有针对性地对目标用户采取安抚措施,可以降低用户投诉率,获得较高的商业价值。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号