首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Churn prediction in telecom has recently gained substantial interest of stakeholders because of associated revenue losses. Predicting telecom churners, is a challenging problem due to the enormous nature of the telecom datasets. In this regard, we propose an intelligent churn prediction system for telecom by employing efficient feature extraction technique and ensemble method. We have used Random Forest, Rotation Forest, RotBoost and DECORATE ensembles in combination with minimum redundancy and maximum relevance (mRMR), Fisher’s ratio and F-score methods to model the telecom churn prediction problem. We have observed that mRMR method returns most explanatory features compared to Fisher’s ratio and F-score, which significantly reduces the computations and help ensembles in attaining improved performance. In comparison to Random Forest, Rotation Forest and DECORATE, RotBoost in combination with mRMR features attains better prediction performance on the standard telecom datasets. The better performance of RotBoost ensemble is largely attributed to the rotation of feature space, which enables the base classifier to learn different aspects of the churners and non-churners. Moreover, the Adaboosting process in RotBoost also contributes in achieving higher prediction accuracy by handling hard instances. The performance evaluation is conducted on standard telecom datasets using AUC, sensitivity and specificity based measures. Simulation results reveal that the proposed approach based on RotBoost in combination with mRMR features (CP-MRB) is effective in handling high dimensionality of the telecom datasets. CP-MRB offers higher accuracy in predicting churners and thus is quite prospective in modeling the challenging problems of customer churn prediction in telecom.  相似文献   

2.
Rotation Forest, an effective ensemble classifier generation technique, works by using principal component analysis (PCA) to rotate the original feature axes so that different training sets for learning base classifiers can be formed. This paper presents a variant of Rotation Forest, which can be viewed as a combination of Bagging and Rotation Forest. Bagging is used here to inject more randomness into Rotation Forest in order to increase the diversity among the ensemble membership. The experiments conducted with 33 benchmark classification data sets available from the UCI repository, among which a classification tree is adopted as the base learning algorithm, demonstrate that the proposed method generally produces ensemble classifiers with lower error than Bagging, AdaBoost and Rotation Forest. The bias–variance analysis of error performance shows that the proposed method improves the prediction error of a single classifier by reducing much more variance term than the other considered ensemble procedures. Furthermore, the results computed on the data sets with artificial classification noise indicate that the new method is more robust to noise and kappa-error diagrams are employed to investigate the diversity–accuracy patterns of the ensemble classifiers.  相似文献   

3.
To build a successful customer churn prediction model, a classification algorithm should be chosen that fulfills two requirements: strong classification performance and a high level of model interpretability. In recent literature, ensemble classifiers have demonstrated superior performance in a multitude of applications and data mining contests. However, due to an increased complexity they result in models that are often difficult to interpret. In this study, GAMensPlus, an ensemble classifier based upon generalized additive models (GAMs), in which both performance and interpretability are reconciled, is presented and evaluated in a context of churn prediction modeling. The recently proposed GAMens, based upon Bagging, the Random Subspace Method and semi-parametric GAMs as constituent classifiers, is extended to include two instruments for model interpretability: generalized feature importance scores, and bootstrap confidence bands for smoothing splines. In an experimental comparison on data sets of six real-life churn prediction projects, the competitive performance of the proposed algorithm over a set of well-known benchmark algorithms is demonstrated in terms of four evaluation metrics. Further, the ability of the technique to deliver valuable insight into the drivers of customer churn is illustrated in a case study on data from a European bank. Firstly, it is shown how the generalized feature importance scores allow the analyst to identify the relative importance of churn predictors in function of the criterion that is used to measure the quality of the model predictions. Secondly, the ability of GAMensPlus to identify nonlinear relationships between predictors and churn probabilities is demonstrated.  相似文献   

4.
Generalized additive models (GAMs) are a generalization of generalized linear models (GLMs) and constitute a powerful technique which has successfully proven its ability to capture nonlinear relationships between explanatory variables and a response variable in many domains. In this paper, GAMs are proposed as base classifiers for ensemble learning. Three alternative ensemble strategies for binary classification using GAMs as base classifiers are proposed: (i) GAMbag based on Bagging, (ii) GAMrsm based on the Random Subspace Method (RSM), and (iii) GAMens as a combination of both. In an experimental validation performed on 12 data sets from the UCI repository, the proposed algorithms are benchmarked to a single GAM and to decision tree based ensemble classifiers (i.e. RSM, Bagging, Random Forest, and the recently proposed Rotation Forest). From the results a number of conclusions can be drawn. Firstly, the use of an ensemble of GAMs instead of a single GAM always leads to improved prediction performance. Secondly, GAMrsm and GAMens perform comparably, while both versions outperform GAMbag. Finally, the value of using GAMs as base classifiers in an ensemble instead of standard decision trees is demonstrated. GAMbag demonstrates performance comparable to ordinary Bagging. Moreover, GAMrsm and GAMens outperform RSM and Bagging, while these two GAM ensemble variations perform comparably to Random Forest and Rotation Forest. Sensitivity analyses are included for the number of member classifiers in the ensemble, the number of variables included in a random feature subspace and the number of degrees of freedom for GAM spline estimation.  相似文献   

5.
Rotation forest: A new classifier ensemble method   总被引:8,自引:0,他引:8  
We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and Principal Component Analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name "forest.” Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Using WEKA, we examined the Rotation Forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with Bagging, AdaBoost, and Random Forest. The results were favorable to Rotation Forest and prompted an investigation into diversity-accuracy landscape of the ensemble models. Diversity-error diagrams revealed that Rotation Forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and Random Forest, and more diverse than these in Bagging, sometimes more accurate as well.  相似文献   

6.
The Rotation Forest classifier is a successful ensemble method for a wide variety of data mining applications. However, the way in which Rotation Forest transforms the feature space through PCA, although powerful, penalizes training and prediction times, making it unfeasible for Big Data. In this paper, a MapReduce Rotation Forest and its implementation under the Spark framework are presented. The proposed MapReduce Rotation Forest behaves in the same way as the standard Rotation Forest, training the base classifiers on a rotated space, but using a functional implementation of the rotation that enables its execution in Big Data frameworks. Experimental results are obtained using different cloud-based cluster configurations. Bayesian tests are used to validate the method against two ensembles for Big Data: Random Forest and PCARDE classifiers. Our proposal incorporates the parallelization of both the PCA calculation and the tree training, providing a scalable solution that retains the performance of the original Rotation Forest and achieves a competitive execution time (in average, at training, more than 3 times faster than other PCA-based alternatives). In addition, extensive experimentation shows that by setting some parameters of the classifier (i.e., bootstrap sample size, number of trees, and number of rotations), the execution time is reduced with no significant loss of performance using a small ensemble.  相似文献   

7.
针对于大样本数据的客户流失预测,从特征有效表达的角度,提出了一种基于谱回归特征约简的预测模型.模型在原始客户特征基础上,利用基于谱回归的流形降维,建立可区分性的低维特征空间,在此之上采用支持向量机实现客户流失的二分类.通过在网络客户和传统电信客户两种不同数据集上的大样本实验,并与不同分类器、不同特征约简或选择方法的对比,证明了该方法的有效性.  相似文献   

8.
陈松峰  范明 《计算机科学》2010,37(8):236-239256
提出了一种使用基于贝叶斯的基分类器建立组合分类器的新方法PCABoost.本方法在创建训练样本时,随机地将特征集划分成K个子集,使用PCA得到每个子集的主成分,形成新的特征空间,并将全部的训练数据映射到新的特征空间作为新的训练集.通过不同的变换生成不同的特征空间,从而产生若干个有差异的训练集.在每一个新的训练集上利用AdaBoost建立一组基于贝叶斯的逐渐提升的分类器(即一个分类器组),这样就建立了若干个有差异的分类器组,然后在每个分类器组内部通过加权投票产生一个预测,再把每个组的预测通过投票来产生组合分类器的分类结果,最终建立一个具有两层组合的组合分类器.从UCI标准数据集中随机选取30个数据集进行实验.结果表明,本算法不仅能够显著提高基于贝叶斯的分类器的分类性能,而且与Rotation Forest和AdaBoost等组合方法相比,在大部分数据集上都具有更高的分类准确率.  相似文献   

9.
Predicting customer churn with the purpose of retaining customers is a hot topic in academy as well as in today’s business environment. Targeting the right customers for a specific retention campaign carries a high priority. This study focuses on two aspects in which churn prediction models could be improved by (i) relying on customer information type diversity and (ii) choosing the best performing classification technique. (i) With the upcoming interest in new media (e.g. blogs, emails, ...), client/company interactions are facilitated. Consequently, new types of information are available which generate new opportunities to increase the prediction power of a churn model. This study contributes to the literature by finding evidence that adding emotions expressed in client/company emails increases the predictive performance of an extended RFM churn model. As a substantive contribution, an in-depth study of the impact of the emotionality indicators on churn behavior is done. (ii) This study compares three classification techniques – i.e. Logistic Regression, Support Vector Machines and Random Forests – to distinguish churners from non-churners. This paper shows that Random Forests is a viable opportunity to improve predictive performance compared to Support Vector Machines and Logistic Regression which both exhibit an equal performance.  相似文献   

10.
梳理了客户流失和客户流失管理的定义,客户流失问题的研究内容、应用场景,客户流失预测算法及特征选择方法,模型评估的常用技术与度量等方面的研究现状,指出当前研究的不足,并提出未来的研究方向.  相似文献   

11.
The amounts and types of remote sensing data have increased rapidly, and the classification of these datasets has become more and more overwhelming for a single classifier in practical applications. In this paper, an ensemble algorithm based on Diversity Ensemble Creation by Oppositional Relabeling of Artificial Training Examples (DECORATEs) and Rotation Forest is proposed to solve the classification problem of remote sensing image. In this ensemble algorithm, the RBF neural networks are employed as base classifiers. Furthermore, interpolation technology for identical distribution is used to remold the input datasets. These remolded datasets will construct new classifiers besides the initial classifiers constructed by the Rotation Forest algorithm. The change of classification error is used to decide whether to add another new classifier. Therefore, the diversity among these classifiers will be enhanced and the accuracy of classification will be improved. Adaptability of the proposed algorithm is verified in experiments implemented on standard datasets and actual remote sensing dataset.  相似文献   

12.
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.  相似文献   

13.
14.
This paper presents a novel ensemble classifier framework for improved classification of mammographic lesions in Computer-aided Detection (CADe) and Diagnosis (CADx) systems. Compared to previously developed classification techniques in mammography, the main novelty of proposed method is twofold: (1) the “combined use” of different feature representations (of the same instance) and data resampling to generate more diverse and accurate base classifiers as ensemble members and (2) the incorporation of a novel “ensemble selection” mechanism to further maximize the overall classification performance. In addition, as opposed to conventional ensemble learning, our proposed ensemble framework has the advantage of working well with both weak and strong classifiers, extensively used in mammography CADe and/or CADx systems. Extensive experiments have been performed using benchmark mammogram dataset to test the proposed method on two classification applications: (1) false-positive (FP) reduction using classification between masses and normal tissues, and (2) diagnosis using classification between malignant and benign masses. Results showed promising results that the proposed method (area under the ROC curve (AUC) of 0.932 and 0.878, each obtained for the aforementioned two classification applications, respectively) impressively outperforms (by an order of magnitude) the most commonly used single neural network (AUC = 0.819 and AUC =0.754) and support vector machine (AUC = 0.849 and AUC = 0.773) based classification approaches. In addition, the feasibility of our method has been successfully demonstrated by comparing other state-of-the-art ensemble classification techniques such as Gentle AdaBoost and Random Forest learning algorithms.  相似文献   

15.
一种基于旋转森林的集成协同训练算法   总被引:1,自引:0,他引:1       下载免费PDF全文
集成协同训练算法(ensemble co-training)是将集成学习(ensemble learning)和协同训练算法(co-training)相结合的半监督学习方法,旋转森林(rotation forest)是利用特征提取来构造基分类器差异性的集成学习方法,在对现有的集成协同训练算法研究基础上,提出了基于旋转森林的协同训练算法——ROFCO,该方法重在利用未标记数据提高基分类器之间的差异性和特征提取效果,使基分类器的泛化误差保持不变或下降的同时,能保持甚至提高基分类器之间的差异性,提高集成效果。实验结果表明该方法能取得较好效果。  相似文献   

16.
冠心病的早期无创性诊断一直是医疗诊断领域的研究热点,为了提高冠心病诊断的准确率和诊断效率,提出了一种新颖的局部Fisher判别分析(LFDA)特征提取方法和集成核极限学习机(KELM)相结合的冠心病诊断模型(LFDA-EKELM)。首先使用LFDA方法剔除不相关特征和冗余特征,找出对分类结果贡献度较高的特征子集,产生不同的训练集以训练粒子群优化的KELM分类器PSO-KELM,并基于旋转森林(RF)构建集成分类器,实现冠心病的智能诊断。实验结果表明,与基于ELM、SVM和BPNN方法相比,提出方法有效提高了冠心病诊断准确率,提升了诊断效率,且分类结果高于已有方法和相似方法,是一种有效冠心病诊断模型。  相似文献   

17.
针对数据挖掘方法在电信客户流失预测中的局限性,提出将信息融合与数据挖掘相结合,分别从数据层、特征层、决策层构建客户流失预测模型。确定客户流失预测指标;根据客户样本在特征空间分布的差异性对客户进行划分,得到不同特征的客户群;不同客户群采用不同算法构建客户流失预测模型,再通过人工蚁群算法求得模型融合权重,将各模型的预测结果加权得到预测最终结果。实验结果表明,基于信息融合的客户流失预测模型确实比传统模型更优。  相似文献   

18.
We present an extensive empirical comparison between nineteen prototypical supervised ensemble learning algorithms, including Boosting, Bagging, Random Forests, Rotation Forests, Arc-X4, Class-Switching and their variants, as well as more recent techniques like Random Patches. These algorithms were compared against each other in terms of threshold, ranking/ordering and probability metrics over nineteen UCI benchmark data sets with binary labels. We also examine the influence of two base learners, CART and Extremely Randomized Trees, on the bias–variance decomposition and the effect of calibrating the models via Isotonic Regression on each performance metric. The selected data sets were already used in various empirical studies and cover different application domains. The source code and the detailed results of our study are publicly available.  相似文献   

19.
We propose three model-free feature extraction approaches for solving the multiple class classification problem; we use multi-objective genetic programming (MOGP) to derive (near-)optimal feature extraction stages as a precursor to classification with a simple and fast-to-train classifier. Statistically-founded comparisons are made between our three proposed approaches and seven conventional classifiers over seven datasets from the UCI Machine Learning database. We also make comparisons with other reported evolutionary computation techniques. On almost all the benchmark datasets, the MOGP approaches give better or identical performance to the best of the conventional methods. Of our proposed MOGP-based algorithms, we conclude that hierarchical feature extraction performs best on multi-classification problems.  相似文献   

20.
Xue  Yanbing  Geng  Huiqiang  Zhang  Hua  Xue  Zhenshan  Xu  Guangping 《Multimedia Tools and Applications》2018,77(17):22199-22211

This paper proposes a feed forward architecture algorithm using fusion of features and classifiers for semantic segmentation. The algorithm consists of three phases: Firstly, the features from hierarchical convolutional neural network (CNN) and the features based on region are extracted and fused on super pixel level; secondly, multiple classifiers of Softmax, XGBoost and Random Forest are ensemble to compute the per-pixel class probabilities; at last, a fully connected conditional random field is employed to enhance the final performance. The hierarchical features contain more global evidence and the region features contain more local evidence. So the fusion of these two features is expected to enhance the feature representation ability. In classification phase, integrating multiple classifiers aims to improve the generalization ability of classification algorithms. Experiments are conducted on Sift-Flow datasets by our proposed methods with competitive labeling accuracy.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号