首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Credit scoring aims to assess the risk associated with lending to individual consumers. Recently, ensemble classification methodology has become popular in this field. However, most researches utilize random sampling to generate training subsets for constructing the base classifiers. Therefore, their diversity is not guaranteed, which may lead to a degradation of overall classification performance. In this paper, we propose an ensemble classification approach based on supervised clustering for credit scoring. In the proposed approach, supervised clustering is employed to partition the data samples of each class into a number of clusters. Clusters from different classes are then pairwise combined to form a number of training subsets. In each training subset, a specific base classifier is constructed. For a sample whose class label needs to be predicted, the outputs of these base classifiers are combined by weighted voting. The weight associated with a base classifier is determined by its classification performance in the neighborhood of the sample. In the experimental study, two benchmark credit data sets are adopted for performance evaluation, and an industrial case study is conducted. The results show that compared to other ensemble classification methods, the proposed approach is able to generate base classifiers with higher diversity and local accuracy, and improve the accuracy of credit scoring.  相似文献   

2.
《Artificial Intelligence》2001,125(1-2):209-226
Naive Bayes classifiers provide an efficient and scalable approach to supervised classification problems. When some entries in the training set are missing, methods exist to learn these classifiers under some assumptions about the pattern of missing data. Unfortunately, reliable information about the pattern of missing data may be not readily available and recent experimental results show that the enforcement of an incorrect assumption about the pattern of missing data produces a dramatic decrease in accuracy of the classifier. This paper introduces a Robust Bayes Classifier (rbc) able to handle incomplete databases with no assumption about the pattern of missing data. In order to avoid assumptions, the rbc bounds all the possible probability estimates within intervals using a specialized estimation method. These intervals are then used to classify new cases by computing intervals on the posterior probability distributions over the classes given a new case and by ranking the intervals according to some criteria. We provide two scoring methods to rank intervals and a decision theoretic approach to trade off the risk of an erroneous classification and the choice of not classifying unequivocally a case. This decision theoretic approach can also be used to assess the opportunity of adopting assumptions about the pattern of missing data. The proposed approach is evaluated on twenty publicly available databases.  相似文献   

3.
A novel method for evaluating the reliability of a classifier on a pattern is proposed based on the discernibility of a pattern's class against other classes from the pattern. Three measures of discernibility are proposed and experimentally compared with each other and with more conventional techniques based on the classification scores for class labels. The classification accuracy can be significantly enhanced through discernibility measures using the most reliable – ‘elite’ – patterns. It can be further boosted by forming an amalgamation of the elites of different classifiers. Improved performance is achieved at the price of rejecting many patterns. There are situations in which this price is worth paying – when the non-reliable predictions, however good, lead to the need for the manual testing of very cumbersome and complex technical devices or in diagnostics of human terminal diseases. Contrary to conventional techniques for estimating reliability, the proposed measures are applicable to small datasets as well as to datasets with complex class structures on which conventional classifiers show low accuracy rates.  相似文献   

4.
一种基于微阵列数据的集成分类方法*   总被引:1,自引:0,他引:1  
针对现有的微阵列数据集成分类方法分类精度不高这一问题,提出了一种Bagging-PCA-SVM方法。该方法首先采用Bootstrap技术对训练样本集重复取样,构成大量训练样本子集,然后在每个子集上进行特征选择和主成分分析以消除噪声基因与冗余基因;最后利用支持向量机作为分类器,采用多数投票的方法预测样本的类属。通过三个数据集进行了测试,测试结果表明了该方法的有效性和可行性。  相似文献   

5.
针对异构数据集下的不均衡分类问题,从数据集重采样、集成学习算法和构建弱分类器3个角度出发,提出一种针对异构不均衡数据集的分类方法——HVDM-Adaboost-KNN算法(heterogeneous value difference metric-Adaboost-KNN),该算法首先通过聚类算法对数据集进行均衡处理,获得多个均衡的数据子集,并构建多个子分类器,采用异构距离计算异构数据集中2个样本之间的距离,提高KNN算法的分类准性能,然后用Adaboost算法进行迭代获得最终分类器。用8组UCI数据集来评估算法在不均衡数据集下的分类性能,Adaboost实验结果表明,相比Adaboost等算法,F1值、AUC、G-mean等指标在异构不均衡数据集上的分类性能都有相应的提高。  相似文献   

6.
Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing with such problems. Various approaches have been introduced in the literature to deal with imbalanced datasets, and are typically based on oversampling, undersampling or cost-sensitive classification. In this paper, we introduce an effective ensemble of cost-sensitive decision trees for imbalanced classification. Base classifiers are constructed according to a given cost matrix, but are trained on random feature subspaces to ensure sufficient diversity of the ensemble members. We employ an evolutionary algorithm for simultaneous classifier selection and assignment of committee member weights for the fusion process. Our proposed algorithm is evaluated on a variety of benchmark datasets, and is confirmed to lead to improved recognition of the minority class, to be capable of outperforming other state-of-the-art algorithms, and hence to represent a useful and effective approach for dealing with imbalanced datasets.  相似文献   

7.
A pattern classification problem usually involves using high-dimensional features that make the classifier very complex and difficult to train. With no feature reduction, both training accuracy and generalization capability will suffer. This paper proposes a novel hybrid filter-wrapper-type feature subset selection methodology using a localized generalization error model. The localized generalization error model for a radial basis function neural network bounds from above the generalization error for unseen samples located within a neighborhood of the training samples. Iteratively, the feature making the smallest contribution to the generalization error bound is removed. Moreover, the novel feature selection method is independent of the sample size and is computationally fast. The experimental results show that the proposed method consistently removes large percentages of features with statistically insignificant loss of testing accuracy for unseen samples. In the experiments for two of the datasets, the classifiers built using feature subsets with 90% of features removed by our proposed approach yield average testing accuracies higher than those trained using the full set of features. Finally, we corroborate the efficacy of the model by using it to predict corporate bankruptcies in the US.  相似文献   

8.
This paper proposes a grey-based nearest neighbor approach to predict accurately missing attribute values. First, grey relational analysis is employed to determine the nearest neighbors of an instance with missing attribute values. Accordingly, the known attribute values derived from these nearest neighbors are used to infer those missing values. Two datasets were used to demonstrate the performance of the proposed method. Experimental results show that our method outperforms both multiple imputation and mean substitution. Moreover, the proposed method was evaluated using five classification problems with incomplete data. Experimental results indicate that the accuracy of classification is maintained or even increased when the proposed method is applied for missing attribute value prediction.  相似文献   

9.
We introduce Learn++.MF, an ensemble-of-classifiers based algorithm that employs random subspace selection to address the missing feature problem in supervised classification. Unlike most established approaches, Learn++.MF does not replace missing values with estimated ones, and hence does not need specific assumptions on the underlying data distribution. Instead, it trains an ensemble of classifiers, each on a random subset of the available features. Instances with missing values are classified by the majority voting of those classifiers whose training data did not include the missing features. We show that Learn++.MF can accommodate substantial amount of missing data, and with only gradual decline in performance as the amount of missing data increases. We also analyze the effect of the cardinality of the random feature subsets, and the ensemble size on algorithm performance. Finally, we discuss the conditions under which the proposed approach is most effective.  相似文献   

10.
The random subspace method for constructing decision forests   总被引:28,自引:0,他引:28  
Much of previous attention on decision trees focuses on the splitting criteria and optimization of tree sizes. The dilemma between overfitting and achieving maximum accuracy is seldom resolved. A method to construct a decision tree based classifier is proposed that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. The subspace method is compared to single-tree classifiers and other forest construction methods by experiments on publicly available datasets, where the method's superiority is demonstrated. We also discuss independence between trees in a forest and relate that to the combined classification accuracy  相似文献   

11.
Data with missing values,or incomplete information,brings some challenges to the development of classification,as the incompleteness may significantly affect the performance of classifiers.In this paper,we handle missing values in both training and test sets with uncertainty and imprecision reasoning by proposing a new belief combination of classifier(BCC)method based on the evidence theory.The proposed BCC method aims to improve the classification performance of incomplete data by characterizing the uncertainty and imprecision brought by incompleteness.In BCC,different attributes are regarded as independent sources,and the collection of each attribute is considered as a subset.Then,multiple classifiers are trained with each subset independently and allow each observed attribute to provide a sub-classification result for the query pattern.Finally,these sub-classification results with different weights(discounting factors)are used to provide supplementary information to jointly determine the final classes of query patterns.The weights consist of two aspects:global and local.The global weight calculated by an optimization function is employed to represent the reliability of each classifier,and the local weight obtained by mining attribute distribution characteristics is used to quantify the importance of observed attributes to the pattern classification.Abundant comparative experiments including seven methods on twelve datasets are executed,demonstrating the out-performance of BCC over all baseline methods in terms of accuracy,precision,recall,F1 measure,with pertinent computational costs.  相似文献   

12.
目的 高光谱分类问题中,由于类内光谱特性存在差异性,导致常规的随机样本选择策略无法保证训练样本均匀覆盖样本空间。针对这一问题,提出基于类内再聚类的样本空间优化策略。同时为了进一步提高分类精度,针对低置信度分类结果,提出基于邻域高置信信息的修正策略。方法 采用FCM(fuzzy C-means)聚类算法对每类样本进行类内再聚类,在所聚的每个子类内选择适当样本。利用两个简单分类器SVM(support vector machine)和SRC(sparse representation-based classifier),对分类结果进行一致性检测,确定高、低置信区域,对低置信区域,利用主成分图作为引导图对置信度图进行滤波,使得高置信信息向低置信区域传播,从而修正低置信区域分类结果。以上策略可以保证即便在较少的训练样本的情况下,也能够训练出较高的分类器,大幅度提高分类精度。结果 使用3组实验数据,根据样本比例设置两组实验与经典以及最新分类算法进行对比。实验结果表明,本文算法均取得很大改进,尤其在样本比例较小的实验中效果显著。在小比例(一般样本选取比例的十分之一)训练样本实验中,对于India Pines数据集,OA(overall accuracy)值高达90.48%;在Salinas数据集上能达到99.68%;同样,PaviaU数据集的OA值为98.54%。3组数据集的OA值均比其他算法高出4% 6%。结论 综上表明,本文算法通过样本空间优化策略选取有代表性、均衡性的样本,保证小比例样本下分类精度依然显著;基于邻域高置信信息的修正策略起到很好的优化效果。同时,本文算法适应多种数据集,具有很好的鲁棒性。  相似文献   

13.
Artificial immune system (AIS)-based pattern classification approach is relatively new in the field of pattern recognition. The study explores the potentiality of this paradigm in the context of prototype selection task that is primarily effective in improving the classification performance of nearest-neighbor (NN) classifier and also partially in reducing its storage and computing time requirement. The clonal selection model of immunology has been incorporated to condense the original prototype set, and performance is verified by employing the proposed technique in a practical optical character recognition (OCR) system as well as for training and testing of a set of benchmark databases available in the public domain. The effect of control parameters is analyzed and the efficiency of the method is compared with another existing techniques often used for prototype selection. In the case of the OCR system, empirical study shows that the proposed approach exhibits very good generalization ability in generating a smaller prototype library from a larger one and at the same time giving a substantial improvement in the classification accuracy of the underlying NN classifier. The improvement in performance has been statistically verified. Consideration of both OCR data and public domain datasets demonstrate that the proposed method gives results better than or at least comparable to that of some existing techniques.  相似文献   

14.
The problem of anomaly and attack detection in IoT environment is one of the prime challenges in the domain of internet of things that requires an immediate concern. For example, anomalies and attacks in IoT environment such as scan, malicious operation, denial of service, spying, data type probing, wrong setup, malicious control can lead to failure of an IoT system. Datasets generated in an IoT environment usually have missing values. The presence of missing values makes the classifier unsuitable for classification task. This article introduces (a) a novel imputation technique for imputation of missing data values (b) a classifier which is based on feature transformation to perform classification (c) imputation measure for similarity computation between any two instances that can also be used as similarity measure. The performance of proposed classifier is studied by using imputed datasets obtained through applying Kmeans, F-Kmeans and proposed imputation methods. Experiments are also conducted by applying existing and proposed classifiers on the imputed dataset obtained using proposed imputation technique. For experimental study in this article, we have used an open source dataset named distributed smart space orchestration system publicly available from Kaggle. Experiment results obtained are also validated using Wilcoxon non-parametric statistical test. It is proved that the performance of proposed approach is better when compared to existing classifiers when the imputation process is performed using F-Kmeans and K-Means imputation techniques. It is also observed that accuracies for attack classes scan, malicious operation, denial of service, spying, data type probing, wrong setup are 100% while it is 99% for malicious control attack class when the proposed imputation and classification technique are applied.  相似文献   

15.
DNA microarray experiment inevitably generates gene expression data with missing values. An important and necessary pre-processing step is thus to impute these missing values. Existing imputation methods exploit gene correlation among all experimental conditions for estimating the missing values. However, related genes coexpress in subsets of experimental conditions only. In this paper, we propose to use biclusters, which contain similar genes under subset of conditions for characterizing the gene similarity and then estimating the missing values. To further improve the accuracy in missing value estimation, an iterative framework is developed with a stopping criterion on minimizing uncertainty. Extensive experiments have been conducted on artificial datasets, real microarray datasets as well as one non-microarray dataset. Our proposed biclusters-based approach is able to reduce errors in missing value estimation.  相似文献   

16.
Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Na?¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Na?¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Na?¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Na?¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.  相似文献   

17.
With the widespread usage of social networks, forums and blogs, customer reviews emerged as a critical factor for the customers’ purchase decisions. Since the beginning of 2000s, researchers started to focus on these reviews to automatically categorize them into polarity levels such as positive, negative, and neutral. This research problem is known as sentiment classification. The objective of this study is to investigate the potential benefit of multiple classifier systems concept on Turkish sentiment classification problem and propose a novel classification technique. Vote algorithm has been used in conjunction with three classifiers, namely Naive Bayes, Support Vector Machine (SVM), and Bagging. Parameters of the SVM have been optimized when it was used as an individual classifier. Experimental results showed that multiple classifier systems increase the performance of individual classifiers on Turkish sentiment classification datasets and meta classifiers contribute to the power of these multiple classifier systems. The proposed approach achieved better performance than Naive Bayes, which was reported the best individual classifier for these datasets, and Support Vector Machines. Multiple classifier systems (MCS) is a good approach for sentiment classification, and parameter optimization of individual classifiers must be taken into account while developing MCS-based prediction systems.  相似文献   

18.
宋超  徐新  桂容  谢欣芳  徐丰 《计算机应用》2017,37(1):244-250
为了充分利用极化合成孔径雷达(SAR)图像不同极化特征对不同地物目标类型的刻画能力,提出一种基于多层支持向量机(SVM)的极化SAR特征分析与分类方法。该方法首先通过特征分析确定适合不同地物类型的最佳特征子集;然后采用分层分类树的方式,根据每一种地物类型的特征子集逐层进行SVM分类;最终得到整体分类结果。RadarSAT-2极化SAR图像分类实验结果表明所提方法水域、耕地、林地、城区4类地物分类精度为85%左右,总体分类精度达到86%。该算法充分利用了不同地物目标类型的特性,提高了分类精度,也降低了算法时间复杂度。  相似文献   

19.
In medical information system, the data that describe patient health records are often time stamped. These data are liable to complexities such as missing data, observations at irregular time intervals and large attribute set. Due to these complexities, mining in clinical time-series data, remains a challenging area of research. This paper proposes a bio-statistical mining framework, named statistical tolerance rough set induced decision tree (STRiD), which handles these complexities and builds an effective classification model. The constructed model is used in developing a clinical decision support system (CDSS) to assist the physician in clinical diagnosis. The STRiD framework provides the following functionalities namely temporal pre-processing, attribute selection and classification. In temporal pre-processing, an enhanced fuzzy-inference based double exponential smoothing method is presented to impute the missing values and to derive the temporal patterns for each attribute. In attribute selection, relevant attributes are selected using the tolerance rough set. A classification model is constructed with the selected attributes using temporal pattern induced decision tree classifier. For experimentation, this work uses clinical time series datasets of hepatitis and thrombosis patients. The constructed classification model has proven the effectiveness of the proposed framework with a classification accuracy of 91.5% for hepatitis and 90.65% for thrombosis.  相似文献   

20.
Text categorization presents unique challenges to traditional classification methods due to the large number of features inherent in the datasets from real-world applications of text categorization, and a great deal of training samples. In high-dimensional document data, the classes are typically categorized only by subsets of features, which are typically different for the classes of different topics. This paper presents a simple but effective classifier for text categorization using class-dependent projection based method. By projecting onto a set of individual subspaces, the samples belonging to different document classes are separated such that they are easily to be classified. This is achieved by developing a new supervised feature weighting algorithm to learn the optimized subspaces for all the document classes. The experiments carried out on common benchmarking corpuses showed that the proposed method achieved both higher classification accuracy and lower computational costs than some distinguishing classifiers in text categorization, especially for datasets including document categories with overlapping topics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号