首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A data driven ensemble classifier for credit scoring analysis   总被引:2,自引:0,他引:2  
This study focuses on predicting whether a credit applicant can be categorized as good, bad or borderline from information initially supplied. This is essentially a classification task for credit scoring. Given its importance, many researchers have recently worked on an ensemble of classifiers. However, to the best of our knowledge, unrepresentative samples drastically reduce the accuracy of the deployment classifier. Few have attempted to preprocess the input samples into more homogeneous cluster groups and then fit the ensemble classifier accordingly. For this reason, we introduce the concept of class-wise classification as a preprocessing step in order to obtain an efficient ensemble classifier. This strategy would work better than a direct ensemble of classifiers without the preprocessing step. The proposed ensemble classifier is constructed by incorporating several data mining techniques, mainly involving optimal associate binning to discretize continuous values; neural network, support vector machine, and Bayesian network are used to augment the ensemble classifier. In particular, the Markov blanket concept of Bayesian network allows for a natural form of feature selection, which provides a basis for mining association rules. The learned knowledge is represented in multiple forms, including causal diagram and constrained association rules. The data driven nature of the proposed system distinguishes it from existing hybrid/ensemble credit scoring systems.  相似文献   

2.
Bayesian networks are models for uncertain reasoning which are achieving a growing importance also for the data mining task of classification. Credal networks extend Bayesian nets to sets of distributions, or credal sets. This paper extends a state-of-the-art Bayesian net for classification, called tree-augmented naive Bayes classifier, to credal sets originated from probability intervals. This extension is a basis to address the fundamental problem of prior ignorance about the distribution that generates the data, which is a commonplace in data mining applications. This issue is often neglected, but addressing it properly is a key to ultimately draw reliable conclusions from the inferred models. In this paper we formalize the new model, develop an exact linear-time classification algorithm, and evaluate the credal net-based classifier on a number of real data sets. The empirical analysis shows that the new classifier is good and reliable, and raises a problem of excessive caution that is discussed in the paper. Overall, given the favorable trade-off between expressiveness and efficient computation, the newly proposed classifier appears to be a good candidate for the wide-scale application of reliable classifiers based on credal networks, to real and complex tasks.  相似文献   

3.
用Matlab语言建构贝叶斯分类器   总被引:2,自引:1,他引:2  
文本分类是文本挖掘的基础与核心,分类器的构建是文本分类的关键,利用贝叶斯网络可以构造出分类性能较好的分类器。文中利用Matlab构造出了两种分类器:朴素贝叶斯分类器NBC,用互信息测度和条件互信息测度构建了TANC。用UCI上下载的标准数据集验证所构造的分类器,实验结果表明,所建构的几种分类器的性能总体比文献中列的高些,从而表明所建立的分类器的有效性和正确性。笔者对所建构的分类器进行优化并应用于文本分类中。  相似文献   

4.
基于概率估计的贝叶斯及贝叶斯网络分类模型,拥有其它数据挖掘工具所不具备的优势。在分析贝叶斯及贝叶斯网络分类模型基础上,结合最小风险决策准则,提出了一种新的信用评估模型。在实际数据集上采用交叉验证方式进行了测试。实验结果表明基于最小风险决策准则的贝叶斯及贝叶斯网络分类模型可以有效地减少信用评估风险。  相似文献   

5.
杜超  王志海  江晶晶  孙艳歌 《软件学报》2017,28(11):2891-2904
基于模式的贝叶斯分类模型是解决数据挖掘领域分类问题的一种有效方法.然而,大多数基于模式的贝叶斯分类器只考虑模式在目标类数据集中的支持度,而忽略了模式在对立类数据集合中的支持度.此外,对于高速动态变化的无限数据流环境,在静态数据集下的基于模式的贝叶斯分类器就不能适用.为了解决这些问题,提出了基于显露模式的数据流贝叶斯分类模型EPDS(Bayesian classifier algorithm based on emerging pattern for data stream).该模型使用一个简单的混合森林结构来维护内存中事务的项集,并采用一种快速的模式抽取机制来提高算法速度.EPDS采用半懒惰式学习策略持续更新显露模式,并为待分类事务在每个类下建立局部分类模型.大量实验结果表明,该算法比其他数据流分类模型有较高的准确度.  相似文献   

6.
Peculiarity rules are a new type of useful knowledge that can be discovered by searching the relevance among peculiar data. A main task in mining such knowledge is peculiarity identification. Previous methods for finding peculiar data focus on attribute values. By extending to record-level peculiarity, this paper investigates relational peculiarity-oriented mining. Peculiarity rules are mined, and more importantly explained, in a relational mining framework. Several experiments are carried out and the results show that relational peculiarity-oriented mining is effective.  相似文献   

7.

An application of classifier fusion technique is presented to improve the performance of automated reservoir facies identification system. The algorithm presented in this study uses three well-known classifiers, namely Bayesian, k-nearest neighbor (kNN), and support vector machine (SVM) to automatically identify four defined facies of Asmari Formation from log-derived amplitude versus offset (AVO) attributes. Fuzzy Sugeno integral (FSI) method is then employed to combine the outputs of three investigated classifiers and increase the consistency of reservoir facies identification process. The experimental results obtained from applying the presented algorithm on data related to three wells drilled in Asmari Formation provide evidence of the effectiveness of the proposed algorithm regarding true positive (TP), false positive (FP), and classification accuracy criteria.

  相似文献   

8.
Association rule mining is a data mining technique for discovering useful and novel patterns or relationships from databases. These rules are simple to infer and intuitive and can be easily used for classification in any domain that requires explanation for and investigation into how the classification works. Examples of such areas are medicine, agriculture, education, etc. For such a system to find wide adoptability, it should give output that is correct and comprehensible. The amount of data has been growing very fast and so has the search space of these problems. So we need to change traditional methods. This paper discusses a rule mining classifier called DA-AC (dynamic adaptive-associative classifier) which is based on a Dynamic Particle Swarm Optimizer. Due to its seeding method, exemplar selection, adaptive parameters, dynamic reconstruction of regions and velocity update, it avoids premature convergence and provides a better value in every dimension. Quality evaluation is done both for individual rules as well as entire rulesets. Experiments were conducted over fifteen benchmark datasets to evaluate performance of proposed algorithm in comparison with six other state-of-the-art non associative classifiers and eight associative classifiers. Results demonstrate competitive performance of proposed DA-AC while considering predictive accuracy and number of mined patterns as parameters. The method was then applied to predict life expectancy of post operative thoracic surgery patients.  相似文献   

9.
Qualitatively, a filter is said to be “robust” if its performance degradation is acceptable for distributions close to the one for which it is optimal, that is, the one for which it has been designed. This paper adapts the signal-processing theory of optimal robust filters to classifiers. The distribution (class conditional distributions) to which the classifier is to be applied is parameterized by a state vector and the principle issue is to choose a design state that is optimal in comparison to all other states relative to some measure of robustness. A minimax robust classifier is one whose worst performance over all states is better than the worst performances of the other classifiers (defined at the other states). A Bayesian robust classifier is one whose expected performance is better than the expected performances of the other classifiers. The state corresponding to the Bayesian robust classifier is called the maximally robust state. Minimax robust classifiers tend to give too much weight to states for which classification is very difficult and therefore our effort is focused on Bayesian robust classifiers. Whereas the signal-processing theory of robust filtering concentrates on design with full distributional knowledge and a fixed number of observation variables (features), design via training from sample data and feature selection are so important for classification that robustness optimality must be considered from these perspectives—in particular, for small samples. In this context, for a given sample size, we will be concerned with the maximally robust state-feature pair. All definitions are independent of the classification rule; however, applications are only considered for linear and quadratic discriminant analysis, for which there are parametric forms for the optimal discriminants.  相似文献   

10.
尽管朴素贝叶斯简单而且在很多数据集上效果很好,但是其属性独立性假设在现实世界中并不总是成立的,当这一假设不成立时,其结果很差。通过分析和研究,提出了一种放宽这种独立性假设的新算法——懒惰学习双层朴素贝叶斯分类器L^2DLNB,该算法使用基于条件互信息的懒惰学习方法,在求不同类标的似然度时,使用不同的属性依赖关系,从而能够更准确地计算出各类标似然度。实验结果表明此算法在一些数据集上取得了更好的分类精度。  相似文献   

11.
Classifier combination falls in the so called data mining area. Its aim is to combine some paradigms from the supervised classification – sometimes with a previous non-supervised data division phase – in order to improve the individual accuracy of the component classifiers. Formation of classifier hierarchies is an alternative among the several methods of classifier combination. In this paper we present a novel method to find good hierarchies of classifiers for given databases. In this new proposal, a search is performed by means of genetic algorithms, returning the best individual according to the classification accuracy over the dataset, estimated through 10-fold cross-validation. Experiments have been carried out over 14 databases from the UCI repository, showing an improvement in the performance compared to the single classifiers. Moreover, similar or better results than other approaches, such as decision tree bagging and boosting, have been obtained.  相似文献   

12.
In Bayesian classifier learning, estimating the joint probability distribution p( x ,y) or the likelihood p( x |y) directly from training data is considered to be difficult, especially in large multidimensional data sets. To circumvent this difficulty, existing Bayesian classifiers such as Naive Bayes, BayesNet, and AηDE have focused on estimating simplified surrogates of p( x ,y) from different forms of one‐dimensional likelihoods. Contrary to the perceived difficulty in multidimensional likelihood estimation, we present a simple generic ensemble approach to estimate multidimensional likelihood directly from data. The idea is to aggregate pi( x |y) estimated from a random subsample of data . This article presents two ways to estimate multidimensional likelihoods using the proposed generic approach and introduces two new Bayesian classifiers called ENNBayes and MassBayes that estimate pi( x |y) using a nearest‐neighbor density estimation and a probability estimation through feature space partitioning, respectively. Unlike the existing Bayesian classifiers, ENNBayes and MassBayes have constant training time and space complexities and they scale better than existing Bayesian classifiers in very large data sets. Our empirical evaluation shows that ENNBayes and MassBayes yield better predictive accuracy than the existing Bayesian classifiers in benchmark data sets.  相似文献   

13.
The availability of a large amount of medical data leads to the need of intelligent disease prediction and analysis tools to extract hidden information. A large number of data mining and statistical analysis tools are used for disease prediction. Single data‐mining techniques show acceptable level of accuracy for heart disease diagnosis. This article focuses on prediction and analysis of heart disease using weighted vote‐based classifier ensemble technique. The proposed ensemble model overcomes the limitations of conventional data‐mining techniques by employing the ensemble of five heterogeneous classifiers: naive Bayes, decision tree based on Gini index, decision tree based on information gain, instance‐based learner, and support vector machines. We have used five benchmark heart disease data sets taken from UCI repository. Each data set contains different set of feature space that ultimately leads to the prediction of heart disease. The effectiveness of proposed ensemble classifier is investigated by comparing the performance with different researchers' techniques. Tenfold cross‐validation is used to handle the class imbalance problem. Moreover, confusion matrices and analysis of variance statistics are used to show the prediction results of all classifiers. The experimental results verify that the proposed ensemble classifier can deal with all types of attributes and it has achieved the high diagnosis accuracy of 87.37%, sensitivity of 93.75%, specificity of 92.86%, and F‐measure of 82.17%. The F‐ratio higher than the F‐critical and p‐value less than 0.01 for a 95% confidence interval indicate that the results are statistically significant for all the data sets.  相似文献   

14.
基于TAN贝叶斯网络分类器的测井岩性预测   总被引:3,自引:0,他引:3  
贝叶斯网络是一种建立在概率和统计理论基础上的数据分析和辅助决策工具,利用其构造出的树扩展朴素贝叶斯网络分类器是目前最优秀的分类器之一。针对石油勘探中测井数据的特殊性,利用贝叶斯网络预测出其对应的岩性,并介绍了使用此方法进行岩性预测的算法过程。通过BNT软件包用Matlab语言构建了分类器,并由实验结果的分析说明了此分类器的优点。  相似文献   

15.
对金融客户进行准确分类是向其提供个性化服务的前提.针对某金融产品的销售需求,通过在线推销测试收集客户样本数据,并根据用户反馈标注样本.通过构造概率分布函数、离散化连续型数据两种方式构建贝叶斯分类器.利用交叉检验训练和测试分类算法,发现朴素贝叶斯分类算法性能优于高斯贝叶斯算法和逻辑回归算法.离散化连续型数据过程中结合分类...  相似文献   

16.
Within the framework of Bayesian networks (BNs), most classifiers assume that the variables involved are of a discrete nature, but this assumption rarely holds in real problems. Despite the loss of information discretization entails, it is a direct easy-to-use mechanism that can offer some benefits: sometimes discretization improves the run time for certain algorithms; it provides a reduction in the value set and then a reduction in the noise which might be present in the data; in other cases, there are some Bayesian methods that can only deal with discrete variables. Hence, even though there are many ways to deal with continuous variables other than discretization, it is still commonly used. This paper presents a study of the impact of using different discretization strategies on a set of representative BN classifiers, with a significant sample consisting of 26 datasets. For this comparison, we have chosen Naive Bayes (NB) together with several other semi-Naive Bayes classifiers: Tree-Augmented Naive Bayes (TAN), k-Dependence Bayesian (KDB), Aggregating One-Dependence Estimators (AODE) and Hybrid AODE (HAODE). Also, we have included an augmented Bayesian network created by using a hill climbing algorithm (BNHC). With this comparison we analyse to what extent the type of discretization method affects classifier performance in terms of accuracy and bias-variance discretization. Our main conclusion is that even if a discretization method produces different results for a particular dataset, it does not really have an effect when classifiers are being compared. That is, given a set of datasets, accuracy values might vary but the classifier ranking is generally maintained. This is a very useful outcome, assuming that the type of discretization applied is not decisive future experiments can be d times faster, d being the number of discretization methods considered.  相似文献   

17.
Multiple classifier systems (MCS) are attracting increasing interest in the field of pattern recognition and machine learning. Recently, MCS are also being introduced in the remote sensing field where the importance of classifier diversity for image classification problems has not been examined. In this article, Satellite Pour l'Observation de la Terre (SPOT) IV panchromatic and multispectral satellite images are classified into six land cover classes using five base classifiers: contextual classifier, k-nearest neighbour classifier, Mahalanobis classifier, maximum likelihood classifier and minimum distance classifier. The five base classifiers are trained with the same feature sets throughout the experiments and a posteriori probability, derived from the confusion matrix of these base classifiers, is applied to five Bayesian decision rules (product rule, sum rule, maximum rule, minimum rule and median rule) for constructing different combinations of classifier ensembles. The performance of these classifier ensembles is evaluated for overall accuracy and kappa statistics. Three statistical tests, the McNemar's test, the Cochran's Q test and the Looney's F-test, are used to examine the diversity of the classification results of the base classifiers compared to the results of the classifier ensembles. The experimental comparison reveals that (a) significant diversity amongst the base classifiers cannot enhance the performance of classifier ensembles; (b) accuracy improvement of classifier ensembles can only be found by using base classifiers with similar and low accuracy; (c) increasing the number of base classifiers cannot improve the overall accuracy of the MCS and (d) none of the Bayesian decision rules outperforms the others.  相似文献   

18.
Data stream classification with artificial endocrine system   总被引:3,自引:3,他引:0  
Due to concept drifts, maintaining an up-to-date model is a challenging task for most of the current classification approaches used in data stream mining. Both the incremental classifiers and the ensemble classifiers spend most of their time in updating their temporary models and at the same time, a big sample buffer for training a classifier is necessary for most of them. These two drawbacks constrain further application in classifying a data stream. In this paper, we present a hormone based nearest neighbor classification algorithm for data stream classification, in which the classifier is updated every time a new record arrives. The records could be seen as locations in the feature space, and each location can accommodate only one endocrine cell. The classifier consists of endocrine cells on the boundaries of different classes. Every time a new record arrives, the cell that resides in the most unfit location will move to the new arrived record. In this way, the changing boundaries between different classes are recorded by the locations where endocrine cells reside in. The main advantages of the proposed method are the saving of the sample buffer and the improving of the classification accuracy. It is very important for conditions where the hardware resources are very expensive or the main memory is limited. Experiments on synthetic and real life data sets show that the proposed algorithm is able to classify data streams with less memory space and classification error.  相似文献   

19.
最小总风险准则的贝叶斯网络个人信用评估模型*   总被引:1,自引:0,他引:1  
将最小总风险准则MOR与贝叶斯网络分类器相结合,提出了一种新型信用评估模型。在两个真实数据集上以MOR用10层交叉验证对贝叶斯网络信用评估模型进行了测试,并与最小错误概率准则MPE的贝叶斯网络分类器的结果进行了对比。结果表明,基于MOR的贝叶斯网络分类模型可以有效地减小信用评估风险。  相似文献   

20.
Bayesian Network Classifiers   总被引:154,自引:0,他引:154  
Friedman  Nir  Geiger  Dan  Goldszmidt  Moises 《Machine Learning》1997,29(2-3):131-163
Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes, is competitive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a classifier with less restrictive assumptions can perform even better. In this paper we evaluate approaches for inducing classifiers from data, based on the theory of learning Bayesian networks. These networks are factored representations of probability distributions that generalize the naive Bayesian classifier and explicitly represent statements about independence. Among these approaches we single out a method we call Tree Augmented Naive Bayes (TAN), which outperforms naive Bayes, yet at the same time maintains the computational simplicity (no search involved) and robustness that characterize naive Bayes. We experimentally tested these approaches, using problems from the University of California at Irvine repository, and compared them to C4.5, naive Bayes, and wrapper methods for feature selection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号