首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Today, feature selection is an active research in machine learning. The main idea of feature selection is to choose a subset of available features, by eliminating features with little or no predictive information, as well as redundant features that are strongly correlated. There are a lot of approaches for feature selection, but most of them can only work with crisp data. Until now there have not been many different approaches which can directly work with both crisp and low quality (imprecise and uncertain) data. That is why, we propose a new method of feature selection which can handle both crisp and low quality data. The proposed approach is based on a Fuzzy Random Forest and it integrates filter and wrapper methods into a sequential search procedure with improved classification accuracy of the features selected. This approach consists of the following main steps: (1) scaling and discretization process of the feature set; and feature pre-selection using the discretization process (filter); (2) ranking process of the feature pre-selection using the Fuzzy Decision Trees of a Fuzzy Random Forest ensemble; and (3) wrapper feature selection using a Fuzzy Random Forest ensemble based on cross-validation. The efficiency and effectiveness of this approach is proved through several experiments using both high dimensional and low quality datasets. The approach shows a good performance (not only classification accuracy, but also with respect to the number of features selected) and good behavior both with high dimensional datasets (microarray datasets) and with low quality datasets.  相似文献   

2.
Imperfect information inevitably appears in real situations for a variety of reasons. Although efforts have been made to incorporate imperfect data into learning and inference methods, there are many limitations as to the type of data, uncertainty and imprecision that can be handled. In this paper, we propose a classification and regression technique to handle imperfect information. We incorporate the handling of imperfect information into both the learning phase, by building the model that represents the situation under examination, and the inference phase, by using such a model. The model obtained is global and is described by a Gaussian mixture. To show the efficiency of the proposed technique, we perform a comparative study with a broad baseline of techniques available in literature tested with several data sets.  相似文献   

3.
Currently, web spamming is a serious problem for search engines. It not only degrades the quality of search results by intentionally boosting undesirable web pages to users, but also causes the search engine to waste a significant amount of computational and storage resources in manipulating useless information. In this paper, we present a novel ensemble classifier for web spam detection which combines the clonal selection algorithm for feature selection and under-sampling for data balancing. This web spam detection system is called USCS. The USCS ensemble classifiers can automatically sample and select sub-classifiers. First, the system will convert the imbalanced training dataset into several balanced datasets using the under-sampling method. Second, the system will automatically select several optimal feature subsets for each sub-classifier using a customized clonal selection algorithm. Third, the system will build several C4.5 decision tree sub-classifiers from these balanced datasets based on its specified features. Finally, these sub-classifiers will be used to construct an ensemble decision tree classifier which will be applied to classify the examples in the testing data. Experiments on WEBSPAM-UK2006 dataset on the web spam problem show that our proposed approach, the USCS ensemble web spam classifier, contributes significant classification performance compared to several baseline systems and state-of-the-art approaches.  相似文献   

4.
Cancer diagnosis is an important emerging clinical application of microarray data. Its accurate prediction to the type or size of tumors relies on adopting powerful and reliable classification models, so as to patients can be provided with better treatment or response to therapy. However, the high dimensionality of microarray data may bring some disadvantages, such as over-fitting, poor performance and low efficiency, to traditional classification models. Thus, one of the challenging tasks in cancer diagnosis is how to identify salient expression genes from thousands of genes in microarray data that can directly contribute to the phenotype or symptom of disease. In this paper, we propose a new ensemble gene selection method (EGS) to choose multiple gene subsets for classification purpose, where the significant degree of gene is measured by conditional mutual information or its normalized form. After different gene subsets have been obtained by setting different starting points of the search procedure, they will be used to train multiple base classifiers and then aggregated into a consensus classifier by the manner of majority voting. The proposed method is compared with five popular gene selection methods on six public microarray datasets and the comparison results show that our method works well.  相似文献   

5.
Advocating the Use of Imprecisely Observed Data in Genetic Fuzzy Systems   总被引:2,自引:0,他引:2  
In our opinion, and in accordance with current literature, the precise contribution of genetic fuzzy systems to the corpus of the machine learning theory has not been clearly stated yet. In particular, we question the existence of a set of problems for which the use of fuzzy rules, in combination with genetic algorithms, produces more robust models, or classifiers that are inherently better than those arising from the Bayesian point of view. We will show that this set of problems actually exists, and comprises interval and fuzzy valued datasets, but it is not being exploited. Current genetic fuzzy classifiers deal with crisp classification problems, where the role of fuzzy sets is reduced to give a parametric definition of a set of discriminant functions, with a convenient linguistic interpretation. Provided that the customary use of fuzzy sets in statistics is vague data, we propose to test genetic fuzzy classifiers over imprecisely measured data and design experiments well suited to these problems. The same can be said about genetic fuzzy models: the use of a scalar fitness function assumes crisp data, where fuzzy models, a priori, do not have advantages over statistical regression.  相似文献   

6.
Security administrators need to prioritise which feature to focus on amidst the various possibilities and avenues of attack, especially via Web Service in e-commerce applications. This study addresses the feature selection problem by proposing a predictive fuzzy associative rule model (FARM). FARM validates inputs by segregating the anomalies based fuzzy associative patterns discovered from five attributes in the intrusion datasets. These associative patterns leads to the discovery of a set of 18 interesting rules at 99% confidence and subsequently, categorisation into not only certainly allow/deny but also probably deny access decision class. FARM's classification provides 99% classification accuracy and less than 1% false alarm rate. Our findings indicate two benefits to using fuzzy datasets. First, fuzzy enables the discovery of fuzzy association patterns, fuzzy association rules and more sensitive classification. In addition, the root mean squared error (RMSE) and classification accuracy for fuzzy and crisp datasets do not differ much when using the Random Forest classifier. However, when other classifiers are used with increasing number of instances on the fuzzy and crisp datasets, the fuzzy datasets perform much better. Future research will involve experimentation on bigger data sets on different data types.  相似文献   

7.
Zhang  Yong  Liu  Bo  Cai  Jing  Zhang  Suhua 《Neural computing & applications》2016,28(1):259-267

Extreme learning machine for single-hidden-layer feedforward neural networks has been extensively applied in imbalanced data learning due to its fast learning capability. Ensemble approach can effectively improve the classification performance by combining several weak learners according to a certain rule. In this paper, a novel ensemble approach on weighted extreme learning machine for imbalanced data classification problem is proposed. The weight of each base learner in the ensemble is optimized by differential evolution algorithm. Experimental results on 12 datasets show that the proposed method could achieve more classification performance compared with the simple vote-based ensemble method and non-ensemble method.

  相似文献   

8.
This paper focuses on ensemble methods for Fuzzy Rule-Based Classification Systems (FRBCS) where the decisions of different classifiers are combined in order to form the final classification model. The proposed methods reduce the FRBCS complexity and the generated rules number. We are interested in particular in ensemble methods which cluster the attributes into subgroups of attributes and treat each subgroup separately. Our work is an extension of a previous ensemble method called SIFRA. This method uses frequent itemsets mining concept in order to deduce the groups of related attributes by analyzing their simultaneous appearances in the databases. The drawback of this method is that it forms the groups of attributes by searching for dependencies between the attributes independently from the class information. Besides, since we deal with supervised learning problems, it would be very interesting to consider the class attribute when forming the attributes subgroups. In this paper, we proposed two new supervised attributes regrouping methods which take into account not only the dependencies between the attributes but also the information about the class labels. The results obtained with various benchmark datasets show a good accuracy of the built classification model.  相似文献   

9.
Along with the increase of data and information, incremental learning ability turns out to be more and more important for machine learning approaches. The online algorithms try not to remember irrelevant information instead of synthesizing all available information (as opposed to classic batch learning algorithms). Today, combining classifiers is proposed as a new road for the improvement of the classification accuracy. However, most ensemble algorithms operate in batch mode. For this reason, we propose an incremental ensemble that combines five classifiers that can operate incrementally: the Naive Bayes, the Averaged One-Dependence Estimators (AODE), the 3-Nearest Neighbors, the Non-Nested Generalised Exemplars (NNGE) and the Kstar algorithms using the voting methodology. We performed a large-scale comparison of the proposed ensemble with other state-of-the-art algorithms on several datasets and the proposed method produce better accuracy in most cases.  相似文献   

10.
The decision making trial and evaluation laboratory (DEMATEL) method is a useful tool for analyzing correlations among factors using crisp values. However, the crisp values are inadequate to model real-life situations due to the fuzziness and uncertainty that are frequently involved in judgments of experts. The aim of this paper is to extend the DEMATEL method to an uncertain linguistic environment. In this paper, the correlation information among factors provided by experts is in the form of uncertain linguistic terms. A formula is first presented to transform correlation information from uncertain linguistic terms to trapezoidal fuzzy numbers. Then, we aggregate the transformed correlation information of each expert into group information using the operations of trapezoidal fuzzy numbers. The importance and classification of factors are determined via fuzzy matrix operations. Furthermore, a causal diagram is constructed to vividly show the different roles of factors. Finally, an example is used to illustrate the procedure of the proposed method.  相似文献   

11.
The amounts and types of remote sensing data have increased rapidly, and the classification of these datasets has become more and more overwhelming for a single classifier in practical applications. In this paper, an ensemble algorithm based on Diversity Ensemble Creation by Oppositional Relabeling of Artificial Training Examples (DECORATEs) and Rotation Forest is proposed to solve the classification problem of remote sensing image. In this ensemble algorithm, the RBF neural networks are employed as base classifiers. Furthermore, interpolation technology for identical distribution is used to remold the input datasets. These remolded datasets will construct new classifiers besides the initial classifiers constructed by the Rotation Forest algorithm. The change of classification error is used to decide whether to add another new classifier. Therefore, the diversity among these classifiers will be enhanced and the accuracy of classification will be improved. Adaptability of the proposed algorithm is verified in experiments implemented on standard datasets and actual remote sensing dataset.  相似文献   

12.
Graph-based semi-supervised classification depends on a well-structured graph. However, it is difficult to construct a graph that faithfully reflects the underlying structure of data distribution, especially for data with a high dimensional representation. In this paper, we focus on graph construction and propose a novel method called semi-supervised ensemble classification in subspaces, SSEC in short. Unlike traditional methods that execute graph-based semi-supervised classification in the original space, SSEC performs semi-supervised linear classification in subspaces. More specifically, SSEC first divides the original feature space into several disjoint feature subspaces. Then, it constructs a neighborhood graph in each subspace, and trains a semi-supervised linear classifier on this graph, which will serve as the base classifier in an ensemble. Finally, SSEC combines the obtained base classifiers into an ensemble classifier using the majority-voting rule. Experimental results on facial images classification show that SSEC not only has higher classification accuracy than the competitive methods, but also can be effective in a wide range of values of input parameters.  相似文献   

13.
A key element in solving real-life data science problems is selecting the types of models to use. Tree ensemble models (such as XGBoost) are usually recommended for classification and regression problems with tabular data. However, several deep learning models for tabular data have recently been proposed, claiming to outperform XGBoost for some use cases. This paper explores whether these deep models should be a recommended option for tabular data by rigorously comparing the new deep models to XGBoost on various datasets. In addition to systematically comparing their performance, we consider the tuning and computation they require. Our study shows that XGBoost outperforms these deep models across the datasets, including the datasets used in the papers that proposed the deep models. We also demonstrate that XGBoost requires much less tuning. On the positive side, we show that an ensemble of deep models and XGBoost performs better on these datasets than XGBoost alone.  相似文献   

14.
Classification is the most used supervized machine learning method. As each of the many existing classification algorithms can perform poorly on some data, different attempts have arisen to improve the original algorithms by combining them. Some of the best know results are produced by ensemble methods, like bagging or boosting. We developed a new ensemble method called allocation. Allocation method uses the allocator, an algorithm that separates the data instances based on anomaly detection and allocates them to one of the micro classifiers, built with the existing classification algorithms on a subset of training data. The outputs of micro classifiers are then fused together into one final classification. Our goal was to improve the results of original classifiers with this new allocation method and to compare the classification results with existing ensemble methods. The allocation method was tested on 30 benchmark datasets and was used with six well known basic classification algorithms (J48, NaiveBayes, IBk, SMO, OneR and NBTree). The obtained results were compared to those of the basic classifiers as well as other ensemble methods (bagging, MultiBoost and AdaBoost). Results show that our allocation method is superior to basic classifiers and also to tested ensembles in classification accuracy and f-score. The conducted statistical analysis, when all of the used classification algorithms are considered, confirmed that our allocation method performs significantly better both in classification accuracy and f-score. Although the differences are not significant for each of the used basic classifier alone, the allocation method achieved the biggest improvements on all six basic classification algorithms. In this manner, allocation method proved to be a competitive ensemble method for classification that can be used with various classification algorithms and can possibly outperform other ensembles on different types of data.  相似文献   

15.
传统的过采样方法是解决非平衡数据分类问题的有效方法之一。基于SMOTE的过采样方法在数据集出现类别重叠(class-overlapping)和小析取项(small-disjuncts)问题时将降低采样的效果,针对该问题提出了一种基于样本局部密度的过采样算法MOLAD。在此基础上,为了解决非平衡数据的分类问题,提出了一种在采样阶段将MOLAD算法和基于Bagging的集成学习结合的算法LADBMOTE。LADBMOTE首先根据MOLAD计算每个少数类样本的K近邻,然后选择所有的K近邻进行采样,生成K个平衡数据集,最后利用基于Bagging的集成学习方法将K个平衡数据集训练得到的分类器集成。在KEEL公开的20个非平衡数据集上,将提出的LADBMOTE算法与当前流行的7个处理非平衡数据的算法对比,实验结果表明LADBMOTE在不同的分类器上的分类性能更好,鲁棒性更强。  相似文献   

16.
集成学习被广泛用于提高分类精度, 近年来的研究表明, 通过多模态扰乱策略来构建集成分类器可以进一步提高分类性能. 本文提出了一种基于近似约简与最优采样的集成剪枝算法(EPA_AO). 在EPA_AO中, 我们设计了一种多模态扰乱策略来构建不同的个体分类器. 该扰乱策略可以同时扰乱属性空间和训练集, 从而增加了个体分类器的多样性. 我们利用证据KNN (K-近邻)算法来训练个体分类器, 并在多个UCI数据集上比较了EPA_AO与现有同类型算法的性能. 实验结果表明, EPA_AO是一种有效的集成学习方法.  相似文献   

17.
As a new form of sustainable development, the concept “Smart Cities” knows a large expansion during the recent years. It represents an urban model, refers to all alternative approaches to metropolitan ICTs case to enhance quality and performance of urban service for better interaction between citizens and government. However, the smart cities based on distributed and autonomous information infrastructure contains millions of information sources that will be expected more than 50 billion devices connected by using IoT or other similar technologies in 2020. In Information Technology, we often need to process and reason with information coming from various sources (sensors, experts, models). Information is almost always tainted with various kinds of imperfection: imprecision, uncertainty, ambiguity, we need a theoretical framework general enough to allow for the representation, propagation and combination of all kinds of imperfect information. The theory of belief functions is one such Framework. Real-time data generated from autonomous and distributed sources can contain all sorts of imperfections regarding on the quality of data e.g. imprecision, uncertainty, ignorance and/or incompleteness. Any imperfection in data within smart city can have an adverse effect over the performance of urban services and decision making. In this context, we address in this article the problem of imperfection in smart city data. We will focus on handling imperfection during the process of information retrieval and data integration and we will create an evidential database by using the evidence theory in order to improve the efficiency of smart city. The expected outcomes from this paper are (1) to focus on handling imperfection during the process of information retrieval and data integration (2) to create an evidential database by using the evidence theory in order to improve the efficiency of smart city. As experimentation we present a special case of modeling imperfect data in the field of Healthcare. An evidential database will be built which will contain all the perfect and imperfect data. These data come from several Heterogeneous sources in a context of Smart Cities. Imperfect aspects in the evidential database expressed by the theory of beliefs that will present in this paper.  相似文献   

18.
CCDM 2014数据挖掘竞赛基于医学诊断数据,提出了实际生活中广泛出现的多类标问题和多类分类问题。针对两个问题出现的类别不平衡现象以及训练样本较少等特点,为了更好地完成数据挖掘任务,借助二次学习和集成学习的思想,提出了一个新的学习框架--二次集成学习。该学习框架通过首次集成学习得到若干置信度较高的样本,将其加入到原始训练集,并在新的训练集上进行二次学习,进而得到泛化性能更高的分类器。竞赛结果表明,与常用的集成学习相比,二次集成学习在两个问题上均取得了非常理想的结果。  相似文献   

19.
In this article, we present a semisupervised support vector machine that uses self-training approach. We then construct an ensemble of semisupervised SVM classifiers to address the problem of pixel classification of remote sensing images. Semisupervised support vector machines (S3VMs) are based on applying the margin maximization principle to both labeled and unlabeled samples. The ensemble of SVM classifiers recognizes the conceptual similarity between component classifiers from the same data source. The effectiveness of the proposed technique is first demonstrated for two numeric remote sensing data described in terms of feature vectors and then identifying different land cover regions in remote sensing imagery. Experimental results on these datasets show that employing this learning scheme can increase the accuracy level. The performance of the ensemble is compared with one of its component classifier and conventional SVM in terms of accuracy and quantitative cluster validity indices.  相似文献   

20.
曹鹏  李博  栗伟  赵大哲 《计算机应用》2013,33(2):550-553
针对大规模数据的分类准确率低且效率下降的问题,提出一种结合X-means聚类的自适应随机子空间组合分类算法。首先使用X-means聚类方法,保持原有数据结构的同时,把复杂的数据空间自动分解为多个样本子空间进行分治学习;而自适应随机子空间组合分类器,提升了基分类器的差异性并自动确定基分类器数量,提升了组合分类器的鲁棒性及分类准确性。该算法在人工和UCI数据集上进行了测试,并与传统单分类和组合分类算法进行了比较。实验结果表明,对于大规模数据集,该方法具有更好的分类精度和健壮性,并提升了整体算法的效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号