首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In supervised classification, we often encounter many real world problems in which the data do not have an equitable distribution among the different classes of the problem. In such cases, we are dealing with the so-called imbalanced data sets. One of the most used techniques to deal with this problem consists of preprocessing the data previously to the learning process. This paper proposes a method belonging to the family of the nested generalized exemplar that accomplishes learning by storing objects in Euclidean n-space. Classification of new data is performed by computing their distance to the nearest generalized exemplar. The method is optimized by the selection of the most suitable generalized exemplars based on evolutionary algorithms. An experimental analysis is carried out over a wide range of highly imbalanced data sets and uses the statistical tests suggested in the specialized literature. The results obtained show that our evolutionary proposal outperforms other classic and recent models in accuracy and requires to store a lower number of generalized examples.  相似文献   

2.
A program has been developed which derives classification rules from empirical observations and expresses these rules in a knowledge representation format called 'counting criteria'. Decision rules derived in this format are often more comprehensible than rules derived by existing machine learning programs such as AQ11. Use of the program is illustrated by the inference of discrimination criteria for certain types of bacteria based upon their biochemical characteristics. The program may be useful for the conceptual analysis of data and for the automatic generation of prototype knowledge bases for expert systems.  相似文献   

3.
霍纬纲  高小霞 《控制与决策》2012,27(12):1833-1838
提出一种适用于多类不平衡分布情形下的模糊关联分类方法,该方法以最小化AdaBoost.M1W集成学习迭代过程中训练样本的加权分类错误率和子分类器中模糊关联分类规则数目及规则中所含模糊项的数目为遗传优化目标,实现了AdaBoost.M1W和模糊关联分类建模过程的较好融合.通过5个多类不平衡UCI标准数据集和现有的针对不平衡分类问题的数据预处理方法实验对比结果,表明了所提出的方法能显著提高多类不平衡情形下的模糊关联分类模型的分类性能.  相似文献   

4.
ReliefF has proved to be a successful feature selector but when handling a large dataset, it is computationally expensive. We present an optimization using Supervised Model Construction which improves starter selection. Effectiveness has been evaluated using 12 UCI datasets and a clinical diabetes database. Experiments indicate that compared with ReliefF, the proposed method improved computation efficiency whilst maintaining the classification accuracy. In the clinical dataset (20,000 records with 47 features), feature selection via Supervised Model Construction (FSSMC) reduced the processing time by 80%, compared to ReliefF, and maintained accuracy for Naive Bayes, IB1 and C4.5 classifiers.  相似文献   

5.
分类方法的新发展:研究综述   总被引:7,自引:0,他引:7  
分类是数据挖掘的重要任务之一,也是机器学习、模式识别和人工智能等相关领域广泛研究的问题。分类在实际中有广泛的应用,包括医疗诊断、信用评估、选择购物等。近年来,随着相关领域中新技术的不断涌现,分类方法也得到了新发展。本文对这些新发展进行了较详细的归纳,总结了分类方法发展的趋势。  相似文献   

6.
We propose a data mining-constraint satisfaction optimization problem (DM–CSOP) where it is desired to maximize the number of correct classifications at a lowest possible information acquisition cost. We show that the problem can be formulated as a set of several binary variable knapsack optimization problems, which are solved sequentially. We propose a heuristic hybrid simulated annealing and gradient-descent artificial neural network (ANN) procedure to solve the DM-CSOP. Using a real-world heart disease data set, we show that the proposed hybrid procedure provides a low-cost and high-quality solution when compared to a traditional ANN classification approach.  相似文献   

7.
在英文TTS(text to speech)系统中,需要根据文本中每一个单词的发音来合成语音.由于在真实文本的处理中,无论词典规模如何大,都不可能包括文本中的每一个单词,所以需要使用某种算法来预测词典中未登录单词的发音.介绍了一种基于实例学习的方法,并在一个大规模的英语词典上进行了性能评测.结果表明,这种方法的单词发音正确率可以达到70.1%,显著地超过以往报导的其他自动预测方法.  相似文献   

8.
Patient no-shows have significant adverse effects on healthcare systems. Therefore, predicting patients’ no-shows is necessary to use their appointment slots effectively. In the literature, filter feature selection methods have been prominently used for patient no-show prediction. However, filter methods are less effective than wrapper methods. This paper presents new wrapper methods based on three variants of the proposed algorithm, Opposition-based Self-Adaptive Cohort Intelligence (OSACI). The three variants of OSACI are referred to in this paper as OSACI-Init, OSACI-Update, and OSACI-Init_Update, which are formed by the integration of Self-Adaptive Cohort Intelligence (SACI) with three Opposition-based Learning (OBL) strategies; namely: OBL initialization, OBL update, and OBL initialization and update, respectively. The performance of the proposed algorithms was examined and compared with that of Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Differential Evolution (DE), and SACI in terms of AUC, sensitivity, specificity, dimensionality reduction, and convergence speed. Patient no-show data of a primary care clinic in upstate New York was used in the numerical experiments. The results showed that the proposed algorithms outperformed the other compared algorithms by achieving higher dimensionality reduction and better convergence speed while achieving comparable AUC, sensitivity, and specificity scores.  相似文献   

9.
Traditional pattern recognition generally involves two tasks: unsupervised clustering and supervised classification. When class information is available, fusing the advantages of both clustering learning and classification learning into a single framework is an important problem worthy of study. To date, most algorithms generally treat clustering learning and classification learning in a sequential or two-step manner, i.e., first execute clustering learning to explore structures in data, and then perform classification learning on top of the obtained structural information. However, such sequential algorithms cannot always guarantee the simultaneous optimality for both clustering and classification learning. In fact, the clustering learning in these algorithms just aids the subsequent classification learning and does not benefit from the latter. To overcome this problem, a simultaneous learning framework for clustering and classification (SCC) is presented in this paper. SCC aims to achieve three goals: (1) acquiring the robust classification and clustering simultaneously; (2) designing an effective and transparent classification mechanism; (3) revealing the underlying relationship between clusters and classes. To this end, with the Bayesian theory and the cluster posterior probabilities of classes, we define a single objective function to which the clustering process is directly embedded. By optimizing this objective function, the effective and robust clustering and classification results are achieved simultaneously. Experimental results on both synthetic and real-life datasets show that SCC achieves promising classification and clustering results at one time.  相似文献   

10.
A scalable, incremental learning algorithm for classification problems   总被引:5,自引:0,他引:5  
In this paper a novel data mining algorithm, Clustering and Classification Algorithm-Supervised (CCA-S), is introduced. CCA-S enables the scalable, incremental learning of a non-hierarchical cluster structure from training data. This cluster structure serves as a function to map the attribute values of new data to the target class of these data, that is, classify new data. CCA-S utilizes both the distance and the target class of training data points to derive the cluster structure. In this paper, we first present problems with many existing data mining algorithms for classification problems, such as decision trees, artificial neural networks, in scalable and incremental learning. We then describe CCA-S and discuss its advantages in scalable, incremental learning. The testing results of applying CCA-S to several common data sets for classification problems are presented. The testing results show that the classification performance of CCA-S is comparable to the other data mining algorithms such as decision trees, artificial neural networks and discriminant analysis.  相似文献   

11.

Background

One of the emerging techniques for performing the analysis of the DNA microarray data known as biclustering is the search of subsets of genes and conditions which are coherently expressed. These subgroups provide clues about the main biological processes. Until now, different approaches to this problem have been proposed. Most of them use the mean squared residue as quality measure but relevant and interesting patterns can not be detected such as shifting, or scaling patterns. Furthermore, recent papers show that there exist new coherence patterns involved in different kinds of cancer and tumors such as inverse relationships between genes which can not be captured.

Results

The proposed measure is called Spearman's biclustering measure (SBM) which performs an estimation of the quality of a bicluster based on the non-linear correlation among genes and conditions simultaneously. The search of biclusters is performed by using a evolutionary technique called estimation of distribution algorithms which uses the SBM measure as fitness function. This approach has been examined from different points of view by using artificial and real microarrays. The assessment process has involved the use of quality indexes, a set of bicluster patterns of reference including new patterns and a set of statistical tests. It has been also examined the performance using real microarrays and comparing to different algorithmic approaches such as Bimax, CC, OPSM, Plaid and xMotifs.

Conclusions

SBM shows several advantages such as the ability to recognize more complex coherence patterns such as shifting, scaling and inversion and the capability to selectively marginalize genes and conditions depending on the statistical significance.  相似文献   

12.
The political apportionment problem has been studied for more than 200 years. In this paper, we introduce a generalized parametric divisor method (GPDM), which generalizes most of the classical and widely used apportionment methods from the literature as special cases. Moreover, it allows for very flexible interpolation between previous methods by appropriately setting two parameters in the GPDM. We identify an inequity measure that the GPDM globally optimizes. We also identify two natural inequity measures for which an apportionment given by the GPDM is locally optimal. These results generalize similar results for classical apportionment methods, and justify the use of a large class of new apportionment methods given by the GPDM. From this class, we identify and recommend specific new methods. Our numerical experiments compare the apportionments given by the new methods with those given by existing methods using real data for the United States, Germany, Canada, Australia, England, and Japan. Explicit definition of the GPDM has enabled us to perform computational experiments for evaluating the unbiasedness of the GPDM using two standard measures while comparing with other traditional methods. Based upon our generalization technique and numerical experiments, we show that the GPDM outperforms all the traditional apportionment methods by selecting appropriate parameter values. Thus, we can conclude that the GPDM is the most “unbiased” and fairer if parameters can be agreed ex ante, and the GPDM is applicable to actual electoral voting systems.  相似文献   

13.
In the last few years, there have been several revolutions in the field of deep learning, mainly headlined by the large impact of Generative Adversarial Networks (GANs). GANs not only provide an unique architecture when defining their models, but also generate incredible results which have had a direct impact on society. Due to the significant improvements and new areas of research that GANs have brought, the community is constantly coming up with new researches that make it almost impossible to keep up with the times. Our survey aims to provide a general overview of GANs, showing the latest architectures, optimizations of the loss functions, validation metrics and application areas of the most widely recognized variants. The efficiency of the different variants of the model architecture will be evaluated, as well as showing the best application area; as a vital part of the process, the different metrics for evaluating the performance of GANs and the frequently used loss functions will be analyzed. The final objective of this survey is to provide a summary of the evolution and performance of the GANs which are having better results to guide future researchers in the field.  相似文献   

14.
We present a machine learning tool for automatic texton-based joint classification and segmentation of mitochondria in MNT-1 cells imaged using ion-abrasion scanning electron microscopy (IA-SEM). For diagnosing signatures that may be unique to cellular states such as cancer, automatic tools with minimal user intervention need to be developed for analysis and mining of high-throughput data from these large volume data sets (typically ). Challenges for such a tool in 3D electron microscopy arise due to low contrast and signal-to-noise ratios (SNR) inherent to biological imaging. Our approach is based on block-wise classification of images into a trained list of regions. Given manually labeled images, our goal is to learn models that can localize novel instances of the regions in test datasets. Since datasets obtained using electron microscopes are intrinsically noisy, we improve the SNR of the data for automatic segmentation by implementing a 2D texture-preserving filter on each slice of the 3D dataset. We investigate texton-based region features in this work. Classification is performed by k-nearest neighbor (k-NN) classifier, support vector machines (SVMs), adaptive boosting (AdaBoost) and histogram matching using a NN classifier. In addition, we study the computational complexity vs. segmentation accuracy tradeoff of these classifiers. Segmentation results demonstrate that our approach using minimal training data performs close to semi-automatic methods using the variational level-set method and manual segmentation carried out by an experienced user. Using our method, which we show to have minimal user intervention and high classification accuracy, we investigate quantitative parameters such as volume of the cytoplasm occupied by mitochondria, differences between the surface area of inner and outer membranes and mean mitochondrial width which are quantities potentially relevant to distinguishing cancer cells from normal cells. To test the accuracy of our approach, these quantities are compared against manually computed counterparts. We also demonstrate extension of these methods to segment 3D images obtained using electron tomography.  相似文献   

15.
A local boosting algorithm for solving classification problems   总被引:1,自引:0,他引:1  
Based on the boosting-by-resampling version of Adaboost, a local boosting algorithm for dealing with classification tasks is proposed in this paper. Its main idea is that in each iteration, a local error is calculated for every training instance and a function of this local error is utilized to update the probability that the instance is selected to be part of next classifier's training set. When classifying a novel instance, the similarity information between it and each training instance is taken into account. Meanwhile, a parameter is introduced into the process of updating the probabilities assigned to training instances so that the algorithm can be more accurate than Adaboost. The experimental results on synthetic and several benchmark real-world data sets available from the UCI repository show that the proposed method improves the prediction accuracy and the robustness to classification noise of Adaboost. Furthermore, the diversity-accuracy patterns of the ensemble classifiers are investigated by kappa-error diagrams.  相似文献   

16.

Objectives

In Taiwan, the classification of real problems of children with appropriate occupational therapy is a difficult job for the therapist. The complexities of 127 attribute values to be evaluated in the assessment, the misleading diagnosis which may be made by the pediatrician and the shortage of manpower cause of high workload for the therapist. The design of an easy to use and effective classification model is therefore an important issue in children’s occupational therapy treatment. This study accordingly applies an artificial neural network (ANN) and classification and regression tree (CART) techniques to skeleton an intelligent classification model in order to provide a comprehensive framework to assist the therapist to raise the accuracy when categorizing children’s problems for occupational therapy. These categories with critical attributes under the guidelines of the American Occupational Therapy Association (AOTA) are discussed, in order to assist the therapist for precise assessment and appropriate treatment. To the best of our knowledge, no research has yet been conducted on the problems’ characteristics in children’s occupational therapy.

Methods

Based on the advice and assistance of the therapists and occupational therapy treatment needed, 127 outpatients from a regional hospital in Taiwan between 2007 and 2010 were selected as the data sets for problems in children occupation classification. This study accordingly suggests an intelligent model for the classification which integrates ANN and CART. The major steps in applying the model include: (1) building an ANN higher performance trained model; and (2) adopting CART to the trained model and building in previous steps, to extract the critical attributes of children occupational problems.

Results

The results showed that artificial neural network had a higher accuracy, up to 84%, with evenly distributed datasets. Then high performance of the trained neural network had been extracted for the rules by using the classification tree approach in the classification and regression trees application. Most important of all, this study indicated that some of the rules can correctly identify up to 67% of the problems of the children with 100% confidence, which is much better than the current evaluations being used. Moreover, the tree with a binary variable of age and 8 predicators were found and listed afterward, such as, gross coordination, upper left muscle tone, interpersonal skill, proprioceptive and vestibular, visual, visual stimulus input for influence of emotional and movement, swallowing, and dressing. Actual implementation showed that the intelligent classification model is capable of integrating ANN and CART techniques to clarify children’s occupational therapy problems with considerable accuracy.

Conclusions

The model could be employed as a supporting system in making decisions regarding children problems with occupational therapy classifications and treatment. The rules extracted from CART were helpful to therapists in classifying what category the real problems of the children belonged to. This study expected that more machine learning techniques will certainly play an essential role in future children occupational therapy applications.  相似文献   

17.
Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent occurrence of the class imbalance problem indicate the need for extra research efforts. The objective of this paper is to investigate meta-techniques applicable to most classifier learning algorithms, with the aim to advance the classification of imbalanced data. The AdaBoost algorithm is reported as a successful meta-technique for improving classification accuracy. The insight gained from a comprehensive analysis of the AdaBoost algorithm in terms of its advantages and shortcomings in tacking the class imbalance problem leads to the exploration of three cost-sensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. Further analysis shows that one of the proposed algorithms tallies with the stagewise additive modelling in statistics to minimize the cost exponential loss. These boosting algorithms are also studied with respect to their weighting strategies towards different types of samples, and their effectiveness in identifying rare cases through experiments on several real world medical data sets, where the class imbalance problem prevails.  相似文献   

18.
19.
Because credit card fraud costs the banking sector billions of dollars every year, decreasing the losses incurred from credit card fraud is an important driver for the sector and end-users. In this paper, we focus on analyzing cardholder spending behavior and propose a novel cardholder behavior model for detecting credit card fraud. The model is called the Cardholder Behavior Model (CBM). Two focus points are proposed and evaluated for CBMs. The first focus point is building the behavior model using single-card transactions versus multi-card transactions. As the second focus point, we introduce holiday seasons as spending periods that are different from the rest of the year. The CBM is fine-tuned by using a real credit card transaction data-set from a leading bank in Turkey, and the credit card fraud detection accuracy is evaluated with respect to the abovementioned two focus points.  相似文献   

20.
给出了构造广义折衷算子的另一方法,即基于全序半群序和的广义折衷算子的方法,并且用例子表明如果构造方法的前提条件之一不成立,则结论也不再成立。纠正了一条别人的错误结论。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号