期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory

Enislay Ramentol Yailé Caballero Rafael Bello Francisco Herrera 《Knowledge and Information Systems》2012,33(2):245-265

Imbalanced data is a common problem in classification. This phenomenon is growing in importance since it appears in most real domains. It has special relevance to highly imbalanced data-sets (when the ratio between classes is high). Many techniques have been developed to tackle the problem of imbalanced training sets in supervised learning. Such techniques have been divided into two large groups: those at the algorithm level and those at the data level. Data level groups that have been emphasized are those that try to balance the training sets by reducing the larger class through the elimination of samples or increasing the smaller one by constructing new samples, known as undersampling and oversampling, respectively. This paper proposes a new hybrid method for preprocessing imbalanced data-sets through the construction of new samples, using the Synthetic Minority Oversampling Technique together with the application of an editing technique based on the Rough Set Theory and the lower approximation of a subset. The proposed method has been validated by an experimental study showing good results using C4.5 as the learning algorithm. 相似文献

2.

EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling

Mikel Galar Alberto Fernández Edurne Barrenechea Francisco Herrera 《Pattern recognition》2013,46(12):3460-3471

Classification with imbalanced data-sets has become one of the most challenging problems in Data Mining. Being one class much more represented than the other produces undesirable effects in both the learning and classification processes, mainly regarding the minority class. Such a problem needs accurate tools to be undertaken; lately, ensembles of classifiers have emerged as a possible solution. Among ensemble proposals, the combination of Bagging and Boosting with preprocessing techniques has proved its ability to enhance the classification of the minority class.In this paper, we develop a new ensemble construction algorithm (EUSBoost) based on RUSBoost, one of the simplest and most accurate ensemble, which combines random undersampling with Boosting algorithm. Our methodology aims to improve the existing proposals enhancing the performance of the base classifiers by the usage of the evolutionary undersampling approach. Besides, we promote diversity favoring the usage of different subsets of majority class instances to train each base classifier. Centered on two-class highly imbalanced problems, we will prove, supported by the proper statistical analysis, that EUSBoost is able to outperform the state-of-the-art methods based on ensembles. We will also analyze its advantages using kappa-error diagrams, which we adapt to the imbalanced scenario. 相似文献

3.

面向信贷不平衡数据的高斯混合欠采样算法

韩旭贾宁朱宁《计算机工程与设计》2020,41(1):65-70

为提高分类算法在信贷风险领域不平衡数据的预测性能,提出一种基于高斯混合模型(Gaussian mixture model,GMM)的欠采样算法,将其应用在信贷不平衡数据领域中。采用高斯混合模型对多数类样本进行聚类欠采样(under-sampling),消除样本间的不平衡问题。实验比较该算法与传统的欠采样方法,进行该算法的抗噪鲁棒性分析,实验结果表明,该算法能够有效提升分类器的性能,其对信贷数据集具有较强的鲁棒性。相似文献

4.

Boosting imbalanced data learning with Wiener process oversampling

Qian Li Gang Li Wenjia Niu Yanan Cao Liang Chang Jianlong Tan Li Guo 《Frontiers of Computer Science》2017,11(5):836-851

Learning from imbalanced data is a challenging task in a wide range of applications, which attracts significant research efforts from machine learning and data mining community. As a natural approach to this issue, oversampling balances the training samples through replicating existing samples or synthesizing new samples. In general, synthesization outperforms replication by supplying additional information on the minority class. However, the additional information needs to follow the same normal distribution of the training set, which further constrains the new samples within the predefined range of training set. In this paper, we present the Wiener process oversampling (WPO) technique that brings the physics phenomena into sample synthesization. WPO constructs a robust decision region by expanding the attribute ranges in training set while keeping the same normal distribution. The satisfactory performance of WPO can be achieved with much lower computing complexity. In addition, by integrating WPO with ensemble learning, the WPOBoost algorithm outperformsmany prevalent imbalance learning solutions. 相似文献

5.

LoRAS: an oversampling approach for imbalanced datasets

Bej Saptarshi Davtyan Narek Wolfien Markus Nassar Mariam Wolkenhauer Olaf 《Machine Learning》2021,110(2):279-301

Machine Learning - The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority... 相似文献

6.

Evolutionary rule-based systems for imbalanced data sets 总被引：1，自引：1，他引：1

Albert Orriols-Puig Ester Bernadó-Mansilla 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2009,13(3):213-225

This paper investigates the capabilities of evolutionary on-line rule-based systems, also called learning classifier systems (LCSs), for extracting knowledge from imbalanced data. While some learners may suffer from class imbalances and instances sparsely distributed around the feature space, we show that LCSs are flexible methods that can be adapted to detect such cases and find suitable models. Results on artificial data sets specifically designed for testing the capabilities of LCSs in imbalanced data show that LCSs are able to extract knowledge from highly imbalanced domains. When LCSs are used with real-world problems, they demonstrate to be one of the most robust methods compared with instance-based learners, decision trees, and support vector machines. Moreover, all the learners benefit from re-sampling techniques. Although there is not a re-sampling technique that performs best in all data sets and for all learners, those based in over-sampling seem to perform better on average. The paper adapts and analyzes LCSs for challenging imbalanced data sets and establishes the bases for further studying the combination of re-sampling technique and learner best suited to a specific kind of problem. 相似文献

7.

Boosting support vector machines for imbalanced data sets

Benjamin X. Wang Nathalie Japkowicz 《Knowledge and Information Systems》2010,25(1):1-20

Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. We then counter the excessive bias introduced by this approach with a boosting algorithm. We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class. 相似文献

8.

不平衡数据加权集成学习算法

《微型机与应用》2015,(23):7-10

针对传统的机器学习算法对不平衡数据集的少类分类准确率不高的问题,基于支持向量机和模糊聚类,提出一种不平衡数据加权集成学习算法。首先提出加权支持向量机模型(Weighted Support Vector Machine,WSVM),该模型根据不同类别数据所占比例的不同,为各类别分配不同的权重,然后将WSVM与模糊聚类结合提出一种新的集成学习算法。将本文提出的算法应用于人造数据集和UCI数据集实验中,实验结果表明,所提出的算法能够有效地解决不平衡数据的分类问题,具有更好的分类性能。相似文献

9.

The effect of imbalanced data sets on LDA: A theoretical and empirical analysis

Jigang Xie Zhengding Qiu 《Pattern recognition》2007,40(2):557-562

This paper demonstrates that the imbalanced data sets have a negative effect on the performance of LDA theoretically. This theoretical analysis is confirmed by the experimental results: using several sampling methods to rebalance the imbalanced data sets, it is found that the performances of LDA on balanced data sets are superior to those of LDA on imbalanced data sets. 相似文献

10.

Evolutionary synthetic oversampling technique and cocktail ensemble model for warfarin dose prediction with imbalanced data

Tao Yanyun Jiang Bin Xue Ling Xie Cheng Zhang Yuzhen 《Neural computing & applications》2021,33(17):11203-11221

Neural Computing and Applications - To improve the accuracy of warfarin daily dose prediction, we develop an evolutionary synthetic oversampling technique (ESMOTE) with a cocktail ensemble model... 相似文献

11.

Designing the rule classification with oversampling approach with high accuracy for imbalanced data in semiconductor production lines

Wang Hsiao-Yu Tsung Chen-Kun Hung Ching-Hua Chen Chen-Huei 《Multimedia Tools and Applications》2022,81(25):36437-36452

The product quality is the major factor for enhancing the production ability and competitiveness. Decreasing the cost and increasing production capacity are common approaches to realize the enhancement of the product quality. The production managers apply various multimedia data to evaluate the product quality. For example, capturing the stamping sound to evaluate the correct cutting and taking the component image to measure the chip positions are common heterogeneous multimedia data that are applied to manufacturing. However, the production managers prefer to minimize the number of defective products, e. g. the secondary operation and fixing the product tolerance in the assembly stage, to fitting the production target. Therefore, contrasting the defective product identification procedure with high accuracy becomes a challenge due to the decrease of the number of the defective products. In this paper, we propose the Rule Classification with Oversampling (RCOS) approach to provide the high accuracy with few defective products. The proposed RCOS includes the oversampling technique and the rule classification approach to emphasize the properties of the defective products and provide the precise classes. Given few defective products, capturing the properties of the failure is difficult. The RCOS considers the revised Synthetic Minority Over-Sampling Technique (SMOTE) to highlight the failure properties, and then the rule model is considered to extract the root cause of the defective products. We implement the proposed RCOS in the semiconductor production line. From the experiment results, the proposed RCOS provide about at most 98% in accuracy, and the comparison shows that the results have been improved in common criteria e. g. the true-positive rate, G mean, F1 score, and False Alarm Rate. Therefore, the proposed RCOS provides high practicality for the implementation consideration.

相似文献

12.

A bi-objective hybrid algorithm for the classification of imbalanced noisy and borderline data sets

Saeed Sana Ong Hong Choon 《Pattern Analysis & Applications》2019,22(3):979-998

Pattern Analysis and Applications - Classification of imbalanced data sets is one of the significant problems of machine learning and data mining. Traditional classifiers usually produced... 相似文献

13.

基于BSMOTE和逆转欠抽样的不均衡数据分类算法

陈睿张亮杨静胡荣贵《计算机应用研究》2014,(11)

针对传统分类器在数据不均衡的情况下分类效果不理想的缺陷,为提高分类器在不均衡数据集下的分类性能,特别是少数类样本的分类能力,提出了一种基于BSMOTE 和逆转欠抽样的不均衡数据分类算法。该算法使用BSMOTE进行过抽样,人工增加少数类样本的数量,然后通过优先去除样本中的冗余和噪声样本,使用逆转欠抽样方法逆转少数类样本和多数类样本的比例。通过多次进行上述抽样形成多个训练集合,使用Bagging方法集成在多个训练集合上获得的分类器来提高有效信息的利用率。实验表明,该算法较几种现有算法不仅能够提高少数类样本的分类性能,而且能够有效提高整体分类准确度。相似文献

14.

Improving SVM classification on imbalanced time series data sets with ghost points

Suzan Köknar-Tezel Longin Jan Latecki 《Knowledge and Information Systems》2011,28(1):1-23

Imbalanced data sets present a particular challenge to the data mining community. Often, it is the rare event that is of interest and the cost of misclassifying the rare event is higher than misclassifying the usual event. When the data is highly skewed toward the usual, it can be very difficult for a learning system to accurately detect the rare event. There have been many approaches in recent years for handling imbalanced data sets, from under-sampling the majority class to adding synthetic points to the minority class in feature space. However, distances between time series are known to be non-Euclidean and non-metric, since comparing time series requires warping in time. This fact makes it impossible to apply standard methods like SMOTE to insert synthetic data points in feature spaces. We present an innovative approach that augments the minority class by adding synthetic points in distance spaces. We then use Support Vector Machines for classification. Our experimental results on standard time series show that our synthetic points significantly improve the classification rate of the rare events, and in most cases also improves the overall accuracy of SVMs. We also show how adding our synthetic points can aid in the visualization of time series data sets. 相似文献

15.

Investigating fitness functions for a hyper-heuristic evolutionary algorithm in the context of balanced and imbalanced data classification

Rodrigo C. Barros Márcio P. Basgalupp André C. P. L. F. de Carvalho 《Genetic Programming and Evolvable Machines》2015,16(3):241-281

相似文献

16.

Multi-objective evolutionary algorithms and phylogenetic inference with multiple data sets

L. Poladian L.S. Jermiin 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2006,10(4):359-368

Evolutionary relationships among species are usually (1) illustrated by means of a phylogenetic tree and (2) inferred by optimising some measure of fitness, such as the total evolutionary distance between species or the likelihood of the tree (given a model of the evolutionary process and a data set). The combinatorial complexity of inferring the topology of the best tree makes phylogenetic inference an ideal candidate for evolutionary algorithms. However, difficulties arise when different data sets provide conflicting information about the inferred `best' tree(s). We apply the techniques of multi-objective optimisation to phylogenetic inference for the first time. We use the simplest model of evolution and a four species problem to illustrate the method. 相似文献

17.

Bi-immune sets for complexity classes

José L. Balcázar Uwe Schöning 《Theory of Computing Systems》1985,18(1):1-10

An infinite and co-infinite setA is bi-immune for a complexity classC if neitherA nor its complement has an infinite subset inC. We prove various equivalent characterizations of this notion. Also, we introduce a stronger version of bi-immunity and show how both notions relate to density and other properties of sets in EXPTIME.This research was performed while the authors were visiting the Department of Mathematics, University of California, Santa Barbara, Ca. 93106, U.S.A., and was supported in part by the U.S.A.-Spanish Joint Committee for Educational and Cultural Affairs, by the Deutsche Forschungsgemeinschaft, and by the National Science Foundation under Grants MCS80-11979 and MCS83-12472. 相似文献

18.

Cost-sensitive boosting for classification of imbalanced data

Yanmin Mohamed S. Andrew K.C. Yang 《Pattern recognition》2007,40(12):3358-3378

Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent occurrence of the class imbalance problem indicate the need for extra research efforts. The objective of this paper is to investigate meta-techniques applicable to most classifier learning algorithms, with the aim to advance the classification of imbalanced data. The AdaBoost algorithm is reported as a successful meta-technique for improving classification accuracy. The insight gained from a comprehensive analysis of the AdaBoost algorithm in terms of its advantages and shortcomings in tacking the class imbalance problem leads to the exploration of three cost-sensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. Further analysis shows that one of the proposed algorithms tallies with the stagewise additive modelling in statistics to minimize the cost exponential loss. These boosting algorithms are also studied with respect to their weighting strategies towards different types of samples, and their effectiveness in identifying rare cases through experiments on several real world medical data sets, where the class imbalance problem prevails. 相似文献

19.

一种用于非平衡数据分类的集成学习模型

焦盛岚杨炳儒翟云赵万里《计算机工程与应用》2012,48(29):119-123,219

针对非平衡数据分类问题,提出了一种改进的SVM-KNN分类算法,在此基础上设计了一种集成学习模型.该模型采用限数采样方法对多数类样本进行分割,将分割后的多数类子簇与少数类样本重新组合,利用改进的SVM-KNN分别训练,得到多个基本分类器,对各个基本分类器进行组合.采用该模型对UCI数据集进行实验,结果显示该模型对于非平衡数据分类有较好的效果. 相似文献

20.

Error back-propagation algorithm for classification of imbalanced data 总被引：1，自引：0，他引：1

Sang-Hoon Oh Author Vitae 《Neurocomputing》2011,74(6):1058-1061

Classification of imbalanced data is pervasive but it is a difficult problem to solve. In order to improve the classification of imbalanced data, this letter proposes a new error function for the error back-propagation algorithm of multilayer perceptrons. The error function intensifies weight-updating for the minority class and weakens weight-updating for the majority class. We verify the effectiveness of the proposed method through simulations on mammography and thyroid data sets. 相似文献