首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms. In supervised learning, dealing with the problem of class imbalance is still considered to be a challenging research problem. Various machine learning techniques are designed to operate on balanced datasets; therefore, the state of the art, different under-sampling, over-sampling and hybrid strategies have been proposed to deal with the problem of imbalanced datasets, but highly skewed datasets still pose the problem of generalization and noise generation during resampling. To over-come these problems, this paper proposes a majority clustering model for classification of imbalanced datasets known as MCBC-SMOTE (Majority Clustering for balanced Classification-SMOTE). The model provides a method to convert the problem of binary classification into a multi-class problem. In the proposed algorithm, the number of clusters for the majority class is calculated using the elbow method and the minority class is over-sampled as an average of clustered majority classes to generate a symmetrical class distribution. The proposed technique is cost-effective, reduces the problem of noise generation and successfully disables the imbalances present in between and within classes. The results of the evaluations on diverse real datasets proved to provide better classification results as compared to state of the art existing methodologies based on several performance metrics.  相似文献   

2.
Learning from imbalanced data is one of the greatest challenging problems in binary classification, and this problem has gained more importance in recent years. When the class distribution is imbalanced, classical machine learning algorithms tend to move strongly towards the majority class and disregard the minority. Therefore, the accuracy may be high, but the model cannot recognize data instances in the minority class to classify them, leading to many misclassifications. Different methods have been proposed in the literature to handle the imbalance problem, but most are complicated and tend to simulate unnecessary noise. In this paper, we propose a simple oversampling method based on Multivariate Gaussian distribution and K-means clustering, called GK-Means. The new method aims to avoid generating noise and control imbalances between and within classes. Various experiments have been carried out with six classifiers and four oversampling methods. Experimental results on different imbalanced datasets show that the proposed GK-Means outperforms other oversampling methods and improves classification performance as measured by F1-score and Accuracy.  相似文献   

3.
Imbalanced data classification is one of the major problems in machine learning. This imbalanced dataset typically has significant differences in the number of data samples between its classes. In most cases, the performance of the machine learning algorithm such as Support Vector Machine (SVM) is affected when dealing with an imbalanced dataset. The classification accuracy is mostly skewed toward the majority class and poor results are exhibited in the prediction of minority-class samples. In this paper, a hybrid approach combining data pre-processing technique and SVM algorithm based on improved Simulated Annealing (SA) was proposed. Firstly, the data pre-processing technique which primarily aims at solving the resampling strategy of handling imbalanced datasets was proposed. In this technique, the data were first synthetically generated to equalize the number of samples between classes and followed by a reduction step to remove redundancy and duplicated data. Next is the training of a balanced dataset using SVM. Since this algorithm requires an iterative process to search for the best penalty parameter during training, an improved SA algorithm was proposed for this task. In this proposed improvement, a new acceptance criterion for the solution to be accepted in the SA algorithm was introduced to enhance the accuracy of the optimization process. Experimental works based on ten publicly available imbalanced datasets have demonstrated higher accuracy in the classification tasks using the proposed approach in comparison with the conventional implementation of SVM. Registering at an average of 89.65% of accuracy for the binary class classification has demonstrated the good performance of the proposed works.  相似文献   

4.
Recently, many researchers have concentrated on using neural networks to learn features for Distant Supervised Relation Extraction (DSRE). These approaches generally use a softmax classifier with cross-entropy loss, which inevitably brings the noise of artificial class NA into classification process. To address the shortcoming, the classifier with ranking loss is employed to DSRE. Uniformly randomly selecting a relation or heuristically selecting the highest score among all incorrect relations are two common methods for generating a negative class in the ranking loss function. However, the majority of the generated negative class can be easily discriminated from positive class and will contribute little towards the training. Inspired by Generative Adversarial Networks (GANs), we use a neural network as the negative class generator to assist the training of our desired model, which acts as the discriminator in GANs. Through the alternating optimization of generator and discriminator, the generator is learning to produce more and more discriminable negative classes and the discriminator has to become better as well. This framework is independent of the concrete form of generator and discriminator. In this paper, we use a two layers fully-connected neural network as the generator and the Piecewise Convolutional Neural Networks (PCNNs) as the discriminator. Experiment results show that our proposed GAN-based method is effective and performs better than state-of-the-art methods.  相似文献   

5.
Generally, defective dies on semiconductor wafer maps tend to form spatial clusters in distinguishable patterns which contain crucial information on specific problems of equipment or process, thus it is highly important to identify and classify diverse defect patterns accurately. However, in practice, there exists a serious class imbalance problem, that is, the number of the defective dies on semiconductor wafer maps is usually much smaller than that of the non-defective dies. In various machine learning applications, a typical classification algorithm is, however, developed under the assumption that the number of instances for each class is nearly balanced. If the conventional classification algorithm is applied to a class imbalanced dataset, it may lead to incorrect classification results and degrade the reliability of the classification algorithm. In this research, we consider the semiconductor wafer defect bin data combined with wafer warpage information and propose a new hybrid resampling algorithm to improve performance of classifiers. From the experimental analysis, we show that the proposed algorithm provides better classification performance compared to other data preprocessing methods regardless of classification models.  相似文献   

6.
Histopathology is considered as the gold standard for diagnosing breast cancer. Traditional machine learning (ML) algorithm provides a promising performance for cancer diagnosis if the training dataset is balanced. Nevertheless, if the training dataset is imbalanced the performance of the ML model is skewed toward the majority class. It may pose a problem for the pathologist because if the benign sample is misclassified as malignant, then a pathologist could make a misjudgment about the diagnosis. A limited investigation has been done in literature for solving the class imbalance problem in computer‐aided diagnosis (CAD) of breast cancer using histopathology. This work proposes a hybrid ML model to solve the class imbalance problem. The proposed model employs pretrained ResNet50 and the kernelized weighted extreme learning machine for CAD of breast cancer using histopathology. The breast cancer histopathological images are obtained from publicly available BreakHis and BisQue datasets. The proposed method achieved a reasonable performance for the classification of the minority as well as the majority class instances. In comparison, the proposed approach outperforms the state‐of‐the‐art ML models implemented in previous studies using the same training‐testing folds of the publicly accessible BreakHis dataset.  相似文献   

7.
针对传统支持向量机(SVM)算法在数据不均衡情况下无法有效实现故障检测的不足,提出一种基于过抽样和代价敏感支持向量机相结合的故障检测新算法。该算法首先利用边界人工少数类过抽样技术(BSMOTE)实现训练样本的均衡。为减少人工增加样本带来的噪声影响,利用K近邻构造一个代价敏感的支持向量机(CSSVM)算法,利用每个样本的代价函数消除噪声样本对SVM算法分类精度的影响。将该算法应用在轴承故障检测中,并同传统的SVM算法,不同类代价敏感SVM-C算法,SVM和SMOTE相结合的算法进行比较,试验结果表明当样本不均衡时,建议算法的故障检测性能较其它算法有显著提高。  相似文献   

8.
Recently, machine learning algorithms have been used in the detection and classification of network attacks. The performance of the algorithms has been evaluated by using benchmark network intrusion datasets such as DARPA98, KDD’99, NSL-KDD, UNSW-NB15, and Caida DDoS. However, these datasets have two major challenges: imbalanced data and high-dimensional data. Obtaining high accuracy for all attack types in the dataset allows for high accuracy in imbalanced datasets. On the other hand, having a large number of features increases the runtime load on the algorithms. A novel model is proposed in this paper to overcome these two concerns. The number of features in the model, which has been tested at CICIDS2017, is initially optimized by using genetic algorithms. This optimum feature set has been used to classify network attacks with six well-known classifiers according to high f1-score and g-mean value in minimum time. Afterwards, a multi-layer perceptron based ensemble learning approach has been applied to improve the models’ overall performance. The experimental results show that the suggested model is acceptable for feature selection as well as classifying network attacks in an imbalanced dataset, with a high f1-score (0.91) and g-mean (0.99) value. Furthermore, it has outperformed base classifier models and voting procedures.  相似文献   

9.
Classification of imbalanced data is a well explored issue in the data mining and machine learning community where one class representation is overwhelmed by other classes. The Imbalanced distribution of data is a natural occurrence in real world datasets, so needed to be dealt with carefully to get important insights. In case of imbalance in data sets, traditional classifiers have to sacrifice their performances, therefore lead to misclassifications. This paper suggests a weighted nearest neighbor approach in a fuzzy manner to deal with this issue. We have adapted the ‘existing algorithm modification solution’ to learn from imbalanced datasets that classify data without manipulating the natural distribution of data unlike the other popular data balancing methods. The K nearest neighbor is a non-parametric classification method that is mostly used in machine learning problems. Fuzzy classification with the nearest neighbor clears the belonging of an instance to classes and optimal weights with improved nearest neighbor concept helping to correctly classify imbalanced data. The proposed hybrid approach takes care of imbalance nature of data and reduces the inaccuracies appear in applications of original and traditional classifiers. Results show that it performs well over the existing fuzzy nearest neighbor and weighted neighbor strategies for imbalanced learning.  相似文献   

10.
基于一类超球面支持向量机的机械故障诊断研究   总被引:1,自引:0,他引:1  
针对机械故障诊断中故障类样本不易获取以及样本分布不均的问题,提出了基于一类超球面支持向量机(SVM)的故障诊断方法,该方法只需要对正常类样本进行训练.试验分析了异常类样本缺失对一类超球面支持向量机性能的影响,并提出模型参数优化选择方法,以提高分类模型的推广能力.分析了不同训练结果的分类能力,并对一类超球面支持向量机与一类超平面支持向量机的分类结果进行比较,验证了前者的正确性和有效性.  相似文献   

11.
Osteosarcoma is one of the most widespread causes of bone cancer globally and has a high mortality rate. Early diagnosis may increase the chances of treatment and survival however the process is time-consuming (reliability and complexity involved to extract the hand-crafted features) and largely depends on pathologists’ experience. Convolutional Neural Network (CNN—an end-to-end model) is known to be an alternative to overcome the aforesaid problems. Therefore, this work proposes a compact CNN architecture that has been rigorously explored on a Small Osteosarcoma histology Image Dataaseet (a high-class imbalanced dataset). Though, during training, class-imbalanced data can negatively affect the performance of CNN. Therefore, an oversampling technique has been proposed to overcome the aforesaid issue and improve generalization performance. In this process, a hierarchical CNN model is designed, in which the former model is non-regularized (due to dense architecture) and the later one is regularized, specifically designed for small histopathology images. Moreover, the regularized model is integrated with CNN’s basic architecture to reduce overfitting. Experimental results demonstrate that oversampling might be an effective way to address the imbalanced class problem during training. The training and testing accuracies of the non-regularized CNN model are 98% & 78% with an imbalanced dataset and 96% & 81% with a balanced dataset, respectively. The regularized CNN model training and testing accuracies are 84% & 75% for an imbalanced dataset and 87% & 86% for a balanced dataset.  相似文献   

12.
With the rise of internet facilities, a greater number of people have started doing online transactions at an exponential rate in recent years as the online transaction system has eliminated the need of going to the bank physically for every transaction. However, the fraud cases have also increased causing the loss of money to the consumers. Hence, an effective fraud detection system is the need of the hour which can detect fraudulent transactions automatically in real-time. Generally, the genuine transactions are large in number than the fraudulent transactions which leads to the class imbalance problem. In this research work, an online transaction fraud detection system using deep learning has been proposed which can handle class imbalance problem by applying algorithm-level methods which modify the learning of the model to focus more on the minority class i.e., fraud transactions. A novel loss function named Weighted Hard- Reduced Focal Loss (WH-RFL) has been proposed which has achieved maximum fraud detection rate i.e., True Positive Rate (TPR) at the cost of misclassification of few genuine transactions as high TPR is preferred over a high True Negative Rate (TNR) in fraud detection system and same has been demonstrated using three publicly available imbalanced transactional datasets. Also, Thresholding has been applied to optimize the decision threshold using cross-validation to detect maximum number of frauds and it has been demonstrated by the experimental results that the selection of the right thresholding method with deep learning yields better results.  相似文献   

13.
Massive Open Online Course (MOOC) has become a popular way of online learning used across the world by millions of people. Meanwhile, a vast amount of information has been collected from the MOOC learners and institutions. Based on the educational data, a lot of researches have been investigated for the prediction of the MOOC learner’s final grade. However, there are still two problems in this research field. The first problem is how to select the most proper features to improve the prediction accuracy, and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data. In order to solve these two problems, an improved random forests method is proposed in this paper. First, a hybrid indicator is defined to measure the importance of the features, and a rule is further established for the feature selection; then, a Clustering-Synthetic Minority Over-sampling Technique (SMOTE) is embedded into the traditional random forests algorithm to solve the class imbalance problem. In experiment part, we verify the performance of the proposed method by using the Canvas Network Person-Course (CNPC) dataset. Furthermore, four well-known prediction methods have been applied for comparison, where the superiority of our method has been proved.  相似文献   

14.
Traffic accidents data sets are usually imbalanced, where the number of instances classified under the killed or severe injuries class (minority) is much lower than those classified under the slight injuries class (majority). This, however, supposes a challenging problem for classification algorithms and may cause obtaining a model that well cover the slight injuries instances whereas the killed or severe injuries instances are misclassified frequently. Based on traffic accidents data collected on urban and suburban roads in Jordan for three years (2009–2011); three different data balancing techniques were used: under-sampling which removes some instances of the majority class, oversampling which creates new instances of the minority class and a mix technique that combines both. In addition, different Bayes classifiers were compared for the different imbalanced and balanced data sets: Averaged One-Dependence Estimators, Weightily Average One-Dependence Estimators, and Bayesian networks in order to identify factors that affect the severity of an accident. The results indicated that using the balanced data sets, especially those created using oversampling techniques, with Bayesian networks improved classifying a traffic accident according to its severity and reduced the misclassification of killed and severe injuries instances. On the other hand, the following variables were found to contribute to the occurrence of a killed causality or a severe injury in a traffic accident: number of vehicles involved, accident pattern, number of directions, accident type, lighting, surface condition, and speed limit. This work, to the knowledge of the authors, is the first that aims at analyzing historical data records for traffic accidents occurring in Jordan and the first to apply balancing techniques to analyze injury severity of traffic accidents.  相似文献   

15.
16.
At this current time, data stream classification plays a key role in big data analytics due to its enormous growth. Most of the existing classification methods used ensemble learning, which is trustworthy but these methods are not effective to face the issues of learning from imbalanced big data, it also supposes that all data are pre-classified. Another weakness of current methods is that it takes a long evaluation time when the target data stream contains a high number of features. The main objective of this research is to develop a new method for incremental learning based on the proposed ant lion fuzzy-generative adversarial network model. The proposed model is implemented in spark architecture. For each data stream, the class output is computed at slave nodes by training a generative adversarial network with the back propagation error based on fuzzy bound computation. This method overcomes the limitations of existing methods as it can classify data streams that are slightly or completely unlabeled data and providing high scalability and efficiency. The results show that the proposed model outperforms state-of-the-art performance in terms of accuracy (0.861) precision (0.9328) and minimal MSE (0.0416).  相似文献   

17.
在故障诊断领域中,对传统支持向量机(SVM)算法在数据失衡情况下无法有效实现故障检测的不足,提出一种基于谱聚类下采样失衡数据下SVM故障检测算法。该算法在核空间中对多数类进行谱聚类,然后选择具有代表意义的信息点,最终实现样本均衡。将该算法应用在轴承故障检测领域,并同其他算法进行比较,试验结果表明本文建议的算法在失衡数据情况下较其他算法具有较强的故障检测性能。  相似文献   

18.
19.
基于频域系统辨识和支持向量机的桥梁状态监测方法   总被引:8,自引:0,他引:8  
随着大跨度悬索、斜拉桥的增加,保障桥梁安全、降低维护费用成为交通管理以及政府部门关注的问题。针对损伤样本难以获得的实际情况,将桥梁状态监测问题作为模式识别中的“一类学习”问题处理。桥梁模式特征获取过程是“只有输出响应”的系统辨识问题,考虑到监测系统需要在线工作的特点,提出运用概念直观、结果可靠且便于自动实现的CMIF系统辨识方法作为获取模式特征的工具。为了获得足够敏感的异常报警判别函数,采用了基于支持向量机的一类学习算法,这种方法在得到很高灵敏性的同时,可以方便地权衡敏感性和泛化性能之间的矛盾。用香港汀九桥794小时实测数据对所采用的算法进行验证,证明了算法的有效性和实用性,其结果可供设计类似监测系统时参考。  相似文献   

20.
面向不均衡训练集的印刷图像套准状态检测方法   总被引:1,自引:1,他引:0  
简川霞  高健 《包装工程》2018,39(11):158-164
目的针对不均衡的印刷图像套准状态检测中存在的印刷套不准图像识别准确率低的问题,研究不均衡印刷图像训练集的预处理方法。方法提出不均衡印刷图像训练集数据的集成采样预处理方法。支持向量机先将不均衡的训练集数据分为支持向量和非支持向量,然后过采集少类样本(即印刷套不准图像)中的支持向量,欠采集多类样本(即印刷套准图像)中的非支持向量,实现训练集数据的均衡化。最后采用预处理后的均衡训练集对支持向量机模型进行训练,并优化模型参数。结果采用文中提出的集成采样方法对不均衡训练集预处理后获得支持向量机模型,通过对印刷图像套准状态进行识别,获得的少类样本识别率a+为0.9375,识别准确率几何平均数Gmean为0.9437,F测度为0.9574。结论文中提出方法获得的印刷套不准图像识别准确率a+,Gmean和F测度均优于实验中的其他方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号