首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Learning from imbalanced data is one of the greatest challenging problems in binary classification, and this problem has gained more importance in recent years. When the class distribution is imbalanced, classical machine learning algorithms tend to move strongly towards the majority class and disregard the minority. Therefore, the accuracy may be high, but the model cannot recognize data instances in the minority class to classify them, leading to many misclassifications. Different methods have been proposed in the literature to handle the imbalance problem, but most are complicated and tend to simulate unnecessary noise. In this paper, we propose a simple oversampling method based on Multivariate Gaussian distribution and K-means clustering, called GK-Means. The new method aims to avoid generating noise and control imbalances between and within classes. Various experiments have been carried out with six classifiers and four oversampling methods. Experimental results on different imbalanced datasets show that the proposed GK-Means outperforms other oversampling methods and improves classification performance as measured by F1-score and Accuracy.  相似文献   

2.
Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms. In supervised learning, dealing with the problem of class imbalance is still considered to be a challenging research problem. Various machine learning techniques are designed to operate on balanced datasets; therefore, the state of the art, different under-sampling, over-sampling and hybrid strategies have been proposed to deal with the problem of imbalanced datasets, but highly skewed datasets still pose the problem of generalization and noise generation during resampling. To over-come these problems, this paper proposes a majority clustering model for classification of imbalanced datasets known as MCBC-SMOTE (Majority Clustering for balanced Classification-SMOTE). The model provides a method to convert the problem of binary classification into a multi-class problem. In the proposed algorithm, the number of clusters for the majority class is calculated using the elbow method and the minority class is over-sampled as an average of clustered majority classes to generate a symmetrical class distribution. The proposed technique is cost-effective, reduces the problem of noise generation and successfully disables the imbalances present in between and within classes. The results of the evaluations on diverse real datasets proved to provide better classification results as compared to state of the art existing methodologies based on several performance metrics.  相似文献   

3.
Recently, machine learning algorithms have been used in the detection and classification of network attacks. The performance of the algorithms has been evaluated by using benchmark network intrusion datasets such as DARPA98, KDD’99, NSL-KDD, UNSW-NB15, and Caida DDoS. However, these datasets have two major challenges: imbalanced data and high-dimensional data. Obtaining high accuracy for all attack types in the dataset allows for high accuracy in imbalanced datasets. On the other hand, having a large number of features increases the runtime load on the algorithms. A novel model is proposed in this paper to overcome these two concerns. The number of features in the model, which has been tested at CICIDS2017, is initially optimized by using genetic algorithms. This optimum feature set has been used to classify network attacks with six well-known classifiers according to high f1-score and g-mean value in minimum time. Afterwards, a multi-layer perceptron based ensemble learning approach has been applied to improve the models’ overall performance. The experimental results show that the suggested model is acceptable for feature selection as well as classifying network attacks in an imbalanced dataset, with a high f1-score (0.91) and g-mean (0.99) value. Furthermore, it has outperformed base classifier models and voting procedures.  相似文献   

4.
Generally, defective dies on semiconductor wafer maps tend to form spatial clusters in distinguishable patterns which contain crucial information on specific problems of equipment or process, thus it is highly important to identify and classify diverse defect patterns accurately. However, in practice, there exists a serious class imbalance problem, that is, the number of the defective dies on semiconductor wafer maps is usually much smaller than that of the non-defective dies. In various machine learning applications, a typical classification algorithm is, however, developed under the assumption that the number of instances for each class is nearly balanced. If the conventional classification algorithm is applied to a class imbalanced dataset, it may lead to incorrect classification results and degrade the reliability of the classification algorithm. In this research, we consider the semiconductor wafer defect bin data combined with wafer warpage information and propose a new hybrid resampling algorithm to improve performance of classifiers. From the experimental analysis, we show that the proposed algorithm provides better classification performance compared to other data preprocessing methods regardless of classification models.  相似文献   

5.
Recently, many researchers have concentrated on distant supervision relation extraction (DSRE). DSRE has solved the problem of the lack of data for supervised learning, however, the data automatically labeled by DSRE has a serious problem, which is class imbalance. The data from the majority class obviously dominates the dataset, in this case, most neural network classifiers will have a strong bias towards the majority class, so they cannot correctly classify the minority class. Studies have shown that the degree of separability between classes greatly determines the performance of imbalanced data. Therefore, in this paper we propose a novel model, which combines class-to-class separability and cost-sensitive learning to adjust the maximum reachable cost of misclassification, thus improving the performance of imbalanced data sets under distant supervision. Experiments have shown that our method is more effective for DSRE than baseline methods.  相似文献   

6.
Emotion detection from the text is a challenging problem in the text analytics. The opinion mining experts are focusing on the development of emotion detection applications as they have received considerable attention of online community including users and business organization for collecting and interpreting public emotions. However, most of the existing works on emotion detection used less efficient machine learning classifiers with limited datasets, resulting in performance degradation. To overcome this issue, this work aims at the evaluation of the performance of different machine learning classifiers on a benchmark emotion dataset. The experimental results show the performance of different machine learning classifiers in terms of different evaluation metrics like precision, recall ad f-measure. Finally, a classifier with the best performance is recommended for the emotion classification.  相似文献   

7.
Imbalanced data classification is one of the major problems in machine learning. This imbalanced dataset typically has significant differences in the number of data samples between its classes. In most cases, the performance of the machine learning algorithm such as Support Vector Machine (SVM) is affected when dealing with an imbalanced dataset. The classification accuracy is mostly skewed toward the majority class and poor results are exhibited in the prediction of minority-class samples. In this paper, a hybrid approach combining data pre-processing technique and SVM algorithm based on improved Simulated Annealing (SA) was proposed. Firstly, the data pre-processing technique which primarily aims at solving the resampling strategy of handling imbalanced datasets was proposed. In this technique, the data were first synthetically generated to equalize the number of samples between classes and followed by a reduction step to remove redundancy and duplicated data. Next is the training of a balanced dataset using SVM. Since this algorithm requires an iterative process to search for the best penalty parameter during training, an improved SA algorithm was proposed for this task. In this proposed improvement, a new acceptance criterion for the solution to be accepted in the SA algorithm was introduced to enhance the accuracy of the optimization process. Experimental works based on ten publicly available imbalanced datasets have demonstrated higher accuracy in the classification tasks using the proposed approach in comparison with the conventional implementation of SVM. Registering at an average of 89.65% of accuracy for the binary class classification has demonstrated the good performance of the proposed works.  相似文献   

8.
针对传统支持向量机(SVM)算法在滚动轴承故障诊断领域中,对失衡数据集效果不佳、对噪声敏感以及对本身参数依赖较大等缺点,提出一种基于样本特性的过采样算法(OABSC)。该算法利用改进凝聚层次聚类将故障样本分成多个簇;在每个簇中综合考虑样本距离、近邻域密度对"疑似噪声点"进行识别、剔除,并将剩余样本按信息量进行排序;紧接着,在每个簇中采用K^*-信息量近邻域(K^*INN)过采样算法合成新样本,以使得数据集平衡;模拟3种不同失衡比下的轴承故障情况,并采用粒子群算法优化了SVM分类器的参数。经试验证明:相比已有算法,OABSC算法能更好地适用于数据呈多簇分布且失衡的轴承故障诊断领域,拥有更高的G-mean值与AUC值以及更强的算法鲁棒性。  相似文献   

9.
针对磁记忆检测信号弱、缺陷区域无法有效识别的问题,提出了一种改进的模糊支持向量机(FSVM),并将其应用于磁记忆检测缺陷的识别。改进的FSVM一方面在传统确定模糊隶属度函数方法的基础上,通过构造k近邻离散度,减弱孤立点或噪声样本对分类的影响;另一方面通过对样本特征值进行加权处理,消弱冗余特征或弱特征对识别的影响。将改进FSVM应用于磁记忆检测缺陷识别。实验结果表明:该方法可以有效识别不同危险区域的缺陷信号,具有较好的鲁棒性和分类能力,是一种有效的磁记忆检测缺陷识别方法。  相似文献   

10.
Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate. The common approach to handle classification involving imbalanced data is to balance the data using a sampling approach such as random undersampling, random oversampling, or Synthetic Minority Oversampling Technique (SMOTE) algorithms. This paper compared the classification performance of three popular classifiers (Logistic Regression, Gaussian Naïve Bayes, and Support Vector Machine) in predicting machine failure in the Oil and Gas industry. The original machine failure dataset consists of 20,473 hourly data and is imbalanced with 19945 (97%) ‘non-failure’ and 528 (3%) ‘failure data’. The three independent variables to predict machine failure were pressure indicator, flow indicator, and level indicator. The accuracy of the classifiers is very high and close to 100%, but the sensitivity of all classifiers using the original dataset was close to zero. The performance of the three classifiers was then evaluated for data with different imbalance rates (10% to 50%) generated from the original data using SMOTE, SMOTE-Support Vector Machine (SMOTE-SVM) and SMOTE-Edited Nearest Neighbour (SMOTE-ENN). The classifiers were evaluated based on improvement in sensitivity and F-measure. Results showed that the sensitivity of all classifiers increases as the imbalance rate increases. SVM with radial basis function (RBF) kernel has the highest sensitivity when data is balanced (50:50) using SMOTE (Sensitivitytest = 0.5686, Ftest = 0.6927) compared to Naïve Bayes (Sensitivitytest = 0.4033, Ftest = 0.6218) and Logistic Regression (Sensitivitytest = 0.4194, Ftest = 0.621). Overall, the Gaussian Naïve Bayes model consistently improves sensitivity and F-measure as the imbalance ratio increases, but the sensitivity is below 50%. The classifiers performed better when data was balanced using SMOTE-SVM compared to SMOTE and SMOTE-ENN.  相似文献   

11.
Recently, machine learning-based technologies have been developed to automate the classification of wafer map defect patterns during semiconductor manufacturing. The existing approaches used in the wafer map pattern classification include directly learning the image through a convolution neural network and applying the ensemble method after extracting image features. This study aims to classify wafer map defects more effectively and derive robust algorithms even for datasets with insufficient defect patterns. First, the number of defects during the actual process may be limited. Therefore, insufficient data are generated using convolutional auto-encoder (CAE), and the expanded data are verified using the evaluation technique of structural similarity index measure (SSIM). After extracting handcrafted features, a boosted stacking ensemble model that integrates the four base-level classifiers with the extreme gradient boosting classifier as a meta-level classifier is designed and built for training the model based on the expanded data for final prediction. Since the proposed algorithm shows better performance than those of existing ensemble classifiers even for insufficient defect patterns, the results of this study will contribute to improving the product quality and yield of the actual semiconductor manufacturing process.  相似文献   

12.
针对机械故障数据的高维性和不平衡性,提出基于格拉斯曼流形的多聚类特征选择和迭代近邻过采样的故障分类方法。对采集到的振动信号,提取时域和频域相关特征,利用多聚类特征选择将高维数据以局部流形结构映射到低维特征集合。无标签样本借助迭代近邻过采样以恢复最大平衡性为目标进行样本分类,并对剩余无标签样本进行模糊分类。选取滚动轴承正常、外圈、内圈以及滚动体的故障数据,并与支持向量机、基于图的半监督学习算法进行对比。结果表明,提出的方法能有效识别出少数类故障,并在整体上有显著的分类效果。  相似文献   

13.
Histopathology is considered as the gold standard for diagnosing breast cancer. Traditional machine learning (ML) algorithm provides a promising performance for cancer diagnosis if the training dataset is balanced. Nevertheless, if the training dataset is imbalanced the performance of the ML model is skewed toward the majority class. It may pose a problem for the pathologist because if the benign sample is misclassified as malignant, then a pathologist could make a misjudgment about the diagnosis. A limited investigation has been done in literature for solving the class imbalance problem in computer‐aided diagnosis (CAD) of breast cancer using histopathology. This work proposes a hybrid ML model to solve the class imbalance problem. The proposed model employs pretrained ResNet50 and the kernelized weighted extreme learning machine for CAD of breast cancer using histopathology. The breast cancer histopathological images are obtained from publicly available BreakHis and BisQue datasets. The proposed method achieved a reasonable performance for the classification of the minority as well as the majority class instances. In comparison, the proposed approach outperforms the state‐of‐the‐art ML models implemented in previous studies using the same training‐testing folds of the publicly accessible BreakHis dataset.  相似文献   

14.
在故障诊断领域中,对传统支持向量机(SVM)算法在数据失衡情况下无法有效实现故障检测的不足,提出一种基于谱聚类下采样失衡数据下SVM故障检测算法。该算法在核空间中对多数类进行谱聚类,然后选择具有代表意义的信息点,最终实现样本均衡。将该算法应用在轴承故障检测领域,并同其他算法进行比较,试验结果表明本文建议的算法在失衡数据情况下较其他算法具有较强的故障检测性能。  相似文献   

15.
针对旋转机械高维故障特征集识别精度低的问题,提出基于核监督局部保留投影(Kernel Supervised Locality Preserving Projection, KSLPP)与ReliefF特征加权的K近邻(ReliefF Weighted K-Nearest Neighbor, RWKNN)分类器相结合的维数约简故障诊断方法。该方法首先应用KSLPP提取故障特征集中的非线性信息,同时在降维投影过程中充分利用类别信息,使降维后最小化类内散度,最大化类间分离度;随后,将降维后得到的低维敏感特征集输入RWKNN进行模式识别,RWKNN能够突出不同特征对分类的贡献率,强化敏感特征,弱化不相关特征,提升了分类精度和鲁棒性。最后,通过典型转子实验台的故障特征集验证了该方法的有效性。  相似文献   

16.
In this paper, we propose an offline and online machine health assessment (MHA) methodology composed of feature extraction and selection, segmentation‐based fault severity evaluation, and classification steps. In the offline phase, the best representative feature of degradation is selected by a new filter‐based feature selection approach. The selected feature is further segmented by utilizing the bottom‐up time series segmentation to discriminate machine health states, ie, degradation levels. Then, the health state fault severity is extracted by a proposed segment evaluation approach based on within segment rate‐of‐change (RoC) and coefficient of variation (CV) statistics. To train supervised classifiers, a priori knowledge about the availability of the labeled data set is needed. To overcome this limitation, the health state fault‐severity information is used to label (eg, healthy, minor, medium, and severe) unlabeled raw condition monitoring (CM) data. In the online phase, the fault‐severity classification is carried out by kernel‐based support vector machine (SVM) classifier. Next to SVM, the k‐nearest neighbor (KNN) is also used in comparative analysis on the fault severity classification problem. Supervised classifiers are trained in the offline phase and tested in the online phase. Unlike to traditional supervised approaches, this proposed method does not require any a priori knowledge about the availability of the labeled data set. The proposed methodology is validated on infield point machine sliding‐chair degradation data to illustrate its effectiveness and applicability. The results show that the time series segmentation‐based failure severity detection and SVM‐based classification are promising.  相似文献   

17.
Melanoma is the most deadly skin cancer. Early diagnosis is a challenge for clinicians. Current algorithms for skin lesions' classification focus mostly on segmentation and feature extraction. This article instead puts the emphasis on the learning process, testing the recognition performance of three different classifiers: support vector machine (SVM), artificial neural network and k‐nearest neighbor. Extensive experiments were run on a database of more than 5000 dermoscopy images. The obtained results show that the SVM approach outperforms the other methods reaching an average recognition rate of 82.5% comparable with those obtained by skilled clinicians. If confirmed, our data suggest that this method may improve classification results of a computer‐assisted diagnosis of melanoma. © 2010 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 20, 316–322, 2010  相似文献   

18.
Classification of structural brain magnetic resonance (MR) images is a crucial task for many neurological phenotypes that machine learning tools are increasingly developed and applied to solve this problem in recent years. In this study binary classification of T1‐weighted structural brain MR images are performed using state‐of‐the‐art machine learning algorithms when there is no information about the clinical context or specifics of neuroimaging. Image derived features and clinical labels that are provided by the International Conference on Medical Image Computing and Computer‐Assisted Intervention 2014 machine learning challenge are used. These morphological summary features are obtained from four different datasets (each N > 70) with clinically relevant phenotypes and automatically extracted from the MR imaging scans using FreeSurfer, a freely distributed brain MR image processing software package. Widely used machine learning tools, namely; back‐propagation neural network, self‐organizing maps, support vector machines and k‐nearest neighbors are used as classifiers. Clinical prediction accuracy is obtained via cross‐validation on the training data (N = 150) and predictions are made on the test data (N = 100). Classification accuracy, the fraction of cases where prediction is accurate and area under the ROC curve are used as the performance metrics. Accuracy and area under curve metrics are used for tuning the training hyperparameters and the evaluation of the performance of the classifiers. Performed experiments revealed that support vector machines show a better success compared to the other methods on clinical predictions using summary morphological features in the absence of any information about the phenotype. Prediction accuracy would increase greatly if contextual information is integrated into the system. © 2017 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 27, 89–97, 2017  相似文献   

19.
At this current time, data stream classification plays a key role in big data analytics due to its enormous growth. Most of the existing classification methods used ensemble learning, which is trustworthy but these methods are not effective to face the issues of learning from imbalanced big data, it also supposes that all data are pre-classified. Another weakness of current methods is that it takes a long evaluation time when the target data stream contains a high number of features. The main objective of this research is to develop a new method for incremental learning based on the proposed ant lion fuzzy-generative adversarial network model. The proposed model is implemented in spark architecture. For each data stream, the class output is computed at slave nodes by training a generative adversarial network with the back propagation error based on fuzzy bound computation. This method overcomes the limitations of existing methods as it can classify data streams that are slightly or completely unlabeled data and providing high scalability and efficiency. The results show that the proposed model outperforms state-of-the-art performance in terms of accuracy (0.861) precision (0.9328) and minimal MSE (0.0416).  相似文献   

20.
The aim of this research is to develop a mechanism to help medical practitioners predict and diagnose liver disease. Several systems have been proposed to help medical experts by diminishing error and increasing accuracy in diagnosing and predicting diseases. Among many existing methods, a few have considered the class imbalance issues of liver disorder datasets. As all the samples of liver disorder datasets are not useful, they do not contribute to learning about classifiers. A few samples might be redundant, which can increase the computational cost and affect the performance of the classifier. In this paper, a model has been proposed that combines noise filter, fuzzy sets, and boosting techniques (NFFBTs) for liver disease prediction. Firstly, the noise filter (NF) eliminates the outliers from the minority class and removes the outlier and redundant pair from the majority class. Secondly, the fuzzy set concept is applied to handle uncertainty in datasets. Thirdly, the AdaBoost boosting algorithm is trained with several learners viz, random forest (RF), support vector machine (SVM), logistic regression (LR), and naive Bayes (NB). The proposed NFFBT prediction system was applied to two datasets (i.e., ILPD and MPRLPD) and found that AdaBoost with RF yielded 90.65% and 98.95% accuracy and F1 scores of 92.09% and 99.24% over ILPD and MPRLPD datasets, respectively.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号