首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms. In supervised learning, dealing with the problem of class imbalance is still considered to be a challenging research problem. Various machine learning techniques are designed to operate on balanced datasets; therefore, the state of the art, different under-sampling, over-sampling and hybrid strategies have been proposed to deal with the problem of imbalanced datasets, but highly skewed datasets still pose the problem of generalization and noise generation during resampling. To over-come these problems, this paper proposes a majority clustering model for classification of imbalanced datasets known as MCBC-SMOTE (Majority Clustering for balanced Classification-SMOTE). The model provides a method to convert the problem of binary classification into a multi-class problem. In the proposed algorithm, the number of clusters for the majority class is calculated using the elbow method and the minority class is over-sampled as an average of clustered majority classes to generate a symmetrical class distribution. The proposed technique is cost-effective, reduces the problem of noise generation and successfully disables the imbalances present in between and within classes. The results of the evaluations on diverse real datasets proved to provide better classification results as compared to state of the art existing methodologies based on several performance metrics.  相似文献   

2.
简川霞  叶荣  林浩  贺鑫  杜美剑 《包装工程》2020,41(21):251-260
目的 针对印刷标志图像训练数据集非均衡性导致印刷标志图像中少类数据套准状态识别准确率低的问题,提出改进的SMOTE训练集过采样方法,以提高少类数据的识别准确率。方法 提取印刷标志图像灰度行程矩阵的纹理特征,组成多维的模型输入特征数据。基于少类样本的邻域信息,得到少类样本的过采样参数。对少类样本采取不同的过采样策略,实现训练集样本的均衡。使用均衡的训练集建立支持向量机模型,实现对印刷套准状态的识别。结果 实验结果表明,文中方法在不同非均衡印刷数据集上,获得的平均分类准确率几何平均数Gmean为0.8507,召回率Re为0.7192,ROC曲线下面积A为0.8549。结论 文中方法在不同非均衡印刷套准数据集上的分类性能要优于实验中的SMOTE,IS和SVM等方法。  相似文献   

3.
Recently, machine learning algorithms have been used in the detection and classification of network attacks. The performance of the algorithms has been evaluated by using benchmark network intrusion datasets such as DARPA98, KDD’99, NSL-KDD, UNSW-NB15, and Caida DDoS. However, these datasets have two major challenges: imbalanced data and high-dimensional data. Obtaining high accuracy for all attack types in the dataset allows for high accuracy in imbalanced datasets. On the other hand, having a large number of features increases the runtime load on the algorithms. A novel model is proposed in this paper to overcome these two concerns. The number of features in the model, which has been tested at CICIDS2017, is initially optimized by using genetic algorithms. This optimum feature set has been used to classify network attacks with six well-known classifiers according to high f1-score and g-mean value in minimum time. Afterwards, a multi-layer perceptron based ensemble learning approach has been applied to improve the models’ overall performance. The experimental results show that the suggested model is acceptable for feature selection as well as classifying network attacks in an imbalanced dataset, with a high f1-score (0.91) and g-mean (0.99) value. Furthermore, it has outperformed base classifier models and voting procedures.  相似文献   

4.
Classification of imbalanced data is a well explored issue in the data mining and machine learning community where one class representation is overwhelmed by other classes. The Imbalanced distribution of data is a natural occurrence in real world datasets, so needed to be dealt with carefully to get important insights. In case of imbalance in data sets, traditional classifiers have to sacrifice their performances, therefore lead to misclassifications. This paper suggests a weighted nearest neighbor approach in a fuzzy manner to deal with this issue. We have adapted the ‘existing algorithm modification solution’ to learn from imbalanced datasets that classify data without manipulating the natural distribution of data unlike the other popular data balancing methods. The K nearest neighbor is a non-parametric classification method that is mostly used in machine learning problems. Fuzzy classification with the nearest neighbor clears the belonging of an instance to classes and optimal weights with improved nearest neighbor concept helping to correctly classify imbalanced data. The proposed hybrid approach takes care of imbalance nature of data and reduces the inaccuracies appear in applications of original and traditional classifiers. Results show that it performs well over the existing fuzzy nearest neighbor and weighted neighbor strategies for imbalanced learning.  相似文献   

5.
在故障诊断领域中,对传统支持向量机(SVM)算法在数据失衡情况下无法有效实现故障检测的不足,提出一种基于谱聚类下采样失衡数据下SVM故障检测算法。该算法在核空间中对多数类进行谱聚类,然后选择具有代表意义的信息点,最终实现样本均衡。将该算法应用在轴承故障检测领域,并同其他算法进行比较,试验结果表明本文建议的算法在失衡数据情况下较其他算法具有较强的故障检测性能。  相似文献   

6.
In 2018, 1.76 million people worldwide died of lung cancer. Most of these deaths are due to late diagnosis, and early-stage diagnosis significantly increases the likelihood of a successful treatment for lung cancer. Machine learning is a branch of artificial intelligence that allows computers to quickly identify patterns within complex and large datasets by learning from existing data. Machine-learning techniques have been improving rapidly and are increasingly used by medical professionals for the successful classification and diagnosis of early-stage disease. They are widely used in cancer diagnosis. In particular, machine learning has been used in the diagnosis of lung cancer due to the benefits it offers doctors and patients. In this context, we performed a study on machine-learning techniques to increase the classification accuracy of lung cancer with 32 × 56 sized numerical data from the Machine Learning Repository web site of the University of California, Irvine. In this study, the precision of the classification model was increased by the effective employment of pre-processing methods instead of direct use of classification algorithms. Nine datasets were derived with pre-processing methods and six machine-learning classification methods were used to achieve this improvement. The study results suggest that the accuracy of the k-nearest neighbors algorithm is superior to random forest, naïve Bayes, logistic regression, decision tree, and support vector machines. The performance of pre-processing methods was assessed on the lung cancer dataset. The most successful pre-processing methods were Z-score (83% accuracy) for normalization methods, principal component analysis (87% accuracy) for dimensionality reduction methods, and information gain (71% accuracy) for feature selection methods.  相似文献   

7.
Diabetics is one of the world’s most common diseases which are caused by continued high levels of blood sugar. The risk of diabetics can be lowered if the diabetic is found at the early stage. In recent days, several machine learning models were developed to predict the diabetic presence at an early stage. In this paper, we propose an embedded-based machine learning model that combines the split-vote method and instance duplication to leverage an imbalanced dataset called PIMA Indian to increase the prediction of diabetics. The proposed method uses both the concept of over-sampling and under-sampling along with model weighting to increase the performance of classification. Different measures such as Accuracy, Precision, Recall, and F1-Score are used to evaluate the model. The results we obtained using K-Nearest Neighbor (kNN), Naïve Bayes (NB), Support Vector Machines (SVM), Random Forest (RF), Logistic Regression (LR), and Decision Trees (DT) were 89.32%, 91.44%, 95.78%, 89.3%, 81.76%, and 80.38% respectively. The SVM model is more efficient than other models which are 21.38% more than exiting machine learning-based works.  相似文献   

8.
Stroke and cerebral haemorrhage are the second leading causes of death in the world after ischaemic heart disease. In this work, a dataset containing medical, physiological and environmental tests for stroke was used to evaluate the efficacy of machine learning, deep learning and a hybrid technique between deep learning and machine learning on the Magnetic Resonance Imaging (MRI) dataset for cerebral haemorrhage. In the first dataset (medical records), two features, namely, diabetes and obesity, were created on the basis of the values of the corresponding features. The t-Distributed Stochastic Neighbour Embedding algorithm was applied to represent the high-dimensional dataset in a low-dimensional data space. Meanwhile,the Recursive Feature Elimination algorithm (RFE) was applied to rank the features according to priority and their correlation to the target feature and to remove the unimportant features. The features are fed into the various classification algorithms, namely, Support Vector Machine (SVM), K Nearest Neighbours (KNN), Decision Tree, Random Forest, and Multilayer Perceptron. All algorithms achieved superior results. The Random Forest algorithm achieved the best performance amongst the algorithms; it reached an overall accuracy of 99%. This algorithm classified stroke cases with Precision, Recall and F1 score of 98%, 100% and 99%, respectively. In the second dataset, the MRI image dataset was evaluated by using the AlexNet model and AlexNet + SVM hybrid technique. The hybrid model AlexNet + SVM performed is better than the AlexNet model; it reached accuracy, sensitivity, specificity and Area Under the Curve (AUC) of 99.9%, 100%, 99.80% and 99.86%, respectively.  相似文献   

9.
Internet of Things (IoT) defines a network of devices connected to the internet and sharing a massive amount of data between each other and a central location. These IoT devices are connected to a network therefore prone to attacks. Various management tasks and network operations such as security, intrusion detection, Quality-of-Service provisioning, performance monitoring, resource provisioning, and traffic engineering require traffic classification. Due to the ineffectiveness of traditional classification schemes, such as port-based and payload-based methods, researchers proposed machine learning-based traffic classification systems based on shallow neural networks. Furthermore, machine learning-based models incline to misclassify internet traffic due to improper feature selection. In this research, an efficient multilayer deep learning based classification system is presented to overcome these challenges that can classify internet traffic. To examine the performance of the proposed technique, Moore-dataset is used for training the classifier. The proposed scheme takes the pre-processed data and extracts the flow features using a deep neural network (DNN). In particular, the maximum entropy classifier is used to classify the internet traffic. The experimental results show that the proposed hybrid deep learning algorithm is effective and achieved high accuracy for internet traffic classification, i.e., 99.23%. Furthermore, the proposed algorithm achieved the highest accuracy compared to the support vector machine (SVM) based classification technique and k-nearest neighbours (KNNs) based classification technique.  相似文献   

10.
针对传统支持向量机(SVM)算法在滚动轴承故障诊断领域中,对失衡数据集效果不佳、对噪声敏感以及对本身参数依赖较大等缺点,提出一种基于样本特性的过采样算法(OABSC)。该算法利用改进凝聚层次聚类将故障样本分成多个簇;在每个簇中综合考虑样本距离、近邻域密度对"疑似噪声点"进行识别、剔除,并将剩余样本按信息量进行排序;紧接着,在每个簇中采用K^*-信息量近邻域(K^*INN)过采样算法合成新样本,以使得数据集平衡;模拟3种不同失衡比下的轴承故障情况,并采用粒子群算法优化了SVM分类器的参数。经试验证明:相比已有算法,OABSC算法能更好地适用于数据呈多簇分布且失衡的轴承故障诊断领域,拥有更高的G-mean值与AUC值以及更强的算法鲁棒性。  相似文献   

11.
With the development of artificial intelligence-related technologies such as deep learning, various organizations, including the government, are making various efforts to generate and manage big data for use in artificial intelligence. However, it is difficult to acquire big data due to various social problems and restrictions such as personal information leakage. There are many problems in introducing technology in fields that do not have enough training data necessary to apply deep learning technology. Therefore, this study proposes a mixed contour data augmentation technique, which is a data augmentation technique using contour images, to solve a problem caused by a lack of data. ResNet, a famous convolutional neural network (CNN) architecture, and CIFAR-10, a benchmark data set, are used for experimental performance evaluation to prove the superiority of the proposed method. And to prove that high performance improvement can be achieved even with a small training dataset, the ratio of the training dataset was divided into 70%, 50%, and 30% for comparative analysis. As a result of applying the mixed contour data augmentation technique, it was possible to achieve a classification accuracy improvement of up to 4.64% and high accuracy even with a small amount of data set. In addition, it is expected that the mixed contour data augmentation technique can be applied in various fields by proving the excellence of the proposed data augmentation technique using benchmark datasets.  相似文献   

12.
Supervised machine learning approaches are effective in text mining, but their success relies heavily on manually annotated corpora. However, there are limited numbers of annotated biomedical event corpora, and the available datasets contain insufficient examples for training classifiers; the common cure is to seek large amounts of training samples from unlabeled data, but such data sets often contain many mislabeled samples, which will degrade the performance of classifiers. Therefore, this study proposes a novel error data detection approach suitable for reducing noise in unlabeled biomedical event data. First, we construct the mislabeled dataset through error data analysis with the development dataset. The sample pairs’ vector representations are then obtained by the means of sequence patterns and the joint model of convolutional neural network and long short-term memory recurrent neural network. Following this, the sample identification strategy is proposed, using error detection based on pair representation for unlabeled data. With the latter, the selected samples are added to enrich the training dataset and improve the classification performance. In the BioNLP Shared Task GENIA, the experiments results indicate that the proposed approach is competent in extract the biomedical event from biomedical literature. Our approach can effectively filter some noisy examples and build a satisfactory prediction model.  相似文献   

13.
In recent days, the gigantic generation of medical data from smart healthcare applications requires the development of big data classification methodologies. Medical data classification can be utilized for visualizing the hidden patterns and finding the presence of disease from the medical data. In this article, we present an efficient multi-kernel support vector machine (MKSVM) and fruit fly optimization algorithm (FFOA) for disease classification. Initially, FFOA is employed to choose the finest features from the available set of features. The selected features from the medical dataset are processed and provided to the MKSVM for medical data classification purposes. The proposed chronic kidney disease (CKD) classification method has been simulated in MATLAB. Next, testing of the dataset takes place using the own benchmark CKD dataset from UCI machine learning repositories such as Kidney chronic, Cleveland, Hungarian, and Switzerland. The performance of the proposed CKD classification method is elected by accuracy, sensitivity, specificity, positive predictive value, negative predictive value, false positive rate, and false negative rate. The investigational outcome specifies that the proposed CKD classification method achieves maximum classification precision value of 98.5% for chronic kidney dataset, 90.42904% for Cleveland, 89.11565% for Hungarian, and 86.17886% for Switzerland dataset than existing hybrid kernel SVM, fuzzy min-max GSO neural network, and SVM methods.  相似文献   

14.
Learning from imbalanced data is one of the greatest challenging problems in binary classification, and this problem has gained more importance in recent years. When the class distribution is imbalanced, classical machine learning algorithms tend to move strongly towards the majority class and disregard the minority. Therefore, the accuracy may be high, but the model cannot recognize data instances in the minority class to classify them, leading to many misclassifications. Different methods have been proposed in the literature to handle the imbalance problem, but most are complicated and tend to simulate unnecessary noise. In this paper, we propose a simple oversampling method based on Multivariate Gaussian distribution and K-means clustering, called GK-Means. The new method aims to avoid generating noise and control imbalances between and within classes. Various experiments have been carried out with six classifiers and four oversampling methods. Experimental results on different imbalanced datasets show that the proposed GK-Means outperforms other oversampling methods and improves classification performance as measured by F1-score and Accuracy.  相似文献   

15.
In machine learning and data mining, feature selection (FS) is a traditional and complicated optimization problem. Since the run time increases exponentially, FS is treated as an NP-hard problem. The researcher’s effort to build a new FS solution was inspired by the ongoing need for an efficient FS framework and the success rates of swarming outcomes in different optimization scenarios. This paper presents two binary variants of a Hunger Games Search Optimization (HGSO) algorithm based on V- and S-shaped transfer functions within a wrapper FS model for choosing the best features from a large dataset. The proposed technique transforms the continuous HGSO into a binary variant using V- and S-shaped transfer functions (BHGSO-V and BHGSO-S). To validate the accuracy, 16 famous UCI datasets are considered and compared with different state-of-the-art metaheuristic binary algorithms. The findings demonstrate that BHGSO-V achieves better performance in terms of the selected number of features, classification accuracy, run time, and fitness values than other state-of-the-art algorithms. The results demonstrate that the BHGSO-V algorithm can reduce dimensionality and choose the most helpful features for classification problems. The proposed BHGSO-V achieves 95% average classification accuracy for most of the datasets, and run time is less than 5 sec. for low and medium dimensional datasets and less than 10 sec for high dimensional datasets.  相似文献   

16.
The outbreak of the pandemic, caused by Coronavirus Disease 2019 (COVID-19), has affected the daily activities of people across the globe. During COVID-19 outbreak and the successive lockdowns, Twitter was heavily used and the number of tweets regarding COVID-19 increased tremendously. Several studies used Sentiment Analysis (SA) to analyze the emotions expressed through tweets upon COVID-19. Therefore, in current study, a new Artificial Bee Colony (ABC) with Machine Learning-driven SA (ABCML-SA) model is developed for conducting Sentiment Analysis of COVID-19 Twitter data. The prime focus of the presented ABCML-SA model is to recognize the sentiments expressed in tweets made upon COVID-19. It involves data pre-processing at the initial stage followed by n-gram based feature extraction to derive the feature vectors. For identification and classification of the sentiments, the Support Vector Machine (SVM) model is exploited. At last, the ABC algorithm is applied to fine tune the parameters involved in SVM. To demonstrate the improved performance of the proposed ABCML-SA model, a sequence of simulations was conducted. The comparative assessment results confirmed the effectual performance of the proposed ABCML-SA model over other approaches.  相似文献   

17.
材料数据由于小样本、高维度、噪音大等特性, 用于机器学习建模时常常会产生与领域专家认知不一致的结果。面向机器学习全流程, 开发材料领域知识嵌入的机器学习模型是解决这一问题的有效途径。材料数据的准确性直接影响了数据驱动的材料性能预测的可靠性。本研究针对机器学习应用过程中的数据预处理阶段, 提出了融合材料领域知识的数据准确性检测方法。该方法首先结合材料专家认知构建了材料领域知识库。然后, 将其与数据驱动的数据准确性检测方法结合, 从数据和领域知识两个角度对材料数据集进行基于描述符取值规则的单维度数据正确性检测、基于描述符相关性规则的多维度数据相关性检测以及基于多维相似样本识别策略的全维度数据可靠性检测。对于每一阶段识别出的异常数据, 结合材料领域知识进行修正, 并将领域知识融入到数据准确性检测方法的全过程以确保数据集从初始阶段就具有较高准确性。最后该方法在NASICON型固态电解质激活能预测数据集上的实验结果表明: 本研究提出的方法可以有效识别异常数据并进行合理修正。与原始数据集相比, 基于修正数据集的6种机器学习模型的预测精度都有不同程度的提升。其中, 在最优模型上R2提升了33%。  相似文献   

18.
针对极限学习机在处理高维数据时存在内存能耗大、分类准确率低、泛化性差等问题,提出了一种批量分层编码极限学习机算法。首先通过对数据集分批处理,以减小数据维度,降低输入复杂性;然后采用多层自动编码器结构对各批次数据进行无监督编码,以实现深层特征提取;最后利用流形正则化思想构建含有继承因子的流形分类器,以保持数据的完整性,提高算法的泛化性能。实验结果表明,该方法实现简单,在NORB,MNIST和USPS数据集上的分类准确率分别可以达到92.16%、99.35%和98.86%,与其它极限学习机算法对比,在降低计算复杂度和减少CPU内存消耗上具有较明显的优势。  相似文献   

19.
M. Naresh  S. Sikdar  J. Pal 《Strain》2023,59(5):e12439
A vibration data-based machine learning architecture is designed for structural health monitoring (SHM) of a steel plane frame structure. This architecture uses a Bag-of-Features algorithm that extracts the speeded-up robust features (SURF) from the time-frequency scalogram images of the registered vibration data. The discriminative image features are then quantised to a visual vocabulary using K-means clustering. Finally, a support vector machine (SVM) is trained to distinguish the undamaged and multiple damage cases of the frame structure based on the discriminative features. The potential of the machine learning architecture is tested for an unseen dataset that was not used in training as well as with some datasets from entirely new damages close to existing (i.e., trained) damage classes. The results are then compared with those obtained using three other combinations of features and learning algorithms—(i) histogram of oriented gradients (HOG) feature with SVM, (ii) SURF feature with k-nearest neighbours (KNN) and (iii) HOG feature with KNN. In order to examine the robustness of the approach, the study is further extended by considering environmental variabilities along with the localisation and quantification of damage. The experimental results show that the machine learning architecture can effectively classify the undamaged and different joint damage classes with high testing accuracy that indicates its SHM potential for such frame structures.  相似文献   

20.
Quality data in manufacture process has the features of mixed type, uneven distribution, dimension curse and data coupling. To apply the massive manufacturing quality data effectively to the quality analysis of the manufacture enterprise, the data pre-processing algorithm based on equivalence relation is employed to select the characteristic of hybrid data and preprocess data. KML-SVM (Optimised kernel-based hybrid manifold learning and support vector machines algorithm) is proposed. KML is adopted to solve the problems of manufacturing process quality data dimension curse. SVM is adopted to classify and predict low-dimensional embedded data, as well as to optimise support vector machine kernel function so that the classification accuracy can be maximised. The actual manufacturing process data of AVIC Shenyang Liming Aero-Engine Group Corporation Ltd is demonstrated to simulate and verify the proposed algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号