首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
霍纬纲  高小霞 《控制与决策》2012,27(12):1833-1838
提出一种适用于多类不平衡分布情形下的模糊关联分类方法,该方法以最小化AdaBoost.M1W集成学习迭代过程中训练样本的加权分类错误率和子分类器中模糊关联分类规则数目及规则中所含模糊项的数目为遗传优化目标,实现了AdaBoost.M1W和模糊关联分类建模过程的较好融合.通过5个多类不平衡UCI标准数据集和现有的针对不平衡分类问题的数据预处理方法实验对比结果,表明了所提出的方法能显著提高多类不平衡情形下的模糊关联分类模型的分类性能.  相似文献   

2.
While there is an ample amount of medical information available for data mining, many of the datasets are unfortunately incomplete – missing relevant values needed by many machine learning algorithms. Several approaches have been proposed for the imputation of missing values, using various reasoning steps to provide estimations from the observed data. One of the important steps in data mining is data preprocessing, where unrepresentative data is filtered out of the data to be mined. However, none of the related studies about missing value imputation consider performing a data preprocessing step before imputation. Therefore, the aim of this study is to examine the effect of two preprocessing steps, feature and instance selection, on missing value imputation. Specifically, eight different medical‐related datasets are used, containing categorical, numerical and mixed types of data. Our experimental results show that imputation after instance selection can produce better classification performance than imputation alone. In addition, we will demonstrate that imputation after feature selection does not have a positive impact on the imputation result.  相似文献   

3.
现实中许多领域产生的数据通常具有多个类别并且是不平衡的。在多类不平衡分类中,类重叠、噪声和多个少数类等问题降低了分类器的能力,而有效解决多类不平衡问题已经成为机器学习与数据挖掘领域中重要的研究课题。根据近年来的多类不平衡分类方法的文献,从数据预处理和算法级分类方法两方面进行了分析与总结,并从优缺点和数据集等方面对所有算法进行了详细的分析。在数据预处理方法中,介绍了过采样、欠采样、混合采样和特征选择方法,对使用相同数据集算法的性能进行了比较。从基分类器优化、集成学习和多类分解技术三个方面对算法级分类方法展开介绍和分析。最后对多类不平衡数据分类研究领域的未来发展方向进行总结归纳。  相似文献   

4.
Combining the spatial features and spectral feature of hyperspectral remote sensing image in supervised classification can effectively improve the classification time and accuracy.In this study,the spatial information extraction method,named watershed transform,was combined with the Extreme Learning Machine(ELM)and Support Vector Machine(SVM)methods.The classification results of the datasets with the spatial features and without the spatial features were synthetically evaluated and compared.Two hyperspectral datasets,the ROSIS data of Pavia university and the Hyperion data of Okavango Delta(Botswana),were selected to test the methods.After preprocessing,the training samples were selected from the images as the reference areas for each type,and the spectral features of each type were analyzed.The two classification methods were utilized to classify the hyperspectral datasets and relevant classification results were obtained.based on the validation samples selected from the images,the classification results were evaluated using the confusion matrix and the execution times.After that,the spectral features and spatial features were combined to classify the data.The results show that the Extreme Learning Machine(ELM) is superior to the Support Vector Machine(SVM)in the classification time and precision,and the spatial features are introduced in the classification process,which can effectively improve the classification accuracy.  相似文献   

5.

In the fields of pattern recognition and machine learning, the use of data preprocessing algorithms has been increasing in recent years to achieve high classification performance. In particular, it has become inevitable to use the data preprocessing method prior to classification algorithms in classifying medical datasets with the nonlinear and imbalanced data distribution. In this study, a new data preprocessing method has been proposed for the classification of Parkinson, hepatitis, Pima Indians, single proton emission computed tomography (SPECT) heart, and thoracic surgery medical datasets with the nonlinear and imbalanced data distribution. These datasets were taken from UCI machine learning repository. The proposed data preprocessing method consists of three steps. In the first step, the cluster centers of each attribute were calculated using k-means, fuzzy c-means, and mean shift clustering algorithms in medical datasets including Parkinson, hepatitis, Pima Indians, SPECT heart, and thoracic surgery medical datasets. In the second step, the absolute differences between the data in each attribute and the cluster centers are calculated, and then, the average of these differences is calculated for each attribute. In the final step, the weighting coefficients are calculated by dividing the mean value of the difference to the cluster centers, and then, weighting is performed by multiplying the obtained weight coefficients by the attribute values in the dataset. Three different attribute weighting methods have been proposed: (1) similarity-based attribute weighting in k-means clustering, (2) similarity-based attribute weighting in fuzzy c-means clustering, and (3) similarity-based attribute weighting in mean shift clustering. In this paper, we aimed to aggregate the data in each class together with the proposed attribute weighting methods and to reduce the variance value within the class. Thus, by reducing the value of variance in each class, we have put together the data in each class and at the same time, we have further increased the discrimination between the classes. To compare with other methods in the literature, the random subsampling has been used to handle the imbalanced dataset classification. After attribute weighting process, four classification algorithms including linear discriminant analysis, k-nearest neighbor classifier, support vector machine, and random forest classifier have been used to classify imbalanced medical datasets. To evaluate the performance of the proposed models, the classification accuracy, precision, recall, area under the ROC curve, κ value, and F-measure have been used. In the training and testing of the classifier models, three different methods including the 50–50% train–test holdout, the 60–40% train–test holdout, and tenfold cross-validation have been used. The experimental results have shown that the proposed attribute weighting methods have obtained higher classification performance than random subsampling method in the handling of classifying of the imbalanced medical datasets.

  相似文献   

6.
利用支持向量机进行模式分类时,特征选择是数据预处理的一项重要内容。有效的特征选择在很大程度上影响着分类器的性能。根据样本各特征分量的均值与方差对分类的影响,提出根据分类权值进行特征选择,以提高支持向量机性能的简便方法,制定了两个具体实施方案。在三个常用数据集上进行了仿真实验,结果验证了方法的有效性。  相似文献   

7.
In recent years, huge volumes of healthcare data are getting generated in various forms. The advancements made in medical imaging are tremendous owing to which biomedical image acquisition has become easier and quicker. Due to such massive generation of big data, the utilization of new methods based on Big Data Analytics (BDA), Machine Learning (ML), and Artificial Intelligence (AI) have become essential. In this aspect, the current research work develops a new Big Data Analytics with Cat Swarm Optimization based deep Learning (BDA-CSODL) technique for medical image classification on Apache Spark environment. The aim of the proposed BDA-CSODL technique is to classify the medical images and diagnose the disease accurately. BDA-CSODL technique involves different stages of operations such as preprocessing, segmentation, feature extraction, and classification. In addition, BDA-CSODL technique also follows multi-level thresholding-based image segmentation approach for the detection of infected regions in medical image. Moreover, a deep convolutional neural network-based Inception v3 method is utilized in this study as feature extractor. Stochastic Gradient Descent (SGD) model is used for parameter tuning process. Furthermore, CSO with Long Short-Term Memory (CSO-LSTM) model is employed as a classification model to determine the appropriate class labels to it. Both SGD and CSO design approaches help in improving the overall image classification performance of the proposed BDA-CSODL technique. A wide range of simulations was conducted on benchmark medical image datasets and the comprehensive comparative results demonstrate the supremacy of the proposed BDA-CSODL technique under different measures.  相似文献   

8.
《遥感技术与应用》2013,28(5):766-772
The glacier is an important natural and great potential of the fresh water resources,and plays a vital role in the regional ecological environment balance and stability.This study acquired the airborne hyperspectral data over Zhongxi-1 Glacier in August,2011.Firstly,the data preprocessing,including radiation calibration,atmospheric correction and geometric correction was performed on the hyperspectral data;secondly,using principal component analysis (PCA)and minimum noise transformation (MNF) for data dimensionality reduced respectively;thirdly,six classification methods,i.e.maximum likelihood method,minimum distance,Mahalanobis distance method,spectral angle method binary encoding,and spectral information divergence,were applied in the two datasets,and also the comparison results of the different classification methods were conducted to determine the optimal method of data dimensionality reduction and the optimal classification method;finally,the hyperspectral data for glacierclassification was compared with the HJ satellite multispectral data.The results show that: the classification accuracy of the PCA transform data from hyperspectral data is higher than that of MNF transform data;for the PCA transformed dataset of hyperspectraldata,the Mahalanobis distance method,maximum likelihood method,minimum distance method produced better classification results with the comparison to others,while for the MNF transformed dataset from hyperspectral data,the spectral angle method and spectral information divergence method is better than others.  相似文献   

9.
在传统静态表情识别研究基础上,提出一种简单的人脸裁剪方法,再用浅层卷积神经网络进一步提取特征并进行表情识别。以CK+和JAFFE为实验数据集,进行预处理效果对比实验、数据增强实验、单种表情识别实验和跨数据集六分类实验。结果表明,针对数据量较少的情况,提出的表情识别方法效果明显且鲁棒性更优。  相似文献   

10.
为实现对高维混合、不平衡信贷数据中的不良贷款者的准确预测,从降维预处理和分类算法两方面进行优化,提出一种基于混合数据主成分分析(Principal Component Analysis of Mixed Data,PCAmix)预处理的单类[K]近邻[(K]-Nearest Neighbor,[KNN)]计算均值算法。针对传统的主成分分析(Principal Component Analysis,PCA)不能直接处理定性变量的问题,使用PCAmix降维预处理数据,为规避不平衡数据在二分类模型中性能较差的缺点,采用单类分类和[K]近邻算法邻居计算的思想,仅采用多数类训练模型。利用Bootstrap方法找到最佳的决策边界,使得正负样本最大限度地分离,最终准确预测客户的违约风险。采用UCI数据库中的German和Default个人信用评分数据集进行验证,实验结果表明该算法在处理高维混合、不平衡的信贷数据上具有较好的分类效果。  相似文献   

11.
Binary classification is one of the most common problem in machine learning. It consists in predicting whether a given element belongs to a particular class. In this paper, a new algorithm for binary classification is proposed using a hypergraph representation. The method is agnostic to data representation, can work with multiple data sources or in non-metric spaces, and accommodates with missing values. As a result, it drastically reduces the need for data preprocessing or feature engineering. Each element to be classified is partitioned according to its interactions with the training set. For each class, a seminorm over the training set partition is learnt to represent the distribution of evidence supporting this class.Empirical validation demonstrates its high potential on a wide range of well-known datasets and the results are compared to the state-of-the-art. The time complexity is given and empirically validated. Its robustness with regard to hyperparameter sensitivity is studied and compared to standard classification methods. Finally, the limitation of the model space is discussed, and some potential solutions proposed.  相似文献   

12.
杨鹤标  王健 《计算机工程》2010,36(20):52-54
针对多关系多分类的非平衡数据,提出一种分类模型。在预处理阶段,建立目标类纠错输出编码(ECOC)、目标关系与背景关系间的虚拟连接并完成属性聚集处理,进而划分训练集和验证集。在训练阶段,依据一对多划分思想,结合CrossMine算法构造多个子分类器,采用AUC法评估验证各子分类器。在验证阶段,比较目标类ECOC与各子分类器分类结果连接字的海明距离,选择最小海明距离的目标类为最终分类。经合成和真实数据的实验,验证了模型有效性及分类效果。  相似文献   

13.
特征选择是机器学习和数据挖掘领域中一项重要的数据预处理技术,它旨在最大化分类任务的精度和最小化最优子集特征个数。运用粒子群算法在高维数据集中寻找最优子集面临着陷入局部最优和计算代价昂贵的问题,导致分类精度下降。针对此问题,提出了基于多因子粒子群算法的高维数据特征选择算法。引入了进化多任务的算法框架,提出了一种两任务模型生成的策略,通过任务间的知识迁移加强种群交流,提高种群多样性以改善易陷入局部最优的缺陷;设计了基于稀疏表示的初始化策略,在算法初始阶段设计具有稀疏表示的初始解,降低了种群在趋向最优解集时的计算开销。在6个公开医学高维数据集上的实验结果表明,所提算法能够有效实现分类任务且得到较好的精度。  相似文献   

14.
论文将从多目标优化的角度出发,结合LDA的第一个目标(最大化类间方差)和SVM 的第二个目标(最小化经验风险),构造一个新的最大类间方差和最小经验风险(MVE)数据分类模型。由于该模型是一个非凸规划模型,论文使用凹凸规划(CCCP)来进行求解。为了验证论文提出的数据类模型,对人工和真实的数据挖掘实验室数据(UCI)数据集进行分类实验测试。实验结果表明该数据分类模型有效性的。  相似文献   

15.
针对数据不平衡带来的少数类样本识别率低的问题,提出通过加权策略对过采样和随机森林进行改进的算法,从数据预处理和算法两个方面降低数据不平衡对分类器的影响。数据预处理阶段应用合成少数类过采样技术(Synthetic Minority Oversampling Technique,SMOTE)降低数据不平衡度,每个少数类样本根据其相对于剩余样本的欧氏距离分配权重,使每个样本合成不同数量的新样本。算法改进阶段利用Kappa系数评价随机森林中决策树训练后的分类效果,并赋予每棵树相应的权重,使分类能力更好的树在投票阶段有更大的投票权,提高随机森林算法对不平衡数据的整体分类性能。在KEEL数据集上的实验表明,与未改进算法相比,改进后的算法对少数类样本分类准确率和整体样本分类性能有所提升。  相似文献   

16.
为克服传统支持向量机不能处理交叉数据分类问题,Mangasarian等人提出一种新的分类方法PSVM,该方法可有效解决交叉数据两分类问题,但用PSVM解决多分类问题还报道不多。为此,提出一种基于PSVM的多分类方法(M-PSVM),并探讨训练样本比例与分类精度之间关系。在UCI数据集上的测试结果表明,M-PSVM与传统SVM分类性能相当,且当训练样本比例小时,效果更优;此外,在入侵检测数据集上的初步实验表明,M-PSVM可有效改进少数类的分类精度,因而为求解数据不平衡下的分类问题提供了新的思路,进一步的实验验证正在进行。  相似文献   

17.
吴锦华  左开中  接标  丁新涛 《计算机应用》2015,35(10):2752-2756
作为数据预处理的一种常用的手段,特征选择不仅能够提高分类器的分类性能,而且能增加对分类结果的解释性。针对基于稀疏学习的特征选择方法有时会忽略一些有用的判别信息而影响分类性能的问题,提出了一种新的判别性特征选择方法——D-LASSO,用于选择出更具有判别力的特征。首先D-LASSO模型包含一个L1-范式正则化项,用于产生一个稀疏解;其次,为了诱导出更具有判别力的特征,模型中增加了一个新的判别性正则化项,用于保留同类样本以及不同类样本之间几何分布信息,用于诱导出更具有判别力的特征。在一系列Benchmark数据集上的实验结果表明,与已有方法相比较,D-LASSO不仅能进一步提高分类器的分类精度,而且对参数也较为鲁棒。  相似文献   

18.
不平衡数据在分类时往往会偏向"多数",传统过采样生成的样本不能较好的表达原始数据集分布特征.改进的变分自编码器结合数据预处理方法,通过少数类样本训练,使用变分自编码器的生成器生成样本,用于以均衡训练数据集,从而解决传统采样导致的不平衡数据引起分类过拟合问题.我们在UCI四个常用的数据集上进行了实验,结果表明该算法在保证准确率的同时提高了F_measureG_mean.  相似文献   

19.
Knowledge-based systems such as expert systems are of particular interest in medical applications as extracted if-then rules can provide interpretable results. Various rule induction algorithms have been proposed to effectively extract knowledge from data, and they can be combined with classification methods to form rule-based classifiers. However, most of the rule-based classifiers can not directly handle numerical data such as blood pressure. A data preprocessing step called discretization is required to convert such numerical data into a categorical format. Existing discretization algorithms do not take into account the multimodal class densities of numerical variables in datasets, which may degrade the performance of rule-based classifiers. In this paper, a new Gaussian Mixture Model based Discretization Algorithm (GMBD) is proposed that preserve the most frequent patterns of the original dataset by taking into account the multimodal distribution of the numerical variables. The effectiveness of GMBD algorithm was verified using six publicly available medical datasets. According to the experimental results, the GMBD algorithm outperformed five other static discretization methods in terms of the number of generated rules and classification accuracy in the associative classification algorithm. Consequently, our proposed approach has a potential to enhance the performance of rule-based classifiers used in clinical expert systems.  相似文献   

20.
社交媒体已成为当前发布和传播突发灾害信息的重要媒介,有效识别并利用其中的真实信息对灾害应急管理具有重要意义。针对传统文本分类模型的不足,提出一种基于 BERT 预训练模型的灾害推文分类方法。经数据清洗、预处理及算法对比分析,在 BERT 预训练模型基础上,研究构建了基于长短期记忆-卷积神经网络(LSTM-CNN)的文本分类模型。在 Kaggle 竞赛平台的推文数据集上的实验表明,相比传统的朴素贝叶斯分类模型和常见的微调模型,该分类模型性能表现优异,识别率可达 85%,可以更好地应对小样本分类问题。有关工作对精准识别真实灾害信息、提高灾害应急响应与沟通效率具有重要意义。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号