首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.

In the fields of pattern recognition and machine learning, the use of data preprocessing algorithms has been increasing in recent years to achieve high classification performance. In particular, it has become inevitable to use the data preprocessing method prior to classification algorithms in classifying medical datasets with the nonlinear and imbalanced data distribution. In this study, a new data preprocessing method has been proposed for the classification of Parkinson, hepatitis, Pima Indians, single proton emission computed tomography (SPECT) heart, and thoracic surgery medical datasets with the nonlinear and imbalanced data distribution. These datasets were taken from UCI machine learning repository. The proposed data preprocessing method consists of three steps. In the first step, the cluster centers of each attribute were calculated using k-means, fuzzy c-means, and mean shift clustering algorithms in medical datasets including Parkinson, hepatitis, Pima Indians, SPECT heart, and thoracic surgery medical datasets. In the second step, the absolute differences between the data in each attribute and the cluster centers are calculated, and then, the average of these differences is calculated for each attribute. In the final step, the weighting coefficients are calculated by dividing the mean value of the difference to the cluster centers, and then, weighting is performed by multiplying the obtained weight coefficients by the attribute values in the dataset. Three different attribute weighting methods have been proposed: (1) similarity-based attribute weighting in k-means clustering, (2) similarity-based attribute weighting in fuzzy c-means clustering, and (3) similarity-based attribute weighting in mean shift clustering. In this paper, we aimed to aggregate the data in each class together with the proposed attribute weighting methods and to reduce the variance value within the class. Thus, by reducing the value of variance in each class, we have put together the data in each class and at the same time, we have further increased the discrimination between the classes. To compare with other methods in the literature, the random subsampling has been used to handle the imbalanced dataset classification. After attribute weighting process, four classification algorithms including linear discriminant analysis, k-nearest neighbor classifier, support vector machine, and random forest classifier have been used to classify imbalanced medical datasets. To evaluate the performance of the proposed models, the classification accuracy, precision, recall, area under the ROC curve, κ value, and F-measure have been used. In the training and testing of the classifier models, three different methods including the 50–50% train–test holdout, the 60–40% train–test holdout, and tenfold cross-validation have been used. The experimental results have shown that the proposed attribute weighting methods have obtained higher classification performance than random subsampling method in the handling of classifying of the imbalanced medical datasets.

  相似文献   

2.
针对多分类不均衡问题,提出了一种新的基于一对一(one-versus-one,OVO)分解策略的方法。首先基于OVO分解策略将多分类不均衡问题分解成多个二值分类问题;再利用处理不均衡二值分类问题的算法建立二值分类器;接着利用SMOTE过抽样技术处理原始数据集;然后采用基于距离相对竞争力加权方法处理冗余分类器;最后通过加权投票法获得输出结果。在KEEL不均衡数据集上的大量实验结果表明,所提算法比其他经典方法具有显著的优势。  相似文献   

3.
In cost-sensitive learning, misclassification costs can vary for different classes. This paper investigates an approach reducing a multi-class cost-sensitive learning to a standard classification task based on the data space expansion technique developed by Abe et al., which coincides with Elkan's reduction with respect to binary classification tasks. Using this proposed reduction approach, a cost-sensitive learning problem can be solved by considering a standard 0/1 loss classification problem on a new distribution determined by the cost matrix. We also propose a new weighting mechanism to solve the reduced standard classification problem, based on a theorem stating that the empirical loss on independently identically distributed samples from the new distribution is essentially the same as the loss on the expanded weighted training set. Experimental results on several synthetic and benchmark datasets show that our weighting approach is more effective than existing representative approaches for cost-sensitive learning.  相似文献   

4.
哈希编码结合空间金字塔的图像分类   总被引:1,自引:1,他引:0       下载免费PDF全文
目的 稀疏编码是当前广泛使用的一种图像表示方法,针对稀疏编码及其改进算法计算过程复杂、费时等问题,提出一种哈希编码结合空间金字塔的图像分类算法。方法 首先,提取图像的局部特征点,构成局部特征点描述集。其次,学习自编码哈希函数,将局部特征点表示为二进制哈希编码。然后,在二进制哈希编码的基础上进行K均值聚类生成二进制视觉词典。最后,结合空间金字塔模型,将图像表示为空间金字塔直方图向量,并应用于图像分类。结果 在常用的Caltech-101和Scene-15数据集上进行实验验证,并和目前与稀疏编码相关的算法进行实验对比。与稀疏编码相关的算法相比,本文算法词典学习时间缩短了50%,在线编码速度提高了1.3~12.4倍,分类正确率提高了1%~5%。结论 提出了一种哈希编码结合空间金字塔的图像分类算法,利用哈希编码代替稀疏编码对局部特征点进行编码,并结合空间金字塔模型用于图像分类。实验结果表明,本文算法词典学习时间更短、编码速度更快,适用于在线词典学习和应用。  相似文献   

5.
Physical activity recognition using wearable sensors has gained significant interest from researchers working in the field of ambient intelligence and human behavior analysis. The problem of multi-class classification is an important issue in the applications which naturally has more than two classes. A well-known strategy to convert a multi-class classification problem into binary sub-problems is the error-correcting output coding (ECOC) method. Since existing methods use a single classifier with ECOC without considering the dependency among multiple classifiers, it often fails to generalize the performance and parameters in a real-life application, where different numbers of devices, sensors and sampling rates are used. To address this problem, we propose a unique hierarchical classification model based on the combination of two base binary classifiers using selective learning of slacked hierarchy and integrating the training of binary classifiers into a unified objective function. Our method maps the multi-class classification problem to multi-level classification. A multi-tier voting scheme has been introduced to provide a final classification label at each level of the solicited model. The proposed method is evaluated on two publicly available datasets and compared with independent base classifiers. Furthermore, it has also been tested on real-life sensor readings for 3 different subjects to recognize four activities i.e. Walking, Standing, Jogging and Sitting. The presented method uses same hierarchical levels and parameters to achieve better performance on all three datasets having different number of devices, sensors and sampling rates. The average accuracies on publicly available dataset and real-life sensor readings were recorded to be 95% and 85%, respectively. The experimental results validate the effectiveness and generality of the proposed method in terms of performance and parameters.  相似文献   

6.
李晨光  张波  赵骞  陈小平  王行甫 《计算机应用》2022,42(11):3603-3609
由于缺乏足够的训练数据,文本共情预测的进展一直都较为缓慢;而与之相关的文本情感极性分类任务则存在大量有标签的训练样本。由于文本共情预测与文本情感极性分类两个任务间存在较大相关性,因此提出了一种基于迁移学习的文本共情预测方法,该方法可从情感极性分类任务中学习到可迁移的公共特征,并通过学习到的公共特征辅助文本共情预测任务。首先通过一个注意力机制对两个任务间的公私有特征进行动态加权融合;其次为了消除两个任务间的数据集领域差异,通过一种对抗学习策略来区分两个任务间的领域独有特征与领域公共特征;最后提出了一种Hinge?loss约束策略,使共同特征对不同的目标标签具有通用性,而私有特征对不同的目标标签具有独有性。在两个基准数据集上的实验结果表明,相较于对比的迁移学习方法,所提方法的皮尔逊相关系数(PCC)和决定系数(R2)更高,均方误差(MSE)更小,充分说明了所提方法的有效性。  相似文献   

7.
Wang  Zhen  Zhang  Long-Bo  Sun  Fu-Zhen  Wang  Lei  Liu  Shu-Shu 《Multimedia Tools and Applications》2019,78(17):24453-24472

Due to its high query speed and low storage cost, binary hashing has been widely used in approximate nearest neighbors (ANN) search. However, the binary bits are generally considered to be equal, which causes data points with different codes to share the same Hamming distance to the query sample. To solve the above distance measure ambiguity, bitwise weights methods were proposed. Unfortunately, in most of the existing methods, the bitwise weights and the binary codes are learnt separately in two stages, and their performances cannot be further improved. In this paper, to effectively address the above issues, we propose an adaptive mechanism that jointly generate the bitwise weights and the binary codes by preserving different types of similarity relationship. As a result, the binary codes are utilized to obtain the initial retrieval results, and they are further re-ranked by the weighted Hamming distance. This ANN search mechanism is termed AR-Rank in this paper. First, this joint mechanism allows the bitwise weights and the binary codes to be used as mutual feedback during the training stage, and they are well adapted to one other when the algorithm converges. Furthermore, the bitwise weights are required to preserve the relative similarity which is consistent with the nature of ANN search task. Thus, the data points can be accurately re-sorted based on the weighted Hamming distances. Evaluations on three datasets demonstrate that the proposed AR-Rank retrieval system outperforms nine state-of-the-art methods.

  相似文献   

8.
汪海龙  禹晶  肖创柏 《自动化学报》2021,47(5):1077-1086
哈希学习能够在保持数据之间语义相似性的同时, 将高维数据投影到低维的二值空间中以降低数据维度实现快速检索. 传统的监督型哈希学习算法主要是将手工设计特征作为模型输入, 通过分类和量化生成哈希码. 手工设计特征缺乏自适应性且独立于量化过程使得检索的准确率不高. 本文提出了一种基于点对相似度的深度非松弛哈希算法, 在卷积神经网络的输出端使用可导的软阈值函数代替常用的符号函数使准哈希码非线性接近-1或1, 将网络输出的结果直接用于计算训练误差, 在损失函数中使用$\ell_1$范数约束准哈希码的各个哈希位接近二值编码. 模型训练完成之后, 在网络模型外部使用符号函数, 通过符号函数量化生成低维的二值哈希码, 在低维的二值空间中进行数据的存储与检索. 在公开数据集上的实验表明, 本文的算法能够有效地提取图像特征并准确地生成二值哈希码, 且在准确率上优于其他算法.  相似文献   

9.
针对高维数据近似最近邻查询,在过滤-验证框架下提出了一种基于学习的数据相关的c-近似最近邻查询算法.证明了数据经过随机投影之后,满足语义哈希技术所需的熵最大化准则.把经过随机投影的二进制数据作为数据的类标号,训练一组分类器用来预测查询的类标号.在此基础上,计算查询与数据集中数据对象的海明距离.最后,在过滤后的候选数据集上计算查询的最近邻与现有方法相比,该方法对空间需求更小,编码长度更短,效率更高.模拟数据集和真实数据集上的实验结果表明,该方法不仅能够提高查询效率,而且方便调控在查询质量和查询处理时间方面的平衡问题.  相似文献   

10.
Although many variants of local binary patterns (LBP) are widely used for face analysis due to their satisfactory classification performance, they have not yet been proven compact. We propose an effective code selection method that obtain a compact LBP (CLBP) using the maximization of mutual information (MMI) between features and class labels. The derived CLBP is effective because it provides better classification performance with smaller number of codes. We demonstrate the effectiveness of the proposed CLBP by several experiments of face recognition and facial expression recognition. Our experimental results show that the CLBP outperforms other LBP variants such as LBP, ULBP, and MCT in terms of smaller number of codes and better recognition performance.  相似文献   

11.
基于特征空间变换的纠错输出编码   总被引:1,自引:0,他引:1  

针对基于纠错输出编码多类分类中如何保证基分类器差异性的问题, 提出一种基于特征空间变换的编码方法. 该方法引入特征空间, 将编码矩阵扩展成三维矩阵; 然后基于二类划分, 利用特征变换得到不同的特征子空间, 从而训练得到差异性大的基分类器. 基于公共数据集的实验结果表明: 该方法能够比原始的编码矩阵获得更优的分类性能, 同时增加了基分类器的差异性; 该方法适用于任何编码矩阵, 为大数据的分类提供了新的思路.

  相似文献   

12.
人群密度等级估计是智能人群监控的核心技术之一。其主要应用是统计监控图像或视频中指定监控区域内的人群密度量化等级。文中提出一种基于置信度分析的人群密度等级分类模型。首先设计基于二叉树分类思想的纠错输出编码,优化组合多个二分类器。然后提取置信样本,训练SVM二分类器。最后利用信道传输模型进行解码,依据后验概率最大法则得到样本所属的人群密度等级。该模型在样本集和特征相同的前提下分类正确率和泛化性能均优于传统分类模型,为以人群密度估计为代表的多类分类问题求解提供一种思路。  相似文献   

13.
In recent years, the multi-label classification task has gained the attention of the scientific community given its ability to solve problems where each of the instances of the dataset may be associated with several class labels at the same time instead of just one. The main problems to deal with in multi-label classification are the imbalance, the relationships among the labels, and the high complexity of the output space. A large number of methods for multi-label classification has been proposed, but although they aimed to deal with one or many of these problems, most of them did not take into account these characteristics of the data in their building phase. In this paper we present an evolutionary algorithm for automatic generation of ensembles of multi-label classifiers by tackling the three previously mentioned problems, called Evolutionary Multi-label Ensemble (EME). Each multi-label classifier is focused on a small subset of the labels, still considering the relationships among them but avoiding the high complexity of the output space. Further, the algorithm automatically designs the ensemble evaluating both its predictive performance and the number of times that each label appears in the ensemble, so that in imbalanced datasets infrequent labels are not ignored. For this purpose, we also proposed a novel mutation operator that considers the relationship among labels, looking for individuals where the labels are more related. EME was compared to other state-of-the-art algorithms for multi-label classification over a set of fourteen multi-label datasets and using five evaluation measures. The experimental study was carried out in two parts, first comparing EME to classic multi-label classification methods, and second comparing EME to other ensemble-based methods in multi-label classification. EME performed significantly better than the rest of classic methods in three out of five evaluation measures. On the other hand, EME performed the best in one measure in the second experiment and it was the only one that did not perform significantly worse than the control algorithm in any measure. These results showed that EME achieved a better and more consistent performance than the rest of the state-of-the-art methods in MLC.  相似文献   

14.
针对数据不平衡带来的少数类样本识别率低的问题,提出通过加权策略对过采样和随机森林进行改进的算法,从数据预处理和算法两个方面降低数据不平衡对分类器的影响。数据预处理阶段应用合成少数类过采样技术(Synthetic Minority Oversampling Technique,SMOTE)降低数据不平衡度,每个少数类样本根据其相对于剩余样本的欧氏距离分配权重,使每个样本合成不同数量的新样本。算法改进阶段利用Kappa系数评价随机森林中决策树训练后的分类效果,并赋予每棵树相应的权重,使分类能力更好的树在投票阶段有更大的投票权,提高随机森林算法对不平衡数据的整体分类性能。在KEEL数据集上的实验表明,与未改进算法相比,改进后的算法对少数类样本分类准确率和整体样本分类性能有所提升。  相似文献   

15.
The performance of medical image classification has been enhanced by deep convolutional neural networks (CNNs), which are typically trained with cross-entropy (CE) loss. However, when the label presents an intrinsic ordinal property in nature, e.g., the development from benign to malignant tumor, CE loss cannot take into account such ordinal information to allow for better generalization. To improve model generalization with ordinal information, we propose a novel meta ordinal regression forest (MORF) method for medical image classification with ordinal labels, which learns the ordinal relationship through the combination of convolutional neural network and differential forest in a meta-learning framework. The merits of the proposed MORF come from the following two components: A tree-wise weighting net (TWW-Net) and a grouped feature selection (GFS) module. First, the TWW-Net assigns each tree in the forest with a specific weight that is mapped from the classification loss of the corresponding tree. Hence, all the trees possess varying weights, which is helpful for alleviating the tree-wise prediction variance. Second, the GFS module enables a dynamic forest rather than a fixed one that was previously used, allowing for random feature perturbation. During training, we alternatively optimize the parameters of the CNN backbone and TWW-Net in the meta-learning framework through calculating the Hessian matrix. Experimental results on two medical image classification datasets with ordinal labels, i.e., LIDC-IDRI and Breast Ultrasound datasets, demonstrate the superior performances of our MORF method over existing state-of-the-art methods.   相似文献   

16.
梁辰  李成海 《计算机科学》2016,43(5):87-90, 121
针对基于监督的入侵检测算法在现实网络环境中通常面临的训练样本不足的问题,提出了一种基于纠错输出编码的半监督多类分类入侵检测方法。该方法综合cop-kmeans算法的半监督思想,挖掘未标记数据中的隐含关系,扩大有标记正常网络数据的数量。该算法首先采用SVDD计算入侵检测各类别的可分程度,从而得到由不同子类构成的二叉树;然后分别对二叉树的各层节点进行编码并形成层次输出编码,得到最终的分类器。实验表明,该算法对各种类型的攻击具有更高的检测率,在现实网络环境中具有较好的实用性。  相似文献   

17.
无监督的深度哈希学习方法由于缺少相似性监督信息,难以获取高质量的哈希编码.因此,文中提出端到端的基于伪成对标签的深度无监督哈希学习模型.首先对由预训练的深度卷积神经网络得到的图像特征进行统计分析,用于构造数据的语义相似性标签.再进行基于成对标签的有监督哈希学习.在两个常用的图像数据集CIFAR-10、NUS-WIDE上的实验表明,经文中方法得到的哈希编码在图像检索上的性能较优.  相似文献   

18.
Fuzzy relational classifier (FRC) is a recently proposed two-step nonlinear classifier. At first, the unsupervised fuzzy c-means (FCM) clustering is performed to explore the underlying groups of the given dataset. Then, a fuzzy relation matrix indicating the relationship between the formed groups and the given classes is constructed for subsequent classification. It has been shown that FRC has two advantages: interpretable classification results and avoidance of overtraining. However, FRC not only lacks the robustness which is very important for a classifier, but also fails on the dataset with non-spherical distributions. Moreover, the classification mechanism of FRC is sensitive to the improper class labels of the training samples, thus leading to considerable decline in classification performance. The purpose of this paper is to develop a Robust FRC (RFRC) algorithm aiming at overcoming or mitigating all of the above disadvantages of FRC and maintaining its original advantages. In the proposed RFRC algorithm, we employ our previously proposed robust kernelized FCM (KFCM) to replace FCM to enhance its robustness against outliers and its suitability for the non-spherical data structures. In addition, we incorporate the soft class labels into the classification mechanism to improve its performance, especially for the datasets containing the improper class labels. The experimental results on 2 artificial and 11 real-life benchmark datasets demonstrate that RFRC algorithm can consistently outperform FRC in classification performance.  相似文献   

19.
纠错输出编码(ECOC)可以有效地解决多类分类问题.基于数据的编码是主要的编码方法之一.对此,提出一种基于子类划分和粒子群优化(PSO)的自适应编码方法,利用混淆矩阵衡量各类别的相关性,基于规则的方法对类别进行自适应组合,根据组合方案构建类别的二类划分并最终形成编码矩阵,通过引入PSO算法寻找最优阈值,从而得到最优编码矩阵.实验结果表明,所提出的编码方法可以得到更好的分类性能.  相似文献   

20.
传统的多标签分类算法是以二值标签预测为基础的,而二值标签由于仅能指示数据是否具有相关类别,所含语义信息较少,无法充分表示标签语义信息。为充分挖掘标签空间的语义信息,提出了一种基于非负矩阵分解和稀疏表示的多标签分类算法(MLNS)。该算法结合非负矩阵分解与稀疏表示技术,将数据的二值标签转化为实值标签,从而丰富标签语义信息并提升分类效果。首先,对标签空间进行非负矩阵分解以获得标签潜在语义空间,并将标签潜在语义空间与原始特征空间结合以形成新的特征空间;然后,对此特征空间进行稀疏编码来获得样本间的全局相似关系;最后,利用该相似关系重构二值标签向量,从而实现二值标签与实值标签的转化。在5个标准多标签数据集和5个评价指标上将所提算法与MLBGM、ML2、LIFT和MLRWKNN等算法进行对比。实验结果表明,所提MLNS在多标签分类中优于对比的多标签分类算法,在50%的案例中排名第一,在76%的案例中排名前二,在全部的案例中排名前三。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号