首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
针对标签信息不完整的多标签分类问题,一种新的多标签算法MCWD被提出。它通过有效地恢复训练数据中缺失的标签信息,能够产生更好的分类结果。在训练阶段,MCWD通过迭代更新每个训练实例的权重以及利用两两标签之间的相关性来恢复训练数据中缺失的标签信息;在标签恢复完毕后,利用新得到的训练集来训练分类模型;用此模型对测试集进行预测。实验结果表明,该算法在14个多标签数据集上具有一定的优势。  相似文献   

2.
基于SSKM算法的遥感图像半监督聚类   总被引:1,自引:0,他引:1  
闫利  曹君 《遥感信息》2010,(2):8-11
半监督聚类是近几年提出的一种新的聚类方法,具有良好的聚类性能,但是,它们绝大多数都需要有完整的先验信息,即对于所有的样本类别,都需要有至少一个标签数据。本文提出了一种基于不完整信息的遥感图像半监督聚类方法——SSKM聚类算法,算法利用部分样本类别的先验信息,辅助遥感图像聚类。实验表明,相比于传统的K均值聚类,该算法能够有效地改善遥感图像的聚类效果。  相似文献   

3.
一种用于不平衡数据分类的改进AdaBoost算法   总被引:4,自引:1,他引:3  
真实世界中存在大量的类别不平衡分类问题,传统的机器学习算法如AdaBoost算法,关注的是分类器的整体性能,而没有给予小类更多的关注。因此针对类别不平衡学习算法的研究是机器学习的一个重要方向。AsymBoost作为AdaBoost的一种改进算法,用于类别不平衡学习时,牺牲大类样本的识别精度来提高小类样本的分类性能。AsymBoost算法依然可能遭遇样本权重过大造成的过适应问题。据此提出了一种新型的AdaBoost改进算法。该方法通过对大类中分类困难样本的权重和标签进行处理,使分类器能够同时获得较好的查准率和查全率。实验结果表明,该方法可以有效提高在不平衡数据集上的分类性能。  相似文献   

4.
传统多视图学习通常假设样本在每个视图都是完整的,但是由于数据难以获取、设备故障、遮挡等因素,这一假设并不总能成立,而传统的多视图学习方法很难有效处理不完整多视图数据.目前,研究者们已经提出了一些不完整多视图学习的方法,但是这些方法没有充分利用样本类别信息,从而影响恢复后样本的判别性.因此,提出基于判别稀疏性表示的不完整多视图分类方法(IMVC-DSR).具体地,该方法假设缺失样本可用少量观测样本稀疏线性表示.同时,为了充分利用类别先验信息,增加恢复后样本的判别性,该方法鼓励相同类别样本之间相互表示,降低不同类别样本之间的相互表达.此外,该方法考虑到视图之间的相关关系,引入选择算子选出不同视图的相同样本,并约束相同样本在不同视图的线性表达具有一致性.最后,在公开的五组数据集上验证了所提方法IMVC-DSR的有效性.  相似文献   

5.
为了处理有限样本条件下的多类网络故障识别问题,提出了一种自适应质心距投影层次支持向量机。针对层次支持向量机固有的误差积累现象,该方法通过定义特征空间样本质心距投影度量类别离散程度,依据类别可分性优化偏态层次树结构;并设计基于自适应惩罚因子的补偿算法,修正由不平衡数据引起的分类超平面倾斜。实验结果表明,该方法具有较好的识别效果和效率,能够有效地抑制误差积累。  相似文献   

6.
在实际应用问题中,由于客观世界物质的多样性、模糊性和复杂性,经常会遇到大量未知样本类别信息的数据挖掘问题,而传统方法往往都依赖于已知样本类别信息才能对数据进行有效挖掘,对于未知模式类别信息的多类数据目前还没有有效的处理方法.针对未知类别信息的多类样本挖掘问题,提出了一种基于主动学习的模式类别挖掘模型(pattern class mining model based on active learning,PM_AL)来解决未知类别信息的模式类别挖掘问题.该模型通过衡量已得到的模式类别与未标记样本间的关系,引入样本差异度的方法来抽取最有价值样本,通过主动学习方式以较小的标记代价快速挖掘无标记样本所蕴含的可能模式类别,从而有助于将无类别标记的多分类问题转化成有类别标记的多分类问题.实验结果表明,PM_AL算法能够以较小的标记代价处理无类别信息的模式类别挖掘问题.  相似文献   

7.
由于稀疏表示方法在人脸分类算法中的成功使用,在此基础上提出了一种更为有效的基于稀疏表示(SRC)和弹性网络相结合的分类方法。为了加强样本间的协作表示能力以及增强处理强相关性变量数据的能力,基于迭代动态剔除机制,提出一种结合弹性网络的稀疏分解方法。通过采用训练样本的线性组合来表示测试样本,并运用迭代机制从所有样本中剔除对分类贡献度较小的类别和样本,采用Elastic Net算法来进行系数分解,从而选择出对分类贡献度较大的样本和类别,最后根据计算相似度对测试样本进行分类。在ORL、FERET和AR三个数据集进行了许多实验,实验结果显示算法识别率分别达到了98.75%、86.62%、99.72%,表明了所提算法的有效性。所提算法相比LASSO和SRC-GS等方法,在系数分解过程中增强了处理高维小样本和强相关性变量数据的能力,突出了稀疏约束在该算法中的重要性,具有更高的准确性和稳定性,能够更加有效地适用于人脸分类。  相似文献   

8.
传统分类算法一般要求数据集类别分布平衡,然而在实际情况中往往面临的是不平衡的类别分布。目前存在的数据层面和模型层面算法试图从不同角度解决该问题,但面临着参数选择以及重复采样产生的额外计算等问题。针对此问题,提出了一种在小批量内样本损失自适应均衡化的方法。该算法采用了一种动态学习损失函数的方式,根据小批量内样本标签信息调整各样本损失权重,从而实现在小批量内各类别样本总损失的平衡性。通过在caltech101和ILSVRC2014数据集上的实验表明,该算法能够有效地减少计算成本并提高分类精度,且一定程度上避免了过采样方法所带来的模型过拟合风险。  相似文献   

9.
流形学习算法在模式识别领域有着重要应用,针对文本分类数据的特点,提出一种基于邻域选取进行修正的局部线性嵌入算法,用带有权值的欧式距离来构造文本数据的局部邻域,提高文本分类的识别率;同时,利用文本数据的类别信息,运用半监督局部线性嵌入算法构造分类器,提高文本分类的效果。实验表明,本文基于文本分类改进的流形学习算法,能够有效地对文本进行分类。  相似文献   

10.
集成方法是处理包含缺失属性数据集分类问题的一种简单有效的方法,但目前针对不完整数据的集成分类算法在衡量各子分类器的权重时只考虑对应的数据子集的维数和大小.考虑到不完整数据集的缺失属性对类别的贡献度,使用信息熵衡量缺失属性之间的差异,提出一种新的针对不完整数据的集成学习分类算法———信息熵集成分类算法(EECA).应用以BP神经网络为基础分类器的集成分类器在UCI数据集上进行实验.实验结果表明,EECA比简单使用缺失属性的多少计算子分类器权重的方法更有效,最终结果准确度更高.  相似文献   

11.
This paper proposes to utilize information within incomplete instances (instances with missing values) when estimating missing values. Accordingly, a simple and efficient nonparametric iterative imputation algorithm, called the NIIA method, is designed for iteratively imputing missing target values. The NIIA method imputes each missing value several times until the algorithm converges. In the first iteration, all the complete instances are used to estimate missing values. The information within incomplete instances is utilized since the second imputation iteration. We conduct some experiments for evaluating the efficiency, and demonstrate: (1) the utilization of information within incomplete instances is of benefit to easily capture the distribution of a dataset; and (2) the NIIA method outperforms the existing methods in accuracy, and this advantage is clearly highlighted when datasets have a high missing ratio.  相似文献   

12.
While there is an ample amount of medical information available for data mining, many of the datasets are unfortunately incomplete – missing relevant values needed by many machine learning algorithms. Several approaches have been proposed for the imputation of missing values, using various reasoning steps to provide estimations from the observed data. One of the important steps in data mining is data preprocessing, where unrepresentative data is filtered out of the data to be mined. However, none of the related studies about missing value imputation consider performing a data preprocessing step before imputation. Therefore, the aim of this study is to examine the effect of two preprocessing steps, feature and instance selection, on missing value imputation. Specifically, eight different medical‐related datasets are used, containing categorical, numerical and mixed types of data. Our experimental results show that imputation after instance selection can produce better classification performance than imputation alone. In addition, we will demonstrate that imputation after feature selection does not have a positive impact on the imputation result.  相似文献   

13.
Data for classification are often incomplete. The multiple-values construction method (MVCM) can be used to include data with missing values for classification. In this study, the MVCM is implemented by using fuzzy sets theory in the context of classification with discrete data. By using the fuzzy sets based MVCM, data with missing values can add values to classification, but can also introduce excessive uncertainty. Furthermore, the computational cost for the use of incomplete data could be prohibitive if the scale of missing values is large. This paper discusses the association between classification performance and the use of incomplete data. It proposes an algorithm of near-optimal use of incomplete classification data. An experiment with real-world data demonstrates the usefulness of the algorithm.  相似文献   

14.
不完整数据的分析与填充一直是大数据处理的热点研究课题,传统的分析方法无法对不完整数据直接聚类,大部分方法先填充缺失值,然后对数据聚类。这些方法一般利用整个数据集对缺失数据进行填充,使得填充值容易受到噪声的干扰,导致填充结果不精确,进而造成聚类精度很低。提出一种不完整数据聚类算法,对不完全信息系统的相似度公式进行重新定义,给出不完整数据对象间的相似度度量方式,进而直接对不完整数据聚类。根据聚类结果将同一类对象划分到相同的簇中,通过同一类对象的属性值对缺失值进行填充,避免噪声对填充值的干扰,提高填充结果的精确性。实验结果表明,提出的方法能够对不完整数据进行聚类,并有效提高缺失数据的填充精度。  相似文献   

15.
构造性覆盖下不完整数据修正填充方法   总被引:1,自引:0,他引:1       下载免费PDF全文
不完整数据处理是数据挖掘、机器学习等领域中的重要问题,缺失值填充是处理不完整数据的主流方法。当前已有的缺失值填充方法大多运用统计学和机器学习领域的相关技术来分析原始数据中的剩余信息,从而得到较为合理的值来替代缺失部分。缺失值填充大致可以分为单一填充和多重填充,这些填充方法在不同的场景下有着各自的优势。但是,很少有方法能进一步考虑样本空间分布中的邻域信息,并以此对缺失值的填充结果进行修正。鉴于此,本文提出了一种可广泛应用于诸多现有填充方法的框架用以提升现有方法的填充效果,该框架由预填充、空间邻域信息挖掘和修正填充三部分构成。本文对7种填充方法在8个UCI数据集上进行了实验,实验结果验证了本文所提框架的有效性和鲁棒性。  相似文献   

16.
基于EM和贝叶斯网络的丢失数据填充算法   总被引:2,自引:0,他引:2  
实际应用中存在大量的丢失数据的数据集,对丢失数据的处理已成为目前分类领域的研究热点。分析和比较了几种通用的丢失数据填充算法,并提出一种新的基于EM和贝叶斯网络的丢失数据填充算法。算法利用朴素贝叶斯估计出EM算法初值,然后将EM和贝叶斯网络结合进行迭代确定最终更新器,同时得到填充后的完整数据集。实验结果表明,与经典填充算法相比,新算法具有更高的分类准确率,且节省了大量开销。  相似文献   

17.
Data with missing values,or incomplete information,brings some challenges to the development of classification,as the incompleteness may significantly affect the performance of classifiers.In this paper,we handle missing values in both training and test sets with uncertainty and imprecision reasoning by proposing a new belief combination of classifier(BCC)method based on the evidence theory.The proposed BCC method aims to improve the classification performance of incomplete data by characterizing the uncertainty and imprecision brought by incompleteness.In BCC,different attributes are regarded as independent sources,and the collection of each attribute is considered as a subset.Then,multiple classifiers are trained with each subset independently and allow each observed attribute to provide a sub-classification result for the query pattern.Finally,these sub-classification results with different weights(discounting factors)are used to provide supplementary information to jointly determine the final classes of query patterns.The weights consist of two aspects:global and local.The global weight calculated by an optimization function is employed to represent the reliability of each classifier,and the local weight obtained by mining attribute distribution characteristics is used to quantify the importance of observed attributes to the pattern classification.Abundant comparative experiments including seven methods on twelve datasets are executed,demonstrating the out-performance of BCC over all baseline methods in terms of accuracy,precision,recall,F1 measure,with pertinent computational costs.  相似文献   

18.
张安珍  李建中  高宏 《软件学报》2020,31(2):406-420
本文研究了基于符号语义的不完整数据聚集查询处理问题.不完整数据又称为缺失数据,缺失值包括可填充的和不可填充的两种类型.现有的缺失值填充算法不能保证填充后查询结果的准确度,为此,本文给出不完整数据聚集查询结果的区间估计.本文在符号语义中扩展传统关系数据库模型,提出一种通用不完整数据库模型,该模型可以处理可填充的和不可填充的两种类型缺失值.在该模型下,提出一种新的不完整数据聚集查询结果语义:可靠结果.可靠结果是真实查询结果的区间估计,可以保证真实查询结果很大概率在该估计区间范围内.本文给出线性时间求解SUM、COUNT和AVG查询可靠结果的方法.真实数据集和合成数据集上的扩展实验验证了本文所提方法的有效性.  相似文献   

19.
Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Na?¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Na?¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Na?¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Na?¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.  相似文献   

20.
The handling of missing values is a topic of growing interest in the software quality modeling domain. Data values may be absent from a dataset for numerous reasons, for example, the inability to measure certain attributes. As software engineering datasets are sometimes small in size, discarding observations (or program modules) with incomplete data is usually not desirable. Deleting data from a dataset can result in a significant loss of potentially valuable information. This is especially true when the missing data is located in an attribute that measures the quality of the program module, such as the number of faults observed in the program module during testing and after release. We present a comprehensive experimental analysis of five commonly used imputation techniques. This work also considers three different mechanisms governing the distribution of missing values in a dataset, and examines the impact of noise on the imputation process. To our knowledge, this is the first study to thoroughly evaluate the relationship between data quality and imputation. Further, our work is unique in that it employs a software engineering expert to oversee the evaluation of all of the procedures and to ensure that the results are not inadvertently influenced by poor quality data. Based on a comprehensive set of carefully controlled experiments, we conclude that Bayesian multiple imputation and regression imputation are the most effective techniques, while mean imputation performs extremely poorly. Although a preliminary evaluation has been conducted using Bayesian multiple imputation in the empirical software engineering domain, this is the first work to provide a thorough and detailed analysis of this technique. Our studies also demonstrate conclusively that the presence of noisy data has a dramatic impact on the effectiveness of imputation techniques.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号