首页 | 本学科首页   官方微博 | 高级检索  
 共查询到10条相似文献,搜索用时 78 毫秒
Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Na?¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Na?¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Na?¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Na?¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.  相似文献   

We introduce Learn++.MF, an ensemble-of-classifiers based algorithm that employs random subspace selection to address the missing feature problem in supervised classification. Unlike most established approaches, Learn++.MF does not replace missing values with estimated ones, and hence does not need specific assumptions on the underlying data distribution. Instead, it trains an ensemble of classifiers, each on a random subset of the available features. Instances with missing values are classified by the majority voting of those classifiers whose training data did not include the missing features. We show that Learn++.MF can accommodate substantial amount of missing data, and with only gradual decline in performance as the amount of missing data increases. We also analyze the effect of the cardinality of the random feature subsets, and the ensemble size on algorithm performance. Finally, we discuss the conditions under which the proposed approach is most effective.  相似文献   

针对高光谱遥感图像维数高、样本少导致分类精度低的问题,提出一种基于DS聚类的高光谱图像集成分类算法(DSCEA)。首先,根据高光谱数据特点,从整体波段中随机选择一定数量的波段,构成不同的训练样本;其次,分析图像的空谱信息,构造无向加权图,利用优势集(DS)聚类方法得到最大特征差异的波段子集;最后,根据不同样本,利用支持向量机训练具有差异的单个分类器,采用多数表决法集成最终分类器,实现对高光谱遥感图像的分类。在Indian Pines数据集上DSCEA算法的分类精度最高可达到84.61%,在Pavia University数据集上最高可达到91.89%,实验结果表明DSCEA算法可以有效的解决高光谱分类问题。  相似文献   

针对SVM方法计算复杂度和时间复杂度较高的缺点,提出一种自适应剪枝LS-SVM算法。该算法通过块增量学习、剪枝过程以及逆学习的交替进行,大幅减少了支持向量的个数,降低了算法的计算复杂度和时间复杂度。实验结果表明,同标准C-SVM算法相比,应用该算法的入侵检测模型在检测时间、检测精度方面有着较好表现。  相似文献   

在电力系统中,利用图像识别技术对没有数据传送接口的数字仪表进行识别有利于系统自动化水平的提高和安全运行。文章介绍了图像处理过程和数字仪表显示值的识别方法,阐述了支持向量机方法的基本原理,分别采用一对多和一对一的策略方法组合多个二值分类器解决了10类数字的识别问题,并利用这两种多分类器对仪表显示值进行了识别。最后,比较了支持向量机方法和其它方法的识别结果。实验结果表明,支持向量机方法具有更高的识别率。  相似文献   

The self-regulation ability of the elderly is largely degenerated with the age increases, and the elderly often expose to great potential hazards of heart disorders. In practice, the electrocardiography (ECG) is one of the well-known non-invasive procedures used as records of heart rhythms and diagnosis of unusual heart diseases. In this paper, we propose a healthcare management system, named CardiaGuard, which is specialized in monitoring and analysis the heart disorder events for the elderly. The CardiaGuard cloud service is an expert system designed based on the hybrid classifier implemented using support vector machine (SVM) and random tree (RT) classification algorithm. We conduct a comprehensive performance evaluation which shows the proposed hybrid classification engine are able to detect six types of cardiac disorders with higher accuracy rate than the SVM-based classifier alone. CardiaGuard poses a great solution to enhance the quality of good clinical practice on the healthcare management for the elderly in cardiology.  相似文献   

针对电子邮件应用中垃圾邮件危害日益严重的问题,基于机器学习的垃圾邮件过滤方法正成为当前互联网应用领域的研究热点之一.通过对现有基于机器学习的垃圾邮件处理方法的分析研究,并结合中文信息处理的特点,提出一种基于支持向量机SVM(Support Vector Machine)的中文垃圾邮件过滤方法并加以设计实现.实验表明,在有限样本情况下,基于SVM的中文垃圾邮件过滤方法具有较高的准确性和稳定性.  相似文献   

Fabien  Grard 《Neurocomputing》2008,71(7-9):1578-1594
For classification, support vector machines (SVMs) have recently been introduced and quickly became the state of the art. Now, the incorporation of prior knowledge into SVMs is the key element that allows to increase the performance in many applications. This paper gives a review of the current state of research regarding the incorporation of two general types of prior knowledge into SVMs for classification. The particular forms of prior knowledge considered here are presented in two main groups: class-invariance and knowledge on the data. The first one includes invariances to transformations, to permutations and in domains of input space, whereas the second one contains knowledge on unlabeled data, the imbalance of the training set or the quality of the data. The methods are then described and classified into the three categories that have been used in literature: sample methods based on the modification of the training data, kernel methods based on the modification of the kernel and optimization methods based on the modification of the problem formulation. A recent method, developed for support vector regression, considers prior knowledge on arbitrary regions of the input space. It is exposed here when applied to the classification case. A discussion is then conducted to regroup sample and optimization methods under a regularization framework.  相似文献   

一种用于大规模文本分类的特征表示方法   总被引:4,自引:0,他引:4       下载免费PDF全文
随着网络和信息技术的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术。文本的特征表示严重地限制了文本分类性能的提升。以经典的向量空间模型和tf-idf权值计算公式为基础,提出了以应用于文本分类为目的的权值改进公式p-idf公式。在比较了贝叶斯、K近邻、神经网络和支持向量机四种典型的文本分类器的基础上,采用支持向量机分类器搭建了一个文本分类试验系统。经过科学的试验比较了tf-idf、p-idf、LTC三种权值公式在文本分类系统中对分类器性能的影响,证实了所提出的p-idf公式的合理性和有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号