首页 | 本学科首页   官方微博 | 高级检索  
     

文本分类中基于K-means的类偏斜KNN样本剪裁
引用本文:刘海峰,姚泽清,苏展,张学仁.文本分类中基于K-means的类偏斜KNN样本剪裁[J].微电子学与计算机,2012,29(5):24-28.
作者姓名:刘海峰  姚泽清  苏展  张学仁
作者单位:解放军理工大学理学院,江苏南京,210007
摘    要:KNN算法是经典的文本分类算法.训练样本的数量和类别密度是影响算法性能的主要瓶颈,合理的样本剪裁可以提高分类器效率.文中提出了一种基于聚类的改进KNN分类模型.首先对训练集进行聚类,基于测试样本与簇之间的相对位置对训练集进行合理裁剪以节约计算开销;然后基于簇内样本分布进行样本赋权,改善大类别样本的密度占优现象.实验结果表明,本文提出的样本剪裁方法提高了KNN算法的分类性能.

关 键 词:K最近邻  类偏斜  样本剪裁  聚类

A Clustering-Based Method for Reducing the Amount of Sample in KNN Text Categorization on the Category Deflection
LIU Hai-feng,YAO Ze-qing,SU Zhan,ZHANG Xue-ren.A Clustering-Based Method for Reducing the Amount of Sample in KNN Text Categorization on the Category Deflection[J].Microelectronics & Computer,2012,29(5):24-28.
Authors:LIU Hai-feng  YAO Ze-qing  SU Zhan  ZHANG Xue-ren
Affiliation:(Institute of Sciences,PLA University of Science and Technology,Nanjing 210007,China)
Abstract:KNN is one of the classical algorithms in text categorization.The number of training samples and the density is the primary bottleneck on the algorithm.A reasonable method for reducing the amount of training data can improve the efficiency of classification.This paper proposes an improved KNN model basing on clustering.Firstly,by clustering the samples into clusters,we remove some samples from training set basing on the distance in order to save computing cost.Secondly,take into account the category distribution we bring up a better weighting method in order to overcome the defect that the bigger class of training samples dominated in KNN.The result of test shows that the improved KNN classification algorithm improves the efficiency of its classification.
Keywords:K-nearest neighbor  category deflection  sample selection  clustering
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号