首页 | 本学科首页   官方微博 | 高级检索  
     

基于主成分分析和K近邻的文件类型识别算法
引用本文:鄢梦迪,秦琳琳,吴刚. 基于主成分分析和K近邻的文件类型识别算法[J]. 计算机应用, 2016, 36(11): 3161-3164. DOI: 10.11772/j.issn.1001-9081.2016.11.3161
作者姓名:鄢梦迪  秦琳琳  吴刚
作者单位:中国科学技术大学 信息科学技术学院, 合肥 230022
基金项目:中央高校基本科研业务费专项资金资助项目(WK2100100024)。
摘    要:为解决基于文件后缀名和文件特征标识识别文件类型误判率较高的问题,在基于文件内容识别文件类型的算法基础上,提出主成分分析(PCA)和K近邻(KNN)算法相结合的文件类型识别算法。首先,使用PCA方法对样本预处理以降低样本空间的维数;然后,对降维后的训练样本集进行聚类处理,即用聚类质心代表每种类型的文件;最后,针对训练样本分布不均匀可能造成的分类误差,提出基于距离加权的KNN算法。实验结果表明,改进算法在样本数较多的情况下,能降低分类的计算复杂度,并保持了较高的识别正确率;而且该算法不依赖文件类型的特征标识,应用范围更为广泛。

关 键 词:文件类型识别  字节频率分布  主成分分析  K近邻  
收稿时间:2016-04-29
修稿时间:2016-06-30

File type detection algorithm based on principal component analysis and K nearest neighbors
YAN Mengdi,QIN Linlin,WU Gang. File type detection algorithm based on principal component analysis and K nearest neighbors[J]. Journal of Computer Applications, 2016, 36(11): 3161-3164. DOI: 10.11772/j.issn.1001-9081.2016.11.3161
Authors:YAN Mengdi  QIN Linlin  WU Gang
Affiliation:Institute of Information Science and Technology, University of Science and Technology of China, Hefei Anhui 230022, China
Abstract:In order to solve the problem that using the file suffix and file feature to identify file type may cause a low recognition accuracy rate, a new content-based file-type detection algorithm was proposed, which was based on Principal Component Analysis (PCA) and K Nearest Neighbors (KNN). Firstly, PCA algorithm was used to reduce the dimension of the sample space. Then by clustering the training samples, each file type was represented by cluster centroids. In order to reduce the error caused by unbalanced training samples, KNN algorithm based on distance weighting was proposed. The experimental result shows that the improved algorithm, in the case of a large number of training samples, can reduce computational complexity, and can maintain a high recognition accuracy rate. This algorithm doesn't depend on the feature of each file, so it can be used more widely.
Keywords:file type identification, byte frequency distribution, Principal Component Analysis (PCA), K Nearest Neighbors (KNN)')"  >K Nearest Neighbors (KNN)
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号