一种基于概率主题模型的恶意代码特征提取方法 A Method of Extracting Malware Features Based on Probabilistic Topic Model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于概率主题模型的恶意代码特征提取方法

引用本文：	刘亚姝, 王志海, 侯跃然, 严寒冰. 一种基于概率主题模型的恶意代码特征提取方法[J]. 计算机研究与发展, 2019, 56(11): 2339-2348. DOI: 10.7544/issn1000-1239.2019.20190393

作者姓名：	刘亚姝王志海侯跃然严寒冰

作者单位：	1.¹(北京交通大学计算机与信息技术学院北京 100044);2.²(北京建筑大学电气与信息工程学院北京 100044);3.³(北京邮电大学网络技术研究院北京 100876);4.⁴(国家计算机网络应急技术处理协调中心北京 100029) (ly_s8020@163.com)

基金项目：	国家重点研发计划;国家重点研发计划;国家自然科学基金;国家自然科学基金

摘要：	在当前复杂网络环境下，恶意代码通过各种方式快速传播，入侵用户终端设备或网络设备、非法窃取用户隐私数据，对网络和互联网用户造成了严重的安全威胁.传统检测方法难以检测未知恶意代码，而恶意代码变体的多样性和庞大数量也对未知恶意代码检测构成了巨大挑战.提出了一种无监督的恶意代码识别方法，通过分析反汇编PE文件给出汇编指令标准化规则，结合潜在狄立克雷分布(latent Dirichlet allocation, LDA)获得汇编指令中潜在的“文档-主题”、“主题-词”的分布.再以“主题分布”构造恶意样本特征，产生一个全新的恶意代码检测框架.结合“困惑度”和变化的步长给出了最优“主题”数目的快速评价和自动确定方法，解决了LDA模型中主题数目需要预先指定的问题.同时解析了“文档-主题”、“主题-词”聚集结果的语义可解释性，说明了该方法获得的样本特征具有潜在的语义.实验结果表明：与其他方法相比该方法具有相当的或更好的恶意代码鉴别能力，同时能够准确地识别恶意代码的新变体.
关键词：	恶意代码检测狄立克雷分布概率主题模型困惑度 Gibbs
A Method of Extracting Malware Features Based on Probabilistic Topic Model

Liu Yashu, Wang Zhihai, Hou Yueran, Yan Hanbing. A Method of Extracting Malware Features Based on Probabilistic Topic Model[J]. Journal of Computer Research and Development, 2019, 56(11): 2339-2348. DOI: 10.7544/issn1000-1239.2019.20190393

Authors:	Liu Yashu Wang Zhihai Hou Yueran Yan Hanbing

Affiliation:	1.¹(School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044);2.²(School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044);3.³(Institute of Network Technology, Beijing University of Posts and Telecommunications, Beijing 100876);4.⁴(National Computer Network Emergency Response Technical TeamCoordination Center of China, Beijing 100029)

Abstract:	In the current complex network environment, malicious codes have been spread quickly in various ways, which illegally occupy user terminal equipment or network equipment and illegally steal privacy data. Malware poses a serious security threat to network and Internet users. Traditional methods can’t detect unknown malicious codes which is challenged by the diversity and large number of malicious code variants. We propose an unsupervised malware identification approach that generates a standardization rule of assembly instructions by analyzing the content of the decompiled PE files. By introducing latent Dirichlet allocation (LDA), our method extracts the latent “document-topic” and “topic-word” probability allocation from samples. The topic probability distributions are extracted as features of samples, which is a new way for malware feature presentation. Then, we propose a new malware detecting framework to train model and test malware. What’s more, our method solves the problem that the topic number in LDA model needs to be specified beforehand using the perplexity and different steps, which evaluates the best numbers of “topics” quickly and automatically. Finally, it analyzes the semantics of “document-topic” and “topic-word” aggregating results in assembly instructions, which explains the latent semantics of features obtained by our method. Experimental results show that our method is more discriminative, which has better classification results than other methods, while providing accurate discrimination of the new novel malware variants.

Keywords:	malware detection latent Dirichlet allocation (LDA) probabilistic topic model perplexity Gibbs
本文献已被万方数据等数据库收录！
	点击此处可从《计算机研究与发展》浏览原始摘要信息
	点击此处可从《计算机研究与发展》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏