首页 | 本学科首页   官方微博 | 高级检索  
     


An investigation of byte n-gram features for malware classification
Authors:Edward Raff  Richard Zak  Russell Cox  Jared Sylvester  Paul Yacci  Rebecca Ward  Anna Tracy  Mark McLean  Charles Nicholas
Affiliation:1.Computer Science and Electrical Engineering,University of Maryland, Baltimore County,Baltimore,USA;2.Laboratory for Physical Sciences,College Park,USA
Abstract:Malware classification using machine learning algorithms is a difficult task, in part due to the absence of strong natural features in raw executable binary files. Byte n-grams previously have been used as features, but little work has been done to explain their performance or to understand what concepts are actually being learned. In contrast to other work using n-gram features, in this work we use orders of magnitude more data, and we perform feature selection during model building using Elastic-Net regularized Logistic Regression. We compute a regularization path and analyze novel multi-byte identifiers. Through this process, we discover significant previously unreported issues with byte n-gram features that cause their benefits and practicality to be overestimated. Three primary issues emerged from our work. First, we discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy. Second, we discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways. Finally, we demonstrate that n-gram features promote overfitting, even with linear models and extreme regularization.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号