首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our approach is knowledge-based because of several reasons. First, we apply the knowledge obtained from the binary n-gram features to extract assembly instruction sequences using our Assembly Feature Retrieval algorithm. Second, we apply the statistical knowledge obtained during feature extraction to select the best features, and to build a classification model. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.
Bhavani Thuraisingham (Corresponding author)Email:
  相似文献   

2.
Malicious executables, often spread as email attachments, impose serious security threats to computer systems and associated networks. We investigated the use of byte sequence frequencies as a way to automatically distinguish malicious from benign executables without actually executing them. In a series of experiments, we compared classification accuracies over seven feature selection methods, four classification algorithms, and variable byte sequence lengths. We found that single-byte patterns provided surprisingly reliable features to separate malicious executables from benign. Between classifiers and feature selection methods, the overall performance of the models depended more on the choice of classifier than the method of feature selection. Support vector machine (SVM) classifiers were found to be superior in terms of prediction accuracy, training time, and aversion to overfitting.  相似文献   

3.
The recent growth in network usage has motivated the creation of new malicious code for various purposes. Today’s signature-based antiviruses are very accurate for known malicious code, but can not detect new malicious code. Recently, classification algorithms were used successfully for the detection of unknown malicious code. But, these studies involved a test collection with a limited size and the same malicious: benign file ratio in both the training and test sets, a situation which does not reflect real-life conditions. We present a methodology for the detection of unknown malicious code, which examines concepts from text categorization, based on n-grams extraction from the binary code and feature selection. We performed an extensive evaluation, consisting of a test collection of more than 30,000 files, in which we investigated the class imbalance problem. In real-life scenarios, the malicious file content is expected to be low, about 10% of the total files. For practical purposes, it is unclear as to what the corresponding percentage in the training set should be. Our results indicate that greater than 95% accuracy can be achieved through the use of a training set that has a malicious file content of less than 33.3%.  相似文献   

4.
A serious security threat today are malicious executables, especially new, unseen malicious executables often arriving as email attachments. These new malicious executables are created at the rate of thousands every year1 and pose a serious threat. Current anti-virus systems attempt to detect these new malicious programs with heuristic generated by hand. This approach is costly and often ineffective. In this paper we introduce the Trojan Horse SubSeven, its capabilities and influence over intrusion detection systems. A Honey Pot program is implemented, simulating the SubSeven Server. The Honey Pot Program provides feedback and stores data to and from the SubSeven’s client.  相似文献   

5.
This research synthesizes a taxonomy for classifying detection methods of new malicious code by Machine Learning (ML) methods based on static features extracted from executables. The taxonomy is then operationalized to classify research on this topic and pinpoint critical open research issues in light of emerging threats. The article addresses various facets of the detection challenge, including: file representation and feature selection methods, classification algorithms, weighting ensembles, as well as the imbalance problem, active learning, and chronological evaluation. From the survey we conclude that a framework for detecting new malicious code in executable files can be designed to achieve very high accuracy while maintaining low false positives (i.e. misclassifying benign files as malicious). The framework should include training of multiple classifiers on various types of features (mainly OpCode and byte n-grams and Portable Executable Features), applying weighting algorithm on the classification results of the individual classifiers, as well as an active learning mechanism to maintain high detection accuracy. The training of classifiers should also consider the imbalance problem by generating classifiers that will perform accurately in a real-life situation where the percentage of malicious files among all files is estimated to be approximately 10%.  相似文献   

6.
In the literature, many feature types are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not yet been done. In particular, evaluations on OCR documents are very rare. In this paper we investigate seven text representations based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts. Received February 17, 1998 / Revised April 8, 1998  相似文献   

7.
Detection of rapidly evolving malware requires classification techniques that can effectively and efficiently detect zero-day attacks. Such detection is based on a robust model of benign behavior and deviations from that model are used to detect malicious behavior. In this paper we propose a low-complexity host-based technique that uses deviations in static file attributes to detect malicious executables. We first develop simple statistical models of static file attributes derived from the empirical data of thousands of benign executables. Deviations among the attribute models of benign and malware executables are then quantified using information-theoretic (Kullback-Leibler-based) divergence measures. This quantification reveals distinguishing attributes that are considerably divergent between benign and malware executables and therefore can be used for detection. We use the benign models of divergent attributes in cross-correlation and log-likelihood frameworks to classify malicious executables. Our results, using over 4,000 malicious file samples, indicate that the proposed detector provides reasonably high detection accuracy, while having significantly lower complexity than existing detectors.  相似文献   

8.
Malicious executables are programs designed to infiltrate or damage a computer system without the owner’s consent, which have become a serious threat to the security of computer systems. There is an urgent need for effective techniques to detect polymorphic, metamorphic and previously unseen malicious executables of which detection fails in most of the commercial anti-virus software. In this paper, we develop interpretable string based malware detection system (SBMDS), which is based on interpretable string analysis and uses support vector machine (SVM) ensemble with Bagging to classify the file samples and predict the exact types of the malware. Interpretable strings contain both application programming interface (API) execution calls and important semantic strings reflecting an attacker’s intent and goal. Our SBMDS is carried out with four major steps: (1) first constructing the interpretable strings by developing a feature parser; (2) performing feature selection to select informative strings related to different types of malware; (3) followed by using SVM ensemble with bagging to construct the classifier; (4) and finally conducting the malware detector, which not only can detect whether a program is malicious or not, but also can predict the exact type of the malware. Our case study on the large collection of file samples collected by Kingsoft Anti-virus lab illustrate that: (1) The accuracy and efficiency of our SBMDS outperform several popular anti-virus software; (2) Based on the signatures of interpretable strings, our SBMDS outperforms data mining based detection systems which employ single SVM, Naive Bayes with bagging, Decision Trees with bagging; (3) Compared with the IMDS which utilizes the objective-oriented association (OOA) based classification on API calls, our SBMDS achieves better performance. Our SBMDS system has already been incorporated into the scanning tool of a commercial anti-virus software.  相似文献   

9.
Anomaly detection allows for the identification of unknown and novel attacks in network traffic. However, current approaches for anomaly detection of network packet payloads are limited to the analysis of plain byte sequences. Experiments have shown that application-layer attacks become difficult to detect in the presence of attack obfuscation using payload customization. The ability to incorporate syntactic context into anomaly detection provides valuable information and increases detection accuracy. In this contribution, we address the issue of incorporating protocol context into payload-based anomaly detection. We present a new data representation, called \({c}_n\)-grams, that allows to integrate syntactic and sequential features of payloads in an unified feature space and provides the basis for context-aware detection of network intrusions. We conduct experiments on both text-based and binary application-layer protocols which demonstrate superior accuracy on the detection of various types of attacks over regular anomaly detection methods. Furthermore, we show how \({c}_n\)-grams can be used to interpret detected anomalies and thus, provide explainable decisions in practice.  相似文献   

10.
In this paper, we propose a method for network intrusion detection based on language models. Our method proceeds by extracting language features such as n-grams and words from connection payloads and applying unsupervised anomaly detection—without prior learning phase or presence of labeled data. The essential part of this procedure is linear-time computation of similarity measures between language models of connection payloads. Particular patterns in these models decisive for differentiation of attacks and normal data can be traced back to attack semantics and utilized for automatic generation of attack signatures. Results of experiments conducted on two datasets of network traffic demonstrate the importance of high-order n-grams and variable-length language models for detection of unknown network attacks. An implementation of our system achieved detection accuracy of over 80% with no false positives on instances of recent remote-to-local attacks in HTTP, FTP and SMTP traffic.  相似文献   

11.
Nowadays malware is one of the serious problems in the modern societies. Although the signature based malicious code detection is the standard technique in all commercial antivirus softwares, it can only achieve detection once the virus has already caused damage and it is registered. Therefore, it fails to detect new malwares (unknown malwares). Since most of malwares have similar behavior, a behavior based method can detect unknown malwares. The behavior of a program can be represented by a set of called API's (application programming interface). Therefore, a classifier can be employed to construct a learning model with a set of programs' API calls. Finally, an intelligent malware detection system is developed to detect unknown malwares automatically. On the other hand, we have an appealing representation model to visualize the executable files structure which is control flow graph (CFG). This model represents another semantic aspect of programs. This paper presents a robust semantic based method to detect unknown malwares based on combination of a visualize model (CFG) and called API's. The main contribution of this paper is extracting CFG from programs and combining it with extracted API calls to have more information about executable files. This new representation model is called API-CFG. In addition, to have fast learning and classification process, the control flow graphs are converted to a set of feature vectors by a nice trick. Our approach is capable of classifying unseen benign and malicious code with high accuracy. The results show a statistically significant improvement over n-grams based detection method.  相似文献   

12.
在对恶意代码进行检测和分类时,由于传统的灰度编码方法将特征转换为图像的过程中,会产生特征分裂和精度损失等问题,严重影响了恶意代码的检测性能.同时,传统的恶意代码检测和分类的数据集中只使用了单一的恶意样本,并没有考虑到良性样本.因此,文中采用了一个包含良性样本和恶意样本的数据集,同时提出了一种双字节特征编码方法.首先将待...  相似文献   

13.
Dynamic Analysis of Malicious Code   总被引:2,自引:0,他引:2  
Malware analysis is the process of determining the purpose and functionality of a given malware sample (such as a virus, worm, or Trojan horse). This process is a necessary step to be able to develop effective detection techniques for malicious code. In addition, it is an important prerequisite for the development of removal tools that can thoroughly delete malware from an infected machine. Traditionally, malware analysis has been a manual process that is tedious and time-intensive. Unfortunately, the number of samples that need to be analyzed by security vendors on a daily basis is constantly increasing. This clearly reveals the need for tools that automate and simplify parts of the analysis process. In this paper, we present TTAnalyze, a tool for dynamically analyzing the behavior of Windows executables. To this end, the binary is run in an emulated operating system environment and its (security-relevant) actions are monitored. In particular, we record the Windows native system calls and Windows API functions that the program invokes. One important feature of our system is that it does not modify the program that it executes (e.g., through API call hooking or breakpoints), making it more difficult to detect by malicious code. Also, our tool runs binaries in an unmodified Windows environment, which leads to excellent emulation accuracy. These factors make TTAnalyze an ideal tool for quickly understanding the behavior of an unknown malware.  相似文献   

14.
Malware classification based on call graph clustering   总被引:1,自引:0,他引:1  
Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, enabling the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Clustering experiments are conducted on a collection of real malware samples, and the results are evaluated against manual classifications provided by human malware analysts. Experiments show that it is indeed possible to accurately detect malware families via call graph clustering. We anticipate that in the future, call graphs can be used to analyse the emergence of new malware families, and ultimately to automate implementation of generic detection schemes.  相似文献   

15.
16.
基于加权信息增益的恶意代码检测方法   总被引:1,自引:0,他引:1       下载免费PDF全文
采用数据挖掘技术检测恶意代码,提出一种基于加权信息增益的特征选择方法。该方法综合考虑特征频率和信息增益的作用,能够更加准确地选取有效特征,从而提高检测性能。实现一个恶意代码检测系统,采用二进制代码的N-gram和变长N-gram作为特征提取方法,加权信息增益作为特征选择方法,使用多种分类器进行恶意代码检测。实验结果证明,该方法能有效提高恶意代码的检测率和准确率。  相似文献   

17.
Due to its damage to Internet security, malware (e.g., virus, worm, trojan) and its detection has caught the attention of both anti-malware industry and researchers for decades. To protect legitimate users from the attacks, the most significant line of defense against malware is anti-malware software products, which mainly use signature-based method for detection. However, this method fails to recognize new, unseen malicious executables. To solve this problem, in this paper, based on the instruction sequences extracted from the file sample set, we propose an effective sequence mining algorithm to discover malicious sequential patterns, and then All-Nearest-Neighbor (ANN) classifier is constructed for malware detection based on the discovered patterns. The developed data mining framework composed of the proposed sequential pattern mining method and ANN classifier can well characterize the malicious patterns from the collected file sample set to effectively detect newly unseen malware samples. A comprehensive experimental study on a real data collection is performed to evaluate our detection framework. Promising experimental results show that our framework outperforms other alternate data mining based detection methods in identifying new malicious executables.  相似文献   

18.
Justin Zobel  Philip Dart 《Software》1995,25(3):331-345
Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures and phonetic coding. We propose methods for combining these techniques, and show experimentally that these combinations yield good retrieval effectiveness while keeping index size and retrieval time low. Our experiments also suggest that, in contrast to previous claims, phonetic codings are markedly inferior to string distance measures, which are demonstrated to be suitable for both spelling correction and personal name matching.  相似文献   

19.
近些年来,层出不穷的恶意软件对系统安全构成了严重的威胁并造成巨大的经济损失,研究者提出了许多恶意软件检测方案。但恶意软件开发中常利用加壳和多态等混淆技术,这使得传统的静态检测方案如静态特征匹配不足以应对。而传统的应用层动态检测方法也存在易被恶意软件禁用或绕过的缺点。本文提出一种利用底层数据流关系进行恶意软件检测的方法,即在系统底层监视程序运行时的数据传递情况,生成数据流图,提取图的特征形成特征向量,使用特征向量衡量数据流图的相似性,评估程序行为的恶意倾向,以达到快速检测恶意软件的目的。该方法具有低复杂度与高检测效率的特点。实验结果表明本文提出的恶意软件检测方法可达到较高的检测精度以及较低的误报率,分别为98.50%及3.18%。  相似文献   

20.
提出了一个基于35维特征向量的恶意程序检测方法。特征向量的每一维用于表示一种恶意行为事件,每一事件由相应的Win32 API调用及其参数表示。实现了一个自动化行为追踪系统(Argus)用于行为特征的提取。实验数据集从8223个恶意可执行程序和2821个正常可执行程序中获取,并依据程序发生事件数的不同设立事件阈值,建立不同的训练集,分别用于训练贝叶斯分类器。实验表明,当事件阈值为3时,分类器达到最佳检测效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号