共查询到20条相似文献,搜索用时 15 毫秒
1.
《Expert systems with applications》2014,41(13):5843-5857
The formation of new malwares every day poses a significant challenge to anti-virus vendors since antivirus tools, using manually crafted signatures, are only capable of identifying known malware instances and their relatively similar variants. To identify new and unknown malwares for updating their anti-virus signature repository, anti-virus vendors must daily collect new, suspicious files that need to be analyzed manually by information security experts who then label them as malware or benign. Analyzing suspected files is a time-consuming task and it is impossible to manually analyze all of them. Consequently, anti-virus vendors use machine learning algorithms and heuristics in order to reduce the number of suspect files that must be inspected manually. These techniques, however, lack an essential element – they cannot be daily updated. In this work we introduce a solution for this updatability gap. We present an active learning (AL) framework and introduce two new AL methods that will assist anti-virus vendors to focus their analytical efforts by acquiring those files that are most probably malicious. Those new AL methods are designed and oriented towards new malware acquisition. To test the capability of our methods for acquiring new malwares from a stream of unknown files, we conducted a series of experiments over a ten-day period. A comparison of our methods to existing high performance AL methods and to random selection, which is the naïve method, indicates that the AL methods outperformed random selection for all performance measures. Our AL methods outperformed existing AL method in two respects, both related to the number of new malwares acquired daily, the core measure in this study. First, our best performing AL method, termed “Exploitation”, acquired on the 9th day of the experiment about 2.6 times more malwares than the existing AL method and 7.8 more times than the random selection. Secondly, while the existing AL method showed a decrease in the number of new malwares acquired over 10 days, our AL methods showed an increase and a daily improvement in the number of new malwares acquired. Both results point towards increased efficiency that can possibly assist anti-virus vendors. 相似文献
2.
Phurivit SangkatsaneeNaruemon Wattanapongsakorn Chalermpol Charnsripinyo 《Computer Communications》2011,34(18):2227-2235
The growing prevalence of network attacks is a well-known problem which can impact the availability, confidentiality, and integrity of critical information for both individuals and enterprises. In this paper, we propose a real-time intrusion detection approach using a supervised machine learning technique. Our approach is simple and efficient, and can be used with many machine learning techniques. We applied different well-known machine learning techniques to evaluate the performance of our IDS approach. Our experimental results show that the Decision Tree technique can outperform the other techniques. Therefore, we further developed a real-time intrusion detection system (RT-IDS) using the Decision Tree technique to classify on-line network data as normal or attack data. We also identified 12 essential features of network data which are relevant to detecting network attacks using the information gain as our feature selection criterions. Our RT-IDS can distinguish normal network activities from main attack types (Probe and Denial of Service (DoS)) with a detection rate higher than 98% within 2 s. We also developed a new post-processing procedure to reduce the false-alarm rate as well as increase the reliability and detection accuracy of the intrusion detection system. 相似文献
3.
A hybrid machine learning approach to network anomaly detection 总被引:3,自引:0,他引:3
Zero-day cyber attacks such as worms and spy-ware are becoming increasingly widespread and dangerous. The existing signature-based intrusion detection mechanisms are often not sufficient in detecting these types of attacks. As a result, anomaly intrusion detection methods have been developed to cope with such attacks. Among the variety of anomaly detection approaches, the Support Vector Machine (SVM) is known to be one of the best machine learning algorithms to classify abnormal behaviors. The soft-margin SVM is one of the well-known basic SVM methods using supervised learning. However, it is not appropriate to use the soft-margin SVM method for detecting novel attacks in Internet traffic since it requires pre-acquired learning information for supervised learning procedure. Such pre-acquired learning information is divided into normal and attack traffic with labels separately. Furthermore, we apply the one-class SVM approach using unsupervised learning for detecting anomalies. This means one-class SVM does not require the labeled information. However, there is downside to using one-class SVM: it is difficult to use the one-class SVM in the real world, due to its high false positive rate. In this paper, we propose a new SVM approach, named Enhanced SVM, which combines these two methods in order to provide unsupervised learning and low false alarm capability, similar to that of a supervised SVM approach.We use the following additional techniques to improve the performance of the proposed approach (referred to as Anomaly Detector using Enhanced SVM): First, we create a profile of normal packets using Self-Organized Feature Map (SOFM), for SVM learning without pre-existing knowledge. Second, we use a packet filtering scheme based on Passive TCP/IP Fingerprinting (PTF), in order to reject incomplete network traffic that either violates the TCP/IP standard or generation policy inside of well-known platforms. Third, a feature selection technique using a Genetic Algorithm (GA) is used for extracting optimized information from raw internet packets. Fourth, we use the flow of packets based on temporal relationships during data preprocessing, for considering the temporal relationships among the inputs used in SVM learning. Lastly, we demonstrate the effectiveness of the Enhanced SVM approach using the above-mentioned techniques, such as SOFM, PTF, and GA on MIT Lincoln Lab datasets, and a live dataset captured from a real network. The experimental results are verified by m-fold cross validation, and the proposed approach is compared with real world Network Intrusion Detection Systems (NIDS). 相似文献
4.
5.
6.
Alexandre Rafael Lenz Aurora Pozo Silvia Regina Vergilio 《Engineering Applications of Artificial Intelligence》2013,26(5-6):1631-1640
Software testing techniques and criteria are considered complementary since they can reveal different kinds of faults and test distinct aspects of the program. The functional criteria, such as Category Partition, are difficult to be automated and are usually manually applied. Structural and fault-based criteria generally provide measures to evaluate test sets. The existing supporting tools produce a lot of information including: input and produced output, structural coverage, mutation score, faults revealed, etc. However, such information is not linked to functional aspects of the software. In this work, we present an approach based on machine learning techniques to link test results from the application of different testing techniques. The approach groups test data into similar functional clusters. After this, according to the tester's goals, it generates classifiers (rules) that have different uses, including selection and prioritization of test cases. The paper also presents results from experimental evaluations and illustrates such uses. 相似文献
7.
The performance of eight machine learning classifiers were compared with three aphasia related classification problems. The first problem contained naming data of aphasic and non-aphasic speakers tested with the Philadelphia Naming Test. The second problem included the naming data of Alzheimer and vascular disease patients tested with Finnish version of the Boston Naming Test. The third problem included aphasia test data of patients suffering from four different aphasic syndromes tested with the Aachen Aphasia Test. The first two data sets were small. Therefore, the data used in the tests were artificially generated from the original confrontation naming data of 23 and 22 subjects, respectively. The third set contained aphasia test data of 146 aphasic speakers and was used as such in the experiments. With the first and the third data set the classifiers could successfully be used for the task, while the results with the second data set were less encouraging. However, based on the results, no single classifier performed exceptionally well with all data sets, suggesting that the selection of the classifier used for classification of aphasic data should be based on the experiments performed with the data set at hand. 相似文献
8.
Alok R. Chaturvedi George K. Hutchinson Derek L. Nazareth 《Journal of Intelligent Manufacturing》1992,3(1):43-57
This paper describes a synergistic approach that is applicable to a wide variety of system control problems. The approach utilizes a machine learning technique, goal-directed conceptual aggregation (GDCA), to facilitate dynamic decision-making. The application domain employed is Flexible Manufacturing System (FMS) scheduling and control. Simulation is used for the dual purpose of providing a realistic depiction of FMSs, and serves as an engine for demonstrating the viability of a synergistic system involving incremental learning. The paper briefly describes prior approaches to FMS scheduling and control, and machine learning. It outlines the GDCA approach, provides a generalized architecture for dynamic control problems, and describes the implementation of the system as applied to FMS scheduling and control. The paper concludes with a discussion of the general applicability of this approach. 相似文献
9.
The size and dynamism of the Web pose challenges for all its stakeholders, which include producers and consumers of content as well as advertisers who want to place advertisements next to relevant content. A critical piece of information for producers/publishers of content as well as advertisers is the demographics of the consumers who are likely to visit a given web site. In this article we explore predictive models that attempt to deduce the demographics of the audience of a web site using cues embedded in the design or the content of its homepage. We find that it is possible to effectively predict different types of demographics of consumers of web sites on the basis of the suggested approach. Through a statistical analysis we observe that several design elements and the content differ significantly among web sites dominated by consumers of different demographic classes. We also suggest the use of an ensemble classifier that combines the content and design cues with the goal of further improving the prediction performance. 相似文献
10.
Liang WangAuthor Vitae Yaping HuangAuthor VitaeXiaoyue LuoAuthor Vitae Zhe WangAuthor VitaeSiwei LuoAuthor Vitae 《Neurocomputing》2011,74(16):2464-2474
Image deblurring is a basic and important task of image processing. Traditional filtering based image deblurring methods, e.g. enhancement filters, partial differential equation (PDE) and etc., are limited by the hypothesis that natural images and noise are with low and high frequency terms, respectively. Noise removal and edge protection are always the dilemma for traditional models.In this paper, we study image deblurring problem from a brand new perspective—classification. And we also generalize the traditional PDE model to a more general case, using the theories of calculus of variations. Furthermore, inspired by the theories of approximation of functions, we transform the operator-learning problem into a coefficient-learning problem by means of selecting a group of basis, and build a filter-learning model. Based on extreme learning machine (ELM) [1], [2], [3] and [4], an algorithm is designed and a group of filters are learned effectively. Then a generalized image deblurring model, learned filtering PDE (LF-PDE), is built.The experiments verify the effectiveness of our models and the corresponding learned filters. It is shown that our model can overcome many drawbacks of the traditional models and achieve much better results. 相似文献
11.
Recent research revealed that model-assisted parameter tuning can improve the quality of supervised machine learning (ML) models. The tuned models were especially found to generalize better and to be more robust compared to other optimization approaches. However, the advantages of the tuning often came along with high computation times, meaning a real burden for employing tuning algorithms. While the training with a reduced number of patterns can be a solution to this, it is often connected with decreasing model accuracies and increasing instabilities and noise. Hence, we propose a novel approach defined by a two criteria optimization task, where both the runtime and the quality of ML models are optimized. Because the budgets for this optimization task are usually very restricted in ML, the surrogate-assisted Efficient Global Optimization (EGO) algorithm is adapted. In order to cope with noisy experiments, we apply two hypervolume indicator based EGO algorithms with smoothing and re-interpolation of the surrogate models. The techniques do not need replicates. We find that these EGO techniques can outperform traditional approaches such as latin hypercube sampling (LHS), as well as EGO variants with replicates. 相似文献
12.
Learning general concepts in imperfect environments is difficult since training instances often include noisy data, inconclusive data, incomplete data, unknown attributes, unknown attribute values and other barriers to effective learning. It is well known that people can learn effectively in imperfect environments, and can manage to process very large amounts of data. Imitating human learning behavior therefore provides a useful model for machine learning in real-world applications. This paper proposes a new, more effective way to represent imperfect training instances and rules, and based on the new representation, a Human-Like Learning (HULL) algorithm for incrementally learning concepts well in imperfect training environments. Several examples are given to make the algorithm clearer. Finally, experimental results are presented that show the proposed learning algorithm works well in imperfect learning environments. 相似文献
13.
Mapping land-cover modifications over large areas: A comparison of machine learning algorithms 总被引:3,自引:0,他引:3
John Rogan Janet Franklin Doug Stow Jennifer Miller Curtis Woodcock Dar Roberts 《Remote sensing of environment》2008,112(5):2272-2283
Large area land-cover monitoring scenarios, involving large volumes of data, are becoming more prevalent in remote sensing applications. Thus, there is a pressing need for increased automation in the change mapping process. The objective of this research is to compare the performance of three machine learning algorithms (MLAs); two classification tree software routines (S-plus and C4.5) and an artificial neural network (ARTMAP), in the context of mapping land-cover modifications in northern and southern California study sites between 1990/91 and 1996. Comparisons were based on several criteria: overall accuracy, sensitivity to data set size and variation, and noise. ARTMAP produced the most accurate maps overall ( 84%), for two study areas — in southern and northern California, and was most resistant to training data deficiencies. The change map generated using ARTMAP has similar accuracies to a human-interpreted map produced by the U.S. Forest Service in the southern study area. ARTMAP appears to be robust and accurate for automated, large area change monitoring as it performed equally well across the diverse study areas with minimal human intervention in the classification process. 相似文献
14.
Systematic literature review of machine learning based software development effort estimation models
Jianfeng Wen Shixian LiZhiyong Lin Yong HuChangqin Huang 《Information and Software Technology》2012,54(1):41-59
Context
Software development effort estimation (SDEE) is the process of predicting the effort required to develop a software system. In order to improve estimation accuracy, many researchers have proposed machine learning (ML) based SDEE models (ML models) since 1990s. However, there has been no attempt to analyze the empirical evidence on ML models in a systematic way.Objective
This research aims to systematically analyze ML models from four aspects: type of ML technique, estimation accuracy, model comparison, and estimation context.Method
We performed a systematic literature review of empirical studies on ML model published in the last two decades (1991-2010).Results
We have identified 84 primary studies relevant to the objective of this research. After investigating these studies, we found that eight types of ML techniques have been employed in SDEE models. Overall speaking, the estimation accuracy of these ML models is close to the acceptable level and is better than that of non-ML models. Furthermore, different ML models have different strengths and weaknesses and thus favor different estimation contexts.Conclusion
ML models are promising in the field of SDEE. However, the application of ML models in industry is still limited, so that more effort and incentives are needed to facilitate the application of ML models. To this end, based on the findings of this review, we provide recommendations for researchers as well as guidelines for practitioners. 相似文献15.
Matjaž Kukar 《Knowledge and Information Systems》2006,9(3):364-384
Although in the past machine learning algorithms have been successfully used in many problems, their serious practical use
is affected by the fact that often they cannot produce reliable and unbiased assessments of their predictions' quality. In
last few years, several approaches for estimating reliability or confidence of individual classifiers have emerged, many of
them building upon the algorithmic theory of randomness, such as (historically ordered) transduction-based confidence estimation,
typicalness-based confidence estimation, and transductive reliability estimation. Unfortunately, they all have weaknesses:
either they are tightly bound with particular learning algorithms, or the interpretation of reliability estimations is not
always consistent with statistical confidence levels. In the paper we describe typicalness and transductive reliability estimation
frameworks and propose a joint approach that compensates the above-mentioned weaknesses by integrating typicalness-based confidence
estimation and transductive reliability estimation into a joint confidence machine. The resulting confidence machine produces
confidence values in the statistical sense. We perform series of tests with several different machine learning algorithms
in several problem domains. We compare our results with that of a proprietary method as well as with kernel density estimation.
We show that the proposed method performs as well as proprietary methods and significantly outperforms density estimation
methods.
Matjaž Kukar is currently Assistant Professor in the Faculty of Computer and Information Science at University of Ljubljana. His research
interests include machine learning, data mining and intelligent data analysis, ROC analysis, cost-sensitive learning, reliability
estimation, and latent structure analysis, as well as applications of data mining in medical and business problems. 相似文献
16.
Various machine learning techniques have been applied to different problems in survival analysis in the last decade. They were usually adapted to learning from censored survival data by using the information on observation time. This includes learning from parts of the data or interventions to the learning algorithms. Efficient models were established in various fields of clinical medicine and bioinformatics. In this paper, we propose a pre-processing method for adapting the censored survival data to be used with ordinary machine learning algorithms. This is done by pre-assigning censored instances a positive or negative outcome according to their features and observation time. The proposed procedure calculates the goodness of fit of each censored instance to both the distribution of positives and the spoiled distribution of negatives in the entire dataset and relabels that instance accordingly. We performed a thorough empirical testing of our method in a simulation study and on two real-world medical datasets, using the naive Bayes classifier and decision trees. When compared to one of the popular ML methods dealing with survival, our method provided good results, especially when applied to heavily censored data. 相似文献
17.
Rodrigo C. Barros Author Vitae Duncan D. Ruiz Author Vitae 《Information Sciences》2011,181(5):954-971
Model trees are a particular case of decision trees employed to solve regression problems. They have the advantage of presenting an interpretable output, helping the end-user to get more confidence in the prediction and providing the basis for the end-user to have new insight about the data, confirming or rejecting hypotheses previously formed. Moreover, model trees present an acceptable level of predictive performance in comparison to most techniques used for solving regression problems. Since generating the optimal model tree is an NP-Complete problem, traditional model tree induction algorithms make use of a greedy top-down divide-and-conquer strategy, which may not converge to the global optimal solution. In this paper, we propose a novel algorithm based on the use of the evolutionary algorithms paradigm as an alternate heuristic to generate model trees in order to improve the convergence to globally near-optimal solutions. We call our new approach evolutionary model tree induction (E-Motion). We test its predictive performance using public UCI data sets, and we compare the results to traditional greedy regression/model trees induction algorithms, as well as to other evolutionary approaches. Results show that our method presents a good trade-off between predictive performance and model comprehensibility, which may be crucial in many machine learning applications. 相似文献
18.
Applying machine learning to software fault-proneness prediction 总被引:1,自引:0,他引:1
Iker Gondra 《Journal of Systems and Software》2008,81(2):186-195
The importance of software testing to quality assurance cannot be overemphasized. The estimation of a module’s fault-proneness is important for minimizing cost and improving the effectiveness of the software testing process. Unfortunately, no general technique for estimating software fault-proneness is available. The observed correlation between some software metrics and fault-proneness has resulted in a variety of predictive models based on multiple metrics. Much work has concentrated on how to select the software metrics that are most likely to indicate fault-proneness. In this paper, we propose the use of machine learning for this purpose. Specifically, given historical data on software metric values and number of reported errors, an Artificial Neural Network (ANN) is trained. Then, in order to determine the importance of each software metric in predicting fault-proneness, a sensitivity analysis is performed on the trained ANN. The software metrics that are deemed to be the most critical are then used as the basis of an ANN-based predictive model of a continuous measure of fault-proneness. We also view fault-proneness prediction as a binary classification task (i.e., a module can either contain errors or be error-free) and use Support Vector Machines (SVM) as a state-of-the-art classification method. We perform a comparative experimental study of the effectiveness of ANNs and SVMs on a data set obtained from NASA’s Metrics Data Program data repository. 相似文献
19.
Ioanna Lykourentzou Ioannis Giannoukos Vassilis Nikolopoulos George Mpardis Vassili Loumos 《Computers & Education》2009,53(3):950-965
In this paper, a dropout prediction method for e-learning courses, based on three popular machine learning techniques and detailed student data, is proposed. The machine learning techniques used are feed-forward neural networks, support vector machines and probabilistic ensemble simplified fuzzy ARTMAP. Since a single technique may fail to accurately classify some e-learning students, whereas another may succeed, three decision schemes, which combine in different ways the results of the three machine learning techniques, were also tested. The method was examined in terms of overall accuracy, sensitivity and precision and its results were found to be significantly better than those reported in relevant literature. 相似文献
20.
José M. Martínez-Martínez Pablo Escandell-Montero Carlo Barbieri Emilio Soria-Olivas Flavio Mari Marcelino Martínez-Sober Claudia Amato Antonio J. Serrano López Marcello Bassi Rafael Magdalena-Benedito Andrea Stopper José D. Martín-Guerrero Emanuele Gatti 《Computer methods and programs in biomedicine》2014
Patients who suffer from chronic renal failure (CRF) tend to suffer from an associated anemia as well. Therefore, it is essential to know the hemoglobin (Hb) levels in these patients. The aim of this paper is to predict the hemoglobin (Hb) value using a database of European hemodialysis patients provided by Fresenius Medical Care (FMC) for improving the treatment of this kind of patients. For the prediction of Hb, both analytical measurements and medication dosage of patients suffering from chronic renal failure (CRF) are used. Two kinds of models were trained, global and local models. In the case of local models, clustering techniques based on hierarchical approaches and the adaptive resonance theory (ART) were used as a first step, and then, a different predictor was used for each obtained cluster. Different global models have been applied to the dataset such as Linear Models, Artificial Neural Networks (ANNs), Support Vector Machines (SVM) and Regression Trees among others. Also a relevance analysis has been carried out for each predictor model, thus finding those features that are most relevant for the given prediction. 相似文献