首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Summarization is an important intermediate step for expediting knowledge discovery tasks such as anomaly detection. In the context of anomaly detection from data stream, the summary needs to represent both anomalous and normal data. But streaming data has distinct characteristics, such as one-pass constraint, for which conducting data mining operations are difficult. Existing stream summarization techniques are unable to create summary which represent both normal and anomalous instances. To address this problem, in this paper, a number of hybrid summarization techniques are designed and developed using the concept of reservoir for anomaly detection from network traffic. Experimental results on thirteen benchmark data streams show that the summaries produced from stream using pairwise distance (PSSR) and template matching (TMSSR) techniques can retain more anomalies than existing stream summarization techniques, and anomaly detection technique can identify the anomalies with high true positive and low false positive rate.  相似文献   

2.
Many studies on streaming data classification have been based on a paradigm in which a fully labeled stream is available for learning purposes. However, it is often too labor-intensive and time-consuming to manually label a data stream for training. This difficulty may cause conventional supervised learning approaches to be infeasible in many real world applications, such as credit fraud detection, intrusion detection, and rare event prediction. In previous work, Li et al. suggested that these applications be treated as Positive and Unlabeled learning problem, and proposed a learning algorithm, OcVFD, as a solution (Li et al. 2009). Their method requires only a set of positive examples and a set of unlabeled examples which is easily obtainable in a streaming environment, making it widely applicable to real-life applications. Here, we enhance Li et al.’s solution by adding three features: an efficient method to estimate the percentage of positive examples in the training stream, the ability to handle numeric attributes, and the use of more appropriate classification methods at tree leaves. Experimental results on synthetic and real-life datasets show that our enhanced solution (called PUVFDT) has very good classification performance and a strong ability to learn from data streams with only positive and unlabeled examples. Furthermore, our enhanced solution reduces the learning time of OcVFDT by about an order of magnitude. Even with 80 % of the examples in the training data stream unlabeled, PUVFDT can still achieve a competitive classification performance compared with that of VFDTcNB (Gama et al. 2003), a supervised learning algorithm.  相似文献   

3.
Yang  Li 《Computers & Security》2007,26(7-8):459-467
As network attacks have increased in number and severity over the past few years, intrusion detection is increasingly becoming a critical component of secure information systems and supervised network intrusion detection has been an active and difficult research topic in the field of intrusion detection for many years. However, it hasn't been widely applied in practice due to some inherent issues. The most important reason is the difficulties in obtaining adequate attack data for the supervised classifiers to model the attack patterns, and the data acquisition task is always time-consuming and greatly relies on the domain experts. In this paper, we propose a novel supervised network intrusion detection method based on TCM-KNN (Transductive Confidence Machines for K-Nearest Neighbors) machine learning algorithm and active learning based training data selection method. It can effectively detect anomalies with high detection rate, low false positives under the circumstance of using much fewer selected data as well as selected features for training in comparison with the traditional supervised intrusion detection methods. A series of experimental results on the well-known KDD Cup 1999 data set demonstrate that the proposed method is more robust and effective than the state-of-the-art intrusion detection methods, as well as can be further optimized as discussed in this paper for real applications.  相似文献   

4.
Anomaly detection refers to the identification of patterns in a dataset that do not conform to expected patterns. Such non‐conformant patterns typically correspond to samples of interest and are assigned to different labels in different domains, such as outliers, anomalies, exceptions, and malware. A daunting challenge is to detect anomalies in rapid voluminous streams of data. This paper presents a novel, generic real‐time distributed anomaly detection framework for multi‐source stream data. As a case study, we investigate anomaly detection for a multi‐source VMware‐based cloud data center, which maintains a large number of virtual machines (VMs). This framework continuously monitors VMware performance stream data related to CPU statistics (e.g., load and usage). It collects data simultaneously from all of the VMs connected to the network and notifies the resource manager to reschedule its CPU resources dynamically when it identifies any abnormal behavior from its collected data. A semi‐supervised clustering technique is used to build a model from benign training data only. During testing, if a data instance deviates significantly from the model, then it is flagged as an anomaly. Effective anomaly detection in this case demands a distributed framework with high throughput and low latency. Distributed streaming frameworks like Apache Storm, Apache Spark, S4, and others are designed for a lower data processing time and a higher throughput than standard centralized frameworks. We have experimentally compared the average processing latency of a tuple during clustering and prediction in both Spark and Storm and demonstrated that Spark processes a tuple much quicker than storm on average. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

5.
李洋  方滨兴  郭莉  陈友 《软件学报》2007,18(10):2595-2604
网络异常检测技术是入侵检测领域研究的热点和难点内容,目前仍然存在着误报率较高、对建立检测模型的数据要求过高、在复杂的网络环境中由于"噪音"的影响而导致检测率不高等问题.基于改进的TCM-KNN(transductive confidence machines for K-nearest neighbors)置信度机器学习算法,提出了一种网络异常检测的新方法,能够在高置信度的情况下,使用训练的正常样本有效地对异常进行检测.通过大量基于著名的KDD Cup 1999数据集的实验,表明其相对于传统的异常检测方法在保证较高检测率的前提下,有效地降低了误报率.另外,在训练集有少量"噪音"数据干扰的情况下,其仍能保证较高的检测性能;并且在采用"小样本"训练集以及为了避免"维灾难"而进行特征选取等优化处理后,其性能没有明显的削减.  相似文献   

6.
一种基于强化规则学习的高效入侵检测方法   总被引:8,自引:1,他引:8  
在入侵检测研究领域中,提高检测模型的检测率并降低误报率是一个重要的研究课题.在对归纳学习理论深入研究的基础上,将规则学习算法应用到入侵检测建模中.针对审计训练数据不足时出现的检测精度下降的情况,提出了一种基于强化规则学习的高效入侵检测方法EAIDBRL(efficient approach to intrusion detection based on boosting rule learning).在EAIDBRL方法中,首先调整传统Boosting算法的权重更新过程在各个预测目标类内部进行,以消除退化现象;然后修改传统规则学习算法中规则生长和规则剪枝过程的评价准则函数;最后使用改进后的Boosting算法来增强弱规则学习器对网络审计数据的分类性能.标准入侵检测数据集上的测试结果表明,EAIDBRL方法能够较大地提高传统规则学习检测模型在小样本条件下的入侵检测性能.  相似文献   

7.
刁树民  王永利 《计算机应用》2009,29(6):1578-1581
在进行组合决策时,已有的组合分类方法需要对多个组合分类器均有效的公共已知标签训练样本。为了解决在没有已知标签样本的情况下数据流组合分类决策问题,提出一种基于约束学习的数据流组合分类器的融合策略。在判定测试样本上的决策时,根据直推学习理论设计满足每一个局部分类器约束度量的方法,保证了约束的可行性,解决了分布式分类聚集时最大熵的直推扩展问题。测试数据集上的实验证明,与已有的直推学习方法相比,此方法可以获得更好的决策精度,可以应用于数据流组合分类的融合。  相似文献   

8.
概念漂移是动态流数据挖掘中一类常见的问题,但混杂噪声或训练样本规模过小而产生的伪概念漂移会引起与真实概念漂移相似的结果,即模型在线测试性能的不稳定波动,导致二者容易混淆,发生概念漂移的误报.针对流数据中真伪概念漂移的混淆问题,提出一种基于在线性能测试的概念漂移检测方法(concept drift detection method based on online performance test,简称CDPT).该方法将最新获得的数据集进行均匀分组,在每组子数据集上分别进行在线学习,同时记录每组子数据集训练测试得到的分类精度向量,并计算相邻学习时间单元之间的精度落差,依据测试精度下降阈值得到有效波动位点.然后采用交叉检验的方式整合不同分组中的有效波动位点,以消除流数据在线学习过程中由于训练样本过小导致模型不稳定造成的检测干扰,根据精度波动一致性得到一致波动位点.最后,通过跟踪在线学习分类准确率,得到一致波动位点邻域参照点的测试精度变化,比较一致波动位点邻域参照点对应的模型测试精度下降幅度及收敛情况,以有效检测一致波动位点当中真实的概念漂移位点.实验结果表明,该方法能够有效辨识流数据在线学习过程中发生的真实概念漂移,并能有效避免训练样本过小或者流数据中噪声对检测结果的负面影响,同时提高模型的泛化性能.  相似文献   

9.
Anomaly detection involves identifying rare data instances (anomalies) that come from a different class or distribution than the majority (which are simply called “normal” instances). Given a training set of only normal data, the semi-supervised anomaly detection task is to identify anomalies in the future. Good solutions to this task have applications in fraud and intrusion detection. The unsupervised anomaly detection task is different: Given unlabeled, mostly-normal data, identify the anomalies among them. Many real-world machine learning tasks, including many fraud and intrusion detection tasks, are unsupervised because it is impractical (or impossible) to verify all of the training data. We recently presented FRaC, a new approach for semi-supervised anomaly detection. FRaC is based on using normal instances to build an ensemble of feature models, and then identifying instances that disagree with those models as anomalous. In this paper, we investigate the behavior of FRaC experimentally and explain why FRaC is so successful. We also show that FRaC is a superior approach for the unsupervised as well as the semi-supervised anomaly detection task, compared to well-known state-of-the-art anomaly detection methods, LOF and one-class support vector machines, and to an existing feature-modeling approach.  相似文献   

10.
Based on the flipped‐classroom model and the potential motivational and instructional benefits of digital games, we describe a flipped game‐based learning (FGBL) strategy focused on preclass and overall learning outcomes. A secondary goal is to determine the effects, if any, of the classroom aspects of the FGBL strategy on learning efficiency. Our experiments involved 2 commercial games featuring physical motion concepts: Ballance (Newton's law of motion) and Angry Birds (mechanical energy conservation). We randomly assigned 87 8th‐grade students to game instruction (digital game before class and lecture‐based instruction in class), FGBL strategy (digital game before class and cooperative learning in the form of group discussion and practice in class), or lecture‐based instruction groups (no gameplay). Results indicate that the digital games exerted a positive effect on preclass learning outcomes and that FGBL‐strategy students achieved better overall learning outcomes than their lecture‐based peers. Our observation of similar overall outcomes between the cooperative learning and lecture‐based groups suggests a need to provide additional teaching materials or technical support when introducing video games to cooperative classroom learning activities.  相似文献   

11.
In-operation construction vibration monitoring records inevitably contain various anomalies caused by sensor faults, system errors, or environmental influence. An accurate and efficient anomaly detection technique is essential for vibration impact assessment. Identifying anomalies using visualization tools is computationally expensive, time-consuming, and labor-intensive. In this study, an unsupervised approach for detecting anomalies in construction vibration monitoring data was proposed based on a temporal convolutional network and autoencoder. The anomalies were autonomously detected on the basis of the reconstruction errors between the original and reconstructed signals. Considering the false and missed detections caused by great variability in vibration signals, an adaptive threshold method was applied to achieve the best identification performance. This method used the log-likelihood of the reconstruction errors to search for an optimal coefficient for anomalies. A distributed training strategy was implemented on a cloud platform to speed up training and perform anomaly detection without significant time delay. Construction-induced accelerations measured by a real vibration monitoring system were used to evaluate the proposed method. Experimental results show that the proposed approach can successfully detect anomalies with high accuracy; and the distributed training can remarkably save training time, thereby realizing anomaly detection for online monitoring systems with accumulated massive data.  相似文献   

12.
Prediction in streaming data is an important activity in the modern society. Two major challenges posed by data streams are (1) the data may grow without limit so that it is difficult to retain a long history of raw data; and (2) the underlying concept of the data may change over time. The novelties of this paper are in four folds. First, it uses a measure of conceptual equivalence to organize the data history into a history of concepts. This contrasts to the common practice that only keeps recent raw data. The concept history is compact while still retains essential information for learning. Second, it learns concept-transition patterns from the concept history and anticipates what the concept will be in the case of a concept change. It then proactively prepares a prediction model for the future change. This contrasts to the conventional methodology that passively waits until the change happens. Third, it incorporates proactive and reactive predictions. If the anticipation turns out to be correct, a proper prediction model can be launched instantly upon the concept change. If not, it promptly resorts to a reactive mode: adapting a prediction model to the new data. Finally, an efficient and effective system RePro is proposed to implement these new ideas. It carries out prediction at two levels, a general level of predicting each oncoming concept and a specific level of predicting each instance's class. Experiments are conducted to compare RePro with representative existing prediction methods on various benchmark data sets that represent diversified scenarios of concept change. Empirical evidence offers inspiring insights and demonstrates the proposed methodology is an advisable solution to prediction in data streams.A preliminary and shorter version of this paper has been published in the Proceedings of the llth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2005), pp. 710–715.Sometimes there are conflicts in the literature when describing these modes. For example, the concept shift in some papers means the concept drift in other papers. The definitions here are cleared up to the best of the authors’ understanding.The value in each cell can be frequency as well as probability. The latter can be approximated from the former.If the concept changes so fast that the learning can not catch up with it, the prediction will be inordinate. This also applies to human learning.For example, C4.5rules (Quinlan, 1993) can achieve a 100% classification accuracy on the whole data set.If the attribute value has less than 500 instances, all instances will be sampled without replacement.If a data set has only nominal attributes, two nominal attributes will be selected. If a data set has only numeric attributes, two numeric attributes will be selected.One can not manipulate these degrees in the hyperplane or network intrusion data, for which no results are presented.The sample size is chosen to avoid observation noise caused by high classification variance.These error rates may sometimes be higher than those reported in the original work (Hulten et al., 2001). It is because the original work used a much larger data size. There are many more instances coming after the new classifier becomes stable and hence can be classified correctly. This longer existence of each concept relieves CVFDT's dilemma and lowers its average error rate.Please note that for DWCE, the optimal version whose buffer size equals to 10% of its window size has been used on those 3 artificial data streams. However, its prohibitively high time demand makes DWCE intractable when a large number (36) of real-world data streams are tested here. Hence a compromise version is used here instead whose buffer size is half of its window size. The results are sufficient to verify that DWCE trades time for accuracy. It can improve prediction accuracy on WCE, but is often too slow to be useful for on-line prediction.  相似文献   

13.
现有基于机器学习的入侵检测方法大多专注于提高整体检测率和降低整体的漏报率,忽视了少数类别的检测率和漏报率。为此,提出了一种基于SMOTE (Synthetic Minority Oversampling Technique )和GBDT(Gradient Boosting Decision Tree)的入侵检测方法。其核心思想是首先在预处理阶段使用SMOTE技术提高少数类别的样本数量,且对多数类别样本降采样,最后在平衡数据集上训练GBDT分类器。利用KDD99数据集进行实验验证,并与在原始训练集上训练的分类器、KDD99竞赛的最好成绩对比,结果表明,该方法在保持较高的整体正确率的同时较大程度上降低了少数类的漏报率。  相似文献   

14.
In addition to classification and regression, outlier detection has emerged as a relevant activity in deep learning. In comparison with previous approaches where the original features of the examples were used for separating the examples with high dissimilarity from the rest of the examples, deep learning can automatically extract useful features from raw data, thus removing the need for most of the feature engineering efforts usually required with classical machine learning approaches. This requires training the deep learning algorithm with labels identifying the examples or with numerical values. Although outlier detection in deep learning has been usually undertaken by training the algorithm with categorical labels—classifier—, it can also be performed by using the algorithm as regressor. Nowadays numerous urban areas have deployed a network of sensors for monitoring multiple variables about air quality. The measurements of these sensors can be treated individually—as time series—or collectively. Collectively, a variable monitored by a network of sensors can be transformed into a map. Maps can be used as images in machine learning algorithms—including computer vision algorithms—for outlier detection. The identification of anomalous episodes in air quality monitoring networks allows later processing this time period with finer‐grained scientific packages involving fluid dynamic and chemical evolution software, or the identification of malfunction stations. In this work, a Convolutional Neural Network is trained—as a regressor—using as input Ozone‐urban images generated from the Air Quality Monitoring Network of Madrid (Spain). The learned features are processed by Density‐based Spatial Clustering of Applications with Noise (DBSCAN) algorithm for identifying anomalous maps. Comparisons with other deep learning architectures are undertaken, for instance, autoencoders—undercomplete and denoizing—for learning salient features of the maps and later to use as input of DBSCAN. The proposed approach is able efficiently find maps with local anomalies compared to other approaches based on raw images or latent features extracted with autoencoders architectures with DBSCAN.  相似文献   

15.
In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist.  相似文献   

16.
Rescaling is possibly the most popular approach to cost‐sensitive learning. This approach works by rebalancing the classes according to their costs, and it can be realized in different ways, for example, re‐weighting or resampling the training examples in proportion to their costs, moving the decision boundaries of classifiers faraway from high‐cost classes in proportion to costs, etc. This approach is very effective in dealing with two‐class problems, yet some studies showed that it is often not so helpful on multi‐class problems. In this article, we try to explore why the rescaling approach is often helpless on multi‐class problems. Our analysis discloses that the rescaling approach works well when the costs are consistent, while directly applying it to multi‐class problems with inconsistent costs may not be a good choice. Based on this recognition, we advocate that before applying the rescaling approach, the consistency of the costs must be examined at first. If the costs are consistent, the rescaling approach can be conducted directly; otherwise it is better to apply rescaling after decomposing the multi‐class problem into a series of two‐class problems. An empirical study involving 20 multi‐class data sets and seven types of cost‐sensitive learners validates our proposal. Moreover, we show that the proposal is also helpful for class‐imbalance learning.  相似文献   

17.
随着交通智能化的发展, 高速公路监控视频加密上云逐渐成为交通发展的主要趋势之一. 交通数据深度挖掘, 尤其是行人检测问题, 则是该趋势中亟待解决问题之一. 本文针对多种道路环境的行人检测问题, 提出了一种基于鲲鹏云的全天候行人监测解决方案. 首先, 将监控相机中的视频流通过流媒体服务转发至鲲鹏云; 然后鲲鹏云进行视频流解码与行人检测, 同时保存行人历史信息; 最后进行行人事件分析和上报. 本系统采用嵌入式神经网络处理器(NPU)代替传统图形处理器(GPU)平台加速YOLOv4行人检测模块的推理, 一方面取得了较快的检测速度并可实时处理22路视频流, 另一方面, 该解决方案针对不同道路场景下高速道路上的行人也可取得较好的监测效果.  相似文献   

18.
Most existing works on data stream classification assume the streaming data is precise and definite. Such assumption, however, does not always hold in practice, since data uncertainty is ubiquitous in data stream applications due to imprecise measurement, missing values, privacy protection, etc. The goal of this paper is to learn accurate decision tree models from uncertain data streams for classification analysis. On the basis of very fast decision tree (VFDT) algorithms, we proposed an algorithm for constructing an uncertain VFDT tree with classifiers at tree leaves (uVFDTc). The uVFDTc algorithm can exploit uncertain information effectively and efficiently in both the learning and the classification phases. In the learning phase, it uses Hoeffding bound theory to learn from uncertain data streams and yield fast and reasonable decision trees. In the classification phase, at tree leaves it uses uncertain naive Bayes (UNB) classifiers to improve the classification performance. Experimental results on both synthetic and real-life datasets demonstrate the strong ability of uVFDTc to classify uncertain data streams. The use of UNB at tree leaves has improved the performance of uVFDTc, especially the any-time property, the benefit of exploiting uncertain information, and the robustness against uncertainty.  相似文献   

19.
The deployment of environmental sensors has generated an interest in real-time applications of the data they collect. This research develops a real-time anomaly detection method for environmental data streams that can be used to identify data that deviate from historical patterns. The method is based on an autoregressive data-driven model of the data stream and its corresponding prediction interval. It performs fast, incremental evaluation of data as it becomes available, scales to large quantities of data, and requires no pre-classification of anomalies. Furthermore, this method can be easily deployed on a large heterogeneous sensor network. Sixteen instantiations of this method are compared based on their ability to identify measurement errors in a windspeed data stream from Corpus Christi, Texas. The results indicate that a multilayer perceptron model of the data stream, coupled with replacement of anomalous data points, performs well at identifying erroneous data in this data stream.  相似文献   

20.
Detecting duplicates in data streams is an important problem that has a wide range of applications. In general,precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios,and,on the other hand,the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper,we present a novel data structure,Decaying Bloom Filter(DBF),as an extension of the Counting Bloom Filter,that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors,but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy,and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W,our algorithm has an amortized time complexity of O((G/W))~(1/2). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号