首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
We study the problem of maintaining a sketch of recent elements of a data stream. Motivated by applications involving network data, we consider streams that are asynchronous, in which the observed order of data is not the same as the time order in which the data was generated. The notion of recent elements of a stream is modeled by the sliding timestamp window, which is the set of elements with timestamps that are close to the current time. We design algorithms for maintaining sketches of all elements within the sliding timestamp window that can give provably accurate estimates of two basic aggregates, the sum and the median, of a stream of numbers. The space taken by the sketches, the time needed for querying the sketch, and the time for inserting new elements into the sketch are all polylogarithmic with respect to the maximum window size. Our sketches can be easily combined in a lossless and compact way, making them useful for distributed computations over data streams. Previous works on sketching recent elements of a data stream have all considered the more restrictive scenario of synchronous streams, where the observed order of data is the same as the time order in which the data was generated. Our notion of recency of elements is more general than that studied in previous work, and thus our sketches are more robust to network delays and asynchrony. The work of the authors was supported in part through NSF grants CNS 0520102 and CNS 0520009. A preliminary version of this paper appeared in Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC) 2006, pages 82–91. Work done while the third author was at Rensselaer Polytechnic Institute. Authors are listed in reverse alphabetical order.  相似文献   

The CQL continuous query language: semantic foundations and query execution   总被引:2,自引:0,他引:2  
CQL, a continuous query language, is supported by the STREAM prototype data stream management system (DSMS) at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and stored relations. We begin by presenting an abstract semantics that relies only on “black-box” mappings among streams and relations. From these mappings we define a precise and general interpretation for continuous queries. CQL is an instantiation of our abstract semantics using SQL to map from relations to relations, window specifications derived from SQL-99 to map from streams to relations, and three new operators to map from relations to streams. Most of the CQL language is operational in the STREAM system. We present the structure of CQL's query execution plans as well as details of the most important components: operators, interoperator queues, synopses, and sharing of components among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road benchmark recently proposed for DSMSs. We also curate a public repository of data stream applications that includes a wide variety of queries expressed in CQL. The relative ease of capturing these applications in CQL is one indicator that the language contains an appropriate set of constructs for data stream processing. Edited by M. Franklin  相似文献   

Hybridization networks are representations of evolutionary histories that allow for the inclusion of reticulate events like recombinations, hybridizations, or lateral gene transfers. The recent growth in the number of hybridization network reconstruction algorithms has led to an increasing interest in the definition of metrics for their comparison that can be used to assess the accuracy or robustness of these methods. In this paper we establish some basic results that make it possible the generalization to tree-child time consistent (TCTC) hybridization networks of some of the oldest known metrics for phylogenetic trees: those based on the comparison of the vectors of path lengths between leaves. More specifically, we associate to each hybridization network a suitably defined vector of ‘splitted’ path lengths between its leaves, and we prove that if two TCTC hybridization networks have the same such vectors, then they must be isomorphic. Thus, comparing these vectors by means of a metric for real-valued vectors defines a metric for TCTC hybridization networks. We also consider the case of fully resolved hybridization networks, where we prove that simpler, ‘non-splitted’ vectors can be used.  相似文献   

Smartphones centralize a great deal of users’ private information and are thus a primary target for cyber-attack. The main goal of the attacker is to try to access and exfiltrate the private information stored in the smartphone without detection. In situations where explicit information is lacking, these attackers can still be detected in an automated way by analyzing data streams (continuously sampled information such as an application’s CPU consumption, accelerometer readings, etc.). When clustered, anomaly detection techniques may be applied to the data stream in order to detect attacks in progress. In this paper we utilize an algorithm called pcStream that is well suited for detecting clusters in real world data streams and propose extensions to the pcStream algorithm designed to detect point, contextual, and collective anomalies. We provide a comprehensive evaluation that addresses mobile security issues on a unique dataset collected from 30 volunteers over eight months. Our evaluations show that the pcStream extensions can be used to effectively detect data leakage (point anomalies) and malicious activities (contextual anomalies) associated with malicious applications. Moreover, the algorithm can be used to detect when a device is being used by an unauthorized user (collective anomaly) within approximately 30 s with 1 false positive every two days.  相似文献   

In this paper, we consider the task of predicting the electricity power generated by photovoltaic solar systems for the next day at half‐hourly intervals. We introduce DL, a deep learning approach based on feed‐forward neural networks for big data time series, which decomposes the forecasting problem into several sub‐problems. We conduct a comprehensive evaluation using 2 years of Australian solar data, evaluating accuracy and training time, and comparing the performance of DL with two other advanced methods based on neural networks and pattern sequence similarity. We investigate the use of multiple data sources (solar power and weather data for the previous days, and weather forecast for the next day) and also study the effect of different historical window sizes. The results show that DL produces competitive accuracy results and scales well, and is thus a highly suitable method for big data environments.  相似文献   

Performing data mining tasks in streaming data is considered a challenging research direction, due to the continuous data evolution. In this work, we focus on the problem of clustering streaming time series, based on the sliding window paradigm. More specifically, we use the concept of subspace αα-clusters. A subspace αα-cluster consists of a set of streams, whose value difference is less than αα in a consecutive number of time instances (dimensions). The clusters can be continuously and incrementally updated as the streaming time series evolve with time. The proposed technique is based on a careful examination of pair-wise stream similarities for a subset of dimensions and then it is generalized for more streams per cluster. Additionally, we extend our technique in order to find maximal pClusters in consecutive dimensions that have been used in previously proposed clustering methods. Performance evaluation results, based on real-life and synthetic data sets, show that the proposed method is more efficient than existing techniques. Moreover, it is shown that the proposed pruning criteria are very important for search space reduction, and that the cost of incremental cluster monitoring is more computationally efficient that the re-clustering process.  相似文献   

In many data stream mining applications, traditional density estimation methods such as kernel density estimation, reduced set density estimation can not be applied to the density estimation of data streams because of their high computational burden, processing time and intensive memory allocation requirement. In order to reduce the time and space complexity, a novel density estimation method Dm-KDE over data streams based on the proposed algorithm m-KDE which can be used to design a KDE estimator with the fixed number of kernel components for a dataset is proposed. In this method, Dm-KDE sequence entries are created by algorithm m-KDE instead of all kernels obtained from other density estimation methods. In order to further reduce the storage space, Dm-KDE sequence entries can be merged by calculating their KL divergences. Finally, the probability density functions over arbitrary time or entire time can be estimated through the obtained estimation model. In contrast to the state-of-the-art algorithm SOMKE, the distinctive advantage of the proposed algorithm Dm-KDE exists in that it can achieve the same accuracy with much less fixed number of kernel components such that it is suitable for the scenarios where higher on-line computation about the kernel density estimation over data streams is required.We compare Dm-KDE with SOMKE and M-kernel in terms of density estimation accuracy and running time for various stationary datasets. We also apply Dm-KDE to evolving data streams. Experimental results illustrate the effectiveness of the proposed method.  相似文献   

Continuously monitoring through time the correlation/distance of multiple data streams is of interest in a variety of applications, including financial analysis, video surveillance, and mining of biological data. However, distance measures commonly adopted for comparing time series, such as Euclidean and Dynamic Time Warping (DTW), either are known to be inaccurate or are too time-consuming to be applied in a streaming environment. In this paper we propose a novel DTW-like distance measure, called Stream-DTW (SDTW), which unlike DTW can be efficiently updated at each time step. We formally and experimentally demonstrate that SDTW speeds up the monitoring process by a factor that grows linearly with the size of the window sliding over the streams. For instance, with a sliding window of 512 samples, SDTW is about 600 times faster than DTW. We also show that SDTW is a tight approximation of DTW, errors never exceeding 10%, and that it consistently outperforms approximations developed for the case of static time series.  相似文献   

A spatio-temporal database manages spatio-temporal objects and supports corresponding query languages. Today, the term moving objects databases is used as a synonym for spatio-temporal databases managing spatial objects with a continuously changing geospatial location and/or extent. Recent advances in wireless communication, miniaturization of spatially enabled devices and global navigation satellite systems (GNSS) services have resulted in a large number of novel application domains. Applications in these novel domains (geo-sensor networks, moving objects tracking, real-time traffic analysis, etc.) process huge volumes of continuous data streams, i.e. data sets that are produced incrementally over time, rather than those available in full before the processing begins. Several data stream management systems (DSMSs) have been developed to manage this data. Since they are mainly based on a relational paradigm, they do not support geospatial data. Therefore, there is an urgent need for geospatial data stream management, ranging from real-time monitoring and alerting to long-term analysis of processed geospatial data. In this paper we present a formal framework consisting of data types and operations needed to support geospatial data in data streams. It can be used as a basis either for implementation of a completely new geospatial DSMS, or for extending available open source products and research prototypes. We leverage the work on abstract data types from spatio-temporal databases, present an implementation based on user-defined aggregate functions and illustrate embedding into an SQL-like language.  相似文献   

数据流模型作为一种新型的模型,在许多应用中扮演着重要的角色.基于数据流模型的查询处理技术也得到了广泛的研究.为了提高查询系统的性能,现有的研究成果主要可以划分为两类:调度优化和降低负载方法.调度优化方法通过改变元组执行次序来提高查询性能.降低负载方法在负载超出系统处理能力时,通过减少输入流量来提高吞吐率.然而,同时运用这两种方法来提高查询性能的研究工作还很少.结合共享滑动窗口查询操作的调度优化方法和降低负载方法,提出了两种在burst环境下提高查询吞吐率的策略:均匀降载策略和小窗口准确降载策略.理论分析和实验结果均证明这两种策略能显著提高系统的性能.  相似文献   

Managing large-scale time series databases has attracted significant attention in the database community recently. Related fundamental problems such as dimensionality reduction, transformation, pattern mining, and similarity search have been studied extensively. Although the time series data are dynamic by nature, as in data streams, current solutions to these fundamental problems have been mostly for the static time series databases. In this paper, we first propose a framework to online summary generation for large-scale and dynamic time series data, such as data streams. Then, we propose online transform-based summarization techniques over data streams that can be updated in constant time and space. We present both the exact and approximate versions of the proposed techniques and provide error bounds for the approximate case. One of our main contributions in this paper is the extensive performance analysis. Our experiments carefully evaluate the quality of the online summaries for point, range, and knn queries using real-life dynamic data sets of substantial size. Edited by W. Aref  相似文献   

Recently, due to the imprecise nature of the data generated from a variety of streaming applications, such as sensor networks, query processing on uncertain data streams has become an important problem. However, all the existing works on uncertain data streams study unbounded streams. In this paper, we take the first step towards the important and challenging problem of answering sliding-window queries on uncertain data streams, with a focus on one of the most important types of queries—top-k queries. It is nontrivial to find an efficient solution for answering sliding-window top-k queries on uncertain data streams, because challenges not only stem from the strict space and time requirements of processing both arriving and expiring tuples in high-speed streams, but also rise from the exponential blowup in the number of possible worlds induced by the uncertain data model. In this paper, we design a unified framework for processing sliding-window top-k queries on uncertain streams. We show that all the existing top-k definitions in the literature can be plugged into our framework, resulting in several succinct synopses that use space much smaller than the window size, while they are also highly efficient in terms of processing time. We also extend our framework to answering multiple top-k queries. In addition to the theoretical space and time bounds that we prove for these synopses, we present a thorough experimental report to verify their practical efficiency on both synthetic and real data.  相似文献   

Continuous queries applied over nonterminating data streams usually specify windows in order to obtain an evolving–yet restricted–set of tuples and thus provide timely and incremental results. Although sliding windows get frequently employed in many user requests, additional types like partitioned or landmark windows are also available in stream processing engines. In this paper, we set out to study the existence of monotonic-related semantics for a rich set of windowing constructs in order to facilitate a more efficient maintenance of their changing contents. After laying out a formal foundation for expressing windowed queries, we investigate update patterns observed in most common window variants as well as their impact on adaptations of typical operators (like windowed join, union or aggregation), thus offering more insight towards design and implementation of stream processing mechanisms. Furthermore, we identify syntactic equivalences in algebraic expressions involving windows, to the potential benefit of query optimizations. Finally, this framework is validated for several windowed operations against streaming datasets with simulations at diverse arrival rates and window specifications, providing concrete evidence of its significance.  相似文献   

在分布式数据流中,数据流之间相关性分析可以揭示被监测对象之间存在的内在联系。提出了一个基于基窗口的相关系数的计算方法,该方法先将计算相关系数的公式变形为由适合基窗口聚集的因子组成,然后用基于基窗口的方法聚集每个因子。基于基窗口的聚集方法是将窗口中的数据项划分成一系列基窗口并分别对基窗口进行计算。当窗口随机滑动后,新窗口中数据项的聚集可以部分地利用上一次窗口聚集的结果。模拟实验表明,与每次对窗口中所有数据进行聚集相比,基于基窗口的方法可以有效地降低数据流相关系数的计算时间。  相似文献   

In this paper we address the issue of continuous keyword queries on multiple textual streams and explore techniques for extracting useful information from them. The paper represents, to our best knowledge, the first approach that performs keyword search on a multiplicity of textual streams. The scenario that we consider is quite intuitive; let’s assume that a research or financial analyst is searching for information on a topic, continuously polling data from multiple (and possibly heterogeneous) text streams, such as RSS feeds, blogs, etc. The topic of interest can be described with the aid of several keywords. Current filtering approaches would just identify single text streams containing some of the keywords. However, it would be more flexible and powerful to search across multiple streams, which may collectively answer the analyst’s question. We present such model that takes in consideration the continuous flow of text in streams and uses efficient pipelined algorithms such that results are output as soon as they are available. The proposed model is evaluated analytically and experimentally, where the Enron dataset and a variety of blog datasets are used for our experiments.  相似文献   

陈崚  邹凌君  屠莉 《计算机应用》2007,27(8):1976-1979
针对当前对多条数据流的聚类算法不能兼顾质量和效率的矛盾,提出了基于相关系数的多条数据流的聚类算法,实现固定长度的在线动态聚类。算法引入衰减系数提高聚类质量,以相关系数作为流数据间相似度的度量标准,将数据流划分若干个数据段,以各数据流的相关统计信息进行聚类,得到实时的聚类结构。实验结果表明,算法有较高的效率、聚类质量和稳定性。  相似文献   

传统的异常检测算法不能区分CO2数据流的异常类型,为了有效识别因泄漏造成CO2数据流的异常,提出了基于模糊聚类的CO2数据流时空异常模式检测算法。该算法首先利用3 规则实现自适应阈值的异常点检测,其次提取待检测滑动窗口的特征值(均值),构建指定区间内邻居节点间的时空关系矩阵,采用模糊聚类分析相邻节点特征值的时空相关性并对其进行分类,根据分类结果确定泄漏异常概率,最后利用真实观测数据对算法进行验证并对参数的选取进行分析。实验结果表明该算法能有效的识别因泄漏造成的事件异常,具有较高的检测率和较低的误警率。  相似文献   

随着仿真系统复杂程度的增加和规模的增大,仿真时间越来越长,仿真所产生的数据量越来越大,使得仿真数据具有数据流的特性,因此可以采用数据流挖掘技术处理仿真数据.综述了数据流和数据流挖掘技术的主要特点;提出了基于数据流挖掘技术的仿真应用框架;设计了通用数据流挖掘成员,以便能够快速将数据流挖掘算法集成到基于HLA体系结构的仿真系统中,并以导弹突防仿真系统为例介绍了所设计的通用数据流关联规则挖掘成员.  相似文献   

Learning from continuous streams of data has been receiving an increasingly attention in the last years. Among the many challenges related to mining data streams, change detection is one topic frequently addressed. Being able to determine whether or not data characteristics are changing along time is a major concern for data stream algorithms, be it on the supervised or unsupervised scenario. The unsupervised scenario is particularly relevant due to many practical applications do not provide target labeling information. In this scenario, most of the strategies induce consecutive models over time and compare them in order to detect data changes. In this situation, model changes are assumed to be a consequence of data modifications. However, there is no guarantee this assumption is true, since those algorithms do not rely on any theoretical background to ensure that model divergences truly indicate data changes. The need for such theoretical framework has motivated this paper to propose a new stability concept to establish bounds on the learning abilities of unsupervised algorithms designed to detect changes on data streams. This stability concept, based on the surrogate data strategy from time series analysis, provides learning guarantees for online unsupervised algorithms even in case of time dependency among observations. Furthermore, we propose a new change detection algorithm that meets the requirements of this stability concept. Experimental results on different synthetical scenarios illustrate how the stability concept proposed in this paper is applied to detect changes in unsupervised data streams.  相似文献   

In the high-tech and automotive industry, bandwidth considerations and widely accepted standardization are two important reasons why Ethernet is currently being considered as an alternative solution for real-time communication (compared to traditional fieldbusses). Although Ethernet was originally not intended for this purpose, the development of the Ethernet AVB standard enables its use for transporting high-volume data (e.g. from cameras and entertainment applications) with low-latency guarantees. In complex industrial systems, the network is shared by many applications, developed by different parties. To face this complexity, the development of these applications must be kept as independent as possible. In particular, from a network point of view, progress of all communication streams must be guaranteed, and the performance for individual streams should be predictable using only information regarding the stream under study and the general parameters of the communication standard used by the network. Initial methods to guarantee latency for Ethernet AVB networks rely on the traditional busy-period analysis. Typically, these methods are based on knowledge of the inter-arrival patterns of both the stream under study and the interfering streams that also traverse the network. The desired independence is therefore not achieved. In this paper, we present an independent real-time analysis based on so-called eligible intervals, which does not rely on any assumptions on interfering priority classes other than those enforced in the Ethernet AVB standard. We prove this analysis is tight in case there is only a single higher-priority stream, and no additional information on interference is known. In case there are multiple higher-priority streams, we give conditions under which the analysis is still tight. Furthermore, we compare the results of our approach to the two most recent busy-period analyses, point out sources of pessimism in these earlier works, and argue that assuming more information on the sources of interference (e.g. a minimal inter-arrival time between interfering frames) has only limited advantages.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号