首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Distributed data stream processing applications are often characterized by data flow graphs consisting of a large number of built‐in and user‐defined operators connected via streams. These flow graphs are typically deployed on a large set of nodes. The data processing is carried out on‐the‐fly, as tuples arrive at possibly very high rates, with minimum latency. It is well known that developing and debugging distributed, multi‐threaded, and asynchronous applications, such as stream processing applications, can be challenging. Thus, without domain‐specific debugging support, developers struggle when debugging distributed applications. In this paper, we describe tools and language support to support debugging distributed stream processing applications. Our key insight is to view debugging of stream processing applications from four different, but related, perspectives. First, debugging the semantics of the application involves verifying the operator‐level composition and inspecting the flows at the logical level. Second, debugging the user‐defined operators involves traditional source‐code debugging, but strongly tied to the stream‐level interactions. Third, debugging the deployment details of the application require understanding the runtime physical layout and configuration of the application. Fourth, debugging the performance of the application requires inspecting various performance metrics (such as communication rates, CPU utilization, etc.) associated with streams, operators, and nodes in the system. In light of this characterization, we developed several tools such as a debugger‐aware compiler and an associated stream debugger, composition and deployment visualizers, and performance visualizers, as well as language support, such as configuration knobs for logging and tracing, deployment configurations such as operator‐to‐process and process‐to‐node mappings, monitoring directives to inspect streams, and special sink adapters to intercept and dump streaming data to files and sockets, to name a few. We describe these tools in the context of Spade —a language for creating distributed stream processing applications, and System S —a distributed stream processing middleware under development at the IBM Watson Research Center. Published in 2009 by John Wiley & Sons, Ltd.  相似文献   

2.
Continuous queries applied over nonterminating data streams usually specify windows in order to obtain an evolving–yet restricted–set of tuples and thus provide timely and incremental results. Although sliding windows get frequently employed in many user requests, additional types like partitioned or landmark windows are also available in stream processing engines. In this paper, we set out to study the existence of monotonic-related semantics for a rich set of windowing constructs in order to facilitate a more efficient maintenance of their changing contents. After laying out a formal foundation for expressing windowed queries, we investigate update patterns observed in most common window variants as well as their impact on adaptations of typical operators (like windowed join, union or aggregation), thus offering more insight towards design and implementation of stream processing mechanisms. Furthermore, we identify syntactic equivalences in algebraic expressions involving windows, to the potential benefit of query optimizations. Finally, this framework is validated for several windowed operations against streaming datasets with simulations at diverse arrival rates and window specifications, providing concrete evidence of its significance.  相似文献   

3.
Identifying similarities in large datasets is an essential operation in several applications such as bioinformatics, pattern recognition, and data integration. To make a relational database management system similarity-aware, the core relational operators have to be extended. While similarity-awareness has been introduced in database engines for relational operators such as joins and group-by, little has been achieved for relational set operators, namely Intersection, Difference, and Union. In this paper, we propose to extend the semantics of relational set operators to take into account the similarity of values. We develop efficient query processing algorithms for evaluating them, and implement these operators inside an open-source database system, namely PostgreSQL. By extending several queries from the TPC-H benchmark to include predicates that involve similarity-based set operators, we perform extensive experiments that demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.  相似文献   

4.
在分布式数据流管理系统中,需要将查询操作放置到不同的处理结点执行。因此,如何放置查询操作成为分布式数据流管理研究的核心问题。Peter等人提出一种基于时延空间和弹簧张弛技术的查询操作放置算法,但是该算法假设查询操作之间数据流的流速不变,没有考虑数据流的流速与数据流查询操作之间的相关性。为此,通过分析不同的数据流查询操作与其输出的数据流的流速之间的关系,对Peter等人提出的算法加以改进,实验结果表明,改进后的算法可以有效地应用于分布式数据流管理系统。  相似文献   

5.
由于数据流具有无界的特性,数据流系统中的查询多为带有窗口的查询,对带有窗口的查询,现有方法常由操作符直接维护窗口,但操作符的类型及排列方式可能会导致窗口难以维护,且冗余度较大.因此提出一种查询处理中的分级窗口维护策略,将窗口分为流窗口和操作符窗口,以流窗口为主并控制操作符窗口的维护,使查询中的窗口保持一致,解决了窗口维护问题,并且符合流查询语言的语义,各级窗口中的数据通过共享来解决内存消耗问题.  相似文献   

6.
The CQL continuous query language: semantic foundations and query execution   总被引:2,自引:0,他引:2  
CQL, a continuous query language, is supported by the STREAM prototype data stream management system (DSMS) at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and stored relations. We begin by presenting an abstract semantics that relies only on “black-box” mappings among streams and relations. From these mappings we define a precise and general interpretation for continuous queries. CQL is an instantiation of our abstract semantics using SQL to map from relations to relations, window specifications derived from SQL-99 to map from streams to relations, and three new operators to map from relations to streams. Most of the CQL language is operational in the STREAM system. We present the structure of CQL's query execution plans as well as details of the most important components: operators, interoperator queues, synopses, and sharing of components among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road benchmark recently proposed for DSMSs. We also curate a public repository of data stream applications that includes a wide variety of queries expressed in CQL. The relative ease of capturing these applications in CQL is one indicator that the language contains an appropriate set of constructs for data stream processing. Edited by M. Franklin  相似文献   

7.
Stream processing systems are designed to analyze data arriving in real time and using continuous queries and respond when a specific event or sequence of events are detected. An important aspect of these systems is Streaming Analytics, which facilitates statistical calculations on continuous data within the stream. These systems must be designed to handle high volumes of data, be scalable, and accommodate a multitude of long‐lived concurrently running analytics. The challenges involved in the development of stream processing include on‐the‐fly transformation of data streams to match the query needs of users and the ability to model stream transformations to detect overlaps and possibilities for optimizations and to specify a methodology to deliver optimizations. In particular, this work focuses on exposing data stream application internals in order to detect reusable parts and then consolidate applications to optimize computational resource usage. The Streaming Data Analytics Model presented in this paper adopts a declarative approach that enables processing and manipulation of data streams in a simple manner while facilitating powerful optimizations necessary for managing high volumes of streaming data in real time. An evaluation is provided to demonstrate in both theoretical and quantitative aspects the high performance offered by our approach.  相似文献   

8.
流数据管理系统的研究已成为当前数据库领域研究的共识。本文详细论述了流数据管理系统的基本概念、流数据模型和查询语义、流数据查询算法,并提出了流数据管理系统研究中许多重要问题的未来研究方向。  相似文献   

9.
Energy efficiency of data analysis systems has become a very important issue in recent times because of the increasing costs of data center operations. Although distributed streaming workloads have increasingly been present in modern data centers, energy‐efficient scheduling of such applications remains as a significant challenge. In this paper, we conduct an energy consumption analysis of data stream processing systems in order to identify their energy consumption patterns. We follow stream system benchmarking approach to solve this issue. Specifically, we implement Linear Road benchmark on six stream processing environments (S4, Storm, ActiveMQ, Esper, Kafka, and Spark Streaming) and characterize these systems' performance on a real‐world data center. We study the energy consumption characteristics of each system with varying number of roads as well as with different types of component layouts. We also use a microbenchmark to capture raw energy consumption characteristics. We observed that S4, Esper, and Spark Streaming environments had highest average energy consumption efficiencies compared with the other systems. Using a neural networkbased technique with the power/performance information gathered from our experiments, we developed a model for the power consumption behavior of a streaming environment. We observed that energy‐efficient execution of streaming application cannot be specifically attributed to the system CPU usage. We observed that communication between compute nodes with moderate tuple sizes and scheduling plans with balanced system overhead produces better power consumption behaviors in the context of data stream processing systems. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

10.
This paper presents both a calculus for stream processing, named Brooklet, and its realization as an intermediate language, named River. Because River is based on Brooklet, it has a formal semantics that enables reasoning about the correctness of source translations and optimizations. River builds on Brooklet by addressing the real‐world details that the calculus elides. We evaluated our system by implementing front‐ends for three streaming languages, and three important optimizations, and a back‐end for the System S distributed streaming runtime. Overall, we significantly lower the barrier to entry for new stream‐processing languages and thus grow the ecosystem of this crucial style of programming. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

11.
流处理器与传统微处理器相比具有更高的性能和效率,已广泛应用于图像处理、媒体处理等领域。本文基于Altera EP2S180 FPGA芯片设计并实现了一款32位异构多核流处理器MASA-I,并对其硬件开销及性能进行了评估。结果表明,基于流处理的异构多核系统能够在FPGA上较好地实现,满足了流应用的需求。  相似文献   

12.
Exploiting punctuation semantics in continuous data streams   总被引:4,自引:0,他引:4  
As most current query processing architectures are already pipelined, it seems logical to apply them to data streams. However, two classes of query operators are impractical for processing long or infinite data streams. Unbounded stateful operators maintain state with no upper bound in size and, so, run out of memory. Blocking operators read an entire input before emitting a single output and, so, might never produce a result. We believe that a priori knowledge of a data stream can permit the use of such operators in some cases. We discuss a kind of stream semantics called punctuated streams. Punctuations in a stream mark the end of substreams allowing us to view an infinite stream as a mixture of finite streams. We introduce three kinds of invariants to specify the proper behavior of operators in the presence of punctuation. Pass invariants define when results can be passed on. Keep invariants define what must be kept in local state to continue successful operation. Propagation invariants define when punctuation can be passed on. We report on our initial implementation and show a strategy for proving implementations of these invariants are faithful to their relational counterparts.  相似文献   

13.
The technological advances in wireless sensor network (WSN) enable the development of complex applications including health monitoring, environmental sampling, and disaster area monitoring. WSN applications deploy battery‐powered sensors at remote locations for long periods. The development of energy‐efficient and complex WSN applications therefore requires in‐depth embedded systems programming skills that are normally not found in domain experts. So that this challenge can be overcome, programming environments for WSN need to offer a high degree of productivity, flexibility, and efficiency at the same time. In this work, we present Curracurrong, a development environment for WSNs that is based on expressing queries with stream programming. A query is represented as a stream graph consisting of stream operators and communication channels. Curracurrong provides an extensible stream operator library that adapts to a wide range of applications. It uses a novel placement algorithm that optimizes the energy consumption on sensor nodes. Through a case study, we demonstrate the productivity and flexibility of our system. We conduct experiments that evaluate the energy efficiency of our optimized operator placement algorithm. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

14.
面向流数据的数据管理系统的研究   总被引:2,自引:1,他引:1  
传统关系数据库系统通常用来存储没有时间概念的相对静止的数据, 对于一些新的应用领域, 信息是以数据序列的形式产生并且需要实时地、持续地进行处理, 这就超出了传统系统的解决能力。数据流数据管理系统是面向流数据而设计的数据管理系统, 它能有效地处理输入流数据并提供持续检索的功能。从整体上分析数据流管理系统的体系结构, 重点讨论基于流数据的数据模型和流查询。  相似文献   

15.
流数据的查询应用十分广泛,而标准SQL语言不支持这类查询功能,因此有必要对标准SQL语言进行扩展,以满足流数据的查询应用需求。支持流数据的查询语言StreamSQL在标准SQL语言的基础上增加了对流数据对象的处理机制,通过引入滑动窗口的概念,以支持流数据与关系表的相互转换操作,同时提供用户自定义函数功能,弥补了SQL在流数据处理方面的不足。  相似文献   

16.
Sliding window-based frequent pattern mining over data streams   总被引:2,自引:0,他引:2  
Finding frequent patterns in a continuous stream of transactions is critical for many applications such as retail market data analysis, network monitoring, web usage mining, and stock market prediction. Even though numerous frequent pattern mining algorithms have been developed over the past decade, new solutions for handling stream data are still required due to the continuous, unbounded, and ordered sequence of data elements generated at a rapid rate in a data stream. Therefore, extracting frequent patterns from more recent data can enhance the analysis of stream data. In this paper, we propose an efficient technique to discover the complete set of recent frequent patterns from a high-speed data stream over a sliding window. We develop a Compact Pattern Stream tree (CPS-tree) to capture the recent stream data content and efficiently remove the obsolete, old stream data content. We also introduce the concept of dynamic tree restructuring in our CPS-tree to produce a highly compact frequency-descending tree structure at runtime. The complete set of recent frequent patterns is obtained from the CPS-tree of the current window using an FP-growth mining technique. Extensive experimental analyses show that our CPS-tree is highly efficient in terms of memory and time complexity when finding recent frequent patterns from a high-speed data stream.  相似文献   

17.
Of late, there has been a considerable interest in models, algorithms and methodologies specifically targeted towards designing hardware and software for streaming applications. Such applications process potentially infinite streams of audio/video data or network packets and are found in a wide range of devices, starting from mobile phones to set-top boxes. Given a streaming application and an architecture, the timing analysis problem is to determine the timing properties of the processed data stream, given the timing properties of the input stream. This problem arises while determining many common performance metrics related to streaming applications and the mapping of such applications onto hardware architectures. Such metrics include the maximum delay experienced by any data item of the stream and the maximum backlog or the buffer requirement to store the incoming stream. Most of the previous work related to estimating or optimizing these metrics take a high-level view of the architecture and neglect micro-architectural features such as caches. In this paper, we show that an accurate estimation of these metrics, however, heavily relies on an appropriate modeling of the processor micro-architecture. Towards this, we present a novel framework for cache-aware timing analysis of stream processing applications. Our framework accurately models the evolution of the instruction cache of the underlying processor as a stream is processed, and the fact that the execution time involved in processing any data item depends on all the previous data items occurring in the stream. The main contribution of our method lies in its ability to seamlessly integrate program analysis techniques for micro-architectural modeling with known analytical methods for analyzing streaming applications, which treat the arrival/service of event streams as mathematical functions. This combination is powerful as it allows to model the code/cache-behavior of the streaming application, as well as the manner in which it is triggered by event arrivals. We employ our analysis method to an MPEG-2 encoder application and our experiments indicate that detailed modeling of the cache behavior is efficient, scalable and leads to more accurate timing/buffer size estimates.
Lothar ThieleEmail:
  相似文献   

18.
一种支持多目标的数据流操作语言   总被引:1,自引:0,他引:1  
随着数据流在各个应用领域的涌现和广泛应用,数据流相关的研究已经成为数据库技术中一个新的研究方向,并得到了越来越多的关注.数据流的操作语言作为用户与数据流管理系统之问进行语义交换的桥梁,从很大程度上体现出了数据流处理的特点.提出了一种数据流管理系统中支持多目标的数据流操作语言.它可以同时完成对数据流和关系表的操作.此外针对数据流的特性,语言中还引入了时间戳,时间粒度,连续查询,近似查询等相关概念,并以丰富灵活的语法支持了各种相关技术.  相似文献   

19.
大规模网络安全监控应用中,决策者应用数据流联机在线分析(Stream OLAP)技术对网络安全事件流建立流数据方(Stream Cube)进行实时分析,以了解当前网络安全状况并动态评估当前网络安全态势。由于内存容量有限,Stream Cube只关注当前时间窗口内的数据,而对于时间窗口外的过期数据则采用近似存储或简单地丢弃,所以不支持超出时间窗口范围的大时间窗口查询。针对以上缺陷,提出一种多维多层安全事件流实时分析框架HS-Stream Cube,采用内存和外存两层混合存储模式实现任意时间窗口的精确查询;然后根据数据流特点重点研究两层混合存储模式下HS-StreamCube的模型、构建、存储管理和查询等;最后通过实验验证该系统的可用性和高效性。  相似文献   

20.
Mining neighbor-based patterns in data streams   总被引:1,自引:0,他引:1  
Discovery of complex patterns such as clusters, outliers, and associations from huge volumes of streaming data has been recognized as critical for many application domains. However, little research effort has been made toward detecting patterns within sliding window semantics as required by real-time monitoring tasks, ranging from real time traffic monitoring to stock trend analysis. Applying static pattern detection algorithms from scratch to every window is impractical due to their high algorithmic complexity and the real-time responsiveness required by streaming applications. In this work, we develop methods for the incremental detection of neighbor-based patterns, in particular, density-based clusters and distance-based outliers over sliding stream windows. Incremental computation for pattern detection queries is challenging. This is because purging of to-be-expired data from previously formed patterns may cause birth, shrinkage, splitting or termination of these complex patterns. To overcome this, we exploit the “predictability” property of sliding windows to elegantly discount the effect of expired objects with little maintenance cost. Our solution achieves guaranteed minimal CPU consumption, while keeping the memory utilization linear in the number of objects in the window. To thoroughly analyze the performance of our proposed methods, we develop a cost model characterizing the performance of our proposed neighbor-based pattern mining strategies. We conduct an analysis study to not only identify the key performance factors for each strategy but also show under which conditions each of them are most efficient. Our comprehensive experimental study, using both synthetic and real data from domains of moving object monitoring and stock trades, demonstrates superiority of our proposed strategies over alternate methods in both CPU processing resources and in memory utilization.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号