首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 211 毫秒
1.
Healthcare scientific applications, such as body area network, require of deploying hundreds of interconnected sensors to monitor the health status of a host. One of the biggest challenges is the streaming data collected by all those sensors, which needs to be processed in real time. Follow-up data analysis would normally involve moving the collected big data to a cloud data center for status reporting and record tracking purpose. Therefore, an efficient cloud platform with very elastic scaling capacity is needed to support such kind of real time streaming data applications. The current cloud platform either lacks of such a module to process streaming data, or scales in regard to coarse-grained compute nodes.In this paper, we propose a task-level adaptive MapReduce framework. This framework extends the generic MapReduce architecture by designing each Map and Reduce task as a consistent running loop daemon. The beauty of this new framework is the scaling capability being designed at the Map and Task level, rather than being scaled from the compute-node level. This strategy is capable of not only scaling up and down in real time, but also leading to effective use of compute resources in cloud data center. As a first step towards implementing this framework in real cloud, we developed a simulator that captures workload strength, and provisions the amount of Map and Reduce tasks just in need and in real time.To further enhance the framework, we applied two streaming data workload prediction methods, smoothing and Kalman filter, to estimate the unknown workload characteristics. We see 63.1% performance improvement by using the Kalman filter method to predict the workload. We also use real streaming data workload trace to test the framework. Experimental results show that this framework schedules the Map and Reduce tasks very efficiently, as the streaming data changes its arrival rate.  相似文献   

2.
Online mining of frequent sets in data streams with error guarantee   总被引:7,自引:5,他引:2  
For most data stream applications, the volume of data is too huge to be stored in permanent devices or to be thoroughly scanned more than once. It is hence recognized that approximate answers are usually sufficient, where a good approximation obtained in a timely manner is often better than the exact answer that is delayed beyond the window of opportunity. Unfortunately, this is not the case for mining frequent patterns over data streams where algorithms capable of online processing data streams do not conform strictly to a precise error guarantee. Since the quality of approximate answers is as important as their timely delivery, it is necessary to design algorithms to meet both criteria at the same time. In this paper, we propose an algorithm that allows online processing of streaming data and yet guaranteeing the support error of frequent patterns strictly within a user-specified threshold. Our theoretical and experimental studies show that our algorithm is an effective and reliable method for finding frequent sets in data stream environments when both constraints need to be satisfied.  相似文献   

3.
流媒体技术应用越来越广泛,但数据传输中的延迟、抖动,影响了媒体流播放质量。如何提供保证性能的流媒体服务成为推广流媒体技术的关键。提出了最小延迟算法,可以提供高的信道利用率及高的目的端缓冲区数据吞吐量,提高媒体流播放质量。还提出了最小聚类延迟算法作为改进,进一步优化媒体流整体播放性能。上述两种算法在流媒体技术中有一定的应用推广价值。  相似文献   

4.
Of late, there has been a considerable interest in models, algorithms and methodologies specifically targeted towards designing hardware and software for streaming applications. Such applications process potentially infinite streams of audio/video data or network packets and are found in a wide range of devices, starting from mobile phones to set-top boxes. Given a streaming application and an architecture, the timing analysis problem is to determine the timing properties of the processed data stream, given the timing properties of the input stream. This problem arises while determining many common performance metrics related to streaming applications and the mapping of such applications onto hardware architectures. Such metrics include the maximum delay experienced by any data item of the stream and the maximum backlog or the buffer requirement to store the incoming stream. Most of the previous work related to estimating or optimizing these metrics take a high-level view of the architecture and neglect micro-architectural features such as caches. In this paper, we show that an accurate estimation of these metrics, however, heavily relies on an appropriate modeling of the processor micro-architecture. Towards this, we present a novel framework for cache-aware timing analysis of stream processing applications. Our framework accurately models the evolution of the instruction cache of the underlying processor as a stream is processed, and the fact that the execution time involved in processing any data item depends on all the previous data items occurring in the stream. The main contribution of our method lies in its ability to seamlessly integrate program analysis techniques for micro-architectural modeling with known analytical methods for analyzing streaming applications, which treat the arrival/service of event streams as mathematical functions. This combination is powerful as it allows to model the code/cache-behavior of the streaming application, as well as the manner in which it is triggered by event arrivals. We employ our analysis method to an MPEG-2 encoder application and our experiments indicate that detailed modeling of the cache behavior is efficient, scalable and leads to more accurate timing/buffer size estimates.
Lothar ThieleEmail:
  相似文献   

5.
数据流的流量太大会无法被整个存储,或被多次扫描。为此,在研究已有挖掘算法的基础上,提出一种界标窗口中数据流频繁模式挖掘算法DSMFP_LW。利用扩展前缀模式树存储全局临界频繁模式,实现单遍扫描数据流和数据增量更新。实验结果表明,与Lossy Counting算法相比,DSMFP_LW算法具有更好的时空效率。  相似文献   

6.
随着流媒体技术的兴起,越来越多的人选择从网上获得视频点播、网络电视、远程会议、远程教育等服务.但同时网络上充斥着一些淫秽、非法等有害视频,对社会造成很大危害.现有的基于视频内容的离线检测方法需要解码数据包,且匹配算法复杂度高,不适于数据量庞大的流媒体服务器或网关做实时检测.为此,提出一种基于媒体业务流特征的视频匹配算法,对流媒体服务器的码流进行实时检测和过滤.采用被广泛应用的MP4码流作为测试对象,分析了算法中各个参数对于判定结果的影响.实验结果表明,所提算法简单易行,无论是在拒真率还是取伪率方面都取得较为满意的结果.  相似文献   

7.
Anomaly detection refers to the identification of patterns in a dataset that do not conform to expected patterns. Such non‐conformant patterns typically correspond to samples of interest and are assigned to different labels in different domains, such as outliers, anomalies, exceptions, and malware. A daunting challenge is to detect anomalies in rapid voluminous streams of data. This paper presents a novel, generic real‐time distributed anomaly detection framework for multi‐source stream data. As a case study, we investigate anomaly detection for a multi‐source VMware‐based cloud data center, which maintains a large number of virtual machines (VMs). This framework continuously monitors VMware performance stream data related to CPU statistics (e.g., load and usage). It collects data simultaneously from all of the VMs connected to the network and notifies the resource manager to reschedule its CPU resources dynamically when it identifies any abnormal behavior from its collected data. A semi‐supervised clustering technique is used to build a model from benign training data only. During testing, if a data instance deviates significantly from the model, then it is flagged as an anomaly. Effective anomaly detection in this case demands a distributed framework with high throughput and low latency. Distributed streaming frameworks like Apache Storm, Apache Spark, S4, and others are designed for a lower data processing time and a higher throughput than standard centralized frameworks. We have experimentally compared the average processing latency of a tuple during clustering and prediction in both Spark and Storm and demonstrated that Spark processes a tuple much quicker than storm on average. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

8.
In recent years, network streaming becomes a highly popular research topic in computer science due to the fact that a large proportion of network traffic is occupied by multimedia streaming. In this paper we present novel methodologies for enhancing the streaming capabilities of Java RMI. Our streaming support for Java RMI includes the pushing mechanism, which allows the servers to push data in a streaming fashion to the client site, and the aggregation mechanism, which allows the client site to make a single remote invocation to gather data from multiple servers that keep replicas of data streams and aggregate partial data into a complete data stream. In addition, our system also allows the client site to forward local data to other clients. Our framework is implemented by extending the Java RMI stub to allow custom designs for streaming buffers and controls, and by providing a continuous buffer for raw data in the transport layer socket. This enhanced framework allows standard Java RMI services to enjoy streaming capabilities. In addition, we propose aggregation algorithms as scheduling methods in such an environment. Preliminary experiments using our framework demonstrate its promising performance in the provision of streaming services in Java RMI layers.  相似文献   

9.
由于流数据无限增长的特点,系统无法在内存中保存所有扫描过的流数据,因此数据流处理的关键是建立流数据的概要结构,以便随时能根据该结构提供数据流的近似处理结果,将重点讨论数据流的概要生成技术。先利用经验模态分解方法提取流数据的趋势,滤除数据中的噪声,再利用精确抽样方法实现概要的生成。利用提出的概要生成方法,内存中只需保存滑动窗口中多个段的概要信息。由于该方法中概要是基于趋势序列生成的,趋势序列较原序列平滑,序列中具有相同数值的元素增加,可以进一步节省存储空间。  相似文献   

10.
刘建伟  李卫民 《计算机科学》2009,36(11):148-151
传统的数据库管理系统和数据查询算法不能很好地支持对流数据的查询已经被广泛认识,因而需要研究新的流数据模式查询算法.提出了一种基于摘要技术的在线快速混合模型流数据聚类算法,该算法为分阶段混合模型聚类过程.算法首先时最初到达的流数据用多维网格结构进行划分,对划分形成的每一个单元进行数据摘要,提取足够的统计信息.对该摘要运行基于模型的贪心聚类算法,聚类形成的混合模型的摘要信息存储在永久摘要数据库中,从而形成初始聚类混合模型;在聚类模型的维持过程中,当不断有流数据到达时,对到达的数据块用多维网格结构进行划分,对划分形成的每一个单元提取足够的摘要信息.对该摘要运行基于模型的贪心聚类算法形成聚类混合模型.在判断是否可以把新到达的模型合并到现有的混合模型中去时,提出了三种合并标准.实验表明,该算法减少了分类误差,其速度也比传统的基于模型的贪心聚类算法大大加快.  相似文献   

11.
数据流高速、连续无限和动态的特性使得传统的数据分析和挖掘技术无效或需要改进。以数据流分类为重点,分析了数据流分类中的一些关键问题,综述了典型的数据流分类技术;针对现有方法的不足,给出了应用主动学习和半监督学习的新思路。  相似文献   

12.
Edge computing combining with artificial intelligence (AI) has enabled the timely processing and analysis of streaming data produced by IoT intelligent applications. However, it causes privacy risk due to the data exchanges between local devices and untrusted edge servers. The powerful analytical capability of AI further exacerbates the risks because it can even infer private information from insensitive data. In this paper, we propose a privacy-preserving IoT streaming data analytical framework based on edge computing, called PrivStream, to prevent the untrusted edge server from making sensitive inferences from the IoT streaming data. It utilizes a well-designed deep learning model to filter the sensitive information and combines with differential privacy to protect against the untrusted edge server. The noise is also injected into the framework in the training phase to increase the robustness of PrivStream to differential privacy noise. Taking into account the dynamic and real-time characteristics of streaming data, we realize PrivStream with two types of models to process data segment with fixed length and variable length, respectively, and implement it on a distributed streaming platform to achieve real-time streaming data transmission. We theoretically prove that Privstream satisfies ε-differential privacy and experimentally demonstrate that PrivStream has better performance than the state-of-the-art and has acceptable computation and storage overheads.  相似文献   

13.
Jan Wassenberg 《Software》2012,42(9):1095-1106
This report introduces a new lossless asymmetric single instruction multiple data codec designed for extremely efficient decompression of large satellite images. A throughput in excess of 3GB/s allows decompression to proceed in parallel with asynchronous transfers from fast block devices such as disk arrays. This is made possible by a simple and fast single instruction multiple data entropy coder that removes leading null bits. Our main contribution is a new approach for vectorized prediction and encoding. Unlike previous approaches that treat the entropy coder as a black box, we account for its properties in the design of the predictor. The resulting compressed stream is 1.2 to 1.5 times as large as JPEG‐2000, but can be decompressed 100 times as quickly – even faster than copying uncompressed data in memory. Applications include streaming decompression for out of core visualization. To the best of our knowledge, this is the first entirely vectorized algorithm for lossless compression. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

14.
Detection of changes in streaming data is an important mining task, with a wide range of real-life applications. Numerous algorithms have been proposed to efficiently detect changes in streaming data. However, the limitation of existing algorithms is that they assume that data are generated independently. In particular, temporal dependencies of data in a stream are still not thoroughly studied. Motivated by this, in this work we propose a new efficient method to detect changes in streaming data by exploring the temporal dependencies of data in the stream. As part of this, we introduce a new statistical model called the Candidate Change Point (CCP) model, with which the main idea is to compute the probabilities of finding change points in the stream. The computed probabilities are used to generate a distribution, which is, in turn, used in statistical hypothesis tests to determine the candidate changes. We use the CCP model to develop a new algorithm called Candidate Change Point Detector (CCPD), which detects change points in linear time, and is thus applicable for real-time applications. Our extensive experimental evaluation demonstrates the efficiency and the feasibility of our approach.  相似文献   

15.
Streaming data are widely used in today’s world. Data come from different sources in streams and must be processed online and with minimum delay. These data stream can contain confidential data such as customers’ purchase information and need to be mined in order to reveal other useful information like customers’ purchase patterns. Privacy preservation throughout these processes plays a crucial role. K-anonymity is a well-known technique for preserving privacy. The principle issues in k-anonymity are information loss and running time. Although some of the existing k-anonymity techniques are able to generate anonymized data with acceptable information loss, their main drawback is that they are very time-consuming and are not applicable in a streaming context since streaming data are usually very sensitive to delay and need to be processed quite fast. In [32], we proposed a cluster-based k-anonymity algorithm called fast anonymizing algorithm for numerical streaming data (FAANST) which can anonymize numerical streaming data quite fast while providing an admissible information loss. The main drawback of FAANST is that some tuples may remain in the system for a long time and are output when they might be considered to have expired. In this paper, we propose two extensions for FAANST, passive and proactive solutions. These two solutions put a soft deadline, called $delay$ , on the time each tuple can stay in the system, and if a tuple passes this deadline, these algorithms force the tuple to be output. The proactive solution goes even one step further and utilizes a simple heuristic function to predict when a tuple in the system may expire and outputs the tuple if it will expire in the next round of the algorithm’s execution.  相似文献   

16.
Rate-adaptive streaming technologies, such as the Dynamic Adaptive Streaming over HTTP (DASH) standard, provides an efficient and easy solution to stream multimedia in a heterogenous context. However, it reinforces the streaming capacity problem in the core Content Delivery Network (CDN) infrastructure since delivering one video means delivering an aggregation of multiple representations. In particular, for live rate-adaptive streaming, a large set of non-divisible data streams need to be either delivered in whole, or not delivered at all. Previous theoretical models that deal with streaming capacity problems are based on elastic bit rates, and do not capture these emerging features faced by today’s CDNs. In this paper, we identify a new, discretized streaming model, for live rate-adaptive video delivery in CDNs. For this model we formulate a general optimization problem and show that it is NP-complete. Then we study two fundamental scenarios that occur in real CDNs. For each of these scenarios, we present a fast, easy to implement, and near-optimal algorithm with performance approximation ratios that are negligible for large networks. These are the first sets of results for the discretized streaming model, and have both practical and theoretical importance in a topic that has become critical.  相似文献   

17.
We consider the problem of distributed packet selection and scheduling for multiple video streams sharing a communication channel. An optimization framework is proposed, which enables the multiple senders to coordinate their packet transmission schedules, such that the average quality over all video clients is maximized. The framework relies on rate-distortion information that is used to characterize a video packet. This information consists of two quantities: the size of the packet in bits, and its importance for the reconstruction quality of the corresponding stream. A distributed streaming strategy then allows for trading off rate and distortion, not only within a single video stream, but also across different streams. Each of the senders allocates to its own video packets a share of the available bandwidth on the channel in proportion to their importance. We evaluate the performance of the distributed packet scheduling algorithm for two canonical problems in streaming media, namely adaptation to available bandwidth and adaptation to packet loss through prioritized packet retransmissions. Simulation results demonstrate that, for the difficult case of scheduling nonscalably encoded video streams, our framework is very efficient in terms of video quality, both over all streams jointly and also over the individual videos. Compared to a conventional streaming system that does not consider the relative importance of the video packets, the gains in performance range up to 6 dB for the scenario of bandwidth adaptation, and even up to 10 dB for the scenario of random packet loss adaptation.  相似文献   

18.
CAVLC是H.264中熵编码的一种重要实现方式,具有可挖掘的数据级并行特征,但同时具有较强的串行特点。本文分析了CAVLC的程序特征,提出了CAVLC的流式实现方法,并在流处理器STORM-1上进行了实现。实验结果表明本方法能够满足实时高清H.264编码的性能需求。  相似文献   

19.
分布式集群环境使得数据实时计算更为复杂,流式大数据处理系统的正确性难以保障.现有的大数据基准测试框架可以测试流式大数据处理系统的性能表现,但是普遍存在应用场景设计简单、评价指标不充分等不足.针对这一挑战,本文构造了一个面向股票交易场景的流式大数据基准测试框架,通过生成股票高频交易数据,测试系统在高流速场景下的延迟、吞吐量、GC时间、CPU资源等的性能表现.同时,通过横向测试验证流式大数据系统的扩展性.本文以Apache Spark Streaming为待测系统进行测试,实验结果表明,高流速场景下出现延迟增加、GC时间提高等性能下降问题,原因是系统输入速率的提高及并行度的增加.  相似文献   

20.
Mining data streams is the process of extracting information from non-stopping, rapidly flowing data records to provide knowledge that is reliable and timely. Streaming data algorithms need to be one pass and operate under strict limitations of memory and response time. In addition, the classification of streaming data requires learning in an environment where the data characteristics might change constantly. Many of the classification algorithms presented in literature assume a 100 % labeling rate, which is impractical and expensive when data records are rapidly flowing in. In this paper, a new incremental grid density based learning framework, the GC3 framework, is proposed to perform classification of streaming data with concept drift and limited labeling. The proposed framework uses grid density clustering to detect changes in the input data space. It maintains an evolving ensemble of classifiers to learn and adapt to the model changes over time. The framework also uses a uniform grid density sampling mechanism to obtain a uniform subset of samples for better classification performance with a lower labeling rate. The entire framework is designed to be one-pass, incremental and work with limited memory to perform any-time classification on demand. Experimental comparison with state of the art concept drift handling systems demonstrate the GC3 frameworks ability to provide high classification performance, using fewer models in the ensemble and with only 4-6 % of the samples labeled. The results show that the GC3 framework is effective and attractive for use in real world data stream classification applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号