首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
带权值数据流滑动窗口随机抽样算法的改进   总被引:3,自引:0,他引:3       下载免费PDF全文
通过改进加权抽样算法,结合基本窗口技术,提出了两种面向带权值数据流上连续更新滑动窗口的随机抽样算法:WRSB算法和IWRSB算法。当新的数据元组到达时,根据数据元组的权值计算出该元组的键值,根据元组键值的大小决定其是否进入样本集以及样本集中被替换的数据元组,同时设置一个系统缓冲区来保存最近到达的键值较大的部分数据元组,作为过期数据元组的后备,使算法能够有效地处理过期数据元组问题。理论分析和实验结果表明,两种算法都能有效地处理带权值数据流上连续更新滑动窗口的随机抽样问题,相比较而言,IWRSB算法具有更好的性能。  相似文献   

2.
《Information Fusion》2008,9(3):412-424
Data processing applications for sensor streams have to deal with multiple continuous data streams with inputs arriving at highly variable and unpredictable rates from various sources. These applications perform various operations (e.g. filter, aggregate, join, etc.) on incoming data streams in real-time according to predefined queries or rules. Since the data rate and data distribution fluctuate over time, an appropriate join tree for processing join queries must be adaptively maintained in response to dynamic changes to prevent rapid degradation of the system performance. In this paper, we address the problem of finding an optimal join tree that maximizes throughput for sliding window based multi-join queries over continuous data streams and prove its NP-Hardness. We present a dynamic programming algorithm, OptDP, which produces the optimal tree but runs in an exponential time in the number of input streams. We then present a polynomial time greedy algorithm, XGreedyJoin. We tested these algorithms in ARES, an adaptively re-optimizing engine for stream queries, which we developed by extending Jess (Jess is a popular RETE-based, forward chaining rule engine written in java). For almost all instances, trees from XGreedyJoin perform close to the optimal trees from OptDP, and significantly better than common heuristics-based XJoin algorithms.  相似文献   

3.
Frequent Itemsets Mining has been applied in many data processing applications with remarkable results. Recently, data streams processing is gaining a lot of attention due to its practical applications. Data in data streams are transmitted at high rates and cannot be stored for offline processing making impractical to use traditional data mining approaches (such as Frequent Itemsets Mining) straightforwardly on data streams. In this paper, two single-pass parallel algorithms based on a tree data structure for Frequent Itemsets Mining on data streams are proposed. The presented algorithms employ Landmark and Sliding Window Models for windows handling. In the presented paper, as in other revised papers, if the number of frequent items on data streams is low then the proposed algorithms perform an exact mining process. On the contrary, if the number of frequent patterns is large the mining process is approximate with no false positives produced. Experiments conducted demonstrate that the presented algorithms outperform the processing time of the hardware architectures reported in the state-of-the-art.  相似文献   

4.
在分布式数据流中,数据流之间相关性分析可以揭示被监测对象之间存在的内在联系。提出了一个基于基窗口的相关系数的计算方法,该方法先将计算相关系数的公式变形为由适合基窗口聚集的因子组成,然后用基于基窗口的方法聚集每个因子。基于基窗口的聚集方法是将窗口中的数据项划分成一系列基窗口并分别对基窗口进行计算。当窗口随机滑动后,新窗口中数据项的聚集可以部分地利用上一次窗口聚集的结果。模拟实验表明,与每次对窗口中所有数据进行聚集相比,基于基窗口的方法可以有效地降低数据流相关系数的计算时间。  相似文献   

5.
Tuple dropping, though commonly used for load shedding in most data stream operations, is generally inadequate for multiway windowed stream joins. The join output rate can be unnecessarily reduced because tuple dropping fails to exploit the time correlations that are likely to exist among interrelated streams. In this paper, we introduce GrubJoin-an adaptive multiway windowed stream join that effectively performs time correlation-aware CPU load shedding. GrubJoin maximizes the output rate by achieving near-optimal window harvesting, which picks only the most profitable segments of individual windows for the join. Due mainly to the combinatorial explosion of possible multiway join sequences involving different window segments, GrubJoin faces unique challenges that do not exist for binary joins, such as determining the optimal window harvesting configuration in a time-efficient manner and learning the time correlations among the streams without introducing overhead. To tackle these challenges, we formalize window harvesting as an optimization problem, develop greedy heuristics to determine near-optimal window harvesting configurations, and use approximation techniques to capture the time correlations. Our experimental results show that GrubJoin is vastly superior to tuple dropping when time correlations exist and is equally effective when time correlations are nonexistent.  相似文献   

6.
多数据流滑动窗口并发连接方法   总被引:10,自引:1,他引:9  
提出一种多数据流滑动窗口连接方法M3Join及其实现架构Roujoin.Roujoin由一个连接路由表和多个连接区组成,其内容根据并发连接请求设置,先将新元组插入缓冲区,然后根据其路由标记查找连接路由表进入合适的连接区执行连接或输出给用户.如果产生连接元组,则更改其路由标记后送回连接路由表,并反复迭代直到没有连接元组.由于共享中间结果,在处理多个并发查询时只需扫描流元组一遍.实验结果表明M3Join具有良好的性能,能够满足并发连接查询处理的需求.  相似文献   

7.
Anytime algorithms have been proposed for many different applications, e.g., in data mining. Their strengths are the ability to first provide a result after a very short initialization and second to improve their result with additional time. Therefore, anytime algorithms have so far been used when the available processing time varies, e.g., on varying data streams. In this paper we propose to employ anytime algorithms on constant data streams, i.e., for tasks with constant time allowance. We introduce two approaches that harness the strengths of anytime algorithms on constant data streams and thereby improve the over all quality of the result with respect to the corresponding budget algorithm. We derive formulas for the expected performance gain and demonstrate the effectiveness of our novel approaches using existing anytime algorithms on benchmark data sets.  相似文献   

8.
This paper introduces a class of join algorithms, termed W-join, for joining multiple infinite data streams. W-join addresses the infinite nature of the data streams by joining stream data items that lie within a sliding window and that match a certain join condition. In addition to its general applicability in stream query processing, W-join can be used to track the motion of a moving object or detect the propagation of clouds of hazardous material or pollution spills over time in a sensor network environment. We describe two new algorithms for W-join and address variations and local/global optimizations related to specifying the nature of the window constraints to fulfill the posed queries. The performance of the proposed algorithms is studied experimentally in a prototype stream database system, using synthetic data streams and real time-series data. Tradeoffs of the proposed algorithms and their advantages and disadvantages are highlighted, given variations in the aggregate arrival rates of the input data streams and the desired response times per query. This is an extended version of the paper published in the Proceedings of the 15th International Conference on Scientific and Statistical Database Management, SSDBM 2003, Boston, U.S.A., pp. 75–84.  相似文献   

9.
Recently a few Continuous Query systems have been developed to cope with applications involving continuous data streams. At the same time, numerous algorithms are proposed for better performance. A recent work on this subject was to define scheduling strategies on shared window joins over data streams from multiple query expressions. In these strategies, a tuple with the highest priority is selected to process from multiple candidates. However, the performance of these static strategies is deeply influenced when data are bursting, because the priority is determined only by static information, such as the query windows, arriving order, etc. In this paper, we propose a novel adaptive strategy where the priority of a tuple is integrated with realtime information. A thorough experimental evaluation has demonstrated that this new strategy can outperform the existing strategies.  相似文献   

10.
To enable efficiency in stream processing, the evaluation of a query is usually performed over bounded parts of (potentially) unbounded streams, i.e., processing windows “slide” over the streams. To avoid inefficient re-evaluations of already evaluated parts of a stream in respect to a query, incremental evaluation strategies are applied, i.e., the query results are obtained incrementally from the result set of the preceding processing state without having to re-evaluate all input buffers. This method is highly efficient but it comes at the cost of having to maintain processing state, which is not trivial, and may defeat performance advantages of the incremental evaluation strategy. In the context of RDF streams the problem is further aggravated by the hard-to-predict evolution of the structure of RDF graphs over time and the application of sub-optimal implementation approaches, e.g., using relational technologies for storing data and processing states which incur significant performance drawbacks for graph-based query patterns. To address these performance problems, this paper proposes a set of novel operator-aware data structures coupled with incremental evaluation algorithms which outperform the counterparts of relational stream processing systems. This claim is demonstrated through extensive experimental results on both simulated and real datasets.  相似文献   

11.
一种基于变尺度滑动窗口的数据流频繁集挖掘算法   总被引:2,自引:0,他引:2  
基干传统滑动窗口机制的数据流频繁集挖掘算法较多地考虑快速且精确的效果,而较少考虑数据流的时变特性,对传统的滑动窗口机制进行改进.同时考虑数据流的海量特性和时变特性,提出一种基于变尺度滑动窗口机制的数据流频繁集挖掘算法V-Stream.该算法采用事务链表组的概要数据结构.能够根据数据流的数据分布变化自适应调整窗口大小.Eclipse上的仿真实验结果表明,V-Stream相比Manku算法提高了挖掘数据流频繁集的时间与空间效率.  相似文献   

12.
Mining utility itemsets from data steams is one of the most interesting research issues in data mining and knowledge discovery. In this paper, two efficient sliding window-based algorithms, MHUI-BIT (Mining High-Utility Itemsets based on BITvector) and MHUI-TID (Mining High-Utility Itemsets based on TIDlist), are proposed for mining high-utility itemsets from data streams. Based on the sliding window-based framework of the proposed approaches, two effective representations of item information, Bitvector and TIDlist, and a lexicographical tree-based summary data structure, LexTree-2HTU, are developed to improve the efficiency of discovering high-utility itemsets with positive profits from data streams. Experimental results show that the proposed algorithms outperform than the existing approaches for discovering high-utility itemsets from data streams over sliding windows. Beside, we also propose the adapted approaches of algorithms MHUI-BIT and MHUI-TID in order to handle the case when we are interested in mining utility itemsets with negative item profits. Experiments show that the variants of algorithms MHUI-BIT and MHUI-TID are efficient approaches for mining high-utility itemsets with negative item profits over stream transaction-sensitive sliding windows.  相似文献   

13.
This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing site for query execution. This typically introduces high communication overhead. Our observation is that semijoin, effective in reducing communication overhead in distributed database query processing, can be also effective in distributed stream query processing. The challenge, however, lies in the streaming nature of the tuples, as it requires continuous and incremental processing of an unbounded sequence of tuples instead of one-time processing of a set of stored tuples. This paper describes our comprehensive work done to address the challenge. Specifically, we first propose a distributed stream join processing model that handles the issue of network delays introduced from the shipment of data streams, and allows for efficient batch processing. Then, based on the model, we propose join algorithms in a multi-way join case: first, one-way join algorithms for different combinations of join placement and join method and, then, multi-way join algorithms assuming linear join ordering. Regarding the join method, two distributed join methods are introduced: (1) simple join, in which full tuples are forwarded to the query processing site and (2) semijoin-based join, in which partial tuples are forwarded. A semijoin-based join can be executed with different possible semijoin strategies which incur different communication overheads. We present a complete set of join algorithms considering all possible semijoin strategies, and propose an optimization algorithm. The join algorithms are executed continuously in an incremental manner as tuples arrive, and never ship tuples redundantly. The optimization algorithm constructs an efficient multi-way join plan by using a greedy heuristic which adds to the plan one stream with the minimum join execution cost in each step. Through extensive experiments, we conduct comparative studies of the performance among the proposed one-way join algorithms and the efficiency of the generated plan between the optimization algorithm based on the greedy heuristic and the exhaustive search, respectively.  相似文献   

14.
We address the problem of load shedding for continuous multi-way join queries over multiple data streams. When the arrival rates of tuples from data streams exceed the system capacity, a load shedding algorithm drops some subset of input tuples to avoid system overloads. To decide which tuples to drop among the input tuples, most existing load shedding algorithms determine the priority of each input tuple based on the frequency or some historical statistics of its join attribute value, and then drop tuples with the lowest priority. However, those value-based algorithms cannot determine the priorities of tuples properly in environments where join attribute values are unique and each join attribute value occurs at most once in each data stream. In this paper, we propose a load shedding algorithm specifically designed for such environments. The proposed load shedding algorithm determines the priority of each tuple based on the order of streams in which its join attribute value appears, rather than its join attribute value itself. Consequently, the priorities of tuples can be determined effectively in environments where join attribute values are unique and do not repeat. The experimental results show that the proposed algorithm outperforms the existing algorithms in such environments in terms of effectiveness and efficiency.  相似文献   

15.
Ed-Sjoin:一种优化的字符串相似连接算法   总被引:1,自引:0,他引:1  
相似连接(similarity join)在数据清洗、生物信息、模式识别等应用领域中有着广泛应用,其中基于编辑距离的字符串相似连接是一种重要的相似连接.尽管当前有一些基于编辑距离的字符串连接算法提出,然而,当前的算法存在着大量的多余计算,影响了算法的效率.为了高效计算基于编辑距离的字符串连接,提出了一种优化的算法Ed-sjoin,分别从优化筛选算法和基于前缀的重复消减策略两方面对算法进行优化,这些优化策略可以实现更加有效的剪枝,并且避免了部分重复计算,从而加速算法的执行.实验结果表明,提出的方法优于现有方法.  相似文献   

16.
Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the nmnber of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density- based methods in the clustering process and at the same time overcoming the constraints, which are put out by data streanFs nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms' performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.  相似文献   

17.
数据流分类是数据挖掘领域的重要研究任务之一,已有的数据流分类算法大多是在有标记数据集上进行训练,而实际应用领域数据流中有标记的数据数量极少。为解决这一问题,可通过人工标注的方式获取标记数据,但人工标注昂贵且耗时。考虑到未标记数据的数量极大且隐含大量信息,因此在保证精度的前提下,为利用这些未标记数据的信息,本文提出了一种基于Tri-training的数据流集成分类算法。该算法采用滑动窗口机制将数据流分块,在前k块含有未标记数据和标记数据的数据集上使用Tri-training训练基分类器,通过迭代的加权投票方式不断更新分类器直到所有未标记数据都被打上标记,并利用k个Tri-training集成模型对第k+1块数据进行预测,丢弃分类错误率高的分类器并在当前数据块上重建新分类器从而更新当前模型。在10个UCI数据集上的实验结果表明:与经典算法相比,本文提出的算法在含80%未标记数据的数据流上的分类精度有显著提高。  相似文献   

18.
Sliding-window multi-stream join (SWMJ) is a fundamental operation for correlating information from different streams. We provide a solution to the problem of assessing significance of the SWMJ result by focusing on the relative frequency of windows satisfying a given equijoin predicate as the most important parameter of the SWMJ result. In particular, we derive a formula for computing the expected relative frequency of windows satisfying a given equijoin predicate that can be evaluated in quadratic time in the window size given a proposed probabilistic model of the multi-stream. In experiments conducted on a daily rainfall data set we demonstrate the remarkable accuracy of our method, which confirms our theoretical analysis.  相似文献   

19.
相似性连接技术在数据清洗、数据集成等领域中具有重要意义,近年来引起了学术界的广泛关注.随着数据量的不断增大、数据处理实时性的要求逐渐提高以及处理器性能提升瓶颈的出现,传统的串行相似性连接方法已经不能满足当前大数据处理的需求.近些年,GPU作为协处理器在机器学习等领域取得了良好的加速效果,因此基于GPU的并行算法开始成为解决各类性能问题的有效解决方案.为此,提出了基于CPU-GPU异构体系的并行相似性连接方法.首先,方法使用GPU构建倒排索引,索引采用SoA(struct of arrays)结构,从而解决了传统索引结构在并行模式下读写效率低的问题.其次,针对串行算法的性能问题,提出基于过滤验证框架的并行双重长度过滤算法,其中利用前缀过滤和构建好的倒排索引提升过滤效果.方法中相似度精确计算验证过程使用CPU计算执行,从而充分利用CPU-GPU的异构计算资源.最后,在多个数据集上进行实验验证性能.通过与串行相似性连接算法进行对比,实验结果表明所提出方法相对于已有方法具有更好的过滤效果和更低的索引生成代价,并在相似性连接上具有更好的性能和良好的加速比.  相似文献   

20.
In a shared-nothing parallel database system, a join operation is split into a set of tasks that are allocated to the nodes in the system to be executed concurrently and independently. While parallel processing could greatly reduce the completion time of a join operation, the system performance may degrade because of load imbalance across the nodes caused by data skewness in the relations. Load-balanced join processing uses various techniques to evenly distribute the load among nodes in a system and hence improves the overall system performance. In this paper, the basic issues in designing load-balanced parallel join algorithms are identified. From the solutions to those issues, a large set of load-balanced join algorithms can be constructed. Performance of four representative algorithms-two dynamic load-balancing algorithms proposed in this paper and two static load-balancing algorithms adapted from similar algorithms in the literature-is studied and compared with that of a parallel join algorithm that does not balance the join load. The results of our study clearly show the benefits of load-balancing. This study also demonstrates that the dynamic load-balancing techniques proposed in this paper not only are feasible but also provide good system performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号