期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

基于滑动窗口的数据流压缩技术及连续查询处理方法 总被引：8，自引：0，他引：8

王栩李建中王伟平《计算机研究与发展》2004,41(10):1639-1644

基于滑动窗口的连续查询处理是数据流研究领域的一个热点问题．已有的研究工作均假设滑动窗口内的数据能够全部保存在主存中，若滑动窗口内的数据量超过了可用主存空间，已有的查询处理方法则无法正常工作．提出两种数据流上的滑动窗口压缩技术，有效地降低了滑动窗口的存储空间需求．同时，给出了基于压缩滑动窗口的连续查询处理算法，理论分析和实验结果表明，这些算法具有很好的性能，能够满足数据流连续查询处理的实时性要求．相似文献

2.

基于动态可调衰减滑动窗口的变速数据流聚类算法

《计算机应用与软件》2015,(11)

在数据流聚类算法中,滑动窗口技术可以及时淘汰历史元组、只关注近期元组,从而改善数据流的聚类效果。如果同时数据流流速无规律地随时间动态变化,原来单纯的滑动窗口技术在解决这类问题时存在缺陷,所以,在充分考虑了滑动窗口大小和数据流流速之间关系的前提下,提出了基于动态可调衰减滑动窗口的变速数据流聚类算法。该算法对历史元组和近期元组分别赋予一定的权重进行处理,然后依据数据流流速的不同函数改变窗口的大小,从而实现数据流的聚类。提出了该数据流聚类算法的数据结构——变异数据流聚类的数据结构。通过真实数据和模拟数据来构造动态变速数据流从而作为验证算法的原始数据。实验结果表明,与Clu Stream聚类算法相比,该方法具有较高的聚类质量、较小的内存开销和较少的聚类处理时间。相似文献

3.

n-of-N 数据流模型上高效概率Skyline 计算

杨永滔王意洁《软件学报》2012,23(3):550-564

研究概率数据流上的q-skyline计算问题.与只支持滑动窗口数据流模型的已有方法相比,所提出的方法能够支持更为通用的n-of-N数据流模型.采用将q-skyline查询转换为区间树上刺入查询的方法支持n-of-N数据流模型.提出PnNM算法维护支持n-of-N数据流模型所需的相关数据结构,高效处理了不确定对象候选集合更新和区间更新等维护工作;提出PnNCont算法实现连续查询处理.理论分析和实验结果表明,算法能够有效地支持概率数据流n-of-N模型上的q-skyline查询处理. 相似文献

4.

滑动窗口规模的动态调整算法 总被引：9，自引：0，他引：9

李建中张冬冬《软件学报》2004,15(12):1800-1814

讨论当数据流系统的数据流流速或连续查询发生变化时,滑动窗口规模的动态调整问题.根据可用内存空间大小和连续查询需求,提出了3类动态调整滑动窗口规模的算法,实现了对连续查询3种服务质量级别的支持,提高了连续查询处理的效率和效果.理论分析与实验结果表明,提出的算法可以有效地应用于数据流系统. 相似文献

5.

一种不确定数据流子空间聚类算法

胡德敏余星《计算机应用研究》2014,31(9)

针对不确定数据流上的聚类问题提出一种不确定数据流子空间聚类算法UDSSC.该算法使用滑动窗口机制接收新到达的数据,剔除陈旧的数据;还引入子空间簇生成策略和新型离群点机制;系统建立了三个缓冲区分别存储新到来的元组、要进行聚类的元组和离群点元组,以此获得高质量的聚类结果.实验表明,UDSSC算法与同类型算法相比,具有更好的聚类效果、更低的时间复杂度和更强的扩展性. 相似文献

6.

数据流上的复合滑动窗口聚集算法

钟颖莉李金宝王伟平艾春宇《计算机工程与应用》2006,42(14):187-191

基于滑动窗口的聚集查询是数据流研究领域的一个热点问题。在已有的研究工作中,聚集算法都是针对立即执行的连续查询提出的,这些算法均是当数据流新到一个元组立即计算一次聚集结果。而在实际应用中,连续查询有时采取的是周期执行方式。论文针对周期执行的连续查询提出了复合滑动窗口聚集算法,即数据流新到一个元组,将它插入到基本窗口中,当基本窗口被插满时计算一次聚集结果。给出了非增量式和增量式两种算法。理论分析和实验结果表明增量式算法具有较好的性能。相似文献

7.

数据流滑动窗口聚集查询降载策略研究*

郭景峰贺春亮《计算机应用研究》2009,26(7):2474-2477

滑动窗口聚集查询在数据流管理系统中应用广泛,数据流到达高峰期,必须考虑滑动窗口聚集查询中出现的降载问题。分析了子集模型的特点和已有降载策略的不足,给出了数据流滑动窗口聚集查询降载问题的约束条件,提出了能保证子集结果产生的基于丢弃窗口更新策略的降载算法。理论分析和实验结果表明,该算法对数据流滑动窗口聚集查询降载问题的处理具有较高的有效性和实用性。相似文献

8.

一种基于滑动窗口的流数据聚类算法

蔡妮明王翰虎陈梅《计算机技术与发展》2011,21(1)

在实际应用中,人们往往比较关心最近一段时间内数据流的分布状况.在传统的基于界标模型的聚类算法CluStream中,没有淘汰过期元组,不能准确反映当前数据流的数据分布状况.滑动窗口是数据流中一种关注近期数据的近似方法.为了提高对流数据聚类分析的质量及效率,对算法clustream进行了改进,采用滑动窗口来支持数据处理.为了减少聚类操作中每次迭代的计算次数,算法采用改进的k-means来执行聚类操作.优化后的算法能及时淘汰过期元组,同时对新到达的元组不断进行实时处理,可以获得更准确的分析结果.与聚类算法CluStream相比,优化算法可获得较小的内存开销和快速的数据处理能力,聚类结果更合理清晰. 相似文献

9.

数据流上Ad Hoc查询的自适应处理算法

黄浩杨卫东《计算机工程》2013,(9)

对数据流上的Ad Hoc查询进行自适应处理,需要保证已有查询计划快速在线更新和迁移,但现有方法实现新旧查询计划的更新需要大量的滑动窗口状态转换。为此,提出一种Ad Hoc查询自适应处理算法。该算法基于数据流概要分布特性和自定义评分模型,快速计算出现有查询计划的最佳增量更新,以实现新到达的 Ad Hoc 查询处理,降低新旧查询计划切换时间。在数据流benchmark Linear Road提供的高速公路数据集上进行实验,结果表明,与MS、PT方法相比,该算法可较快完成新旧查询计划的切换。相似文献

10.

基于免疫原理的不确定数据流聚类算法

肖丹萍叶东毅《模式识别与人工智能》2012,25(5):826-834

提出一种基于免疫原理、对不确定数据流进行聚类的算法——IUMicro。IUMicro针对不确定数据流上元组级不确定性问题,引入动态更新以适应数据变化的免疫模型,其中包括一种有效的在线收集数据流统计信息的B细胞特征结构及其更新策略。为兼顾元组存在概率与元组间的距离两方面因素,定义概率识别半径,为每个不断到达的数据元组找到合理的候选簇。离线聚类根据免疫细胞识别区域的空间关系,进行任意形状的无监督聚类。实验结果表明,IUMicro能有效抑制噪声,具有良好的聚类质量和较快的处理速度。相似文献

11.

基于合并-分裂策略的近似等深直方图增量维护

张龙波李战怀王勇《计算机科学》2009,36(8):182-184

直方图在数据库领域有着广泛的应用,是一种常用的概要数据结构生成方法.首先提出了一个基于数据流界标窗口模型的近似等深直方图构建维护算法框架,该算法框架通过桶的合并一分裂实现近似等深直方图的增量维护;然后对三种不同的桶合并一分裂策略进行了比较和讨论;最后对该算法框架和三种不同的桶合并一分裂策略进行了实验分析. 相似文献

12.

面向轨迹数据流的KNN近似查询 总被引：1，自引：0，他引：1

下载免费PDF全文

王考杰郑雪峰宋一丁曲阜平《计算机工程》2011,37(16):17-20

提出一种基于滑动窗口的K-最近邻(KNN)近似查询算法。将滑动窗口内数据通过聚类划分成若干大小不一的基本窗口,针对每个基本窗口给定一个采样率,对窗口内数据进行偏倚采样,形成数据流摘要,并基于该摘要,采用计算几何平面扫描算法执行分布式最近邻查询。仿真实验结果表明该算法有效,且具有较好的可扩展性。相似文献

13.

On evaluating stream learning algorithms

João Gama Raquel Sebastião Pedro Pereira Rodrigues 《Machine Learning》2013,90(3):317-346

Most streaming decision models evolve continuously over time, run in resource-aware environments, and detect and react to changes in the environment generating data. One important issue, not yet convincingly addressed, is the design of experimental work to evaluate and compare decision models that evolve over time. This paper proposes a general framework for assessing predictive stream learning algorithms. We defend the use of prequential error with forgetting mechanisms to provide reliable error estimators. We prove that, in stationary data and for consistent learning algorithms, the holdout estimator, the prequential error and the prequential error estimated over a sliding window or using fading factors, all converge to the Bayes error. The use of prequential error with forgetting mechanisms reveals to be advantageous in assessing performance and in comparing stream learning algorithms. It is also worthwhile to use the proposed methods for hypothesis testing and for change detection. In a set of experiments in drift scenarios, we evaluate the ability of a standard change detection algorithm to detect change using three prequential error estimators. These experiments point out that the use of forgetting mechanisms (sliding windows or fading factors) are required for fast and efficient change detection. In comparison to sliding windows, fading factors are faster and memoryless, both important requirements for streaming applications. Overall, this paper is a contribution to a discussion on best practice for performance assessment when learning is a continuous process, and the decision models are dynamic and evolve over time. 相似文献

14.

Fast and memory efficient mining of high-utility itemsets from data streams: with and without negative item profits 总被引：2，自引：1，他引：1

Hua-Fu Li Hsin-Yun Huang Suh-Yin Lee 《Knowledge and Information Systems》2011,28(3):495-522

Mining utility itemsets from data steams is one of the most interesting research issues in data mining and knowledge discovery. In this paper, two efficient sliding window-based algorithms, MHUI-BIT (Mining High-Utility Itemsets based on BITvector) and MHUI-TID (Mining High-Utility Itemsets based on TIDlist), are proposed for mining high-utility itemsets from data streams. Based on the sliding window-based framework of the proposed approaches, two effective representations of item information, Bitvector and TIDlist, and a lexicographical tree-based summary data structure, LexTree-2HTU, are developed to improve the efficiency of discovering high-utility itemsets with positive profits from data streams. Experimental results show that the proposed algorithms outperform than the existing approaches for discovering high-utility itemsets from data streams over sliding windows. Beside, we also propose the adapted approaches of algorithms MHUI-BIT and MHUI-TID in order to handle the case when we are interested in mining utility itemsets with negative item profits. Experiments show that the variants of algorithms MHUI-BIT and MHUI-TID are efficient approaches for mining high-utility itemsets with negative item profits over stream transaction-sensitive sliding windows. 相似文献

15.

基于局部聚类的轨迹数据流偏倚采样

王考杰郑雪峰宋一丁安丰亮《计算机科学》2011,38(5):135-137

移动对象轨迹数据管理是移动计算领域的研究热点。通过采样技术构造数据流摘要是普通采用的方法之一。传统的均匀采样往往容易丢失某些关键变化数据。利用轨迹数据流的局部连续性特征,提出一种基于滑动窗口的偏倚采样算法。算法将滑动窗口通过聚类划分成若干大小不一的基本窗口,并针对每个基本窗口给定一个采样率,对窗口内数据进行偏倚采样,从而形成数据流摘要。算法利用了轨迹数据的内在特征,因此具有较高的采样质量。最后,基于实际数据对算法进行了实验,结果证明了算法的有效性。相似文献

16.

Optimal workload-based weighted wavelet synopses

Yossi Matias Daniel Urieli 《Theoretical computer science》2007

In recent years wavelets were shown to be effective data synopses. We are concerned with the problem of finding efficiently wavelet synopses for massive data sets, in situations where information about query workload is available. We present linear time, I/O optimal algorithms for building optimal workload-based wavelet synopses for point queries. The synopses are based on a novel construction of weighted inner products and use weighted wavelets that are adapted to those products. The synopses are optimal in the sense that the subset of retained coefficients is the best possible for the bases in use with respect to either the mean-squared absolute or relative errors. For the latter, this is the first optimal wavelet synopsis even for the regular, non-workload-based case. Experimental results demonstrate the advantage obtained by the new optimal wavelet synopses. 相似文献

17.

Mining neighbor-based patterns in data streams 总被引：1，自引：0，他引：1

Di Yang Elke A. Rundensteiner Matthew O. Ward 《Information Systems》2013

Discovery of complex patterns such as clusters, outliers, and associations from huge volumes of streaming data has been recognized as critical for many application domains. However, little research effort has been made toward detecting patterns within sliding window semantics as required by real-time monitoring tasks, ranging from real time traffic monitoring to stock trend analysis. Applying static pattern detection algorithms from scratch to every window is impractical due to their high algorithmic complexity and the real-time responsiveness required by streaming applications. In this work, we develop methods for the incremental detection of neighbor-based patterns, in particular, density-based clusters and distance-based outliers over sliding stream windows. Incremental computation for pattern detection queries is challenging. This is because purging of to-be-expired data from previously formed patterns may cause birth, shrinkage, splitting or termination of these complex patterns. To overcome this, we exploit the “predictability” property of sliding windows to elegantly discount the effect of expired objects with little maintenance cost. Our solution achieves guaranteed minimal CPU consumption, while keeping the memory utilization linear in the number of objects in the window. To thoroughly analyze the performance of our proposed methods, we develop a cost model characterizing the performance of our proposed neighbor-based pattern mining strategies. We conduct an analysis study to not only identify the key performance factors for each strategy but also show under which conditions each of them are most efficient. Our comprehensive experimental study, using both synthetic and real data from domains of moving object monitoring and stock trades, demonstrates superiority of our proposed strategies over alternate methods in both CPU processing resources and in memory utilization. 相似文献

18.

滑动窗口数据流上多极值查询资源共享策略研究

田李王乐李爱平邹鹏贾焰《计算机研究与发展》2008,45(3):548-556

为了提高在同一数据流上同时计算多个连续极值查询(MAX或MIN)时的处理能力,对查询间资源共享技术进行了研究.提出了一种称为"关键点集"的裁剪策略,系统仅需保存少量数据即可满足所有查询的需要.发掘多个查询间的相似性和可共享的计算存储资源,提出了一个多极值查询处理算法MCEQP.采用链表结构实现的该算法,当一个新数据到达时最多需要O(M K)时间即可更新全部K个查询的结果,其中M为关键点集包含数据的个数.MCEQP采用触发器驱动的方式,只在某些特定时刻才需要计算因数据失效引起的查询结果变化,更新K个查询结果所需时间为O(K).理论分析和实验证明,对于滑动窗口数据流上的多个极值查询,MCEQP算法在降低存储开销和提高性能方面均优于现有的通用方法. 相似文献

19.

Sampling streaming data with replacement

Byung-Hoon Park George Ostrouchov Nagiza F. Samatova 《Computational statistics & data analysis》2007,52(2):750-762

Simple random sampling is a widely accepted basis for estimation from a population. When data come as a stream, the total population size continuously grows and only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. Reservoir sampling without replacement has been extensively studied and several algorithms with sub-linear time complexity exist. Although reservoir sampling with replacement is previously mentioned by some authors, it has been studied very little and only linear algorithms exist. A with-replacement reservoir sampling algorithm of sub-linear time complexity is introduced. A thorough complexity analysis of several approaches to the with-replacement reservoir sampling problem is also provided. 相似文献