首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Reservoir sampling is a known technique for maintaining a random sample of a fixed size over a data stream of an unknown size. While reservoir sampling is suitable for applications demanding a sample over the whole data stream, it is not designed for applications in which an input stream is composed of sub-streams with heterogeneous statistical properties. For this class of applications, the conventional reservoir sampling technique can lead to a potential damage in the statistical quality of the sample because it does not guarantee the inclusion of a statistically sufficient number of tuples in the sample from each sub-stream. In this paper, we address this heterogeneity problem by stratifying the reservoir sample among the underlying sub-streams. This stratification poses two challenges. First, a fixed-size reservoir should be allocated to individual sub-streams optimally, specifically to have the stratified reservoir sample used to generate estimates at the level of either the whole data stream or the individual sub-streams. Second, the allocation should be adjusted adaptively if and when new sub-streams appear in or existing sub-streams disappear from the input stream and as their statistical properties change. We propose a novel adaptive stratified reservoir sampling algorithm designed to meet these challenges. An extensive performance study shows the superiority of the achieved sample quality and demonstrates the adaptivity of the proposed sampling algorithm.  相似文献   

2.
Usually the data generation rate of a data stream is unpredictable, and some data elements of the data stream cannot be processed in real time if the generation rate exceeds the capacity of a data stream processing algorithm. In order to overcome this situation gracefully, a load shedding technique is recommended. This paper proposes a frequency-based load shedding technique over a data stream of tuples. In many data stream processing applications, such as mining frequent patterns, data elements having high frequency can be considered more significant than others having low frequency. Based on this observation, in the proposed technique, only frequent elements of a data stream are processed in real time while the others are trimmed. The decision to shed a load from the data stream or not is controlled automatically by the data generation rate of a data stream. Consequently, an unnecessary load shedding operation is not allowed in the proposed technique.  相似文献   

3.
This paper introduces a new technique for computer visualization of three-dimensional flow fields. The most powerful feature of this technique is that the streamlines and stream surface are generated by mass conservative interpolation schemes. Interpolation is an important topic in flow visualization because CFD velocity fields are defined at a discrete location in space. Interpolation errors are more significant than those arising from numerical integration. The main draw-back of conventional trilinear interpolation of velocity is that it is not mass conservative. Failure to conserve mass can produce errors which can not be eliminated by reducing the integration step. A significant feature of the relationship between the velocity field and the stream functions is that it implies conservation of mass. So a mass conservative interpolation scheme is developed using a stream function, which is obtained by solving the partial differential equation in the local cell and approximated by a cluster of stream surfaces. Then the streamline can be traced using numerical techniques with mass conservative interpolation and the stream surface is directly calculated by slicing the stream function. The result is more accurate because we replace the polygoned tiling of streamlines by mass conservative stream surface generation. Results presented here compare the performance of the new method to the trilinear interpolation scheme and demonstrate its effectiveness.  相似文献   

4.
Semi-stream join algorithms join a fast data stream with a disk-based relation. This is important, for example, in real-time data warehousing where a stream of transactions is joined with master data before loading it into a data warehouse. In many important scenarios, the stream input has a skewed distribution, which makes certain performance optimizations possible.We propose two such optimization techniques: (1) a caching technique for frequently used master data and (2) a technique for selective load shedding of stream tuples. The caching technique is fine-grained, operating on a tuple-level. Furthermore, it is generic in the sense that it can be applied to different semi-stream join algorithms to deal with data skew. We analyze it by combining it with various well-known semi-stream joins, and show that it improves the service rate by more than 40% for typical data with skewed distributions. The load shedding technique sheds the fraction of the stream that is most expensive to join. In contrast to existing approaches, the service rate improves under load shedding. We present experimental data showing significant improvements as compared to related approaches and perform a sensitivity analysis for various internal parameters.  相似文献   

5.
流是一个很形象的概念,当程序需要读取数据的时候,就会开启一个通向数据源的流,这个数据源可以是文件、内存或是网络连接。类似的,当程序需要写入数据的时候,就会开启一个通向目的地的流,这时候就可以想象数据好像在这其中"流"动一样。  相似文献   

6.
流是一个很形象的概念,当程序需要读取数据的时候,就会开启一个通向数据源的流,这个数据源可以是文件,内存,或是网络连接。类似的,当程序需要写入数据的时候,就会开启一个通向目的地的流。这时候你就可以想象数据好像在这其中"流"动一样。  相似文献   

7.
高带宽远程内存结构中的预取研究   总被引:1,自引:0,他引:1  
高速电路和光互联技术的发展极大地提高了网络的速度与带宽。因而,突破高性能计算机CPU与内存紧耦合的传统结构成为可能,CPU与内存的耦合不再受距离的限制,这必将引起体系结构的变革。文[1]提出DSAG结构——CPU与内存在空间上分离,每个CPU节点上仅留少量内存.将海量内存放在远程统一管理作为内存服务器,CPU节点和内存服务器之间通过高速网络互连。这种新的体系结构带来了更好的共享性和可扩展性,但同时也对我们解决CPU和内存之间的不平衡性问题带来了挑战。为了降低DSAG这种远程内存结构增加的访存时延,我们考虑到CPU正常访存没有充分利用网络的高带宽,因此可以利用剩余的网络带宽来进行远程内存数据的预取。本论文在应用程序执行时记录本地(相对于远程内存)不命中的地址信息,以页对齐分析其中存在的页框流(Page Frame Stream)的统计特征,并提出可基于页框流的预取机制可降低访存延迟、提升系统性能的观点。最后我们采用模拟的方法验证了观点的可行性与正确性,进一步提出了三种预取策略,比较并分析影响预取效果的因素。  相似文献   

8.
针对采用VBR编码的流媒体数据,由于其码率变化大,常出现数据猝发传输,而传统的流控技术又需要大量的客户端反馈。致使带宽调整具有较大的滞后性而不适合基于UDP的VBR流媒体服务的问题。提出了一种基于流媒体数据传输速率本身特征的主动滤波整流模型。由于此模型无需客户反馈信息,可利用服务器的数据缓冲对数据流进行滤波整流处理,从而不仅减少了由VBR编码带来的数据猝发传输。而且可直接在服务器方提供稳定的数据流传输服务。  相似文献   

9.
Sliding window-based frequent pattern mining over data streams   总被引:2,自引:0,他引:2  
Finding frequent patterns in a continuous stream of transactions is critical for many applications such as retail market data analysis, network monitoring, web usage mining, and stock market prediction. Even though numerous frequent pattern mining algorithms have been developed over the past decade, new solutions for handling stream data are still required due to the continuous, unbounded, and ordered sequence of data elements generated at a rapid rate in a data stream. Therefore, extracting frequent patterns from more recent data can enhance the analysis of stream data. In this paper, we propose an efficient technique to discover the complete set of recent frequent patterns from a high-speed data stream over a sliding window. We develop a Compact Pattern Stream tree (CPS-tree) to capture the recent stream data content and efficiently remove the obsolete, old stream data content. We also introduce the concept of dynamic tree restructuring in our CPS-tree to produce a highly compact frequency-descending tree structure at runtime. The complete set of recent frequent patterns is obtained from the CPS-tree of the current window using an FP-growth mining technique. Extensive experimental analyses show that our CPS-tree is highly efficient in terms of memory and time complexity when finding recent frequent patterns from a high-speed data stream.  相似文献   

10.
王辉  王思远  殷慧  毕海芸  赵岩 《遥感信息》2012,27(4):13-15,33
针对DEM数据地物特征不明显,控制点不易选取,从而造成与影像配准困难的情况,提出根据DEM生成河流信息,利用河流交叉点及河流拐点作为控制点,然后再与遥感影像进行配准的方法。该方法首先根据DEM确定河流流向,根据河流流向计算每个栅格的汇流累积量,而后根据汇流累积量提取河流信息,最后根据生成的河流信息提取关键特征点,从而实现DEM与遥感影像的配准。通过对北京密云地区ETM影像与DEM数据进行配准试验,证明了该方法的可行性。  相似文献   

11.
邢长征  胡权波 《计算机工程》2013,(12):247-250,259
处理倾斜分布特征的数据流聚类算法TDCA存在聚类速度与内存利用率上的不足,且变流速的数据流环境对聚类结果的质量有严重影响。针对上述问题,提出一种数据流聚类算法GR—Stream。采用网格单元作为数据点的聚集形式,以基于R.tree的扩展数据结构作为组织网格单元的索引结构,在此基础上引入剪枝策略,并调整数据点进入树的方式。在真实数据集KDD.CUP99上进行测试,结果表明,与TDCA算法相比,该算法在聚类过程中可以提高40%的访问速度,应用剪枝策略节省至少一半的内存使用量,同时在变流速的数据流环境下将聚类结果的平均纯度保持在90%以上。  相似文献   

12.
目前关于差分隐私数据流统计发布的研究仅考虑一维数据流,其方法无法直接用于解决二维数据流统计发布中可能存在的隐私泄露问题.针对此问题,首先提出面向固定长度二维数据流的差分隐私统计发布算法--PTDSS算法.该算法通过单次线性扫描数据流,以较低空间消耗计算出满足一定条件的二维数据流元组的统计频度,并经过敏感度分析添加适量的噪声使其满足差分隐私要求;接着在PTDSS算法的基础上,利用滑动窗口机制,设计出面向任意长度二维数据流的差分隐私连续统计发布算法--PTDSS-SW.理论分析与实验结果表明,所提算法可安全地实现二维数据流统计发布的隐私保护,同时统计发布结果的相对误差在10%~95%.  相似文献   

13.
针对目前已有的时间序列数据分段方法多侧重于静态数据的分段现状,根据时间序列流数据的变化情况,分析数据流的状态,提出一种有限自动机的分段方法,它通过分析时间序列流中数据所处的状态,进而发现其中的变化点,并以变化点作为段的两端,从而完成时间序列的分段。实验表明,这种方法能够有效地对高速时间序列流进行分段,保证了分段的效果和质量。  相似文献   

14.
在研究已有时间序列数据流预测方法的前提下,给出了一种基于滑动窗口的时间序列数据流通用预测模型,提出能有效降噪并进行多尺度滑动窗口分析,进而进行预测的新方法Online-HHT,将数据流中的滑动窗口技术与HHT方法相结合从而达到在线分析的目的。使用此模型,通过实验证实了Online-HHT方法能够有效地对时序数据流进行在线自适应趋势预测。  相似文献   

15.
In this paper, we study the problem of anomaly detection in wireless network streams. We have developed a new technique, called Stream Projected Outlier deTector (SPOT), to deal with the problem of anomaly detection from multi-dimensional or high-dimensional data streams. We conduct a detailed case study of SPOT in this paper by deploying it for anomaly detection from a real-life wireless network data stream. Since this wireless network data stream is unlabeled, a validating method is thus proposed to generate the ground-truth results in this case study for performance evaluation. Extensive experiments are conducted and the results demonstrate that SPOT is effective in detecting anomalies from wireless network data streams and outperforms existing anomaly detection methods.  相似文献   

16.
The fast systolic computation and double pipelines were designed to achieve implementations that use less processors to execute the algorithm in less time then the conventional systolic algorithms. H. T. Kung and C. S. Leiserson in [1-3] proposed systolic algorithms realized on a bidirectional linear array where two data streams flow in opposite directions. The data flow introduced for this solution requires data elements to appear in the data stream at each second time step, which is the only way to meet all the elements from the other data stream.

In [4, 5] the authors proposed a linear array where one data stream is double mapped while the elements from the other data stream flow in consecutive time moments. The procedure to obtain such a solution is called a fast systolic design. It was shown in [5] that double pipeline solutions are obtained by separating and grouping techniques in addition to this design.

Several more efficient systolic designs have been proposed for the matrix vector multiplication algorithm in [4, 5]. Here we implement these techniques on other linear array algorithms such as triangular linear system solver, string comparison, convolution, correlation, MA and AR filter.  相似文献   

17.
边小勇  张晓龙  余海 《计算机应用》2012,32(10):2935-2939
针对某钢铁企业生产过程中的生产信息不畅通、产品质量无法追踪问题,开展了基于工厂信息(PI)的实时数据流分析与全过程质量监控方法的研究。着重研究了实时数据流分割和过程监控,提出基于统计质量控制(SQC)图和工序性能指标的统计监控方法,并开发了一个产品技术质量监控系统,应用结果表明基于PI的实时数据流分析与产品质量监控实现了企业对生产工序质量的监控,以及关键生产工艺的识别与改进。  相似文献   

18.
The direct sampling (DS) multiple-point statistical technique is proposed as a non-parametric missing data simulator for hydrological flow rate time-series. The algorithm makes use of the patterns contained inside a training data set to reproduce the complexity of the missing data. The proposed setup is tested in the reconstruction of a flow rate time-series while considering several missing data scenarios, as well as a comparative test against a time-series model of type ARMAX. The results show that DS generates more realistic simulations than ARMAX, better recovering the statistical content of the missing data. The predictive power of both techniques is much increased when a correlated flow rate time-series is used, but DS can also use incomplete auxiliary time-series, with a comparable prediction power. This makes the technique a handy simulation tool for practitioners dealing with incomplete data sets.  相似文献   

19.
为了提高网站的利用率及优化网站,构建了Web数据流挖掘系统,介绍了该系统的框架结构,并以商丘师范学院校园网为挖掘对象,说明了Web数据流挖掘的工作流程以及Web资源服务的具体实现流程。实践证明,基于Web数据流挖掘技术实现Web资源服务,可充分利用Web网站的信息和网络资源,实时、高效地为用户提供个性化的Web资源服务。  相似文献   

20.
为了提高信息服务中心的智能数据处理能力,提出一种基于信息总线服务的智能数据流自适应集成方法,构建智能数据流自适应信息处理模型,采用统计时间序列分析方法进行信息总线服务的大数据处理和信号集成,采用同步与匹配滤波方法进行信息总线服务数据信息流的滤波检测,采用快速傅里叶变换进行智能数据流自适应集成信号滤波处理,提高智能数据流的纯度。采用重叠保留法实现信息总线服务数据流的混频处理,构建相关匹配检测器实现信息总线服务数据流的分路、插零、取反等运算,根据运算结果实现智能数据流自适应集成。仿真结果表明,采用该方法进行信息总线服务数据流集成的混频性能较好,提高了信息总线服务数据流的检测和信息融合能力,提高了输出信噪比。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号