首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
基于小波概要的并行数据流聚类   总被引:1,自引:0,他引:1  
许多应用中会连续不断产生大量随时间演变的序列型数据,构成时间序列数据流,如传感器网络、实时股票行情、网络及通信监控等场合.聚类是分析这类并行多数据流的一种有力工具.但数据流长度无限、随时间演变和大数据量的特点,使得传统的聚类方法无法直接应用.利用数据流的遗忘特性,应用离散小波变换,分层、动态地维护每个数据流的概要结构.基于该概要结构,快速计算数据流与聚类中心之间的近似距离,实现了一种适合并行多数据流的K-means聚类方法.所进行的实验验证了该聚类方法的有效性.  相似文献   

2.
A survey on algorithms for mining frequent itemsets over data streams   总被引:9,自引:8,他引:1  
The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the one-scan nature, the unbounded memory requirement and the high data arrival rate of data streams, the combinatorial explosion of itemsets exacerbates the mining task. The high complexity of the FI mining problem hinders the application of the stream mining techniques. We recognize that a critical review of existing techniques is needed in order to design and develop efficient mining algorithms and data structures that are able to match the processing rate of the mining with the high arrival rate of data streams. Within a unifying set of notations and terminologies, we describe in this paper the efforts and main techniques for mining data streams and present a comprehensive survey of a number of the state-of-the-art algorithms on mining frequent itemsets over data streams. We classify the stream-mining techniques into two categories based on the window model that they adopt in order to provide insights into how and why the techniques are useful. Then, we further analyze the algorithms according to whether they are exact or approximate and, for approximate approaches, whether they are false-positive or false-negative. We also discuss various interesting issues, including the merits and limitations in existing research and substantive areas for future research.  相似文献   

3.
Peer-to-peer (P2P) databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large-scale ad hoc analysis queries, for example, aggregation queries, on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement, given the distributed and dynamic nature of P2P databases. In this paper, we present novel sampling-based techniques for approximate answering of ad hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors: the data is distributed (usually in uneven quantities) across many peers, within each peer, the data is often highly correlated, and, moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach based on random walks of the P2P graph, as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solution.  相似文献   

4.
We present a pipelining, dynamically tunable reorder operator for providing user control during long running, data- intensive operations. Users can see partial results and accordingly direct the processing by specifying preferences for various data items; data of interest is prioritized for early processing. The reordering mechanism is efficient and non-blocking and can be used over arbitrary data streams from files and indexes, as well as continuous data feeds. We also investigate several policies for the reordering based on the performance goals of various typical applications. We present performance results for reordering in the context of an online aggregation implementation in Informix and in the context of sorting and scrolling in a large-scale spreadsheet. Our experiments demonstrate that for a variety of data distributions and applications, reordering is responsive to dynamic preference changes, imposes minimal overheads in overall completion time, and provides dramatic improvements in the quality of the feedback over time. Surprisingly, preliminary experiments indicate that online reordering can also be useful in traditional batch query processing, because it can serve as a form of pipelined, approximate sorting. Received April 28, 2000 / Accepted June 23, 2000  相似文献   

5.
Scattercast: an adaptable broadcast distribution framework   总被引:5,自引:0,他引:5  
Internet broadcasting - the simultaneous distribution of live content streams to a large audience - has a number of interesting applications ranging from real-time broadcasts of audio/video streams for online concerts or sporting events to efficient and reliable large-scale software distribution. We identify three fundamental requirements for scalable broadcasting services: an efficient infrastructure for large-scale broadcasting, an ability to adapt the infrastructure to suit the requirements of a wide range of applications, and ease of deployment of the infrastructure. Although solutions such as the network-layer IP multicast approach and a slew of overlay distribution networks exist today, none of these technologies satisfactorily addresses all of the above concerns. In this paper, we argue that an application-customizable hybrid overlay is well suited to meet these challenges. To this end, we propose an architecture called scattercast that relies on a network of strategically located agents called ScatterCast proXies or SCXs. These agents collaboratively provide the broadcast service for a session. Clients locate a nearby SCX and tap into the session via that SCX. Scattercast constructs a hybrid overlay network composed of unicast links between SCXs that interconnect locally scoped multicast regions. Rather than define a single standardized service model for transmitting data on top of the overlay, scattercast builds a customizable transport framework that provides adaptability by leveraging application-defined semantics to drive the distribution of content. We demonstrate the ability of our architecture to provide efficient distribution via a set of simulation experiments. Finally, we present our experience with the adaptability of the framework by describing two applications, a real-time Internet radio and an online slide-presentation tool, both of which we have built on top of a prototype implementation of the architecture.Received: 15 March 2002, Accepted: 2 October 2002,  相似文献   

6.
Process mining aims at gaining insights into business processes by analyzing the event data that is generated and recorded during process execution. The vast majority of existing process mining techniques works offline, i.e. using static, historical data, stored in event logs. Recently, the notion of online process mining has emerged, in which techniques are applied on live event streams, i.e. as the process executions unfold. Analyzing event streams allows us to gain instant insights into business processes. However, most online process mining techniques assume the input stream to be completely free of noise and other anomalous behavior. Hence, applying these techniques to real data leads to results of inferior quality. In this paper, we propose an event processor that enables us to filter out infrequent behavior from live event streams. Our experiments show that we are able to effectively filter out events from the input stream and, as such, improve online process mining results.  相似文献   

7.
Data cubes have become important components in most data warehouse systems and decision support systems. In such systems, users usually pose very complex queries to the online analytical processing (OLAP) system, and systems usually have to deal with a huge amount of data because of the large dimensionality of the sets; thus, approximating query processing has emerged as a viable solution. Specifically, the applications of cube streams handle multidimensional data sets in a continuous manner in contrast to the traditional cube approximation. Such an application collects data events for cube streams online, generates snapshots with limited resources, and keeps the approximated information in a synopsis memory for further analysis. Compared to the OLAP applications, applications of cube streams are subject to many more resource constraints on both the processing time and the memory and cannot be dealt with by existing methods due to the limited resources. In this paper, we propose the DAWA algorithm, which is a hybrid algorithm of discrete cosine transform (DCT) for data and the discrete wavelet transform (DWT), to approximate cube streams. Our algorithm combines the advantages of the high compression rate of DWT and the low memory cost of DCT. Consequently, DAWA requires much smaller working buffer and outperforms both DWT-based and DCT-based methods in execution efficiency. Also, it is shown that DAWA provides a good solution for an approximate query processing of cube streams with a small working buffer and a short execution time. The optimality of the DAWA algorithm is theoretically proved and empirically demonstrated by our experiments.  相似文献   

8.
This work aims to connect two rarely combined research directions, i.e., non-stationary data stream classification and data analysis with skewed class distributions. We propose a novel framework employing stratified bagging for training base classifiers to integrate data preprocessing and dynamic ensemble selection methods for imbalanced data stream classification. The proposed approach has been evaluated based on computer experiments carried out on 135 artificially generated data streams with various imbalance ratios, label noise levels, and types of concept drift as well as on two selected real streams. Four preprocessing techniques and two dynamic selection methods, used on both bagging classifiers and base estimators levels, were considered. Experimentation results showed that, for highly imbalanced data streams, dynamic ensemble selection coupled with data preprocessing could outperform online and chunk-based state-of-art methods.  相似文献   

9.
According to the database outsourcing model, a data owner delegates database functionality to a third-party service provider, which answers queries received from clients. Authenticated query processing enables the clients to verify the correctness of query results. Despite the abundance of methods for authenticated processing in conventional databases, there is limited work on outsourced data streams. Stream environments pose new challenges such as the need for fast structure updating, support for continuous query processing and authentication, and provision for temporal completeness. Specifically, in addition to the correctness of individual results, the client must be able to verify that there are no missing results in between data updates. This paper presents a comprehensive set of methods covering relational streams. We first describe REF, a technique that achieves correctness and temporal completeness but incurs false transmissions, i.e., the provider has to inform the clients whenever there is a data update, even if their results are not affected. Then, we propose CADS, which minimizes the processing and transmission overhead through an elaborate indexing scheme and a virtual caching mechanism. In addition, we present an analytical study to determine the optimal indexing granularity, and extend CADS for the case that the data distribution changes over time. Finally, we evaluate the effectiveness of our techniques through extensive experiments.  相似文献   

10.
Online mining of frequent sets in data streams with error guarantee   总被引:7,自引:5,他引:2  
For most data stream applications, the volume of data is too huge to be stored in permanent devices or to be thoroughly scanned more than once. It is hence recognized that approximate answers are usually sufficient, where a good approximation obtained in a timely manner is often better than the exact answer that is delayed beyond the window of opportunity. Unfortunately, this is not the case for mining frequent patterns over data streams where algorithms capable of online processing data streams do not conform strictly to a precise error guarantee. Since the quality of approximate answers is as important as their timely delivery, it is necessary to design algorithms to meet both criteria at the same time. In this paper, we propose an algorithm that allows online processing of streaming data and yet guaranteeing the support error of frequent patterns strictly within a user-specified threshold. Our theoretical and experimental studies show that our algorithm is an effective and reliable method for finding frequent sets in data stream environments when both constraints need to be satisfied.  相似文献   

11.
Existing studies on episode mining mainly concentrate on the discovery of (global) frequent episodes in sequences. However, frequent episodes are not suited for data streams because they do not capture the dynamic nature of the streams. This paper focuses on detecting dynamic changes in frequencies of episodes over time-evolving streams. We propose an efficient method for the online detection of abrupt emerging episodes and abrupt submerging episodes over streams. Experimental results on synthetic data show that the proposed method can effectively detect the defined patterns and meet the strict requirements of stream processing, such as one-pass, real-time update and return of results, plus limited time and space consumption. Experimental results on real data demonstrate that the patterns detected by our method are natural and meaningful. The proposed method has wide applications in stream monitoring and analysis as the discovered patterns indicate dynamic emergences/disappearances of noteworthy events/phenomena hidden in the streams.  相似文献   

12.
We investigate the potential of the analysis of noisy non-stationary time series by quantising it into streams of discrete symbols and applying finite-memory symbolic predictors. Careful quantisation can reduce the noise in the time series to make model estimation more amenable. We apply the quantisation strategy in a realistic setting involving financial forecasting and trading. In particular, using historical data, we simulate the trading of straddles on the financial indexes DAX and FTSE 100 on a daily basis, based on predictions of the daily volatility differences in the underlying indexes. We propose a parametric, data-driven quantisation scheme which transforms temporal patterns in the series of daily volatility changes into grammatical and statistical patterns in the corresponding symbolic streams. As symbolic predictors operating on the quantised streams, we use the classical fixed-order Markov models, variable memory length Markov models and a novel variation of fractal-based predictors, introduced in its original form in Tin_ o and Dorffner [1]. The fractal-based predictors are designed to efficiently use deep memory. We compare the symbolic models with continuous techniques such as time-delay neural networks with continuous and categorical outputs, and GARCH models. Our experiments strongly suggest that the robust information reduction achieved by quantising the real-valued time series is highly beneficial. To deal with non-stationarity in financial daily time series, we propose two techniques that combine ‘sophisticated’ models fitted on the training data with a fixed set of simple-minded symbolic predictors not using older (and potentially misleading) data in the training set. Experimental results show that by quantising the volatility differences and then using symbolic predictive models, market makers can sometimes generate a statistically significant excess profit. We also mention some interesting observations regarding the memory structure in the series of daily volatility differences studied.  相似文献   

13.
We propose a novel approach based on predictive quantization (PQ) for online summarization of multiple time-varying data streams. A synopsis over a sliding window of most recent entries is computed in one pass and dynamically updated in constant time. The correlation between consecutive data elements is effectively taken into account without the need for preprocessing. We extend PQ to multiple streams and propose structures for real-time summarization and querying of a massive number of streams. Queries on any subsequence of a sliding window over multiple streams are processed in real time. We examine each component of the proposed approach, prediction, and quantization separately and investigate the space-accuracy trade-off for synopsis generation. Complementing the theoretical optimality of PQ-based approaches, we show that the proposed technique, even for very short prediction windows, significantly outperforms the current techniques for a wide variety of query types on both synthetic and real data sets.  相似文献   

14.
Processing changeable data streams in real time is one of the most important issues in the data mining field due to its broad applications such as retail market analysis, wireless sensor networks, and stock market prediction. In addition, it is an interesting and challenging problem to deal with the stream data since not only the data have unbounded, continuous, and high speed characteristics but also their environments have limited resources. High utility pattern mining, meanwhile, is one of the essential research topics in pattern mining to overcome major drawbacks of the traditional framework for frequent pattern mining that takes only binary databases and identical item importance into consideration. This approach conducts mining processes by reflecting characteristics of real world databases, non-binary quantities and relative importance of items. Although relevant algorithms were proposed for finding high utility patterns in stream environments, they suffer from a level-wise candidate generation-and-test and a large number of candidates by their overestimation techniques. As a result, they consume a huge amount of execution time, which is a significant performance issue since a rapid process is necessary in stream data analysis. In this paper, we propose an algorithm for mining high utility patterns from resource-limited environments through efficient processing of data streams in order to solve the problems of the overestimation-based methods. To improve mining performance with fewer candidates and search space than the previous ones, we develop two techniques for reducing overestimated utilities. Moreover, we suggest a tree-based data structure to maintain information of stream data and high utility patterns. The proposed tree is restructured by our updating method with decreased overestimation utilities to keep up-to-date stream information whenever the current window slides. Our approach also has an important effect on expert and intelligent systems in that it can provide users with more meaningful information than traditional analysis methods by reflecting the characteristics of real world non-binary databases in stream environments and emphasizing on recent data. Comprehensive experimental results show that our algorithm outperforms the existing sliding window-based one in terms of runtime efficiency and scalability.  相似文献   

15.
Recent work in dynamic graph algorithms has led to efficient algorithms for dynamic undirected graph problems such as connectivity. However, no efficient deterministic algorithms are known for the dynamic versions of fundamental directed graph problems like strong connectivity and transitive closure, as well as some undirected graph problems such as maximum matchings and cuts. We provide some explanation for this lack of success by presenting quadratic lower bounds on the certificate complexity of the seemingly difficult problems, in contrast to the known linear certificate complexity for the problems which have efficient dynamic algorithms. In many applications of dynamic (di)graph problems, a certain form of lookahead is available. Specifically, we consider the problems of assembly planning in robotics and the maintenance of relations in databases. These give rise to dynamic strong connectivity and dynamic transitive closure problems, respectively. We explain why it is reasonable, and indeed natural and desirable, to assume that lookahead is available in these two applications. Exploiting lookahead to circumvent their inherent complexity, we obtain efficient dynamic algorithms for strong connectivity and transitive closure. Received August 11, 1995; revised August 23, 1996.  相似文献   

16.
Learning from continuous streams of data has been receiving an increasingly attention in the last years. Among the many challenges related to mining data streams, change detection is one topic frequently addressed. Being able to determine whether or not data characteristics are changing along time is a major concern for data stream algorithms, be it on the supervised or unsupervised scenario. The unsupervised scenario is particularly relevant due to many practical applications do not provide target labeling information. In this scenario, most of the strategies induce consecutive models over time and compare them in order to detect data changes. In this situation, model changes are assumed to be a consequence of data modifications. However, there is no guarantee this assumption is true, since those algorithms do not rely on any theoretical background to ensure that model divergences truly indicate data changes. The need for such theoretical framework has motivated this paper to propose a new stability concept to establish bounds on the learning abilities of unsupervised algorithms designed to detect changes on data streams. This stability concept, based on the surrogate data strategy from time series analysis, provides learning guarantees for online unsupervised algorithms even in case of time dependency among observations. Furthermore, we propose a new change detection algorithm that meets the requirements of this stability concept. Experimental results on different synthetical scenarios illustrate how the stability concept proposed in this paper is applied to detect changes in unsupervised data streams.  相似文献   

17.
郭吉平 《计算机应用》2005,25(6):1369-1372
描述了一种基于时间序列数据流大纲的预测框架,提出了构建具有有效降噪效果的小波大纲的方法,可根据背景噪声而分层自适应设置去噪(保留)阈值。并且在这种小波大纲的基础上实现了多尺度概要的分析和预测方法,能够分析动态变化的高频数据流的趋势、拐点、周期、方差的变化,用来为时间序列数据流提供实时的注解。在实际电力负荷数据上的仿真实验证明这种方法可以提供快速的精确的近似预测。  相似文献   

18.
Slow access to disk-based multimedia data is a major limiting factor in the performance of modern multimedia Web servers connected over broadband networks. The I/O bottleneck becomes even more pronounced for currently evolving systems handling multimedia data, such as audio and video. Retrieval of complex multimedia documents needs to be handled at two levels: I/O bandwidth management for multiple multimedia streams, and interstream and intrastream synchronization for multimedia objects constituting these documents. In this paper, based on the diverse characteristics of multimedia data, we propose efficient techniques for synchronous retrieval and delivery of such data from the storage system to the main memory of the server. We propose methods to quantify user perceived quality via quality-of-presentation (QoP) parameters. We combine QoP and Object Composition Petri Net (OCPN) multimedia data modeling to develop techniques for efficient synchronous retrieval of multimedia data. Since I/O bandwidth is a precious resource, the proposed techniques have low overhead, which is , where m is the number of logical I/O channels and n is the total number of frames of multimedia data in a scheduling period. We simulate the relative performance of these techniques under diverse I/O conditions and determine the tradeoffs between the system resources, such as memory, bandwidth, and the improvement in QoP for multimedia applications.Published online: 9 February 2005 Correspondence to: M. Farrukh Khan  相似文献   

19.
Norm negotiation in online multi-player games   总被引:2,自引:2,他引:0  
In recent years, the proliferation of VOIP data has created a number of applications in which it is desirable to perform quick online classification and recognition of massive voice streams. Typically such applications are encountered in real time intelligence and surveillance. In many cases, the data streams can be in compressed format, and the rate of data processing can often run at the rate of Gigabits per second. All known techniques for speaker voice analysis require the use of an offline training phase in which the system is trained with known segments of speech. The state-of-the-art method for text-independent speaker recognition is known as Gaussian mixture modeling (GMM), and it requires an iterative expectation maximization procedure for training, which cannot be implemented in real time. In many real applications (such as surveillance) it is desirable to perform the recognition process in online time, so that the system can be quickly adapted to new segments of the data. In many cases, it may also be desirable to quickly create databases of training profiles for speakers of interest. In this paper, we discuss the details of such an online voice recognition system. For this purpose, we use our micro-clustering algorithms to design concise signatures of the target speakers. One of the surprising and insightful observations from our experiences with such a system is that while it was originally designed only for efficiency, we later discovered that it was also more accurate than the widely used GMM. This was because of the conciseness of the micro-cluster model, which made it less prone to over training. This is evidence of the fact that it is often possible to get the best of both worlds and do better than complex models both from an efficiency and accuracy perspective. We present experimental results illustrating the effectiveness and efficiency of the method.
Charu C. AggarwalEmail:
  相似文献   

20.
On clustering massive text and categorical data streams   总被引:4,自引:4,他引:0  
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号