首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ε)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.  相似文献   

2.
Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the nmnber of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density- based methods in the clustering process and at the same time overcoming the constraints, which are put out by data streanFs nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms' performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.  相似文献   

3.
Squeezer: An efficient algorithm for clustering categorical data   总被引:25,自引:0,他引:25       下载免费PDF全文
This paper presents a new efficient algorithm for clustering categorical data,Squeezer,which can produce high quality clustering results and at the same time deserve good scalability.The Squeezer algorithm reads each tuple t in sequence,either assigning t to an existing cluster (initially none),or creating t as a new cluster,which is determined by the similarities between t and clusters.Due to its characteristics,the proposed algorithm is extremely suitable for clustering data streams,where given a sequence of points,the objective is to maintain consistently good clustering of the sequence so far,using a small amount of memory and time.Outliers can also be handled efficiently and directly in Squeezer.Experimental results on real-life and synthetic datasets verify the superiority of Squeezer.  相似文献   

4.
Many database applications require efficient processing of data streams with value variations and fluctuant sampling frequency.The variations typically imply fundamental features of the stream and important domain knowledge of underlying objects.In some data streams,successive events seem to recur in a certain time interval,but the data indeed evolves with tiny differences as time elapses.This feature,so called pseudo periodicity,poses a new challenge to stream variation management.This study focuses on the online management for variations over such streams.The idea can be applied to many scenarios such as patient vital signal monitoring in medical applications.This paper proposes a new method named Pattern Growth Graph (PGG) to detect and manage variations over evolving streams with following features:1) adopts the wave-pattern to capture the major information of data evolution and represent them compactly; 2) detects the variations in a single pass over the stream with the help of wave-pattern matching algorithm;3) only stores different segments of the pattern for incoming stream,and hence substantially compresses the data without losing important information;4) distinguishes meaningful data changes from noise and reconstructs the stream with acceptable accuracy. Extensive experiments on real datasets containing millions of data items,as well as a prototype system,are carried out to demonstrate the feasibility and effectiveness of the proposed scheme.  相似文献   

5.
A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This issue becomes more serious in finding frequent itemsets or frequency counting over an online transactional data stream since there can be a large number of itemsets to be monitored. We have proposed a method called the estDec method for finding frequent itemsets over an online data stream. In order to reduce the number of monitored itemsets in this method, monitoring the count of an itemset is delayed until its support is large enough to become a frequent itemset in the near future. For this purpose, the count of an itemset should be estimated. Consequently, how to estimate the count of an itemset is a critical issue in minimizing memory usage as well as processing time. In this paper, the effects of various count estimation methods for finding frequent itemsets are analyzed in terms of mining accuracy, memory usage and processing time.  相似文献   

6.
In many data stream mining applications, traditional density estimation methods such as kemel density estimation, reduced set density estimation can not be applied to the density estimation of data streams because of their high computational burden, processing time and intensive memory allocation requirement. In order to reduce the time and space complexity, a novel density estimation method Dm-KDE over data streams based on the proposed algorithm m-KDE which can be used to design a KDE estimator with the fixed number of kernel components for a dataset is proposed. In this method, Dm-KDE sequence entries are created by algorithm m-KDE instead of all kemels obtained from other density estimation methods. In order to further reduce the storage space, Dm-KDE sequence entries can be merged by calculating their KL divergences. Finally, the probability density functions over arbitrary time or entire time can be estimated through the obtained estimation model. In contrast to the state-of-the-art algorithm SOMKE, the distinctive advantage of the proposed algorithm Dm-KDE exists in that it can achieve the same accuracy with much less fixed number of kernel components such that it is suitable for the scenarios where higher on-line computation about the kernel density estimation over data streams is required. We compare Dm-KDE with SOMKE and M-kernel in terms of density estimation accuracy and running time for various stationary datasets. We also apply Dm-KDE to evolving data streams. Experimental results illustrate the effectiveness of the pro- posed method.  相似文献   

7.
Graphics processing units (GPUs) have an SIMD architecture and have been widely used recently as powerful general-purpose co-processors for the CPU. In this paper, we investigate efficient GPU-based data cubing because the most frequent operation in data cube computation is aggregation, which is an expensive operation well suited for SIMD parallel processors. H-tree is a hyper-linked tree structure used in both top-k H-cubing and the stream cube. Fast H-tree construction, update and real-time query response are crucial in many OLAP applications. We design highly efficient GPU-based parallel algorithms for these H-tree based data cube operations. This has been made possible by taking effective methods, such as parallel primitives for segmented data and efficient memory access patterns, to achieve load balance on the GPU while hiding memory access latency. As a result, our GPU algorithms can often achieve more than an order of magnitude speedup when compared with their sequential counterparts on a single CPU. To the best of our knowledge, this is the first attempt to develop parallel data cubing algorithms on graphics processors.  相似文献   

8.
The existing methods for visualizing volumetric data are mostly based on piecewise linear models.And all kinds of analysis based on them have to be substituted by coarse interpolations.So both accuracy and reliability of the traditional framework for visualization and analysis of volumetric data are far from our needs of digging information implied in volumetric data fields.In this paper,we propose a novel framework based on a C2-continuous seven-directional box spline,under which reconstruction is of high accuracy and differential computations relative to analysis based on the reconstruction model are accurate.We introduce a polynomial differential operator to improve the reconstruction accuracy.In order to settle the difficulty of evaluating upon the seven-directional box spline,we convert it into B′ezier form and propose effective theories and algorithms of extracting iso-surfaces,critical points and curvatures.Plentiful of examples are also given in this paper to illustrate that the novel framework is suitable for analysis,the improved reconstruction method has high accuracy,and our algorithms are fast and stable.  相似文献   

9.
Detecting duplicates in data streams is an important problem that has a wide range of applications. In general,precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios,and,on the other hand,the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper,we present a novel data structure,Decaying Bloom Filter(DBF),as an extension of the Counting Bloom Filter,that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors,but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy,and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W,our algorithm has an amortized time complexity of O((G/W))~(1/2). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.  相似文献   

10.
The superpoints are the sources (or the destinations) that connect with a great deal of destinations (or sources) during a measurement time interval, so detecting the superpoints in real time is very important to network security and management. Previous algorithms are not able to control the usage of the memory and to deliver the desired accuracy, so it is hard to detect the superpoints on a high speed link in real time. In this paper, we propose an adaptive sampling algorithm to detect the superpoints in real time, which uses a flow sample and hold module to reduce the detection of the non-superpoints and to improve the measurement accuracy of the superpoints. We also design a data stream structure to maintain the flow records, which compensates for the flow Hash collisions statistically. An adaptive process based on different sampling probabilities is used to maintain the recorded IP addresses in the limited memory. This algorithm is compared with the other algorithms by analyzing the real network trace data. Experiment results and mathematic analysis show that this algorithm has the advantages of both the limited memory requirement and high measurement accuracy.  相似文献   

11.
In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone.  相似文献   

12.
With the coming shift to cloud computing,cloud database is emerging to provide database service over the Internet.In the cloud-based environment,data are distributed at Internet scale and the system needs to handle a huge number of user queries simultaneously without delay.How data are distributed among the servers has a crucial impact on the query load distribution and the system response time.In this paper,we propose a market-based control method,called MBA,to achieve query load balance via reasonable data distribution.In MBA,database nodes are treated as traders in a market,and certain market rules are used to intelligently decide data allocation and migration.We built a prototype system and conducted extensive experiments.Experimental results show that the MBA method significantly improves system performance in terms of average query response time and fairness.  相似文献   

13.
In many sensor network applications,it is essential to get the data distribution of the attribute value over the network.Such data distribution can be got through clustering,which partitions the network into contiguous regions,each of which contains sensor nodes of a range of similar readings.This paper proposes a method named Distributed,Hierarchical Clustering (DHC) for online data analysis and mining in senior networks.Different from the acquisition and aggregation of raw sensory data,DHC clusters sensor nodes based on their current data values as well as their geographical proximity,and computes a summary for each cluster.Furthermore,these clusters,together with their summaries,are produced in a distributed,bottom-up manner.The resulting hierarchy of clusters and their summaries facilitates interactive data exploration at multiple resolutions.It can also be used to improve the efficiency of data-centric routing and query processing in sensor networks.We also design and evaluate the maintenance mechanisms for DHC to make it be able to work on evolving data.Our simulation results on real world datasets as well as synthetic datasets show the effectiveness and efficiency of our approach.  相似文献   

14.
I/O parallelism is considered to be a promising approach to achieving high performance in parallel data warehousing systems where huge amounts of data and complex analytical queries have to be processed. This paper proposes a parallel secondary data cube storage structure (PHC for short) to efficiently support the processing of range sum queries and dynamic updates on data cube using parallel computing systems. Based on PHC, two parallel algorithms for processing range sum queries and updates are proposed also. Both the algorithms have the same time complexity, O(logdn/P). The analytical and experimental results show that PHC and the parallel algorithms have high performance and achieve optimum speedup.  相似文献   

15.
Scan-based testing methodologies remedy the testability problem of sequential circuits; yet they suffer from prolonged test time and excessive test power due to numerous shift operations. The correlation among test data along with the high density of the unspecified bits in test data enables the utilization of the existing test data in the scan chain for the generation of the subsequent test stimulus, thus reducing both test time and test data volume. We propose a pair of scan approaches in this paper; in the first approach, a test stimulus partially consists of the preceding stimulus, while in the second approach, a test stimulus partially consists of the preceding test response bits. Both proposed scan-based test schemes access only a subset of scan cells for loading the subsequent test stimulus while freezing the remaining scan cells with the preceding test data, thus decreasing scan chain transitions during shift operations. The proposed scan architecture is coupled with test data manipulation techniques which include test stimuli ordering and partitioning algorithms, boosting test time reductions. The experimental results confirm that test time reductions exceeding 97%, and test power reductions exceeding 99% can be achieved by the proposed scan-based testing methodologies on larger ISCAS89 benchmark circuits.  相似文献   

16.
RFID middleware collects and filters RFID streaming data to process applications' requests called continuous queries, because they are executed continuously during tag movement. Several approaches to building an index on queries rather than data records, called a query index, have been proposed to evaluate continuous queries over streaming data. EPCglobal proposed an Event Cycle Specification (ECSpec) model, which is a de facto standard query interface for RFID applications. Continuous queries based on ECSpec consist of a large number of segments that represent the query conditions. The problem when using any of the existing query indexes on these continuous queries is that it takes a long time to build the index, because it is necessary to insert a large number of segments into the index. To solve this problem, we propose a transform method that converts a group of segments into compressed data. We also propose an efficient query index scheme for the transformed space. Comparing with existing query indexes, the performance of proposed index outperforms the others on various datasets.  相似文献   

17.
Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS (Semi-Random Multiple decision Trees for Data Streams), based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naive Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.  相似文献   

18.
Most of the earlier work on clustering is mainly focused on numerical data the inherent geometric properties of which can be exploited to naturally define distance functions between the data points. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The k-means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. This paper shows how to apply the notion of "cluster centers" to a dataset of categorical objects, and a k-means-like algorithm for clustering categorical data is introduced.  相似文献   

19.
Data streams produced by positioning systems such as Global Positioning System (GPS) or RFID readers can be considered as location streams[12]. Location streams are usually generated in a distributed fashion by a large scale distributed system covering a wide range of areas. Computing on distributed location streams is both practically useful and theoretically challenging. The results of computation could be used to schedule the traffic in a metropolis to avoid traffic jam, dispatch taxis to serve the passengers more quickly and display the current position of goods in supply chain management, etc. Since location streams are usually generated with very high rate in uncertain ways over hostile environments, the collected updates of location are probably redundant and inconsistent in a wide positioning system. To process distributed location streams with redundancy and inconsistency, this paper proposes a novel method based on min-wise hash. With this method, redundant updates of distributed location streams can be effiectively filtered out, while the true location could be derived from inconsistent ones. Consequently, globally uniform samples can be obtained. Based on the uniform samples, an algorithm for computing the approximate k-median of huge number of moving objects is presented in this paper. Furthermore, it is demonstrated that sketch-based methods are not necessarily effiective in processing location streams with redundancy and inconsistency. In addition to theoretical analysis, some extensive experiments are conducted to validate the efficiency and effiectiveness of the proposed approach.  相似文献   

20.
The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclustering problem (also known as projected clustering) for gene expression, in which each row can only be a member of a single bicluster while columns can participate in multiple clusters. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). We present a novel method to identify these exclusive row biclusters in the spirit of the optimal set cover problem. We present our algorithmic solution as a combination of existing biclustering algorithms and combinatorial auction techniques. Furthermore, we devise an approach for tuning the threshold of our algorithm based on comparison with a null model, inspired by the Gap statistic approach. We demonstrate our approach on both synthetic and real world gene expression data and show its power in identifying large span non-overlapping rows submatrices, while considering their unique nature.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号