首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We propose two fast algorithms for abrupt change detection in streaming data that can operate on arbitrary unknown data distributions before and after the change. The first algorithm, MB-GT\textsf{MB-GT} , computes efficiently the average Euclidean distance between all pairs of data points before and after the hypothesized change. The second algorithm, MB-CUSUM\textsf{MB-CUSUM} , computes the log-likelihood ratio statistic for the data distributions before and after the change, similarly to the classical CUSUM algorithm, but unlike that algorithm, MB-CUSUM\textsf{MB-CUSUM} does not need to know the exact distributions, and uses kernel density estimates instead. Although a straightforward computation of the two change statistics would have computational complexity of O(N 4) with respect to the size N of the streaming data buffer, the proposed algorithms are able to use the computational structure of these statistics to achieve a computational complexity of only O(N 2) and memory requirement of O(N). Furthermore, the algorithms perform surprisingly well on dependent observations generated by underlying dynamical systems, unlike traditional change detection algorithms.  相似文献   

2.
FFTs in external or hierarchical memory   总被引:2,自引:0,他引:2  
Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2 m -point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation.Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the CRAY-2, the CRAY X-MP, and the CRAY Y-MP systems. Using all eight processors on the CRAY Y-MP, this main memory routine runs at nearly two gigaflops.A condensed version of this paper previously appeared in the Proceedings of Supercomputing '89.  相似文献   

3.
The synchronization barrier is a point in the program where the processing elements (PEs) wait until all the PEs have arrived at this point. In a reduction computation, given a commutative and associative binary operationop, one needs to reduce valuesa 0,...,a N-1, stored in PEs 0,...,N-1 to a single valuea *=a 0 op a, op...op a N -1 and then to broadcast the resulta * to all PEs. This computation is often followed by a synchronization barrier. Routines to perform these functions are frequently required in parallel programs. Simple and efficient, workingC-language routines for the parallel barrier synchronization and reduction computations are presented. The codes are appropriate for a CREW (concurrent-read-exclusive-write) or EREW parallel random access shared memory MIMD computer. They require only shared memory read and write; no locks, semaphores etc. are needed. The running time of each of these routines isO(logN). The amount of shared memory required and the number of shared memory accesses generated are botO(N). These are the asymptotically minimum values for the three parameters. The algorithms employ the obvious computational scheme involving a binary tree. Examples of applications for these routines and results of performance testing on the Sequent Balance 21000 computer are presented.An abstract of this article appeared inProc. 1989 Int. Conf. Parallel Processing, p. II-175.  相似文献   

4.
We present an efficient algorithm to perform approximate offsetting operations on geometric models using GPUs. Our approach approximates the boundary of an object with point samples and computes the offset by merging the balls centered at these points. The underlying approach uses Layered Depth Images (LDI) to organize the samples into structured points and performs parallel computations using multiple cores. We use spatial hashing to accelerate intersection queries and balance the workload among various cores. Furthermore, the problem of offsetting with a large distance is decomposed into successive offsetting using smaller distances. We derive bounds on the accuracy of offset computation as a function of the sampling rate of LDI and offset distance. In practice, our GPU-based algorithm can accurately compute offsets of models represented using hundreds of thousands of points in a few seconds on a GeForce GTX 580 GPU. We observe more than 100 times speedup over prior serial CPU-based approximate offset computation algorithms.  相似文献   

5.
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Blockwise access to data is a central theme in the design of efficient EM algorithms. A similar requirement arises in the transmission of data between processors in high performance parallel computation systems, for which blockwise communication is a crucial issue. {We consider multisearch problems, where a large number of queries are to be simultaneously processed and satisfied by navigating through large data structures on parallel computers. Our examples originate as algorithms for parallel machines, and we adapt them to the EM situation where the queries and data structure are considered to be much larger than the size of the available internal memory. } This paper presents techniques to achieve blocking for I/ O as well as for communication in multisearch on the BSP and EM-BSP models. We describe improvements to the 1-optimal BSP* multisearch algorithm of [8] which permit larger search trees to be handled. In the area of EM algorithms new algorithms for multisearch in balanced trees are described. For search trees of size O(nlog n) where n is the number of queries, we obtain a work-optimal, parallel, randomized EM multisearch algorithm whose I/ O and communication time are smaller, asymptotically, than the computation time. We obtain a deterministic version via a similar technique. These algorithms are obtained via the simulation techniques of [12], [17], [13], [14], and [24]. For larger trees we describe a parallel, EM algorithm which is 1-optimal considering computation, communication, and I/ O (for suitable parameter constellations) plus I/ O-optimal. We give a lower bound to the number of I/ O operations required for filtering n queries through a binary or multiway search tree of size m when m\geq n 2+ε , for a constant ε>0 . Online publication February 20, 2001.  相似文献   

6.
We present a randomized parallel algorithm for constructing the three-dimensional convex hull on a generic p-processor coarse-grained multicomputer with arbitrary interconnection network and n/p local memory per processor, where n/p ≥ p 2+ε (for some arbitrarily small ε > 0). For any given set of n points in 3-space, the algorithm computes the three-dimensional convex hull, with high probability, in local computation time and O(1) communication phases with at most O(n/p) data sent/received by each processor. That is, with high probability, the algorithm computes the three-dimensional convex hull of an arbitrary point set in time , where Γ n,p denotes the time complexity of one communication phase. The assumption n/p ≥ p 2+ε implies a coarse-grained, limited parallelism, model which is applicable to most commercially available multiprocessors. In the terminology of the BSP model, our algorithm requires, with high probability, O(1) supersteps, synchronization period , computation cost , and communication cost O((n/p) g). Received October 30, 1995, and in revised form April 15, 1996, and in final form September 17, 1996.  相似文献   

7.
We consider the following partition problem: Given a set S of n elements that is organized as k sorted subsets of size n/k each and given a parameter h with 1/k ≤ h ≤ n/k , partition S into g = O(n/(hk)) subsets D 1 , D 2 , . . . , D g of size Θ(hk) each, such that, for any two indices i and j with 1 ≤ i < j ≤ g , no element in D i is bigger than any element in D j . Note that with various combinations of the values of parameters h and k , several fundamental problems, such as merging, sorting, and finding an approximate median, can be formulated as or be reduced to this partition problem. The partition problem also finds many applications in solving problems of parallel computing and computational geometry. In this paper we present efficient parallel algorithms for solving the partition problem and a number of its applications. Our parallel partition algorithm runs in O( log n) time using processors in the EREW PRAM model. The complexity bounds of our parallel partition algorithm on the respective special cases match those of the optimal EREW PRAM algorithms for merging, sorting, and finding an approximate median. Using our parallel partition algorithm, we are also able to obtain better complexity bounds (even possibly on a weaker parallel model) than the previously best known parallel algorithms for several important problems, including parallel multiselection, parallel multiranking, and parallel sorting of k sorted subsets. Received May 5, 1996; revised July 30, 1998.  相似文献   

8.
This paper presents new algorithms for solving some geometric problems on a shared memory parallel computer, where concurrent reads are allowed but no two processors can simultaneously attempt to write in the same memory location. The algorithms are quite different from known sequential algorithms, and are based on the use of a new parallel divide-and-conquer technique. One of our results is an O(log n) time, O(n) processor algorithm for the convex hull problem. Another result is an O(log n log log n) time, O(n) processor algorithm for the problem of selecting a closest pair of points among n input points.  相似文献   

9.
The paper describes how to modify the two-sided Hari–Zimmermann algorithm for computation of the generalized eigenvalues of a matrix pair (A, B), where B is positive definite, to an implicit algorithm that computes the generalized singular values of a pair (F, G). In addition, we present blocking and parallelization techniques for speedup of the computation.For triangular matrix pairs of a moderate size, numerical tests show that the double precision sequential pointwise algorithm is several times faster than the Lapack DTGSJA algorithm, while the accuracy is slightly better, especially for small generalized singular values.Cache-aware algorithms, implemented either as the block-oriented, or as the full block algorithm, are several times faster than the pointwise algorithm. The algorithm is almost perfectly parallelizable, so parallel shared memory versions of the algorithm are perfectly scalable, and their speedup almost solely depends on the number of cores used. A hybrid shared/distributed memory algorithm is intended for huge matrices that do not fit into the shared memory.  相似文献   

10.
In this paper we present optimal processor x time parallel algorithms for term matching and anti-unification of terms represented as trees. Term matching is the special case of unification in which one of the terms is restricted to contain no variables. It has wide applicability to logic programming, term rewriting systems and symbolic pattern matching. Anti-unification is the dual problem of unification in which one computes the most specific generalization of two terms. It has application to inductive inference and theorem proving. Our algorithms run in O(log2 N) time using N/log2 N processors on a shared-memory model of computation that prohibits simultaneous reads or writes (EREW PRAM). These algorithms are the first polylogarithmic-time EREW algorithms with a processor x time product of the same order as that of their sequential counterparts, thereby permitting optimal speed-ups using any number of processors up to N/log2 N. We also use the techniques developed in the paper to provide an N/log N-processor, O(log N)-time algorithm for a shared-memory model that allows both simultaneous reads and simultaneous writes (CRCW PRAM).Supported by NSF Grant IRI-88-09324 and NSF/DARPA Grant CCR-8908092.  相似文献   

11.
A method is given for computing the fourth virial coefficient D(T) for a pairwise additive spherically symmetric interaction potential. Taking one of the four interacting particles as origin and using the appropriate co-ordinate transformations in the usual way the ninefold integrals defining D(T) are reduced to sixfold integrals which are then formally reduced to triple integrals by expanding out those Ursell-Mayer functions in the integrals not involving the origin particle as infinite series in Legendre polynomials PS(cos?), where ? is the angle between the radius vectors of the interacting particles. The coefficients in these expansions are integrals of highly oscillatory functions, especially for large s, and are evaluated using the Chebyshev polynomial expansion for the Ursell-Mayer functions, thus making explicit use of the oscillatory behaviour of PS(cos?). The triple integrals are evaluated using a nonproduct integration formula of the seventh degree employed earlier in the computation of the thirdh virial coefficient. The values of D(T) computed by the present method have the same qualitative behaviour as the literature values but appear to be more accurate, particularly at lower temperatures.  相似文献   

12.
Computing an optimal solution to the knapsack problem is known to be NP-hard. Consequently, fast parallel algorithms for finding such a solution without using an exponential number of processors appear unlikely. An attractive alternative is to compute an approximate solution to this problem rapidly using a polynomial number of processors. In this paper, we present an efficient parallel algorithm for finding approximate solutions to the 0–1 knapsack problem. Our algorithm takes an , 0 < < 1, as a parameter and computes a solution such that the ratio of its deviation from the optimal solution is at most a fraction of the optimal solution. For a problem instance having n items, this computation uses O(n5/2/3/2) processors and requires O(log3n + log2nlog(1/)) time. The upper bound on the processor requirement of our algorithm is established by reducing it to a problem on weighted bipartite graphs. This processor complexity is a significant improvement over that of other known parallel algorithms for this problem.  相似文献   

13.
This paper presents the study of an adaptive control which tracks a desired time-based trajectory as closely as possible for all times over a wide range of manipulator motion and payloads both in joint-variable coordinates and Cartesian coordinates. The proposed adaptive control is based on the linearized perturbation equations in the vicinity of a nominal trajectory. The controlled system is characterized by feedforward and feedback components which can be computed separately and simultaneously. The feedforward component computes the nominal torques from the Newton-Euler equations of motion either using the resolved joint information or the joint information from the trajectory planning program. This computation can be completed in O(n) time. The feedback component consisting of recursive least square identification and one-step optimal control algorithms for the linearized system computes the perturbation torques in O(n3) time. Because of the parallel structure, the computations of the adaptive control may be implemented in low-cost microprocessors. A computer simulation study Was conducted to evaluate the performance of the adaptive control in joint-variable coordinates for a three-joint robot arm. The feasibility of implementing the adaptive control in Cartesian coordinates using present day low-cost microprocessors is discussed.  相似文献   

14.
Parallel computation model is an abstraction for the performance characteristics of parallel computers, and should evolve with the development of computational infrastructure. The heterogeneous CPU/Graphics Processing Unit (GPU) systems have been and will be important platforms for scientific computing, which introduces an urgent demand for new parallel computation models targeting this kind of supercomputers. In this research, we propose a parallel computation model called HLognGP to abstract the computation and communication features of heterogeneous platforms like TH‐1A. All the substantial parameters of HLognGP are in vector form and deal with the new features in GPU clusters. A simplified version HLog3GP of the proposed model is mapped to a specific GPU cluster and verified with two typical benchmarks. Experimental results show that HLog3GP outperforms the other two evaluated models and can well model the new particularities of GPU clusters. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

15.
Conclusion We have considered dynamic models of parallel computations and transformation algorithms for macropipelined programs that increase the internal exchange asynchronism of the parallel components. The models generalize the well-known CSP synchronous exchange paradigm (the rendezvous mechanism) [18] and provide a theoretical justification for the development of a whole range of interconnected semantic models of asynchronous computation with increasing degree of exchange asynchronism. Some publications ensure asynchronous exchange only by buffering [15, 16]. The dynamic model approach developed in this study combines buffering and analysis of data dependences of the exchange operators. It not only reduces losses associated with synchronization of communicating parallel processes, but also ensures automatic resolution of some classes of data exchange deadlocks. Experience with dynamic models as a means of increasing the efficiency of macropipelined programs and multilevel memory in multiprocessor systems [12, 17] can be applied in other programming systems and parallel program design systems of both compiling (S1, S2) and interpreting (S3) type. Translated from Kibernetika i Sistemnyi Analiz, No. 6, pp. 45–65, November–December, 1995.  相似文献   

16.
Association discovery from a transaction collection is an important data-mining task. We study a new problem in this area whose solution can provide users with valuable association rules in some relevant collections: association discovery in derivative transaction collections. In this problem, we are given association rules in two transaction collections D 1 and D 2, and aim to find new association rules in derivative transaction collections D 1D 2, D 1D 2, D 2D 1 and D 1D 2. Direct application of existing algorithms can solve this problem, but in an expensive way. We propose an efficient solution through making full use of already discovered information, taking advantage of the relationships existing among relevant collections, and avoiding unnecessary but expensive support-counting operations by scanning databases. Experiments on well-known synthetic data show that our solution consistently outperforms the naive solution by factors from 2 to 3 in most cases. We also propose an efficient parallelization of our approach, as parallel algorithms are often interesting and necessary in the area of data mining. Received August 1998 / Revised April 1999 / Accepted in revised form September 1999  相似文献   

17.
In this paper we describe optimal processor-time parallel algorithms for set operations such as union, intersection, comparison on quadtrees. The algorithms presented in this paper run in O(log N) time using N/log N processors on a shared memory model of computation that allows concurrent reads or writes. Consequently they allow us to achieve optimal speedups using any number of processors up to N/log N. The approach can also be used to derive optimal processor-time parallel algorithms for weaker models of parallel computation.  相似文献   

18.
19.
We give two parallel algorithms for sequence comparison on the Connection Machine 2 (CM-2). The specific comparison measure we compute is the edit distance: given a finite alphabet ∑ and two input sequences X ϵ ∑+ and Y ϵ ∑+ the edit distance d(X,Y) is the minimum cost of transforming X into Y via a series of weighted insertions, deletions and substitutions of characters. The edit distance comparison measure is equivalent to or subsumes a broad range of well known sequence comparison measures. The CM-2 is very fast at performing parallel prefix operations. Our contribution consists of casting the problem in terms of these operations. Our first algorithm computes d(X,Y) using N processors and O(M S) time units, where M = min(|X|,||Y|) + 1, N = max(|X|,|Y|) + 1 and S is the time required for a parallel prefix operation. The second algorithm computes d(X,Y) using NM processors and O((log N log M)(S + R)) time units, where R is the time for a ‘router’ communication step—one in which each processor is able to read data, in parallel, from the memory of any other processor. Our algorithms can also be applied to several variants of the problem, such as subsequence comparisons, and one—many and many-many comparisons on 'sequence databases'.  相似文献   

20.
In this paper, we present a novel volumetric mesh representation suited for parallel computing on modern GPU architectures. The data structure is based on a compact, ternary sparse matrix storage of boundary operators. Boundary operators correspond to the first‐order top‐down relations of k‐faces to their (k ? 1)‐face facets. The compact, ternary matrix storage format is based on compressed sparse row matrices with signed indices and allows for efficient parallel computation of indirect and bottom‐up relations. This representation is then used in the implementation of several parallel volumetric mesh algorithms including Laplacian smoothing and volumetric Catmull‐Clark subdivision. We compare these algorithms with their counterparts based on OpenVolumeMesh and achieve speedups from 3× to 531×, for sufficiently large meshes, while reducing memory consumption by up to 36%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号