首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present a parallel algorithm for computing an optimal sequence alignment in efficient space. The algorithm is intended for a message-passing architecture with one-dimensional-array topology. The algorithm computes an optimal alignment of two sequences of lengthsM andN inO((M+N) 2 /P) time andO((M+N)/P) space per processor, where the number of processors isP>=max(M, N). Thus, whenP=max(M, N) it achieves linear speedup and requires constant space per processor. Some experimental results on an Intel hypercube are provided.This research was supported by NIH Grant LM05110 from the National Library of Medicine.  相似文献   

2.
The problem of routing packets onn 1×...×n r mesh-connected arrays or grids of processors is studied. The focus of this paper is on permutation routing where each processor contains exactly one packet initially and finally. A slight modification of permutation routing called balanced routing is also discussed. For two-dimensional grids a determinisitc routing algorithm is given forn×n meshes where each processor has a buffer of size f(n) < n. It needs 2n + O(n/f(n)) steps on grids without wrap-arounds. Hence, it is asymptoticaliy nearly optimal, and as good as randomized algorithms routing data only with high probability. Furthermore, it is demonstrated that onr-dimensional cubes of processors permutation routing can be performed asymptotically by (2r–2)n steps, which is faster than the running times of so-far known randomized algorithms and of deterministic algorithms.Partially supported by Siemens AG, München.  相似文献   

3.
Spatial image warping is useful for image processing and graphics. In this paper, we present concurrent-read-exclusive-write (CREW) and exclusive-read-exclusive-write (EREW) parallel-random-access-machine (PRAM) algorithms that achieve O(1) asymptotic run time. The significant result is the creative processor assignment that results in an EREW PRAM algorithm. The forward algorithm calculates any nonscaling affine transform including arbitrary skewings, translations, and rotations. The EREW algorithm is the most efficient in practice, and the MasPar MP-1 with 16K processors rotates a 4-million-element image in under a second and a 2-million-element volume in one-half of a second. This high performance allows interactive viewing of volumes from arbitrary viewpoints and illustrates linear speedup. This practical efficiency is analyzed and illustrated by using a bridging model of computation. We develop the mixed cost communication machine (MCCM) to quantify the communication costs and correlate these costs to the MasPar MP-1. The forward algorithm has provable N = 1 congestion on the MCCM, while the backward algorithm has congestion N > 1 which varies with the transform. There are also important quality advantages using our direct warping techniques; empirical measurements are given to provide comparisons to multipass warps.  相似文献   

4.
Parallel merge sort is useful for sorting a large quantity of data progressively. The merge sort should be parallelized carefully since the conventional algorithm has poor performance due to the successive reduction of the number of participating processors by half, and down to one in the last merging stage. The proposed load-balanced merge sort utilizes all processors throughout the computation. It evenly distributes data to all processors in each stage. Thus every processor is forced to work in all phases. Significant performance enhancement has been achieved up to a speedup of (P–1)/log P where P is the number of processors. Experimental results demonstrate a speedup of 9.6 (upper bound of 10.7) on 32-processor Cray T3E when sorting 4M 32-bit integers, and a speed up of 2.3 (upper bound of 2.8) on an 8-node PC cluster.  相似文献   

5.
Massively parallel computers are becoming quite common in their use in computational fluid dynamics. In this study, a parallel algorithm of a 3-D primitive-equation coastal ocean circulation model is designed on the hypercube MIMD computer architecture. The grid is partitioned using one-dimensional domain decomposition. The code is tested in a uniform rectangular grid problem for which the model domain in each node is a cube. For the problem where the grain size (n y ) is fixed, the speedup is linear and is close to ideal forP 8 processors. The overhead (F C ) increases as the number of processors increases. The background overhead is inversely proportional to the size of the grain. The slopeF C vs.P is a measure of the fraction of non-parallel code. For the problem where the domain is fixed, the speedup is 7.8 using 8-processors and 29.6 using 32-processors. The overhead increases linearly withP. The slopeF C is a measure of the communication cost. The load balancing problem is examined for a model of the Gulf of California whose computational domain is irregular.  相似文献   

6.
Considern 2 processors arranged in ann × n torus network in which each processor is connected by direct communication channels with its four neighbours. This paper studies the followingverification problem on anonymousn × n torus networks: verify whether the network is oriented; that is, verify whether there is an agreement, among all processors, on a consistent channel labelling. The problem is to be solved by a distributed algorithm executed by the processors themselves. If processors can label their channels arbitrarily, then there are network labellings that are not oriented but, to the processors, are indistinguishable from ones that are oriented. Hence there is no deterministic distributed verification algorithm. However, a verification algorithm does exist if the initial labellings are suitably restricted. We describe the restrictions placed on the initial labellings by subsets of the permutation groupS 4. We show that the existence of an algorithm for verification is equivalent to the existence of certain tilings of the torus with Wang tiles. Using this equivalence, we have determined the existence of a distributed algorithm for the verification problem for alln × n torus networks for an important class of restrictions, the subgroups ofS 4.  相似文献   

7.
Xiaotie Deng  Binhai Zhu 《Algorithmica》1999,24(3-4):270-286
We present a randomized algorithm for computing the Voronoi diagram of line segments using coarse-grained parallel machines. Operating on P processors, for any input of n line segments, this algorithm performs O((n log n)/P) local operations per processor, O(n/P) messages per processor, and O(1) communication phases, with high probability for n=Ω(P 3+ε ) . Received June 1, 1997; revised March 10, 1998.  相似文献   

8.
In this work, image-space-parallel direct volume rendering (DVR) of unstructured grids is investigated for distributed-memory architectures. A hypergraph-partitioning-based model is proposed for the adaptive screen partitioning problem in this context. The proposed model aims to balance the rendering loads of processors while trying to minimize the amount of data replication. In the parallel DVR framework we adopted, each data primitive is statically owned by its home processor, which is responsible from replicating its primitives on other processors. Two appropriate remapping models are proposed by enhancing the above model for use within this framework. These two remapping models aim to minimize the total volume of communication in data replication while balancing the rendering loads of processors. Based on the proposed models, a parallel DVR algorithm is developed. The experiments conducted on a PC cluster show that the proposed remapping models achieve better speedup values compared to the remapping models previously suggested for image-space-parallel DVR  相似文献   

9.
Visual and interactive data exploration requires fast and reliable tools for embedding of an original data space in 3(2)‐dimensional Euclidean space. Multidimensional scaling (MDS) is a good candidate. However, owing to at least O(M2) memory and time complexity, MDS is computationally demanding for interactive visualization of data sets consisting of order of 104 objects on computer systems, ranging from PC with multicore CPU processor, graphics processing unit (GPU) board to midrange MPI clusters. To explore interactively data sets of that size, we have developed novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics. We demonstrate that the performance of our MDS algorithms implemented in compute unified device architecture environment on a PC equipped with a modern GPU board (Tesla M2090, GeForce GTX 480) is considerably faster than its MPI/OpenMP parallel implementation on the modern midrange professional cluster (10 nodes, each equipped with 2x Intel Xeon X5670 CPUs). We also show that the hybridized two‐level MPI/CUDA implementation, run on a cluster of GPU nodes, can additionally provide a linear speedup. Copyright 2013 John Wiley & Sons, Ltd.  相似文献   

10.
An optimal parallel algorithm for volume ray casting   总被引:3,自引:0,他引:3  
Volume rendering by ray casting is computationally expensive. For interactive volume visualization, rendering must be done in real time (30 frames/s). Since the typical size of a 3D dataset is 2563, parallel processing is imperative. In this paper, we present anO(logn) EREW algorithm for volume rendering. We useO(n 3) processors that can be optimized toO(log3 n) time withO(n 3/log3 n) processors. We have implemented our algorithm on a MasPar MP-1. The implementation results show that a frame of size 2563 is generated in 11 s by 4096 processors. This time can be further reduced by the use of large number of processors.  相似文献   

11.
We present an optimal parallel algorithm for computing a cycle separator of ann-vertex embedded planar undirected graph inO(logn) time onn/logn processors. As a consequence, we also obtain an improved parallel algorithm for constructing a depth-first search tree rooted at any given vertex in a connected planar undirected graph in O(log2 n) time on n/logn processors. The best previous algorithms for computing depth-first search trees and cycle separators achieved the same time complexities, but withn processors. Our algorithms run on a parallel random access machine that permits concurrent reads and concurrent writes in its shared memory and allows an arbitrary processor to succeed in case of a write conflict.A preliminary version of this paper appeared as Improved Parallel Depth-First Search in Undirected Planar Graphs in theProceedings of the Third Workshop on Algorithms and Data Structures, 1993, pp. 407–420.Supported in part by NSF Grant CCR-9101385.  相似文献   

12.
We present a fast parallel algorithm for computing the dominators of a directed acyclic graph. The model of computation used in a parallel random access machine that allows simultaneous reads but prohibits simultaneous writes into the same memory location. Let Pt(n) be the processor complexity of computing the transitive closure of an n-vertex directed graph on this model. The only known parallel algorithm for dominators requires O(log2n) time and uses O(nPt(n)) processors. Our algorithm for dominators has the same time complexity but uses O(Pt(n)) processors, thereby improving the processor complexity of the previously known algorithm by a factor of n.  相似文献   

13.
K. Diks  A. Pelc 《Algorithmica》2000,28(1):37-50
We consider broadcasting among n processors, f of which can be faulty. A fault-free processor, called the source, holds a piece of information which has to be transmitted to all other fault-free processors. We assume that the fraction f/n of faulty processors is bounded by a constant γ<1 . Transmissions are fault free. Faults are assumed to be of the crash type: faulty processors do not send or receive messages. We use the whispering model: pairs of processors communicating in one round must form a matching. A fault-free processor sending a message to another processor becomes aware of whether this processor is faulty or fault free and can adapt future transmissions accordingly. The main result of the paper is a broadcasting algorithm working in O( log n) rounds and using O(n) messages of logarithmic size, in the worst case. This is an improvement of the result from [17] where O ((log n) 2 ) rounds were used. Our method also gives the first algorithm for adaptive distributed fault diagnosis in O( log n) rounds. Received May 1997; revised May 1998.  相似文献   

14.
Parallelizing the Data Cube   总被引:1,自引:0,他引:1  
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.  相似文献   

15.
We consider multimessage multicasting over thenprocessor complete (or fully connected) static network (MMC). First we present a linear time algorithm that constructs for every degreedproblem instance a communication schedule with total communication time at mostd2, wheredis the maximum number of messages that each processor may send or receive. Then we present degreedproblem instances such that all their communication schedules have total communication time at leastd2. We observe that our lower bound applies when the fan-out (maximum number of processors receiving any given message) is huge, and thus the number of processors is also huge. Since this environment is not likely to arise in the near future, we turn our attention to the study of important subproblems that are likely to arise in practice. We show that when each message has fan-outk=1 theMMCproblem corresponds to the makespan openshop preemptive scheduling problem which can be solved in polynomial time and show that fork?2 our problem is NP-complete and remains NP-complete even when forwarding is allowed. We present an algorithm to generate a communication schedule with total communication time 2d−1 for any degreedproblem instance with fan-outk=2. Our main result is anO(q·d·e) time algorithm, wheree?nd(the input length), with an approximation bound ofqd+k1/q(d−1), for any integerqsuch thatk>q?2. Our algorithms are centralized and require all the communication information ahead of time. Applications where all of this information is readily available include iterative algorithms for solving linear equations, and most dynamic programming procedures. The Meiko CS-2 machine and computer systems with processors communicating via dynamic permutation networks whose basic switches can act as data replicators (e.g.,nbynBenes network with 2 by 2 switches that can also act as data replicators) will also benefit from our results at the expense of doubling the number of communication phases.  相似文献   

16.
A systolic algorithm is described for generating all permutations of n elements in lexicographic order. The algorithm is designed to be executed on a linear array of n processors, each having constant size memory, and each being responsible for producing one element of a given permutation. There is a constant delay per permutation, leading to an O(n!) time solution. This is an improvement over the best previously known techniques in two respects: the algorithm runs on the (arguably) weakest model of parallel computation, and is cost optimal (assuming the time to output the permutations is counted). The algorithm is extended to run adaptively, i.e., when the number of available processors is other than n.  相似文献   

17.
In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in Θ(log P) time and, in addition, achieves good initial load balance. The QE strategy prossess certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel A* algorithms to solve the Traveling Salesman Problem on an nCUBE2 hypercube multicomputer. The QE strategy yields average speedup improvements of about 20-185% and 15-120% at low and intermediate work densities (the ratio of the problem size to P), respectively, over three well-known load balancing methods-the round-robin (RR), the random communication (RC), and the neighborhood averaging (NA) strategies. The average speedup observed on 1024 processors is about 985, representing a very high efficiency of 0.96. Finally, we analyze and empirically evaluate the scalability of parallel A* algorithms in terms of the isoefficiency metric. Our analysis gives (1) a Θ(P log P) lower bound on the isoefficiency function of any parallel A* algorithm, and (2) a general expression for the upper bound on the isoefficiency function of our parallel A* algorithm using the QE strategy on any topology-for the hypercube and 2-D mesh architectures the upper bounds on the isoefficiency function are found to be Θ(P log2P) and Θ(P[formula]), respectively. Experimental results validate our analysis, and also show that parallel A* search has better scalability using the QE load balancing strategy than using the RR, RC, or NA strategies.  相似文献   

18.
A Note on Parallel Selection on Coarse-Grained Multicomputers   总被引:1,自引:0,他引:1  
Consider the selection problem of determining the k th smallest element of a set of n elements. Under the CGM (coarse-grained multicomputer) model with p processors and O(n/p) local memory, we present a deterministic parallel algorithm for the selection problem that requires O( log p) communication rounds. Besides requiring a low number of communication rounds, the algorithm also attempts to minimize the total amount of data transmitted in each round (only O(p) except in the last round). In addition to showing theoretical complexities, we present very promising experimental results obtained on a parallel machine that show almost linear speedup, indicating the efficiency and scalability of the proposed algorithm. Received June 1, 1997; revised March 10, 1998.  相似文献   

19.
A new algorithm for interactive graphics on multicomputers   总被引:1,自引:0,他引:1  
As nonshared-memory multiple instruction, multiple data (MIMD) systems become more common, it becomes important to develop parallel rendering algorithms for them. These systems, known as multicomputers, can produce data sets so large that it is difficult to visualize the data on conventional graphics systems, especially if the visualization proceeds in tandem with the calculation. Parallel systems must run interactive graphics to allow convenient visualizations of their computations. While few parallel systems currently have a frame buffer that will support interactive rendering, such systems should be more common in the future. This article describes an algorithm suited for interactive polygon rendering, where the model's image on screen generally has frame-to-frame coherence. The algorithm uses this coherence to perform load-balancing calculations in parallel with the other calculations. The algorithm also uses an optimized version of personalized all-to-all communication, where all processors communicate with all other processors  相似文献   

20.
The problem of computing the empirical cumulative distribution function (ECDF) of N points in k-dimensional space has been studied and motivated recently by Bentley [1], whose solution uses recursive multidimensional divide-and-conquer. In this paper, the problem is treated as a generalization of the problem of computing the inversion of a permutation. An algorithm of Knuth [3] is then extended to yield an O(kN(log2N)k?1) solution to the ECDF problem, which is comparable to Bentley's solution. Neither solution approaches the O(kN log2N) lower bound, and they are worse than the O(kN2) ‘brute force’ algorithm for large k. The new algorithm, however, has the advantage of being highly parallel so that fast solution exists with parallel processors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号