首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present a randomized parallel list ranking algorithm for distributed memory multiprocessors, using a BSP type model. We first describe a simple version which requires, with high probability, log(3p)+log ln(n)=Õ(logp+log logn) communication rounds (h-relations withh=Õ(n/p)) andÕ(n/p)) local computation. We then outline an improved version that requires high probability, onlyr?(4k+6) log(2/3p)+8=Õ(k logp) communication rounds wherek=min{i?0 |ln(i+1)n?(2/3p)2i+1}. Notekn) is an extremely small number. Forn andp?4, the value ofk is at most 2. Hence, for a given number of processors,p, the number of communication rounds required is, for all practical purposes, independent ofn. Forn?1, 500,000 and 4?p?2048, the number of communication rounds in our algorithm is bounded, with high probability, by 78, but the actual number of communication rounds observed so far is 25 in the worst case. Forn?10010100 and 4?p?2048, the number of communication rounds in our algorithm is bounded, with high probability, by 118; and we conjecture that the actual number of communication rounds required will not exceed 50. Our algorithm has a considerably smaller member of communication rounds than the list ranking algorithm used in Reid-Miller’s empirical study of parallel list ranking on the Cray C-90.(1) To our knowledge, Reid-Miller’s algorithm(1) was the fastest list ranking implementation so far. Therefore, we expect that our result will have considerable practical relevance.  相似文献   

2.
A Note on Parallel Selection on Coarse-Grained Multicomputers   总被引:1,自引:0,他引:1  
Consider the selection problem of determining the k th smallest element of a set of n elements. Under the CGM (coarse-grained multicomputer) model with p processors and O(n/p) local memory, we present a deterministic parallel algorithm for the selection problem that requires O( log p) communication rounds. Besides requiring a low number of communication rounds, the algorithm also attempts to minimize the total amount of data transmitted in each round (only O(p) except in the last round). In addition to showing theoretical complexities, we present very promising experimental results obtained on a parallel machine that show almost linear speedup, indicating the efficiency and scalability of the proposed algorithm. Received June 1, 1997; revised March 10, 1998.  相似文献   

3.
The dynamic lot-sizing model (DLS) is one of the most frequently used models in production and inventory system because lot decisions can greatly affect the performance of the system. The practicality of DLS algorithms is hindered by the huge amount of computer resources required for solving these models, even for a modest problem. This study developed a parallel algorithm to solve the lot-sizing problem efficiently. Given that n is the size of the problem, the complexity of the proposed parallel algorithm is O(n2p) with p processors. Numerical experiments are provided to verify the complexity of the proposed algorithm. The empirical results demonstrate that the speedup of this parallel algorithm approaches linearity, which means that the proposed algorithm can take full advantage of the distributed computing power as the size of the problem increases.  相似文献   

4.
We consider parallel heap operations on the exclusive-read exclusive-write parallel random-access machine. We first present an O(n/p + log p) time parallel algorithm to construct a heap of n elements using p processors, which is optimal for p θ(n/log n). We then propose a parallel root deletion algorithm. In a preparatory step, a data structure for dynamic processor allocation is constructed using O((n/log n)1 − 1/k) processors in O(log k) time for some constant k, 1 ≤ k ≤ ⌈log(n/log n)⌉. A sequence of root deletions can then be performed, each of which takes O((log n)/p + log p + log log n) time using p processors. Finally, we discuss a parallel algorithm running in O((log n)/p + log p) time for inserting an element into a heap, which is optimal for p = θ((log n)/log log n). Both deletion and insertion algorithms run in O(log log n) time when p = θ((log n)/log log n).  相似文献   

5.
Given two strings A and B of lengths na and nb, respectively, the All-substrings Longest Common Subsequence (ALCS) problem obtains, for any substring B' of B, the length of the longest string that is a subsequence of both A and B'. The sequential algorithm for this problem takes O(na nb) time and O(nb) space. We present a parallel algorithm for the ALCS problem on the Coarse-Grained Multicomputer (BSP/CGM) model with p < √na processors, that takes O(na nb/p) time, O(log p) communication rounds and O(nb √na) space per processor. The proposed algorithm also solves the basic Longest Common Subsequence (LCS) problem that finds the longest string (and not only its length) that is a subsequence of both A and B. To our knowledge, this is the best BSP/CGM algorithm in the literature for the LCS and ALCS problems.  相似文献   

6.
The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. In this paper we examine the numerical solution of an elliptic partial differential equation in order to study the relationship between problem size and architecture. The equation's domain is discretized into n2 grid points which are divided into partitions and mapped onto the individual processor memories. We analytically quantify the relationships among grid size, stencil type, partitioning strategy processor execution time, and communication network type. In doing so, we determine the optimal number of processors to assign to the solution (and hence the optimal speedup), and identify (i) the smallest grid size which fully benefits from using all available processors, (ii) the leverage on performance given by increasing processor speed or communication network speed, and (iii) the suitability of various architectures for large numerical problems. Finally, we compare the predictions of our analytic model with measurements from a multiprocessor and find that the model accurately predicts performance.  相似文献   

7.
We study the scheduling situation where n tasks with identical processing times have to be scheduled on m parallel processors. Each task is subjected to a release date and requires simultaneously a fixed number of processors. We show that, for each fixed value of m, the problem of minimizing total completion time can be solved in polynomial time. The complexity status of the corresponding problem Pm|ri,pi=p,sizei|∑Ci was unknown before.Scope and purposeThere has been increasing interest in multiprocessor scheduling, i.e., in scheduling models where tasks require several processors (machines) simultaneously. Many scheduling problems fit in this model and a large amount of research has been carried on theoretical multiprocessor scheduling. In this paper we study the situation where tasks, subjected to release dates, have identical processing time and we introduce a dynamic programming algorithm that can compute the minimum total completion time. Although this scheduling problem has been open in the literature for several years, our algorithm is simple and easy to understand.  相似文献   

8.
Xin He  Yaacov Yesha 《Algorithmica》1990,5(1-4):129-145
We develop efficient parallel algorithms for ther-dominating set and thep-center problems on trees. On a concurrent-read exclusive-write PRAM, our algorithm for ther-dominating set problem runs inO(logn log logn) time withn processors. The algorithm for thep-center problem runs inO(log2 n log logn) time withn processors.  相似文献   

9.
Trees are a useful data type, but they are not routinely included in parallel programming systems, in part because their irregular structure makes partitioning and scheduling difficult. We present a method for algebraically constructing implementations of tree skeletons, high-level homomorphic operations that execute in parallel. Many computations on binary trees can be performed inO(logn) parallel time usingnprocessors, even taking account of communication costs. We extend these results to trees with arbitrary and variable degree. Then we show that it is possible to implement a distributed version of homomorphisms on binary trees, takingO(n/p+ log2p) parallel time onp < nprocessors, for trees of any skew and taking full account of communication costs. Under slightly stronger restrictions on the underlying functions, this can be improved toO(n/p+ logp). Furthermore, the technique for deriving distributed versions is algebraic, allowing the automatic generation of code for SPMD and data-parallel architectures.  相似文献   

10.
11.
Traditionally, the block-based medial axis transform (BB-MAT) and the chessboard distance transform (CDT) were usually viewed as two completely different image computation problems, especially for three dimensional (3D) space. In fact, there exist some equivalent properties between them. The relationship between both of them is first derived and proved in this paper. One of the significant properties is that CDT for 3D binary image V is equal to BB-MAT for image V' where it denotes the inverse image of V. In a parallel algorithm, a cost is defined as the product of the time complexity and the number of processors used. The main contribution of this work is to reduce the costs of 3D BB-MAT and 3D CDT problems proposed by Wang [65]. Based on the reverse-dominance technique which is redefined from dominance concept, we achieve the computation of the 3D CDT problem by implementing the 3D BB-MAT algorithm first. For a 3D binary image of size N3, our parallel algorithm can be run in O(logN) time using N3 processors on the concurrent read exclusive write (CREW) parallel random access machine (PRAM) model to solve both 3D BB-MAT and 3D CDT problems, respectively. The presented results for the cost are reduced in comparison with those of Wang's. To the best of our knowledge, this work is the lowest costs for the 3D BB-MAT and 3D CDT algorithms known. In parallel algorithms, the running time can be divided into computation time and communication time. The experimental results of the running, communication and computation times for the different problem sizes are implemented in an HP Superdome with SMP/CC-NUMA (symmetric multiprocessor/cache coherent non-uniform memory access) architecture. We conclude that the parallel computer (i.e., SMP/CC-NUMA architecture or cluster system) is more suitable for solving problems with a large amount of input size.  相似文献   

12.
The paper presents a Coarse-Grained Multicomputer algorithm that solves the problem of detection of repetitions. This algorithm can be implemented in the CGM model with P processors in in time and O(P) communication steps. It is the first CGM algorithm for this problem. We present also experimental results showing that the CGM algorithm is very efficient.  相似文献   

13.
In this paper we present deterministic parallel algorithms for the coarse-grained multicomputer (CGM) and bulk synchronous parallel (BSP) models for solving the following well-known graph problems: (1) list ranking, (2) Euler tour construction in a tree, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, and (7) 2-edge connectivity and biconnectivity (testing and component computation). The algorithms require O(log p) communication rounds with linear sequential work per round (p = no. processors, N = total input size). Each processor creates, during the entire algorithm, messages of total size O(log (p) (N/p)) . The algorithms assume that the local memory per processor (i.e., N/p ) is larger than p ε , for some fixed ε > 0 . Our results imply BSP algorithms with O(log p) supersteps, O(g log (p) (N/p)) communication time, and O(log (p) (N/p)) local computation time. It is important to observe that the number of communication rounds/ supersteps obtained in this paper is independent of the problem size, and grows only logarithmically with respect to p . With growing problem size, only the sizes of the messages grow but the total number of messages remains unchanged. Due to the considerable protocol overhead associated with each message transmission, this is an important property. The result for Problem (1) is a considerable improvement over those previously reported. The algorithms for Problems (2)—(7) are the first practically relevant parallel algorithms for these standard graph problems. Received July 5, 2000; revised April 16, 2001.  相似文献   

14.
This paper proposes a parallel algorithm,called KDOP (K-Dimensional Optimal Parallel algorithm),to solve a general class of recurrence equations efficiently.The KDOP algorithm partitions the computation into a series of subcomputations,each of which is executed in the fashion that all the processors work simultaneously with each one executing an optimal sequential algorithm to solve a subcomputation task.The algorithm solves the equations in O(N/P) steps in EREW PRAM model (Exclusive Read Exclusive Write Parallel Random Access Machine model) using p≤N^1-∈ processors,where N is the size of the problem,and ∈ is a given constant.This is an optimal algorithm (its sepeedup is O(p)) in the case of p≤N^1-∈.Such an optimal speedup for this problem was previously achieved only in the case of p≤N^0.5.The algorithm can be implemented on machines with multiple processing elements or pipelined vector machines with parallel memory systems.  相似文献   

15.
在消息传递并行机上的高效的最小生成树算法   总被引:5,自引:0,他引:5  
王光荣  顾乃杰 《软件学报》2000,11(7):889-898
基于传统的Borǔ vka串行最小生成树算法,提出了一个在消息传递并行机上的高效的最小生成树算法.并且采用3种方法来提高该算法的效率,即通过两趟合并及打包收缩的方法来减少通信开销,通过平衡数据分布的办法使各个处理器的计算量平衡.该算法的计算和通信复杂度分别为O(n2/p)和O((tsp+twn)n/p).在曙光-1000并行机上运行的实际效果是,对于有10 000个顶点的稀疏图,通过16个节点的运行加速比是12.  相似文献   

16.
Optimal and near optimal parallel algorithms for several fundamental problems are proposed for a parallel organization consistmg of n processors, each having access to a row and a column of an n × n array of memory modules. Parallel computations are implemented on such an organization by decomposing them into alternating orthogonal processing phases. A number of efficient data movement techniques are developed for the proposed organization which lead to optimal or near optimal solutions to several communication-intensive problems such as sorting, performing permutations, list ranking (data dependent parallel prefix), and problems on graphs represented by an unsorted list of n2 edges. It is also shown that the proposed organization is capable of simulating any fixed-degree network on n2 processors with O(n) loss in time, which is optimal. Finally, an enhanced organization having p processors, 1 ≤ pn2, and O(n2) memory locations is presented, and is shown to provide optimal speedups for adjacency-matrix based graph problems for any number of processors in the range [1, n3/2].  相似文献   

17.
We show how computations such as those involved in American or European-style option price valuations with the explicit finite difference method can be performed in parallel. Towards this we introduce a latency tolerant parallel algorithm for performing such computations efficiently that achieves optimal theoretical speedup p, where p is the number of processor of the parallel system. An implementation of the parallel algorithm has been undertaken, and an evaluation of its performance is carried out by performing an experimental study on a high-latency PC cluster, and at a smaller scale, on a multi-core processor using in addition the SWARM parallel computing framework for multi-core processors. Our implementation of the parallel algorithm is not only architecture but also communication library independent: the same code works under LAM-MPI and Open MPI and also BSPlib, two sets of library frameworks that facilitate parallel programming. The suitability of our approach to multi-core processors is also established.  相似文献   

18.
The nonserial polyadic dynamic programming algorithm is one of the most fundamental algorithms for solving discrete optimization problems. Although the loops in the nonserial polyadic dynamic programming algorithm are similar to those in matrix multiplication, the available automatic optimization techniques have little effect on this imperfect loop because of nonuniform data dependencies. In this paper, we develop algorithmic optimizations to improve the cache performance of the nonserial polyadic dynamic programming algorithm. Our algorithmic transformation takes advantage of the cache oblivious method by relaxing some dependencies in the standard iterative version. Based on the ideal cache model of the cache oblivious algorithm, the approximate bound of cache misses is given by . We also found that the optimized algorithm with the cache oblivious approach is more sensitive to conventional optimization techniques such as tiling. Experimental results on several platforms show that the optimized algorithms improve the cache performance and achieves speedups of 2–10 times.
Guangming TanEmail:
  相似文献   

19.
AVL (Adel'son-Vel'skii and Landis) trees are efficient data structures for implementing dictionaries. We present a parallel dictionary, using AVL trees, on the EREW PRAM by proposing optimal algorithms to performkoperations withp(1 ≤pk) processors. An explicit processor scheduling is devised to avoid simultaneous reads in our parallel algorithm to performksearches, which avoids the need for any additional memory in the parallelization. To perform multiple insertions and deletions, we identify rotations (in addition to AVL tree rotations) required to restore balance and present parallel algorithms to performpinsertions/deletions inO(logn+ logp) time withpprocessors.  相似文献   

20.
We develop constant-time algorithms to compute the Hough transform on a processor array with a reconfigurable bus system (abbreviated to PARBS). The PARBS is a comptuation model which consists of a processor array and a reconfigurable bus system. It is a very powerful computation model in that many problems can be solved efficiently. In this paper, we introduce the concept of iterative-PARBS which is similar to the FOR-loop construct in sequential programming languages. The iterative-PARBS is a building block in which the processing data can be routed through it several times. We can think it as a “hardware subroutine”. Based on this scheme, we are able to explore constant-time Hough transform algorithms on PARBS. The following new results are derived in this study:
  1. The sum ofn bits can be computed in O(1) times on a PARBS with O(n 1+?) processors for any fixed ?>0.
  2. The weights of each simple path of ann*n image can be computed in O(1) time on a 3-D PARBS with O(n 2+?) processors for any fixed ?>0.
  3. Thep angle Hough transform of ann*n image can be computed in O(1) time on a PARBS with O(p*n 2+?) processors for any fixed ?>0 withp copies of the image pretiled.
  4. Thep angle Hough transform of ann*n image can be computed in O(1) time on a PARBS with O(p*n 3) processors.
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号