首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper we describe a simple parallel algorithm for list ranking. The algorithm is deterministic and runs inO(logn) time on an EREW PRAM withn/logn processors. The algorithm matches the performance of the Cole-Vishkin [CV3] algorithm but is simple and has reasonable constant factors.R. J. Anderson was supported by an NSF Presidential Young Investigator award and G. L. Miller was supported by NSF Grant DCR-85114961.  相似文献   

2.
This paper considers a variety of geometric pattern recognition problems on input sets of size n using a coarse grained multicomputer model consisting of p processors with Ω(n/p) local memory each (i.e., Ω(n/p) memory cells of Θ(log n) bits apiece), where the processors are connected to an arbitrary interconnection network. It introduces efficient scalable parallel algorithms for a number of geometric problems including the rectangle finding problem, the maximal equally spaced collinear points problem, and the point set pattern matching problem. All of the algorithms presented are scalable in that they are applicable and efficient over a very wide range of ratios of problem size to number of processors. In addition to the practicality imparted by scalability, these algorithms are easy to implement in that all required communications can be achieved by a small number of calls to standard global routing operations.  相似文献   

3.
4.
5.
《Parallel Computing》1997,22(12):1661-1675
This paper presents a mapping scheme for parallel pipelined execution of the Back-propagation Learning Algorithm on distributed memory multiprocessors. The proposed implementation exhibits inter-layer or pipelined parallelism, unique to the multilayer neural networks. Simple algorithms have been presented, which allow the data transfer involved in both recall and learning phases of the back-propagation algorithm to be carried out with a small communication overhead. The effectiveness of the mapping scheme has been illustrated, by estimating the speedup of the proposed implementation on an array of T-805 transputers.  相似文献   

6.
In this paper we present a decentralized remapping method for data parallel applications on distributed memory multiprocessors. The method uses a generalized dimension exchange (GDE) algorithm periodically during the execution of an application to balance (remap) the system's workload. We implemented this remapping method in parallel WaTor simulations and parallel image thinning applications, and found it to be effective in reducing the computation time. The average performance gain is about 20% in the WaTor simulation of a 256 × 256 ocean grid on 16 processors, and up to 8% in the thinning of a typical image of size 128 × 128 on eight processors. The performance gains due to remapping in the image thinning case are reasonably substantial given the fact that the application by its very nature does not necessarily favor remapping. We also implemented this remapping method, using up to 32 processors, for partitioning and re-partitioning of grids in computational fluid dynamics. It was found that the GDE-based parallel refinement policy, coupled with simple geometric strategies, produces partitions that are comparable in quality to those from the best serial algorithms. © 1997 John Wiley & Sons, Ltd.  相似文献   

7.
In this paper,we focus on the compiling implementation of parlalel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstact machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed.The performance statistics on transputer array demonstrate the effectiveness of our model,parallel abstract machine,optimizing compilation strategies and compiler.  相似文献   

8.
We present a new parallel algorithm for computing a maximum cardinality matching in a bipartite graph suitable for distributed memory computers.The presented algorithm is based on the Push-Relabel algorithm which is known to be one of the fastest algorithms for the bipartite matching problem. Previous attempts at developing parallel implementations of it have focused on shared memory computers using only a limited number of processors.We first present a straightforward adaptation of these shared memory algorithms to distributed memory computers. However, this is not a viable approach as it requires too much communication. We then develop our new algorithm by modifying the previous approach through a sequence of steps with the main goal being to reduce the amount of communication and to increase load balance. The first goal is achieved by changing the algorithm so that many push and relabel operations can be performed locally between communication rounds and also by selecting augmenting paths that cross processor boundaries infrequently. To achieve good load balance, we limit the speed at which global relabelings traverse the graph. In several experiments on a large number of instances, we study weak and strong scalability of our algorithm using up to 128 processors.The algorithm can also be used to find ?-approximate matchings quickly.  相似文献   

9.
We present two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyze their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers, while QSP1 is fairly close to optimal. Langet al. (1) and Schnorret al. (2) have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-perprocessor case. Both QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). We also analyze a different variant of Lang's sort which is as scalable as QSP2. We briefly discuss another metric called resource consumption. According to this metric, both QSP1 and QSP2 are superior to variants of Lang's sort.  相似文献   

10.
We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions to be made for the performance of the software on large numbers of processors of a given parallel system, by only benchmarking the code on small numbers of processors. Having described the methods used, and emphasized the simplicity of their implementation, the approach is tested on a range of engineering software applications that are built upon the use of multigrid algorithms. Despite their simplicity, the models are demonstrated to provide both accurate and robust predictions across a range of different parallel architectures, partitioning strategies and multigrid codes. In particular, the effectiveness of the predictive methodology is shown for a practical engineering software implementation of an elastohydrodynamic lubrication solver.  相似文献   

11.
In this paper, we study parallel branch and bound on fine grained hypercube multiprocessors. Each processor in a fine grained system has only a very small amount of memory available. Therefore, current parallel branch and bound methods for coarse grained systems ( 1000 nodes) cannot be applied, since all these methods assume that every processor stores the path from the node it is currently processing back to the node where the process was created (the back-up path). Furthermore, the much larger number of processors available in a fine grained system makes it imperative that global information (e.g. the current best solution) is continuously available at every processor; otherwise the amount of unnecessary search would become intolerable. We describe an efficient branch-and-bound algorithm for fine grained hypercube multiprocessors. Our method uses a global scheme where all processors collectively store all back-up paths such that each processor needs to store only a constant amount of information. At each iteration of the algorithm, all current nodes may decide whether they need to create new children, be pruned, or remain unchanged. We describe an algorithm that, based on these decisions, updates the current back-up paths and distributes global information in O(log m) steps, where m is the current number of nodes. This method also includes dynamic allocation of search processes to processors and provides optimal load balancing. Even if very drastic changes in the set of current nodes occur, our load balancing mechanism does not suffer any slow down.  相似文献   

12.
Recently developed block-iterative versions of some row-action algorithms for solving general systems of sparse linear equations allow parallelism in the computations when the underlying problem is appropriately decomposed. However, problems associated with the parallel implementation of these algorithms have to be addressed. In this paper we present an implementation on distributed memory multiprocessors of a block version of the Kaczmarz row-action method. One of the main issues related to the efficient implementation of this method on a concurrent environment is to develop suitable communication schemes in order to reduce the amount of communication needed at each iteration. We propose two data distribution strategies which lead to different computation and communication schemes. To verify and compare the effectiveness of the proposed strategies, numerical experiments have been carried out on a Symult S2010 and a Meiko Computing Surface. The performance evaluation has been done using a scaled efficiency model.  相似文献   

13.
We present a parallel algorithm for distributed memory multiprocessors, which is based on generalized marching (GM), one of the fastest methods in the class of fast Poisson solvers. The GM algorithm is not suited for any but very coarse-grain parallel processing. The main difficulty with parallelization is that the number of independent processes and the amount of work in each process change exponentially and in inverse proportion of each other. To improve parallelism, the matrices involved in GM are diagonalized performing multiple FFTs. In this way, independent processes extending across all the algorithm are obtained. The parallel GM has been tested on an Ncube/10 and a Symult S2010, running the Express communication system. A performance evaluation has been carried out using a scaled efficiency model and some classical parameters.  相似文献   

14.
Threads provides a mechanism for simulating the execution of parallel algorithms on a simplified model of a shared-memory multiprocessor. The algorithms can be expressed in a high-level block-structured language, which supports multiple threads of execution within a common body of program code. Results show an ability to achieve good speedup for small problems using algorithms derived by simple modifications of sequential algorithms. As well, a sibling thread synchronisation feature provides the basis for the synchronous execution of threads. k-parallel algorithms tailored to the machine size and implemented as synchronously executing iterations, can provide near linear speedup as the problem size is increased. The techniques described in this paper seem to promise an effective synchronous execution mode for shared-memory MIMD architectures.  相似文献   

15.
Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoCs), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs to provide data exchange and synchronization support.This paper focuses on the energy/delay exploration of a distributed shared memory architecture, suitable for low-power on-chip multiprocessors based on NoC. A mechanism is proposed for the data allocation on the distributed shared memory space, dynamically managed by an on-chip hardware memory management unit (HwMMU). Moreover, the exploitation of the HwMMU primitives for the migration, replication, and compaction of shared data is discussed. Experimental results show the impact of different distributed shared memory configurations for a selected set of parallel benchmark applications from the power/-performance perspective. Furthermore, a case study for a graph exploration algorithm is discussed, accounting for the effects of the core mapping and the network topology on energy and performance at the system level.  相似文献   

16.
The prevalence of multicore processors has resulted in the wider applicability of parallel programming models such as OpenMP and MapReduce. A common goal of running parallel applications implemented under such models is to guarantee bounded response times while maximizing system utilization. Unfortunately, little previous work has been done that can provide such performance guarantees. In this paper, this problem is addressed by applying soft real-time scheduling analysis techniques. Analysis and conditions are presented for guaranteeing bounded response times for parallel applications under global EDF multiprocessor scheduling.  相似文献   

17.
This paper introduces a model for parallel computation, called thedistributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network.We introduce the notion of aconservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to shortcut pointers in a data structure so that remote processors can communicate without causing undue congestion. We giveO(lgn)-step, linear-processor, linear-space, conservative algorithms for a variety of problems onn-node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions in an expression tree. We giveO(lg2 n)-step, linear-processor, linear-space, conservative algorithms for problems on graphs of sizen, including finding a minimum-cost spanning forest, computing biconnected components, and constructing an Eulerian cycle. Most of these algorithms use as a subroutine a generalization of the prefix computation to trees. We show that any suchtreefix computation can be performed inO(lgn) steps using a conservative variant of Miller and Reif's tree-contraction technique.This research was supported in part by the Defense Advanced Research Projects Agency under Contract N00014-80-C-0622 and by the Office of Naval Research under Contract N00014-86-K-0593. Charles Leiserson is supported in part by an NSF Presidential Young Investigator Award with matching funds provided by AT&T Bell Laboratories and Xerox Corporation. Bruce Maggs is supported in part by an NSF Fellowship.  相似文献   

18.
MUPPET is a problem-solving environment for scientific computing with message-based multiprocessors. It consists of four part—concurrent languages, programming environments, application environments and man-machine interfaces. The programming paradigm of MUPPET is based on parallel abstract machines and transformations between them. This paradigm allows the development of programs which are portable among multiprocessors with different interconnection topologies.

In this paper we discuss the MUPPET programming paradigm. We give an introduction to the language CONCURRENT MODULA-2 and the graphic specification system GONZO. The graphic specification system tries to introduce graphics as a tool for programming. It is also the basis for programming generation and transformation.  相似文献   


19.
In this paper we describe a technique for finding efficient parallel algorithms for problems on directed graphs that involve checking the existence of certain kinds of paths in the graph. This technique provides efficient algorithms for finding dominators in flow graphs, performing interval and loop analysis on reducible flow graphs, and finding the feedback vertices of a digraph. Each of these algorithms takesO(log2 n) time using the same number of processors needed for fast matrix multiplication. All of these bounds are for an EREW PRAM.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号