期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An optimal implementation of broadcasting with selective reduction

Lindon L.F. Akl S.G. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(3):256-269

A model of parallel computation called broadcasting with selective reduction (BSR) can be viewed as a concurrent-read concurrent-write (CRCW) parallel random access machine (PRAM) with one extension. An additional type of concurrent memory access is permitted in BSR, namely the BROADCAST instruction by means of which all N processors may gain access to all M memory locations simultaneously for the purpose of writing. At each memory location, a subset of the incoming broadcast data is selected and reduced to one value finally stored in that location. For several problems, BSR algorithms are known which require fewer steps than the corresponding best-known PRAM algorithms, using the same number of processors. A circuit is introduced to implement the BSR model, and it is shown that, in size and depth, the circuit presented is of the same order as an optimal circuit implementing the PRAM. Thus, if it is reasonable to assume that CRCW PRAM instructions execute in constant time, the assumption of a constant time BROADCAST instruction is no less reasonable 相似文献

2.

PRAM programming: in theory and in practice

D. S. Lecomber C. J. Siniolakis K. R. Sujithan 《Concurrency and Computation》2000,12(4):211-226

That the influence of the PRAM model is ubiquitous in parallel algorithm design is as clear as the fact that it is technologically infeasible for the forseeable future. The current generation of parallel hardware prominently features distributed memory and high‐performance interconnection networks—very much the antithesis of the shared memory required for the PRAM model. It has been shown that, in spite of communication costs, for some problems very fast parallel algorithms are available for distributed‐memory machines—from embarassingly parallel problems to sorting and numerical analysis. In contrast it is known that for other classes of problem PRAM‐style shared‐memory simulation on a distributed‐memory machine can, in theory, produce solutions of comparable performance to the best possible for such architectures. The Bulk Synchronous Parallel (BSP) model accurately represents most parallel machines—theoretical and actual—in an execution and cost model. We introduce a scalable portable PRAM realization appropriate for BSP computers and a methodology for usage. Our system is fast and built upon the familiar sequential C++ coupled with the new standard BSP library of parallel computation and communication primitives. It is portable to and predictable on a vast number of parallel computers including workstation clusters, a 256‐processor Cray T3D, an 8‐node IBM SP/2 and a 4‐node shared‐memory SGI Power Challenge machine. Our approach achieves simplicity of programming over direct‐mode BSP programming for reasonable overhead cost. We objectively compare optimized BSP and PRAM algorithms implemented with our C++ PRAM library and provide encouraging experimental results for our new style of programming. Copyright © 2000 John Wiley & Sons, Ltd. 相似文献

3.

基于SRAM和PRAM混合主存设计

姚英彪陈越佳《计算机工程与应用》2016,52(13):69-75

由于DRAM芯片超高的静态功耗,使得利用DRAM构建高性能计算机系统中的大容量主存遇到能耗过大问题,这激发了对新型大容量主存结构的研究。针对上述问题,设计了一种基于SRAM和PRAM的混合主存系统,该系统将SRAM作为PRAM的专用写缓存,并将改进后的LRFU算法应用到SRAM写缓存,从而在对主存系统性能影响不大的前提下,有效降低主存系统的能耗和延长PRAM的可用时间。仿真结果显示,所设计的混合存储结构的能耗-延时积（EDP）为纯DRAM存储结构的40%;此外,与纯PRAM存储结构相比,可使PRAM的写操作次数下降28.5%,与将SRAM作为Cache相比,PRAM写次数下降13%。相似文献

4.

Efficient PRAM simulation on a distributed memory machine

R. M. Karp M. Luby F. Meyer auf der Heide 《Algorithmica》1996,16(4-5):517-542

We present algorithms for the randomized simulation of a shared memory machine (PRAM) on a Distributed Memory Machine (DMM). In a PRAM, memory conflicts occur only through concurrent access to the same cell, whereas the memory of a DMM is divided into modules, one for each processor, and concurrent accesses to the same module create a conflict. Thedelay of a simulation is the time needed to simulate a parallel memory access of the PRAM. Any general simulation of anm processor PRAM on ann processor DMM will necessarily have delay at leastm/n. A randomized simulation is calledtime-processor optimal if the delay isO(m/n) with high probability. Using a novel simulation scheme based on hashing we obtain a time-processor optimal simulation with delayO(log log(n) log*(n)). The best previous simulations use a simpler scheme based on hashing and have much larger delay: (log(n)/log log(n)) for the simulation of an n processor PRAM on ann processor DMM, and (log(n)) in the case where the simulation is time-processor optimal.Our simulations use several (two or three) hash functions to distribute the shared memory among the memory modules of the PRAM. The stochastic processes modeling the behavior of our algorithms and their analyses based on powerful classes of universal hash functions may be of independent interest.Research partially supported by NSF/DARPA Grant CCR-9005448. Work was done while at the University of California at Berkeley and the International Computer Science Institute, Berkeley, CA.Research partially supported by National Science Foundation Operating Grant CCR-9016468, National Science Foundation Operating Grant CCR-9304722, United States-Israel Binational Science Foundation Grant No. 89-00312, United States-Israel Binational Science Foundation Grant No. 92-00226, and ESPRIT BR Grant EC-US 030.Part of work was done during a visit at the International Computer Science Institute at Berkeley; supported in part by DFG-Forschergruppe Effiziente Nutzung massiv paralleler Systeme, Teilprojekt 4, and by the Esprit Basic Research Action Nr. 7141 (ALCOM II). 相似文献

5.

Efficient Deterministic and Probabilistic Simulations of PRAMs on Linear Arrays with Reconfigurable Pipelined Bus Systems

Li Keqin Pan Yi Zheng Si Qing 《The Journal of supercomputing》2000,15(2):163-181

In this paper, we present deterministic and probabilistic methods for simulating PRAM computations on linear arrays with reconfigurable pipelined bus systems (LARPBS). The following results are established in this paper. (1) Each step of a p-processor PRAM with m=O(p) shared memory cells can be simulated by a p-processors LARPBS in O(log p) time, where the constant in the big-O notation is small. (2) Each step of a p-processor PRAM with m=(p) shared memory cells can be simulated by a p-processors LARPBS in O(log m) time. (3) Each step of a p-processor PRAM can be simulated by a p-processor LARPBS in O(log p) time with probability larger than 1–1/p^c for all c>0. (4) As an interesting byproduct, we show that a p-processor LARPBS can sort p items in O(log p) time, with a small constant hidden in the big-O notation. Our results indicate that an LARPBS can simulate a PRAM very efficiently. 相似文献

6.

A process oriented semantics of the PRAM-language FORK

Gudula Rünger Kurt Sieber 《Computer Languages, Systems and Structures》1994,20(4):253-265

The parallel language FORK [1], based on a scalable shared memory model, is a PASCAL-like language with some additional parallel constructs. A PRAM (Parallel Random Access Machine) algorithm can be expressed on a high level of abstraction as a FORK program which is translated into efficient PRAM code guaranteeing theoretically predicted runtimes.

In this paper, we concentrate on those features of the language FORK related to parallelism, such as the group concept, a shared memory access and synchronous or asynchronous execution. We present a trace-based denotational interleaving semantics where processes describe synchronous computations. Processes are created or deleted dynamically and run asynchronously. Interleaving rules reflect the underlying CRCW (concurrent-read-concurrent-write) PRAM model. 相似文献

7.

Simulating Shared Memory in Real Time: On the Computation Power of Reconfigurable Architectures

Artur Czumaj Friedhelm Meyer auf der Heide Volker Stemann 《Information and Computation》1997,137(2):103

We consider randomized simulations of shared memory on a distributed memory machine (DMM) where thenprocessors and thenmemory modules of the DMM are connected via a reconfigurable architecture. We first present a randomized simulation of a CRCW PRAM on a reconfigurable DMM having a complete reconfigurable interconnection. It guarantees delay (log *n), with high probability. Next we study a reconfigurable mesh DMM (RM-DMM). Here thenprocessors andnmodules are connected via ann×nreconfigurable mesh. It was already known that ann×mreconfigurable mesh can simulate in constant time ann-processor CRCW PRAM with shared memory of sizem. In this paper we present a randomized step by step simulation of a CRCW PRAM with arbitrarily large shared memory on an RM-DMM. It guarantees constant delay with high probability, i.e., it simulates in real time. Finally we prove a lower bound showing that sizeΩ(n²) for the reconfigurable mesh is necessary for real time simulations. 相似文献

8.

Asynchronous PRAMS with Memory Latency

《Journal of Parallel and Distributed Computing》1994,23(1):10-26

We introduce two new asynchronous PRAM models which allow significant latencies for accessing global memory. In both models, accessing global memory takes L time units where Lis a fixed parameter, but the models provide two different mechanisms to help hide this latency. The Delay PRAM (D-PRAM) allows reads and writes to be issued before earlier reads and writes are completed; the Block PRAM (B-PRAM) allows a block of Lcontiguous locations in global memory to be read or written in O(L) time units. For both models we develop work-optimal randomized algorithms that solve a Certified Write-All Problem (CWA) of size n with expected O(n) work using up to (n log L)/(L log n) processors. This is a fundamental problem since it can be used as a synchronization primitive for n parallel instructions. If the D-PRAM has some restrictions on the asynchrony allowed we can use our CWA solution to simulate any n-processor CRCW PRAM program on a restricted D-PRAM with memory latency L, using O(n) expected work per parallel step, and using up to (n log L)/(L log n) D-PRAM processors. We prove a matching lower bound which shows that our CWA solution is optimal in terms of expected work. Our algorithms work both for models where the latency L to access global memory is fixed, and for models where the latency can vary probabilistically. 相似文献

9.

Locality-preserving hash functions for general purpose parallel computation

A. Chin 《Algorithmica》1994,12(2-3):170-181

Consider the problem of efficiently simulating the shared-memory parallel random access machine (PRAM) model on massively parallel architectures with physically distributed memory. To prevent network congestion and memory bank contention, it may be advantageous to hash the shared memory address space. The decision on whether or not to use hashing depends on (1) the communication latency in the network and (2) the locality of memory accesses in the algorithm.We relate this decision directly to algorithmic issues by studying the complexity of hashing in the Block PRAM model of Aggarwal, Chandra, and Snir, a shared-memory model of parallel computation which accounts for communication locality. For this model, we exhibit a universal family of hash functions having optimal locality. The complexity of applying these hash functions to the shared address space of the Block PRAM (i.e., by permuting data elements) is asymptotically equivalent to the complexity of performing a square matrix transpose, and this result is best possible for all pairwise independent universal hash families. These complexity bounds provide theoretical evidence that hashing and randomized routing need not destroy communication locality, addressing an open question of Valiant.This work was started when the author was a student at Oxford University, supported by a National Science Foundation Graduate Fellowship and a Rhodes Scholarship. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the National Science Foundation or the Rhodes Trust. 相似文献

10.

Support for Efficient Programming on the SB-PRAM

Thomas Grün Thomas Rauber Jochen Röhrig 《International journal of parallel programming》1998,26(3):209-240

The SB-PRAM is a shared-memory parallel computer that has been designed according to the PRAM model from theoretical computer science. The SB-PRAM realizes a concurrent-read, concurrent-write PRAM where each processor can access the global memory in unit time. This article describes the programming environment of the SB-PRAM that enables a programmer to develop efficient and portable programs without dealing with architectural details of the machine. In particular, we discuss compiler and operating system issues and show that the runtime functions of the P4 environment and several parallel data structures can be implemented very efficiently by using special features of the SB-PRAM. In contrast to other parallel machines, the synchronization of processors and the management of concurrent accesses to the global memory only require a few machine instructions independent of the number of processors participating in the operation. This efficient implementation of the runtime system is the basis for good performance of many challenging applications. 相似文献

11.

Energy efficient task allocation for hybrid main memory architecture

《Journal of Systems Architecture》2016

Compared with the conventional dynamic random access memory (DRAM), emerging non-volatile memory technologies provide better density and energy efficiency. However, current NVM devices typically suffer from high write power, long write latency and low write endurance. In this paper, we study the task allocation problem for the hybrid main memory architecture with both DRAM and PRAM, in order to leverage system performance and the energy consumption of the memory subsystem via assigning different memory devices for each individual task. For an embedded system with a static set of periodical tasks, we design an integer linear programming (ILP) based offline adaptive space allocation (offline-ASA) algorithm to obtain the optimal task allocation. Furthermore, we propose an online adaptive space allocation (online-ASA) algorithm for dynamic task set where arrivals of tasks are not known in advance. Experimental results show that our proposed schemes achieve 27.01% energy saving on average, with additional performance cost of 13.6%. 相似文献

12.

Simulations among concurrent-write PRAMs 总被引：1，自引：0，他引：1

Faith E. Fich Prabhakar Ragde Avi Wigderson 《Algorithmica》1988,3(1):43-51

This paper is concerned with the relative power of the two most popular concurrent-write models of parallel computation, the PRIORITY PRAM [G], and the COMMON PRAM [K]. Improving the trivial and seemingly optimalO(logn) simulation, we show that one step of a PRIORITY machine can be simulated byO(logn/(log logn)) steps of a COMMON machine with the same number of processors (and more memory). We further prove that this is optimal, if processor communication is restricted in a natural way.Support for this research was provided by NSF Grants MCS-8402676 and MCS-8120790, DARPA Contract No. N00039-84-C-0089, an IBM Faculty Development Award, and an NSERC postgraduate scholarship. 相似文献

13.

An EREW PRAM algorithm for image component labeling

Cypher R. Sanz J.L.C. Snyder L. 《IEEE transactions on pattern analysis and machine intelligence》1989,11(3):258-262

An important midlevel task for computer vision is addressed. The problem consists of labeling connected components in N^1/2×N^2/2 binary images. This task can be solved with parallel computers by using a simple and novel algorithm. The parallel computing model used is a synchronous fine-grained shared-memory model where only one processor can read from or write to the same memory location at a given time. This model is known as the exclusive-read exclusive-write parallel RAM (EREW PRAM). Using this model, the algorithm presented has O(log N) complexity. The algorithm can run on parallel machines other than the EREW PRAM. In particular, it offers an optimal image component labeling algorithm for mesh-connected computers 相似文献

14.

Parallel random access machines with bounded memory wordsize

Stephen J. Bellantoni 《Information and Computation》1991,91(2)

The PRAM model of parallel computation is examined with respect to wordsize, the number of bits which can be held in each global memory cell. First, adversary arguments are used to show the incomparability of certain machines which store the same amount of global information but which differ in wordsize. Next, for machines with infinitely many memory cells, a counting argument is used to show a large lower bound and to separate a hierarchy of machine classes based on wordsize. Finally, an efficient simulation by boolean circuits is used to give a simple new proof of the tight Ω((log n)/(log log n)) time bound for on small-wordsize machines. Overall the results suggest that, in some circumstances, the memory wordsize is a more significant resource than the write resolution rule, number of memory cells, or number of processors. 相似文献

15.

On the Power of Segmenting and Fusing Buses

《Journal of Parallel and Distributed Computing》1996,34(1):82-94

Reconfigurable bus-based models of parallel computation have been shown to be extremely powerful, capable of solving several problems in constant time that require nonconstant time on conventional models such as the PRAM. The primary source of the power of reconfigurable bus-based models is their ability to dynamically alter the connections between processors by manipulating the communication medium. This can be viewed as the models' ability to (i) segment a bus into two or more bus segments and (ii) fuse two or more buses or bus segments together. In this paper, we investigate the contribution of the abilities of a reconfigurable bus-based model to segment and fuse buses. We show that the ability to fuse buses is the more crucial of the two. The ability to segment buses enhances the power of the model under certain circumstances. We also study the roles of concurrent reading and writing in the context of reconfigurable bus-based models. These results establish a hierarchy of powers of the PRAM and reconfigurable bus-based models. 相似文献

16.

A new scheme for the deterministic simulation of PRAMs in VLSI 总被引：3，自引：0，他引：3

F. Luccio A. Pietracaprina G. Pucci 《Algorithmica》1990,5(1):529-544

A deterministic scheme for the simulation of (n, m)-PRAM computation is devised. Each PRAM step is simulated on a bounded degree network consisting of a mesh-of-trees (MT) of siden. The memory is subdivided inn modules, each local to a PRAM processor. The roots of the MT contain these processors and the memory modules, while the otherO(n ²) nodes have the mere capabilities of packet switchers and one-bit comparators. The simulation algorithm makes a crucial use of pipelining on the MT, and attains a time complexity ofO(log² n/log logn). The best previous time bound wasO(log² n) on a different interconnection network withn processors. While the previous simulation schemes use an intermediate MPC model, which is in turn simulated on a bounded degree network, our method performs the simulation directly with a simple algorithm.This work has been supported in part by Ministero della Pubblica Istruzione of Italy under a research grant. 相似文献

17.

Simulations among concurrent-write PRAMs

Faith E. Fich Prabhakar Ragde Avi Wigderson 《Algorithmica》1988,3(1-4):43-51

This paper is concerned with the relative power of the two most popular concurrent-write models of parallel computation, the PRIORITY PRAM [G], and the COMMON PRAM [K]. Improving the trivial and seemingly optimalO(logn) simulation, we show that one step of a PRIORITY machine can be simulated byO(logn/(log logn)) steps of a COMMON machine with the same number of processors (and more memory). We further prove that this is optimal, if processor communication is restricted in a natural way. 相似文献

18.

Merging sorted runs using large main memory

Salzberg Betty 《Acta Informatica》1989,27(3):195-215

Summary External sorting is usually accomplished by first creating sorted runs, then merging the runs. In the merge phase, writing and calculating can be overlapped by reading if two input buffers are used for each sorted run. If the memory is very large, the input buffers will be large and using two input buffers per sorted run will be more efficient than using only one input buffer per run and risking reduced overlap of reading and writing. In many cases, merging time can be cut in half. We derive a formula for estimating the total time for merging for a given memory size, file size, number of merging passes and for a given disk drive. We present an extreme example where in spite of having two buffers per run, significant non-overlap occurs. However, in realistic problems, we show that making one merge pass with two input buffers per run is near optimal. This contradicts earlier results on merging which do not take large memory into account. 相似文献

19.

On single-walk parallelization of the job shop problem solving algorithms

Wojciech Bo?ejko 《Computers & Operations Research》2012,39(9):2258-2264

New parallel objective function determination methods for the job shop scheduling problem are proposed in this paper, considering makespan and the sum of jobs execution times criteria, however, the methods proposed can be applied also to another popular objective functions such as jobs tardiness or flow time. Parallel Random Access Machine (PRAM) model is applied for the theoretical analysis of algorithm efficiency. The methods need a fine-grained parallelization, therefore the approach proposed is especially devoted to parallel computing systems with fast shared memory (e.g. GPGPU, General-Purpose computing on Graphics Processing Units). 相似文献

20.

Fast Generation of Random Permutations Via Networks Simulation

A. Czumaj P. Kanarek M. Kutylowski K. Lorys 《Algorithmica》1998,21(1):2-20

We consider the problem of generating random permutations with uniform distribution. That is, we require that for an arbitrary permutation π of n elements, with probability 1/n! the machine halts with the i th output cell containing π(i) , for 1 ≤ i ≤ n . We study this problem on two models of parallel computations: the CREW PRAM and the EREW PRAM. The main result of the paper is an algorithm for generating random permutations that runs in O(log log n) time and uses O(n ^1+o(1) ) processors on the CREW PRAM. This is the first o(log n) -time CREW PRAM algorithm for this problem. On the EREW PRAM we present a simple algorithm that generates a random permutation in time O(log n) using n processors and O(n) space. This algorithm outperforms each of the previously known algorithms for the exclusive write PRAMs. The common and novel feature of both our algorithms is first to design a suitable random switching network generating a permutation and then to simulate this network on the PRAM model in a fast way. Received November 1996; revised March 1997. 相似文献