期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Portable Parallel implementation of BLAS 3

A. Averbuch D. Amitai R. Friedman E. Gabber 《Concurrency and Computation》1994,6(5):411-459

The use of a massively parallel machine is aimed at the development of applications programs to solve most significant scientific, engineering, industrial and commercial problems. High-performance computing technology has emerged as a powerful and indispensable aid to scientific and engineering research, product and process development, and all aspects of manufacturing. Such computational power can be achieved only by massively parallel computers. It also requires a new and more effective mode of interaction between the computational sciences and applications and those parts of computer science concerned with the development of algorithms and software. We are interested in using parallel processing to handle large numerical tasks such as linear algebra problems. Yet, programming such systems has proven itself to be very complicated, error-prone and architecture-specific. One successful method for alleviating this problem, a method that worked well in the case of the massively pipelined supercomputers, is to use subprogram libraries. These libraries are built to efficiently perform some basic operations, while hiding low-level system specifics from the programmer. Efficiently porting a library to a new hardware, be it a vector machine or a shared memory or message passing based multiprocessor, is a major undertaking. It is a slow process that requires an intimate knowledge of the hardware features and optimization issues. We propose a scheme for the creation of portable implementations of such libraries. We present an implementation of BLAS (basic linear algebra subprograms), which is used as a standard linear algebra library. Our parallel implementation uses the virtual machine for multiprocessors (VMMP) (1990), which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallell program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementating the same services on all target multiprocessors. Software created using this scheme is automatically efficient on both categories of MIMD machines, and on any hardware VMMP has been ported to. An additional level of abstraction is achieved using the programming language C++, an object-oriented language. Eckel, Stroustrup, 1989, 1986). For the programmer who is using BLAS-3, it is hiding both the data structures used to define linear algebra objects, and the parallel nature of the operations performed on these objects. We coded BLAS on top of VMMP. This code was run without any modifications on two shared memory machines-the commercial Sequent Symmetry and the experimental Taunop. (The code should run on any machine the VMMP was ported onto, given the availability of a C++ compiler). Performance results for this implementation are given. The speed-up of the BLAS-3 routines, tested on 22 processors of the Sequent, was in the range of 8.68 to 15.89. Application programs (e.g. Cholesky factorization) using the library routines achieved similar efficiency. 相似文献

2.

VMMP: a practical tool for the development of portable andefficient programs for multiprocessors

Gabber E. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(3):304-317

The VMMP (virtual machine for multiprocessors) software package is presented. It provides a coherent set of services for parallel application programs running on diverse multiple input multiple data (MIMD) multiprocessors, including shared memory and message passing multiprocessors. The communication, synchronization, and data distribution requirements of parallel algorithms are analyzed. Related languages and tools are described. VMMP services are identified. VMMP implementation, coding and portability are discussed. Some measurements of the performance of VROMP application programs and VMMP overhead are given. Several hints for improving the performance of application programs are described 相似文献

3.

A flexibility coupled hypercube multiprocessor for high level vision

Myung H. Sunwoo J. K. Aggarwal 《Machine Vision and Applications》1992,5(2):127-138

In general, message passing multiprocessors suffer from communication overhead between processors and shared memory multiprocessors suffer from memory contention. Also, in computer vision tasks, data I/O overhead limits performance. In particular, high level vision tasks, which are complex and require nondeterministic communication, are strongly affected by these disadvantages. This paper proposes a flexibly (tightly/loosely) coupled hypercube multiprocessor (FCHM) for high level vision to alleviate these problems. A variable address space memory scheme in which a set of adjacent memory modules can be merged into a shared memory module by a dynamically partitionable hypercube topology is proposed. The architecture is quantitatively analyzed using computational models and simulated on the Intel’s Personal SuperComputer (iPSC/I), a hypercube multiprocessor. A parallel algorithm for exhaustive search is simulated on FCHM using the iPSC/I showing significant performance improvements over that of the iPSC/I. This research was supported in part by IBM corporation. 相似文献

4.

Parallel Algorithms for VLSI Layout Verification

Ky MacPherson Prithviraj Banerjee 《Journal of Parallel and Distributed Computing》1996,36(2):156

相似文献

5.

Distributed hardwired barrier synchronization for scalablemultiprocessor clusters

Shisheng Shang Kai Hwang 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(6):591-605

Conventional multiprocessors mostly use centralized, memory-based barriers to synchronize concurrent processes created in multiple processors. These centralized barriers often become the bottleneck or hot spots in the shared memory. In this paper, we overcome the difficulty by presenting a distributed and hardwired barrier architecture, that is hierarchically constructed for fast synchronization in cluster-structured multiprocessors. The hierarchical architecture enables the scalability of cluster-structured multiprocessors. A special set of synchronization primitives is developed for explicit use of distributed barriers dynamically. To show the application of the hardwired barriers, we demonstrate how to synchronize Doall and Doacross loops using a limited number of hardwired barriers. Timing analysis shows an O(10²) to O(10⁵) reduction in synchronization overhead, compared with the use of software-controlled barriers implemented in a shared memory. The hardwired architecture is effective in implementing any partially ordered set of barriers or fuzzy barriers with extended synchronization regions. The versatility, scalability, programmability, and low overhead make the distributed barrier architecture attractive in constructing fine-grain, massively parallel MIMD systems using multiprocessor clusters with distributed shared memory 相似文献

6.

Issues and experiences in implementing a distributed tuplespace

James B. Fenwick Lori L. Pollock 《Software》1997,27(10):1199-1232

相似文献

7.

PVM: A framework for parallel distributed computing

V. S. Sunderam 《Concurrency and Computation》1990,2(4):315-339

相似文献

8.

Programming language concepts for multiprocessors

Harry F. Jordan 《Parallel Computing》1988,8(1-3):31-40

It is currently possible to build multiprocessor systems which will support the tightly coupled activity of hundreds to thousands of different instruction streams, or processes. This can be done by coupling many monoprocessors, or a smaller number of pipelined multiprocessors, through a high concurrency switching network. The switching network may couple processors to memory modules, resulting in a shared memory multiprocessor system, or it may couple processor/memory pairs, resulting in a distributed memory system.

The need to direct the activity of very many processes simultaneously places qualitatively different demands on a programming language than the direction of a single process. In spite of the different requirements, most languages for multiprocessors have been simple extensions of conventional, single stream programming languages. The extensions are often implemented by way of subroutine calls and have little impact on the basic structure of the language. This paper attempts to examine the underlying conceptual structure of parallel languages for large-scale multiprocessors on the basis of an existing language for shared memory multiprocessors, known as the Force, and to extend the concepts in this language to distributed memory systems. 相似文献

9.

Sorting in mesh connected multiprocessors

Corbett P.F. Scherson I.D. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(5):626-632

A sorting algorithm, dubbed MeshSort, for multidimensional mesh-connected multiprocessors is introduced. Bitonic Sort and ShearSort are shown to be special cases of MeshSort. MeshSort thus provides some insight into the operation of parallel sorting. It requires operations only along orthogonal vectors of processors, simplifying the control of the multiprocessor. This allows MeshSort to be used on any reduced architecture where a multidimensional memory structure is interconnected with a lower dimensional structure of processors. A modified version of MeshSort, called FastMeshSort, is presented. This algorithm applies the same basic principle as MeshSort, and is almost as simple to implement, but achieves much better performance. The modified algorithm is shown to be very efficient for reasonably sized meshes. FastMeshSort is presented as a practical sorting and routing algorithm for real multidimensional mesh-connected multiprocessors. The algorithms can easily be extended to other multiprocessor structures 相似文献

10.

Visualization of Message Passing Parallel Programs with the TOPSYS Parallel Programming Environment

《Journal of Parallel and Distributed Computing》1993,18(2):118-128

Parallel programming is orders of magnitudes more complex than writing sequential programs. This is particularly true for programming distributed memory multiprocessor architectures based on message passing programming models. Apart from understanding the sequential parts of the parallel program, new degrees of freedom lead to additional problems. Understanding the synchronization and communication behavior of parallel programs is the most critical issue in programming distributed memory multiprocessors. The paper describes methods and tools for visualization and animation of the dynamic execution of parallel programs. Based on an evaluation and classification of existing visualization environments, the visualization and animation tool VISTOP (VISualization TOol for Parallel Systems) is presented as part of the integrated tool environment TOPSY S (TOols for Parallel SYStems) for programming distributed memory multiprocessors. VISTOP supports the interactive on-line visualization of message passing programs based on various views; in particular, a process graph based concurrency view for detecting synchronization and communication bugs. 相似文献

11.

Parallel Algorithms for Dynamic Shortest Path Problems

Ismail Chabini & Sridevi Ganugapati 《International Transactions in Operational Research》2002,9(3):279-302

The development of intelligent transportation systems (ITS) and the resulting need for the solution of a variety of dynamic traffic network models and management problems require faster‐than‐real‐time computation of shortest path problems in dynamic networks. Recently, a sequential algorithm was developed to compute shortest paths in discrete time dynamic networks from all nodes and all departure times to one destination node. The algorithm is known as algorithm DOT and has an optimal worst‐case running‐time complexity. This implies that no algorithm with a better worst‐case computational complexity can be discovered. Consequently, in order to derive algorithms to solve all‐to‐one shortest path problems in dynamic networks, one would need to explore avenues other than the design of sequential solution algorithms only. The use of commercially‐available high‐performance computing platforms to develop parallel implementations of sequential algorithms is an example of such avenue. This paper reports on the design, implementation, and computational testing of parallel dynamic shortest path algorithms. We develop two shared‐memory and two message‐passing dynamic shortest path algorithm implementations, which are derived from algorithm DOT using the following parallelization strategies: decomposition by destination and decomposition by transportation network topology. The algorithms are coded using two types of parallel computing environments: a message‐passing environment based on the parallel virtual machine (PVM) library and a multi‐threading environment based on the SUN Microsystems Multi‐Threads (MT) library. We also develop a time‐based parallel version of algorithm DOT for the case of minimum time paths in FIFO networks, and a theoretical parallelization of algorithm DOT on an ‘ideal’ theoretical parallel machine. Performances of the implementations are analyzed and evaluated using large transportation networks, and two types of parallel computing platforms: a distributed network of Unix workstations and a SUN shared‐memory machine containing eight processors. Satisfactory speed‐ups in the running time of sequential algorithms are achieved, in particular for shared‐memory machines. Numerical results indicate that shared‐memory computers constitute the most appropriate type of parallel computing platforms for the computation of dynamic shortest paths for real‐time ITS applications. 相似文献

12.

Efficient message passing on UNIX shared memory multiprocessors

Massimo Bernaschi 《Future Generation Computer Systems》1998,13(6):443-449

We describe a single-copy mechanism which enables an efficient message passing among UNIX processes on shared memory multiprocessors. A special version of PVMe, IBM's AIX implementation of the PVM message passing programming model, has been built based on this approach. Some preliminary results here reported show the clear advantage of the single-copy with respect to more conventional schemes. 相似文献

13.

Grasp: Visualizing the behavior of hierarchical multiprocessor real-time systems

《Journal of Systems Architecture》2013,59(6):307-314

Trace visualization is a viable approach for gaining insight into the behavior of complex distributed real-time systems. Grasp is a versatile trace visualization toolset. Its flexible plugin infrastructure allows for easy extension with custom visualization and analysis techniques for automatic trace verification. This paper presents its visualization capabilities for hierarchical multiprocessor systems, including partitioned and global multiprocessor scheduling with migrating tasks and jobs, communication between jobs via shared memory and message passing, and hierarchical scheduling in combination with multiprocessor scheduling. For tracing distributed systems with asynchronous local clocks Grasp also supports the synchronization of traces from different processors during the visualization and analysis. 相似文献

14.

基于Message Passing的并行编程环境 总被引：3，自引：0，他引：3

刘欣然胡铭曾《计算机工程》1997,23(5):17-20

在分布式并行计算机系统中，由于处理机间无共享内存，因此采用了ＭｅｓｓａｇｅＰａｓｓｉｎｇ方式实现处理间的通信。文中讨论了基于ＭｅｓｓａｇｅＰａｓｓｉｎｇ的并行编程环境所应具备的特点，然后介绍几种被人们普遍接受的并行编程环境。相似文献

15.

天气雷达资料实时并行处理方法 总被引：1，自引：1，他引：0

下载免费PDF全文

王志斌陈波万玉发吴涛罗兵沃伟峰《计算机工程》2009,35(23):255-257

利用共享存储多处理器的集群环境,研究高频度实时多部天气雷达资料处理的并行计算方法,根据单部天气雷达的计算特点和多部雷达混合处理的方法,提出一种粗粒度消息传递接口分布式内存和细粒度OpenMP共享内存混合编程的2级并行方法。实验结果表明,该方法使系统资料处理速度得到较大提高。相似文献

16.

Super－Object模型：在分布存储器多计算机上实现共享存储器程序设计模式

窦勇周兴铭《计算机学报》1995,(7)

Ｓｕｐｅｒ－Ｏｂｊｅｃｔ模型提出了一种新的方法，在分布存储器多计算机上实现语言级虚拟共享存储器以支持共享存储器通信模式．Ｓｕｐｅｒ－Ｏｂｊｅｃｔ模型引入新的概念ｓｕｐｅｒ－ｏｂｊｅｃｔ，不同于其它模型，基于ｓｕｐｅｒ－ｏｂｊｅｃｔ，它提出了新的共享数据定位方法，全局地址标识（ｎａｍｅ，ｏｆｆ－ｓｅｔ）．Ｓｕｐｅｒ－Ｏｂｊｅｃｔ模型与Ｆｏｒｔｒａｎ７７结合，我们实现了一个运行时间系统和库调用，支持程序员使用Ｆｏｒｔｒａｎ语言编写并行程序，最后介绍了系统的实现和取得的性能．相似文献

17.

曙光1000A上消息传递与共享存储的比较 总被引：14，自引：2，他引：12

唐志敏施巍松胡伟武《计算机学报》2000,23(2):134-140

分布式共享存储虽然有易于编程的优点,但往往被认为效率不高、完全由软件实现的分布式共享存储系统（又称为虚拟共享存储系统）更是如此,文中以典型的消息传递系统ＰＶＭ与分布式共享存储系统ＪＩＡＪＩＡ粉列,报这两种并行程序设计环境的特点,并用７个应用程序在曙光１０００Ａ上分别比较了这两个系统的性能,实验３结果表明,ＪＩＡＪＩＡ的与ＰＶ玎当,但基于ＪＩＡＪＩＡ的并行程序设计却比ＰＶＮ简单得多。相似文献

18.

Distributed shared abstractions (DSA) on multiprocessors

Clemencon C. Mukherjee B. Schwan K. 《IEEE transactions on pattern analysis and machine intelligence》1996,22(2):132-152

Any parallel program has abstractions that are shared by the program's multiple processes. Such shared abstractions can considerably affect the performance of parallel programs, on both distributed and shared memory multiprocessors. As a result, their implementation must be efficient, and such efficiency should be achieved without unduly compromising program portability and maintainability. The primary contribution of the DSA library is its representation of shared abstractions as objects that may be internally distributed across different nodes of a parallel machine. Such distributed shared abstractions (DSA) are encapsulated so that their implementations are easily changed while maintaining program portability across parallel architectures. The principal results presented are: a demonstration that the fragmentation of object state across different nodes of a multiprocessor machine can significantly improve program performance; and that such object fragmentation can be achieved without compromising portability by changing object interfaces. These results are demonstrated using implementations of the DSA library on several medium scale multiprocessors, including the BBN Butterfly, Kendall Square Research, and SGI shared memory multiprocessors. The DSA library's evaluation uses synthetic workloads and a parallel implementation of a branch and bound algorithm for solving the traveling salesperson problem (TSP) 相似文献

19.

A shared memory algorithm and proof for the generalized alternative construct in CSP

Richard M. Fujimoto Hwa-chung Feng 《International journal of parallel programming》1987,16(3):215-241

Communicating Sequential Processes (CSP) is a paradigm for communication and synchronization among distributed processes. The alternative construct is a key feature of CSP that allows nondeterministic selection of one among several possible communicants. A generalized version of Hoare's original alternative construct that allows output commands to be included in guards has been proposed. Previous algorithms for this construct assume a message passing architecture and are not appropriate for multiprocessor systems that feature shared memory. This paper describes a distributed algorithm for the generalized alternative construct that exploits the capabilities of a parallel computer with shared memory. A correctness proof of the proposed algorithm is presented to show that the algorithm conforms to somesatefy andliveness criteria. Extensions to allow termination of processes and to ensure fairness in guard selection are also given.This work was supported by ONR Contract Number N00014-87-K-0184. 相似文献

20.

Portable parallelizing Fortran compiler

A. Averbuch R. Dekel E. Gabber 《Concurrency and Computation》1996,8(2):91-123

The Portable Parallelizing Fortran Compiler (PPFC) is an additional component for the portable programming environment developed in Tel-Aviv University for scientific code. This environment supports portable and efficient programming of diverse MIMD multiprocessors, both distributed- and shared-memory. Till now this environment has consisted of two tools: the Virtual Machine for MultiProcessors (VMMP) and the Portable Parallelizing Pascal compiler (P³C). We have added the PPFC which is an automatic parallelizer compiler for the Fortran language. The compiler is fully automatic (does not require additional declarations to assist parallelization), which is characterized by loops operating on regular data structures, and produces efficient and portable code for a variety of multiprocessors from the same serial code. The parallel implementation uses the VMMP, which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. The PPFC parallelized 12 out of the 24 Livermore Loops. It was also applied to parallelize all the 14 Fortran application programs that where parallelized by the P³C and achieved the same speed-ups and efficiencies. In most examples the PPFC achieved high speed-ups and efficiencies on all target multiprocessors. The PPFC emphasizes efficiency and code portability. Although PPFC employs a relatively simple data flow analysis, it produces efficient code for various widely used application programs. 相似文献