期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Indigo: user-level support for building distributed shared abstractions

Prince Kohli Mustaque Ahamad Karsten Schwan 《Concurrency and Computation》1998,10(1):1-29

Distributed systems that consist of workstations connected by high performance interconnects offer computational power comparable to moderate size parallel machines. Middleware like distributed shared memory (DSM) or distributed shared objects (DSO) attempts to improve the programmability of such hardware by presenting to application programmers interfaces similar to those offered by shared memory machines. This paper presents the portable Indigo data sharing library which provides a small set of primitives with which arbitrary shared abstractions are easily and efficiently implemented across distributed hardware platforms. Sample shared abstractions implemented with Indigo include DSM as well as fragmented objects, where the object state is split across different machines and where interfragment communications may be customized to application-specific consistency needs. The Indigo library's design and implementation are evaluated on two different target platforms: a workstation cluster and an IBM SP2 machine. As part of this evaluation, a novel DSM system and consistency protocol are implemented and evaluated with several high performance applications. Application performance attained with the DSM system is compared to the performance experienced when utilizing the underlying basic message-passing facilities or when employing Indigo to construct customized fragmented objects implementing the application's shared state. Such experimentation results in insights concerning the efficient implementation of DSM systems (e.g. how to deal with false sharing). It also leads to the conclusion that Indigo provides a sufficiently rich set of abstractions for efficient implementation of the next generation of parallel programming models for high performance machines. © 1998 John Wiley & Sons, Ltd. 相似文献

2.

Portable parallel FFT for MIMD multiprocessors

Amir Averbuch Eran Gabber 《Concurrency and Computation》1998,10(8):583-605

A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with detailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. © 1998 John Wiley & Sons, Ltd. 相似文献

3.

Priority queues and sorting methods for parallel simulation

Grammatikakis M.D. Liesche S. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(5):401-422

The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms 相似文献

4.

Portable Parallel implementation of BLAS 3

A. Averbuch D. Amitai R. Friedman E. Gabber 《Concurrency and Computation》1994,6(5):411-459

The use of a massively parallel machine is aimed at the development of applications programs to solve most significant scientific, engineering, industrial and commercial problems. High-performance computing technology has emerged as a powerful and indispensable aid to scientific and engineering research, product and process development, and all aspects of manufacturing. Such computational power can be achieved only by massively parallel computers. It also requires a new and more effective mode of interaction between the computational sciences and applications and those parts of computer science concerned with the development of algorithms and software. We are interested in using parallel processing to handle large numerical tasks such as linear algebra problems. Yet, programming such systems has proven itself to be very complicated, error-prone and architecture-specific. One successful method for alleviating this problem, a method that worked well in the case of the massively pipelined supercomputers, is to use subprogram libraries. These libraries are built to efficiently perform some basic operations, while hiding low-level system specifics from the programmer. Efficiently porting a library to a new hardware, be it a vector machine or a shared memory or message passing based multiprocessor, is a major undertaking. It is a slow process that requires an intimate knowledge of the hardware features and optimization issues. We propose a scheme for the creation of portable implementations of such libraries. We present an implementation of BLAS (basic linear algebra subprograms), which is used as a standard linear algebra library. Our parallel implementation uses the virtual machine for multiprocessors (VMMP) (1990), which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallell program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementating the same services on all target multiprocessors. Software created using this scheme is automatically efficient on both categories of MIMD machines, and on any hardware VMMP has been ported to. An additional level of abstraction is achieved using the programming language C++, an object-oriented language. Eckel, Stroustrup, 1989, 1986). For the programmer who is using BLAS-3, it is hiding both the data structures used to define linear algebra objects, and the parallel nature of the operations performed on these objects. We coded BLAS on top of VMMP. This code was run without any modifications on two shared memory machines-the commercial Sequent Symmetry and the experimental Taunop. (The code should run on any machine the VMMP was ported onto, given the availability of a C++ compiler). Performance results for this implementation are given. The speed-up of the BLAS-3 routines, tested on 22 processors of the Sequent, was in the range of 8.68 to 15.89. Application programs (e.g. Cholesky factorization) using the library routines achieved similar efficiency. 相似文献

5.

VMMP: a practical tool for the development of portable andefficient programs for multiprocessors

Gabber E. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(3):304-317

The VMMP (virtual machine for multiprocessors) software package is presented. It provides a coherent set of services for parallel application programs running on diverse multiple input multiple data (MIMD) multiprocessors, including shared memory and message passing multiprocessors. The communication, synchronization, and data distribution requirements of parallel algorithms are analyzed. Related languages and tools are described. VMMP services are identified. VMMP implementation, coding and portability are discussed. Some measurements of the performance of VROMP application programs and VMMP overhead are given. Several hints for improving the performance of application programs are described 相似文献

6.

Program structuring for effective parallel portability

Alverson G.A. Notkin D. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(9):1041-1059

The tension between software development costs and efficiency is especially high when considering parallel programs intended to run on a variety of architectures. In the domain of shared memory architectures and explicitly parallel programs, the authors have addressed this problem by defining a programming structure that eases the development of effectively portable programs. On each target multiprocessor, an effectively portable program runs almost as efficiently as a program fine-tuned for that machine. Additionally, its software development cost is close to that of a single program that is portable across the targets. Using this model, programs are defined in terms of data structure and partitioning-scheduling abstractions. Low software development cost is attained by writing source programs in terms of abstract interfaces and thereby requiring minimal modification to port; high performance is attained by matching (often dynamically) the interfaces to implementations that are most appropriate to the execution environment. The authors include results of a prototype used to evaluate the benefits and costs of this approach 相似文献

7.

Parallel and distributed local search in COMET

Laurent Michel Andrew See Pascal Van Hentenryck 《Computers & Operations Research》2009

The availability of commodity multiprocessors and high-speed networks of workstations offer significant opportunities for addressing the increasing computational requirements of optimization applications. To leverage these potential benefits, it is important, however, to make parallel and distributed processing easily accessible to a wide audience of optimization programmers. This paper addresses this challenge by proposing parallel and distributed programming abstractions that keep the distance from sequential local search algorithms as small as possible. The abstractions, including parallel loops, interruptions, thread pools, and shared objects, are compositional and cleanly separate the optimization program and the parallel instructions. They have been evaluated experimentally on a variety of applications, including warehouse location and coloring, for which they provide significant speedups. 相似文献

8.

BSPlib: The BSP programming library 总被引：1，自引：0，他引：1

Jonathan M.D. Hill Bill McColl Dan C. Stefanescu Mark W. Goudreau Kevin Lang Satish B. Rao Torsten Suel Thanasis Tsantilas Rob H. Bisseling 《Parallel Computing》1998,24(14):1947-1980

BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access (DRMA) using put or get operations, and bulk synchronous message passing (BSMP). Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms (FFTs), sorting, and molecular dynamics. 相似文献

9.

Parallel Algorithms for Dynamic Shortest Path Problems

Ismail Chabini & Sridevi Ganugapati 《International Transactions in Operational Research》2002,9(3):279-302

The development of intelligent transportation systems (ITS) and the resulting need for the solution of a variety of dynamic traffic network models and management problems require faster‐than‐real‐time computation of shortest path problems in dynamic networks. Recently, a sequential algorithm was developed to compute shortest paths in discrete time dynamic networks from all nodes and all departure times to one destination node. The algorithm is known as algorithm DOT and has an optimal worst‐case running‐time complexity. This implies that no algorithm with a better worst‐case computational complexity can be discovered. Consequently, in order to derive algorithms to solve all‐to‐one shortest path problems in dynamic networks, one would need to explore avenues other than the design of sequential solution algorithms only. The use of commercially‐available high‐performance computing platforms to develop parallel implementations of sequential algorithms is an example of such avenue. This paper reports on the design, implementation, and computational testing of parallel dynamic shortest path algorithms. We develop two shared‐memory and two message‐passing dynamic shortest path algorithm implementations, which are derived from algorithm DOT using the following parallelization strategies: decomposition by destination and decomposition by transportation network topology. The algorithms are coded using two types of parallel computing environments: a message‐passing environment based on the parallel virtual machine (PVM) library and a multi‐threading environment based on the SUN Microsystems Multi‐Threads (MT) library. We also develop a time‐based parallel version of algorithm DOT for the case of minimum time paths in FIFO networks, and a theoretical parallelization of algorithm DOT on an ‘ideal’ theoretical parallel machine. Performances of the implementations are analyzed and evaluated using large transportation networks, and two types of parallel computing platforms: a distributed network of Unix workstations and a SUN shared‐memory machine containing eight processors. Satisfactory speed‐ups in the running time of sequential algorithms are achieved, in particular for shared‐memory machines. Numerical results indicate that shared‐memory computers constitute the most appropriate type of parallel computing platforms for the computation of dynamic shortest paths for real‐time ITS applications. 相似文献

10.

CTK: configurable object abstractions for multiprocessors

Silva D.M. Schwan K. Eisenhauer G. 《IEEE transactions on pattern analysis and machine intelligence》2001,27(6):531-549

The Configuration Toolkit (CTK) is a library for constructing configurable object based abstractions that are part of multiprocessor programs or operating systems. The library is unique in its exploration of runtime configuration for attaining performance improvements: 1) its programming model facilitates the expression and implementation of program configuration; and 2) its efficient runtime support enables performance improvements by the configuration of program components during their execution. Program configuration is attained without compromising the encapsulation or the reuse of software abstractions. CTK programs are configured using attributes associated with object classes, object instances, state variables, operations, and object invocations. At runtime, such attributes are interpreted by policy classes, which may be varied separately from the abstractions with which they are associated. Using policies and attributes, an object's runtime behavior may be varied by: 1) changing its performance or reliability while preserving the implementation of its functional behavior, or 2) changing the implementation of its internal computational strategy. CTK's multiprocessor implementation is layered on a Cthreads-compatible programming library, which results in its portability to a wide variety of uni- and multiprocessor machines, including a Kendall Square KSR-2 Supercomputer, SGI machines, various SUN workstations, and as a native kernel on the GP1000 BBN Butterfly multiprocessor. The platforms evaluated in the paper are the KSR and SGI machines 相似文献

11.

Practical parallel Union-Find algorithms for transitive closure and clustering

G. Cybenko T. G. Allen J. E. Polito 《International journal of parallel programming》1988,17(5):403-423

Practical parallel algorithms, based on classical sequential Union-Find algorithms for computing transitive closures of binary relations, are described and implemented for both shared memory and distributed memory parallel computers. By practical algorithms, we mean algorithms that are efficient for parallel systems with bounded numbers of processors as opposed to algorithms where the number of processors grows with the problem size. Transitive closures are useful for decomposing many applications problems into independent subproblems. The implementations were on an ENCORE Multimax shared memory machine and an NCUBE hypercube. Our implementations indicate that transitive closure computations are intrinsically difficult for distributed memory parallel machines because of the need for global information. By contrast, our results for shared memory machines exhibited excellent speedups.Supported in part by NSF Grant DCR-8619103, ONR contract N000-86-G-0202 and DOE Grant DE-FG02-85ER25001.Supported in part by RADC contract F30602-85-C-0303.Supported in part by RADC contract F30602-85-C-0303. 相似文献

12.

Issues and experiences in implementing a distributed tuplespace

James B. Fenwick Lori L. Pollock 《Software》1997,27(10):1199-1232

相似文献

13.

Address tracing for parallel machines

Stunkel C.B. Janssens B. Fuchs W.K. 《Computer》1991,24(1):31-38

Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately 相似文献

14.

Parallel Algorithms for VLSI Layout Verification

Ky MacPherson Prithviraj Banerjee 《Journal of Parallel and Distributed Computing》1996,36(2):156

相似文献

15.

Parallel object monitors

Denis Caromel Luis Mateu Guillaume Pothier ric Tanter 《Concurrency and Computation》2008,20(12):1387-1417

Coordination of parallel activities on a shared memory machine is a crucial issue for modern software, even more with the advent of multi‐core processors. Unfortunately, traditional concurrency abstractions force programmers to tangle the application logic with the synchronization concern, thereby compromising understandability and reuse, and fall short when fine‐grained and expressive strategies are needed. This paper presents a new concurrency abstraction called POM, parallel object monitor, supporting expressive means for coordination of parallel activities over one or more objects, while allowing a clean separation of the coordination concern from application code. Expressive and reusable strategies for concurrency control can be designed, thanks to a full access to the queue of pending requests, parallel execution of dispatched requests together with after‐actions, and complete control over reentrancy. A small domain‐specific aspect language is provided to adequately configure pre‐packaged, off‐the‐shelf synchronizations. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

16.

The Stanford Dash multiprocessor 总被引：2，自引：0，他引：2

Lenoski D. Laudon J. Gharachorloo K. Weber W.-D. Gupta A. Hennessy J. Horowitz M. Lam M.S. 《Computer》1992,25(3):63-79

The overall goals and major features of the directory architecture for shared memory (Dash) are presented. The fundamental premise behind the architecture is that it is possible to build a scalable high-performance machine with a single address space and coherent caches. The Dash architecture is scalable in that it achieves linear or near-linear performance growth as the number of processors increases from a few to a few thousand. This performance results from distributing the memory among processing nodes and using a network with scalable bandwidth to connect the nodes. The architecture allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance. A distributed directory-based protocol that provides cache coherence without compromising scalability is discussed in detail. The Dash prototype machine and the corresponding software support are described 相似文献

17.

Porting industrial codes and developing sparse linear solvers on parallel computers

《Computing Systems in Engineering》1995,6(4-5):295-305

We address the main issues when porting existing codes from serial to parallel computers and when developing portable parallel software on MIMD multiprocessors (shared memory, virtual shared memory, and distributed memory multiprocessors, and networks of computers). We discuss the use of numerical libraries as a way of developing portable and efficient parallel code. We illustrate this by using examples from our experience in porting industrial codes and in designing parallel numerical libraries. We report in some detail on the parallelization of scientific applications coming from Centre National d'Etudes Spatiales and from Aérospatiale, and we illustrate how it is possible to develop portable and efficient numerical software by considering the parallel solution of sparse linear systems of equations. 相似文献

18.

现代仪器用实时分布式操作系统 总被引：1，自引：0，他引：1

刘谋用吴越葛霁光《计算机学报》1999,22(6):608-614

为了给多功能,多参数,智能化,网络化的现代仪器系统提供一和中更好的硬件支撑环境,作者研制了现代仪器用弱实时分化布式操作系统（ＩＯＷＲＴＤＯＳ）其结构自底向上分为３层,通用硬件接口（ＧＨＩ）层,微内核（Ｍｉｃｒｏｋｅｒｎｅｌ）层和全局共享对象（ＧＳＯ）层,ＧＨＩ封装了硬件细节,为操作系统的其余部分提供了一个理想的机器结构,Ｍｉｃｒｏｋｅｒｎｅｌ是ＩＯＷＲＴＤＯＳ的核心,主要提供内存管理,多线程管理和相似文献

19.

天气雷达资料实时并行处理方法 总被引：1，自引：1，他引：0

下载免费PDF全文

王志斌陈波万玉发吴涛罗兵沃伟峰《计算机工程》2009,35(23):255-257

利用共享存储多处理器的集群环境,研究高频度实时多部天气雷达资料处理的并行计算方法,根据单部天气雷达的计算特点和多部雷达混合处理的方法,提出一种粗粒度消息传递接口分布式内存和细粒度OpenMP共享内存混合编程的2级并行方法。实验结果表明,该方法使系统资料处理速度得到较大提高。相似文献

20.

协同分布式图形硬件的混合并行体绘制

下载免费PDF全文

曹轶莫则尧王弘堃袁斌《中国图象图形学报》2008,13(7):1379-1384

由于一般的共享存储并行机缺乏图形硬件,其上产生的3维科学计算数据,无法采用硬件加速的并行体绘制来就地进行数据可视化。为此基于本地并行机和分布式图形工作站,给出了一种混合并行绘制模型。该模型的工作原理是先将源数据存留在并行机,然后通过并行机的多处理器发布远程绘制命令流,进而通过操控工作站的图形硬件完成绘制;后期图像合成在并行机上执行,以发挥共享存储通信优势。通过负载平衡优化,并行绘制流水线有效实现了绘制、合成与显示的重叠。实验结果显示,该方法能以1024×1024图像分辨率,交互绘制并行机上的大规模数据场。相似文献