首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with detailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

2.
We describe the portable scalable implementation of the NRL Layered Ocean Model (NLOM). Scalability is based primarily on the tiled data parallel parallel programming paradigm. This is sufficiently general that the actual technique used on a given machine to obtain scalability can be selected at compile time from: (i) data parallel, (ii) SPMD message passing, (iii) autotasking, or (iv) SPMD message passing between multi-processor autotasked systems. The code is thus portable onto all machine types likely to be used by ocean modelers.  相似文献   

3.
Parallel programming is orders of magnitudes more complex than writing sequential programs. This is particularly true for programming distributed memory multiprocessor architectures based on message passing programming models. Apart from understanding the sequential parts of the parallel program, new degrees of freedom lead to additional problems. Understanding the synchronization and communication behavior of parallel programs is the most critical issue in programming distributed memory multiprocessors. The paper describes methods and tools for visualization and animation of the dynamic execution of parallel programs. Based on an evaluation and classification of existing visualization environments, the visualization and animation tool VISTOP (VISualization TOol for Parallel Systems) is presented as part of the integrated tool environment TOPSY S (TOols for Parallel SYStems) for programming distributed memory multiprocessors. VISTOP supports the interactive on-line visualization of message passing programs based on various views; in particular, a process graph based concurrency view for detecting synchronization and communication bugs.  相似文献   

4.
Research on multiprocessor interconnection networks has primarily focused on wormhole switching, virtual channel flow control, and routing algorithms to enhance their performance. The rationale behind this research is that by alleviating the network latency for high network loads, the overall system performance would improve; many studies have used synthetic workloads to support this claim. However, such workloads may not necessarily capture the behavior of real applications. In this paper, we have used parallel applications for a closer examination of the network behavior. In particular, the performance benefit from enhancing a 2D mesh with virtual channels (VCs) and a fully adaptive routing algorithm is examined with a set of shared-memory and message passing applications. Execution time and average message latency of shared memory applications are measured using execution-driven simulation and by varying many architectural attributes that affect the network workload. The communication traces of message passing applications, collected on an IBM-SP2, are used to run a trace-driven simulation of the mesh architecture to obtain message latency. Simulation results show that VCs and adaptive routing can reduce the network latency to varying degrees depending on the application. However, these modest benefits do not translate to significant improvements in the overall execution time because the load on the network is not high enough to exploit the advantages of the network enhancements. Moreover, this benefit may be negated if the architectural enhancements increase the network cycle time. Rather, emphasis should be placed on improving the raw network bandwidth and faster network interfaces  相似文献   

5.
Thin liquid film flow over surfaces containing complex multiply connected topography is modelled using lubrication theory. The resulting time dependent nonlinear coupled set of governing equations for film thickness and pressure is solved on different parallel computing platforms using a purpose written portable and scalable parallel multigrid algorithm in order to achieve the fine-scale resolution required to guarantee mesh independent solutions. The robustness of the approach is demonstrated via the solution of three problems: one to establish the convergence characteristics viz. the partitioning and message passing strategies adopted, taking flow over a well-defined trench topography as a benchmark against existing experimental and corresponding numerical predictions; two, flow through a sparsely distributed set of occlusions with computations performed on different parallel architectures; three, free-surface planarisation with respect to flow over complex topography - the first an engineered functional substrate, the second a naturally occurring surface.  相似文献   

6.
Initially, parallel algorithms were designed by parallelising the existing sequential algorithms for frequently occurring problems on available parallel architectures.

More recently, parallel strategies have been identified and utilised resulting in many new parallel algorithms. However, the analysis of such techniques reveals that further strategies can be applied to increase the parallelism. One of these strategies, i.e., increasing the computational work in each processing node, can reduce the memory accesses and hence congestion in a shared memory multiprocessor system. Similarly, when network message passing is minimised in a distributed memory processor system, dramatic improvements in the performance of the algorithm ensue.

A frequently occurring computational problem in digital signal processing (DSP) is the solution of symmetric positive definite Toeplilz linear systems. The Levinson algorithm for solving such linear equations is where the Toeplitz matrix property is utilised in the elimination process of each element to advantage. However, it can be shown that in the Parallel Implicit Elimination (PIE) method where more than one element is eliminated simultaneously, the Toeplitz structure can again be utilised to advantage. This relatively simple strategy yields a reduction in accesses to shared memory or network message passing, resulting in a significant improvement in the performance of the algorithm [2],  相似文献   

7.
In this paper we present a parallel runtime substrate, the Mobile Object Layer (MOL), that supports data or object mobility and automatic message forwarding in order to ease the implementation of adaptive and irregular applications on distributed memory machines. The MOL implements a global logical name space for message passing and distributed directories to assist in the translation of logical to physical addresses. Our data show that the latency of the MOL primitives is within 10–14% of the latency of the underlying communication substrate. The MOL is a lightweight, portable library designed to minimize maintenance costs for very large-scale parallel adaptive applications.  相似文献   

8.
The Computer Aided Parallelisation Tools (CAPTools) [Ierotheou, C, Johnson SP, Cross M, Leggett PF, Computer aided parallelisation tools (CAPTools)—conceptual overview and performance on the parallelisation of structured mesh codes, Parallel Computing, 1996;22:163–195] is a set of interactive tools aimed to provide automatic parallelisation of serial FORTRAN Computational Mechanics (CM) programs. CAPTools analyses the user's serial code and then through stages of array partitioning, mask and communication calculation, generates parallel SPMD (Single Program Multiple Data) messages passing FORTRAN.The parallel code generated by CAPTools contains calls to a collection of routines that form the CAPTools communications Library (CAPLib). The library provides a portable layer and user friendly abstraction over the underlying parallel environment. CAPLib contains optimised message passing routines for data exchange between parallel processes and other utility routines for parallel execution control, initialisation and debugging. By compiling and linking with different implementations of the library, the user is able to run on many different parallel environments.Even with today's parallel systems the concept of a single version of a parallel application code is more of an aspiration than a reality. However for CM codes the data partitioning SPMD paradigm requires a relatively small set of message-passing communication calls. This set can be implemented as an intermediate ‘thin layer’ library of message-passing calls that enables the parallel code (especially that generated automatically by a parallelisation tool such as CAPTools) to be as generic as possible.CAPLib is just such a ‘thin layer’ message passing library that supports parallel CM codes, by mapping generic calls onto machine specific libraries (such as CRAY SHMEM) and portable general purpose libraries (such as PVM an MPI). This paper describe CAPLib together with its three perceived advantages over other routes:
  • •as a high level abstraction, it is both easy to understand (especially when generated automatically by tools) and to implement by hand, for the CM community (who are not generally parallel computing specialists);
  • •the one parallel version of the application code is truly generic and portable;
  • •the parallel application can readily utilise whatever message passing libraries on a given machine yield optimum performance.
  相似文献   

9.
多处理器MPEG2并行解码系统的设计   总被引:1,自引:0,他引:1  
MPEG2运动图像及伴音压缩标准是许多视频服务应用的核心算法。基于软件结合多处理器的并行系统实现MPEG2算法解压,不仅灵活适用于多种MPEG2产品的回放功能,避免了硬件芯片解压的局限性,而且随着个人计算机的普及和性能的提高,这种系统适配卡方案可以令个人计算机拥有更多的MPEG2服务功能,对MPEG2系列标准更新算法的研究和测试工作也带来方便。本文分析了MPEG2解码对实现系统的要求,特别是解压处理时各部分运算量和数据传输、处理的要求。根据这些数据本文基于多种TMS320C40并行处理系统板,对MPEG2输入码流的数据分割,并行解码存储控制和通信、解码算法复杂度等问题进行了实验和分析,据此得到相应的设计选择和数据。最后提出了MPEG2并行处理解码系统的设计方案。  相似文献   

10.
Hua Zhang  Joohan Lee  Ratan Guha 《Software》2008,38(10):1049-1071
Clusters, composed of symmetric multiprocessor (SMP) machines and heterogeneous machines, have become increasingly popular for high‐performance computing. Message‐passing libraries, such as message‐passing interface (MPI) and parallel virtual machine (PVM), are de facto parallel programming libraries for clusters that usually consist of homogeneous and uni‐processor machines. For SMP machines, MPI is combined with multithreading libraries like POSIX Thread and OpenMP to take advantage of the architecture. In addition to existing parallel programming libraries that are in C/C++ and FORTRAN programming languages, the Java programming language presents itself as another alternative with its object‐oriented framework, platform neutral byte code, and ever‐increasing performance. This paper presents a new parallel programming model and a library, VCluster, which implements this model. VCluster is based on migrating virtual threads instead of processes to support clusters of SMP machines more efficiently. The implementation uses thread migration, which can be used in dynamic load balancing. VCluster was developed in pure Java, utilizing the portability of Java to support clusters of heterogeneous machines. Several applications are developed to illustrate the use of this library and compare the usability and performance of VCluster with other approaches. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

11.
传统并行软件系统的设计和实现存在着开发效率低、质量难以保证和可移植性差等问题。针对这些问题,采用开发标准并行库的方法加以解决。借鉴高性能嵌入式计算软件计划(high performance embedded computing software initiative,HPEC_SI)的解决方法,基于消息传递接口(message passing interface,MPI)的消息传递机制,对图像/信号处理中的一些典型并行算法以类组件的方式进行封装,设计和实现了具有面向对象特征的、用于图像/信号处理的并行向量库,提供给应用软件开发人员一个良好的开发环境。通过测试和实验证明,该库可以高效地实现相应的向量矩阵并行算法,并具有简单易用、可复用性和可移植性强、效率高的特点。  相似文献   

12.
Trace visualization is a viable approach for gaining insight into the behavior of complex distributed real-time systems. Grasp is a versatile trace visualization toolset. Its flexible plugin infrastructure allows for easy extension with custom visualization and analysis techniques for automatic trace verification. This paper presents its visualization capabilities for hierarchical multiprocessor systems, including partitioned and global multiprocessor scheduling with migrating tasks and jobs, communication between jobs via shared memory and message passing, and hierarchical scheduling in combination with multiprocessor scheduling. For tracing distributed systems with asynchronous local clocks Grasp also supports the synchronization of traces from different processors during the visualization and analysis.  相似文献   

13.
分布式实时操作系统消息机制的设计与实现   总被引:1,自引:1,他引:0  
随着数字信号处理技术的迅猛发展,针对并行数字信号处理(DSP)应用自主开发了一个满足用户需要的高性能分布式实时操作系统--腾飞分布式实时操作系统(TF-RTOS).消息机制用于线程间的通信,是操作系统中的重要部分.在开发TF-RTOS过程中,从消息命令包、消息队列、消息传递过程和消息原语这4个方面设计并实现了一种直接消息传递的消息机制,该消息机制具有简化线程间通信、增强系统功能、提高系统性能的特点.  相似文献   

14.
We present the design and implementation of a loosely coupled multiprocessor built from off-the-shelf parts. Message passing is used as the communication paradigm. Several novel techniques are used to reduce the demands on the kernel from the message passing subsystem. We achieve message passing times of the same order for messages within processors and interprocessor messages, allowing transparent interprocess communication. Because it is possible to achieve these performance results, we conclude that process allocation need not be a critical problem in efficient multiprocessor design, at least for small scale multiprocessors.  相似文献   

15.
Parallel programs present some features such as concurrency, communication and synchronization that make the test a challenging activity. Because of these characteristics, the direct application of traditional testing is not always possible and adequate testing criteria and tools are necessary. In this paper we investigate the challenges of validating message‐passing parallel programs and present a set of specific testing criteria. We introduce a family of structural testing criteria based on a test model. The model captures control and data flow of the message‐passing programs, by considering their sequential and parallel aspects. The criteria provide a coverage measure that can be used for evaluating the progress of the testing activity and also provide guidelines for the generation of test data. We also describe a tool, called ValiPar, which supports the application of the proposed testing criteria. Currently, ValiPar is configured for parallel virtual machine (PVM) and message‐passing interface (MPI). Results of the application of the proposed criteria to MPI programs are also presented and analyzed. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

16.
Integrating Java 3D model and sensor data for remote monitoring and control   总被引:1,自引:0,他引:1  
This paper presents a novel approach and a framework for web-based systems that can be used in distributed manufacturing environments. A prototype is developed to demonstrate its application to remote monitoring and control of a Tripod—one type of parallel kinematic machine. It utilizes the latest Java technologies (Java 3D and Java Servlets) as enabling technologies for system implementation. Instead of using a camera for monitoring, the Tripod is modeled using Java 3D with behavioral control nodes embedded. Once downloaded from its server, the 3D model behaves in the same way of its counterpart at client side. It remains alive by connecting with the Tripod through message passing, e.g., sensor signals and control commands transmissions. The goal of this research is to eliminate network traffic with Java 3D models, while still providing users with intuitive environments. In the near future, open-architecture devices will be web-ready having Java virtual machines embedded. This will make the approach more effective for web-based device monitoring and control.  相似文献   

17.
Replacing traditional operating systems communication implementations with customized implementations increases the performance of parallel and distributed applications. This paper describes the design and implementation of customizable message passing systems. The customized message passing systems are generated using application-specific information such as the profile of an application's communication pattern. FFT, Simplex, and Cholesky are used as example parallel applications. The message passing system has also been customized for different types of distributed system services including a distributed scheduling facility. The customized message passing system likewise improves the performance of these facilities and enhances their scalability. As a practical concern, as there are a large number of possible optimizations, object-oriented frameworks are employed to organize the implementations and to facilitate the choice of optimizations.  相似文献   

18.
机群系统是一种分布存储系统,它主要利用消息传递方式来实现各结点之间的通信。而MPI(Message Passing Interface)作为一种基于消息传递的并行程序设计环境,已广泛应用于多种并行系统,尤其是像机群系统那样的分布存储并行机。该文主要探讨了MPI中的消息传递调用接口,提出了几种有效的在结点间传递多维稀疏数组的方法,并通过实践加以比较。  相似文献   

19.
《Parallel Computing》1997,22(13):1837-1851
The PAPS (Performance Analysis of Parallel Systems) toolset is a testbed for the model based performance prediction of message passing parallel applications executed on private memory multiprocessor computer systems. PAPS allows to describe the execution behavior of the computer hardware and operating system software resources up to a very detailed level. This enables very accurate performance prediction of parallel applications even in the case of substantial performance degradation due to contention for shared resources. In this paper the fundamental design principles and implementation methodologies for the development of the PAPS toolset are presented and the PAPS parallel system specification formalisms are described. A simplified performance study of a parallel Gaussian elimination application on the nCUBE 2 multiprocessor system is used to demonstrate the usage of the tool.  相似文献   

20.
A macro package for expressing message passing functions within parallel FORTRAN program is presented. It makes the user program fully portable among all parallel computers where the macros are implemented. The implementation on the Intel iPSC/2 hypercube is discussed in more detail. New message passing primitives have been added to the iPSC/2 operating system, offering the user a broader functionality at no efficiency loss. The full macro set, using these primitives, works with the same performance as the original Intel primitives.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号