首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 265 毫秒
1.
一种支持多种访存技术的CBEA片上多核MPI并行编程模型   总被引:1,自引:0,他引:1  
现有的CBEA(Cell Broadband Engine Architecture)编程模型多侧重于支持类似于流处理的"批量访存"(Bulk Data Transfer)应用,传统非规则访存应用性能较低.文中基于Cell架构提出了一种同时支持"批量访存"与非规则访存应用的MPI并行编程模型,将通信分解在PPE(PowerPC Processing Element)上,拓宽模型的适用范围;在统一访存接口下,通过运行时访存剖分信息指导选择和优化访存以提高计算效率.实验结果表明,文中提出的编程模型支持多种访存模式并具有很好的并行加速比,可获得较同类相关技术30%~50%左右的性能提升.  相似文献   

2.
This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer.  相似文献   

3.
The PEPPHER (EU FP7 project) component model defines the notion of component, interface and meta-data for homogeneous and heterogeneous parallel systems. In this paper, we describe and evaluate the PEPPHER composition tool, which explores the application’s components and their implementation variants, generates the necessary low-level code that interacts with the runtime system, and coordinates the native compilation and linking of the various code units to compose the overall application code to optimize performance. We discuss the concept of smart containers and its benefits for reducing dispatch overhead, exploiting implicit parallelism across component invocations and runtime optimization of data transfers. In an experimental evaluation with several applications, we demonstrate that the composition tool provides a high-level programming front-end while effectively utilizing the task-based PEPPHER runtime system (StarPU) underneath for different usage scenarios on GPU-based systems.  相似文献   

4.

In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.

  相似文献   

5.
This paper presents an object-oriented, Java-like core language with primitives for distributed programming and explicit code mobility. We apply our formulation to prove the correctness of several optimisations for distributed programs. Our language captures crucial but often hidden aspects of distributed object-oriented programming, including object serialisation, dynamic class downloading and remote method invocation. It is defined in terms of an operational semantics that concisely models the behaviour of distributed programs using machinery from calculi of mobile processes. Type safety is established using invariant properties for distributed runtime configurations. We argue that primitives for explicit code mobility offer a programmer fine-grained control of type-safe code distribution, which is crucial for improving the performance and safety of distributed object-oriented applications.  相似文献   

6.
Rinard  M.C. Scales  D.J. Lam  M.S. 《Computer》1993,26(6):28-38
Jade, a high-level parallel programming language for managing coarse-grained parallelism, is discussed. Jade simplifies programming by providing sequential-execution and shared-address-space abstractions. It is also platform-independent; the same Jade program runs on uniprocessors, multiprocessors, and heterogeneous networks of machines. An example that illustrates how Jade programmers express irregular, dynamically determined concurrency and how the implementation exploits this source of concurrency is presented. A digital video imaging program that runs on a high-resolution video system and several other examples of Jade applications are described  相似文献   

7.
Transactional Memory: The Hardware-Software Interface   总被引:1,自引:0,他引:1  
As multicore chips become ubiquitous, the need to provide architectural support for practical parallel programming is reaching critical. Conventional lock-based concurrency control techniques are difficult to use, requiring the programmer to navigate through the minefield of coarse-versus fine-grained locks, deadlock, livelock, lock convoying, and priority inversion. This explicit management of concurrency is beyond the reach of the average programmer, threatening to waste the additional parallelism available with multicore architectures. This comprehensive architecture supports nested transactions, transactional handlers, and two-phase commit. The result is a seamless integration of transactional memory with modern programming languages and runtime environments  相似文献   

8.
具有并发类库的C++   总被引:1,自引:1,他引:1  
杨延中  王为  田籁声 《软件学报》1998,9(6):401-404
本文探讨如何通过类库将并发性引入顺序面向对象语言.以C++为例,在并发类库中提供并发类及相应工具,使之支持分布并行的面向对象程序设计.本文介绍并发类库及语言底层支撑系统的设计与实现,最后给出初步测试结果.  相似文献   

9.
This paper deals with application of concurrent object-oriented programming with Actors to solve dynamic programming problems in a distributed computing environment. This area of research is often called distributed artificial intelligence. Using a dynamic programming example of chained matrix multiplication, a method of managing dynamic programming searches in a distributed programming environment with Actors is presented. Distributed computations with Actors are visualized by means of Time-Varying Automata (for cases with no intra-actor concurrency) or using a class of high-level nets called Hierarchical Colored Petri Nets (for cases with intra-actor concurrency). Design and implementation features of the specific Actor-based programming environment, using a concurrent extension of C++, are also discussed.  相似文献   

10.
Concurrency and parallelism have long been viewed as important, but somewhat distinct concepts. While concurrency is extensively used to amortize latency (for example, in web- and database-servers, user interfaces, etc.), parallelism is traditionally used to enhance performance through execution on multiple functional units. Motivated by an evolving application mix and trends in hardware architecture, there has been a push toward integrating traditional programming models for concurrency and parallelism. Use of conventional threads APIs (POSIX, OpenMP) with messaging libraries (MPI), however, leads to significant programmability concerns, owing primarily to their disparate programming models. In this paper, we describe a novel API and associated runtime for concurrent programming, called MPI Threads (MPIT), which provides a portable and reliable abstraction of low-level threading facilities. We describe various design decisions in MPIT, their underlying motivation, and associated semantics. We provide performance measurements for our prototype implementation to quantify overheads associated with various operations. Finally, we discuss two real-world use cases: an asynchronous message queue and a parallel information retrieval system. We demonstrate that MPIT provides a versatile, low overhead programming model that can be leveraged to program large parallel ensembles.  相似文献   

11.
Using GPUs as general-purpose processors has revolutionized parallel computing by providing, for a large and growing set of algorithms, massive data-parallelization on desktop machines. An obstacle to their widespread adoption, however, is the difficulty of programming them and the low-level control of the hardware required to achieve good performance. This paper proposes a programming approach, SafeGPU, that aims to make GPU data-parallel operations accessible through high-level libraries for object-oriented languages, while maintaining the performance benefits of lower-level code. The approach provides data-parallel operations for collections that can be chained and combined to express compound computations, with data synchronization and device management all handled automatically. It also integrates the design-by-contract methodology, which increases confidence in functional program correctness by embedding executable specifications into the program text. We present a prototype of SafeGPU for Eiffel, and show that it leads to modular and concise code that is accessible for GPGPU non-experts, while still providing performance comparable with that of hand-written CUDA code. We also describe our first steps towards porting it to C#, highlighting some challenges, solutions, and insights for implementing the approach in different managed languages. Finally, we show that runtime contract-checking becomes feasible in SafeGPU, as the contracts can be executed on the GPU.  相似文献   

12.
OP2 is a high-level domain specific library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into multiple parallel implementations for execution on a range of back-end hardware platforms. In this paper we present the design and performance of OP2’s recent developments facilitating code generation and execution on distributed memory heterogeneous systems. OP2 targets the solution of numerical problems based on static unstructured meshes. We discuss the main design issues in parallelizing this class of applications. These include handling data dependencies in accessing indirectly referenced data and design considerations in generating code for execution on a cluster of multi-threaded CPUs and GPUs. Two representative CFD applications, written using the OP2 framework, are utilized to provide a contrasting benchmarking and performance analysis study on a number of heterogeneous systems including a large scale Cray XE6 system and a large GPU cluster. A range of performance metrics are benchmarked including runtime, scalability, achieved compute and bandwidth performance, runtime bottlenecks and systems energy consumption. We demonstrate that an application written once at a high-level using OP2 is easily portable across a wide range of contrasting platforms and is capable of achieving near-optimal performance without the intervention of the domain application programmer.  相似文献   

13.
The SB-PRAM is a shared-memory parallel computer that has been designed according to the PRAM model from theoretical computer science. The SB-PRAM realizes a concurrent-read, concurrent-write PRAM where each processor can access the global memory in unit time. This article describes the programming environment of the SB-PRAM that enables a programmer to develop efficient and portable programs without dealing with architectural details of the machine. In particular, we discuss compiler and operating system issues and show that the runtime functions of the P4 environment and several parallel data structures can be implemented very efficiently by using special features of the SB-PRAM. In contrast to other parallel machines, the synchronization of processors and the management of concurrent accesses to the global memory only require a few machine instructions independent of the number of processors participating in the operation. This efficient implementation of the runtime system is the basis for good performance of many challenging applications.  相似文献   

14.
We present lattice parallelism (LPAR), a programming methodology and software development tool for implementing scientific computations on distributed memory MIMD multiprocessors. LPAR supports an efficient portable model for managing physically distributed, dynamic data structures in a shared name space. It enables the programmer to effectively manage locality for achieving high performance without becoming involved with low-level details such as communication. LPAR is intended for applications which locally concentrate computational effort non-uniformly or employ multiple level representations. We present computational results for two such applications-particle dynamics and multigrid-running on the Intel iPSC/860 and the nCUBE/2. Performance achieved with these portable applications is competitive with highly optimized Fortran running on a single processor of the Cray Y-MP.  相似文献   

15.
Fourteen concurrent object-oriented languages are compared in terms of how they deal with communication, synchronization, process management, inheritance, and implementation trade-offs. The ways in which they divide responsibility between the programmer, the compiler, and the operating system are also investigated. It is found that current object-oriented languages that have concurrency features are often compromised in important areas, including inheritance capability, efficiency, ease of use, and degree of parallel activity. Frequently, this is because the concurrency features were added after the language was designed. The languages discussed are Actors, Abd/1, Abd/R, Argus, COOL, Concurrent Smalltalk, Eiffel, Emerald, ES-Kit C++, Hybrid, Nexus, Parmacs, POOL-T, and Presto  相似文献   

16.
Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-presented way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimCS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations  相似文献   

17.
The issues of managing distributed applications are discussed, and a set of tools, the meta system, that solves some longstanding problems is presented. The Meta model of a distributed application is described. To make the discussion concrete, it is shown how NuMon, a seismological analysis system for monitoring compliance with nuclear test-ban treaties is managed within the Meta framework. The three steps entailed in using Meta are described. First the programmer instruments the application and its environment with sensors and actuators. The programmer then describes the application structure using the object-oriented data modeling facilities of the authors' high-level control language, Lomita. Finally, the programmer writes a control program referencing the data model. Meta's performance and real-time behavior are examined  相似文献   

18.
An environment that lets system applications be expressed as virtual machines, through which architecture-independent multiple-instruction, multiple-data stream (MIMD) programs are written, is described. The virtual machine hides the hardware configuration from the programmer so that the MIMD programming environment always appears the same, regardless of the actual hardware. The data-definition and procedural high-level languages used in the environment and the generation of object code in the environment are discussed. The runtime configuration of the system and an implemented prototype of the system are described  相似文献   

19.
20.
A major problem for the integration of concurrency in object-oriented languages is the so-called inheritance anomaly, i.e. the conflicts between inheritance and concurrency that often cause the need to redefine inherited methods in order to maintain the integrity of objects. Several solutions have been proposed for resolving these conflicts. However, some of them are incomplete, and do not solve all types of inheritance anomaly; others make the definition of classes complex and/or their implementation inefficient. This paper describes a C++ library for concurrent programming that provides a comprehensive framework particularly suitable for coarse-grained distributed applications. This library copes with the inheritance anomaly problem, presenting a solution that minimizes the redefinition of inherited methods without increasing the complexity to write them. This solution is based on the use of a special set of methods, called interface methods, composed of a body and two sets of synchronization constraints. These two sets of synchronization constraints are respectively used to enable the execution of their method body and to disable the methods that cannot be executed after their method. © 1998 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号