首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Opus is a new programming language designed to assist in coordinating the execution of multiple, independent program modules. With the help of Opus, coarse grained task parallelism between data parallel modules can be expressed in a clean and structured way. In this paper we address the problems of how to build a compilation and runtime support system that can efficiently implement the Opus constructs. Our design considers the often‐conflicting goals of efficiency and modular construction through software re‐use. In particular, we present the system requirements for an efficient Opus implementation, the Opus runtime system, and describe how they work together to provide the underlying services that the Opus compiler needs for a broad class of machines. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

3.
Multithreaded architectures provide an opportunity for efficiently executing programs with irregular parallelism and/or irregular locality. This paper presents a strategy that makes use of the multithreaded execution model without exposing multithreading to the programmer. Our approach is to design simple extensions to C, and to provide compiler support that automatically translates high-level C programs into lower-level threaded programs. In this paper we present EARTH-C our extended C language which contains simple constructs for specifying control parallelism, data locality, shared variables and atomic operations. Based on EARTH-C, we describe compiler techniques that are used for translating to lower-level Threaded-C programs for the EARTH multithreaded architecture. We demonstrate our approach with six benchmark programs. We show that even naive EARTH-C programs can lead to reasonable performance, and that more advanced EARTH-C programs can give performance very close to hand-coded threated-C programs. This work supported, in part, by NSERC and FCAR.  相似文献   

4.
一种多线程计算程序的机群移植方法   总被引:3,自引:0,他引:3  
机群并行化应用程序的用户接口和编程方式多种多样,常常令用户望而却步,该文详细了一种从程序的目标代码着手,以ELF格式可执行文件PLT表项重定位为基础,利用多线程程序自身的并发和同步特征,让线程中的计算负载分布到机群各节点的移植技术,为用户提供透明的机群并行机制,提出并讨论了相应的Master-Worker(Task-Farming)计算通信模型以及调度策略,最后,通过实现该移植技术,分析基于BLAS库多线程矩阵乘法程序移植后的运行结果,验证了该模型的可行性和效率。  相似文献   

5.
Programming for large‐scale, multicore‐based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function‐level parallelism that targets productivity. StarSs deploys a data‐flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one‐sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

6.
Svend E. Knudsen 《Software》2011,41(4):393-402
A simple programming abstraction based on the notion of independence is introduced as a means for mapping the independence inherent in an algorithm explicitly into its programmed solution. This enables a compiler and runtime system to exploit the independence and achieve efficient parallelism of execution on multicore processors. The constructs needed to express mutual independence among statements are proposed and their implementation in iOberon, an extension of the Active Oberon programming language, is defined. The programming language extensions, runtime support, and performance measurements are described in detail. We believe that this concept of specifying local disjoint program fragments can be applied to other programming languages. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

7.
This paper introduces the JStar parallel programming language, which is a Java-based declarative language aimed at discouraging sequential programming, encouraging massively parallel programming, and giving the compiler and runtime maximum freedom to try alternative parallelisation strategies. We describe the execution semantics and runtime support of the language, several optimisations and parallelism strategies, with some benchmark results.  相似文献   

8.
Fine-grain MPI (FG-MPI) extends the execution model of MPI to allow for interleaved execution of multiple concurrent MPI processes inside an OS-process. It provides a runtime that is integrated into the MPICH2 middleware and uses light-weight coroutines to implement an MPI-aware scheduler. In this paper we describe the FG-MPI runtime system and discuss the main design issues in its implementation. FG-MPI enables expression of function-level parallelism, which along with a runtime scheduler, can be used to simplify MPI programming and achieve performance without adding complexity to the program. As an example, we use FG-MPI to re-structure a typical use of non-blocking communication and show that the integrated scheduler relieves the programmer from scheduling computation and communication inside the application and brings the performance part outside of the program specification into the runtime.  相似文献   

9.
The Block Conjugate Gradient algorithm (Block‐CG) was developed to solve sparse linear systems of equations that have multiple right‐hand sides. We have adapted it for use in heterogeneous, geographically distributed, parallel architectures. Once the main operations of the Block‐CG (Tasks) have been collected into smaller groups (subjobs), each subjob is matched by the middleware MJMS (MPI Jobs Management System) with a suitable resource selected among those which are available. Moreover, within each subjob, concurrency is introduced at two different levels and with two different granularities: the coarse‐grained parallelism to perform independent tasks and the fine‐grained parallelism within the execution of a task. We refer to this algorithm as to multi‐grained distributed implementation of the parallel Block‐CG. We compare the performance of a parallel implementation with the one of the distributed implementation running on a variety of Grid computing environments. The middleware MJMS—developed by some of the authors and built on top of Globus Toolkit and Condor‐G—was used for co‐allocation, synchronization, scheduling and resource selection. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

10.
Multi‐core processors offer a huge potential of parallelism but pose a challenge of program development for achieving high performance in real applications. We compare three popular parallel programming models—POSIX threads (Pthreads), OpenMP, and Threading Building Blocks (TBB)—regarding their use for multi‐core systems. We analyze how these models can be employed for implementing various parallelizations of a real‐world application from the area of medical imaging, and we conduct extensive runtime experiments to measure performance. Our main contribution is a comprehensive comparison of Pthreads, OpenMP, and TBB with respect to the following criteria: program development effort, programming style, level of abstraction, and runtime performance on multi‐cores. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

11.
12.
Loops are the richest source of parallelism in scientific applications. A large number of loop scheduling schemes have therefore been devised for loops with and without data dependencies (modeled as dependence distance vectors) on heterogeneous clusters. The loops with data dependencies require synchronization via cross‐node communication. Synchronization requires fine‐tuning to overcome the communication overhead and to yield the best possible overall performance. In this paper, a theoretical model is presented to determine the granularity of synchronization that minimizes the parallel execution time of loops with data dependencies when these are parallelized on heterogeneous systems using dynamic self‐scheduling algorithms. New formulas are proposed for estimating the total number of scheduling steps when a threshold for the minimum work assigned to a processor is assumed. The proposed model uses these formulas to determine the synchronization granularity that minimizes the estimated parallel execution time. The accuracy of the proposed model is verified and validated via extensive experiments on a heterogeneous computing system. The results show that the theoretically optimal synchronization granularity, as determined by the proposed model, is very close to the experimentally observed optimal synchronization granularity, with no deviation in the best case, and within 38.4% in the worst case. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

13.
CLOS系统是一个嵌入CommonLisp的面向对象标准语言,本文结合我们提出的类划分方法,通过引入同步或异步通信协议和RPC并发控制,详细介绍了一个新的分布CLOS系统ParCLOS。测试结果表明:若结点较均匀,运行效率大于80%。  相似文献   

14.
This paper surveys concurrency issues of programming languages. The evolution of these issues is analyzed in the context of the evolution of other language concepts, such as data and control abstraction.Specific concurrency concepts discussed in the paper include: granularity of parallelism, degree of parallelism, synchronization and communication, and physical distribution. The review of the problems of synchronization and communication includes semaphores, messages and mailboxes, and monitors.Concurrency aspects of ADA are also presented as a case study of a state-of-the-art programming language.  相似文献   

15.
In this paper,a new parallel logic programming language——HPARLOG developed by usis described,and a new scheme for the AND-parallelism implementation in logic programminglanguage is proposed.This scheme not only resolves the instantiation conflict on sharing-variables,thoroughly explores the parallelism of the programs with incrementally constructeddata structure,but also decreases the dynamic complexity of the programs.In addition,apscudo-copy based memory management scheme to enhance the locality of goal processes andlower the overhead of program execution is proposed.  相似文献   

16.
This paper begins by describing BSL, a new logic programming language fundamentally different from Prolog. BSL is a nondeterministic Algol-class language whose programs have a natural translation to first order logic; executing a BSL program without free variables amounts to proving the corresponding first order sentence. A new approach is proposed for parallel execution of logic programs coded in BSL, that relies on advanced compilation techniques for extracting fine grain parallelism from sequential code. We describe a new “Very Long Instruction Word” (VLIW) architecture for parallel execution of BSL programs. The architecture, now being designed at the IBM Thomas J. Watson Research Center, avoids the synchronization and communication delays (normally associated with parallel execution of logic programs on multiprocessors), by determining data dependences between operations at compile time, and by coupling the processing elements very tightly, via a single central shared register file. A simulator for the architecture has been implemented and some simulation results are reported in the paper, which are encouraging.  相似文献   

17.
Embedded manycore architectures are often organized as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used with a crossbar-like medium inside each cluster and a network-on-chip (NoC) at the global level which make memory operations nonuniform (NUMA). Due to NUMA, regular applications typically employed in the embedded domain (e.g., image processing, computer vision, etc.) ultimately behave as irregular workloads if a flat memory system is assumed at the program level. Nested parallelism represents a powerful programming abstraction for these architectures, provided that (i) streamlined middleware support is available, whose overhead does not dominate the run-time of fine-grained applications; (ii) a mechanism to control thread binding at the cluster-level is supported. We present a lightweight runtime layer for nested parallelism on cluster-based embedded manycores, integrating our primitives in the OpenMP runtime system, and implementing a new directive to control NUMA-aware nested parallelism mapping. We explore on a set of real application use cases how NUMA makes regular parallel workloads behave as irregular, and how our approach allows to control such effects and achieve up to 28 × speedup versus flat parallelism.  相似文献   

18.
Because multicore CPUs have become the standard with all major hardware manufacturers, it becomes increasingly important for programming languages to provide programming abstractions that can be mapped effectively onto parallel architectures. Stream processing is a programming paradigm where computations are expressed as independent actors that communicate via FIFO data-channels. The coarse-grained parallelism exposed in stream programs facilitates such an efficient mapping of actors onto the underlying multicore hardware. We propose a stream-parallel programming abstraction that extends object-oriented languages with stream-programming facilities. StreamPI consists of a class hierarchy for actor-specification together with a language-independent runtime system that supports the execution of stream programs on multicore architectures. We show that the language-specific part of StreamPI, i.e., the class hierarchy, can be implemented as a library-level programming language extension. A library-level extension has the advantage that an existing programming language implementation need not be touched. Legacy-code can be mixed with a stream-parallel application, and the use of sequential legacy code with actors is supported. Unlike previous approaches, StreamPI allows dynamic creation and subsequent execution of stream programs. StreamPI actors are typed. Type-safety is achieved through type-checks at stream graph creation time. We have implemented StreamPI??s language-independent runtime system and language interfaces for Ada?2005 and C++ for Intel multicore architectures. We have evaluated StreamPI for up to 16 cores on a two?CPU 8-core Intel Xeon X7560 server, and we provide a performance comparison with StreamIt?(Gordon et al. in International Conference on Architectural Support for Programming Languages and Operating Systems, 2006), which is the de facto standard for stream-parallel programming. Although our approach provides greater programming flexibility than StreamIt, the performance of StreamPI compares favorably to the static compilation model of StreamIt.  相似文献   

19.
This paper presents a language based on the logic programming paradigm that supports objects, messages and inheritance. The object-oriented extension is fairly simple: objects are clusters of processes, objects' state is represented by logical variables, message-passing communication between objects is performed via multi-head clauses, and inheritance is mapped into clause union. The language implementation is obtained by translating logic objects into a concurrent logic language based on multi-head clauses, taking advantage of its distributed implementation on a massively parallel architecture. The runtime support realizes some interesting features such as intensional messages and the transparency of object allocation, object migration and parallelism.  相似文献   

20.
The context of this work is a practical, open‐source visualization system, called JIVE, that supports two forms of runtime visualizations of Java programs – object diagrams and sequence diagrams. They capture, respectively, the current execution state and execution history of a Java program. These diagrams are similar to those found in the UML for specifying design–time decisions. In our work, we construct these diagrams at execution time, thereby ensuring continuity of notation from design to execution. In so doing, a few extensions to the UML notation are proposed in order to better represent runtime behavior. As sequence diagrams can become long and unwieldy, we present techniques for their compact representation. A key result in this paper is a novel labeling scheme based upon regular expressions to compactly represent long sequences and an O(r2) algorithm for computing these labels, where r is the length of the input sequence, based upon the concept of ‘tandem repeats’ in a sequence. Horizontal compaction greatly helps minimize the extent of white space in sequence diagrams by the elimination of object lifelines and also by grouping lifelines together. We propose a novel extension to the sequence diagram to deal with out‐of‐model calls when the lifelines of certain classes of objects are filtered out of the visualization, but method calls may occur between in‐model and out‐of‐model calls. The paper also presents compaction techniques for multi‐threaded Java execution with different forms of synchronization. Finally, we present experimental results from compacting the runtime visualizations of a variety of Java programs and execution trace sizes in order to demonstrate the practicality and efficacy of our techniques. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号