首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The paper presents a new parallel language, mpC, designed specially for programming high‐performance computations on heterogeneous networks of computers, as well as its supportive programming environment. The main idea underlying mpC is that an mpC application explicitly defines an abstract network and distributes data, computations and communications over the network. The mpC programming environment uses, at run time, this information as well as information on any real executing network in order to map the application to the real network in such a way that ensures the efficient execution of the application on this real network. Experience of using mpC for solving both regular and irregular real‐life problems on networks of heterogeneous computers is also presented. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

2.
GOP is a graph‐oriented programming model which aims at providing high‐level abstractions for configuring and programming cooperative parallel processes. With GOP, the programmer can configure the logical structure of a parallel/distributed program by constructing a logical graph to represent the communication and synchronization between the local programs in a distributed processing environment. This paper describes a visual programming environment, called VisualGOP, for the design, coding, and execution of GOP programs. VisualGOP applies visual techniques to provide the programmer with automated and intelligent assistance throughout the program design and construction process. It provides a graphical interface with support for interactive graph drawing and editing, visual programming functions and automation facilities for program mapping and execution. VisualGOP is a generic programming environment independent of programming languages and platforms. GOP programs constructed under VisualGOP can run in heterogeneous parallel/distributed systems. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

3.
ParC is an extension of the C programming language with block-oriented parallel constructs that allow the programmer to express fine-grain parallelism in a shared-memory model. It is suitable for the expression of parallel shared-memory algorithms, and also conducive for the parallelization of sequential C programs. In addition, performance enhancing transformations can be applied within the language, without resorting to low-level programming. The language includes closed constructs to create parallelism, as well as instructions to cause the termination of parallel activities and to enforce synchronization. The parallel constructs are used to define the scope of shared variables, and also to delimit the sets of activities that are influenced by termination or synchronization instructions. The semantics of parallelism are discussed, especially relating to the discrepancy between the limited number of physical processors and the potentially much larger number of parallel activities in a program.  相似文献   

4.
Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

5.
Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation.  相似文献   

6.
This paper describes the design and implementation of an Efficient Architecture for Running THreads (EARTH) runtime system for a multi‐processor/multi‐node cluster. The (EARTH) model was designed to support the efficient execution of parallel (multi‐threaded) programs with irregular fine‐grain parallelism using off‐the‐shelf computers. Implementing an EARTH runtime system requires an explicitly threaded runtime system. For portability, we built this runtime system on top of Pthreads under Linux and used sockets for inter‐node communication. Moreover, in order to make the best use of the resources available on a cluster of symmetric multi‐processors (SMP), this implementation enables the overlapping of communication and computation. We used Threaded‐C, a language designed to implement the programming model supported by the EARTH architecture. This language allows the expression of various levels of parallelism and provides the primitives needed to manage the required communication and synchronization. The Threaded‐C programming language supports irregular fine‐grain parallelism through a two‐level hierarchy of threads and fibers. It also provides various synchronization and communication constructs that reflect the nature of EARTH's fibers—non‐preemptive execution with data‐driven scheduling—as well as the extensive use of split‐phase transactions on EARTH to execute long‐latency operations. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

7.
NestStep is a parallel programming language for the BSP (bulk–synchronous–parallel) model of parallel computation.Extending the classical BSP model, NestStep supports dynamically nested parallelism by nesting of supersteps and a hierarchical processor group concept. Furthermore, NestStep adds a virtual shared memory realization in software, where memory consistency is relaxed to superstep boundaries. Distribution of shared arrays is also supported.A prototype for a subset of NestStep has been implemented based on Java as sequential basis language. The prototype implementation is targeted to a set of Java Virtual Machines coupled by Java socket communication to a virtual parallel computer.  相似文献   

8.
Gopal Gupta  Enrico Pontelli 《Software》2001,31(12):1143-1181
Naive parallel implementation of non‐deterministic systems (such as a theorem proving system) and languages (such as logic, constraint, or concurrent constraint languages) can result in poor performance. We present three optimization schemas, based on flattening of the computation tree, procrastination of overheads, and sequentialization of computations that can be systematically applied to parallel implementations of non‐deterministic systems/languages to reduce the parallel overhead and to obtain improved efficiency of parallel execution. The effectiveness of these schemas is illustrated by applying them to the ACE parallel logic programming system. The performance data presented show that considerable improvement in execution efficiency can be achieved. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

9.
New Model and Algorithm for Hardware/Software Partitioning   总被引:1,自引:0,他引:1       下载免费PDF全文
This paper focuses on the algorithmic aspects for the hardware/software (HW/SW) partitioning which searches a reasonable composition of hardware and software components which not only satisfies the constraint of hardware area but also optimizes the execution time. The computational model is extended so that all possible types of communications can be taken into account for the HW/SW partitioning. Also, a new dynamic programming algorithm is proposed on the basis of the computational model, in which source data, rather than speedup in previous work, of basic scheduling blocks are directly utilized to calculate the optimal solution. The proposed algorithm runs in O(n·A) for n code fragments and the available hardware area A. Simulation results show that the proposed algorithm solves the HW/SW partitioning without increase in running time, compared with the algorithm cited in the literature.  相似文献   

10.
Programming for large‐scale, multicore‐based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function‐level parallelism that targets productivity. StarSs deploys a data‐flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one‐sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

11.
12.
That the influence of the PRAM model is ubiquitous in parallel algorithm design is as clear as the fact that it is technologically infeasible for the forseeable future. The current generation of parallel hardware prominently features distributed memory and high‐performance interconnection networks—very much the antithesis of the shared memory required for the PRAM model. It has been shown that, in spite of communication costs, for some problems very fast parallel algorithms are available for distributed‐memory machines—from embarassingly parallel problems to sorting and numerical analysis. In contrast it is known that for other classes of problem PRAM‐style shared‐memory simulation on a distributed‐memory machine can, in theory, produce solutions of comparable performance to the best possible for such architectures. The Bulk Synchronous Parallel (BSP) model accurately represents most parallel machines—theoretical and actual—in an execution and cost model. We introduce a scalable portable PRAM realization appropriate for BSP computers and a methodology for usage. Our system is fast and built upon the familiar sequential C++ coupled with the new standard BSP library of parallel computation and communication primitives. It is portable to and predictable on a vast number of parallel computers including workstation clusters, a 256‐processor Cray T3D, an 8‐node IBM SP/2 and a 4‐node shared‐memory SGI Power Challenge machine. Our approach achieves simplicity of programming over direct‐mode BSP programming for reasonable overhead cost. We objectively compare optimized BSP and PRAM algorithms implemented with our C++ PRAM library and provide encouraging experimental results for our new style of programming. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

13.
数据并行计算:概念,模型与系统   总被引:3,自引:2,他引:1  
一、引言并行计算,或者并行处理,指的是这样一种努力和相关的研究:利用多个具有计算能力的部件来共同完成一个计算工作,以获得比用一个部件来完成要快的效果。这显然是一个很自然的想法。历史地看,几乎是自从有了计算机,就有了并行处理的想法和实践。在80年代后期到90年代初期,以寻求对人类面临的若干重  相似文献   

14.
王一拙  陈旭  计卫星  苏岩  王小军  石峰 《软件学报》2016,27(7):1789-1804
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持.  相似文献   

15.
When parallel applications are run in large‐scale distributed environments, such as grids, peer‐to‐peer (P2P) systems, and clouds, the set of resources used can change dynamically as machines crash, reservations end, and new resources become available. It is vital for applications to respond to these changes. Therefore, it is necessary to keep track of the available resources—a problem which is known to be notoriously difficult. In this article we argue that resource tracking must be provided as the standard functionality in the lower parts of the software stack. We propose a general solution to resource tracking: the Join–Elect–Leave (JEL) model. JEL provides unified resource tracking for parallel and distributed applications across environments. JEL is a simple yet powerful model based on notifying when resources have Joined or Left the computation. We demonstrate that JEL is suitable for resource tracking in a wide variety of programming models, ranging from the fixed resource sets traditionally used in MPI‐1 to flexible grid‐oriented programming models. We compare several JEL implementations, and show these to perform and scale well in several real‐world scenarios involving grids, clouds and P2P systems applied concurrently, and wide‐area systems with failing resources. Using JEL, we have won the first prize in a number of international distributed computing competitions. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

16.
The restricted synchronization structure of so-called structured parallel programming paradigms has an advantageous effect on programmer productivity, cost modeling, and scheduling complexity. However, imposing these restrictions can lead to a loss of parallelism, compared to using a programming approach that does not impose synchronization structure. In this paper we study the potential loss of parallelism when expressing parallel computations into a programming model which limits the computation graph (DAG) to series–parallel topology, which characterizes all well-known structured programming models. We present an analytical model that approximately captures this loss of parallelism in terms of simple parameters that are related to DAG topology and workload distribution. We validate the model using a wide range of synthetic and real-world parallel computations running on shared and distributed-memory machines. Although the loss of parallelism is theoretically unbounded, our measurements show that for all above applications the performance loss due to choosing a series–parallel structured model is invariably limited up to 10%. In all cases, the loss of parallelism is predictable provided the topology and workload variability of the DAG are known.  相似文献   

17.

Parallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

  相似文献   

18.
Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod...  相似文献   

19.
异构众核架构具有超高的性能功耗比,已成为超级计算机体系结构的重要发展方向.但众核系统更为复杂的并行层次和存储层次,给编程和优化带来了极大的挑战,因此研究面向众核系统的并行编程技术,对于降低国产众核系统并行应用的编程难度、提升并行程序的性能都具有重要的意义.提出统一架构的多模式并行编程模型,包括异构融合的加速运算模型和按同构方式编程的自主运算模型,根据编程模型设计了Parallel C语言,能有效描述国产众核系统的异构并行性,与其它众核系统上MPI+X的使用模式相比,编程和系统优化都具有全局视角,在多级局部性描述、单边消息、兼容已有多核应用等方面具有特色;基于Open64构建了Parallel C编译系统,全面支持加速运算模型和自主运算模型,提出并实现了数据布局与自动DMA、编译指导的线程代理和拓扑位置感知的集合通信等优化.Micro Benchmark和实际应用在神威太湖之光计算机系统上的测试数据表明,Parallel C语言和编译系统具有良好的性能和可扩展性,能够有效支撑大型应用.  相似文献   

20.
Cluster Computing for Determining Three-Dimensional Protein Structure   总被引:1,自引:0,他引:1  
Determining the three-dimensional structure of proteins is crucial to efficient drug design and understanding biological processes. One successful method for computing the molecule’s shape relies on inter-atomic distance bounds provided by Nuclear Magnetic Resonance spectroscopy. The accuracy of computed structures as well as the time required to obtain them are greatly improved if the gaps between the upper and lower distance-bounds are reduced. These gaps are reduced most effectively by applying the tetrangle inequality, derived from the Cayley-Menger determinant, to all atom-quadruples. However, tetrangle-inequality bound-smoothing is an extremely computation intensive task, requiring O(n4) time for an n-atom molecule. To reduce computation time, we propose a novel coarse-grained parallel algorithm intended for a Beowulf-type cluster of PCs. The algorithm employs pn/6 processors and requires O(n4/p) time and O(p2) communications, where n is the number of atoms in a molecule. The number of communications is at least an order of magnitude lower than in the earlier parallelizations. Our implementation utilized processors with at least 59% efficiency (including the communication overhead)—an impressive figure for a non-embarrassingly parallel problem on a cluster of workstations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号