首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
A large class of intensive numerical applications show an irregular structure, exhibiting an unpredictable runtime behavior. Two kinds of irregularity can be distinguished in these applications. First, irregular control structures, derived from the use of conditional statements on data only known at runtime. Second, irregular data structures, derived from computations involving sparse matrices, grids, trees, graphs, etc. Many of these applications exhibit a large amount of parallelism, but the above features usually make that exploiting such parallelism becomes a very difficult task. This paper discusses the effective parallelization of numerical irregular codes, focusing on the definition and use of data-parallel extensions to express the parallelism that they exhibit. We show that the combination of data distributions with storage structures allows to obtain efficient parallel codes. Codes dealing with sparse matrices, finite element methods and molecular dynamics (MD) simulations are taken as working examples.  相似文献   

3.
When parallelizing irregular applications on ccNUMA machines several issues should be taken into account in order to achieve high code performance. These factors include locality exploitation and parallelism, as well as careful use of memory resources (memory overhead). An important number of numerical simulation codes are clear examples of irregular applications. Frequently these kinds of codes include reduction operations in their core, so that an important fraction of the computational time is spent on such operations. Specifically, cloth simulation belongs to this class of applications, being a topic of increasing interest in diverse areas, like in the multimedia industry. Moreover, when real time simulation is the aim, its parallelization becomes an important option.This paper discusses and compares different irregular reduction parallelization techniques on ccNUMA share memory machines. Broadly speaking, we may classify them into two groups: privatization-based and data partitioning-based methods. In this paper we describe a framework, based on data affinity, that permits to develop various algorithms inside the group of the data partitioning-based techniques. All these techniques and approaches are analyzed and adapted to the computational structure of a real, physically based, cloth simulator.  相似文献   

4.
A popular approach to providing nonexperts in parallel computing with an easy-to-use programming model is to design a software library consisting of a set of preparallelized routines, and hide the intricacies of parallelization behind the library's API. However, for regular domain problems (such as simple matrix manipulations or low-level image processing applications-in which all elements in a regular subset of a dense data field are accessed in turn) speedup obtained with many such library-based parallelization tools is often suboptimal. This is because interoperation optimization (or: time-optimization of communication steps across library calls) is generally not incorporated in the library implementations. We present a simple, efficient, finite state machine-based approach for communication minimization of library-based data parallel regular domain problems. In the approach, referred to as lazy parallelization, a sequential program is parallelized automatically at runtime by inserting communication primitives and memory management operations whenever necessary. Apart from being simple and cheap, lazy parallelization guarantees to generate legal, correct, and efficient parallel programs at all times. The effectiveness of the approach is demonstrated by analyzing the performance characteristics of two typical regular domain problems obtained from the field of low-level image processing. Experimental results show significant performance improvements over nonoptimized parallel applications. Moreover, obtained communication behavior is found to be optimal with respect to the abstraction level of message passing programs.  相似文献   

5.
The simulation of fabrics, clothes, and flexible materials is an essential topic in computer animation of realistic virtual humans and dynamic sceneries. New emerging technologies, as interactive digital TV and multimedia products, make necessary the development of powerful tools to perform real-time simulations. Parallelism is one of such tools. When analyzing computationally fabric simulations we found these codes belonging to the complex class of irregular applications. Frequently this kind of codes includes reduction operations in their core, so that an important fraction of the computational time is spent on such operations. In fabric simulators these operations appear when evaluating forces, giving rise to the equation system to be solved. For this reason, this paper discusses only this phase of the simulation. This paper analyzes and evaluates different irregular reduction parallelization techniques on ccNUMA shared memory machines, applied to a real, physically-based, fabric simulator we have developed. Several issues are taken into account in order to achieve high code performance, as exploitation of data access locality and parallelism, as well as careful use of memory resources (memory overhead). In this paper we use the concept of data affinity to develop various efficient algorithms for reduction parallelization exploiting data locality.  相似文献   

6.
随着多核处理器逐渐成为处理器发展的新趋势,为了持续提高程序性能,必须并行执行应用程序.传统的自动并行技术能够很好地并行科学计算应用中的规则循环,但对于含有大量函数调用和指针引用的不规则程序,目前还不能有效地对其实施并行.针对这一现状,文中提出了基于区域平均执行时间和数据依赖信息的可能并行区域识别方法来对一些不规则程序实施高效并行,主要贡献如下:(1)自动识别程序中的多种并行性,不仅包括传统并行性分析中的循环迭代间的细粒度并行性,而且也包括传统并行性分析尚不能有效处理的循环体和函数调用点间的粗粒度并行性.对于程序中蕴含的众多并行性,文中基于区域平均执行时间实施收益分析来选择合适的并行区域实施并行;(2)自动识别可能并行区域间数据依赖关系的数量、类型以及导致数据依赖关系的程序变量.基于文中的分析结果,作者使用面向行为的投机并行系统(behavior oriented parallelism)对SPEC2006中的4个测试用例实现了并行化.并行化后的程序在Intel和AMD多核处理器上分别得到了300%和260%的平均性能加速.  相似文献   

7.
The efficient parallelization of fast multipole-based algorithms for the N-body problem is one of the most challenging topics in high performance scientific computing. The emergence of non-local, irregular communication patterns generated by these algorithms can easily create an insurmountable bottleneck on supercomputers with hundreds of thousands of cores. To overcome this obstacle we have developed an innovative parallelization strategy for Barnes–Hut tree codes on present and upcoming HPC multicore architectures. This scheme, based on a combined MPI–Pthreads approach, permits an efficient overlap of computation and data exchange. We highlight the capabilities of this method on the full IBM Blue Gene/P system JUGENE at Jülich Supercomputing Centre and demonstrate scaling across 294,912 cores with up to 2,048,000,000 particles. Applying our implementation pepc to laser–plasma interaction and vortex particle methods close to the continuum limit, we demonstrate its potential for ground-breaking advances in large-scale particle simulations.  相似文献   

8.
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributed-memory message-passing machines, such as the Paragon, Delta, CM-5 and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient index translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsumes the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirection arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life adaptive applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC. Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper demonstrates how CHAOS can be effectively used in such a framework. By embedding CHAOS primitives in the Syracuse Fortran 90D/HPF compiler, kernels taken from the CHARMM and DSMC codes have been automatically parallelized.  相似文献   

9.
We describe the application of pD, a small para-functional language that we developed as a high-level programming interface for the parallel computer algebra package PACLIB. pD provides several facilities to express parallel algorithms in a flexible way on different levels of abstraction. The compiler translates a pD module into statically typed parallel C code with explicit task creation and synchronization constructs. This target code can be linked with the PACLIB kernel, the multi-processor runtime system of the computer algebra library SACLIB. The parallelization of several computer algebra algorithms on a shared memory multi-processor demonstrates the elegance and efficiency of this approach.  相似文献   

10.
G. Bohlender  T. Kersten  R. Trier 《Computing》1994,53(3-4):259-276
The majority of numerical algorithms employs floating-point vector and matrix operations. On a parallel computer these algorithms should be solved fastand reliably in order to avoid a time-consuming error analysis. The XSC-languages (high-level language extensions for eXtended Scientific Computation) are well-suited for this purpose since they support the design of numerical algorithms delivering correct and automatically verified results. This goal is attained by an arithmetic with maximum accuracy (especially for vector and matrix operations), highly accurate standard functions, and exact evaluation of dot product expressions. Within theESPRIT Parallel Computing Action, one XSC-language, PASCAL-XSC, was implemented on a Supercluster Transputer System under the operating system HELIOS. Parallel algorithms for computationally intensive and maximally accurate matrix operations were implemented and tested on various transputer architectures. We will sketch some features of these architectures and present some benchmarks for the algorithms used. These algorithms form a parallel C runtime library of PASCAL-XSC (or any other XSC-language that uses a C runtime library) and are called automatically. This can be considered a basis for implicit parallelization in an XSC-language.  相似文献   

11.
Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs.  相似文献   

12.
Agent-based models, an emerging paradigm of simulation of complex systems, appear very suitable to parallel processing. However, during the parallelization of a simulator of financial markets, we found that some features of these codes highlight non-trivial issues of the present hardware/software platforms for parallel processing. Here we present the results of a series of tests, on different platforms, of simplified codes that reproduce such problems and can be used as a starting point in the search of a possible solution.  相似文献   

13.
We tackle the parallelization of Non-Negative Matrix Factorization (NNMF), using the Alternating Least Squares and Lee and Seung algorithms, motivated by its use in audio source separation. For the first algorithm, a very suitable technique is the use of active set algorithms for solving several non-negative inequality constraints least squares problems. We have addressed the NNMF for dense matrix on multicore architectures, by organizing these optimization problems for independent columns. Although in the sequential case, the method is not as efficient as the block pivoting variant used by other authors, they are very effective in the parallel case, producing satisfactory results for the type of applications where is to be used. For the Lee and Seung method, we propose a reorganization of the algorithm steps that increases the convergence speed and a parallelization of the solution. The article also includes a theoretical and experimental study of the performance obtained with similar matrices to that which arise in applications that have motivated this work.  相似文献   

14.
This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor  相似文献   

15.
Fragmentation of numerical algorithms for parallel subroutines library   总被引:1,自引:1,他引:0  
Fragmentation is a well-known method of the parallelization of numerical algorithms and programs. Algorithm fragmentation allows creating fragmented parallel programs that can be executed on parallel computers of different types (multiprocessors and/or multicomputers) and can be dynamically tuned to all the available resources. Fragmentation of the often used numerical algorithms, their representation for inclusion into the library of parallel numerical subroutines and properties of the runtime system are considered.  相似文献   

16.
Today's increased computing speeds allow conventional sequential machines to effectively emulate associative computing techniques. We present a parallel programming paradigm called ASC (ASsociative Computing), designed for a wide range of computing engines. Our paradigm has an efficient associative-based, dynamic memory-allocation mechanism that does not use pointers. It incorporates data parallelism at the base level, so that programmers do not have to specify low-level sequential tasks such as sorting, looping and parallelization. Our paradigm supports all of the standard data-parallel and massively parallel computing algorithms. It combines numerical computation (such as convolution, matrix multiplication, and graphics) with nonnumerical computing (such as compilation, graph algorithms, rule-based systems, and language interpreters). This article focuses on the nonnumerical aspects of ASC  相似文献   

17.
马春燕  吕炳旭  叶许姣  张雨 《软件学报》2023,34(7):3022-3042
随着多核处理器的普及应用,针对嵌入式遗留系统中串行代码的自动并行化方法是研究热点.其中,针对具有非完美嵌套结构、非仿射依赖关系特征的复杂嵌套循环的自动并行化方法存在技术挑战.提出了一种基于LLVMPass的复杂嵌套循环的自动并行化框架(CNLPF).首先,提出了一种复杂嵌套循环的表示模型,即循环结构树,并将嵌套循环的正则区域自动转换为循环结构树表示;然后,对循环结构树进行数据依赖分析,构建循环内和循环间的依赖关系;最后,基于OpenMP共享内存的编程模型生成并行的循环程序.针对SPEC2006数据集中包含近500个复杂嵌套循环的6个程序案例,分别对其进行复杂嵌套循环占比统计和并行性能加速测试.结果表明,提出的自动并行化框架可以处理LLVMPolly无法优化的复杂嵌套循环,增强了LLVM的并行编译优化能力,且该方法结合Polly的组合优化,比单独采用Polly优化的加速效果提升了9%-43%.  相似文献   

18.

Parallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

  相似文献   

19.
A library of algorithms developed as algorithmic cyberFilms is presented. Algorithmic cyberFilms are a new type of software components for presentation, specification/programming and automatic code generation of computational algorithms. The algorithmic cyberFilm format is implemented as a set of multimedia frames (and scenes), and each component is represented by frames of algorithmic skeletons representing dynamical features of an algorithm, by frames of integrated view providing static features of the algorithm in a compact format, and by corresponding template codes supporting the program generation. We developed a library which is a collection of basic and advanced algorithms taught at universities, including computation on grids, trees and graphs. In this paper, we present basic constructs of visual languages which are used for representing cyberFilms as well as for demonstrating the library components. We also provide a general overview of the library and its features. In addition, we discuss results of experiments which were conducted to verify the usability of the library components and their usefulness in education.  相似文献   

20.
Skeletal parallel programming is a promising approach to easy parallel programming in which users can build parallel programs simply by combining parts of a given set of ready-made parallel computation patterns called skeletons. There is a trade-off for this easiness in the form of an efficiency problem caused by the compositional style of the programming. One solution to this problem is fusion transformation that optimizes naively composed skeleton programs by eliminating redundant intermediate data structures. Several parallel skeleton libraries have automatic fusion mechanisms. However, there have been no automatic fusion mechanisms proposed for variable-length list (VLL) skeletons, even though such skeletons are useful for practical problems. The main difficulty is that previous fusion mechanisms are not applicable to VLL skeletons, and so the fusion cannot be completed. In this paper, we propose a novel fusion mechanism for VLL skeletons that can achieve both an easy programming interface and complete fusion. The proposed mechanism has been implemented in our skeleton library, SkeTo, by using the expression templates technique, experimental results have shown that it is very effective.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号