期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Language-based vectorization and parallelization using intrinsics,OpenMP, TBB and Cilk Plus

Przemysław Stpiczyński 《The Journal of supercomputing》2018,74(4):1461-1472

The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic language-based tools for simple and efficient parallelization of recursively defined computational problems and other problems that need both task and data parallelization techniques. We show how to use these models of parallel programming to transform a source code of Adaptive Simpson’s Integration to programs that can utilize multiple cores of modern processors. Using the example of Belman–Ford algorithm for solving single-source shortest path problems, we advise how to improve performance of data parallel algorithms by tuning data structures for better utilization of vector extensions of modern processors. Manual vectorization techniques based on Cilk array notation and intrinsics are presented. We also show how to simplify such optimization using Intel SIMD Data Layout Template containers. 相似文献

2.

A parallel spectral model for atmospheric transport processes

Thomas Kindler Karsten Schwan Dilma Silva Mary Trauner Fred Alyea 《Concurrency and Computation》1996,8(9):639-666

The paper describes a parallel implementation of a grand challenge problem: global atmospheric modeling. The novel contributions of our work include (1) a detailed investigation of opportunities for parallelism in atmospheric global modeling based on spectral solution methods, (2) the experimental evaluation of overheads arising from load imbalances and data movement for alternative parallelization methods, and (3) the development of a parallel code that can be monitored and steered interactively based on output data visualizations and animations of program functionality or performance. Code parallelization takes advantage of the relative independence of computations at different levels in the earth's atmosphere, resulting in parallelism of up to 40 processors, each independently performing computations for different atmospheric levels and requiring few communications between different levels across model time steps. Next, additional parallelism is attained within each level by taking advantage of the natural parallelism offered by the spectral computations being performed (e.g. taking advantage of independently computable terms in equations). Performance measurements are performed on a 64-node KSR2 supercomputer. However, the parallel code has been ported to several shared memory parallel machines, including SGI multiprocessors, and has also been ported to distributed memory platforms like the IBM SP-2. 相似文献

3.

Parallelizing the spectral transform method. Part II

David W. Walker Patrick H. Worley John B. Drake 《Concurrency and Computation》1992,4(7):509-531

The spectral transform method is a widely used numerical technique for solving partial differential equations on the sphere in global climate modeling. This paper describes the parallelization and performance of the spectral method for solving the non-linear shallow water equations on the surface of a sphere using a 128-node Intel iPSC/860 hypercube. Solving the shallow water equations represents a computational kernel of more complex climate models. This work is part of a research program to develop climate models that are capable of much longer simulations at a significantly finer resolution than current models. Such models are important in understanding the effects of the increasing atmospheric concentrations of greenhouse gases, and the computational requirements are so large that massively parallel multiprocessors will be necessary to run climate model simulations in a reasonable amount of time. The spectral method involves the transformation of data between the physical, Fourier and spectral domains. Each of these domains is two-dimensional. The spectral method performs Fourier transforms in the longitude direction followed by summation in the latitude direction to evaluate the discrete spectral transform. A simple way of parallelizing the spectral code is to decompose the physical problem domain in just the latitude direction. This allows an optimized sequential FFT algorithm to be used in the longitude direction. However, this approach limits the number of processors that can be brought to bear on the problem. Decomposing the problem over both directions allows the parallelism inherent in the problem to be exploited more effectively-the grain size is reduced, so that more processors can be used. Results are presented that show that decomposing over both directions does result in a more rapid solution of the problem. The results show that, for a given problem and number of processors, the optimum decomposition has approximately equal numbers of processors in each direction. Load imbalance also has an impact on the performance of the method. The importance of minimizing communication latency and overlapping communication with calculation is stressed. General methods for doing this, that may be applied to many other problems, are discussed. 相似文献

4.

Reliable performance prediction for multigrid software on distributed memory systems

Giuseppe Romanazzi Peter K. JimackChristopher E. Goodyer 《Advances in Engineering Software》2011,42(5):247-258

We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions to be made for the performance of the software on large numbers of processors of a given parallel system, by only benchmarking the code on small numbers of processors. Having described the methods used, and emphasized the simplicity of their implementation, the approach is tested on a range of engineering software applications that are built upon the use of multigrid algorithms. Despite their simplicity, the models are demonstrated to provide both accurate and robust predictions across a range of different parallel architectures, partitioning strategies and multigrid codes. In particular, the effectiveness of the predictive methodology is shown for a practical engineering software implementation of an elastohydrodynamic lubrication solver. 相似文献

5.

A study of application sensitivity to variation in message-passing latency and bandwidth

PATRICK H. WORLEY ALLEN C. ROBINSON DAVID R. MACKAY EDWARD J. BARRAGY 《Concurrency and Computation》1998,10(5):387-406

This study measures the effects of changes in message latency and bandwidth for production-level codes on a current generation tightly coupled MPP, the Intel Paragon. Messages are sent multiple times to study the application sensitivity to variations in bandwidth and latency. This method preserves the effects of contention on the interconnection network. Two applications are studied: PCTH, a shock physics code developed at Sandia National Laboratories; and PSTSWM, a spectral shallow water code developed at Oak Ridge National Laboratory and Argonne National Laboratory. These codes are significant in that PCTH is a ‘full physics’ application code in production use, while PSTSWM serves as a parallel algorithm test bed and benchmark for production codes used in atmospheric modeling. They are also significant in that the message-passing behavior differs significantly between the two codes, each representing an important class of scientific message-passing applications. © 1998 John Wiley & Sons, Ltd. 相似文献

6.

Parallelizing the Data Cube 总被引：1，自引：0，他引：1

Frank Dehne Todd Eavis Susanne Hambrusch Andrew Rau-Chaplin 《Distributed and Parallel Databases》2002,11(2):181-201

This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p. 相似文献

7.

Parallelization of a multi-blocked CFD code via three strategies for fluid flow and heat transfer analysis

Rongguang Jia Bengt Sundén 《Computers & Fluids》2004,33(1):57-80

This paper reports on a parallel implementation of a general 3D multi-block CFD code. The parallelization is achieved by using three strategies. Firstly, it is done on dual-processor PC-clusters where Windows NT systems are running. A multi-thread programming model is adopted for the multi-block code, where one thread corresponds to a block. Shared-memory is used for the exchange of inner-boundaries between neighboring blocks (threads) on the same node, while WinSockets are employed for those on different nodes. Secondly, the parallelization is extended to UNIX operating system. MPI is applied for all the message passing between different processors, including those on the same node. Thirdly, Pthreads (POSIX threads), a standardized application interface for threads, are adopted to take the advantage of the shared-memory feature of the SMP nodes, while MPI is only applied for the message passing between processors on different nodes. In all the strategies, a static load-balancing method is employed for equitable distribution of computational work to specified nodes. The parameters of the present code is studied in detail to facilitate the explanation of the speedup results. Two examples are provided to show the speedup and load balancing of the parallel calculation. Detailed comparison is made to evaluate the efficiency of different strategies. 相似文献

8.

Parallelized direct execution simulation of message-passingparallel programs

Dickens P.M. Heidelberger P. Nicol D.M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(10):1090-1105

As massively parallel computers proliferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper, we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors 相似文献

9.

面向全球数值天气预报模式的加权等积并行数据划分算法

卢风顺宋君强张理论张卫民任开军朱小谦《计算机研究与发展》2012,49(4):795-803

极区计算对全球数值预报模式设计的重要性主要体现在2个方面：模式动力框架中的极区处理和极区并行数据划分带来的并行负载不平衡问题.其中后者是全球数值预报模式大规模并行计算的性能瓶颈,对此提出一种新的基于加权等积的球面数据划分算法.该算法以球带数目和权函数为参数,将南北两极分别划分到单独的子区域,形成极点通区,使从极点到赤道方向每个纬度对应的子区域数目逐渐增多,灵活地实现球面网格的高质量划分.从理论上分析该算法的划分质量后,以基于球谐谱的浅水波模式PSTSWM为实验平台,验证了提出的划分算法具有很好的并行划分性能以及可扩展性.结合我国自主设计的GRAPES全球模式,展望了该算法的应用前景. 相似文献

10.

基于多核阵列体系结构的嵌套循环并行优化

杨子煜严明赵鹏《计算机工程与科学》2009,31(Z1)

多核处理器已广泛应用于高性能计算领域,如何有效地将传统串行程序转换为并行代码并减少程序中嵌套循环所占用时间仍是该领域的挑战性问题。本文首先基于多面体模型对嵌套循环进行依赖特征分析并实现瓦片分割,据此自动生成粗粒度并行代码。针对多核阵列处理器的结构特点,采用遗传算法生成通信优化的瓦片任务序列,在此基础上建立了有效的任务调度模型。最后将上述方法应用于LU分解,结果表明该方法与传统调度算法相比,在增加数据局部性、实现负载平衡方面具有更好效果。相似文献

11.

Parallel genetic algorithms: a survey and problem state of the art

D. S. Knysh V. M. Kureichik 《Journal of Computer and Systems Sciences International》2010,49(4):579-589

In relation with development of computer capabilities and the appearance of multicore processors, parallel computing made it possible to reduce the time for solution of optimization problems. At present of interest are methods of parallel computing for genetic algorithms using the evolutionary model of development in which the main component is the population of species (set of alternative solutions to the problem). In this case, the algorithm efficiency increases due to parallel development of several populations. The survey of basic parallelization strategies and the most interesting models of their implementation are presented. Theoretical ideas on improvement of existing parallelization mechanisms for genetic algorithms are described. A modified model of parallel genetic algorithm is developed. Since genetic algorithms are used for solution of optimization problems, the proposed model was studied for the problem of optimization of a multicriteria function. The algorithm capabilities of getting out of local optima and the influence of algorithm parameters on the deep extremum search dynamics were studied. The conclusion on efficiency of application of dynamic connections of processes, rather than static connections, is made. New mechanisms for implementation and analysis of efficiency of dynamic connections for distributed computing in genetic algorithms are necessary. 相似文献

12.

Design space exploration of a software speculative parallelization scheme 总被引：2，自引：0，他引：2

Cintra M. Llanos D.R. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(6):562-576

With speculative parallelization, code sections that cannot be fully analyzed by the compiler are optimistically executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and/or memory system. Software schemes require no changes to the hardware of existing shared-memory systems, but can suffer from significant overheads involved with the speculative execution. In fact, the performance of software schemes is highly dependent on application characteristics, the design and implementation of the scheme, and the system configuration and size. This paper explores the design space of a recently proposed software speculative parallelization scheme. In the process, we gain insight into the most beneficial features of software schemes for speculative parallelization, as well as the most influential application characteristics. For instance, experimental results show that, contrary to intuition, checking for data dependence violations on every speculative store, as opposed to at commit time, leads to little performance degradation in the worst case and to significantly better performance with large configurations. Also, scheduling policies based on windows can perform very close to fully dynamic policies with a fraction of the memory overhead. Finally, experimental results show consistent speedups in the execution of loops that cannot be parallelized at compile time, both with and without RAW data dependences, for 4 to 32 processors. 相似文献

13.

Implementation of a Numerical Solution of the Multicomponent Kinetic Collection Equation (MKCE) on Parallel Computers

《Journal of Parallel and Distributed Computing》2001,61(5):641-661

Two different numerical solutions of the two-component kinetic collection equation were implemented on parallel computers. The parallelization approach included domain decomposition and MPI commands for communications. Four different parallel codes were tested. A dynamic decomposition based on an occupancy function provided the optimum balance between time performance and flexibility for any number of processors. The occupancy function was defined according to the number of calculations required at each grid point in the domain. Speed-up performance depended very much on the parallel code used and in some cases very good results were obtained for up to 32 processors. 相似文献

14.

Strategy Compliant Multi-Threaded Term Completion

REINHARD BÜNDGEN MANFRED G BEL WOLFGANG KÜCHLIN 《Journal of Symbolic Computation》1996,21(4-6)

We report on the design, implementation, and performance,of the parallel term-rewriting system PaReDuX.We discuss the parallelization of three term completion procedures: Knuth-Bendix completion, completion modulo AC, and unfailing completion. Our parallelization is strategy-compliant, i.e., the parallel code performs exactly the same work as the sequential code, but the work load is shared by many processors. PaReDuX is designed for shared memory parallel architectures, such as multi-processor workstations, where it shows good performance on a variety of examples. 相似文献

15.

A novel compiler support for automatic parallelization on multicore systems

José M. Andión Manuel Arenaz Gabriel Rodríguez Juan Touriño 《Parallel Computing》2013

The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000. 相似文献

16.

并行算法的可扩放性分析 总被引：8，自引：0，他引：8

陈国良《小型微型计算机系统》1995,16(2):10-16

本文讨论并行算法的可扩放性的定义，研究目的和各种评判标准，以期有助于了解并行算法和体系结构的匹配关系，最大化系统的加速和效率以及预计目前小规模并行机上的并行算法运行于巨最并行机ＭＰＣ上时的性能。相似文献

17.

Optimizing array-intensive applications for on-chip multiprocessors

Kadayif I. Kandemir M. Chen G. Ozturk O. Karakoy M. Sezer U. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(5):396-411

With energy consumption becoming one of the first-class optimization parameters in computer system design, compilation techniques that consider performance and energy simultaneously are expected to play a central role. In particular, compiling a given application code under performance and energy constraints is becoming an important problem. In this paper, we focus on an on-chip multiprocessor architecture and present a set of code optimization strategies. We first evaluate an adaptive loop parallelization strategy (i.e., a strategy that allows each loop nest to execute using a different number of processors if doing so is beneficial) and measure the potential energy savings when unused processors during execution of a nested loop are shut down (i.e., placed into a power-down or sleep state). Our results show that shutting down unused processors can lead to as much as 67 percent energy savings at the expense of up to 17 percent performance loss in a set of array-intensive applications. To eliminate this performance penalty, we also discuss and evaluate a processor preactivation strategy based on compile-time analysis of nested loops. Based on our experiments, we conclude that an adaptive loop parallelization strategy combined with idle processor shut down and preactivation can be very effective in reducing energy consumption without increasing execution time. We then generalize our strategy and present an application parallelization strategy based on integer linear programming (ILP). Given an array-intensive application, our optimization strategy determines the number of processors to be used in executing each loop nest based on the objective function and additional compilation constraints provided by the user/programmer. Our initial experience with this constraint-based optimization strategy shows that it is very successful in optimizing array-intensive applications on on-chip multiprocessors under multiple energy and performance constraints. 相似文献

18.

Restructuring Fortran legacy applications for parallel computing in multiprocessors

Fernando G. Tinetti Mariano Méndez Armando De Giusti 《The Journal of supercomputing》2013,64(2):638-659

As it is widely known, multi-core computers are broadly used these days, and automatic parallelization of sequential programs is still a challenge. In this context, we propose a set of code transformations to be applied automatically by a tool in order to transform sequential legacy systems into their parallel version. We implement these transformations by applying a lightweight source code analysis based on rewritable AST (Abstract Syntax Tree). Since it is not always possible to automatically parallelize the code, we also implemented some specific analyses in order to report possible changes that would allow specific parallelization. Additionally, we present some examples in which these transformations were conducted and the corresponding performance experiments. 相似文献

19.

Legacy code and parallel computing: updating and parallelizing a numerical model

Tinetti Fernando G. Perez Maximiliano J. Fraidenraich Ariel Altenberg Adolfo E. 《The Journal of supercomputing》2020,76(7):5636-5654

In this paper, we present several important details in the process of legacy code parallelization, mostly related to the problem of maintaining numerical output of a legacy code while obtaining a balanced workload for parallel processing. Since we maintained the non-uniform mesh imposed by the original finite element code, we have to develop a specially designed data distribution among processors so that data restrictions are met in the finite element method. In particular, we introduce a data distribution method that is initially used in shared memory parallel processing and obtain better performance than the previous parallel program version. Besides, this method can be extended to other parallel platforms such as distributed memory parallel computers. We present results including several problems related to performance profiling on different (development and production) parallel platforms. The use of new and old parallel computing architectures leads to different behavior of the same code, which in all cases provides better performance in multiprocessor hardware.

相似文献

20.

Tradeoffs in granularity and parallelization for a Monte Carlo shower simulation code

Kenichi Miura 《Parallel Computing》1988,8(1-3):91-100

The EGS4 code, developed at Stanford Linear Accelerator Center, simulates electron-photon cascading phenomena. The original code is inherently sequential: processing one particle at a time. This paper reports on a series of experiments in parallelizing different versions of EGS4. Our parallel experiments were run on a 30-processor Sequent Balance B21 and a 6-processor Symmetry S27. We have considered the following approaches for parallel execution of this application code:

1. (1) Original sequential version modified for parallel processing: 1 processor;
2. (2) Version 1 run multiprocessed: 1 to 29 processors;
3. (3) Sequential version modified for large-grain parallel processing: 1 procssor;
4. (4) Version 3 run using the Sequent Microtasking Library: 1 to 29 processors.

For each approach, we discuss the relative advantages and disadvantages in the areas of coding effort, understandability and portability, as well as performance, and outline a new parallelization approach we are currently pursuing based on Large-Grain Data Flow techniques. 相似文献