期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

唐佩佳徐云钟旭阳《计算机系统应用》2020,29(10):82-88

大量遗留的串行代码需要进行并行化改造,而并行程序复杂性及并行计算平台多样性导致改造成本较高.为此,设计了一种基于标记语言的三层并行编程框架,完成了从串行程序层到并行中间代码层、并行中间代码层到目标并行编程语言程序层的二个转换阶段.采用对串行代码进行语言标记的方法来实现并行中间代码层,该代码层实际是共享存储、分布式存储并行平台编程语言的一种抽象.该框架还实现了一种性能标记方法,可用于并行参数自动寻优.用于雷达数据处理的实验结果表明,实现了对应并行代码的生成,且并行加速比与人工实现的并行代码相当. 相似文献

2.

Parallelization of torsion finite element code using compressed stiffness matrix algorithm

Sefidgar Seyed Mohammad Hassan Firoozjaee Ali Rahmani Dehestani Mehdi 《Engineering with Computers》2021,37(3):2439-2455

In the current study, the problems of elastic and elastoplastic torsion were formulated by finite element method. The finite element code was parallelized on both shared and distributed memory architectures. An assembling method with high parallelism ability and consuming minimum memory was proposed to obtain compressed global stiffness matrix directly. Parallel programming principles were expressed in two shared memory and distributed memory approaches; moreover, parallel well-known mathematical libraries were briefly expressed. In this paper, the main focus was on a lucid explanation of parallelization mechanisms in detail on two memory architectures such as some settings of Linux operating system for large-scale problems. To verify the ability of the proposed method and its parallel performance, several benchmark examples were represented with different mesh sizes and were compared with their respective analytical solutions. Considering the obtained results, the proposed sparse assembling algorithm decreased required memory significantly (about 10^3.5 to 10^5.5 times) and the obtained speedup was about 3.4 for the elastoplastic torsion problem in a simple multicore computer.

相似文献

3.

Optimization and Performance of a Fortran 90 MPI-Based Unstructured Code on Large-Scale Parallel Systems

Shires Dale Mohan Ram 《The Journal of supercomputing》2003,25(2):131-141

The message-passing interface (MPI) has become the standard in achieving effective results when using the message passing paradigm of parallelization. Codes written using MPI are extremely portable and are applicable to both clusters and massively parallel computing platforms. Since MPI uses the single program, multiple data (SPMD) approach to parallelism, good performance requires careful tuning of the serial code as well as careful data and control flow analysis to limit communication. We discuss optimization strategies used and their degree of success to increase performance of an MPI-based unstructured finite element simulation code written in Fortran 90. We discuss performance results based on implementations using several modern massively parallel computing platforms including the SGI Origin 3800, IBM Nighthawk 2 SMP, and Cray T3E-1200. 相似文献

4.

Explicit nonlinear dynamic finite element analysis on homogeneous/heterogeneous parallel computing environment

《Advances in Engineering Software》2006,37(11):701-720

This paper presents parallel computational strategies to implement explicit nonlinear finite element analysis code onto distributed memory parallel computers for solving large-scale problems in structural dynamics. Implementation details on both homogeneous and heterogeneous parallel processing environments are considered in detail in this paper. Implementation of an explicit nonlinear finite element dynamic analysis code on homogeneous systems is discussed first and this is later moved onto heterogeneous systems. Domain decomposition with explicit message passing is preferred for parallel implementation. The message passing implementation in the parallel algorithm is based on MPI (Message Passing Interface) libraries. Implementation aspects of overlapped, non-overlapped domain decomposition techniques, Dynamic Task Allocation (DTA) and clustering techniques for DTA and their relative merits are presented. The interprocessor communications are optimised by overlapping with computations to improve the performance of the domain decomposition based explicit dynamic analysis finite element code.The issues related to implementation of finite element code for nonlinear dynamic analysis on heterogeneous parallel computing environment are later presented. A new dynamic load-balancing algorithm is developed for this purpose and it is integrated with the domain decomposition based parallel explicit finite element code to test our algorithms on a coarse grain heterogeneous cluster of workstations. Numerical experiments have been carried out on PARAM-10000, an Indian parallel computer and also on cluster of Unix workstations. 相似文献

5.

Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications

Rafael Sotomayor Luis Miguel Sanchez Javier Garcia Blas Javier Fernandez J. Daniel Garcia 《International journal of parallel programming》2017,45(2):262-282

Parallelism has become one of the most extended paradigms used to improve performance. However, it forces software developers to adapt applications and coding mechanisms to exploit the available computing devices. Legacy source code needs to be re-written to take advantage of multi- core and many-core computing devices. Writing parallel applications in a traditional way is hard, expensive, and time consuming. Furthermore, there is often more than one possible transformation or optimization that can be applied to a single piece of legacy code. Therefore many parallel versions of the same original sequential code need to be considered. In this paper, we describe an automatic parallel source code generation workflow (REWORK) for parallel heterogeneous platforms. REWORK automatically identifies promising kernels on legacy C++ source code and generates multiple specific versions of kernels for improving C++ applications, selecting the most adequate version based on both static source code and target platform characteristics. 相似文献

6.

A real-time multigrid finite hexahedra method for elasticity simulation using CUDA

Christian Dick Joachim Georgii Rüdiger Westermann 《Simulation Modelling Practice and Theory》2011,19(2):801-816

We present a multigrid approach for simulating elastic deformable objects in real time on recent NVIDIA GPU architectures. To accurately simulate large deformations we consider the co-rotated strain formulation. Our method is based on a finite element discretization of the deformable object using hexahedra. It draws upon recent work on multigrid schemes for the efficient numerical solution of partial differential equations on such discretizations. Due to the regular shape of the numerical stencil induced by the hexahedral regime, and since we use matrix-free formulations of all multigrid steps, computations and data layout can be restructured to avoid execution divergence of parallel running threads and to enable coalescing of memory accesses into single memory transactions. This enables to effectively exploit the GPU’s parallel processing units and high memory bandwidth via the CUDA parallel programming API. We demonstrate performance gains of up to a factor of 27 and 4 compared to a highly optimized CPU implementation on a single CPU core and 8 CPU cores, respectively. For hexahedral models consisting of as many as 269,000 elements our approach achieves physics-based simulation at 11 time steps per second. 相似文献

7.

Using high performance Fortran for parallel programming

G. Sarma T. Zacharia D. Miles 《Computers & Mathematics with Applications》1998,35(12):41-57

A finite element code with a polycrystal plasticity model for simulating deformation processing of metals has been developed for parallel computers using High Performance Fortran (HPF). The conversion of the code from an original implementation on the Connection Machine systems using CM Fortran is described. The sections of the code requiring minimal inter-processor communication are easily parallelized, by changing only the syntax for specifying data layout. However, the solver routine based on the conjugate gradient method required additional modifications, which are discussed in detail. The performance of the code on a massively parallel distributed-memory Intel PARAGON supercomputer is evaluated through timing statistics. Published by Elsevier Science Ltd. 相似文献

8.

A hybrid MPI–OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence

Pablo D. Mininni Duane Rosenberg Raghu Reddy Annick Pouquet 《Parallel Computing》2011,37(6-7):316-326

A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves good scalability up to ～20,000 compute cores with a maximum efficiency of 89%, and a mean of 79%. Data are presented that help guide the choice of the optimal number of MPI tasks and OpenMP threads in order to maximize code performance on two different platforms. 相似文献

9.

Finite element analysis on the BBN butterfly multiprocessor

《Computers & Structures》1987,27(1):13-21

Applications of the finite element method can easily exhaust CPU and memory resources on conventional uniprocessor machines, and are thus prime candidates for multiprocessor application. In order to investigate various parallel processing methodologies and to evaluate their performance, we have developed a finite element program for the Butterfly Parallel Processor. The overall organization of the program and its data structures are described along with parallel implementations of well-known algorithms for transient and static analysis. Results are presented for two implementations of explicit time integration, and for static analysis using a parallel version of Gaussian elimination. For comparison, results from running the same program on a DEC VAX 11/785 are also included. The transient cases yielded near-linear (within 2.5%) speedup in the full range of 1–64 processors; the static case results showed significant nonlinearity over the same range. 相似文献

10.

Concepts and implementation of parallel finite element analysis 总被引：1，自引：0，他引：1

K. N. Chiang R. E. Fulton 《Computers & Structures》1990,36(6):1039-1046

The design of complex engineering systems such as advanced aircraft structures and offshore platforms requires continually increasing levels of detail in supporting analysis. The finite element method is widely used as a computational method with which to model physical systems in various engineering problems. For detailed analyses of complex designs, structural models composed of several thousands of degrees of freedom are no longer uncommon. Such design activities require large order finite element and/or finite difference models and excessive computation demands in both calculation speed and information management. The computer simulation of the nonlinear dynamic response of structures and the implementation of parallel FEM systems on a high speed multiprocessor have received considerable attention in recent years. The driving forces of these activities included the reliable simulation of automotive and aircraft crash phenomena, and the increased performance of computers. Most existing major structural analysis software systems were designed 10–20 years ago and have been optimized for current sequential computers. Such systems often are not well structured to take maximum advantage of the recent and continuing revolution in parallel vector computing capabilities. These parallel vector computer architectures not only occur in the form of large supercomputers, but are now also occurring for minicomputers and even engineering workstations. To benefit from advances in parallel computers, software must be developed which takes maximum advantage of the parallel processing feature. 相似文献

11.

A finite volume method parallelization for the simulation of free surface shallow water flows

A.I. Delis E.N. Mathioudakis 《Mathematics and computers in simulation》2009

We construct a parallel algorithm, suitable for distributed memory architectures, of an explicit shock-capturing finite volume method for solving the two-dimensional shallow water equations. The finite volume method is based on the very popular approximate Riemann solver of Roe and is extended to second order spatial accuracy by an appropriate TVD technique. The parallel code is applied to distributed memory architectures using domain decomposition techniques and we investigate its performance on a grid computer and on a Distributed Shared Memory supercomputer. The effectiveness of the parallel algorithm is considered for specific benchmark test cases. The performance of the realization measured in terms of execution time and speedup factors reveals the efficiency of the implementation. 相似文献

12.

一种适应GPU的混合OLAP查询处理模型

张宇张延松陈红王珊《软件学报》2016,27(5):1246-1265

通用GPU因其强大的并行计算能力成为新兴的高性能计算平台,并逐渐成为近年来学术界在高性能数据库实现技术领域的研究热点.但当前GPU数据库领域的研究沿袭的是ROLAP(relational OLAP)多维分析模型,研究主要集中在关系操作符在GPU平台上的算法实现和性能优化技术,以哈希连接的GPU并行算法研究为中心.GPU拥有数千个并行计算单元,但其逻辑控制单元较少,相对于CPU具有更强的并行计算能力,但逻辑控制和复杂内存管理能力较弱,因此并不适合需要复杂数据结构和复杂内存管理机制的内存数据库查询处理算法直接移植到GPU平台.提出了面向GPU向量计算特性的混合OLAP多维分析模型semi-MOLAP,将MOLAP(multidimensionalOLAP)模型的直接数组访问和计算特性与ROLAP模型的存储效率结合在一起,实现了一个基于完全数组结构的GPU semi-MOLAP多维分析模型,简化了GPU数据管理,降低了GPU semi-MOLAP算法复杂度,提高了GPU semi-MOLAP算法的代码执行率.同时,基于GPU和CPU计算的特点,将semi-MOLAP操作符拆分为CPU和GPU平台的协同计算,提高了CPU和GPU的利用率以及OLAP的查询整体性能. 相似文献

13.

Portability,predictability and performance for parallel computing: BSP in practice

Joy Reed Kevin Parrott Tim Lanfear 《Concurrency and Computation》1996,8(10):799-812

We report on practical experience using the Oxford BSP Library to parallelize a large electromagnetic code, the British Aerospace finite-difference time-domain code EMMA T:FD3D. The Oxford BS Library is one of the first realizations of the Bulk Synchronous Parallel computational model to be targeted at numerically intensive scientific (typically Fortran) computing. The BAe EMMA code is one of the first large-scale applications to be parallelized using this library, and it is an important demonstration of the cost effectiveness of the BSP approach. We illustrate how BSP cost-modelling techniques can be used to predict and optimize performance for single-source programs across different parallel platforms. We provide predicted and observed performance figures for an industrial-strength, single-source parallel code for a variety of real parallel architectures: shared memory multiprocessors, workstation clusters and massively parallel platforms. 相似文献

14.

Parallel Implementation of a Class of Adaptive Signal Processing Applications

M. Lee W. Liu V. K. Prasanna 《Algorithmica》2001,30(4):645-684

Recently, High Performance Computing (HPC) platforms have been employed to realize many computationally demanding applications in signal and image processing. These applications require real-time performance constraints to be met. These constraints include latency as well as throughput. In order to meet these performance requirements, efficient parallel algorithms are needed. These algorithms must be engineered to exploit the computational characteristics of such applications. In this paper we present a methodology for mapping a class of adaptive signal processing applications onto HPC platforms such that the throughput performance is optimized. We first define a new task model using the salient computational characteristics of a class of adaptive signal processing applications. Based on this task model, we propose a new execution model. In the earlier linear pipelined execution model, the task mapping choices were restricted. The new model permits flexible task mapping choices, leading to improved throughput performance compared with the previous model. Using the new model, a three-step task mapping methodology is developed. It consists of (1) a data remapping step, (2) a coarse resource allocation step, and (3) a fine performance tuning step. The methodology is demonstrated by designing parallel algorithms for modern radar and sonar signal processing applications. These are implemented on IBM SP2 and Cray T3E, state-of-the-art HPC platforms, to show the effectiveness of our approach. Experimental results show significant performance improvement over those obtained by previous approaches. Our code is written using C and the Message Passing Interface (MPI). Thus, it is portable across various HPC platforms. Received April 8, 1998; revised February 2, 1999. 相似文献

15.

On dynamic memory allocation in sliding-window parallel patterns for streaming analytics

Torquati M. Mencagli G. Drocco M. Aldinucci M. De Matteis T. Danelutto M. 《The Journal of supercomputing》2019,75(8):4114-4131

This work studies the issues related to dynamic memory management in Data Stream Processing, an emerging paradigm enabling the real-time processing of live data streams. In this paper, we consider two streaming parallel patterns and we discuss different implementation variants related to how dynamic memory is managed. The results show that the standard mechanisms provided by modern C++ are not entirely adequate for maximizing the performance. Instead, the combined use of an efficient general purpose memory allocator, a custom allocator optimized for the pattern considered and a custom variant of the C++ shared pointer mechanism, provides a performance improvement up to 16% on the best case.

相似文献

16.

Efficient data parallel algorithms for multidimensional array operations based on the EKMR scheme for distributed memory multicomputers

Chun-Yuan Lin Yeh-Ching Chung Jen-Shiuh Liu 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(7):625-639

Array operations are useful in a large number of important scientific codes, such as molecular dynamics, finite element methods, climate modeling, atmosphere and ocean sciences, etc. In our previous work, we have proposed a scheme of extended Karnaugh map representation (EKMR) for multidimensional array representation. We have shown that sequential multidimensional array operation algorithms based on the EKMR scheme have better performance than those based on the traditional matrix representation (TMR) scheme. Since parallel multidimensional array operations have been an extensively investigated problem, we present efficient data parallel algorithms for multidimensional array operations based on the EKMR scheme for distributed memory multicomputers. In a data parallel programming paradigm, in general, we distribute array elements to processors based on various distribution schemes, do local computation in each processor, and collect computation results from each processor. Based on the row, column, and 2D mesh distribution schemes, we design data parallel algorithms for matrix-matrix addition and matrix-matrix multiplication array operations in both TMR and EKMR schemes for multidimensional arrays. We also design data parallel algorithms for six Fortran 90 array intrinsic functions: All, Maxval, Merge, Pack, Sum, and Cshift. We compare the time of the data distribution, the local computation, and the result collection phases of these array operations based on the TMR and the EKMR schemes. The experimental results show that algorithms based on the EKMR scheme outperform those based on the TMR scheme for all test cases. 相似文献

17.

On parallel computation of centrality measures of graphs

García Juan F. Carriegos M. V. 《The Journal of supercomputing》2019,75(3):1410-1428

Centrality measures or indicators of centrality identify most relevant nodes of graphs. Although optimized algorithms exist for computing of most of them, they are still time consuming and are even infeasible to apply to big enough graphs like the ones representing social networks or extensive enough computer networks. In this paper, we present a parallel implementation in C language of some optimal algorithms for computing of some indicators of centrality. Our parallel version greatly reduces the execution time of their sequential (non-parallel) counterpart. The proposed solution relies on threading, allowing for a theoretical improvement in performance close to the number of logical processors (cores) of the single computer in which it is running. Our software has been tested in several platforms, including the supercomputer Calendula, in which we achieved execution times close to 18 times faster when running our parallel implementation instead of our sequential one. Our solution is multi-platform and portable, working on any machine with several logical processor which is capable of compiling and running C language code.

相似文献

18.

PARALLEL SOLUTION OF SYMMETRIC POSITIVE DEFINITE TOEPLITZ SYSTEMS

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(4):297-303

Initially, parallel algorithms were designed by parallelising the existing sequential algorithms for frequently occurring problems on available parallel architectures.

More recently, parallel strategies have been identified and utilised resulting in many new parallel algorithms. However, the analysis of such techniques reveals that further strategies can be applied to increase the parallelism. One of these strategies, i.e., increasing the computational work in each processing node, can reduce the memory accesses and hence congestion in a shared memory multiprocessor system. Similarly, when network message passing is minimised in a distributed memory processor system, dramatic improvements in the performance of the algorithm ensue.

A frequently occurring computational problem in digital signal processing (DSP) is the solution of symmetric positive definite Toeplilz linear systems. The Levinson algorithm for solving such linear equations is where the Toeplitz matrix property is utilised in the elimination process of each element to advantage. However, it can be shown that in the Parallel Implicit Elimination (PIE) method where more than one element is eliminated simultaneously, the Toeplitz structure can again be utilised to advantage. This relatively simple strategy yields a reduction in accesses to shared memory or network message passing, resulting in a significant improvement in the performance of the algorithm [2], 相似文献

19.

Optimizations and OpenMP implementation for the direct simulation Monte Carlo method

Da Gao Thomas E. Schwartzentruber 《Computers & Fluids》2011,42(1):73-81

Parallel implementation of a three-dimensional direct simulation Monte Carlo (DSMC) code employing complex data structures and dynamic memory allocation is detailed for shared memory systems using Open Multi-Processing (OpenMP). Several techniques to optimize the serial implementation of the DSMC method are first discussed. Specifically for a 3-level Cartesian grid, a Cartesian-based movement technique including particle indexing is demonstrated to result in a modest decrease in overall simulation expense of 34% compared with a ray-tracing technique combined with stored cell-connectivity. Two strategies for data localization leading to optimal usage of cache memory are demonstrated to speed up certain cell-based functions (such as collision computations) by a factor of 3.38–4.36. The shared-memory parallel implementation using OpenMP is then described in detail. Synchronization points and related critical sections are identified as major factors that impact the OpenMP parallel performance. Techniques to remove all such synchronization points in the OpenMP implementation of the DSMC method are outlined. For dual-core and quad-core systems, speedups of 1.99 and 3.74, respectively, are obtained for a (free-stream flow) test simulation with low granularity. Finally, the parallel performance of identical source code employing OpenMP is shown to be strongly correlated to the underlying computer architecture. Both Symmetric Multiprocessor (SMP) and non-uniform memory access (NUMA) systems are studied in order to achieve a better understanding of their impacts on parallel scalability when using OpenMP. 相似文献

20.

基于Matlab的遗留系统并行化重构方法

樊峰峰张延园林奕《计算机与现代化》2012,(5):23-26

随着CPU多核架构的普及,应用的复杂和数据集的膨胀,基于Matlab的遗留系统中的串行程序代码无法充分发挥系统潜在的性能优势,无力应对当前大型数据集的处理应用需求。Matlab的并行计算模型为数据密集型的处理任务提供了并行支持。本文首先从系统架构扩展和业务代码并行化入手,分析遗留系统并行化重构过程要点和方法,应用案例的并行化重构实验数据表明了系统重构处理大型数据集的性能提升。相似文献