期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Restructuring Fortran legacy applications for parallel computing in multiprocessors

Fernando G. Tinetti Mariano Méndez Armando De Giusti 《The Journal of supercomputing》2013,64(2):638-659

As it is widely known, multi-core computers are broadly used these days, and automatic parallelization of sequential programs is still a challenge. In this context, we propose a set of code transformations to be applied automatically by a tool in order to transform sequential legacy systems into their parallel version. We implement these transformations by applying a lightweight source code analysis based on rewritable AST (Abstract Syntax Tree). Since it is not always possible to automatically parallelize the code, we also implemented some specific analyses in order to report possible changes that would allow specific parallelization. Additionally, we present some examples in which these transformations were conducted and the corresponding performance experiments. 相似文献

2.

Exploring parallelization techniques based on OpenMP in H.264/AVC encoder for embedded multi-core processor

Seongmin Jo Song Hyun Jo Yong Ho Song 《Journal of Systems Architecture》2012,58(9):339-353

Recent advances in semiconductor technologies make it possible to integrate many processor cores in a small device package. The parallel execution capability of such multi-core processors can be exploited to enhance the performance of many traditional sequential applications. There have been numerous research activities to develop parallelization techniques using the OpenMp programming model, in order to speed up sequential applications such as the H.264/AVC codec, but mostly in the PC environment. Therefore, it is difficult to understand which parallelization technique fits well with the H.264/AVC encoder on an embedded multi-core architecture. In this paper, we present parallelization techniques applicable to the H.264/AVC encoder on ARM MPCore using the OpenMP programming model. Further, we propose an analytical model for the performance estimation of the H.264/AVC encoder, and we then verify the model accuracy by performing simulations using hardware/software co-verification tool. Our experimental results show that the parallelization techniques proposed in this paper for the embedded multi-core platform improve the encoder performance by up to 2.36 times, and that the parallelization technique exploiting data-level parallelism outperforms the one using task-level parallelism by 41%. It is also observed that balancing loads among processor cores is a critical parameter in achieving better scalability in the encoder. 相似文献

3.

基于Matlab的遗留系统并行化重构方法

樊峰峰张延园林奕《计算机与现代化》2012,(5):23-26

随着CPU多核架构的普及,应用的复杂和数据集的膨胀,基于Matlab的遗留系统中的串行程序代码无法充分发挥系统潜在的性能优势,无力应对当前大型数据集的处理应用需求。Matlab的并行计算模型为数据密集型的处理任务提供了并行支持。本文首先从系统架构扩展和业务代码并行化入手,分析遗留系统并行化重构过程要点和方法,应用案例的并行化重构实验数据表明了系统重构处理大型数据集的性能提升。相似文献

4.

Emerging Architectures Enable to Boost Massively Parallel Data Mining Using Adaptive Sparse Grids

Alexander Heinecke Dirk Pflüger 《International journal of parallel programming》2013,41(3):357-399

Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is thus well-suited for huge amounts of data. Due to the recursive nature of sparse grid algorithms and their classical random memory access pattern, they impose a challenge for the parallelization on modern hardware architectures such as accelerators. In this paper, we present the parallelization on several current task- and data-parallel platforms, covering multi-core CPUs with vector units, GPUs, and hybrid systems. We demonstrate that a less efficient implementation from an algorithmical point of view can be beneficial if it allows vectorization and a higher degree of parallelism instead. Furthermore, we analyze the suitability of parallel programming languages for the implementation. Considering hardware, we restrict ourselves to the x86 platform with SSE and AVX vector extensions and to NVIDIA’s Fermi architecture for GPUs. We consider both multi-core CPU and GPU architectures independently, as well as hybrid systems with up to 12 cores and 2 Fermi GPUs. With respect to parallel programming, we examine both the open standard OpenCL and Intel Array Building Blocks, a recently introduced high-level programming approach, and comment on their ease of use. As the baseline, we use the best results obtained with classically parallelized sparse grid algorithms and their OpenMP-parallelized intrinsics counterpart (SSE and AVX instructions), reporting both single and double precision measurements. The huge data sets we use are a real-life dataset stemming from astrophysics and artificial ones, all of which exhibit challenging properties. In all settings, we achieve excellent results, obtaining speedups of up to 188 × using single precision on a hybrid system. 相似文献

5.

Unstructured computational aerodynamics on many integrated core architecture

《Parallel Computing》2016

Shared memory parallelization of the flux kernel of PETSc-FUN3D, an unstructured tetrahedral mesh Euler flow code previously studied for distributed memory and multi-core shared memory, is evaluated on up to 61 cores per node and up to 4 threads per core. We explore several thread-level optimizations to improve flux kernel performance on the state-of-the-art many integrated core (MIC) Intel processor Xeon Phi “Knights Corner,” with a focus on strong thread scaling. While the linear algebraic kernel is bottlenecked by memory bandwidth for even modest numbers of cores sharing a common memory, the flux kernel, which arises in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, is compute-intensive and is known to exploit effectively contemporary multi-core hardware. We extend study of the performance of the flux kernel to the Xeon Phi in three thread affinity modes, namely scatter, compact, and balanced, in both offload and native mode, with and without various code optimizations to improve alignment and reduce cache coherency penalties. Relative to baseline “out-of-the-box” optimized compilation, code restructuring optimizations provide about 3.8x speedup using the offload mode and about 5x speedup using the native mode. Even with these gains for the flux kernel, with respect to execution time the MIC simply achieves par with optimized compilation on a contemporary multi-core Intel CPU, the 16-core Sandy Bridge E5 2670. Nevertheless, the optimizations employed to reduce the data motion and cache coherency protocol penalties of the MIC are expected to be of value for CFD and many other unstructured applications as many-core architecture evolves. We explore large-scale distributed-shared memory performance on the Cray XC40 supercomputer, to demonstrate that optimizations employed on Phi hybridize to this context, where each of thousands of nodes are comprised of two sockets of Intel Xeon Haswell CPUs with 32 cores per node. 相似文献

6.

GEMLCA: Running Legacy Code Applications as Grid Services

Thierry Delaitre Tamas Kiss Ariel Goyeneche Gabor Terstyanszky Stephen Winter Peter Kacsuk 《Journal of Grid Computing》2005,3(1-2):75-90

There are many legacy code applications that cannot be run in a Grid environment without significant modification. To avoid re-engineering of legacy code, we developed the Grid Execution Management for Legacy Code Architecture (GEMLCA) that enables deployment of legacy code applications as Grid services. GEMLCA implements a general architecture for deploying legacy applications as Grid services without the need for code re-engineering, or even access to the source files. With GEMLCA, only a user-level understanding is required to run a legacy application from a standard Grid service client. The legacy code runs in its native environment using the GEMLCA resource layer to communicate with the Grid client, thus hiding the legacy nature of the application and presenting it as a Grid service. GEMLCA as a Grid service layer supports submitting jobs, getting their results and status back. The paper introduces the GEMLCA concept, its life cycle, design and implementation. It also presents as an example a legacy simulation code that has been successfully transformed into a Grid service using GEMLCA. 相似文献

7.

Automatic Parallelization of Array-oriented Programs for a Multi-core Machine

Wai-Mee Ching Da Zheng 《International journal of parallel programming》2012,40(5):514-531

We present the work on automatic parallelization of array-oriented programs for multi-core machines. Source programs written in standard APL are translated by a parallelizing APL-to-C compiler into parallelized C code, i.e. C mixed with OpenMP directives. We describe techniques such as virtual operations and data-partitioning used to effectively exploit parallelism structured around array-primitives. We present runtime performance data, showing the speedup of the resulting parallelized code, using different numbers of threads and different problem sizes, on a 4-core machine, for several examples. 相似文献

8.

Legacy code and parallel computing: updating and parallelizing a numerical model

Tinetti Fernando G. Perez Maximiliano J. Fraidenraich Ariel Altenberg Adolfo E. 《The Journal of supercomputing》2020,76(7):5636-5654

In this paper, we present several important details in the process of legacy code parallelization, mostly related to the problem of maintaining numerical output of a legacy code while obtaining a balanced workload for parallel processing. Since we maintained the non-uniform mesh imposed by the original finite element code, we have to develop a specially designed data distribution among processors so that data restrictions are met in the finite element method. In particular, we introduce a data distribution method that is initially used in shared memory parallel processing and obtain better performance than the previous parallel program version. Besides, this method can be extended to other parallel platforms such as distributed memory parallel computers. We present results including several problems related to performance profiling on different (development and production) parallel platforms. The use of new and old parallel computing architectures leads to different behavior of the same code, which in all cases provides better performance in multiprocessor hardware.

相似文献

9.

A Memory and Computation Efficient Sparse Level-Set Method

Wladimir J. van der Laan Andrei C. Jalba Jos B. T. M. Roerdink 《Journal of scientific computing》2011,46(2):243-264

Since its introduction, the level set method has become the favorite technique for capturing and tracking moving interfaces, and found applications in a wide variety of scientific fields. In this paper we present efficient data structures and algorithms for tracking dynamic interfaces through the level set method. Several approaches which address both computational and memory requirements have been very recently introduced. We show that our method is up to 8.5 times faster than these recent approaches. More importantly, our algorithm can greatly benefit from both fine- and coarse-grain parallelization by leveraging SIMD and/or multi-core parallel architectures. 相似文献

10.

光线跟踪程序PBRT的并行化及性能优化

付雄 ;王汝传《微机发展》2008,(10):5-8

随着多核处理器的出现和迅速发展,将以前经典的串行程序并行化,更好地利用多核体系结构提高其性能,成为了当前多核处理器应用研究值得关注的一个问题。以并行化光线跟踪程序PBRT为例,深入研究了串行程序并行化中的并行模型的设计与实现、正确性验证,以及并行化后的性能优化等问题。优化后的并行PBRT取得了4个线程时近3．5倍的加速比,证明了所给出的并行化及性能优化有良好的效果。相似文献

11.

PARFES: A method for solving finite element linear equations on multi-core computers

《Advances in Engineering Software》2010,41(12):1256-1265

The method suggested here is intended for solving sets of linear algebraic equations with symmetric sparse matrices. It is oriented at the usage in finite element analysis software operated on multi-core desktop computers. Algorithms of parallelization have been implemented which speed up the method intensively as more processors are added, both in the core memory mode and when using the hard disk storage. PARFES features a higher performance and speedup ability than the multi-frontal method do, because it requires a minimum amount of data transfers – actions that speed up poorly by parallelization available on multi-core desktop computers. 相似文献

12.

Visual simulation of shockwaves

Jason Sewall Nico Galoppo Georgi Tsankov Ming Lin 《Graphical Models》2009,71(4):126-138

We present an efficient method for visual simulations of shock phenomena in compressible, inviscid fluids. Our algorithm is derived from one class of the finite volume method especially designed for capturing shock propagation, but offers improved efficiency through physically-based simplification and adaptation for graphical rendering. Our technique is capable of handling complex, bidirectional object–shock interactions stably and robustly. We describe its applications to various visual effects, including explosion, sonic booms and turbulent flows. Furthermore, we explore parallelization schemes and demonstrate the scalability of our method on shared-memory, multi-core architectures. 相似文献

13.

The BonaFide C Analyzer: automatic loop-level characterization and coverage measurement

Sergio Aldea Diego R. Llanos Arturo Gonzalez-Escribano 《The Journal of supercomputing》2014,68(3):1378-1401

The advent of multicore technologies has increased the interest in parallelization techniques for existing sequential applications. These techniques include the need of detecting loops that are good candidates for parallelization, and classifying all variables of these loops according to their use, a task surprisingly hard to be carried out manually. In this paper, we introduce the BonaFide C Analyzer, an XML-based framework that combines static analysis of source code with profiling information to generate complete reports regarding all loops in a C application, including loop coverage, loop suitability for parallelization, a classification of all variables inside loops based on their accesses, and other hurdles that restrict the parallelization. This information allows to analyze how particular language constructs are used in real-world applications, and helps the programmer to parallelize the code. To show the features of the framework, we present the results of an in-depth loop characterization of C applications that are part of the SPEC CPU2006 benchmark suite. Our study shows that 47.72 % of loops present in the applications analyzed are potentially parallelizable with existent parallel programming models such as OpenMP, while an additional 37.7 % of loops could be run in parallel with the help of runtime speculative parallelization techniques. 相似文献

14.

Performance Optimization of Video Coding Process on Multi-Core Platform Using Gop Level Parallelism

S. Sankaraiah Lam Hai Shuan C. Eswaran Junaidi Abdullah 《International journal of parallel programming》2014,42(6):931-947

High definition video applications often require heavy computation, high bandwidth and high memory requirements which make their real-time implementation difficult. Multi-core architecture with parallelism provides new solutions to implementing complex multimedia applications in real-time. It is well-known that the speed of the H.264 encoder can be increased on a multi-core architecture using the parallelism concept. Most of the parallelization methods proposed earlier for these purposes suffer from the drawbacks of limited scalability and data dependency. In this paper, we present a result obtained using data-level parallelism at the Group-Of-Pictures (GOP) level for the video encoder. The proposed technique involves each GOP being encoded independently and implemented on JM 18.0 using advanced data structures and OpenMP programming techniques. The performance of the parallelized video encoder is evaluated for various resolutions based on the parameters such as encoding speed, bit rate, memory requirements and PSNR. The results show that with GOP level parallelism, very high speed up values can be achieved without much degradation in the video quality. 相似文献

15.

Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications

Rafael Sotomayor Luis Miguel Sanchez Javier Garcia Blas Javier Fernandez J. Daniel Garcia 《International journal of parallel programming》2017,45(2):262-282

Parallelism has become one of the most extended paradigms used to improve performance. However, it forces software developers to adapt applications and coding mechanisms to exploit the available computing devices. Legacy source code needs to be re-written to take advantage of multi- core and many-core computing devices. Writing parallel applications in a traditional way is hard, expensive, and time consuming. Furthermore, there is often more than one possible transformation or optimization that can be applied to a single piece of legacy code. Therefore many parallel versions of the same original sequential code need to be considered. In this paper, we describe an automatic parallel source code generation workflow (REWORK) for parallel heterogeneous platforms. REWORK automatically identifies promising kernels on legacy C++ source code and generates multiple specific versions of kernels for improving C++ applications, selecting the most adequate version based on both static source code and target platform characteristics. 相似文献

16.

A thread-level parallelization of pairwise additive potential and force calculations suitable for current many-core architectures

Yoshimichi Andoh Soichiro Suzuki Satoshi Ohshima Tatsuya Sakashita Masao Ogino Takahiro Katagiri Noriyuki Yoshii Susumu Okazaki 《The Journal of supercomputing》2018,74(6):2449-2469

In molecular dynamics (MD) simulations, calculations of potentials and their derivatives by coordinate, i.e., forces, in a pairwise additive manner such as the Lennard–Jones interactions and a short-range part of the Coulombic interactions form the main part of arithmetic operations. It is essential to achieve high thread-level parallelization efficiency of these pairwise additive calculations of potentials and forces to use current supercomputers with many-core architectures effectively. In this paper, we propose four new thread-level parallelization algorithms for the pairwise additive potential and force calculations. We implement the four codes in a MD calculation code based on the fast multipole method. Performance benchmarks were taken on the FX100 supercomputer and Intel Xeon Phi coprocessor. The code succeeds in achieving high thread-level parallelization efficiency with 32 threads on the FX100 and up to 60 threads on the Xeon Phi. 相似文献

17.

NDL-v2.0: A new version of the numerical differentiation library for parallel architectures

P.E. Hadjidoukas P. Angelikopoulos C. Voglis D.G. Papageorgiou I.E. Lagaris 《Computer Physics Communications》2014

We present a new version of the numerical differentiation library (NDL) used for the numerical estimation of first and second order partial derivatives of a function by finite differencing. In this version we have restructured the serial implementation of the code so as to achieve optimal task-based parallelization. The pure shared-memory parallelization of the library has been based on the lightweight OpenMP tasking model allowing for the full extraction of the available parallelism and efficient scheduling of multiple concurrent library calls. On multicore clusters, parallelism is exploited by means of TORC, an MPI-based multi-threaded tasking library. The new MPI implementation of NDL provides optimal performance in terms of function calls and, furthermore, supports asynchronous execution of multiple library calls within legacy MPI programs. In addition, a Python interface has been implemented for all cases, exporting the functionality of our library to sequential Python codes. 相似文献

18.

An application-centric evaluation of OpenCL on multi-core CPUs

Jie Shen Jianbin Fang Henk Sips Ana Lucia Varbanescu 《Parallel Computing》2013

Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we focus on the potential that these parallel applications have to exploit the performance of multi-core CPUs. Specifically, we analyze the method to systematically reuse and adapt the OpenCL code from GPUs to CPUs. We claim that this work is a necessary step for enabling inter-platform performance portability in OpenCL. 相似文献

19.

基于多核阵列体系结构的嵌套循环并行优化

杨子煜严明赵鹏《计算机工程与科学》2009,31(Z1)

多核处理器已广泛应用于高性能计算领域,如何有效地将传统串行程序转换为并行代码并减少程序中嵌套循环所占用时间仍是该领域的挑战性问题。本文首先基于多面体模型对嵌套循环进行依赖特征分析并实现瓦片分割,据此自动生成粗粒度并行代码。针对多核阵列处理器的结构特点,采用遗传算法生成通信优化的瓦片任务序列,在此基础上建立了有效的任务调度模型。最后将上述方法应用于LU分解,结果表明该方法与传统调度算法相比,在增加数据局部性、实现负载平衡方面具有更好效果。相似文献

20.

Introducing concurrency in sequential Java via laws

Rafael Duarte Augusto Sampaio 《Information Processing Letters》2011,111(3):129-134

Nowadays multi-core processors can be found everywhere. It is well known that one way of improving performance is by parallelization. In this paper we propose a parallelization strategy for Java using algebraic laws. We perform an experiment with two benchmarks and show that our strategy produces a gain similar to a specialized parallel version provided by the Java Grande Benchmark (JGB). 相似文献