期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

dOpenCL: Towards uniform programming of distributed heterogeneous multi-/many-core systems

Philipp Kegel Michel Steuwer Sergei Gorlatch 《Journal of Parallel and Distributed Computing》2013

Modern computer systems become increasingly distributed and heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to use a combination of several programming models (e.g., MPI with OpenCL or CUDA) in order to exploit the system’s full performance potential. In this paper, we present dOpenCL (distributed OpenCL)—a uniform approach to programming distributed heterogeneous systems with accelerators. dOpenCL allows the user to run unmodified existing OpenCL applications in a heterogeneous distributed environment. We describe the challenges of implementing the OpenCL programming model for distributed systems, as well as its extension for running multiple applications concurrently. Using several example applications, we compare the performance of dOpenCL with MPI + OpenCL and standard OpenCL implementations. 相似文献

2.

Toward formally-based design of message passing programs

Gorlatch S. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(3):276-288

Presents a systematic approach to the development of message passing programs. Our programming model is SPMD, with communications restricted to collective operations: scan, reduction, gather, etc. The design process in such an architecture-independent language is based on correctness-preserving transformation rules that are provable in a formal functional framework. We develop a set of design rules for composition and decomposition. For example, scan followed by reduction is replaced by a single reduction, and global reduction is decomposed into two faster operations. The impact of the design rules on the target performance is estimated analytically and tested in machine experiments. As a case study, we design two provably correct, efficient programs using the Message Passing Interface (MPI) for the famous maximum segment sum problem, starting from an intuitive, but inefficient, algorithm specification 相似文献

3.

dOCAL: high-level distributed programming with OpenCL and CUDA

Rasch Ari Bigge Julian Wrodarczyk Martin Schulze Richard Gorlatch Sergei 《The Journal of supercomputing》2020,76(7):5117-5138

In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.

相似文献

4.

A formally based parallelization of data mining algorithms for multi-core systems

Kholod Ivan Shorov Andrey Titkov Evgenii Gorlatch Sergei 《The Journal of supercomputing》2019,75(12):7909-7920

We describe a novel, systematic approach to efficiently parallelizing data mining algorithms: starting with the representation of an algorithm as a sequential composition of functions, we formally transform it into a parallel form using higher-order functions for specifying parallelism. We implement the approach as an extension of the industrial-strength Java-based library Xelopes, and we illustrate its use by developing a multi-threaded Java program for the popular naive Bayes classification algorithm. In comparison with the popular MapReduce programming model, our resulting programs enable not only data-parallel, but also task-parallel implementation and a combination of both. Our experiments demonstrate an efficient parallelization and good scalability on multi-core processors.

相似文献

5.

Parallelization of divide-and-conquer in the Bird-Meertens formalism 总被引：1，自引：0，他引：1

Sergei Gorlatch Christian Lengauer 《Formal Aspects of Computing》1995,7(6):663-682

An SPMD parallel implementation schema for divide-and-conquer specifications is proposed and derived by formal refinement (transformation) of the specification schema. The specification is in the form of a mutually recursive functional definition. In a first phase, a parallel functional program schema is constructed which consists of a communication tree and a functional program that is shared by all nodes of the tree. The fact that this phase proceeds by semantics-preserving transformations in the Bird-Meertens formalism of higher-order functions guarantees the correctness of the resulting functional implementation. A second phase yields an imperative distributed message-passing implementation of this schema. The derivation process is illustrated with an example: a two-dimensional numerical integration algorithm.Parts of this paper were presented at the International Parallel Processing Symposium, Mexico, 1994 [GoL94] and at the World Transputer Congress, Italy, 1994 [Gor94] 相似文献

6.

The Static Parallelization of Loops and Recursions

Lengauer Christian Gorlatch Sergei Herrmann Christoph 《The Journal of supercomputing》1997,11(4):333-353

We demonstrate approaches to the static parallelization of loops and recursions on the example of the polynomial product. Phrased as a loop nest, the polynomial product can be parallelized automatically by applying a space-time mapping technique based on linear algebra and linear programming. One can choose a parallel program that is optimal with respect to some objective function like the number of execution steps, processors, channels, etc. However,at best,linear execution time complexity can be atained. Through phrasing the polynomial product as a divide-and-conquer recursion, one can obtain a parallel program with sublinear execution time. In this case, the target program is not derived by an automatic search but given as a program skeleton, which can be deduced by a sequence of equational program transformations. We discuss the use of such skeletons, compare and assess the models in which loops and divide-and-conquer resursions are parallelized and comment on the performance properties of the resulting parallel implementations. 相似文献

7.

Programming with Divide-and-Conquer Skeletons: A Case Study of FFT

Gorlatch Sergei 《The Journal of supercomputing》1998,12(1-2):85-97

We demonstrate an approach to parallel programming, based on skeletons – parameterized program schemas with efficient implementations over diverse architectures. The contribution of the paper is two-fold: (1) we classify divide-and-conquer (DC) algorithms and provide a family of provably correct parallel implementations for a particular DC skeleton, called DH (distributable homomorphism); (2) we adjust the mathematical specification of the Fast Fourier Transform (FFT) to the DH skeleton and, thereby, obtain a generic SPMD program, well suited for implementation under MPI. The generic program includes the efficient FFT solutions used in practice – the binary-exchange and the 2D- and 3D-transpose implementations – as special cases. 相似文献

8.

Dedicative Verification of Reflex Programs

Anureev I. S. Garanina N. O. Lyakh T. V. Rozov A. S. Zyubin V. E. Gorlatch S. P. 《Programming and Computer Software》2020,46(4):261-272

Programming and Computer Software - This paper presents a new two-step verification method for control software. The novelty of the method is that it reduces the verification of the temporal... 相似文献

9.

Parallelization of the self-organized maps algorithm for federated learning on distributed sources

Kholod Ivan Rukavitsyn Andrey Paznikov Alexey Gorlatch Sergei 《The Journal of supercomputing》2021,77(6):6197-6213

The Journal of Supercomputing - This paper describes a formally based approach for parallelizing the Kohonen algorithm used for the federated learning process in a special kind of neural... 相似文献

10.

Comparing GPU-parallelized metaheuristics to branch-and-bound for batch plants optimization

Borisenko Andrey Gorlatch Sergei 《The Journal of supercomputing》2019,75(12):7921-7933

The Journal of Supercomputing - We systematically compare two approaches with the optimal design of multiproduct batch plants that are widely used, e.g., in the chemical industry. Deterministic... 相似文献