期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Autonomic Coordination of Skeleton-Based Applications Over CPU/GPU Multi-Core Architectures

Mehdi Goli Horacio González–Vélez 《International journal of parallel programming》2017,45(2):203-224

Widely adumbrated as patterns of parallel computation and communication, algorithmic skeletons introduce a viable solution for efficiently programming modern heterogeneous multi-core architectures equipped not only with traditional multi-core CPUs, but also with one or more programmable Graphics Processing Units (GPUs). By systematically applying algorithmic skeletons to address complex programming tasks, it is arguably possible to separate the coordination from the computation in a parallel program, and therefore subdivide a complex program into building blocks (modules, skids, or components) that can be independently created and then used in different systems to drive multiple functionalities. By exploiting such systematic division, it is feasible to automate coordination by addressing extra-functional and non-functional features such as application performance, portability, and resource utilisation from the component level in heterogeneous multi-core architectures. In this paper, we introduce a novel approach to exploit the inherent features of skeleton-based applications in order to automatically coordinate them over heterogeneous (CPU/GPU) multi-core architectures and improve their performance. Our systematic evaluation demonstrates up to one order of magnitude speed-up on heterogeneous multi-core architectures. 相似文献

2.

Parallelization of Full Search Motion Estimation Algorithm for Parallel and Distributed Platforms

Eduarda Monteiro Bruno Vizzotto Cláudio Diniz Marilena Maule Bruno Zatt Sergio Bampi 《International journal of parallel programming》2014,42(2):239-264

This work presents an efficient method to map the Full Search algorithm for Motion Estimation (ME) onto General Purpose Graphic Processing Unit (GPGPU) architectures using Compute Unified Device Architecture (CUDA) programming model. Our method jointly exploits the massive parallelism available in current GPGPU devices and the parallelism potential of Full Search algorithm. Our main goal is to evaluate the feasibility of video codecs implementation using GPGPUs and its advantages and drawbacks compared to other platforms. Therefore, for comparison reasons, three solutions were developed using distinct programming paradigms for distinct underlying hardware architectures: (i) a sequential solution for general-purpose processor (GPP); (ii) a parallel solution for multi-core GPP using OpenMP library; (iii) a distributed solution for cluster/grid machines using Message Passing Interface (MPI) library. The CUDA-based solution for GPGPUs achieves speed-up compatible to the indicated by the theoretical model for different search areas. Our GPGPU Full Search Motion Estimation provides 2×, 20× and 1664× speed-up when compared to MPI, OpenMP and sequential implementations, respectively. Compared to state-of-the-art, our solution reaches up to 17× speed-up. 相似文献

3.

A Memory and Computation Efficient Sparse Level-Set Method

Wladimir J. van der Laan Andrei C. Jalba Jos B. T. M. Roerdink 《Journal of scientific computing》2011,46(2):243-264

Since its introduction, the level set method has become the favorite technique for capturing and tracking moving interfaces, and found applications in a wide variety of scientific fields. In this paper we present efficient data structures and algorithms for tracking dynamic interfaces through the level set method. Several approaches which address both computational and memory requirements have been very recently introduced. We show that our method is up to 8.5 times faster than these recent approaches. More importantly, our algorithm can greatly benefit from both fine- and coarse-grain parallelization by leveraging SIMD and/or multi-core parallel architectures. 相似文献

4.

基于申威众核架构的分组卷积计算加速与优化

王鑫张铭《计算机应用研究》2023,40(6):1745-1749

针对应用普通卷积结构的卷积计算复杂度较高、计算量与参数量较大的问题,提出以国产SW26010P众核处理器为平台的并行分组卷积算法。核心思想是利用独特的数据布局,通过多核映射处理进行并行计算。实验测试结果表明,与单核串行算法相比,使用该并行分组卷积算法可以获得79.5的最高加速比及186.7MFLOPS的最大有效算力。通过SIMD指令对并行分组卷积算法进行数据并行优化后,与使用优化前的并行分组卷积算法相比,可以获得10.2的最高加速比。相似文献

5.

A framework for parallel computational physics algorithms on multi-core: SPH in parallel

David W. Holmes John R. WilliamsPeter Tilke 《Advances in Engineering Software》2011,42(11):999-1008

In this paper, a simulation framework that enables distributed numerical computing in multi-core shared-memory environments is presented. Using multiple threads allows a single memory image to be shared concurrently across cores but potentially introduces race conditions. Race conditions can be avoided by ensuring each core operates on an isolated memory block. This is usually achieved by running a different operating system process on each core, such as multiple MPI processes. However, we show that in many computational physics problems, memory isolation can also be enforced within a single process by leveraging spatial sub-division of the physical domain. A new spatial sub-division algorithm is presented that ensures threads operate on different memory blocks, allowing for in-place updates of state, with no message passing or creation of local variables during time stepping. Additionally, the developed framework controls task distribution dynamically ensuring an events based load balance. Results from fluid mechanics analysis using Smoothed Particle Hydrodynamics (SPH) are presented demonstrating linear performance with number of cores. 相似文献

6.

A queueing theoretic approach for performance evaluation of low-power multi-core embedded systems

Arslan Munir Ann Gordon-Ross Sanjay Ranka Farinaz Koushanfar 《Journal of Parallel and Distributed Computing》2014

With Moore’s law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multi-core to exploit this high transistor density for high performance. However, the optimal layout of these multiple cores along with the memory subsystem (caches and main memory) to satisfy power, area, and stringent real-time constraints is a challenging design endeavor. The short time-to-market constraint of embedded systems exacerbates this design challenge and necessitates the architectural modeling of embedded systems to reduce the time-to-market by expediting target applications to device/architecture mapping. In this paper, we present a queueing theoretic approach for modeling multi-core embedded systems that provides a quick and inexpensive performance evaluation both in terms of time and resources as compared to the development of multi-core simulators and running benchmarks on these simulators. We verify our queueing theoretic modeling approach by running SPLASH-2 benchmarks on the SuperESCalar simulator (SESC). Results reveal that our queueing theoretic model qualitatively evaluates multi-core architectures accurately with an average difference of 5.6% as compared to the architectures’ evaluations from the SESC simulator. Our modeling approach can be used for performance per watt and performance per unit area characterizations of multi-core embedded architectures, with varying number of processor cores and cache configurations, to provide a comparative analysis. 相似文献

7.

Parallel cube computation on modern CPUs and GPUs

Guoliang Zhou Hong Chen 《The Journal of supercomputing》2012,61(3):394-417

With the popularity of column-store databases, modern multi-core CPUs, and general-purpose computing on graphics processing units (GPGPUs), there will be radical changes in how processing is done in the online analytical processing (OLAP) and data warehousing fields. Cube computation is a core and time-consuming problem which has been researched extensively. However, most of the algorithms have been proposed without considering the prevalent multi-core architectures and column storage. This paper presents a new parallel cube algorithm that takes advantage of multi-core architectures. We first propose a cache-conscious bottom-up computation (BUC) algorithm called CC-BUC that adopts an integrated bottom-up and breadth-first partitioning order. Each dimension is separately stored and processed. In processing each dimension, breadth-first data scanning and results outputting reduce memory I/O and enhance cache locality. Cache misses are limited in a dimension scope, and translation lookaside buffer (TLB) misses are reduced. Based on CC-BUC, we give a multi-core architecture-based cube algorithm called MC-Cubing. Multiple partitions are processed simultaneously and multiple threads undergo parallel execution inside each partition. MC-Cubing is consistent with multi-core architectures and high parallelism. The layout and associated algorithms take advantage of single instruction, multiple data (SIMD) instructions and thread-level parallelism (TLP). We implement and demonstrate the effectiveness of MC-Cubing on two multi-core architectures: multi-core CPUs and GPUs. Experimental results show that the MC-Cubing algorithm can speed up nearly six times faster than BUC in real datasets. 相似文献

8.

ERI sorting for emerging processor architectures

Tirath Ramdas Gregory K. Egan 《Computer Physics Communications》2009,180(8):1221-1229

Electron Repulsion Integrals (ERIs) are a common bottleneck in ab initio computational chemistry. It is known that sorted/reordered execution of ERIs results in efficient SIMD/vector processing. This paper shows that reconfigurable computing and heterogeneous processor architectures can also benefit from a deliberate ordering of ERI tasks. However, realizing these benefits as net speedup requires a very rapid sorting mechanism. This paper presents two such mechanisms. Included in this study are analytical, simulation-based, and experimental benchmarking approaches to consider five use cases for ERI sorting, i.e. SIMD processing, reconfigurable computing, limited address spaces, instruction cache exploitation, and data cache exploitation. Specific consideration is given to existing cache-based processors, FPGAs, and the Cell Broadband Engine processor. It is proposed that the analyses conducted in this work should be built upon to aid the development of software autotuners which will produce efficient ab initio computational chemistry codes for a variety of computer architectures. 相似文献

9.

On the design of high-performance algorithms for aligning multiple protein sequences on mesh-based multiprocessor architectures

Diana H.P. Low Bharadwaj Veeravalli David A. Bader 《Journal of Parallel and Distributed Computing》2007

In this paper, we address the problem of multiple sequence alignment (MSA) for handling very large number of proteins sequences on mesh-based multiprocessor architectures. As the problem has been conclusively shown to be computationally complex, we employ divisible load paradigm (also, referred to as divisible load theory, DLT) to handle such large number of sequences. We design an efficient computational engine that is capable of conducting MSAs by exploiting the underlying parallelism embedded in the computational steps of multiple sequence algorithms. Specifically, we consider the standard Smith–Waterman (SW) algorithm in our implementation, however, our approach is by no means restrictive to SW class of algorithms alone. The treatment used in this paper is generic to a class of similar dynamic programming problems. Our approach is recursive in the sense that the quality of solutions can be refined continuously till an acceptable level of quality is achieved. After first phase of computation, we design a heuristic scheme that renders the final solution for MSA. We conduct rigorous simulation experiments using several hundreds of homologous protein sequences derived from the Rattus Norvegicus and Mus Musculus databases of olfactory receptors. We quantify the performance based on speed-up metric. We compare our algorithms to serial or single machine processing approaches. We testify our findings by comparing with conventional equal load partitioning (ELP) strategy that is commonly used in the parallel processing literature. Based on our extensive simulation study, we observe that DLT paradigm offers an excellent speed-up characteristics and provides avenues for its use in several other biological sequence processing related problem. This study is a first time attempt in using the DLT paradigm to devise efficient strategies to handle large scale multiple protein sequence alignment problem on mesh-based multiprocessor systems. 相似文献

10.

Graph grammar‐driven parallel partial differential equation solver

Maciej Paszy&#x;ski Robert Schaefer 《Concurrency and Computation》2010,22(9):1063-1097

The paper presents an extension of the composite programmable graph grammar (CP‐graph grammar) suitable for modeling the parallel direct solver algorithm utilized by the hp finite element method (hp‐FEM). In the proposed graph grammar model, the computational mesh is represented by a CP‐graph. The presented graph grammar models the solver algorithm by a set of graph grammar productions. The graph grammar model makes it possible to examine the concurrency of the algorithm by analyzing the interdependence between the atomic tasks, tasks and super‐tasks. The atomic tasks correspond to the graph grammar productions, representing basic undividable parts of the algorithms. The level of atomic tasks models the concurrency for the shared memory architectures. On the other hand, the tasks correspond to the groups of atomic tasks with predefined inter‐task communication channels. They constitute the grain for the decomposition of the parallel algorithm for the distributed memory architecture. Finally, the super‐tasks correspond to a group of tasks resulting from the execution of load balancing algorithm. The solver algorithm is tested on distributed memory linux cluster for up to 192 processors. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

11.

A bridging model for multi-core computing

Leslie G. Valiant 《Journal of Computer and System Sciences》2011,77(1):154-166

Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software systems to be written that are efficiently portable across changing hardware platforms. These resources are currently lacking in the area of multi-core architectures, where a programmer seeking high performance has no comparable opportunity to build on the intellectual efforts of others. In order to address this problem we propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model. Portable algorithms would contain efficient designs for all reasonable combinations of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines. Our Multi-BSP model is a multi-level model that has explicit parameters for processor numbers, memory/cache sizes, communication costs, and synchronization costs. The lowest level corresponds to shared memory or the PRAM, acknowledging the relevance of that model for whatever limitations on memory and processor numbers it may be efficacious to emulate it. We propose parameter-aware portable algorithms that run efficiently on all relevant architectures with any number of levels and any combination of parameters. For these algorithms we define a parameter-free notion of optimality. We show that for several fundamental problems, including standard matrix multiplication, the Fast Fourier Transform, and comparison sorting, there exist optimal portable algorithms in that sense, for all combinations of machine parameters. Thus some algorithmic generality and elegance can be found in this many parameter setting. 相似文献

12.

Thread-Parallel Integrated Test Pattern Generator Utilizing Satisfiability Analysis

Alexander Czutro Ilia Polian Matthew Lewis Piet Engelke Sudhakar M. Reddy Bernd Becker 《International journal of parallel programming》2010,38(3-4):185-202

Efficient utilization of the inherent parallelism of multi-core architectures is a grand challenge in the field of electronic design automation (EDA). One EDA algorithm associated with a high computational cost is automatic test pattern generation (ATPG). We present the ATPG tool TIGUAN based on a thread-parallel SAT solver. Due to a tight integration of the SAT engine into the ATPG algorithm and a carefully chosen mix of various optimization techniques, multi-million-gate industrial circuits are handled without aborts. TIGUAN supports both conventional single-stuck-at faults and sophisticated conditional multiple stuck-at faults which allows to generate patterns for non-standard fault models. We demonstrate how TIGUAN can be combined with conventional structural ATPG to extract full benefit of the intrinsic strengths of both approaches. 相似文献

13.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

14.

Efficient generation of large-scale pareto-optimal topologies

Krishnan Suresh 《Structural and Multidisciplinary Optimization》2013,47(1):49-61

The objective of this paper is to introduce an efficient algorithm and implementation for large-scale 3-D topology optimization. The proposed algorithm is an extension of a recently proposed 2-D topological-sensitivity based method that can generate numerous pareto-optimal topologies up to a desired volume fraction, in a single pass. In this paper, we show how the computational challenges in 3-D can be overcome. In particular, we consider an arbitrary 3-D domain-space that is discretized via hexahedral/brick finite elements. Exploiting congruence between elements, we propose a matrix-free implementation of the finite element method. The latter exploits modern multi-core architectures to efficiently solve topology optimization problems involving millions of degrees of freedom. The proposed methodology is illustrated through numerical experiments; comparisons are made against previously published results. 相似文献

15.

多核系统的多应用任务映射方法研究*

张伯泉费亭宋宗峰《计算机应用研究》2017,34(2)

在多核处理器系统中,多个计算任务映射到多核处理器内核的方式对于系统吞吐率至关重要。针对此问题提出一种新的多应用任务到多核的映射算法,该算法在应用到来之前预测应用的相关性能,并采用分支限界法提前为未来应用预留合适的内核几何位置。当应用真正到来时,根据预留的区域完成映射。实验结果表明,该算法相比其他传统算法,在多任务通信量的减少和多核系统的吞吐率等方面都收到了良好效果。相似文献

16.

A parallel workload balanced and memory efficient lattice-Boltzmann algorithm with single unit BGK relaxation time for laminar Newtonian flows

David Vidal Robert Roy 《Computers & Fluids》2010,39(8):1411-1423

A parallel workload balanced and memory efficient lattice-Boltzmann algorithm for laminar Newtonian fluid flow through large porous media is investigated. It relies on a simplified LBM scheme using a single unit BGK relaxation time, which is implemented by means of a shift algorithm and comprises an even fluid node partitioning domain decomposition strategy based on a vector data structure. It provides perfect parallel workload balance, and its two-nearest-neighbour communication pattern combined with a simple data transfer layout results in 20-55% lower communication cost, 25-60% higher computational parallel performance and 40-90% lower memory usage than previously reported LBM algorithms. Performance tests carried out using scale-up and speed-up case studies of laminar Newtonian fluid flow through hexagonal packings of cylinders and a random packing of polydisperse spheres on two different computer architectures reveal parallel efficiencies with 128 processors as high as 75% for domain sizes comprising more than 5 billion fluid nodes. 相似文献

17.

Parallel probabilistic relaxation labelling based on Markov random fields for spectral-spatial hyperspectral image classification

Brajesh Kumar Onkar Dikshit 《International journal of remote sensing》2016,37(18):4356-4379

The large volume of data and computational complexity of algorithms limit the application of hyperspectral image classification to real-time operations. This work addresses the use of different parallel processing techniques to speed up the Markov random field (MRF)-based method to perform spectral-spatial classification of hyperspectral imagery. The Metropolis relaxation labelling approach is modified to take advantage of multi-core central processing units (CPUs) and to adapt it to massively parallel processing systems like graphics processing units (GPUs). The experiments on different hyperspectral data sets revealed that the implementation approach has a huge impact on the execution time of the algorithm. The results demonstrated that the modified MRF algorithm produced classification accuracy similar to conventional methods with greatly improved computational performance. With modern multi-core CPUs, good computational speed-up can be achieved even without additional hardware support. The CPU-GPU hybrid framework rendered the otherwise computationally expensive approach suitable for time-constrained applications. 相似文献

18.

High performance combinatorial algorithm design on the Cell Broadband Engine processor

《Parallel Computing》2007,33(10-11):720-740

The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library.List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation.Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Huffman data encoding. We design efficient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular file-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations. 相似文献

19.

Fast anomaly detection in hyperspectral images with RX method on heterogeneous clusters

J. M. Molero A. Paz E. M. Garzón J. A. Martínez A. Plaza I. García 《The Journal of supercomputing》2011,58(3):411-419

Remotely sensed hyperspectral sensors provide image data containing rich information in both the spatial and the spectral domain, and this information can be used to address detection tasks in many applications. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the RX algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few approaches have been developed for parallel implementation of this algorithm. In this paper, we evaluate the suitability of using a hybrid parallel implementation with a high-dimensional hyperspectral scene. A general strategy to automatically map parallel hybrid anomaly detection algorithms for hyperspectral image analysis has been developed. Parallel RX has been tested on an heterogeneous cluster using this routine. The considered approach is quantitatively evaluated using hyperspectral data collected by the NASA’s Airborne Visible Infra-Red Imaging Spectrometer system over the World Trade Center in New York, 5 days after the terrorist attacks. The numerical effectiveness of the algorithms is evaluated by means of their capacity to automatically detect the thermal hot spot of fires (anomalies). The speedups achieved show that a cluster of multi-core nodes can highly accelerate the RX algorithm. 相似文献

20.

Data compression by volume prototypes for streaming data

Kenji Tabata Author VitaeAuthor Vitae Mineichi Kudo^{Author Vitae} 《Pattern recognition》2010,43(9):3162-3176

In these years, we often deal with an enormous amount of data in a large variety of pattern recognition tasks. Such data require a huge amount of memory space and computation time for processing. One of the approaches to cope with these problems is using prototypes. We propose volume prototypes as an extension of traditional point prototypes. A volume prototype is defined as a geometric configuration that represents some data points inside. A volume prototype is akin to a data point in the usage rather than a component of a mixture model. We show a one-pass algorithm to have such prototypes for data stream, along with an application for classification. An oblivion mechanism is also incorporated to adapt concept drift. 相似文献