期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance and energy effects on task-based parallelized applications

Helena Caminal Diego Caballero Juan M. Cebrián Roger Ferrer Marc Casas Miquel Moretó Xavier Martorell Mateo Valero 《The Journal of supercomputing》2018,74(6):2627-2637

Heterogeneity, parallelization and vectorization are key techniques to improve the performance and energy efficiency of modern computing systems. However, programming and maintaining code for these architectures poses a huge challenge due to the ever-increasing architecture complexity. Task-based environments hide most of this complexity, improving scalability and usage of the available resources. In these environments, while there has been a lot of effort to ease parallelization and improve the usage of heterogeneous resources, vectorization has been considered a secondary objective. Furthermore, there has been a swift and unstoppable burst of vector architectures at all market segments, from embedded to HPC. Vectorization can no longer be ignored, but manual vectorization is tedious, error-prone and not practical for the average programmer. This work evaluates the feasibility of user-directed vectorization in task-based applications. Our evaluation is based on the OmpSs programming model, extended to support user-directed vectorization for different SIMD architectures (i.e., SSE, AVX2, AVX512). Results show that user-directed codes achieve manually optimized code performance and energy efficiency with minimal code modifications, favoring portability across different SIMD architectures. 相似文献

2.

Multi-thread implementations of the lattice Boltzmann method on non-uniform grids for CPUs and GPUs

M. Schönherr K. Kucher M. Geier M. Stiebler S. Freudiger M. Krafczyk 《Computers & Mathematics with Applications》2011,61(12):3730-3743

Two multi-thread based parallel implementations of the lattice Boltzmann method for non-uniform grids on different hardware platforms are compared in this paper: a multi-core CPU implementation and an implementation on General Purpose Graphics Processing Units (GPGPU). Both codes employ second order accurate compact interpolation at the interfaces, coupling grids of different resolutions. Since the compact interpolation technique is both simple and accurate, it produces almost no computational overhead as compared to the lattice Boltzmann method for uniform grids in terms of node updates per second. To the best of our knowledge, the current paper presents the first study on multi-core parallelization of the lattice Boltzmann method with inhomogeneous grid spacing and nested time stepping for both CPUs and GPUs. 相似文献

3.

Parallel online spatial and temporal aggregations on multi-core CPUs and many-core GPUs

《Information Systems》2014

With the increasing availability of locating and navigation technologies on portable wireless devices, huge amounts of location data are being captured at ever growing rates. Spatial and temporal aggregations in an Online Analytical Processing (OLAP) setting for the large-scale ubiquitous urban sensing data play an important role in understanding urban dynamics and facilitating decision making. Unfortunately, existing spatial, temporal and spatiotemporal OLAP techniques are mostly based on traditional computing frameworks, i.e., disk-resident systems on uniprocessors based on serial algorithms, which makes them incapable of handling large-scale data on parallel hardware architectures that have already been equipped with commodity computers. In this study, we report our designs, implementations and experiments on developing a data management platform and a set of parallel techniques to support high-performance online spatial and temporal aggregations on multi-core CPUs and many-core Graphics Processing Units (GPUs). Our experiment results show that we are able to spatially associate nearly 170 million taxi pickup location points with their nearest street segments among 147,011 candidates in about 5–25 s on both an Nvidia Quadro 6000 GPU device and dual Intel Xeon E5405 quad-core CPUs when their Vector Processing Units (VPUs) are utilized for computing intensive tasks. After spatially associating points with road segments, spatial, temporal and spatiotemporal aggregations are reduced to relational aggregations and can be processed in the order of a fraction of a second on both GPUs and multi-core CPUs. In addition to demonstrating the feasibility of building a high-performance OLAP system for processing large-scale taxi trip data for real-time, interactive data explorations, our work also opens the paths to achieving even higher OLAP query efficiency for large-scale applications through integrating domain-specific data management platforms, novel parallel data structures and algorithm designs, and hardware architecture friendly implementations. 相似文献

4.

Data-intensive document clustering on graphics processing unit (GPU) clusters 总被引：1，自引：0，他引：1

Yongpeng ZhangAuthor VitaeFrank MuellerAuthor Vitae Xiaohui CuiAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(2):211-224

Document clustering is a central method to mine massive amounts of data. Due to the explosion of raw documents generated on the Internet and the necessity to analyze them efficiently in various intelligent information systems, clustering techniques have reached their limitations on single processors. Instead of single processors, general-purpose multi-core chips are increasingly deployed in response to diminishing returns in single-processor speedup due to the frequency wall, but multi-core benefits only provide linear speedups while the number of documents in the Internet is growing exponentially. Accelerating hardware devices represent a novel promise for improving the performance for data-intensive problems such as document clustering. They offer more radical designs with a higher level of parallelism but adaptation to novel programming environments.In this paper, we assess the benefits of exploiting the computational power of graphics processing units (GPUs) to study two fundamental problems in document mining, namely to calculate the term frequency-inverse document frequency (TF-IDF) and cluster a large set of documents. We transform traditional algorithms into accelerated parallel counterparts that can be efficiently executed on many-core GPU architectures. We assess our implementations on various platforms, ranging from stand-alone GPU desktops to Beowulf-like clusters equipped with contemporary GPU cards. We observe at least one order of magnitude speedups over CPU-only desktops and clusters. This demonstrates the potential of exploiting GPU clusters to efficiently solve massive document mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge. 相似文献

5.

Parallel implementation of quorum planted (ℓ, d) motif search on multi-core/many-core platforms

《Microprocessors and Microsystems》2016

Multi-core and many-core architectures are widely adopted by researchers in applied sciences and engineering, owing to their reasonable cost, and ease of access. Moreover, their painless hardware set-up process and rather simple programming paradigm attract more researchers to acquire them and implement their time-expensive computations on these platforms. Planted Motif Search problem is one of the most challenging problems in bioinformatics whose goal is to enumerate all strings of length ℓ that are commonly planted in a given set of DNA sequences. In this paper, we propose an efficient method of thread parallelization to accelerate the latest Quorum Planted Motif Search algorithm (qPMS9) on multi-core and many-core systems. Our contribution towards dynamic scheduling of threads and parallelization of loops in the proposed method outperforms previous sequential and parallel algorithms. 相似文献

6.

Mapping of option pricing algorithms onto heterogeneous many-core architectures

Shuai Zhang Zhao Wang Ying Peng Bertil Schmidt Weiguo Liu 《The Journal of supercomputing》2017,73(9):3715-3737

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-performance computing areas. GPU and Xeon Phi are two popular general-purpose many-core accelerators. In this paper, we demonstrate how heterogeneous many-core architectures, powered by multi-core CPUs, CUDA-enabled GPUs and Xeon Phis can be used as an efficient computational platform to accelerate popular option pricing algorithms. In order to make full use of the compute power of this architecture, we have used a hybrid computing model which consists of two types of data parallelism: worker level and device level. The worker level data parallelism uses a distributed computing infrastructure for task distribution, while the device level data parallelism uses both the multi-core CPUs and many-core accelerators for fast option pricing calculation. Experiments show that our implementations achieve good performance and scalability on this architecture and also outperform other state-of-the-art GPU-based solutions for Monte Carlo European/American option pricing and BSDE European option pricing. 相似文献

7.

Parallel probabilistic relaxation labelling based on Markov random fields for spectral-spatial hyperspectral image classification

Brajesh Kumar Onkar Dikshit 《International journal of remote sensing》2016,37(18):4356-4379

The large volume of data and computational complexity of algorithms limit the application of hyperspectral image classification to real-time operations. This work addresses the use of different parallel processing techniques to speed up the Markov random field (MRF)-based method to perform spectral-spatial classification of hyperspectral imagery. The Metropolis relaxation labelling approach is modified to take advantage of multi-core central processing units (CPUs) and to adapt it to massively parallel processing systems like graphics processing units (GPUs). The experiments on different hyperspectral data sets revealed that the implementation approach has a huge impact on the execution time of the algorithm. The results demonstrated that the modified MRF algorithm produced classification accuracy similar to conventional methods with greatly improved computational performance. With modern multi-core CPUs, good computational speed-up can be achieved even without additional hardware support. The CPU-GPU hybrid framework rendered the otherwise computationally expensive approach suitable for time-constrained applications. 相似文献

8.

A new era in scientific computing: Domain decomposition methods in hybrid CPU–GPU architectures

M. Papadrakakis G. Stavroulakis A. Karatarakis 《Computer Methods in Applied Mechanics and Engineering》2011,200(13-16):1490-1508

Recent advances in graphics processing units (GPUs) technology open a new era in high performance computing. Applications of GPUs to scientific computations are attracting a lot of attention due to their low cost in conjunction with their inherently remarkable performance features and the recently enhanced computational precision and improved programming tools. Domain decomposition methods (DDM) constitute today an important category of methods for the solution of highly demanding problems in simulation-based applied science and engineering. Among them, dual domain decomposition methods have been successfully applied in a variety of problems in both sequential as well as in parallel/distributed processing systems. In this work, we demonstrate the implementation of the FETI method to a hybrid CPU–GPU computing environment. Parametric tests on implicit finite element structural mechanics benchmark problems revealed the tremendous potential of this type of hybrid computing environment as a result of the full exploitation of multi-core CPU hardware resources and the intrinsic software and hardware features of the GPUs as well as the numerical properties of the solution method. 相似文献

9.

多核处理器上的并行联机分析处理算法研究

周国亮王桂兰朱永利《计算机科学与探索》2013,(2):180-190

近年来,计算机硬件技术获得了很大发展,尤其是大内存和多核,但算法效率并没有随着硬件技术的发展而提高,根本原因是没有充分利用CPU缓存以及单线程程序设计的局限性。在联机分析处理领域,数据方体计算是一个重要而又耗时的操作,因此如何提高数据方体的计算效率是该领域的一个研究难点。探讨了基于多核CPU特征的并行立方体算法,提出了MT-Multi-Way(multi-threading multi-way)和MT-BUC(multi-threading bottom-up computation)算法。该算法通过有效的数据划分和多线程协作,避免了Cache竞争,并确保了负载均衡,获得了近似线性加速比。以上述算法为基础,提出了处理立方体算法的多核框架,包括数据划分策略及递归算法的多核处理,指导立方体算法的并行化。相似文献

10.

Embedded GPU and multicore processors for emotional-based mobile robotic agents

《Future Generation Computer Systems》2016

Control architectures based on emotions are becoming promising solutions for the implementation of future robotic systems. The basic controllers of this architecture are the emotional processes that decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, limiting the processing capacity of a main processor to solve the complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel, hence enabling the computing power to tackle the complex dynamic problems. In this paper, Graphic Processing Unit (GPU), multicore processors and single instruction multiple data (SIMD) instructions are used to provide parallelism for the emotional processes. Different GPUs, multicore processors and SIMD instruction sets are evaluated and compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different environmental conditions, robot dynamics and emotional states. Experimental results show that, despite the fact that GPUs have a bottleneck in the data transmission between the host and the device, the evaluated GTX 670 GPU provides a performance of more than one order of magnitude greater than the initial implementation of the architecture on a single core. Thus, all complex proposed application problems can be solved using the GPU technology in contrast to the first prototype where only 55% of them could be solved. Using AVX SIMD instructions, the performance of the architecture is increased in 3.25 times in relation to the first implementation. Thus, from the 27 proposed applications about 88.8% are solved. In the case of the SSE SIMD instructions, the performance is almost doubled and the robot could solve about 74% of the proposed application problems. The use of AVX and SSE SIMD instructions provides almost the same performance as a quad- and a dual-core, respectively, with the advantage that they do not add any additional hardware cost. 相似文献

11.

Autonomic Coordination of Skeleton-Based Applications Over CPU/GPU Multi-Core Architectures

Mehdi Goli Horacio González–Vélez 《International journal of parallel programming》2017,45(2):203-224

Widely adumbrated as patterns of parallel computation and communication, algorithmic skeletons introduce a viable solution for efficiently programming modern heterogeneous multi-core architectures equipped not only with traditional multi-core CPUs, but also with one or more programmable Graphics Processing Units (GPUs). By systematically applying algorithmic skeletons to address complex programming tasks, it is arguably possible to separate the coordination from the computation in a parallel program, and therefore subdivide a complex program into building blocks (modules, skids, or components) that can be independently created and then used in different systems to drive multiple functionalities. By exploiting such systematic division, it is feasible to automate coordination by addressing extra-functional and non-functional features such as application performance, portability, and resource utilisation from the component level in heterogeneous multi-core architectures. In this paper, we introduce a novel approach to exploit the inherent features of skeleton-based applications in order to automatically coordinate them over heterogeneous (CPU/GPU) multi-core architectures and improve their performance. Our systematic evaluation demonstrates up to one order of magnitude speed-up on heterogeneous multi-core architectures. 相似文献

12.

Implementing molecular dynamics on hybrid high performance computers – short range forces

W. Michael Brown Peng Wang Steven J. Plimpton Arnold N. Tharrington 《Computer Physics Communications》2011,182(4):898-911

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores. 相似文献

13.

A highly parallel Black–Scholes solver based on adaptive sparse grids

《国际计算机数学杂志》2012,89(9):1212-1238

In this paper, we present a highly efficient approach for numerically solving the Black–Scholes equation in order to price European and American basket options. Therefore, hardware features of contemporary high performance computer architectures such as non-uniform memory access and hardware-threading are exploited by a hybrid parallelization using MPI and OpenMP which is able to drastically reduce the computing time. In this way, we achieve very good speed-ups and are able to price baskets with up to six underlyings. Our approach is based on a sparse grid discretization with finite elements and makes use of a sophisticated adaption. The resulting linear system is solved by a conjugate gradient method that uses a parallel operator for applying the system matrix implicitly. Since we exploit all levels of the operator's parallelism, we are able to benefit from the compute power of more than 100 cores. Several numerical examples as well as an analysis of the performance for different computer architectures are provided. 相似文献

14.

High performance computing using MPI and OpenMP on multi-core parallel systems 总被引：1，自引：0，他引：1

Haoqiang Jin Dennis JespersenPiyush Mehrotra Rupak BiswasLei Huang Barbara Chapman 《Parallel Computing》2011,37(9):562-575

The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures. 相似文献

15.

Parallel cube computation on modern CPUs and GPUs

Guoliang Zhou Hong Chen 《The Journal of supercomputing》2012,61(3):394-417

With the popularity of column-store databases, modern multi-core CPUs, and general-purpose computing on graphics processing units (GPGPUs), there will be radical changes in how processing is done in the online analytical processing (OLAP) and data warehousing fields. Cube computation is a core and time-consuming problem which has been researched extensively. However, most of the algorithms have been proposed without considering the prevalent multi-core architectures and column storage. This paper presents a new parallel cube algorithm that takes advantage of multi-core architectures. We first propose a cache-conscious bottom-up computation (BUC) algorithm called CC-BUC that adopts an integrated bottom-up and breadth-first partitioning order. Each dimension is separately stored and processed. In processing each dimension, breadth-first data scanning and results outputting reduce memory I/O and enhance cache locality. Cache misses are limited in a dimension scope, and translation lookaside buffer (TLB) misses are reduced. Based on CC-BUC, we give a multi-core architecture-based cube algorithm called MC-Cubing. Multiple partitions are processed simultaneously and multiple threads undergo parallel execution inside each partition. MC-Cubing is consistent with multi-core architectures and high parallelism. The layout and associated algorithms take advantage of single instruction, multiple data (SIMD) instructions and thread-level parallelism (TLP). We implement and demonstrate the effectiveness of MC-Cubing on two multi-core architectures: multi-core CPUs and GPUs. Experimental results show that the MC-Cubing algorithm can speed up nearly six times faster than BUC in real datasets. 相似文献

16.

Parallel Ant Colony Optimization on Graphics Processing Units

Audrey Delévacq Pierre Delisle Marc Gravel Michaël Krajecki 《Journal of Parallel and Distributed Computing》2013

The purpose of this paper is to propose effective parallelization strategies for the Ant Colony Optimization (ACO) metaheuristic on Graphics Processing Units (GPUs). The Max–Min Ant System (MMAS) algorithm augmented with 3-opt local search is used as a framework for the implementation of the parallel ants and multiple ant colonies general parallelization approaches. The four resulting GPU algorithms are extensively evaluated and compared on both speedup and solution quality on a state-of-the-art Fermi GPU architecture. A rigorous effort is made to keep parallel algorithms true to the original MMAS applied to the Traveling Salesman Problem. We report speedups of up to 23.60 with solution quality similar to the original sequential implementation. With the intent of providing a parallelization framework for ACO on GPUs, a comparative experimental study highlights the performance impact of ACO parameters, GPU technical configuration, memory structures and parallelization granularity. 相似文献

17.

Revisiting Multiple Pattern Matching Algorithms for Multi-Core Architecture

下载免费PDF全文

谭光明刘萍卜东波刘燕兵《计算机科学技术学报》2011,26(5):866-874

Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion detection.In this paper,we present an attempt to design efficient multiple pattern searching algorithms on multi-core architectures.We observe an important feature which indicates that the multiple pattern matching time mainly depends on the number and minimal length of patterns.The multi-core algorithm proposed in this paper leverages this feature to decompose pattern set so that the parallel execution time is minimized.We formulate the problem as an optimal decomposition and scheduling of a pattern set,then propose a heuristic algorithm,which takes advantage of dynamic programming and greedy algorithmic techniques,to solve the optimization problem.Experimental results suggest that our decomposition approach can increase the searching speed by more than 200% on a 4-core AMD Barcelona system. 相似文献

18.

A comprehensive comparison of GPU- and FPGA-based acceleration of reflection image reconstruction for 3D ultrasound computer tomography

Matthias Birk Michael Zapf Matthias Balzer Nicole Ruiter Jürgen Becker 《Journal of Real-Time Image Processing》2014,9(1):159-170

As today’s standard screening methods frequently fail to diagnose breast cancer before metastases have developed, earlier breast cancer diagnosis is still a major challenge. Three-dimensional ultrasound computer tomography promises high-quality images of the breast, but is currently limited by a time-consuming image reconstruction. In this work, we investigate the acceleration of the image reconstruction by GPUs and FPGAs. We compare the obtained performance results with a recent multi-core CPU. We show that both architectures are able to accelerate processing, whereas the GPU reaches the highest performance. Furthermore, we draw conclusions in terms of applicability of the accelerated reconstructions in future clinical application and highlight general principles for speed-up on GPUs and FPGAs. 相似文献

19.

A Memory and Computation Efficient Sparse Level-Set Method

Wladimir J. van der Laan Andrei C. Jalba Jos B. T. M. Roerdink 《Journal of scientific computing》2011,46(2):243-264

Since its introduction, the level set method has become the favorite technique for capturing and tracking moving interfaces, and found applications in a wide variety of scientific fields. In this paper we present efficient data structures and algorithms for tracking dynamic interfaces through the level set method. Several approaches which address both computational and memory requirements have been very recently introduced. We show that our method is up to 8.5 times faster than these recent approaches. More importantly, our algorithm can greatly benefit from both fine- and coarse-grain parallelization by leveraging SIMD and/or multi-core parallel architectures. 相似文献

20.

Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

Jesús Jiménez Juan Ruiz de Miras 《The Journal of supercomputing》2013,65(3):1327-1352

In this paper, we present the analysis and development of a cross-platform OpenCL implementation of the box-counting algorithm, which is one of the most widely-used methods for estimating the Fractal Dimension. The Fractal Dimension is a relevant image analysis method used in several disciplines, but computing it is in general a time consuming process, especially when working with 3D images. Unlike parallel programming models that strictly depend on the hardware type and manufacturer, like CUDA, OpenCL allows us to provide an implementation suitable for execution on both GPUs and multi-core CPUs, whatever the hardware manufacturer. Sorting is a key part of the fast box-counting algorithm and the final speedup is highly conditioned by the efficiency of the sorting algorithm used. Our study reveals that current OpenCL implementations of sorting algorithms are clearly slower when compared with both CUDA for GPU and specific multi-core CPU implementations. Our OpenCL algorithm has been specifically optimized according the type of the target device and the results show an average speedup of up to 7.46× and 4×, when executed on the GPU and the multi-core CPU respectively, both compared with the single-threaded (sequential) CPU implementation. 相似文献