期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance analysis of distributed symmetric sparse matrix vector multiplication algorithm for multi‐core architectures

Dossay Oryspayev Hasan Metin Aktulga Masha Sosonkina Pieter Maris James P. Vary 《Concurrency and Computation》2015,27(17):5019-5036

Sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi‐core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We study important features of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the ‘CPU core hours’ metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology‐aware mapping heuristic using simplified network load model. We have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the ‘CPU core hours’ metric and significantly reduces data movement overheads. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

2.

Parallel performance of Jacobi eigenvalue solution

《Computing Systems in Engineering》1995,6(4-5):377-383

In this paper we focus on Jacobi like resolution of the eigenproblem for a real symmetric matrix from a parallel performance point of view: we try to optimize the algorithm working on the communication intensive part of the code. We discuss several parallel implementations and propose an implementation which overlaps the communications by the computations to reach a better efficiency. We show that the overlapping implementation can lead to significant improvements. We conclude by presenting our future work. 相似文献

3.

Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems

Ling Zhuo Prasanna V.K. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(4):433-448

The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The processing elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-ll Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1 相似文献

4.

Scalable high-throughput variable block size motion estimation architecture

Stephen Warrington Wai-Yip Chan Subramania Sudharsanan 《Microprocessors and Microsystems》2009,33(4):319-325

Variable block size (VBS) motion compensated prediction (MCP) provides substantial rate-distortion performance gain over conventional fixed-block-size MCP and is a key feature of the H.264/AVC video coding standard. VBS–MCP requires the encoder to perform VBS motion estimation (VBSME), a computationally complex operation. In this paper, we propose a high motion vector throughput full-search VBSME architecture. High performance is achieved by performing parallel computations for multiple pixels within a macroblock, as well as computing several candidate motion vector (MV) positions in parallel. Two implementations of the architecture are examined, a four pixel-parallel implementation, and a higher performance 16 pixel-parallel implementation. A high degree of scalability is achieved by allowing for a variable length processing element array, where more processing elements yields a higher degree of candidate MV parallelism. The proposed architecture achieves a throughput exceeding current full-search VBSME architectures. 相似文献

5.

Window memoization: an efficient hardware architecture for high-performance image processing

Farzad Khalvati Mark D. Aagaard 《Journal of Real-Time Image Processing》2010,5(3):195-212

This work presents a new performance improvement technique, window memoization, for hardware implementations of local image processing algorithms. Window memoization combines the memoization techniques proposed in software and hardware with data redundancy in image processing to improve the efficiency of local image processing algorithms implemented in hardware. It minimizes the number of redundant computations performed on an image by identifying similar neighborhoods of pixels in the image and skipping the redundant computations. We have developed an optimized architecture in hardware that embodies the window memoization technique. Our hardware design for window memoization achieves high speedups with an overhead in hardware area that is significantly less than that of the conventional performance improvement techniques. As case studies in hardware, we have applied window memoization to the Kirsch edge detector and median filter. The typical speedup factor in hardware is 1.58 with 40% less hardware in comparison to conventional optimization techniques. 相似文献

6.

Scalable multicore architectures for long DNA sequence comparison

Friman Snchez Felipe Cabarcas Alex Ramirez Mateo Valero 《Concurrency and Computation》2011,23(17):2205-2219

Biological sequence comparison is one of the most important tasks in Bioinformatics. Owing to the fast growth of databases that contain biological information, sequence comparison represents an important challenge for high‐performance computing, especially when very long sequences are compared, i.e. the complete genome of several organisms. The Smith–Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). Concurrently, there is a paradigm shift towards chip multiprocessors in computer architecture, which offer a huge amount of potential performance that can only be exploited efficiently if applications are effectively mapped and parallelized. In this work, we analyze how large‐scale biology sequence comparison takes advantage of the current and future multicore architectures. Our starting point is the performance analysis of the current multicore IBM Cell B.E. processor; we analyze two different SW implementations on the Cell B.E. Then, using simulation tools, we study the performance scalability when a many‐core architecture is used for performing long DNA sequence comparison. We investigate the efficient memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture can be an efficient alternative to execute challenging bioinformatic workloads. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

7.

Minimizing development and maintenance costs in supporting persistently optimized BLAS

R. Clint Whaley Antoine Petitet 《Software》2005,35(2):101-121

The Basic Linear Algebra Subprograms (BLAS) define one of the most heavily used performance‐critical APIs in scientific computing today. It has long been understood that the most important of these routines, the dense Level 3 BLAS, may be written efficiently given a highly optimized general matrix multiply routine. In this paper, however, we show that an even larger set of operations can be efficiently maintained using a much simpler matrix multiply kernel. Indeed, this is how our own project, ATLAS (which provides one of the most widely used BLAS implementations in use today), supports a large variety of performance‐critical routines. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

8.

A regression‐based performance prediction framework for synchronous iterative algorithms on general purpose graphical processing unit clusters

Vivek K. Pallipuram Melissa C. Smith Nimisha Raut Xiaoyu Ren 《Concurrency and Computation》2014,26(2):532-560

Heterogeneous performance prediction models are valuable tools to accurately predict application runtime, allowing for efficient design space exploration and application mapping. The existing performance models require intricate system architecture knowledge, making the modeling task difficult. In this research, we propose a regression‐based performance prediction framework for general purpose graphical processing unit (GPGPU) clusters that statistically abstracts the system architecture characteristics, enabling performance prediction without detailed system architecture knowledge. The regression‐based framework targets deterministic synchronous iterative algorithms using our synchronous iterative GPGPU execution model and is broken into two components: the computation component that models the GPGPU device and host computations and the communication component that models the network‐level communications. The computation component regression models use algorithm characteristics such as the number of floating‐point operations and total bytes as predictor variables and are trained using several small, instrumented executions of synchronous iterative algorithms that include a range of floating‐point operations‐to‐byte requirements. The regression models for network‐level communications are developed using micro‐benchmarks and employ data transfer size and processor count as predictor variables. Our performance prediction framework achieves prediction accuracy over 90% compared with the actual implementations for several tested GPGPU cluster configurations. The end goal of this research is to offer the scientific computing community, an accurate and easy‐to‐use performance prediction framework that empowers users to optimally utilize the heterogeneous resources. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

9.

Energy-Efficient Computations on FPGAs

Viktor?K.?Prasanna Email author 《The Journal of supercomputing》2005,32(2):139-162

Recently, energy dissipation for computations on FPGAs has become an important performance metric. In this paper, we summarize our recent efforts in developing an algorithm-level design methodology for optimizing the energy performance of FPGA based implementations. For kernels, our design methodology consists of four steps: domain selection, domain-specific energy modeling, domain-space exploration and low-level simulation. To achieve system-level energy-efficiency, we outline a design methodology that integrates the kernel-level design methodology. Both the design methodologies can be used to achieve not only energy-efficiency but also latency, area, and power efficiency. We consider signal processing kernels as illustrative examples and demonstrate energy and time efficient algorithms and implementations for these on FPGAs. Example energy performance optimization through algorithmic optimizations include the 29–51% improvement in energy performance for a matrix multiplication kernel, 57–78% improvement for a FFT kernel and the 10–60% improvement for a floating-point LU decomposition kernel over state-of-the-art implementations. Similarly, an improvement of 41 to 46% in energy performance was achieved by the system-level design approach over a greedy approach for a MVDR adaptive beamforming application. Finally we briefly describe a high-level tool for obtaining parameterized and energy-efficient designs on FPGAs.This work is supported by the National Science Foundation under award No. CCR-0311823. 相似文献

10.

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

Mayez A. Al-Mouhamed Ayaz H. Khan 《The Journal of supercomputing》2017,73(9):3761-3795

The abundant data parallelism available in many-core GPUs has been a key interest to improve accuracy in scientific and engineering simulation. In many cases, most of the simulation time is spent in linear solver involving sparse matrix–vector multiply. In forward petroleum oil and gas reservoir simulation, the application of a stencil relationship to structured grid leads to a family of generalized hepta-diagonal solver matrices with some regularity and structural uniqueness. We present a customized storage scheme that takes advantage of generalized hepta-diagonal sparsity pattern and stencil regularity by optimizing both storage and matrix–vector computation. We also present an in-kernel optimization for implementing sparse matrix–vector multiply (SpMV) and biconjugate gradient stabilized (BiCG-Stab) solver. In-kernel is intended to avoid the multiple kernels invocation associated with the use of the numerical library operators. To keep in-kernel, a lock-free inter-block synchronization is used in which completing thread blocks are assigned some independent computations to avoid repeatedly polling the global memory. Other optimizations enable combining reductions and collective write operations to memory. The in-kernel optimization is particularly useful for the iterative structure of BiCG-Stab for preserving vector data locality and to avoid saving vector data back to memory and reloading on each kernel exit and re-entry. Evaluation uses generalized hepta-diagonal matrices that derives from a range of forward reservoir simulation’s structured grids. Results show the profitability of proposed generalized hepta-diagonal custom storage scheme over standard library storage like compressed sparse row, hybrid sparse, and diagonal formats. Using proposed optimizations, SpMV and BiCG-Stab have been noticeably accelerated compared to other implementations using multiple kernel exit–re-entry when the solver is implemented by invoking numerical library operators. 相似文献

11.

Solving path problems on the GPU

Aydın Buluç John R. Gilbert Ceren Budak 《Parallel Computing》2010,36(5-6):241-253

We consider the computation of shortest paths on Graphic Processing Units (GPUs). The blocked recursive elimination strategy we use is applicable to a class of algorithms (such as all-pairs shortest-paths, transitive closure, and LU decomposition without pivoting) having similar data access patterns. Using the all-pairs shortest-paths problem as an example, we uncover potential gains over this class of algorithms. The impressive computational power and memory bandwidth of the GPU make it an attractive platform to run such computationally intensive algorithms. Although improvements over CPU implementations have previously been achieved for those algorithms in terms of raw speed, the utilization of the underlying computational resources was quite low. We implemented a recursively partitioned all-pairs shortest-paths algorithm that harnesses the power of GPUs better than existing implementations. The alternate schedule of path computations allowed us to cast almost all operations into matrix–matrix multiplications on a semiring. Since matrix–matrix multiplication is highly optimized and has a high ratio of computation to communication, our implementation does not suffer from the premature saturation of bandwidth resources as iterative algorithms do. By increasing temporal locality, our implementation runs more than two orders of magnitude faster on an NVIDIA 8800 GPU than on an Opteron. Our work provides evidence that programmers should rethink algorithms instead of directly porting them to GPU. 相似文献

12.

Cache-efficient numerical algorithms using graphics hardware

《Parallel Computing》2007,33(10-11):663-684

We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2–10× improvement in performance. 相似文献

13.

A new approach for sparse matrix vector product on NVIDIA GPUs

F. Vzquez J. J. Fernndez E. M. Garzn 《Concurrency and Computation》2011,23(8):815-826

The sparse matrix vector product (SpMV) is a key operation in engineering and scientific computing and, hence, it has been subjected to intense research for a long time. The irregular computations involved in SpMV make its optimization challenging. Therefore, enormous effort has been devoted to devise data formats to store the sparse matrix with the ultimate aim of maximizing the performance. Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. SpMV implementations for NVIDIA GPUs have already appeared on the scene. This work proposes and evaluates a new implementation of SpMV for NVIDIA GPUs based on a new format, ELLPACK‐R, that allows storage of the sparse matrix in a regular manner. A comparative evaluation against a variety of storage formats previously proposed has been carried out based on a representative set of test matrices. The results show that, although the performance strongly depends on the specific pattern of the matrix, the implementation based on ELLPACK‐R achieves higher overall performance. Moreover, a comparison with standard state‐of‐the‐art superscalar processors reveals that significant speedup factors are achieved with GPUs. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

14.

A distributed approach for accelerating sparse matrix arithmetic operations for high-dimensional feature selection

Antonela Tommasel Daniela Godoy Alejandro Zunino Cristian Mateos 《Knowledge and Information Systems》2017,51(2):459-497

Matrix computations are both fundamental and ubiquitous in computational science, and as a result, they are frequently used in numerous disciplines of scientific computing and engineering. Due to the high computational complexity of matrix operations, which makes them critical to the performance of a large number of applications, their efficient execution in distributed environments becomes a crucial issue. This work proposes a novel approach for distributing sparse matrix arithmetic operations on computer clusters aiming at speeding-up the processing of high-dimensional matrices. The approach focuses on how to split such operations into independent parallel tasks by considering the intrinsic characteristics that distinguish each type of operation and the particular matrices involved. The approach was applied to the most commonly used arithmetic operations between matrices. The performance of the presented approach was evaluated considering a high-dimensional text feature selection approach and two real-world datasets. Experimental evaluation showed that the proposed approach helped to significantly reduce the computing times of big-scale matrix operations, when compared to serial and multi-thread implementations as well as several linear algebra software libraries. 相似文献

15.

EGGS: Sparsity-Specific Code Generation

Xuan Tang Teseo Schneider Shoaib Kamil Aurojit Panda Jinyang Li Daniele Panozzo 《Computer Graphics Forum》2020,39(5):209-219

Sparse matrix computations are among the most important computational patterns, commonly used in geometry processing, physical simulation, graph algorithms, and other situations where sparse data arises. In many cases, the structure of a sparse matrix is known a priori, but the values may change or depend on inputs to the algorithm. We propose a new methodology for compile-time specialization of algorithms relying on mixing sparse and dense linear algebra operations, using an extension to the widely-used open source Eigen package. In contrast to library approaches optimizing individual building blocks of a computation (such as sparse matrix product), we generate reusable sparsity-specific implementations for a given algorithm, utilizing vector intrinsics and reducing unnecessary scanning through matrix structures. We demonstrate the effectiveness of our technique on a benchmark of artificial expressions to quantitatively evaluate the benefit of our approach over the state-of-the-art library Intel MKL. To further demonstrate the practical applicability of our technique we show that our technique can improve performance, with minimal code changes, for mesh smoothing, mesh parametrization, volumetric deformation, optical flow, and computation of the Laplace operator. 相似文献

16.

Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions

Douglas Aberdeen Jonathan Baxter 《Concurrency and Computation》2001,13(2):103-119

相似文献

17.

Resource efficiency of the GigaNetIC chip multiprocessor architecture

《Journal of Systems Architecture》2007,53(5-6):285-299

In this article, we present the prototypical implementation of the scalable GigaNetIC chip multiprocessor architecture. We use an FPGA-based rapid prototyping system to verify the functionality of our architecture in a network application scenario before fabricating the ASIC in a modern CMOS standard cell technology. The rapid prototyping environment gives us the opportunity to test our multiprocessor architecture with Ethernet-based data streams in a real network scenario. Our system concept is based on a massively parallel processor structure. Due to its regularity, our architecture can be easily scaled to accommodate a wide range of packet processing applications with various performance and throughput requirements at high reliability. Furthermore, the composition based on predefined building blocks guarantees fast design cycles and simplifies system verification. We present standard cell synthesis results as well as a performance analysis for a firewall application with various couplings of hardware accelerators. Finally, we compare implementations of our architecture with state-of-the-art desktop CPUs. We use simple, general-purpose applications as well as the introduced packet processing tasks to determine the performance capabilities and the resource efficiency of the GigaNetIC architecture. We show that, if supported by the application, parallelism offers more opportunities than increasing clock frequencies. 相似文献

18.

An (almost) direct deployment of the Fast Multipole Method on the Cell processor

Pierre Fortin Jean-Luc Lamotte 《The Journal of supercomputing》2013,65(3):1205-1222

This paper presents the first deployment of the Fast Multipole Method on the Cell processor (PowerXCell 8i). We rely on the matrix formulation with BLAS routines of the FMB code (Fast Multipole with BLAS) in order to directly and efficiently offload the most time consuming operators of both far field and near field computations on the Cell heterogeneous cores. We detail the difficulties that had to be solved first, and we finally obtain a deployment in single and double precisions, which scales linearly on several Cell blades and which is able to handle both uniform and non-uniform distributions of particles. We also present our performance results and comparisons with multicore CPUs, as well as the limitations of our deployment on the Cell processor. 相似文献

19.

Microwave tomography for breast cancer detection on Cell broadband engine processors

Meilian Xu Parimala Thulasiraman Sima Noghanian 《Journal of Parallel and Distributed Computing》2012

Microwave tomography (MT) is a safe screening modality that can be used for breast cancer detection. The technique uses the dielectric property contrasts between different breast tissues at microwave frequencies to determine the existence of abnormalities. Our proposed MT approach is an iterative process that involves two algorithms: Finite-Difference Time-Domain (FDTD) and Genetic Algorithm (GA). It is a compute intensive problem: (i) the number of iterations can be quite large to detect small tumors; (ii) many fine-grained computations and discretizations of the object under screening are required for accuracy. In our earlier work, we developed a parallel algorithm for microwave tomography on CPU-based homogeneous, multi-core, distributed memory machines. The performance improvement was limited due to communication and synchronization latencies inherent in the algorithm. In this paper, we exploit the parallelism of microwave tomography on the Cell BE processor. Since FDTD is a numerical technique with regular memory accesses, intensive floating point operations and SIMD type operations, the algorithm can be efficiently mapped on the Cell processor achieving significant performance. The initial implementation of FDTD on Cell BE with 8 SPEs is 2.9 times faster than an eight node shared memory machine and 1.45 times faster than an eight node distributed memory machine. In this work, we modify the FDTD algorithm by overlapping computations with communications during asynchronous DMA transfers. The modified algorithm also orchestrates the computations to fully use data between DMA transfers to increase the computation-to-communication ratio. We see 54% improvement on 8 SPEs (27.9% on 1 SPE) for the modified FDTD in comparison to our original FDTD algorithm on Cell BE. We further reduce the synchronization latency between GA and FDTD by using mechanisms such as double buffering. We also propose a performance prediction model based on DMA transfers, number of instructions and operations, the processor frequency and DMA bandwidth. We show that the execution time from our prediction model is comparable (within 0.5 s difference) with the execution time of the experimental results on one SPE. 相似文献

20.

Relation-based computations in a monadic BSP model

N. Botta C. Ionescu 《Parallel Computing》2007,33(12):795-821

We propose a Haskell monadic model of bulk synchronous parallel programs and apply it to the analysis of relation-based computations.

Relation-based computations are simple but general patterns found in scientific computing applications. They are easy to implement sequentially, but difficult to parallelize.

We use the model to give high-level specifications of distributed relation-based algorithms and outline how to obtain testable parallel implementations from these specifications via equational reasoning.

We sketch the architecture of a C++ library of components for distributed relation-based computations. We argue that the model can be used to provide a concise and consistent library documentation. 相似文献