首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present a new scheme for evaluating the performance of multithreaded computers and demonstrate its application to the Cray MTA‐2 and XMT supercomputers. Our scheme is based on the concept of clock cycles per element, ${\cal C}$, plotted against both problem size and the number of processors. This scheme clearly shows if an implementation has achieved its asymptotic efficiency and is more general than (but includes) the commonly used speedup metric. It permits the discovery of any imperfections in both the software as well as the hardware, and is expected to permit a unified comparison of many different parallel architectures. Measurements on a number of well‐known parallel algorithms, ranging from matrix multiply to quicksort, are presented for the MTA‐2 and XMT and highlight some interesting differences between these machines. The performance of sequence alignment using dynamic programming is evaluated on the MTA‐2, XMT, IBM x3755 and SGI Altix 350 and provides a useful comparison of the capabilities of the Cray machines with more conventional shared memory architectures. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

2.
Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg–Tarjan preflow‐push algorithm on the Cray XMT‐2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well‐suited to the parallelization of graph theoretic algorithms, such as preflow‐push. We describe the implementation of the preflow‐push code on the XMT‐2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

3.
We explore the comparative performance of the Cray XMT and XMT‐2 massively multithreaded supercomputers. We use benchmarks to evaluate memory accesses for various types of loops. We also compare the performance of these machines on matrix multiply and on three previously implemented dynamic programming algorithms. It is shown that the relative performance of these machines is dependent on the size (number of processors) of the configuration, as well as the size of the problem being evaluated. In particular, small configurations of the original XMT can sometimes show slightly better performance than larger configurations of the XMT‐2, for the same problem size. We note that, under heavy memory load, performance of loops can saturate well before the maximum number of processors available. This suggests that it may not always be useful to use the maximum number of processors for a specific run. We also show that manual restructuring of nested loops, including decreasing the parallelism, can result in major improvements in performance. The results in this paper indicate that careful exploration of the space of problem sizes, number of processors used, and choices of loop parallelization can yield substantial improvements in performance. These improvements can be very significant for production codes that run for extended periods of time. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

4.
That the influence of the PRAM model is ubiquitous in parallel algorithm design is as clear as the fact that it is technologically infeasible for the forseeable future. The current generation of parallel hardware prominently features distributed memory and high‐performance interconnection networks—very much the antithesis of the shared memory required for the PRAM model. It has been shown that, in spite of communication costs, for some problems very fast parallel algorithms are available for distributed‐memory machines—from embarassingly parallel problems to sorting and numerical analysis. In contrast it is known that for other classes of problem PRAM‐style shared‐memory simulation on a distributed‐memory machine can, in theory, produce solutions of comparable performance to the best possible for such architectures. The Bulk Synchronous Parallel (BSP) model accurately represents most parallel machines—theoretical and actual—in an execution and cost model. We introduce a scalable portable PRAM realization appropriate for BSP computers and a methodology for usage. Our system is fast and built upon the familiar sequential C++ coupled with the new standard BSP library of parallel computation and communication primitives. It is portable to and predictable on a vast number of parallel computers including workstation clusters, a 256‐processor Cray T3D, an 8‐node IBM SP/2 and a 4‐node shared‐memory SGI Power Challenge machine. Our approach achieves simplicity of programming over direct‐mode BSP programming for reasonable overhead cost. We objectively compare optimized BSP and PRAM algorithms implemented with our C++ PRAM library and provide encouraging experimental results for our new style of programming. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

5.
Multicore processors form the basis of most traditional high performance parallel processing architectures. Early experiences with these computers showed significant performance problems, both with regard to computation and inter‐process communication. The transition from Purple, an IBM POWER5‐based machine, to Cielo, a Cray XE6, as the main capability computing platform for the United States Department of Energy's Advanced Simulation and Computing campaign provides an opportunity to reexamine these issues after experiences with a few generations of multicore‐based machines. Experiences with Purple identified some important characteristics that led to strong performance of complex scientific application programs at very large scales. Herein, we compare the performance of some Advanced Simulation and Computing mission critical applications at capability scale across this transition to multicore processors. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

6.
Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.  相似文献   

7.
Progress in adapting molecular dynamics algorithms for systems with short-range interactions to utilize the features of modern supercomputers is described. Efficient utilization of the latest generation of processor architectures requires algorithms that can be both vectorized and parallelized. The approach adopted for vectorization involves combining the layer and neighbor-list methods, while parallelization employs spatial subdivision with explicit communication. The techniques presented here have been used in performance tests on the Cray X1 vector-parallel supercomputer with systems containing over 12 billion atoms.  相似文献   

8.
The subset‐sum problem is a well‐known non‐deterministic polynomial‐time complete (NP‐complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two‐list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU–GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide‐and‐conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector‐based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

9.
In this paper we discuss the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.  相似文献   

10.
《Computers & Fluids》2006,35(8-9):910-919
This report presents a comprehensive survey of the effect of different data layouts on the single processor performance characteristics for the lattice Boltzmann method both for commodity “off-the-shelf” (COTS) architectures and tailored HPC systems, such as vector computers. We cover modern 64-bit processors ranging from IA32 compatible (Intel Xeon/Nocona, AMD Opteron), superscalar RISC (IBM Power4), IA64 (Intel Itanium 2) to classical vector (NEC SX6+) and novel vector (Cray X1) architectures. Combining different data layouts with architecture dependent optimization strategies we demonstrate that the optimal implementation strongly depends on the architecture used. In particular, the correct choice of the data layout could supersede complex cache-blocking techniques in our kernels. Furthermore our results demonstrate that vector systems can outperform COTS architectures by one order of magnitude.  相似文献   

11.
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52–56 Mflops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a real world simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application.  相似文献   

12.

In the current study, the problems of elastic and elastoplastic torsion were formulated by finite element method. The finite element code was parallelized on both shared and distributed memory architectures. An assembling method with high parallelism ability and consuming minimum memory was proposed to obtain compressed global stiffness matrix directly. Parallel programming principles were expressed in two shared memory and distributed memory approaches; moreover, parallel well-known mathematical libraries were briefly expressed. In this paper, the main focus was on a lucid explanation of parallelization mechanisms in detail on two memory architectures such as some settings of Linux operating system for large-scale problems. To verify the ability of the proposed method and its parallel performance, several benchmark examples were represented with different mesh sizes and were compared with their respective analytical solutions. Considering the obtained results, the proposed sparse assembling algorithm decreased required memory significantly (about 103.5 to 105.5 times) and the obtained speedup was about 3.4 for the elastoplastic torsion problem in a simple multicore computer.

  相似文献   

13.
Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

14.
This paper presents a new Byzantine agreement protocol that toleratest processor faults usingO(t·logt) processors,t + 1 rounds,O(t 2 +o·t) total message bits (whereo is the number of processors that must decide), and messages of maximum size 1 bit. It is the first Byzantine agreement protocol that is simultaneously optimal in rounds, message bits, and message size. The new Byzantine agreement protocol is actually a protocol for the (slightly) more general Byzantine relay problem—a problem which we formulate in this paper. The Byzantine relay protocol is the result of a general recursive construction. Each step of the construction combines two smaller (in terms of number of faults tolerated) Byzantine relay protocols into one larger Byzantine relay protocol. The base case is a collection of very simple Byzantine relay protocols, each tolerant of a small constant number of processor faults. A key new feature of the protocol construction technique presented in this paper is that it does not add unproductive overhead rounds: given two constituent protocols that are optimal in the number of rounds, the composite protocol that is constructed is also optimal in the number of rounds.The work of Jennifer Welch was supported in part by NSF Grant CCR-9010730 and an IBM Faculty Development Award. This work was done while she was at the University of North Carolina at Chapel Hill.  相似文献   

15.
In this paper we study various implementations of Cholesky factorization on SIMD architectures. A submatrix algorithm is implemented on the MasPar MP-2 using both block and torus-wrap data mappings. Both LLT and LDLT (square root free) implementations of the algorithm are investigated. The execution times and performance results for the MP-2 are presented. The performance of these algorithms is characterized in terms of the problem size, machine size and other machine dependent communication and computation parameters. Analysis for the communication and computation complexities for these algorithms is also presented, and models to predict the performance are derived. The torus-wrap mapped implementations outperformed the block approach for all problem sizes. The LDLT implementation outperformed LLT for small to medium problem sizes. © 1997 John Wiley & Sons, Ltd.  相似文献   

16.
In this paper we express the Viterbi algorithm as a matrix–vector reduction in which multiplication is replaced by addition and addition by minimization. The resulting algorithm is then readily parallelized in a form suitable for implementation on a systolic processor array. We describe the algorithm for Bose–Chaudhuri–Hocquenghem (BCH) codes which have a task graph with its valence restricted to four inputs and four outputs. The method is also applicable to convolution codes, but the complexity of the task graph increases with the number of input bits for these codes. Results for BCH codes are given for two general purpose parallel machines, an IBM SP2 and a Meiko CS2. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

17.
We are witnessing the consolidation of the heterogeneous computing in parallel computing with architectures such as Cell Broadband Engine (Cell BE) or Graphics Processing Units (GPUs) which are present in a myriad of developments for high performance computing. These platforms provide a Software Development Kit (SDK) to maximize performance at the expense of dealing with complex and low-level architectural details which makes the software development a daunting task. This paper explores stencil computations in several heterogeneous programming models like Cell SDK, CellSs, ALF and CUDA to optimize the Jacobi method for solving Laplace??s differential equation. We describe the programming techniques to extract the maximum performance on the Cell BE and the GPU, and compare their computing paradigms. Experimental results are shown on two Nvidia Teslas and one IBM BladeCenter QS20 blade which incorporates two 3.2?GHz Cell BEs v?5.1. The speed-up factor for our set of GPU optimizations reaches 3?C4×, and the execution times defeat those of the Cell BE by an order of magnitude, also showing great scalability when moving towards newer GPU generations and/or more demanding problem sizes.  相似文献   

18.
Biological sequence comparison is one of the most important tasks in Bioinformatics. Owing to the fast growth of databases that contain biological information, sequence comparison represents an important challenge for high‐performance computing, especially when very long sequences are compared, i.e. the complete genome of several organisms. The Smith–Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). Concurrently, there is a paradigm shift towards chip multiprocessors in computer architecture, which offer a huge amount of potential performance that can only be exploited efficiently if applications are effectively mapped and parallelized. In this work, we analyze how large‐scale biology sequence comparison takes advantage of the current and future multicore architectures. Our starting point is the performance analysis of the current multicore IBM Cell B.E. processor; we analyze two different SW implementations on the Cell B.E. Then, using simulation tools, we study the performance scalability when a many‐core architecture is used for performing long DNA sequence comparison. We investigate the efficient memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture can be an efficient alternative to execute challenging bioinformatic workloads. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

19.
Parallel programming is elusive. The relative performance of different parallel implementations varies with machine architecture, system and problem size. How to compare different implementations over a wide range of machine architectures and problem sizes has not been well addressed due to its difficulty. Scalability has been proposed in recent years to reveal scaling properties of parallel algorithms and machines. In this paper, the relation between scalability and execution time is carefully studied. The concepts of crossing point analysis and range comparison are introduced. Crossing point analysis finds slow/fast performance crossing points of parallel algorithms and machines. Range comparison compares performance over a wide range of ensemble and problem size via scalability and crossing point analysis. Three algorithms from scientific computing are implemented on an Intel Paragon and an IBM SP2 parallel computer. Experimental and theoretical results show how the combination of scalability, crossing point analysis, and range comparison provides a practical solution for scalable performance evaluation and prediction. While our testings are conducted on homogeneous parallel computers, the proposed methodology applies to heterogeneous and network computing as well.  相似文献   

20.

The analysis of complex biological datasets beyond DNA scenarios is gaining increasing interest in current bioinformatics. Particularly, protein sequence data introduce additional complexity layers that impose new challenges from a computational perspective. This work is aimed at investigating GPU solutions to address these issues in a representative algorithm from the phylogenetics field: Fitch’s parsimony. GPU strategies are adopted in accordance with the protein-based formulation of the problem, defining an optimized kernel that takes advantage of data parallelism at the calculations associated with different amino acids. In order to understand the relationship between problem sizes and GPU capabilities, an extensive evaluation on a wide range of GPUs is conducted, covering all the recent NVIDIA GPU architectures—from Kepler to Turing. Experimental results on five real-world datasets point out the benefits that imply the exploitation of state-of-the-art GPUs, representing a fitting approach to address the increasing hardness of protein sequence datasets.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号