期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel exact inference on the Cell Broadband Engine processor

Yinglong Xia Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2010

We present the design and implementation of a parallel exact inference algorithm on the Cell Broadband Engine (Cell BE) processor, a heterogeneous multicore architecture. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with the network structure and clique size. In this paper, we exploit parallelism in exact inference at multiple levels. We propose a rerooting method to minimize the critical path for exact inference, and an efficient scheduler to dynamically allocate SPEs. In addition, we explore potential table representation and layout to optimize DMA transfer between local store and main memory. We implemented the proposed method and conducted experiments on the Cell BE processor in the IBM QS20 Blade. We achieved speedup up to 10 × on the Cell, compared to state-of-the-art processors. The methodology proposed in this paper can be used for online scheduling of directed acyclic graph (DAG) structured computations. 相似文献

2.

Monte Carlo implementation of financial simulation on Cell/B.E. multi-core processor

Jonas Larsson 《Mathematics and computers in simulation》2010,81(3):578-587

The processor evolution has reached a critical moment in time where it will soon be impossible to increase the frequency much further. Processor designers such as Motorola, Intel and IBM have all realised that the only way to improve the FLOP/Watt ratio is to develop multi-core devices. One of the most current examples of multi-core processors is the new Sony/Toshiba/IBM Cell/B.E. multi-core processor. For the suitability to run in parallel, Monte Carlo methods are often considered embarrassingly parallel. This paper describes how a common Monte Carlo based financial simulation can be calculated in parallel using the Cell/B.E. multi-core processor. The measured performance with the achieved multi-core speed-up is also presented. With the recent availability of this increasingly available technology, financial simulations can now be performed in a fraction of the time it used to. This can also be achieved with a limited power and volume budget using commercially available technology. The main challenge with multi-core devices is clearly the programmability. The work presented here describes how this challenge could be dealt with.A basic MPI library has been developed to handle the partitioning and communication of data. The thread creation follows a POSIX thread creation model. MPI together with POSIX make the application portable in between various multi-processor systems and multi-core devices. The conclusions made indicate that a function offload MPI implementation on the Cell/B.E. multi-core processor can efficiently be used to speed-up the Monte Carlo solution of financial simulations. The conclusions made herein are also applicable to other situations where an algorithm can be easily parallelized. 相似文献

3.

FMM算法在Cell/B.E.处理器上实现的分析与验证

唐振张倬柴亚辉徐炜民《计算机工程与科学》2011,33(8):79

FMM算法[1]是基于树结构的,用于解决多体问题(N-Body)的经典算法。它将N-Body问题的计算复杂度由O(N2)降为O(N),并且能达到任意精度。通用CPU在计算规模较大的N-Body问题时需要耗费大量的时间。为了加速算法的执行,本文对FMM算法在Cell/B.E.处理器上的实现进行了分析与验证。首先从功能上将FMM算法分解为八个核心过程,在此基础上根据计算特点的不同,对八个核心过程进行归类,最后选取其中有代表性的核心步骤,阐述了其在Cell/B.E.上实现的可行性问题,以及部分核心步骤的设计和实现过程。实验结果表明,选定的FMM算法核心步骤在Cell/B.E.上可以获得相对通用CPU较高的加速比。相似文献

4.

An enhancer of memory and network for applications with large-capacity data and non-continuous data accessing

Noboru Tanabe Hirotaka Hakozaki Hiroshi Ando Yasunori Dohi Zhengzhe Luo Hironori Nakajo 《The Journal of supercomputing》2010,51(3):279-309

The performance of memory and I/O systems is insufficient to catch up with that of COTS (Commercial Off-The-Shelf) CPU. PC clusters using COTS CPU have been employed for HPC. A cache-based processor is far less effective than a vector processor in applications with low spatial locality. Moreover, for HPC, Google-like server farms and database processing, insufficient capacity of main memory poses a serious problem. Power consumption of a Google-like server farm or a high-end HPC PC cluster is huge. In order to overcome these problems, we propose a concept of a memory and network enhancer equipped with scatter and gather vector access functions, high-performance network connectivity, and capacity extensibility. Communication mechanisms named LHS and LHC are also proposed. LHS and LHC are architectures for reducing latency for mixed messages with small controlling data and large data body. Examples of the killer applications of this new type of hardware are presented. This paper presents not only concepts and simulations but also real hardware prototypes named DIMMnet-2 and DIMMnet-3. This paper presents the evaluations concerning memory issues and network issues. We evaluate the module with NAS CG benchmark class C and Wisconsin benchmarks as applications with memory issues. Although evaluation for CG class C is difficult with conventional cycle-accurate simulation methods, we obtained the result for class C with our original method. As a result, we find that the module can improve its maximum performance about 25 times more with Wisconsin benchmarks. However, the results on a cache-based PC show the cache-line flushing degrades acceleration ratio. This shows the high potential of the proposed extended memory module and processors in combination with DMA-based main memory access such as SPU on Cell/B.E. that does not need cache-line flushing. The LHS and LHC communication mechanisms are evaluated in this paper. The evaluations of their effects on latency are shown. 相似文献

5.

The Bottom-Up Implementation of One MILC Lattice QCD Application on the Cell Blade

Guochun Shi Volodymyr Kindratenko Steven Gottlieb 《International journal of parallel programming》2009,37(5):488-507

We report the results of the bottom-up implementation of one MILC lattice quantum chromodynamics (QCD) application on the Cell Broadband Engine™ processor. In our implementation, we preserve MILC’s framework for scaling the application to run on a large number of compute nodes and accelerate computationally intensive kernels on the Cell’s synergistic processor elements. Speedups of 3.4 × for the 8 × 8 × 16 × 16 lattice and 5.7 × for the 16 × 16 × 16 × 16 lattice are obtained when comparing our implementation of the MILC application executed on a 3.2 GHz Cell processor to the standard MILC code executed on a quad-core 2.33 GHz Intel Xeon processor. We provide an empirical model to predict application performance for a given lattice size. We also show that performance of the compute-intensive part of the application on the Cell processor is limited by the bandwidth between main memory and the Cell’s synergistic processor elements, whereas performance of the application’s parallel execution framework is limited by the bandwidth between main memory and the Cell’s power processor element. 相似文献

6.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

7.

FastNBL: fast neighbor lists establishment for molecular dynamics simulation based on bitwise operations

Li Kun Li Shigang Huang Shan Chen Yifeng Zhang Yunquan 《The Journal of supercomputing》2020,76(7):5501-5520

In the molecular dynamics simulation, an important step is the establishment of neighbor list for each particle, which involves the distance calculation for each particle pair in the simulation space. However, the distance calculation will cause costly floating-point operations. In this paper, we propose a novel algorithm, called Fast Neighbor List, which establishes the neighbor lists mainly using the bitwise operations. Firstly, we design a data layout, which uses an integer value to represent the three-dimensional coordinates of a particle. Then, a bunch of bitwise operations and two subtraction operations are used to judge whether the distance between a pair of particles is within the cutoff radius. We demonstrate that our algorithm can deal with the periodic boundary seamlessly. We also use single instruction multiple data (SIMD) instructions to further improve the performance. We implement our algorithm on Intel Xeon E5-2670, ARM v8, and Sunway many-core processors, respectively. Compared with the traditional method, our algorithm achieves on average 1.79x speedup on Intel Xeon E5-2670 processor, 3.43x speedup on ARM v8 processor, and 4.03x speedup on Sunway many-core processor. After using SIMD instructions, our algorithm achieves on average 2.64x speedup and 14.43x speedup on Intel Xeon E5-2670 and ARM v8 processors, respectively.

相似文献

8.

基于Cell多核处理器的层次化运行时支持技术

董小社冯国富王旭昊冯景华胡雷钧《计算机研究与发展》2010,47(4):561-570

基于Cell处理器的异构多核架构及软件显式管理的多级存储层次,使其面临编程困难和性能难以有效发挥等问题. 现有基于Cell/B.E.的编程模型多侧重于支持类似于流处理的“批量访存”(bulk data transfer)应用,传统非规则访存应用性能较低.通过扩展Cell/B.E.访存库增强协处理单元的自主作用,以协处理单元为中心建立Cell计算平台上的MPI和弱一致性Pthread分层并行编程运行时支持.分层的运行时支持结构及扩展后的Cell/B.E.访存库使模型具有更好的效率和可扩展性,并且提高了非规则应用的性能;模型中的MPI方便了大量传统并行应用向新架构的移植及开发,而弱一致性Pthread则为MPI提供高效的任务运行时管理支持及为系统级用户提供对架构全面控制的编程接口.实验结果表明,提出的运行时支持技术不仅可适应不同应用的要求,同时借助访存库中的剖分优化机制可有效地挖掘Cell/B.E.架构性能. 相似文献

9.

ERI sorting for emerging processor architectures

Tirath Ramdas Gregory K. Egan 《Computer Physics Communications》2009,180(8):1221-1229

Electron Repulsion Integrals (ERIs) are a common bottleneck in ab initio computational chemistry. It is known that sorted/reordered execution of ERIs results in efficient SIMD/vector processing. This paper shows that reconfigurable computing and heterogeneous processor architectures can also benefit from a deliberate ordering of ERI tasks. However, realizing these benefits as net speedup requires a very rapid sorting mechanism. This paper presents two such mechanisms. Included in this study are analytical, simulation-based, and experimental benchmarking approaches to consider five use cases for ERI sorting, i.e. SIMD processing, reconfigurable computing, limited address spaces, instruction cache exploitation, and data cache exploitation. Specific consideration is given to existing cache-based processors, FPGAs, and the Cell Broadband Engine processor. It is proposed that the analyses conducted in this work should be built upon to aid the development of software autotuners which will produce efficient ab initio computational chemistry codes for a variety of computer architectures. 相似文献

10.

Solving Linear Recurrences with Loop Raking

《Journal of Parallel and Distributed Computing》1995,25(1):91-97

We present a variation of the partition method for solving linear recurrences that is well-suited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY Y-MP C90, and present performance results for first- and second-order linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On four processors, we gain an additional speedup of at least 3.7. 相似文献

11.

SIMD技术与向量数学库研究 总被引：2，自引：0，他引：2

解庆春张云泉王可李焱许亚武《计算机科学》2011,38(7):298-301

首先,结合Intel, AMD和IBM处理器,介绍了单指令流多数据流(SIMD)向量化技术及其各自的特点。其次,在3种平台上对各自开发的函数库中的部分向量数学函数进行了测试。结果表明,相对传统的标量计算,向量化技术带来的加速比较高,特别是Celll SDK函数,因其独特的体系结构,多个向量处理单元带来的平均加速比为10。最后,通过测试结果的对比,发现不同数学库中的向量函数之间在性能方面也存在着差异,并对差异原因进行了分析,得出性能差异主要是处理器架构和向量计算单元个数和访存等因素造成的。相似文献

12.

Real-time disparity map computation using the cell broadband engine

Barry McCullagh 《Journal of Real-Time Image Processing》2012,7(2):87-93

In this paper, we present what we believe to be the first implementation of a disparity map computation algorithm, using a local block-based matching technique, on the Cell Broadband Engine, a new processor released by Sony, Toshiba and IBM. The paper discusses the architecture of the Cell Broadband Engine from an image processing perspective. The results of the implementation are also presented and compared to the results obtained using desktop processing devices such as CPUs and GPUs, and shows that the computational power of the Cell Broadband Engine can be used to implement image processing algorithms in real-time. 相似文献

13.

Brain Derived Vision Algorithm on High Performance Architectures

Jayram Moorkanikara Nageswaran Andrew Felch Ashok Chandrasekhar Nikil Dutt Richard Granger Alex Nicolau Alex Veidenbaum 《International journal of parallel programming》2009,37(4):345-369

Even though computing systems have increased the number of transistors, the switching speed, and the number of processors, most programs exhibit limited speedup due to the serial dependencies of existing algorithms. Analysis of intrinsically parallel systems such as brain circuitry have led to the identification of novel architecture designs, and also new algorithms than can exploit the features of modern multiprocessor systems. In this article we describe the details of a brain derived vision (BDV) algorithm that is derived from the anatomical structure, and physiological operating principles of thalamo-cortical brain circuits. We show that many characteristics of the BDV algorithm lend themselves to implementation on IBM CELL architecture, and yield impressive speedups that equal or exceed the performance of specialized solutions such as FPGAs. Mapping this algorithm to the IBM CELL is non-trivial, and we suggest various approaches to deal with parallelism, task granularity, communication, and memory locality. We also show that a cluster of three PS3s (or more) containing IBM CELL processors provides a promising platform for brain derived algorithms, exhibiting speedup of more than 140 × over a desktop PC implementation, and thus enabling real-time object recognition for robotic systems. 相似文献

14.

Cache-based Computer Systems

《Computer》1973,6(3):30-36

A cache-based computer system employs a fast, small memory -the " cache" - interposed between the usual processor and main memory. At any given time the cache contains as much as possible the instructions and data the processor needs; as new information is needed it is brought from main memory to cache, displacing old information. The processor tends to operate with a memory of cache speed but with main memory cost-per-bit. This configuration has analogies with other systems employing memory hierarchies, such as "paging" or "virtual memory" systems. In contrast with these latter, a cache is managed by hardware algorithms, deals with smaller blocks of data (32 bytes, for example, rather than 4096), provides a smaller ratio of memory access times (5:1 rather than 1000: 1), and, because of the last factor, holds the processor idle while blocks of data are being transferred from main memory to cache rather than switching to another task. These are important differences, and may suffice to make the cache-based system cost effective in many situations where paging is not. 相似文献

15.

A microprogrammable list processor for personal computers

Lee K.H. Leung K.S. Cheang S.M. 《Micro, IEEE》1990,10(4):50-61

A low-cost alternative to using a full-scale Lisp machine for list processing is proposed. The ASLP, a PC-based, highly pipelined list processor with two memory modules, supplements a procedural programming language to solve the list-manipulation problems in AI programs. This hardware-assisted processor was designed for use on an IBM PC AT to support list-manipulation functions. A discussion of Lisp and its internal structure and a review of existing Lisp machines are included. An estimation of the theoretical execution time for some typical list-manipulation functions, expressed in terms of machine cycles and memory cycles, is presented 相似文献

16.

Adaptive data-driven parallelization of multi-view video coding on multi-core processor

Yi Pang WeiDong Hu LiFeng Sun ShiQiang Yang 《中国科学F辑(英文版)》2009,52(2):195-205

Multi-view video coding (MVC) comprises rich 3D information and is widely used in new visual media,such as 3DTV and free viewpoint TV (FTV). However,even with mainstream computer manufacturers migrating to multi-core processors,the huge computational requirement of MVC currently prohibits its wide use in consumer markets. In this paper,we demonstrate the design and implementation of the first parallel MVC system on Cell Broadband EngineTM processor which is a state-of-the-art multi-core processor. We propose a task-dispatching algorithm which is adaptive data-driven on the frame level for MVC,and implement a parallel multi-view video decoder with modified H.264/AVC codec on real machine. This approach provides scalable speedup (up to 16 times on sixteen cores) through proper local store management,utilization of code locality and SIMD improvement. Decoding speed,speedup and utilization rate of cores are expressed in experimental results. 相似文献

17.

Scalable multicore architectures for long DNA sequence comparison

Friman Snchez Felipe Cabarcas Alex Ramirez Mateo Valero 《Concurrency and Computation》2011,23(17):2205-2219

Biological sequence comparison is one of the most important tasks in Bioinformatics. Owing to the fast growth of databases that contain biological information, sequence comparison represents an important challenge for high‐performance computing, especially when very long sequences are compared, i.e. the complete genome of several organisms. The Smith–Waterman (SW) algorithm is an exact method based on dynamic programming to quantify local similarity between sequences. The inherent large parallelism of the algorithm makes it ideal for architectures supporting multiple dimensions of parallelism (TLP, DLP and ILP). Concurrently, there is a paradigm shift towards chip multiprocessors in computer architecture, which offer a huge amount of potential performance that can only be exploited efficiently if applications are effectively mapped and parallelized. In this work, we analyze how large‐scale biology sequence comparison takes advantage of the current and future multicore architectures. Our starting point is the performance analysis of the current multicore IBM Cell B.E. processor; we analyze two different SW implementations on the Cell B.E. Then, using simulation tools, we study the performance scalability when a many‐core architecture is used for performing long DNA sequence comparison. We investigate the efficient memory organization that delivers the maximum bandwidth with the minimum cost. Our results show that a heterogeneous architecture can be an efficient alternative to execute challenging bioinformatic workloads. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

18.

Efficient Breadth-First Search on the Cell/BE Processor

《Parallel and Distributed Systems, IEEE Transactions on》2008,19(10):1381-1395

Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But they also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a Breadth-First Search (BFS) algorithm for the Cell/B.E. processor. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processor-specific optimizations. Using a fine-grained global coordination strategy derived by the Bulk-Synchronous Parallel (BSP) model, we have determined an accurate performance model that has guided the implementation and the optimization of our algorithm. Our experiments on a pre-production Cell/B.E. board running at 3.2 GHz, show almost linear speedups when using multiple synergistic processing elements, and an impressive level of performance when compared to other processors. On graphs which offer sufficient parallelism, the Cell/B.E. is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, and custom-designed architectures, such as the MTA-2 and BlueGene/L. 相似文献

19.

Parallel Option Price Valuations with the Explicit Finite Difference Method

Alexandros V. Gerbessiotis 《International journal of parallel programming》2010,38(2):159-182

We show how computations such as those involved in American or European-style option price valuations with the explicit finite difference method can be performed in parallel. Towards this we introduce a latency tolerant parallel algorithm for performing such computations efficiently that achieves optimal theoretical speedup p, where p is the number of processor of the parallel system. An implementation of the parallel algorithm has been undertaken, and an evaluation of its performance is carried out by performing an experimental study on a high-latency PC cluster, and at a smaller scale, on a multi-core processor using in addition the SWARM parallel computing framework for multi-core processors. Our implementation of the parallel algorithm is not only architecture but also communication library independent: the same code works under LAM-MPI and Open MPI and also BSPlib, two sets of library frameworks that facilitate parallel programming. The suitability of our approach to multi-core processors is also established. 相似文献

20.

使用分布式De Bruijn图遍历基因拼接并行构建和化简

曾理成杰峰孟金涛涂志兵冯圣中《软件学报》2013,24(S2):140-149

目前基因拼接软件中应用最广泛的技术是基于De Bruijn图的基因拼接算法,需要对长达数十亿BP长度的基因组测序数据进行处理.针对海量的基因测序数据,快速、高效和可扩展的基因拼接算法非常重要.虽然已出现一些并行拼接算法（如YAGA）开始研究这些问题,但是拼接过程中时间、空间消耗较大的构图和单链化简这两大步骤在海量数据的挑战下仍然是最主要的计算瓶颈.这是因为现有工作在处理这几个步骤时通常使用了并行的表排序（list ranking）,而该方法需要多次对De Bruijn图的海量顶点信息进行分布式的排序,产生了大量的计算节点间的通信.单链化简可由1次De Bruijn 图深度优先遍历完成而不再需要表排序,于是提出一种基于分布式海量图遍历方法对单链化简进行优化,极大地减少了处理器间的通信和计算节点之间的数据移动,因而取得较好的扩展性,其算法复杂度为O（g/p）,通信复杂度为O（g）,这里g为参考序列的长度,p为处理器的核数.当对E.coli和Yeast数据集进行测试,处理器的核数从8个增加到512个时,算法可以得到13倍和10倍的加速比;当对C.elegans和人类1号染色体（chr1）数据集进行测试,处理器的核数从32个增加到512个时,算法可以得到7倍和10倍的加速比. 相似文献