期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

High performance combinatorial algorithm design on the Cell Broadband Engine processor

《Parallel Computing》2007,33(10-11):720-740

The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library.List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation.Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Huffman data encoding. We design efficient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular file-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations. 相似文献

2.

Highly optimized implementation of OpenCV for the Cell Broadband Engine

Hiroki Sugano Ryusuke Miyamoto 《Computer Vision and Image Understanding》2010,114(11):1273-1281

Recently, real-time processing of image recognition is required for embedded applications such as automotive applications, robotics, entertainment, and so on. To realize real-time processing of image recognition on such systems we need optimized libraries for embedded processors. OpenCV is one of the most widely used libraries for computer vision applications and has many functions optimized for Intel processors, but no function is optimized for embedded processors. We present a parallel implementation of OpenCV library on the Cell Broadband Engine (Cell), which is one of the most widely used high performance embedded processors. Experimental result shows that most of the functions optimized for the Cell processor are faster than functions optimized for Intel Core 2 Duo E6850 3.00 GHz. 相似文献

3.

CellJoin: a parallel stream join operator for the cell processor

Buğra Gedik Rajesh R. Bordawekar Philip S. Yu 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(2):501-519

Low-latency and high-throughput processing are key requirements of data stream management systems (DSMSs). Hence, multi-core processors that provide high aggregate processing capacity are ideal matches for executing costly DSMS operators. The recently developed Cell processor is a good example of a heterogeneous multi-core architecture and provides a powerful platform for executing data stream operators with high-performance. On the down side, exploiting the full potential of a multi-core processor like Cell is often challenging, mainly due to the heterogeneous nature of the processing elements, the software managed local memory at the co-processor side, and the unconventional programming model in general. In this paper, we study the problem of scalable execution of windowed stream join operators on multi-core processors, and specifically on the Cell processor. By examining various aspects of join execution flow, we determine the right set of techniques to apply in order to minimize the sequential segments and maximize parallelism. Concretely, we show that basic windows coupled with low-overhead pointer-shifting techniques can be used to achieve efficient join window partitioning, column-oriented join window organization can be used to minimize scattered data transfers, delay-optimized double buffering can be used for effective pipelining, rate-aware batching can be used to balance join throughput and tuple delay, and finally single-instruction multiple-data (SIMD) optimized operator code can be used to exploit data parallelism. Our experimental results show that, following the design guidelines and implementation techniques outlined in this paper, windowed stream joins can achieve high scalability (linear in the number of co-processors) by making efficient use of the extensive hardware parallelism provided by the Cell processor (reaching data processing rates of ≈13 GB/s) and significantly surpass the performance obtained form conventional high-end processors (supporting a combined input stream rate of 2,000 tuples/s using 15 min windows and without dropping any tuples, resulting in ≈8.3 times higher output rate compared to an SSE implementation on dual 3.2 GHz Intel Xeon). 相似文献

4.

The function processor: A data-driven processor array for irregular computations

Jesper Vasell Jonas Vasell 《Future Generation Computer Systems》1992,8(4):321-335

相似文献

5.

Performance of a Lattice Quantum Chromodynamics kernel on the Cell processor

J. Spray A. Trew 《Computer Physics Communications》2008,179(9):642-646

The implementation of a proof-of-concept Lattice Quantum Chromodynamics kernel on the Cell processor is described in detail, illustrating issues encountered in the porting process. The resulting code performs up to 45 GFlop/s per socket (without inter-node parallel communications), indicating that the Cell processor is likely to be a good platform for future Lattice QCD calculations. 相似文献

6.

Implementation of mixed precision in solving systems of linear equations on the Cell processor

Jakub Kurzak Jack Dongarra 《Concurrency and Computation》2007,19(10):1371-1385

This paper describes the design concepts behind implementations of mixed‐precision linear algebra routines targeted for the Cell processor. It describes in detail the implementation of code to solve linear system of equations using Gaussian elimination in single precision with iterative refinement of the solution to the full double‐precision accuracy. By utilizing this approach the algorithm achieves close to an order of magnitude higher performance on the Cell processor than the performance offered by the standard double‐precision algorithm. The code is effectively an implementation of the high‐performance LINPACK benchmark, as it meets all of the requirements concerning the problem being solved and the numerical properties of the solution. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

7.

The Bottom-Up Implementation of One MILC Lattice QCD Application on the Cell Blade

Guochun Shi Volodymyr Kindratenko Steven Gottlieb 《International journal of parallel programming》2009,37(5):488-507

We report the results of the bottom-up implementation of one MILC lattice quantum chromodynamics (QCD) application on the Cell Broadband Engine™ processor. In our implementation, we preserve MILC’s framework for scaling the application to run on a large number of compute nodes and accelerate computationally intensive kernels on the Cell’s synergistic processor elements. Speedups of 3.4 × for the 8 × 8 × 16 × 16 lattice and 5.7 × for the 16 × 16 × 16 × 16 lattice are obtained when comparing our implementation of the MILC application executed on a 3.2 GHz Cell processor to the standard MILC code executed on a quad-core 2.33 GHz Intel Xeon processor. We provide an empirical model to predict application performance for a given lattice size. We also show that performance of the compute-intensive part of the application on the Cell processor is limited by the bandwidth between main memory and the Cell’s synergistic processor elements, whereas performance of the application’s parallel execution framework is limited by the bandwidth between main memory and the Cell’s power processor element. 相似文献

8.

Monte Carlo implementation of financial simulation on Cell/B.E. multi-core processor

Jonas Larsson 《Mathematics and computers in simulation》2010,81(3):578-587

The processor evolution has reached a critical moment in time where it will soon be impossible to increase the frequency much further. Processor designers such as Motorola, Intel and IBM have all realised that the only way to improve the FLOP/Watt ratio is to develop multi-core devices. One of the most current examples of multi-core processors is the new Sony/Toshiba/IBM Cell/B.E. multi-core processor. For the suitability to run in parallel, Monte Carlo methods are often considered embarrassingly parallel. This paper describes how a common Monte Carlo based financial simulation can be calculated in parallel using the Cell/B.E. multi-core processor. The measured performance with the achieved multi-core speed-up is also presented. With the recent availability of this increasingly available technology, financial simulations can now be performed in a fraction of the time it used to. This can also be achieved with a limited power and volume budget using commercially available technology. The main challenge with multi-core devices is clearly the programmability. The work presented here describes how this challenge could be dealt with.A basic MPI library has been developed to handle the partitioning and communication of data. The thread creation follows a POSIX thread creation model. MPI together with POSIX make the application portable in between various multi-processor systems and multi-core devices. The conclusions made indicate that a function offload MPI implementation on the Cell/B.E. multi-core processor can efficiently be used to speed-up the Monte Carlo solution of financial simulations. The conclusions made herein are also applicable to other situations where an algorithm can be easily parallelized. 相似文献

9.

Automatic Synthesis of FPGA Processor Arrays from Loop Algorithms

Marcus Bednara Jürgen Teich 《The Journal of supercomputing》2003,26(2):149-165

We consider the problem of automatic mapping of computation-intensive loop nests onto FPGA hardware. The regular cell array structure of these chips reflects the parallelism in regular loop-like computations. Furthermore, the flexibility of FPGAs allows the cost-effective implementation of reconfigurable high performance processor arrays. So far, there exists no continuous design flow that allows automated generation of FPGA configuration data from a loop nest specified in a high level language. Here, we present a methodology for automatic generation of synthesizable VHDL code specifying a processor array and optimized for FPGA implementation. 相似文献

10.

Parallel exact inference on the Cell Broadband Engine processor

Yinglong Xia Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2010

We present the design and implementation of a parallel exact inference algorithm on the Cell Broadband Engine (Cell BE) processor, a heterogeneous multicore architecture. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with the network structure and clique size. In this paper, we exploit parallelism in exact inference at multiple levels. We propose a rerooting method to minimize the critical path for exact inference, and an efficient scheduler to dynamically allocate SPEs. In addition, we explore potential table representation and layout to optimize DMA transfer between local store and main memory. We implemented the proposed method and conducted experiments on the Cell BE processor in the IBM QS20 Blade. We achieved speedup up to 10 × on the Cell, compared to state-of-the-art processors. The methodology proposed in this paper can be used for online scheduling of directed acyclic graph (DAG) structured computations. 相似文献

11.

Real-time disparity map computation using the cell broadband engine

Barry McCullagh 《Journal of Real-Time Image Processing》2012,7(2):87-93

In this paper, we present what we believe to be the first implementation of a disparity map computation algorithm, using a local block-based matching technique, on the Cell Broadband Engine, a new processor released by Sony, Toshiba and IBM. The paper discusses the architecture of the Cell Broadband Engine from an image processing perspective. The results of the implementation are also presented and compared to the results obtained using desktop processing devices such as CPUs and GPUs, and shows that the computational power of the Cell Broadband Engine can be used to implement image processing algorithms in real-time. 相似文献

12.

Multiband image classification with a distributed architecture

Ian J Curington Stephen E Cannon 《Image and vision computing》1985,3(2):80-84

Much research has gone into specialized hardware for remotely sensed imagery analysis applications, particularly in the use of Landsat data for feature classification analysis. The paper outlines a particular multiband classifier and shows expected performance using the FPS-5000 array processor. The advantage of distributed resources are shown in an optimized implementation of the algorithm in a particular processing environment. 相似文献

13.

一种基于AHB总线的高速存储访问机制的设计与实现

晏东佟冬《小型微型计算机系统》2006,27(8):1574-1579

系统芯片中的高速IO设备一般需要直接存储访问（DMA）功能的支持，以减轻处理器的负担，提高系统性能和IO性能．考察了片上系统总线和DMA访问机制的特性，设计实现了一种基于AHB总线的高速存储访问机制，采用专门的接口支持以太网络接口与系统主存之间的数据传输．在总线接口的设计中提出了总线访问的优化策略，并给出了一种确定FIFO设计参数的分析方法．实验结果表明，该访问机制提高了数据传输速率，有效地支持系统芯片的网络应用．相似文献

14.

Real-time blind audio source separation: performance assessment on an advanced digital signal processor

Danilo Pani Alessandro Pani Luigi Raffo 《The Journal of supercomputing》2014,70(3):1555-1576

Many on-line blind audio source separation (BASS) algorithms have been presented so far to the scientific community, but only a few of them have been evaluated in terms of their real-time performance. In this paper we consider a well-established BASS method (oriented to voices separation) evaluating its performance in terms of separation quality allowed by a real-time embedded computing implementation, also considering novel and state-of-the-art improvements to the it. To this aim, the algorithm has been implemented and ported for real-time execution onto an advanced low-power digital signal processor targeted for complex-domain applications. The optimized embedded implementation is able to perform up to five iterations of the gradient for any input frame of data, achieving good separation levels (up to 11.8 dB of signal to interference ratio on custom recording in real environments). The proposed solution doubles the performance of a C-optimized version running on a traditional PC processor, achieving a better separation result with lower power requirements. 相似文献

15.

基于硬件多线程网络处理器功耗可控无线局域网MAC协议实现

王磊张晓彤张艳丽王沁《计算机应用研究》2012,29(7):2624-2628

针对在如何在提高网络吞吐率并满足实时性需求的同时消耗更少的功耗的问题,以硬件多线程网络处理为平台,以IEEE 802.11MAC层协议为例,通过对MAC层数据流的模式、数据流上的操作行为以及时间约束进行建模并测试分析,提出一种多线程化网络协议的软件实现方法;配合动态功耗可控的多线程网络处理器能够根据流量和实时性自适应地调整系统的性能。实验结果证明,异构多线程结构程序在实时性任务时五个软件线程需四个硬件线程支持,而无实时性任务只需两个硬件线程支持。提出的多线程MAC层协议编程模型能够达到根据网络负载特征动态控制处理器性能的目的。相似文献

16.

SMA:前瞻性多线程体系结构 总被引：4，自引：1，他引：3

肖刚周兴铭徐明邓鹍《计算机学报》1999,22(6):582-590

提出了一种新的ＩＬＰ处理器体系结构－前瞻性多线程体系的结构,简称ＳＭＡ．它结合了前瞻性执行机制和多线程执行机制,以整个线程为长步进行前瞻性执行,多个线程并行执行并且共享处理器硬件资源,这样,处理器既通过组合每个线程的指令窗口形成一个大的动态指令窗口,开发出程序中更大的ＩＬＰ,又利用多线程执行机制屏蔽各种长延迟操作,达到较高的资源利用率;介绍了ＳＭＡ执行模型,并讨论了ＳＭＡ处理器的实现和其中的关键技相似文献

17.

Cell处理器访存特征研究

郑义邓林窦勇《计算机工程与科学》2012,34(11):72

Cell处理器是一款异构多核处理器,拥有强大的计算能力。但是,在进行应用并行化时,却受到本地存储器容量、访存带宽和数据传输延时等的限制。DMA传输是隐藏长延时、提高存储带宽利用率的有效方法。本文在分析Cell处理器结构基础上,进行了一系列详细的DMA测试,并利用指数拟合技术得到DMA平均带宽模型,发现参与DMA传输的SPE数量和每次DMA传输规模是影响DMA访存带宽的主要因素。相似文献

18.

并行DSP小波图象压缩的实现研究

刘杰程建平李政《计算机应用研究》2001,18(8):23-24

小波图象压缩逄法是基于多分辨率分析上的一种压缩方法,它能够根据人们的视觉效应对图象信息进行针对性的处理,达到高效压缩图象的目的。小波变换算法的工程实现需要高性能处理器的支持,并行DSP（Digital Signal Processor)处理器TMS320C80就具有强大的图象处理功能,研究了在并行DSPTMS320C80上实现小波变换图象压缩的快速算法、优化程序设计及并行处理方式,并加以实现。相似文献

19.

On parallel scan-conversion algorithms for transputer networks

《Journal of Microcomputer Applications》1990,13(1):43-55

Two fundamental scan line procedures are considered from the point of view of parallel implementation on transputer networks. The paper describes algorithms for the scan conversion of polygons, and the related hidden surface elimination problem for three-dimensional models with polygonal data structures. In each case parallel designs have been implemented and timed in various configurations against a functionally equivalent, but not necessarily optimized, single processor code. The results presented demonstrate that scan line algorithms can be efficiently implemented on suitably configured networks of transputers. 相似文献

20.

Adaptive data-driven parallelization of multi-view video coding on multi-core processor

Yi Pang WeiDong Hu LiFeng Sun ShiQiang Yang 《中国科学F辑(英文版)》2009,52(2):195-205

Multi-view video coding (MVC) comprises rich 3D information and is widely used in new visual media,such as 3DTV and free viewpoint TV (FTV). However,even with mainstream computer manufacturers migrating to multi-core processors,the huge computational requirement of MVC currently prohibits its wide use in consumer markets. In this paper,we demonstrate the design and implementation of the first parallel MVC system on Cell Broadband EngineTM processor which is a state-of-the-art multi-core processor. We propose a task-dispatching algorithm which is adaptive data-driven on the frame level for MVC,and implement a parallel multi-view video decoder with modified H.264/AVC codec on real machine. This approach provides scalable speedup (up to 16 times on sixteen cores) through proper local store management,utilization of code locality and SIMD improvement. Decoding speed,speedup and utilization rate of cores are expressed in experimental results. 相似文献