期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Porting Estimation of Distribution Algorithms to the Cell Broadband Engine

Carlos Pérez-Miguel Jose Miguel-Alonso Alexander Mendiburu 《Parallel Computing》2010,36(10-11):618-634

Current consumer-grade computers and game devices incorporate very powerful processors that can be used to accelerate many classes of scientific codes. In this paper we explore the ability of the Cell Broadband Engine to run two similar Estimation of Distribution Algorithms, one for the discrete domain and the other for the continuous domain. Starting from initial, sequential versions, we develop multi-threaded programs for symmetric multiprocessors that are afterwards reworked to run on a Cell-based system. In most cases, the parallel programs significantly accelerate execution times, compared with the sequential counterparts. Additional acceleration is achieved using vector (instead of scalar) operations, which are supported by all the tested platforms. We describe the process of parallelizing and porting the programs, and analyze the results obtained taking into account the EDAs under study, the problems solved with them, and the platform in which programs run. We conclude that EDAs are not right targets to be ported to the Cell Broadband Engine. 相似文献

2.

Fast SIMDized Kalman filter based track fit

S. Gorbunov U. Kebschull V. Lindenstruth 《Computer Physics Communications》2008,178(5):374-383

Modern high energy physics experiments have to process terabytes of input data produced in particle collisions. The core of many data reconstruction algorithms in high energy physics is the Kalman filter. Therefore, the speed of Kalman filter based algorithms is of crucial importance in on-line data processing. This is especially true for the combinatorial track finding stage where the Kalman filter based track fit is used very intensively. Therefore, developing fast reconstruction algorithms, which use maximum available power of processors, is important, in particular for the initial selection of events which carry signals of interesting physics.One of such powerful feature supported by almost all up-to-date PC processors is a SIMD instruction set, which allows packing several data items in one register and to operate on all of them, thus achieving more operations per clock cycle. The novel Cell processor extends the parallelization further by combining a general-purpose PowerPC processor core with eight streamlined coprocessing elements which greatly accelerate vector processing applications.In the investigation described here, after a significant memory optimization and a comprehensive numerical analysis, the Kalman filter based track fitting algorithm of the CBM experiment has been vectorized using inline operator overloading. Thus the algorithm continues to be flexible with respect to any CPU family used for data reconstruction.Because of all these changes the SIMDized Kalman filter based track fitting algorithm takes 1 μs per track that is 10000 times faster than the initial version. Porting the algorithm to a Cell Blade computer gives another factor of 10 of the speedup.Finally, we compare performance of the tracking algorithm running on three different CPU architectures: Intel Xeon, AMD Opteron and Cell Broadband Engine. 相似文献

3.

Parallelization and performance comparison of the conjugate gradient equation solver on multicore Cell and Xeon computers

Fadi N. Sibai Mohammad Saad Hashir K. Kidwai 《Concurrency and Computation》2011,23(18):2463-2467

Multicore accelerators are used today to supplement traditional superscalar processors in massively parallel computer nodes with extra floating‐point computation power. This paper presents our parallelization and performance enhancement and evaluation of the conjugate gradient (CG) linear equation solver with enhanced matrix multiplication on the Cell Broadband Engine accelerator. The paper also compares the CG performance results on the Cell and two CG implementations on a computer with two quadcore Xeon processors, one with OpenMP and the other with OpenMPI. We also report the enhancements made on the CG code and performance analysis of CG on single and dual Cell Broadband Engine packages with 8 and 16 synergistic processing elements and on Xeon for heptadiagonal matrices, in particular to matrix multiplication and synchronization. We also report the communication and computation time breakdowns and the floating point operations per second ratio. Our parallel CG solver is shown to scale well with data size, grid dimensionality, and number of cores. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

4.

Real-time disparity map computation using the cell broadband engine

Barry McCullagh 《Journal of Real-Time Image Processing》2012,7(2):87-93

In this paper, we present what we believe to be the first implementation of a disparity map computation algorithm, using a local block-based matching technique, on the Cell Broadband Engine, a new processor released by Sony, Toshiba and IBM. The paper discusses the architecture of the Cell Broadband Engine from an image processing perspective. The results of the implementation are also presented and compared to the results obtained using desktop processing devices such as CPUs and GPUs, and shows that the computational power of the Cell Broadband Engine can be used to implement image processing algorithms in real-time. 相似文献

5.

A high‐performance sorting algorithm for multicore single‐instruction multiple‐data processors

Hiroshi Inoue Takao Moriyama Hideaki Komatsu Toshio Nakatani 《Software》2012,42(6):753-777

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

6.

High performance combinatorial algorithm design on the Cell Broadband Engine processor

《Parallel Computing》2007,33(10-11):720-740

The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library.List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation.Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm and Huffman data encoding. We design efficient parallel algorithms for these combinatorial kernels, and exploit concurrency at multiple levels on the Cell/B.E. processor. We also present a Cell/B.E. optimized implementation of gzip, a popular file-compression application based on the zlib library. For our Cell/B.E. implementation of gzip, we achieve an average speedup of 2.9 in compression over current workstations. 相似文献

7.

Automatic analysis of DMA races using model checking and <Emphasis Type="Italic">k</Emphasis>-induction

Alastair F. Donaldson Daniel Kroening Philipp Rümmer 《Formal Methods in System Design》2011,39(1):83-113

Modern multicore processors, such as the Cell Broadband Engine, achieve high performance by equipping accelerator cores with small “scratch-pad” memories. The price for increased performance is higher programming complexity – the programmer must manually orchestrate data movement using direct memory access (DMA) operations. Programming using asynchronous DMA operations is error-prone, and DMA races can lead to nondeterministic bugs which are hard to reproduce and fix. We present a method for DMA race analysis in C programs. Our method works by automatically instrumenting a program with assertions modeling the semantics of a memory flow controller. The instrumented program can then be analyzed using state-of-the-art software model checkers. We show that bounded model checking is effective for detecting DMA races in buggy programs. To enable automatic verification of the correctness of instrumented programs, we present a new formulation of k-induction geared towards software, as a proof rule operating on loops. Our techniques are implemented as a tool, Scratch, which we apply to a large set of programs supplied with the IBM Cell SDK, in which we discover a previously unknown bug. Our experimental results indicate that our k-induction method performs extremely well on this problem class. To our knowledge, this marks both the first application of k-induction to software verification, and the first example of software model checking in the context of heterogeneous multicore processors. 相似文献

8.

Efficient Algorithms for Block-Cyclic Redistribution of Arrays

Y. W. Lim P. B. Bhat V. K. Prasanna 《Algorithmica》1999,24(3-4):298-330

The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of index set computation , wherein the destination locations for individual data blocks are calculated, and data communication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct indirect and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations. Received June 1, 1997; revised March 10, 1998. 相似文献

9.

Railway Computation for Infinite Linear Systems

Paola Favati Grazia Lotti Ornella Menchi Francesco Romani 《International journal of parallel programming》2002,30(6):419-439

The problem of solving an infinite system of linear equations finitely expressed is addressed. Modifications of the Gauss–Seidel method are presented, especially suitable for the implementation on SMP machines with a small number of processors. One of the proposed parallel algorithms, which concentrates the computational efforts where they are most needed, results to be more efficient than the sequential algorithm, even from the point of view of the total number of operations. 相似文献

10.

Parallel exact inference on the Cell Broadband Engine processor

Yinglong Xia Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2010

We present the design and implementation of a parallel exact inference algorithm on the Cell Broadband Engine (Cell BE) processor, a heterogeneous multicore architecture. Exact inference is a key problem in exploring probabilistic graphical models, where the computation complexity increases dramatically with the network structure and clique size. In this paper, we exploit parallelism in exact inference at multiple levels. We propose a rerooting method to minimize the critical path for exact inference, and an efficient scheduler to dynamically allocate SPEs. In addition, we explore potential table representation and layout to optimize DMA transfer between local store and main memory. We implemented the proposed method and conducted experiments on the Cell BE processor in the IBM QS20 Blade. We achieved speedup up to 10 × on the Cell, compared to state-of-the-art processors. The methodology proposed in this paper can be used for online scheduling of directed acyclic graph (DAG) structured computations. 相似文献

11.

Optimal Parallel Algorithms for Quadtree Problems

《CVGIP: Image Understanding》1994,59(3):281-285

In this paper we describe optimal processor-time parallel algorithms for set operations such as union, intersection, comparison on quadtrees. The algorithms presented in this paper run in O(log N) time using N/log N processors on a shared memory model of computation that allows concurrent reads or writes. Consequently they allow us to achieve optimal speedups using any number of processors up to N/log N. The approach can also be used to derive optimal processor-time parallel algorithms for weaker models of parallel computation. 相似文献

12.

Highly optimized implementation of OpenCV for the Cell Broadband Engine

Hiroki Sugano Ryusuke Miyamoto 《Computer Vision and Image Understanding》2010,114(11):1273-1281

Recently, real-time processing of image recognition is required for embedded applications such as automotive applications, robotics, entertainment, and so on. To realize real-time processing of image recognition on such systems we need optimized libraries for embedded processors. OpenCV is one of the most widely used libraries for computer vision applications and has many functions optimized for Intel processors, but no function is optimized for embedded processors. We present a parallel implementation of OpenCV library on the Cell Broadband Engine (Cell), which is one of the most widely used high performance embedded processors. Experimental result shows that most of the functions optimized for the Cell processor are faster than functions optimized for Intel Core 2 Duo E6850 3.00 GHz. 相似文献

13.

ERI sorting for emerging processor architectures

Tirath Ramdas Gregory K. Egan 《Computer Physics Communications》2009,180(8):1221-1229

Electron Repulsion Integrals (ERIs) are a common bottleneck in ab initio computational chemistry. It is known that sorted/reordered execution of ERIs results in efficient SIMD/vector processing. This paper shows that reconfigurable computing and heterogeneous processor architectures can also benefit from a deliberate ordering of ERI tasks. However, realizing these benefits as net speedup requires a very rapid sorting mechanism. This paper presents two such mechanisms. Included in this study are analytical, simulation-based, and experimental benchmarking approaches to consider five use cases for ERI sorting, i.e. SIMD processing, reconfigurable computing, limited address spaces, instruction cache exploitation, and data cache exploitation. Specific consideration is given to existing cache-based processors, FPGAs, and the Cell Broadband Engine processor. It is proposed that the analyses conducted in this work should be built upon to aid the development of software autotuners which will produce efficient ab initio computational chemistry codes for a variety of computer architectures. 相似文献

14.

Characterizing the basic synchronization and communication operations in Dual Cell-based Blades through CellStats

José L. Abellán Juan Fernández Manuel E. Acacio 《The Journal of supercomputing》2010,53(2):247-268

The Cell Broadband Engine (Cell BE) is a heterogeneous chip-multiprocessor (CMP) architecture to offer very high performance, especially on game and multimedia applications. The singularity of its architecture, nine cores of two different types, along with the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications very challenging. This situation gets even worse when dual Cell-based blade platforms are considered, where two separate Cells can be linked together through a dedicated high-speed interface. In this work, we present a characterization of the main synchronization and communication primitives provided to programmers in the context of a dual Cell-based blade under varying workloads through our CellStats tool. In particular, we focus on the DMA transfer mechanism, the mailboxes, the signals, the read-modify-write atomic operations, and the time taken by thread creation. Our performance results expose the bottlenecks and asymmetries of these platforms, which must be taken into account by programmers for choosing the most adequate primitives to improve the efficiency of their applications. 相似文献

15.

几乎最快与渐近最优的并行分枝界限算法

武继刚计永昶陈国良《软件学报》2000,11(12):1572-1580

分枝界限算法是求解组合优化问题的技术之一,它被广泛地应用在埃运筹学与组合数学中.对共享存储的最优优先一般并行分枝界限算法给出了运行时间复杂度下界Ω(m/p+hlogp),其中p为可用处理器数,h为扩展的结点数,m为状态空间中的活结点数.通过将共享存器设计成p个立体堆,提出了PRAM-EREW上一个新的一般并行分枝界限算法,理论上证明了对于h<p^2p,该算法为最快且渐近最优的并行分枝界限算法.最后对0-r背包问题给出了模拟实验结果. 相似文献

16.

Parallel algorithms for bipartite matching problems on distributed memory computers

Johannes Langguth Md. Mostofa Ali PatwaryFredrik Manne 《Parallel Computing》2011,37(12):820-845

We present a new parallel algorithm for computing a maximum cardinality matching in a bipartite graph suitable for distributed memory computers.The presented algorithm is based on the Push-Relabel algorithm which is known to be one of the fastest algorithms for the bipartite matching problem. Previous attempts at developing parallel implementations of it have focused on shared memory computers using only a limited number of processors.We first present a straightforward adaptation of these shared memory algorithms to distributed memory computers. However, this is not a viable approach as it requires too much communication. We then develop our new algorithm by modifying the previous approach through a sequence of steps with the main goal being to reduce the amount of communication and to increase load balance. The first goal is achieved by changing the algorithm so that many push and relabel operations can be performed locally between communication rounds and also by selecting augmenting paths that cross processor boundaries infrequently. To achieve good load balance, we limit the speed at which global relabelings traverse the graph. In several experiments on a large number of instances, we study weak and strong scalability of our algorithm using up to 128 processors.The algorithm can also be used to find ?-approximate matchings quickly. 相似文献

17.

Performance,optimization, and fitness: Connecting applications to architectures

Mohammad A. Bhuiyan Melissa C. Smith Vivek K. Pallipuram 《Concurrency and Computation》2011,23(10):1066-1100

Recent trends involving multicore processors and graphical processing units (GPUs) focus on exploiting task‐ and thread‐level parallelism. In this paper, we have analyzed various aspects of the performance of these architectures including NVIDIA GPUs, and multicore processors such as Intel Xeon, AMD Opteron, IBM's Cell Broadband Engine. The case study used in this paper is a biological spiking neural network (SNN), implemented with the Izhikevich, Wilson, Morris–Lecar, and Hodgkin–Huxley neuron models. The four SNN models have varying requirements for communication and computation making them useful for performance analysis of the hardware platforms. We report and analyze the variation of performance with network (problem size) scaling, available optimization techniques and execution configuration. A Fitness performance model, that predicts the suitability of the architecture for accelerating an application, is proposed and verified with the SNN implementation results. The Roofline model, another existing performance model, has also been utilized to determine the hardware bottleneck(s) and attainable peak performance of the architectures. Significant speedups for the four SNN neuron models utilizing these architectures are reported; the maximum speedup of 574x was observed in our GPU implementation. Our results and analysis show that a proper match of architecture with algorithm complexity provides the best performance. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

18.

Lock-free deques and doubly linked lists

Håkan Sundell Philippas Tsigas 《Journal of Parallel and Distributed Computing》2008

We present a practical lock-free shared data structure that efficiently implements the operations of a concurrent deque as well as a general doubly linked list. The implementation supports parallelism for disjoint accesses and uses atomic primitives which are available in modern computer systems. Previously known lock-free algorithms of doubly linked lists are either based on non-available atomic synchronization primitives, only implement a subset of the functionality, or are not designed for disjoint accesses. Our algorithm only requires single-word compare-and-swap atomic primitives, supports fully dynamic list sizes, and allows traversal also through deleted nodes and thus avoids unnecessary operation retries. We have performed an empirical study of our new algorithm on two different multiprocessor platforms. Results of the experiments performed under high contention show that the performance of our implementation scales linearly with increasing number of processors. Considering deque implementations and systems with low concurrency, the algorithm by Michael shows the best performance. However, as our algorithm is designed for disjoint accesses, it performs significantly better on systems with high concurrency and non-uniform memory architecture. 相似文献

19.

A Highly Concurrent Priority Queue

《Journal of Parallel and Distributed Computing》1994,22(2):367-373

We present a highly concurrent priority queue algorithm based on the B-link tree, which is a B⁺-tree in which every node has a pointer to its right sibling. The algorithm is built on the concurrent B-link tree algorithms. Since the priority queue is based on highly concurrent search structure algorithms, a large number of insert operations can execute concurrently with little or no interference. We first present an algorithm that executes deletemin operations serially. We extend the serialized-deletemin algorithm to allow both parallel and concurrent deletemin operations. We discuss a decisive operation serializable algorithm that permits concurrent deletemin operations, and an algorithm in which p processors cooperate to perform p deletemin operations in O(log p) time. 相似文献

20.

Parallel VLSI test in a shared‐memory multiprocessor

C. Gil J. Ortega M. G. Montoya 《Concurrency and Computation》2000,12(5):311-326

This paper presents three parallel procedures implemented in a shared‐memory multiprocessor to generate the patterns that allow the testing of digital circuits. The implementation of these procedures in a multiprocessor uses the system memory better than in a distributed‐memory multicomputer, since it is not necessary to store the circuit structure in the local memory of each processor, besides other common structures. The parallel test generation procedures are based on a new sequential algorithm which mixes both the Boolean difference and digital spectral techniques. It is thus different from other methods proposed that deal with the parallelization of test generation algorithms that carry out an implicit enumeration of the input pattern space. The first procedure distributes the set of faults using a backtracking procedure starting from a primary output and allocating a similar number of lines to each processor. The second procedure distributes the set of faults among the processors taking into account the distance from each line to its nearest primary output; it then applies the algorithm to generate the test pattern with some modifications. The third procedure uses a circuit partitioning procedure which allows similar sized parts of the circuit to be assigned to each processor while communications between processors are minimized. The experimental results obtained when the procedures are applied to the usual benchmark circuits (the ISCAS set) show figures for speedup better than in a multicomputer, although fewer processors are used. Copyright © 2000 John Wiley & Sons, Ltd. 相似文献