期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Pipelined data parallel algorithms-II: design

King C.-T. Chou W.-H. Ni L.M. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(4):486-499

For pt.I see ibid., p.470-85. A methodology for designing pipelined data-parallel algorithms on multicomputers is studied. The design procedure starts with a sequential algorithm which can be expressed as a nested loop with constant loop-carried dependencies. The procedure's main focus is on partitioning the loop by grouping related iterations together. Grouping is necessary to balance the communication overhead with the available parallelism and to produce pipelined execution patterns, which result in pipelined data-parallel computations. The grouping should satisfy dependence relationships among the iterations and also allow the granularity to be controlled. Various properties of grouping are studied, and methods for generating communication-efficient grouping are given. Given a grouping and an assignment of the groups to the processors, an analytic model is combined with the grouping results to describe the behavior and to estimate the performance of the resultant parallel program. Expressions characterizing the performance are derived 相似文献

2.

Hypercube algorithms on mesh connected multicomputers

de Cerio L.D. Valero-Garcia M. Gonzalez A. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(12):1247-1260

A new methodology named CALMANT (CC-cube Algorithms on Meshes and Tori) for mapping a type of algorithm that we call CC-cube algorithm onto multicomputers with hypercube, mesh, or torus interconnection topology is proposed. This methodology is suitable when the initial problem can be expressed as a set of processes that communicate through a hypercube topology (a CC-cube algorithm). There are many important algorithms that fit into the CC-cube type. CALMANT is based on three different techniques: (a) the standard embedding to assign the processes of the algorithm to the nodes of the mesh multicomputer; (b) the communication pipelining technique to increase the level of communication parallelism inherent in the CC-cube algorithms; and (c) optimal message-scheduling algorithms proposed in this work in order to avoid conflicts and minimizing in this way the communication time. Although CALMANT is proposed for multicomputers with different interconnection network topologies, the paper only focuses on the particular case of meshes. 相似文献

3.

Low Communication Overhead Jacobi Algorithms for Eigenvalues Computation on Hypercubes

Royo Dolors González Antonio Valero-García Miguel 《The Journal of supercomputing》1999,14(2):171-193

This paper presents some novel algorithms to implement the Jacobi method in hypercube multicomputers. The algorithms are based on the one-sided Jacobi method and use new Jacobi orderings, which are one of the contributions of this paper. The second contribution of this paper is the use of a systematic algorithm transformation, which is referred to as communication pipelining, and is aimed at reducing the communication overhead by exploiting parallelism in the communication operations. The proposed schemes are evaluated by means of analytical models and compared with previous proposals. The results show a significant reduction in the communication overhead, which in some cases can be by a factor almost as great as the number of dimensions of the hypercube. 相似文献

4.

IA-64中软件流水的寄存器需求研究 总被引：1，自引：0，他引：1

林海波李文龙汤志忠《计算机研究与发展》2004,41(1):22-27

软件流水是开发循环程序指令级并行性的重要方法之一，IA-64是支持软件流水的EPIC体系结构，通过对NAS Benchmarks中可软件流水循环所需的寄存器进行量化分析，提出了一种限制循环展开因子的启发式算法，有效地解决了因可用寄存器不足而导致软件流水失败的问题，并提高了应用程序的执行速度。相似文献

5.

An evaluation of relational join algorithms in a pipelined queryprocessing environment

Mikkilineni K.P. Su S.Y.W. 《IEEE transactions on pattern analysis and machine intelligence》1988,14(6):838-848

A query processing strategy which is based on pipelining and data-flow techniques is presented. Timing equations are developed for calculating the performance of four join algorithms: nested block, hash, sort-merge, and pipelined sort-merge. They are used to execute the join operation in a query in distributed fashion and in pipelined fashion. Based on these equations and similar sets of equations developed for other relational algebraic operations, the performance of query execution was evaluated using the different join algorithms. The effects of varying the values of processing time, I/O time, communication time, buffer size, and join selectively on the performance of the pipelined join algorithms are investigated. The results are compared to the results obtained by employing the same algorithms for executing queries using the distributed processing approach which does not exploit the vertical concurrency of the pipelining approach. These results establish the benefits of pipelining 相似文献

6.

FuPerMod: a software tool for the optimization of data-parallel applications on heterogeneous platforms

David Clarke Ziming Zhong Vladimir Rychkov Alexey Lastovetsky 《The Journal of supercomputing》2014,69(1):61-69

Optimization of data-parallel applications for modern HPC platforms requires partitioning the computations between the heterogeneous computing devices in proportion to their speed. Heterogeneous data partitioning algorithms are based on computation performance models of the executing platforms. Their implementation is not trivial as it requires: accurate and efficient benchmarking of computing devices, which may share resources and/or execute different codes; appropriate interpolation methods to predict performance; and advanced mathematical methods to solve the data partitioning problem. In this paper, we present FuPerMod, a software tool that addresses these implementation issues and automates the development of data partitioning code in data-parallel applications for heterogeneous HPC platforms. 相似文献

7.

Efficient mapping and implementation of matrix algorithms on a hypercube

Vladimir Cherkassky Ross Smith 《The Journal of supercomputing》1988,2(1):7-27

It is well known that parallelism by itself does not lead to higher speeds. This study shows how to put parallelism to best use, that is, how to find an optimal balance between communication and computation overheads for two parallel matrix algorithms. The problem graph for matrix algorithms analyzed in this paper is a two-dimensional grid (toroidal mesh) which is mapped onto a hypercube topology. To perform matrix operations on a hypercube, a matrix is partitioned into several submatrices which are stored and manipulated in the nodes. We seek to find an optimal matrix partitioning to minimize overall execution time. The NCUBE parallel machine is used for experimental performance evaluation. For matrix multiplication, we derive an exact analytical model to determine the optimal partitioning size and perform its experimental verification on the NCUBE parallel processor. For a parallel Gaussian elimination known as the balanced algorithm, we present performance measurements and an approximate analytical model for performance evaluation. Our analyses show that the optimal submatrix size is typically small and does not depend on the original matrix size. 相似文献

8.

Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architectures

Guangming Tan Ninghui Sun Gao G.R. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(2):261-274

Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown. 相似文献

9.

A Computation+Communication Load Balanced Loop Partitioning Method for Distributed Memory Systems

Santosh Pande Tareq Bali 《Journal of Parallel and Distributed Computing》1999,58(3):251

Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D. 相似文献

10.

Coarse-grained thread pipelining: a speculative parallel executionmodel for shared-memory multiprocessors

Kazi I.H. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(9):952-966

This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor 相似文献

11.

Trace Software Pipelining

下载免费PDF全文

Wang Jian Andreas Krall M.Anton Ertl 《计算机科学技术学报》1995,10(6):481-490

Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches.This paper presents a novel global software pipelining technique,called Trace Software Pipelining,targeted to the instruction-level parallel processors such as Very Long Instruction Word (VLIW) and superscalar machines.Trace software pipelining applies a global code scheduling technique to compact the original loop body.The resulting loop is called a trace software pipelined (TSP) code.The trace softwrae pipelined code can be directly executed with special architectural support or can be transformed into a globally software pipelined loop for the current VLIW and superscalar processors.Thus,exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique.This makes our new technique very promising in practical compilers.Finally,we also present the preliminary experimental results to support our new approach. 相似文献

12.

Exploiting fine-grain parallelism on dataflow architectures

Guang R. Gao 《Parallel Computing》1990,13(3):309-320

Although dataflow computers have many attractive features, skepticism exists concerning their efficiency in handling arrays (vectors) in high performance scientific computation. This paper outlines an efficient implementation scheme for arrays in applicative languages (such as VAL and SISAL) based on the principles of dataflow software pipelining. It illustrates how the fine-grain parallelism of dataflow approach can effectively handle large amount of data structured in applicative array operations. This is done through dataflow software pipelining between pairs of code blocks which act as producer-consumer of array values. To make effective use of the pipelined code mapping scheme, a compiler needs information concerning the overall program structure as well as the structure of each code block. An applicative language provides a basis for such analysis.

The program transformation techniques described here are developed primarily for the computationally intensive part of a scientific numerical program, which is usually formed by one or a few clusters of acyclic connected code blocks. Each code block defines an array value from several input arrays. We outline how mapping decisions of arrays can be based on a global analysis of attributes of the code blocks. We emphasize the role of overall program structure and the strategy of global optimization of the machine code structure. The structure of a proposed dataflow compiler based on the scheme described in this paper is outlined. 相似文献

13.

Executing DSP applications in a fine-grained dataflow environment

Radivojevic I.P. Herath J. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(10):1028-1041

An experimental approach is chosen to investigate the performance of a fine-grained dataflow architecture for numerically intensive digital signal processing (DSP) applications. The focus is on the behavior of pipelined data-parallel algorithms. However, the granularity of the high-level language programming blocks is not explicitly optimized to balance computation and communication; a natural and logical fine-grained decomposition of problems is used instead. The authors interpret their empirical data by means of parameters such as a number of instructions per generic unit of computation, a density of precedence relations, and a serial fraction. The performance and limitations of fine-grained general-purpose dataflow computing are discussed 相似文献

14.

Processor Array Synthesis from Shift-Variant Deep Nested Do Loops

Kittitornkun Surin Hu Yu Hen 《The Journal of supercomputing》2003,24(3):229-249

The consolidation of Internet devices into a universal/portable device will soon be accomplishable through the incorporation of reconfigurable computing in system-on-a-chip (SOC). At any particular moment, it could be a video/audio mobile phone, an MP3 song player, and other devices. The basic construct of these multimedia processing algorithms can be described as deep nested Do loop algorithms. They are considered the most demanding data-intensive algorithms and hence ideal candidates for an array of reconfigurable nanoprocessors. Therefore, algorithm to hardware synthesis methodology is important for an efficient exploitation of both spatial parallelism and temporal pipelining. In this paper, we propose a processor array synthesis methodology. It can map an n-level nested Do loop represented by a nonuniform or shift-variant data dependence graph to a near-optimal of one-or two-dimensional processor array under the available resource constraints to satisfy high-throughput computation demands. 相似文献

15.

On the design and performance of conventional pipelined architectures

Nigel P. Topham Amos Omondi Roland N. Ibbett 《The Journal of supercomputing》1988,1(4):353-393

Pipelining is a widely used technique for implementing architectures that have inherent temporal parallelism when there is an operational requirement for high throughput. Many variations on the basic theme have been proposed, with varying degrees of success. The aim of this paper is to present a critical review of conventional pipelined architectures and put some well-known problems in sharp relief. It is argued that conventional pipelined architectures have underlying limitations that can only be dealt with by adopting a different view of pipelining. These limitations are explained in terms of discontinuities in the flow of instructions and data, and representative machines are examined in support of this argument. In a companion paper [Topham, Omondi and Ibbett, 1988] we examine an alternative approach to the design of pipelined architectures and introduce an alternative theory of pipelining, which we call Context Flow. 相似文献

16.

A neural networks algorithm for data path synthesis

Haidar M. Harmanani^{Author Vitae} 《Computers & Electrical Engineering》2003,29(4):535-551

This paper presents a deterministic parallel algorithm to solve the data path allocation problem in high-level synthesis. The algorithm is driven by a motion equation that determines the neurons firing conditions based on the modified Hopfield neural network model of computation. The method formulates the allocation problem using the clique partitioning problem, an NP-complete problem, and handles multicycle functional units as well as structural pipelining. The algorithm has a running time complexity of O(1) for a circuit with n operations and c shared resources. A sequential simulator was implemented on a Linux Pentium PC under X-Windows. Several benchmark examples have been implemented and favorable design comparisons to other synthesis systems are reported. 相似文献

17.

分布内存系统中节点间软流水优化技术

陈莉张兆庆冯晓兵《计算机科学》2002,29(11):24-28

1.引言目前分布式处理系统的并行编译技术还处于研究阶段,对于实际应用程序还难以获得高性能。在程序中充分发掘并行性和极小化通信的开销是需要协同考虑的两个方面。考虑全局数据分布和计算分割时,在应用程序中仍然可能存在这样一些情况:在程序中按全局优化数据分布策略,某数组采用一种数据分布方式会取得好的效果,但对某些循环而言,该数相似文献

18.

面向密码流处理器的AES算法软件流水实现方法

王寿成徐进辉严迎建李功丽贾永旺《计算机应用》2017,37(6):1620-1624

针对轮函数在分组密码实现过程中耗时过长的问题,提出了面向可重构密码流处理器（RCSP）的高级加密标准（AES）算法软件流水实现方法。该方法将轮函数操作划分为若干流水段,不同流水段对应不同的并行密码资源,通过并行执行多个轮函数的不同流水段,从而开发指令级并行性提高轮函数执行速度,进而提升分组密码的执行性能。在RCSP的单簇、双簇和四簇运算资源下分析了AES算法的流水线划分过程和软件流水映射方法,实验结果表明,该软件流水实现方法使得单分组或多分组不同数据分块的操作并行执行,不仅能够提升单分组串行执行性能,还能够通过开发分组间的并行性来提高多分组并行执行性能。相似文献

19.

CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

Lee D Dinov I Dong B Gutman B Yanovsky I Toga AW 《Computer methods and programs in biomedicine》2012,106(3):175-187

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. 相似文献

20.

基于数据空间融合的全局计算与数据划分方法 总被引：2，自引：1，他引：2

夏军杨学军《软件学报》2004,15(9):1311-1327

计算与数据划分问题是影响并行程序在分布主存多处理机中执行性能的重要因素,也是并行编译优化的重点.针对该问题,提出了一套关于数据空间融合的理论框架,并基于该框架给出了一种有效的全局计算与数据划分方法,用于分布主存计算环境中的计算与数据划分问题的求解.该方法能够尽量开发计算空间的并行度,利用数据融合技术优化数据分布,并能搜寻优化的全局计算与数据划分.该方法还能很自然地与数据复制以及偏移常量的对准结合在一起,从而使得数据通信量尽可能地小.实验结果表明了所提出方法的有效性. 相似文献