首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Loop skewing is a new procedure to derive the wavefront method of execution of nested loops. The wavefront method is used to execute nested loops on parallel and vector computers when none of the loops can be done in vector mode. Loop skewing is a simple transformation of loop bounds and is combined with loop interchanging to generate the wavefront. This derivation is particularly suitable for implementation in compilers that already perform automatic detection of parallelism and generation of vector and parallel code, such as are available today. Loop normalization, a loop transformation used by several vectorizing translators, is related to loop skewing, and we show how loop normalization, applied blindly, can adversely affect the parallelism detected by these translators.  相似文献   

2.
Wavefront parallelism, in which parallelism is limited to hyperplanes in an iteration space, can arise when compilers apply tiling to loop nests to enhance locality. Previous approaches for scheduling wavefront parallelism focused on maximizing parallelism; balancing workloads, and reducing synchronization. In this paper, we show that on large-scale shared-memory multiprocessors, locality is a crucial factor. We make the distinction between intratile and intertile locality and show that as the number of processors grows, intertile locality becomes more important. We consider and experimentally evaluate existing strategies for scheduling wavefront parallelism. We show that dynamic self-scheduling can be efficiently used on a small number of processors, but performs poorly at large scale because it does not enhance intertile locality. By contrast, static scheduling strategies enhance intertile locality for small tiles, maintaining parallelism and resulting in better performance at large scale. Results from a Convex SPP1000 multiprocessor demonstrate the importance of taking intertile locality into account. Static scheduling outperforms dynamic self-scheduling by a factor of up to 2.3 on 30 processors  相似文献   

3.
Partitioning and mapping of nested loops for linear array multicomputers   总被引:1,自引:1,他引:0  
In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers.  相似文献   

4.
This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.  相似文献   

5.
Abstract

This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation.  相似文献   

6.
The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development.  相似文献   

7.
In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques.  相似文献   

8.
The multigroup, discrete ordinates representation for the linear transport equation enjoys widespread computational use and popularity. Serial solution schemes and numerical algorithms developed over the years provide a timely framework for parallel extension. On the Denelcor HEP, we investigate the parallel structure and extension of a number of standard Sn approaches. Concurrent inner sweeps, coupled acceleration techniques, synchronized inner-outer loops, and chaotic iteration are described. and results of computations are contrasted. The multigroup representation and serial iteration methods are also detailed. The basic iterative Sn method lends itself to parallel tasking, portably affording an effective medium for performing transport calculations on future architectures. This analysis represents a first attempt to extend serial Sn algorithms efficiency, environments and provides good baseline estimates on ease of parallel implementation, relative algorithm efficiency, comparative speedup, and some future directions. We find basic inner-outer and chaotic iteration strategies both easily support comparable high degrees of parallelism. Both accomodate parallel rebalance and diffusion acceleration and appear as robust and viable parallel techniques for Sn production work.  相似文献   

9.
In this paper we describe how to apply fine grain parallelism to augmenting path algorithms for the dense linear assignment problem. We prove by doing that the technique we suggest, can be efficiently implemented on commercial available, massively parallel computers. Using n processors, our method reduces the computational complexity from the sequentialO(n 3) to the parallel complexity ofO(n 2). Exhaustive experiments are performed on a Maspar MP-2 in order to determine which of the algorithmic flavors that fits best onto this kind of architecture.  相似文献   

10.
This paper presents an efficient algorithm that automatically generates a parallel program from a dependence-based representation of a sequential program. The resulting parallel program consists of nested fork-join constructs, composed from the loops and statements of the sequential program. Data dependences are handled by two techniques. One technique implicitly satisfies them by sequencing, thereby reducing parallelism. Where increased parallelism results, the other technique eliminates them by privatization: the introduction of process-specific private instances of variables. Additionally, the algorithm determines when copying values of such instances in and out of nested parallel constructs results in greater parallelism. This is the first algorithm for automatically generating parallelism for such a general model. The algorithm generates as much parallelism as is possible in our model while minimizing privatization.An earlier version of this paper was presented at the First Workshop on Languages and Compilers for Vector and Parallel Machines, which was held at Cornell University in August 1988. That same year a select group of these workshop papers were published in two special issues of the journal: volume 2, numbers 2 and 3.  相似文献   

11.
刘晓娴  赵荣彩  赵捷  徐金龙 《软件学报》2014,25(6):1154-1168
发掘DOACROSS 循环中蕴含的并行性,选择合适的策略将其并行执行,对提升程序的并行性能非常重要.流水并行方式是规则DOACROSS 循环并行的重要方式.自动生成性能良好的流水并行代码是一项困难的工作,并行编译器对程序自动并行时常常对DOACROSS 循环作保守处理,损失了DOACROSS 循环包含的并行性,限制了程序的并行性能.针对上述问题,设计了一种选择计算划分循环层和循环分块层的启发式算法,给出了一个基于流水并行代价模型的循环分块大小计算公式,并使用计数信号量进行并行线程之间的同步,实现了基于OpenMP 的规则DOACROSS 循环流水并行代码的自动生成.通过对有限差分松弛法(finite difference relaxation,简称FDR)的波前(wavefront)循环和时域有限差分法(finite difference time domain,简称FDTD)中典型循环以及程序Poisson,LU 和Jacobi 的测试,算法自动生成的流水并行代码能够在多核处理器上获得明显的性能提升,使用的流水分块大小计算公式能够较为精确地计算出循环流水并行时的最佳分块大小.自动生成的流水并行代码与基于手工选择的最优分块大小的流水并行代码相比,加速比达到手工选择加速比的89%.  相似文献   

12.
A bipartite graph is vertex-bipancyclic (respectively, edge-bipancyclic) if every vertex (respectively, edge) lies in a cycle of every even length from 4 to |V(G)| inclusive. It is easy to see that every connected edge-bipancyclic graph is vertex-bipancyclic. An n-dimensional hypercube, or n-cube denoted by Qn, is well known as bipartite and one of the most efficient networks for parallel computation. In this paper, we study a stronger bipancyclicity of hypercubes. We prove that every n-dimensional hypercube is (2n−4)-path-bipancyclic for n?3. That is, for any path P of length k with 1?k?2n−4 and any integer l with max{2,k}?l?2n−1, an even cycle C of length 2l can be found in Qn such that the path P is included in C for n?3.  相似文献   

13.
Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

14.
We present simple randomized algorithms for parallel priority queues on distributed memory machines. Inserting O(n) elements or deleting the O(n) out ofmsmallest elements usingnprocessors requires O(Tcoll+ log(m/n)) amortized time with high probability whereTcollbounds the time for performing prefix sums and randomized routing. The memory requirement is bounded by (m/n)(1 +o(1)) + O(logn) whp. These bounds are an improvement over the best previously known algorithms for many interconnection networks and even matches the speed of the best known PRAM algorithms. Generalizations for accessing theknsmallest elements are even more efficient. A portable implementation using MPI demonstrates that our approach is already useful for medium scale parallelism. Two parallel selection algorithms for randomly placed data are a spin-off. One runs in time O(Tcoll) with high probability, beating a lower bound for the worst case. The other requires only a single reduction operation.  相似文献   

15.
In massively parallel computers (MPCs), efficient communication among processors is critical to performance. This paper describes the initial implementation of the ComPaSS communication library to support scalable software development in MPCs. ComPaSS provides high-level global communication operations for both data manipulation and process control, many of which are based upon a small set of low-level communication primitives. The low-level operations of the ComPaSS library are provably optimal for a class of architectures representative of many commercial scalable systems, in particular those using wormhole routing and n-dimensional mesh network topologies. This paper concentrates on the multicast and multireceive components of the ComPaSS library, which are fundamental to implementing efficient high-level data parallel operations. The design of the multicast and multireceive primitives is described and an example of a data parallel application utilizing ComPaSS multicast is given. The scalability of these primitives is discussed, and improvements in performance resulting from use of the library on a 64-node nCUBE-2 are presented.  相似文献   

16.
In this paper, to obtain an efficient parallel algorithm to solve sparse block-tridiagonal linear systems, stair matrices are used to construct some parallel polynomial approximate inverse preconditioners. These preconditioners are suitable when the desired goal is to maximize parallelism. Moreover, some theoretical results concerning these preconditioners are presented and how to construct preconditioners effectively for any nonsingular block tridiagonal H-matrices is also described. In addition, the validity of these preconditioners is illustrated with some numerical experiments arising from the second order elliptic partial differential equations and oil reservoir simulations.  相似文献   

17.
18.
A data structure called strips is described for representing linked lists, which enables unit time access of random list elements. Running parallel prefix on strips effectively converts a list into an array. When combined with nondeterministic statement sequencing and data operations, loops for performing iterations over lists, and insertions and deletions on lists can be parallelized yielding very efficient algorithms. The strips-based representation also allows efficient serial operations on lists, which is important both when loops cannot be parallelized or when there is more parallelism than processors.This work was supported in part under ONR Grant N00014-86-K-0215 and under NSF Grant DCR-8503610.  相似文献   

19.
Self-assembly is a generalization of the crystal growth, which has been proposed as a mechanism for the bottom-up fabrication of autonomous DNA computation. In the same context, tile assembly model is a highly distributed parallel model of natural self-assembly. In this paper, we propose a tile assembly system to tackle a well-known NP-complete problem known as Minimum Vertex Cover problem. The proposed algorithm requires Θ(n×m) types of tiles, and each parallel assembly executes in a linear time, where n is the number of vertices and m is the number of edges. Furthermore, the experimental results proved the simplicity and the efficiency of the proposed algorithm to solve the Minimum Vertex Cover, and reduce the overall complexity to find the solution.  相似文献   

20.
随着地图测绘技术的发展,原始地图数据量呈现爆炸式增长,对地图数据处理的空间及时间效率提出了更高需求。针对上述需求,本文提出一种面向矢量瓦片包含矢量数据切图并行和矢量瓦片上传并行的两级并行处理技术。实验结果表明,此并行处理技术在提高矢量数据切图和矢量瓦片上传效率上有显著效果,在海量矢量数据的高效处理上,本文提出的方法是可行的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号