Linear loop transformations and tiling are known to be very effective for enhancing locality of reference in perfectly-nested loops. However, they cannot be applied directly to imperfectly-nested loops. Some compilers attempt to convert imperfectly-nested loops into perfectly-nested loops by using statement sinking, loop fusion, etc., and then apply locality enhancing transformations to the resulting perfectly-nested loops, but the approaches used are fairly ad hoc and may fail even for simple programs. In this paper, we present a systematic approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of each statement into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop nest, so embedding generalizes techniques like statement sinking and loop fusion which are used in ad hoc ways in current compilers to produce perfectly-nested loops from imperfectly-nested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space can itself be transformed to increase locality further, after which fully permutable loops can be tiled. The final code generation step may produce imperfectly-nested loops as output if that is desirable. We present experimental evidence for the effectiveness of this approach, using dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks.  相似文献   

循环分块是一种广泛用于改善数据局部性和开发并行性的程序变换优化技术.主要分为2类:固定分块技术和参数化分块技术,系统地总结了这2类技术,并分析了其优缺点.由于分块大小的选择会严重影响分块代码的性能,因此介绍分析了选择最优分块大小的各种方法.此外,总结了循环分块在多级分块、并行性开发和不完美嵌套循环等方面应用的各项技术.通过对循环分块技术当前研究现状的分析,得出如下结论:1)循环分块技术中的计算复杂度和生成代码效率问题还未得到完全解决,如何利用循环边界有效地约束迭代空间并提高数据局部性还需要更深入的研究;2)最优分块大小的选择依然是一个开放式难题,研究清楚分级存储架构中每级分块对性能的影响具有重要的意义;3)从循环分块的应用角度,如何有效地构建面向任意嵌套循环集的自动分块代码生成系统,同时充分利用深度共享存储资源和多核架构实现分块代码的高并行度,也是一个需要深入研究的问题.  相似文献   

Generation of Efficient Nested Loops from Polyhedra   总被引:1,自引:0,他引:1  
Automatic parallelization in the polyhedral model is based on affine transformations from an original computation domain (iteration space) to a target space-time domain, often with a different transformation for each variable. Code generation is an often ignored step in this process that has a significant impact on the quality of the final code. It involves making a trade-off between code size and control code simplification/optimization. Previous methods of doing code generation are based on loop splitting, however they have nonoptimal behavior when working on parameterized programs. We present a general parameterized method for code generation based on dual representation of polyhedra. Our algorithm uses a simple recursion on the dimensions of the domains, and enables fine control over the tradeoff between code size and control overhead.  相似文献   

This article presents the formal verification, using the Coq proof assistant, of a memory model for low-level imperative languages such as C and compiler intermediate languages. Beyond giving semantics to pointer-based programs, this model supports reasoning over transformations of such programs. We show how the properties of the memory model are used to prove semantic preservation for three passes of the Compcert verified compiler.  相似文献   

This paper presents source-level transformations that improve the performance of programs using synchronous and asynchronous message passing primitives, including remote call to an active process (rendezvous). It also discusses the applicability of these transformations to shared memory and distributed environments. The transformations presented reduce the need for context switching, simplify the specific form of communication, and/or reduce the complexity of the given form of communication. One additional transformation actually increases the number of processes as well as the number of context switches to improve program performance. These transformations are shown to be generalizable. Results of hand-applying the transformations to SR programs indicate reductions in execution time exceeding 90% in many cases. The transformations also apply to many commonly occurring synchronization patterns and to other concurrent programming languages such as Ada and Concurrent C. The long term goal of this effort is to include such transformations as an otpimization step, performed automatically by a compiler.This work was supported by NSF under Grant Number CCR88-10617.  相似文献   

针对当前大量遗产代码无法重复利用的问题,设计一种新的编译工具将C的串行代码转换为基于MPI+OpenMP的混合并行编程代码,降低了并行编程的开发成本。首先,通过对JavaCC的优化,实现一种可以解析C语言的词法和语法分析器,进行源代码分析并生成抽象语法树;其次,根据语法树对源代码进行控制依赖性和数据依赖性分析,产生可并行化的语句块分区;再次,按照提出的并行代码生成方法得到目标代码;最后,基于Visual Studio 2010构建目标代码仿真验证环境。实验结果表明,该工具可以较为理想地实现串行代码自动并行化,与手工编写的代码在加速比上的误差为8.2%~18.4%。  相似文献   

Chapman  B.  Merlin  J.  Pritchard  D.  Bodin  F.  Mevel  Y.  Sørevik  T.  Hill  L. 《The Journal of supercomputing》2000,17(3):311-322
Applications are increasingly being executed on computational systems that have hierarchical parallelism. There are several programming paradigms which may be used to adapt a program for execution in such an environment. In this paper, we outline some of the challenges in porting codes to such systems, and describe a programming environment that we are creating to support the migration of sequential and MPI code to a cluster of shared memory parallel systems, where the target program may include MPI, OpenMP or both. As part of this effort, we are evaluating several experimental approaches to aiding in this complex application development task.  相似文献   

This paper describes some of the tools and techniques that are being used in the interactive SUPRENUM parallelization system SUPERB. Emphasis is placed on specific problems arising from the interactive nature of the system, in particular the necessity to incrementally update data flow information that is used to determine the applicability and the effect of transformations.  相似文献   

This paper deals with communication optimization which is a crucial issue in automatic parallelization. From a system of parameterized affine recurrence equations we propose a heuristic that determines a set of efficient space-time transformations. It focuses on distant communications removal using broadcast—including anticipated broadcast, and locality enforcement.  相似文献   

发掘DOACROSS 循环中蕴含的并行性,选择合适的策略将其并行执行,对提升程序的并行性能非常重要.流水并行方式是规则DOACROSS 循环并行的重要方式.自动生成性能良好的流水并行代码是一项困难的工作,并行编译器对程序自动并行时常常对DOACROSS 循环作保守处理,损失了DOACROSS 循环包含的并行性,限制了程序的并行性能.针对上述问题,设计了一种选择计算划分循环层和循环分块层的启发式算法,给出了一个基于流水并行代价模型的循环分块大小计算公式,并使用计数信号量进行并行线程之间的同步,实现了基于OpenMP 的规则DOACROSS 循环流水并行代码的自动生成.通过对有限差分松弛法(finite difference relaxation,简称FDR)的波前(wavefront)循环和时域有限差分法(finite difference time domain,简称FDTD)中典型循环以及程序Poisson,LU 和Jacobi 的测试,算法自动生成的流水并行代码能够在多核处理器上获得明显的性能提升,使用的流水分块大小计算公式能够较为精确地计算出循环流水并行时的最佳分块大小.自动生成的流水并行代码与基于手工选择的最优分块大小的流水并行代码相比,加速比达到手工选择加速比的89%.  相似文献   

NestStep is a parallel programming language for the BSP (bulk–synchronous–parallel) model of parallel computation.Extending the classical BSP model, NestStep supports dynamically nested parallelism by nesting of supersteps and a hierarchical processor group concept. Furthermore, NestStep adds a virtual shared memory realization in software, where memory consistency is relaxed to superstep boundaries. Distribution of shared arrays is also supported.A prototype for a subset of NestStep has been implemented based on Java as sequential basis language. The prototype implementation is targeted to a set of Java Virtual Machines coupled by Java socket communication to a virtual parallel computer.  相似文献   

片上系统SoC在多媒体信息处理领域中应用广泛.多媒体处理程序中频繁的循环嵌套与多维数组操作严重影响着多媒体处理SoC系统的数据传输与存储效率.根据程序各个部分的存储需求量将程序中数据映射到SoC存储层次上,这是改善SoC系统性能与功耗的必要途径.针对多媒体处理程序,提出一种面向SoC数据映射的快速存储需求量分析方法.在存储需求量分析过程中,提出并使用正交线性有界格对所操作数据的数据域进行划分,并基于相关性进行存储需求量计算,得到了较为准确的存储需求量,并大幅度减少了分析时间.  相似文献   

Program specialization is a program transformation methodology which improves program efficiency by exploiting the information about the input data which are available at compile time. We show that current techniques for program specialization based on partial evaluation do not perform well on nondeterministic logic programs. We then consider a set of transformation rules which extend the ones used for partial evaluation, and we propose a strategy for guiding the application of these extended rules so to derive very efficient specialized programs. The efficiency improvements which sometimes are exponential, are due to the reduction of nondeterminism and to the fact that the computations which are performed by the initial programs in different branches of the computation trees, are performed by the specialized programs within single branches. In order to reduce nondeterminism we also make use of mode information for guiding the unfolding process. To exemplify our technique, we show that we can automatically derive very efficient matching programs and parsers for regular languages. The derivations we have performed could not have been done by previously known partial evaluation techniques.A preliminary version of this paper appears as: Reducing Nondeterminism while Specializing Logic Programs. Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages, Paris, France, January 15–17, 1997, ACM Press, 1997, pp. 414–427.  相似文献   

应用映射是片上网络体系结构研究的关键问题之一,映射结果的好坏会极大地影响体系结构的性能。现有的应用映射方法大多基于特定的网络结构,如2d-mesh、2d-torus等,研究NoC性能或功耗约束的应用映射与优化方法。本文提出了一种拓扑结构感知的基于高层代码转换的片上网络应用映射与优化方法。该方法采用多面体模型对应用的核心循环进行自动并行和局部性优化,并将网络拓扑结构抽象成带权重的有向图,使用该有向图对任务流图进行覆盖,以提高任务的并行性,降低任务间同步和通信开销。实验结果表明,采用优化的映射方法后任务节点间的并行性被充分利用,通信开销降低,整体上提高了片上网络系统性能。  相似文献   

闭环检测是同步定位与建图(simultaneous localization and mapping,SLAM)中的一个重要组成部分,用于减少移动机器人在位置估计和构建环境地图时产生的累计误差。传统方法采用人工设计的特征,但在外界环境中容易受到光照、天气和视点变化等因素所带来的影响。随着深度学习技术的发展,闭环检测得到广泛的探索,且在复杂环境中基于深度学习的闭环检测具有较强的鲁棒性。通过梳理闭环检测的背景和发展现状,从基于深度卷积神经网络、自动编码器和语义信息三个方面,对目前视觉SLAM(visual-SLAM,V-SLAM)闭环检测方法的基本原理、算法特点进行了对比分析,并从视觉应用层面上总结了三类方法所适用的场景,最后讨论了闭环检测未来在自然环境变化、多移动目标和实时动态三个方面所存在的挑战和研究展望。  相似文献   

Loop tiling and unrolling are two important program transformations to exploit locality and expose instruction level parallelism, respectively. However, these transformations are not independent and each can adversely affect the goal of the other. Furthermore, the best combination will vary dramatically from one processor to the next. In this paper, we therefore address the problem of how to select tile sizes and unroll factors simultaneously. We approach this problem in an architecturally adaptive manner by means of iterative compilation, where we generate many versions of a program and decide upon the best by actually executing them and measuring their execution time. We evaluate several iterative strategies based on genetic algorithms, random sampling and simulated annealing. We compare the levels of optimization obtained by iterative compilation to several well-known static techniques and show that we outperform each of them on a range of benchmarks across a variety of architectures. Finally, we show how to quantitatively trade-off the number of profiles needed and the level of optimization that can be reached. In this way, we can reach high levels of optimization within 50 iterations.  相似文献   

研究了Fibonacci-Q变换及其逆变换的周期性,并证明了Fibonacci变换周期与Arnold变换周期的关系式,利用Fibonacci-Q变换Arnold变换之间的内在关系,定义了一种新的非线性变换-A-F变换,同时对其变换周期及其逆变换周期进行了研究,也给出了这两类变换周期的简洁算法。由于A-F变换和Fibonacci-Q变换与Arnold变换的周期性大相径庭,在实际应用中可达到更加安全保密的效果。  相似文献   

面向特征的领域分析方法可为网构软件中资源的有序化提供有效支持.从领域工程的角度出发,提出一种特征模型驱动的网构软件组装与优化方法,该方法以iJackson图描述网构软件的特征模型,结合软件体系结构特点,分析了将特征模型转换为面向业务构件、基于工作流图技术的组合模型的机制,通过应用图论方法,将组合模型建模为以领域特征簇为中心的构件组装结构图,围绕面向多目标需求的QoS模型,建立了Internet环境下网构软件构件组装问题的数学模型,提出了一种基于蚁群优化算法的全局优化方法.最后,以网上书店系统为倒,介绍了仿真实验过程,并说明了方法的有效性和可行性.  相似文献   

