共查询到20条相似文献,搜索用时 0 毫秒
1.
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests 总被引:2,自引:0,他引:2
Nawaaz Ahmed Nikolay Mateev Keshav Pingali 《International journal of parallel programming》2001,29(5):493-544
Linear loop transformations and tiling are known to be very effective for enhancing locality of reference in perfectly-nested loops. However, they cannot be applied directly to imperfectly-nested loops. Some compilers attempt to convert imperfectly-nested loops into perfectly-nested loops by using statement sinking, loop fusion, etc., and then apply locality enhancing transformations to the resulting perfectly-nested loops, but the approaches used are fairly ad hoc and may fail even for simple programs. In this paper, we present a systematic approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of each statement into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop nest, so embedding generalizes techniques like statement sinking and loop fusion which are used in ad hoc ways in current compilers to produce perfectly-nested loops from imperfectly-nested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space can itself be transformed to increase locality further, after which fully permutable loops can be tiled. The final code generation step may produce imperfectly-nested loops as output if that is desirable. We present experimental evidence for the effectiveness of this approach, using dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. 相似文献
2.
利用U模变换增加并行粒度与改善数据访问局部性的方法 总被引:3,自引:0,他引:3
提出了一种利用循环变换增加循环并行粒度,改善循环数据访问局部性的方法,该方法利用了给定二重循环的相关向量集的某些性质,将外层循环变量不同而内层循环变量相等的若干次迭代合并,成为折叠后迭代空间的一个结点,并且保持内层循环的并行性不变,从而达到增加循环并行粒度的目的。对于更普遍的情况,该文讨论了如何根据给定循环的循环向量集,确定一个U模变换对迭代空间进行变换,达到内层循环可并行和扩大循环粒度两个目的,针对循环变换中数据访问局部性可能变差的问题,该文提出了对内层循环先合并,根据合并后的相关向量集变换迭代空间,以及折叠迭代空间的方法,该文的方法是Wavefront循环并行化方法的一种扩展。 相似文献
3.
基于线性表出的非奇异循环变换局部性优化方法 总被引:1,自引:0,他引:1
开发程序的局部性是当今并行编译优化研究的重点之一,而程序变换是开发程序时间局部性和空间局部性的重要手段之一.该文提出了一种新的利用非奇异循环变换来优化程序局部性的局部性优化方法,即基于线性表出的循环变换.该方法利用一组最少的线性无关向量组来线性表出数组访问的下标表达式,并据此构造非奇异变换矩阵来优化数组访问的时间局部性和空间局部性.该方法能充分开发数组访问的时间局部性,能简便地确定是否能对数组访问进行时间局部性或空间局部性优化,并能对给定的嵌套循环同时进行时间局部性和空间局部性优化.实验结果表明了该文所提出的基于线性表出的非奇异循环变换局部性优化方法是有效的. 相似文献
4.
The exploitation of today's high-performance computer systems requires the effective use of parallelism in many forms and at numerous levels. This survey article discusses program analysis and restructuring techniques that target parallel architectures. We first describe various categories of architectures that are oriented toward parallel computation models: vector architectures, shared-memory multiprocessors, massively parallel machines, message-passing architectures, VLIWs, and multithreaded architectures. We then describe a variety of optimization techniques that can be applied to sequential programs to effectively utilize the vector and parallel processing units. After an overview of basic dependence analysis, we present restructuring transformations on DO loops targeted both to vectorization and to concurrent execution, interprocedural and pointer analysis, task scheduling, instruction-level parallelization, and compiler-assisted data placement. We conclude that although tremendous advances have been made in dependence theory and in the development of a toolkit of transformations, parallel systems are used most effectively when the programmer interacts in the optimization process. 相似文献
5.
Zhiyuan Li 《计算机科学技术学报》2007,22(4):497-504
Loop tiling (or loop blocking) is a well-known loop transformation to improve temporal locality in nested loops which perform matrix computations. When targeting caches that have low associativities, one of the key challenges for loop tiling is to simultaneously minimize capacity misses and conflict misses. This paper analyzes the effect of the tile size and the array-dimension size on capacity misses and conflict misses. The analysis supports the approach of combining tile-size selection (to minimize capacity misses) with array padding (to minimize conflict misses). 相似文献
6.
数据网格已逐步在科学研究领域得到应用.提高数据网格的性能以适应分布式数据管理已经成为研究数据网格的一个热点.提出了网格局部性的概念,分析了网格局部性对数据网格性能的影响,并从增强网格局部性的角度对数据网格的性能进行优化,提出了综合跳一扩散副本替换策略(jump-DRP)和参考生物外激素的任务调度策略(JARIP).实验结果表明,考虑了网格局部性因素的jump-DRP与JARIP的策略组合提高了网格平台的任务处理性能,并对各类应用背景及任务的复杂程度具有鲁棒性. 相似文献
7.
Microprocessor speed has been growing exponentially faster than memory system speed in the recent past. This paper explores the long term implications of this trend. We define scalable locality, which measures our ability to apply ever faster processors to increasingly large problems (just as scalable parallelism measures our ability to apply more numerous processors to larger problems). We provide an algorithm called time skewing that derives an execution order and storage mapping to produce any desired degree of locality, for certain programs that can be made to exhibit scalable locality. Our approach is unusual in that it derives the transformation from the algorithm's dataflow (a fundamental characteristic of the algorithm) instead of searching a space of transformations of the execution order and array layout used by the programmer (artifacts of the expression of the algorithm). We provide empirical results for data sets using L2 cache, main memory, and virtual memory. 相似文献
8.
Terence J. Harmer Patrick J. McParland James M. Boyle 《Automated Software Engineering》1998,5(3):321-345
In this paper we describe a COBOL-program restructuring tool currently under development. The tool is constructed using program transformations executed by the TAMPR program transformation system. We discuss the COBOL knowledge embodied in the transformations and how they restructure an example COBOL program developed in the mid–1970s. While the tool is not yet a robust commercial product, early use for restructuring COBOL programs demonstrates the power and flexibility of this transformation–based approach. 相似文献
9.
Hardware and software cache optimizations are active fields of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined through simple software and hardware optimizations. Because current caches provide little flexibility for exploiting temporal and spatial locality, two hardware modifications are proposed to support these two kinds of locality. Spatial locality is exploited by using large virtual cache lines which do not exhibit the performance flaws of large physical cache lines. Temporal locality is exploited by minimizing cache pollution with a bypass mechanism that still allows to exploit spatial locality. Subsequently, it is shown that simple software informations on the spatial/temporal locality of array references, as provided by current data locality optimization algorithms, can be used to increase cache performance significantly. The performance and design tradeoffs of the proposed mechanisms are discussed, Software-assisted caches are also shown to provide a very convenient support for further enhancement of data locality optimizations. 相似文献
10.
面对日益增长的图像数据库,为用户提供一个简洁高效的搜索和浏览解决方案成为一个紧迫而且充满挑战的问题.图像聚类技术可以在许多方面为此提供帮助,例如图像数据预处理、用户界面设计.以及对搜索结果的聚类等.在众多聚类算法中,谱聚类(spectral clustering)方法由于能够解决复杂分布数据的聚类问题,以及接近全局最优的性能,成为近年来广受关注的一种方法.然而,目前存在的谱聚类方法,譬如normalized cut在处理新增数据点的聚类时,计算复杂度很高.提出了一种新的聚类算法——保局聚类.保局聚类在拥有许多非线性谱聚类方法优点的同时,又具有独特的数学特性——能提供显式的映射函数.这为在原数据集和新增数据集上进行高效的聚类提供了可能.实验结果显示,保局聚类比K均值聚类和主成分分析后的K均值聚类效果要好.实验同样显示,保局聚类与normalized cut效果可比,而前者更加高效. 相似文献
11.
一种利用数据融合来提高局部性和减少伪共享的方法 总被引:6,自引:0,他引:6
某些应用程序不能通过数组内元素的重排优化获得性能提高 .针对这一问题 ,该文扩展了数组之间数据重组优化方法 ,着重分析了将多个数组的数据按一定方式进行融合来提高局部性和减少伪共享优化方法的特性 .文章针对几种典型的数组关联模式 ,提出了相应的数据融合方法 ,并建立了一组粗略的性能代价判别规则 ,以指导编译器有选择地融合数组以提高程序的全局优化效果 .根据在多个平台上的测试结果 ,该文还分析了数据融合优化方法在不同体系结构上的性能可移植性 ,并将体系结构特征加入到性能代价判别规则中 ,使得此优化方法能适用于不同的体系结构 .测试结果表明 ,数据融合优化方法对提高某些应用程序的性能 ,尤其是其在软件DSM体系结构上的性能 ,是非常有效的 相似文献
12.
13.
基于投影分层技术的嵌套循环空间局部性优化方法 总被引:3,自引:0,他引:3
从数据访问轨迹入手,探讨了利用数据变换来改善数据访问局部性的本质,提出了一种新的优化数据访问的投影分层技术以及基于它的数据变换框架.该框架主要利用投影技术来优化数据访问的空间局部性,并同时利用数据分层技术来解决因投影而带来的数据重叠问题.该数据变换框架不仅能处理仿射数组下标,而且还能处理许多非仿射的更复杂的数组下标,同时它还能简单直接地确定数据元素的最优存储布局以及优化数据访问的数据变换短阵,并能使访问间距尽量小.实验结果表明它是有效的. 相似文献
14.
Program restructuring in a segmented virtual memory system is examined on the base of empirical data. The measure of connectivity is based on the decrease of the space-time integral caused by combining the two segments. A heuristic clustering algorithm, based on the use of effectively reduced memory reference strings, is presented. It turns out that performance problems in a segmented system are created by the large number of small segments. Restructuring is able to improve the performance of the system essentially; the optimal performance is reached with values of the memory control parameter which are one magnitude smaller than those applicable to original programs. The clusters formed do not correspond to program phases, but to smaller units of program execution; thus the clustering result supports a hierarchical model of program locality. The restructuring process is reasonably insensitive both to the memory control parameters and to the input data used. 相似文献
15.
Graph embedding (GE) is a unified framework for dimensionality reduction techniques. GE attempts to maximally preserve data locality after embedding for face representation and classification. However, estimation of true data locality could be severely biased due to limited number of training samples, which trigger overfitting problem. In this paper, a graph embedding regularization technique is proposed to remedy this problem. The regularization model, dubbed as Locality Regularization Embedding (LRE), adopts local Laplacian matrix to restore true data locality. Based on LRE model, three dimensionality reduction techniques are proposed. Experimental results on five public benchmark face datasets such as CMU PIE, FERET, ORL, Yale and FRGC, along with Nemenyi Post-hoc statistical of significant test attest the promising performance of the proposed techniques. 相似文献
16.
17.
Carole M. McNamee Ronald A. Olsson 《International journal of parallel programming》1990,19(5):357-387
This paper presents source-level transformations that improve the performance of programs using synchronous and asynchronous message passing primitives, including remote call to an active process (rendezvous). It also discusses the applicability of these transformations to shared memory and distributed environments. The transformations presented reduce the need for context switching, simplify the specific form of communication, and/or reduce the complexity of the given form of communication. One additional transformation actually increases the number of processes as well as the number of context switches to improve program performance. These transformations are shown to be generalizable. Results of hand-applying the transformations to SR programs indicate reductions in execution time exceeding 90% in many cases. The transformations also apply to many commonly occurring synchronization patterns and to other concurrent programming languages such as Ada and Concurrent C. The long term goal of this effort is to include such transformations as an otpimization step, performed automatically by a compiler.This work was supported by NSF under Grant Number CCR88-10617. 相似文献
18.
提出一种基于平滑滤波的小波阈值图像增强算法,利用邻域平均和小波阈值相结合的方法对图像进行平滑滤波处理。实验结果表明,该方法有利于图像去噪处理,同时边缘信息也得到了较好的保留,使图像具有更好的视觉效果。 相似文献
19.
Sylvain Girbal Nicolas Vasilache Cédric Bastoul Albert Cohen David Parello Marc Sigler Olivier Temam 《International journal of parallel programming》2006,34(3):261-317
Modern compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved in general, resulting in weak scalability and disappointing sustained performance. We address this challenge by working on the program representation itself, using a semi-automatic optimization approach to demonstrate that current compilers offen suffer from unnecessary constraints and intricacies that can be avoided in a semantically richer transformation framework. Technically, the purpose of this paper is threefold: (1) to show that syntactic code representations close to the operational semantics lead to rigid phase ordering and cumbersome expression of architecture-aware loop transformations, (2) to illustrate how complex transformation sequences may be needed to achieve significant performance benefits, (3) to facilitate the automatic search for program transformation sequences, improving on classical polyhedral representations to better support operation research strategies in a simpler, structured search space. The proposed framework relies on a unified polyhedral representation of loops and statements, using normalization rules to allow flexible and expressive transformation sequencing. Thisrepresentation allows to extend the scalability of polyhedral dependence analysis, and to delay the (automatic) legality checks until the end of a transformation sequence. Our work leverages on algorithmic advances in polyhedral code generation and has been implemented in a modern research compiler. 相似文献
20.
Flash Sheridan 《Software》2007,37(14):1475-1488
A simple technique is presented for testing a C99 compiler, by comparing its output with the output from pre‐existing tools. The advantage to this approach is that new test cases can be added in bulk from existing sources, reducing the need for in‐depth investigation of correctness issues and for creating new test code by hand. This technique was used in testing the PalmSource Palm OS® Cobalt ARM C/C++ cross‐compiler for Palm‐Powered® personal digital assistants, primarily for standards compliance and the correct execution of generated code. The technique described here found several hundred bugs, mostly in our in‐house code, but also in longstanding high‐quality front‐ and back‐end code from Edison Design Group and Apogee Software. It also found 18 bugs in the GNU C compiler, as well as a bug specific to the Apple version of GCC, a bug specific to the Suse version of GCC, and a dozen bugs in versions of GCC for the ARM processor, several of which were critical. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献