共查询到20条相似文献,搜索用时 15 毫秒
1.
Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D. 相似文献
2.
Array syntax, which is supported in many technical programming languages, adds expressive power by allowing operations on and assignments to whole arrays and array sections. To compile an array assignment statement to a uniprocessor, the language processor must convert the statement into a loop that has the same meaning. This process is called scalarization.Scalarization presents a significant technical problem because an array assignment needs to be implemented as if all inputs are fetched before any outputs are stored. Since a loop intermixes loads and stores, the compiler typically allocates a temporary array to hold the intermediate result. Because these extra temporary arrays can cause performance problems in cache, many techniques have been developed to avoid their use or minimize their size.In this paper, we present a novel application of two compiler strategies—loop alignment and loop skewing—to address this problem. We show that these strategies can achieve the asymptotically minimal memory allocation for stencil computations. Our experiments with loop alignment and loop skewing demonstrate that it is extremely effective in improving memory hierarchy performance of Fortran 90 array code on standard uniprocessors. The result should be applicable to other array languages, such as MATLAB. 相似文献
3.
矿井瞬变电磁重叠回线装置实验研究 总被引:1,自引:1,他引:0
针对井下特殊的工作环境,对不同边长、不同形状和不同匝数的重叠回线装置与关断时间、接收信号质量之间的关系进行了定性研究,通过实验得出了如下结论:关断时间受重叠回线装置的形状和接收线圈的匝数影响较小,而对重叠回线装置的面积和发射线圈匝数的变化较为敏感,基本上呈现线性增加关系;在井下有限的施工空间中,适当增大重叠回线装置的面积和加大接收线圈的匝数可以明显提高和保证接收信号的质量;而圆形的重叠回线装置在相同周长的情况下,能够获得最大的接收面积;发射线圈匝数的增加无助于接收信号质量的改善。 相似文献
4.
针对现有通信优化算法无法使MPI自动并行化编译器生成加速比理想的消息传递程序问题,提出了一种基于重排序变换和循环分布的通信优化算法。该算法根据给出的过程间副作用集合和基于mpi_wait/mpi_irecv移动的重排序变换规则,有序地采用重排序变换和循环分布,尽可能安全地扩大点到点非阻塞通信中通信与计算的重叠窗口,使MPI自动并行化编译器生成具有更多计算重叠通信的消息传递代码。实验结果表明,该算法能够隐藏更多的点到点非阻塞通信开销,并且明显提升消息传递程序的加速比。 相似文献
5.
基于区域图数据流分析的通信优化算法 总被引:2,自引:1,他引:2
减少通信开销对于并行化编译器生成高效的分布代码是非常重要的.首先提出了一个冗余并行执行模型(RPEM)作为通信优化算法生成的目标程序的执行模型,之后给出了区域图的概念和区域最大化算法,在最大化区域图的基础上进行数据流分析可以增大数据流分析粒度,提高分析的效率,同时也有助于通信的提前与合并.最后提出了一种基于区域图数据流分析的通信优化算法.该算法能够进行跨循环、跨过程的数据流分析,提高分析的精度,改善通信优化效果.实验结果表明,该算法对于通信量较大的程序能够有效地减少通信的次数和通信量,具有良好的可扩展性. 相似文献
6.
内联是编译器中的一种重要的优化手段.传统的编译器中内联模型只考虑函数的执行频率和大小,而没有考虑后面的优化.优化指导的内联模型是以考虑后面的优化为主而进行的内联,但它的缺点是没有考虑函数的执行频率和大小.为了克服以上两者的缺点,提出新的内联模型--循环合并敏感的优化内联模型,既考虑执行频率和函数大小,又考虑后面的优化.实现了考虑循环合并的内联,加入到ORC原有的内联模型中,自适应的建立新的内联模型,并对此模型进行性能调优.通过实验,发现热度这一内联标准在某些情况下不是很有效,并分析了原因,减少一些内联的函数,则会提高性能.实验的结果显示,新的内联模型可以有效地提高编译器的性能,某些SPEC CPU2000实例的peak性能有高达6%的性能提升,平均提升1%. 相似文献
7.
基于数据空间融合的全局计算与数据划分方法 总被引:2,自引:1,他引:2
计算与数据划分问题是影响并行程序在分布主存多处理机中执行性能的重要因素,也是并行编译优化的重点.针对该问题,提出了一套关于数据空间融合的理论框架,并基于该框架给出了一种有效的全局计算与数据划分方法,用于分布主存计算环境中的计算与数据划分问题的求解.该方法能够尽量开发计算空间的并行度,利用数据融合技术优化数据分布,并能搜寻优化的全局计算与数据划分.该方法还能很自然地与数据复制以及偏移常量的对准结合在一起,从而使得数据通信量尽可能地小.实验结果表明了所提出方法的有效性. 相似文献
8.
Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. To optimize complex loop structures both effectively and inexpensively, we present a novel loop transformation, dependence hoisting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new technique to aggressively optimize benchmarks for better locality. Our technique is as inexpensive as the traditional unimodular loop transformation techniques and thus can be incorporated into commercial compilers. In addition, it is highly effective and is able to block several linear algebra kernels containing highly challenging loop structures, in particular, Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The automatic blocking of QR and pivoting LU is a notable achievement—to our knowledge, few previous compiler techniques, including theoretically more general loop transformation frameworks [1, 21, 23, 27, 31], were able to completely automate the blocking of these kernels, and none has produced the same blocking as produced by our technique. These results indicate that with low compilation cost, our technique can in practice match the effectiveness of much more expensive frameworks that are theoretically more powerful. 相似文献
9.
Elana D. Granston Thierry Montaut François Bodin 《International journal of parallel programming》1995,23(4):263-301
To date, page management in shared virtual memory (SVM) systems has been primarily the responsibility of the run-time system.
However, there are some problems that are difficult to resolve efficiently at run time. Chief among these is false sharing.
In this paper, a loop transformation theory is developed for identifying and eliminating potential sources of multiple-writer
false sharing and other sources of page migration resulting from regular references in numerical applications. Loop nests
of one and two dimensions (before blocking) with single-level, DOALL-style parallelism are covered. The potential of these
transformations is demonstrated experimentally.
Supported by a Postdoctoral Research Associateship in Computational Science and Engineering under National Science Foundation
Grant No. CDA-9310307, and by the Center for Research on Parallel Computation under Grant No. CCR-9120008.
Supported by the Esprit Agency XIII under Grant No. APPARC 6634 BRA III and Intel SSD under Grant No. 1 92 C 250 00 31318
01 2. 相似文献
10.
细胞图像中经常出现重叠(或者粘连)现象,利用红细胞是近似圆形的这一先验知识,用粒度测量来获取红细胞半径R,对重叠区域轮廓上的每个非凹陷点,估计其所对应的圆心点,这些圆心点聚集成各个细胞的中心区域,对每一个中心区域,采用半径为R的圆形结构元进行一次膨胀操作,然后和原二值图像进行交集运算,将其结果作为对各个细胞形状的估计。实验结果表明,该方法能得到比较满意的分离效果,亦适用于其他近似圆形的细胞或者颗粒的重叠区域图像的分离。 相似文献
11.
Generation of efficient parallel code is a major goal of a well-designed and developed parallelizing compiler. Another important goal is portability of both compiler system and the resulting output source codes. The various choices of current and future parallel computer architectures as well as the cost of developing a parallelizing compiler make portability a very important design goal. Since the design of parallelizing compilers is considerably move complex than designing conventional compilers, it is very important to achieve both efficiency and portability. To meet this dual goal, we have investigated the application of object oriented design to parallelizing compilers. Our parallelizing compiler design is based on abstractions of intermediate representations of loops and their class definitions. In this paper, we address the problem of loop parallelization and propose a framework where the loop parallelization process is divided into three phases and the optimization of loops is performed via a cyclic application of these three phases. The class of each phase is hierarchically derived from intermediate representations of loops. This facilitates the portability of the resulting parallelizing compilers. Furthermore, one of the phases uses a reservation table of hardware resources in order to obtain optimized parallel programs for given hardware resources. The validation of the proposed framework is given through the application of the object oriented design on an example program which is then parallelized efficiently. 相似文献
12.
13.
Loop Staggering,Loop Compacting:Restructuring Techniques for Thrashing Problem 总被引:1,自引:0,他引:1 下载免费PDF全文
Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel wit low run-time overhead is thus very important for achieving high performance in paralleo processing systems.However,in parallel processing systems with caches of local memories in memory hierarchies,“thrashing problemmay” may arise when data move back and forth frequently between the caches or local memories in different processors.The techniques associated with parallel compiler to solve the problem are not completely developed.In this paper,we present two restructuring techniques called loopg staggering,loop staggering and compacting,with which we can not only eliminate the cache or local memory thrashing phemomena significantly,but also exploit the potential parallelism existing in outer serial loop.Loop staggering benefits the dynamic loop scheduling strategies,whereas loop staggering and compacting is good for static loop scheduling strategies,Our method especially benefits parallel programs,in which a parallel loop is enclosed by a serial loop and array elements are repeatedly used in the different iterations of the parallel loop. 相似文献
14.
散点图中数据点重叠现象会严重影响可视分析效率.现有散点图去重叠算法主要通过调整部分数据点的位置来完全去除重叠,但普遍存在画布面积增长、轮廓保持不自然、迭代时间较长等问题.认为完全去除重叠是非必须的,通过实验发现:用户能够在散点图有轻微重叠的情况下,快速、准确地完成数据点选取和区域密度估计等可视分析任务.因此,提出了一个非完全的散点图去重叠算法,该算法通过结合虚拟点临时占位、Voronoi网格划分、数据点选择性移动和重叠率快速计算等方法,实现分布紧凑、轮廓自然、高效迭代的散点图去重叠效果.通过客观实验和主观实验评估了算法性能.实验结果表明,该算法在移动距离、面积增长、形状保持、正交顺序、邻域保持这5个客观指标和形状相似性、类簇稳定性这2个主观指标上都优于现有算法. 相似文献
15.
基于指针数组的数据划分模式 总被引:1,自引:0,他引:1
数据划分是分布主存系统中并行编译的关键技术,它以数组和包含这些数组的嵌套循环为研究对象,以提高数据局部性和挖掘计算并行性为根本目的。传统数据划分模式不适合指向数组的指针数组的数据划分,论文提出了解决该类指针数组数据划分的划分模式,文中称为数组向量的数据划分。分析其数据引用的特性,通过选取代表元,给出数据划分的策略,弥补了现有数据划分研究的不足。 相似文献
16.
全局部分重复计算划分 总被引:1,自引:0,他引:1
并行化编译器常常采用拥有者计算规则来进行计算划分,为了提高性能和可扩展性,后来引入了部分重复计算划分的概念.这是一种针对并行程序节点间局部性的重要优化方法.以前的部分重复计算划分局限于一个循环套的范围,因此新提出了全局部分重复计算划分的问题,给出一个简化的性能模型和一个基于整数线性规划的全局部分重复计算划分框架.实验结果表明,其结果显著优于局限于单个循环套的部分重复计算划分,比以前提出的启发式方法有更好的适应性. 相似文献
17.
基于线性表出的非奇异循环变换局部性优化方法 总被引:1,自引:0,他引:1
开发程序的局部性是当今并行编译优化研究的重点之一,而程序变换是开发程序时间局部性和空间局部性的重要手段之一.该文提出了一种新的利用非奇异循环变换来优化程序局部性的局部性优化方法,即基于线性表出的循环变换.该方法利用一组最少的线性无关向量组来线性表出数组访问的下标表达式,并据此构造非奇异变换矩阵来优化数组访问的时间局部性和空间局部性.该方法能充分开发数组访问的时间局部性,能简便地确定是否能对数组访问进行时间局部性或空间局部性优化,并能对给定的嵌套循环同时进行时间局部性和空间局部性优化.实验结果表明了该文所提出的基于线性表出的非奇异循环变换局部性优化方法是有效的. 相似文献
18.
申威众核片上多级存储层次是缓解众核“访存墙”的重要结构.完全由软件管理的SPM结构和片上RMA通信机制给应用性能提升带来很多机会,但也给应用程序开发优化与移植提出了很大挑战.为充分挖掘片上存储层次特点提升应用程序性能,同时减轻用户编程优化负担,本文提出了一种多级存储层次访存与通信融合的编译优化方法.该方法首先设计了融合编译指示,将程序高层信息传递给编译器.其次构建了编译优化收益模型并设计了启发式循环优化方案迭代求解框架,并由编译器完成循环优化方案的求解和优化代码的变换.通过编译生成的DMA和RMA批量数据传输操作,将较低存储层次空间中高访问延迟的核心数据批量缓冲进低访问延迟的更高存储层次空间中.在三个典型测试用例上进行了优化实验测试与分析,结果表明本文所提出的优化在性能上与手工优化相当,较未优化版程序性能有显著提升. 相似文献
19.
Inter‐iteration dependences in loops can hinder loop‐level parallelism. For some loops, existing thread‐level speculation techniques fail to expose their inherent loop‐level parallelism, because some inter‐iteration dependences are too costly to synchronize, predict, pre‐compute and isolate. This paper presents a compiler technique called loop recreation to change the nature of some dependences (by turning some inter‐iteration dependences into intra‐iteration ones and vice versa) in a loop so that the inter‐iteration dependences in the transformed loop are less costly to enforce at runtime than those in the original loop. We present an algorithm for finding an optimal loop recreation transformation with respect to a simple misspeculation cost model and demonstrate the performance advantages of loop recreation over two recent techniques for multicore systems running nine representative irregular applications. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献
20.
《国际计算机数学杂志》2012,89(3):265-278
Writing programs for a distributed-memory system (DMS) is a difficult job. In this paper, a method for parallelising sequential programs for DMS is presented. The input programs are C programs and the output parallel versions are programs containing routines for the Parallel Virtual Machine (PVM). PVM allows a group of computers in a network to be specified as a DMS and provides the routines for task activation and communication. The main task in this parallelisation of program is to process the loops in the source program and determine if there exists any data dependences or not. If the loop iterations are independent, the body will be transformed to tasks that will run independently for PVM. 相似文献