首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
分布存储系统中优化通信的冗余计算分割   总被引:1,自引:0,他引:1  
针对并行循环套序列,提出一种冗余计算分割的通信优化方法,根据数据流分析,文中给出用以确定每个循环套的冗余计算量的一般方法,并在此基础上提出冗余计算分割的实现和判定,针对规则依赖的程序,该文还提出了一个高效的冗余计算分割的实现方法,该技术已经在一个并行编译器中实现,试验结果表明,它比传统的通信优化技术有明显的优越性。  相似文献   

2.
In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques.  相似文献   

3.
This paper addresses the problem of communication-free partition of iteration spaces and data spaces along hyperplanes. To finding more possible communication-free hyperplane partitions, we treat statements within a loop body as separate schedulable units. Instead of using the information about data dependence distance or direction vectors, our technique explicitly formulates array references as transformations from statement-iteration spaces to data spaces. Based on these transformations, the necessary and sufficient conditions for communication-free partition along hyperplanes to be feasible have been proposed. This approach can be applied to all programs with an imperfectly nested loop or sequences of imperfectly nested loops, whose array references are affine functions of outer loop indices or loop invariant variables. The proposed approach is more practical than existing methods in finding the data and computation distribution patterns that can cause the processor to execute fully-parallel on multicomputers without any interprocessor communication.  相似文献   

4.
提出了一种面向SIMD机器的全局数据自动分割算法,该算法能处理多个非紧嵌折循环嵌套,并且数组下标存取为循环变量的线性式,首先通过数据与迭代映射抽象了计算中的通信方式,然事提出识别规则模式通信模式的形式比条件,接着建立包含对准信息和相应通信开销的数据迭代图,并在数据迭代图的基础上提出了一个启发式算法来计算较优的数据分布和迭代分布,以优化处理单元之间的通信开销,通过发析多个循环嵌套所涉及的多个数组映和  相似文献   

5.
基于SIMD机器的优化数据传输的并行循环分割   总被引:2,自引:1,他引:2  
本文提出一个基于分布式局存的SIMD机器的循环分割理论体系以优化运算中所需要的数据传输。该体系使用矩阵表示迭代空间、数据空间和数组存取式。我们引入数据传输概念,并建立一个简单有效的数据传输模型来评估数据在全局内存和局部内存之间的传输开销。最后,对于给定的循环嵌套,我们给出一个循环分割算法以获得优化循环块,使得循环嵌套中所需要的数据传输开销最小,并且大大减少了数据传输和计算的同步开销。实验结果证明了  相似文献   

6.
Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor  相似文献   

7.
Load balance is important because it may affect the speedup attained through the concurrent execution of loop iterations on a parallel processor. We study loop load balance in the context of the well-known Perfect benchmarks. Several static and dynamic characteristics of the Perfect benchmark DOALL loops are observed and interpreted. Thelate arrival of processors is noted as a major source of load imbalance. This observation suggested the idea ofprocessor preallocation. An analytic cost model is presented and the advantages of processor preallocation are demonstrated by experimental evaluation on a CRAY Y-MP8 under the Unicos operating system.  相似文献   

8.
Abstract

This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation.  相似文献   

9.
现有并行识别方法用于众核处理器时存在一定不足,当选择的循环并行维迭代数较少时可能导致严重地负载不均衡。针对这一问题,提出了一种面向众核处理器的多维并行识别方法,在现有并行识别方法无法做到较好的负载均衡时,选择嵌套循环的多个维进行并行,将多个并行维的迭代空间合并后再做任务划分,减少负载不均衡对程序并行效率的影响。此方法已在课题组开发的自动并行化系统中进行了实现,实际应用过程中能够提升一些应用程序在众核处理器上并行执行的效率。  相似文献   

10.
Inter‐iteration dependences in loops can hinder loop‐level parallelism. For some loops, existing thread‐level speculation techniques fail to expose their inherent loop‐level parallelism, because some inter‐iteration dependences are too costly to synchronize, predict, pre‐compute and isolate. This paper presents a compiler technique called loop recreation to change the nature of some dependences (by turning some inter‐iteration dependences into intra‐iteration ones and vice versa) in a loop so that the inter‐iteration dependences in the transformed loop are less costly to enforce at runtime than those in the original loop. We present an algorithm for finding an optimal loop recreation transformation with respect to a simple misspeculation cost model and demonstrate the performance advantages of loop recreation over two recent techniques for multicore systems running nine representative irregular applications. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

11.
This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests  相似文献   

12.
Parallelizing the Data Cube   总被引:1,自引:0,他引:1  
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.  相似文献   

13.
In distributed memory multicomputers, local memory accesses are much faster than those involving interprocessor communication. For the sake of reducing or even eliminating the interprocessor communication, the array elements in programs must be carefully distributed to local memory of processors for parallel execution. We devote our efforts to the techniques of allocating array elements of nested loops onto multicomputers in a communication-free fashion for parallelizing compilers. We first analyze the pattern of references among all arrays referenced by a nested loop, and then partition the iteration space into blocks without interblock communication. The arrays can be partitioned under the communication-free criteria with nonduplicate or duplicate data. Finally, a heuristic method for mapping the partitioned array elements and iterations onto the fixed-size multicomputers under the consideration of load balancing is proposed. Based on these methods, the nested loops can execute without any communication overhead on the distributed memory multicomputers. Moreover, the performance of the strategies with nonduplicate and duplicate data for matrix multiplication is studied  相似文献   

14.
在三维装箱问题中,集装箱问题是一个复杂的组合优化问题,是属于NP-hard问题。提出了六种空间划分和不同规则的空间合并,并将稳定性约束和承载力约束加入其中,通过测试标准数据集去比较应用约束对空间利用率的影响。测试结果显示了承载力约束对不同的空间划分的影响,以及对哪一种空间划分的影响最大,以及在基于稳定性约束和承载力约束的共同作用下,哪一种空间划分的空间利用率最好。同时,与国内外相同研究方向的算法进行了对比,测试结果显示了该算法的性能要高于目前国内外的算法。  相似文献   

15.
图分区质量极大程度上影响着计算机之间的通信开销和负载平衡, 这对于大规模并行图计算的性能是至关重要的. 然而, 随着图数据规模的越来越大, 图分区算法的执行时间成了一个不可避免的问题. 因此, 研究如何优化图分区算法的执行效率是有必要的. 本文提出了一个基于广度优先遍历加权图生成的启发式图分割方法, 该方法在实现较低的通信代价和较好负载平衡的同时, 只引入了少量的预处理时间开销. 实验结果表明, 本文的划分方法减少了复制因子, 降低通信开销, 并且引入的时间开销较小.  相似文献   

16.
Partitioning of processors on a multiprocessor system involves logically dividing the system into processor partitions. Programs can be executed in the different partitions in parallel. Optimally setting the partition size can significantly improve the throughput of multiprocessor systems. The speedup characteristics of parallel programs can be defined by execution signatures. The execution signature of a parallel program on a multiprocessor system is the rate at which the program executes in the absence of other programs and depends upon the number of allocated processors, the specific architecture, and the specific program implementation. Based on the execution signatures, this paper analyzes simple Markovian models of dynamic partitioning. From the analysis, when there are at most two multiprocessor partitions, the optimal dynamic partition size can be found which maximizes throughput. Compared against other partitioning schemes, the dynamic partitioning scheme is shown to be the best in terms of throughput when thereconfiguration overhead is low. If the reconfiguration overhead is high, dynamic partitioning is to be avoided. An expression for the reconfiguration overhead threshold is derived. A general iterative partitioning technique is presented. It is shown that the technique gives maximum throughput forn partions.  相似文献   

17.
Distributing spatially located heterogeneous workloads is an important problem in parallel scientific computing. We investigate the problem of partitioning such workloads (represented as a matrix of non-negative integers) into rectangles, such that the load of the most loaded rectangle (processor) is minimized. Since finding the optimal arbitrary rectangle-based partition is an NP-hard problem, we investigate particular classes of solutions: rectilinear, jagged and hierarchical. We present a new class of solutions called m-way jagged partitions, propose new optimal algorithms for m-way jagged partitions and hierarchical partitions, propose new heuristic algorithms, and provide worst case performance analyses for some existing and new heuristics. Moreover, the algorithms are tested in simulation on a wide set of instances. Results show that two of the algorithms we introduce lead to a much better load balance than the state-of-the-art algorithms. We also show how to design a two-phase algorithm that reaches different time/quality tradeoffs.  相似文献   

18.
MapReduce作为一种分布式编程模型,被广泛应用于大规模和高维度数据集的处理中。其采用原始Hash函数 划分 数据,当数据分布不均匀时,常会出现数据倾斜的问题。基于MapReduce的聚类算法,需要多次迭代且不清楚各阶段Reduce的输入数据分布,因此现有的解决数据倾斜的方法并不适用。为解决数据划分的不均衡问题,提出一种当存在数据倾斜时更改剩余分区索引的策略。该方法在Map运行的过程中统计将要分给各reducer的数据量,由JobTrackcr监控全局的分区信息并根据数据倾斜模型动态修改原分区函数;在接下来的分区过程中,Partitioner把即将导致倾斜的分区索引到其余负载较轻的reducer上,使各节点的负载达到均衡。基于Zipf分布数据集和真实数据集,将所提算法与现有的解决数据倾斜的方法进行对比,结果证明,所提策略解决了MapReduce聚类中的数据倾斜问题,且在稳定性与执行时间上优于Hash和基于采样的动态分区法。  相似文献   

19.
A solution to the problem of partitioning data for distributed memory machines is discussed. The solution uses a matrix notation to describe array accesses in fully parallel loops, which allows the derivation of sufficient conditions for communication-free partitioning (decomposition) of arrays. A series of examples that illustrate the effectiveness of the technique for linear references, the use of loop transformations in deriving the necessary data decompositions, and a formulation that aids in deriving heuristics for minimizing a communication when communication-free partitions are not feasible are presented  相似文献   

20.
Abstract. This paper describes a method for the correction of optically read Devanagari character strings using a Hindi word dictionary. The word dictionary is partitioned in order to reduce the search space besides preventing forced matching to an incorrect word. The dictionary partitioning strategy takes into account the underlying OCR process. The dictionary words at the top level have been divided into two partitions, namely: a short-words partition and the remaining words partition. The short-word partition is sub-partitioned using the envelope information of the words. The envelope consists of the number of top, lower, core modifiers along with the number of core charactersp. Devanagari characters are written in three strips. Most of the characters referred to as core characters are written in the middle strip. The remaining words are further partitioned using tags. A tag is a string of fixed length associated with each partition. The correction process uses a distance matrix for a assigning penalty for a mismatch. The distance matrix is based on the information about errors that the classification process is known to make and the confidence figure that the classification process associates with its output. An improvement of approximately 20% in recognition performance is obtained. For a short word, 590 words are searched on average from 14 sub-partitions of the short-words partition before an exact match is found. The average number of partitions and the average number of words increase to 20 and 1585, respectively, when an exact match is not found. For tag-based partitions, on an average, 100 words from 30 partitions are compared when either an exact match is found or a word within the preset threshold distance is found. If an exact match or a match within a preset threshold is not found, the average number of partitions becomes 75 and 450 words on an average are compared. To the best of our knowledge this is the first work on the use of a Hindi word dictionary for OCR post-processing. Received August 6, 2001 / Accepted August 22, 2001  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号