期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈达智赵荣彩姚远韩林《计算机科学》2012,39(6):301-304

传统MPI自动并行化编译系统从数据重分布的角度,生成面向分布式存储系统的消息传递程序,但是大量数据重分布通信的额外开销导致其加速比低。为了解决此问题,在基于Open64的MPI自动并行化编译系统后端,提出了一种消息传递代码生成算法。该算法以统一数据分布为中心,根据给定的并行化循环集和通信数组集,通过修改WHIRL表示的串行代码语法结构树,生成更精确的消息传递代码。实验结果表明,该算法能够较大程度地降低消息传递程序的通信开销,并且明显提升其加速比。相似文献

2.

LS SIMD C编译器的数据通信优化算法 总被引：1，自引：1，他引：0

王晖何华籼等《计算机科学》2001,28(9):116-118

1 引言当前理想的程序自动并行化系统的实现存在许多难于解决的问题,因此较为流行的并行计算方法是利用并行语言编写并行程序,编译器对并行程序进行编译生成相应的节点程序执行。并行语言按并行执行的粒度分为基于任务的并行语言(主要面向一般应用领域的计算)和数据并行语言(主要应用于科学数值计算),典型的数据并行语言如HPF。对于数据并行语言而言,程序执行的并行性已由程序设计人员根据程序中的数据相关性给出。因此,如何确定数据的分布、优化数据的通信是影响并行程序执行效率的重要问题。数据分布大致可以分为两个阶段:首先对源程序中数据的相关性分析得到数据在抽象处理机上的分布,然后将抽象处理机上的数据分布映射到物理处理机上。数据分布的确定通常有以下几种实现方式:一种是由程序员给出抽象数据分布,编译相似文献

3.

无共享并行数据库中的数据分布研究

鱼佳欣师军卢照《计算机与数字工程》2010,38(3):54-56

并行数据库系统的性能与数据分布密切相关。文章介绍了基于无共享并行数据库结构中目前流行的各类数据分布方法,并在此基础上详细讨论了数据分布涉及的错开问题、动态维护问题及高可用性问题,且对这些问题提出相应的解决方法。最后提出了优化数据分布的一些考虑。相似文献

4.

基于线性不等式的数据划分方法的优化 总被引：1，自引：0，他引：1

董春丽赵荣彩杜澎王峥《计算机应用》2007,27(5):1251-1253

计算和数据划分是串行程序并行化时所要解决的一个重要问题，如何对程序中引用的数据进行合理的分布以最大限度的发现程序的并行性减少数据重分布的通信开销，是并行编译优化的重点。给出的数据和计算的优化分解方法是基于Anderson-Lam的分解算法上改进得到的。根据Anderson-Lam的算法得到数据和计算划分后，以线性不等式的形式表示，然后通过分析循环嵌套中能够进行边界冗余的只读数组，重新构造数据划分不等式，根据此不等式进行数据分布，实现具有边界冗余的只读数组的数据划分，有效地减少了数据收发的通信量。相似文献

5.

并行数据库上的并行CMD_Join算法

李建中都薇《软件学报》1998,(4)

并行数据库在多处理机之间的分布方法(简称数据分布方法)对并行数据操作算法的性能影响很大.如果在设计并行数据操作算法时充分利用数据分布方法的特点,可以得到十分有效的并行算法.本文研究如何充分利用数据分布方法的特点,设计并行数据操作算法的问题,提出了基于CMD多维数据分布方法的并行CMD_Join算法.理论分析和实验结果表明,并行CMD_Join算法的效率高于其它并行Join算法. 相似文献

6.

并行数据库上的并行CMD－Join算法 总被引：3，自引：1，他引：3

李建中都薇《软件学报》1998,9(4):256-262

并行数据库在多处理机之间的分布方法(简称数据分布方法)对并行数据操作算法的性能影响很大.如果在设计并行数据操作算法时充分利用数据分布方法的特点,可以得到十分有效的并行算法.本文研究如何充分利用数据分布方法的特点,设计并行数据操作算法的问题,提出了基于CMD多维数据分布方法的并行CMD－Join算法.理论分析和实验结果表明,并行CMD－Join算法的效率高于其它并行Join算法. 相似文献

7.

机群系统中基于MPI的多维稀疏数组传递方法

俞时权胡浩民马德云《计算机工程》2003,29(5):69-71

机群系统是一种分布存储系统，它主要利用消息传递方式来实现各结点之间的通信。而MPI（Message Passing Interface）作为一种基于消息传递的并行程序设计环境，已广泛应用于多种并行系统，尤其是像机群系统那样的分布存储并行机。该文主要探讨了MPI中的消息传递调用接口，提出了几种有效的在结点间传递多维稀疏数组的方法，并通过实践加以比较。相似文献

8.

并行数据库上的进行CMD—Join算法 总被引：1，自引：1，他引：1

李建中都薇《软件学报》1998,9(4):256-262

并行数据库在多处理机之间的分布方法对并行数据算法的性能影响很大，如果在设计并行数据操作算法时充分利用数据分布方法的特点，可以得到十分有效的并行算法。本研究如何充分利用数据分布方法的特点，设计并行数据操作算法的问题，提出了基ＣＭＤ多维数据分布方法的并行ＣＭＤ－Ｊｏｉｎ算法，理论分析和实验结果表明，并行ＣＭＤ－Ｊｏｉｎ算法的效率高于其它并行Ｊｏｉｎ算法。相似文献

9.

一种并行数据库的动态多维数据分布方法 总被引：7，自引：0，他引：7

李建中《软件学报》1999,10(9):909-916

并行数据库系统的性能与数据库在多处理机之间的分布密切相关.目前已经出现一些并行数据库的数据分布方法.但是,这些方法都不能有效地支持动态数据库.文章提出了一种并行数据库的动态多维数据分布方法.该方法不仅能够有效地支持动态数据库的分布,还具有多维数据分布的诸多优点.此方法由初始数据分布机构和启发式动态数据分布调整机构组成.初始分布机构完成给定数据库文件的初始分布.动态数据分布调整机构实现动态数据库数据分布的动态调整.理论分析和实验结果表明,这种方法十分有效,并且能够有力地支持动态数据库上的各种并行数据操作算法. 相似文献

10.

基于Nash-Pareto策略的自动数据分布方法及支持工具

王晓燕陈晋川郭小燕杜小勇《计算机研究与发展》2015,52(9):1965-1975

大数据时代的来临为数据存储与管理提出了新的挑战.随着数据量的迅猛增加,自动数据分布逐渐成为分布式系统中的研究重点和难点.根据对数据分布问题中数据、负载和节点3个要素的研究和分析,将数据分布问题抽象为称为DaWN(data,workload,node)的三角模型,并将3要素之间的相互关联关系抽象为数据分片、数据分配和负载执行3条纽带;据此,提出了解决自动数据分布问题的基本架构,对各功能模块的协动关系进行探讨;同时,结合已有的研究工作,采用Nash-Pareto优化均衡策略使得前述各机制相得益彰,实验结果验证了其有效性.为使研究工作更多地应用于实践,设计并实现了自动数据分布辅助原型工具ADDvisor(automatic data distribution advisor),协同支持自动数据分布的执行,共同促进大规模分布式联机事务处理系统的并行性能和自动化管理技术的发展. 相似文献

11.

Legacy code and parallel computing: updating and parallelizing a numerical model

Tinetti Fernando G. Perez Maximiliano J. Fraidenraich Ariel Altenberg Adolfo E. 《The Journal of supercomputing》2020,76(7):5636-5654

In this paper, we present several important details in the process of legacy code parallelization, mostly related to the problem of maintaining numerical output of a legacy code while obtaining a balanced workload for parallel processing. Since we maintained the non-uniform mesh imposed by the original finite element code, we have to develop a specially designed data distribution among processors so that data restrictions are met in the finite element method. In particular, we introduce a data distribution method that is initially used in shared memory parallel processing and obtain better performance than the previous parallel program version. Besides, this method can be extended to other parallel platforms such as distributed memory parallel computers. We present results including several problems related to performance profiling on different (development and production) parallel platforms. The use of new and old parallel computing architectures leads to different behavior of the same code, which in all cases provides better performance in multiprocessor hardware.

相似文献

12.

Interprocedural partial redundancy elimination with application todistributed memory compilation

Agrawal G. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(7):609-625

Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper, we address the problem of performing this optimization interprocedurally. We present an Interprocedural Partial Redundancy Elimination (IPRE) scheme based upon a new, concise, full program representation. Our method is applicable to arbitrary recursive programs. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines. We have implemented our scheme as an extension to the Fortran D compilation system. We present experimental results from two codes compiled using our system to demonstrate the useful of IPRE in distributed memory compilation 相似文献

13.

Techniques for compiling programs on distributed memory multicomputers

PeiZong Lee 《Parallel Computing》1995,21(12):1895-1923

It is widely accepted that distributed memory parallel computers will play an important role in solving computation-intensive problems. However, the design of an algorithm in a distributed memory system is time-consuming and error-prone, because a programmer is forced to manage both parallelism and communication. In this paper, we present techniques for compiling programs on distributed memory parallel computers. We will study the storage management of data arrays and the execution schedule arrangement of Do-loop programs on distributed memory parallel computers. First, we introduce formulas for representing data distribution of specific data arrays across processors. Then, we define communication cost for some message-passing communication operations. Next, we derive a dynamic programming algorithm for data distribution. After that, we show how to improve the communication time by pipelining data, and illustrate how to use data-dependence information for pipelining data. Jacobi's iterative algorithm and the Gauss elimination algorithm for linear systems are used to illustrate our method. We also present experimental results on a 32-node nCUBE-2 computer. 相似文献

14.

Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Jorge González-Domínguez^{Author Vitae} Guillermo L. Taboada Author VitaeBasilio B. Fraguela Author Vitae María J. Martín Author VitaeJuan Touriño Author Vitae 《Computers & Electrical Engineering》2012,38(2):258-269

Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification. 相似文献

15.

Possibilities of Optimal Execution of Parallel Programs Containing Simple and Iterated Loops on Heterogeneous Parallel Computational Systems with Distributed Memory

A. I. Avetisyan S. S. Gaisaryan O. I. Samovarov 《Programming and Computer Software》2002,28(1):28-40

The problem of load balancing when executing parallel programs on computational systems with distributed memory is currently of great interest. The most general statement of this problem is that for one parallel loop: execution of a heterogeneous loop on a heterogeneous computational system. When stated in this way, the problem is NP-complete even in the case of two nodes, and no acceptable heuristics for solving it are found. Since the development of heuristics is a rather complicated task, we decided to examine the problem by elementary methods in order to refine (and, possibly, simplify) the original problem statement. The results of our studies are discussed in this paper. Estimates of efficiency of parallel loop execution as functions of the number of nodes of homogeneous and heterogeneous parallel computational systems are obtained. These estimates show that the use of heterogeneous parallel systems reduces the efficiency even in the case when their communication subsystems are scaleable (see the definition in Section 4). The use of local networks (heterogeneous parallel computational systems with nonscaleable communication subsystems) for parallel computations with heavy data exchange is not advantageous and is possible only for a small number of nodes (about five). An algorithm of optimal distribution of data between the nodes of a homogeneous or heterogeneous computational system is suggested. Results of numerical experiments substantiate the conclusions obtained. 相似文献

16.

Deriving Array Distributions by Optimization Techniques

Rauber Thomas Rünger Gudula 《The Journal of supercomputing》2000,15(3):271-293

The paper presents a new method to derive data distributions for parallel computers with distributed memory organization by a mathematical optimization technique. Prerequisites for this approach are a parameterized data distribution and a rigorous performance prediction technique that allows us to derive runtime formulas containing the parameters of the data distribution. A mathematical optimization technique can then be used to determine the parameters in such a way that the total runtime is minimized, thus also minimizing the communication overhead and the load imbalance penalty. The method is demonstrated by using it to determine a data distribution for the LU decomposition of a matrix. 相似文献

17.

分布存储并行机上的自动数据布局优化模型

谢幸陈国良武继刚《计算机研究与发展》2000,37(10):1173-1178

在分布式并行机上,数据布局的质量极大的影响着应用程序的执行性能,以往的研究一般将自动数据布局优化问题近似分解为数据对准优化和数据分布优化两步来解决,且对两者的结合只研究了一维的情况,在相关研究工作的基础上,在多维情况下将数据对准优化和数据分布优化结合在一个模型当中,提出了一个数据对准优化与数据分布优化统一的多维静态数据布局模型,避免了采用启发式策略,从而更加精确地描述了自动数据布局优化问题,同时给相似文献

18.

Analyses and Optimizations for Shared Address Space Programs

Arvind Krishnamurthy Katherine Yelick 《Journal of Parallel and Distributed Computing》1996,38(2):130

We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, calledcycle detection, is based on work by Shasha and Snir and checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information fromsynchronization analysis, which handles post–wait synchronization, barriers, and locks. We also make the analysis efficient by exploiting the common code image property of SPMD programs. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for nonblocking memory operations and one-way communication. The optimizations includemessage pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data reuse. The performance improvements are as high as 20–35% for programs running on a CM-5 multiprocessor using the Split-C language as a global address layer. Even larger benefits can be expected on machines with higher communication latency relative to processor speed. 相似文献

19.

自动并行化中不规则循环的通信代码生成

傅立国姚远丁锐《计算机应用》2014,34(4):1014-1018

不规则计算在大规模并行应用中广泛存在。在面向分布存储结构的自动并行化过程中,较难在编译时为不规则循环生成并行代码。并行代码中的通信代码对程序运行结果的正确性以及加速效果有着严重的影响。通过分析程序的数组重分布图,使用部分冗余的通信方式来维持不规则数组访问的生产者消费者关系,可以在编译时为一类常见的不规则循环自动生成有效的通信代码。该方法使用计算分解和数组引用的访问表达式求解不规则数组在各处理器的本地定义集作为通信的数据集,分析针对此类不规则循环划分的通信策略,继而生成相应的通信代码。实验测试的结果取得了预期的加速效果,验证了方法的有效性。相似文献

20.

Parallel Array Object I/O Support on Distributed Environments

Jenq Kuen Lee Ing-Kuen Tsaur San-Yih Hwang 《Journal of Parallel and Distributed Computing》1997,40(2):1425

This paper presents a parallel file object environment to support distributed array store on shared-nothing distributed computing environments. Our environment enables programmers to extend the concept of array distributions from memory levels to file levels. It allows parallel I/O that facilitates the distribution of objects in an application. When objects are read and/or written by multiple applications using different distributions, we present a novel scheme to help programmers to select the best data distribution pattern according to a minimum amount of remote data movements for the store of array objects on distributed file systems. Our selection scheme, to the best of our knowledge, is the first work to attempt to optimize the distribution patterns in the secondary storage for HPF-like programs with inter-application cases. This is especially important for a class of problems called multiple disciplinary optimization (MDO) problems. Our test bed is built on an 8-node DEC Farm connected with an ethernet, FDDI, or ATM switch. Our experimental results with scientific applications show that not only our parallel file system can provide aggregate bandwidths, but also our selection scheme effectively reduces the communication traffic for the system. 相似文献