首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Declarative Programming Languages (DPLs) apply a grocess model of Horn clauses such as PARLOG^[8] or a reduction model of λ-calculus such as SML^[7] and are,in principle,well suited to multiprocessor implementation.However,the performance of a paralled declarative program can be impaired by a mismatch between the parallelism available in an application and the parallelism available in the architecture.A particularly attractive solution is to automatically match the parallelism of the program to the parallelism of the target hardware as a compilation step.In this paper,we present an optimizing compilation technique called granularity analysis which identifies and removes excess parallelism that would degrade performance.The main steps are:an analysis of the flow of data to form an attributed call graph between function (or predicate) arguments;and an asymptotic estimation of granularity of a function (or predicate) to generate approximate grain size.Compiled procedure calls can be annotated with grain size and a task scheduler can make scheduling decisions with the classification scheme of grains to control parallelism at run-time.The resulting granularity analysis scheme is suitable for exploiting adaptive parallelism of declarative programming languages on multiprocessors.  相似文献   

2.
基于域的编译框架   总被引:2,自引:2,他引:2  
传统的基于函数范围的后端编译框架是一种方便的程序划分方法,然而,考虑到编译过程中的资源需求(例如编译时间和内存使用),代码性能以及编译功能,函数的范围大小以及结构并不是最适合进行程序分析和优化的程序划分,在现代编译器为了尽可能地发掘指令级并行机会而寻求更复杂和时空复杂性更高的算法和情况下,这种不适应性变得更加突出,当函数的范围很大时,时空复杂性很高的算法以函数为基本编译单位通常会导致编译时间太长和(或)内存消耗太多,Hank提,出了一种编译框架,使得优化的范围和结构可以得到一定的控制,基于编译时间和优化机会的考虑,本文提出了一种新的基于域的编译框架,同时,允许一些基于域的优化制导属性在不同的优化阶段之间被传递和观察,这个基于域的编译框架已经在目标码为安腾(Itanium)处理器的编译器ORC(Open Research Compiler)中实现,实验结果表明,此框架在控制编译的时空复杂性方面是成功的。  相似文献   

3.
《Parallel Computing》1999,25(13-14):1741-1783
Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed.  相似文献   

4.
While the dataflow execution model can potentially uncover all forms and levels of parallelism in a program, in its traditional fine grain form it does not exploit any form of locality. Recent evidence indicates that the exploitation of locality in dataflow programs could have a dramatic impact on performance. The current trend in the design of dataflow processors suggests a synthesis of traditional nonstrict fine grain instruction execution and strict coarse grain execution in order to exploit locality. While an increase in instruction granularity favors the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. We define fine grain intrathread locality as a dynamic measure of instruction level locality and quantify it using a set of numeric and nonnumeric benchmarks. The results point to a very large degree of intrathread locality and a remarkable uniformity and consistency of the distribution of thread locality across a wide variety of benchmarks. As the execution is moved to a coarser granularity it can result in an increase of the input latency of operands that would have a detrimental effect on performance. We evaluate the resulting latency incurred through the partitioning of fine grain instructions into coarser grain threads. We define the concept of a cluster of fine grain instructions to quantify coarse grain input and output latencies. The results of our experiments offer compelling evidence that a coarse grain execution outperforms a fine grain grain one on a significant number of numeric codes. These results suggest that the effects of increased instruction granularity on latency is minimal for a high percentage of the measured codes, and in large part is offset by available intrathread locality. Furthermore, simulation results indicate that strict or nonstrict data structure access does not change the basic cluster characteristics.  相似文献   

5.
In this paper, we propose a generic method of automatic parallelization for sparse matrix computation. This method is based on both a refinement of the data-dependence test proposed by Bernstein and an inspector–executor scheme which is specialized to each input program of the compiler. This analysis mixes compilation process and run-time process.The sparsity of underlying data-structure determines a specific parallelism which increases the degree of parallelism of an algorithm. Such a source of parallelism had already been applied to many numerical algorithms such as the usual Cholesky factorization or LU-decomposition algorithms considered as the gold standards of parallelization based on sparsity. The standard automatic parallelization method cannot tackle such source of parallelism because it is based on the value of cells arrays and not merely on the memory addressing function.Addressing the automatization of this parallelism requires to develop a mixed compile-time and runtime approach integrated in a inspector–executor process. The compilation step provides a dedicated inspector devoted to the analyzed program. The inspector computes the dependence graph at runtime which allows a dynamic parallelization of the execution.As expressed just before, the generic scheme developed in this paper follows the design principles which have been applied, but at each time in an ad hoc way, to many sparse parallelization of numerical algorithms such as Cholesky algorithm. As far as we know, no general formal framework has been proposed to automate such a method of sparse parallelization. In this paper, we propose a generic framework of sparse parallelization (i.e. numerical program independent) which can be applied to any numerical programs satisfying the usual syntactic constraints of parallelization.  相似文献   

6.
Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC Ver. 6 update 1 loop parallelizing compiler.  相似文献   

7.
Several types of parallelism can be exploited in logic programs while preserving correctness and efficiency, i.e. ensuring that the parallel execution obtains the same results as the sequential one and the amount of work performed is not greater. However, such results do not take into account a number of overheads which appear in practice, such as process creation and scheduling, which can induce a slow-down, or, at least, limit speedup, if they are not controlled in some way. This paper describes a methodology whereby the granularity of parallel tasks, i.e. the work available under them, is efficiently estimated and used to limit parallelism so that the effect of such overheads is controlled. The run-time overhead associated with the approach is usually quite small, since as much work is done at compile time as possible. Also, a number of run-time optimizations are proposed. Moreover, a static analysis of the overhead associated with the granularity control process is performed in order to decide its convenience. The performance improvements resulting from the incorporation of grain size control are shown to be quite good, specially for systems with medium to large parallel execution overheads.  相似文献   

8.
Exploiting the parallelism available in loops   总被引:1,自引:0,他引:1  
Lilja  D.J. 《Computer》1994,27(2):13-26
Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop  相似文献   

9.
A recent paper [7] describes a variable precision interval arithmetic algorithm for the computation of the zeros of an analytic function inside a given rectangle and to a user-specified accuracy. The algorithm is based on the argument principle in the set of complex numbersC and carries much potential for parallelization at various levels of granularity. Here we explain how to modify the sequential algorithm to take advantage of parallelism at four levels ranging from coarse to medium grain. The algorithm is tested in a distributed environment consisting of eight SPARC workstations. The underlying software environment is also discussed.  相似文献   

10.
Nested dissection is a very popular direct method for solving sparse linear systems that arise from finite difference and finite element methods. Worley and Schreiber [16] give a fine grain algorithm for a square array of processors. Their algorithm uses O(N2) processors, each with O(N) memory, to factor an N2 by N2 sparse matrix whose graphs is an N × N mesh. The efficiency of their method is between 1/46 and 1/12. George et al. [6] [8] give a medium grain algorithm for hypercube architecture, while George et al. [7] give an algorithm for shared memory machines. These papers present a column oriented approach which can exploit O(N) parallelism and yield efficiencies up to 50%. Lucas [11] also gives a column oriented scheme which achieves up to 75% efficiency and O(N) parallelism. In this paper, we present a medium to fine grain algorithm for a P × P array of processors with local memory. This algorithm can exploit up to O(N2) parallelism. The efficiency of the fine grain version is comparable to [16] while as a medium grain algorithm achieves about 49% efficiency. The strength of the method is due to three factors: its ability to pipeline much of the computation, overlapping computation and communication, and the use of level 3 BLAS like primitives. In addition to its high efficiency its memory requirement is optimal, only O(N2 log N/P2) words memory is needed per processor.  相似文献   

11.
Analysis of a Heuristic for Code Partitioning   总被引:1,自引:0,他引:1  
In this paper, we analyze the time complexity and performance of a heuristic for code partitioning for Distributed Memory Multiprocessors (DMMs). The partitioning method is data-flow based where all levels of parallelism are exploited. Given a weighted Directed Acyclic Graph (DAG) representation of the program, our algorithm automatically determines the granularity of parallelism by partitioning the graph into tasks to be scheduled on the DMM. The granularity of parallelism depends only on the program to be executed and on the target machine parameters. The output of our algorithm is passed on as input to the scheduling phase. Finding an optimal solution to this problem is NP-complete. Due to the high cost of graph algorithms, it is nearly impossible to come up with close to optimal solutions that do not have very high cost (higher order polynomial). Our proposed heuristic gives good performance and has relatively low cost.  相似文献   

12.
Algorithmic skeletons provide a promising basis for the automatic utilisation of parallelism at sites of higher order function use through static program analysis. However, decisions about whether or not to realise particular higher order function instances as skeletons must be based on information about processing resources available at runtime

In principle, nested higher order functions may be realised as nested skeletons. However, where higher order function arguments result from partially applied functions, free-variable bindings must be identified and communicated through the corresponding skeleton hierarchy to where those arguments are actually applied

Here, a skeleton based parallelising compiler for Standard ML is presented. Hybrid skeletons, which can change from parallel to serial evaluation at runtime, are considered and mechanisms for their nesting are discussed. The main compilation stages are illustrated for simple examples. A nested higher order function based algorithm for multiplying matrices of arbitrary length integers is presented along with performance figures for compiled code running on a Fujitsu AP3000.  相似文献   

13.
Program Slicing is a well-known decomposition technique that transforms a large program into a smaller one that contains only statements relevant to the computation of a selected function. In this paper, we present two novel predicate-based dynamic slicing algorithms for message passing programs. Unlike more traditional slicing criteria that focus only on parts of the program that influence a variable of interest at a specific position in the program, a predicate focuses on those parts of the program that influence the predicate. The dynamic predicate slices capture some global requirements or suspected error properties of a distributed program and computes all statements that are relevant. The presented algorithms differ from each other in their computational approaches (forward versus backward) and in the granularity of information they provide. A proof of correctness of these algorithms is provided. Through the introduction of dominant states and dominant events, critical statement executions are identified that change the value of the global predicate. Under this formulation, optimizing dynamic predicate slicing becomes a meaningful goal as well. Finally, we present how predicate slices can be applied to support comprehension tasks for analyzing and maintaining distributed programs.  相似文献   

14.
This paper presents a system for parallel execution of Prolog supporting both independent conjunctive and disjunctive parallelism. The system is intended for distributed memory architecture and is composed of a set of workers with a hierarchical structure scheduler. The execution model has been designed in such a way that each worker's environment does not contain references to terms in other environments, thus reducing communication overhead. In order to guarantee the improvement of the performance by the parallelism exploitation, a granularity control has been introduced for each kind of parallelism. For conjunctive parallelism PDP applies a control based on the estimation provided by CASLOG. The features of the system allow to introduce this control without adding overhead. For disjunctive parallelism PDP controls granularity by applying a heuristic-based method, which can be adapted to other parallel Prolog systems. Different scheduling policies have also been tested. The system has been implemented on a transputer network and performance results show that it provides a high speedup for coarse grain parallel programs.  相似文献   

15.
In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grainclusters from a fine grain dataflow graph. We compare thebasic block and thedependence sets algorithms that partition dataflow graphs into clusters. For an extensive set of benchmarks we assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain dataflow execution. We study the performance of medium grain dataflow when several architectural parameters, such as the number of processors, matching cost, and network latency, are varied. The results indicate that medium grain execution offers a good speedup over the fine grain model, that it is scalable, and tolerates network latency and high matching costs well. Medium grain execution can benefit from a higher output bandwidth of a processor and fainally, a simple superscalar processor with an issue rate of two is sufficient to exploit the internal parallelism of a cluster. This work is supported in part by NSF Grants CCR-9010240 and MIP-9113268.  相似文献   

16.
A new technique for estimating and understanding the speed improvement that can result from executing a program on a parallel computer is described. The technique requires no additional programming and minimal effort by a program's author. The analysis begins by tracing a sequential program. A parallelism analyzer uses information from the trace to simulate parallel execution of the program. In addition to predicting parallel performance, the parallelism analyzer measures many aspects of a program's dynamic behavior. Measurements of six substantial programs are presented. These results indicate that the three symbolic programs differ substantially from the numeric programs and, as a consequence, cannot be automatically parallelized with the same compilation techniques  相似文献   

17.
A scientific workflow, usually consists of a good mix of fine and coarse computational granularity tasks displaying varied runtime requirements. It has been observed that fine grained tasks incur more scheduling overhead than their execution time, when executed on widely distributed platforms. Task clustering is extensively used, in such situations, as a runtime optimization method which involves combining multiple short duration tasks into a cluster, to be scheduled on a single resource. This helps in minimizing the scheduling overheads of the fine grained tasks. However, tasks grouping curtails the degree of parallelism and hence needs to be done optimally. Though a number of task clustering techniques have been developed to reduce the impact of system overheads, they fail to identify the appropriate number of clusters at each level of workflow in order to achieve maximum possible parallelism. This work proposes a level based autonomic Workflow-and-Platform Aware (WPA) task clustering technique which takes into consideration both; the workflow structure and the underlying resource set size for task clustering. It aims to achieve maximum possible parallelism among the tasks at a level of a workflow while minimizing the system overheads and resource wastage. A comparative study with current state of the art task clustering approaches on four well-known scientific workflows show that the proposed method significantly reduces the overall workflow execution time and at the same time is able to consolidate the load onto minimum possible resources.  相似文献   

18.
当计算划分层迭代数目较大,或是循环体单次迭代工作量较大,但可用的并行线程数目较小时,传统的基于循环分块的流水粒度优化方法无法进行处理。为此,提出一种基于循环分块减小流水粒度的方法,并根据流水并行循环的代价模型实现最优流水粒度的求解,设计实现了一个流水计算粒度的优化算法。对有限差分松弛法(FDR)的波前循环和时域有限差分法(FDTD)中典型循环的测试表明,与传统的流水粒度选择方法相比,所提算法能够得到更优的循环分块大小。  相似文献   

19.
Popular distributed graph processing frameworks, such as Pregel and GraphLab, are based on the vertex-centric computation model, where users write their customized Compute function for each vertex to process the data iteratively. Vertices are evenly partitioned among the compute nodes, and the granularity of parallelism of the graph algorithm is normally tuned by adjusting the number of compute nodes. Vertex-centric model splits the computation into phases. Inside one specific phase, the computation proceeds as an embarrassingly parallel process, because no communication between compute nodes incurs. By default, current graph engine only handles one iteration of the algorithm in a phase. However, in this paper, we find that we can also tune the granularity of parallelism, by aggregating the computation of multiple iterations into one phase, which has a significant impact on the performance of the graph algorithm. In the ideal case, if all computations are handled in one phase, the whole algorithm turns into an embarrassingly parallel algorithm and the benefit of parallelism is maximized. Based on this observation, we propose two approaches, a function-based approach and a parameter-based approach, to automatically transform a Pregel algorithm into a new one with tunable granularity of parallelism. We study the cost of such transformation and the trade-off between the granularity of parallelism and the performance. We provide a new direction to tune the performance of parallel algorithms. Finally, the approaches are implemented in our graph processing system, N2, and we illustrate their performance using popular graph algorithms.  相似文献   

20.
On the granularity and clustering of directed acyclic task graphs   总被引:1,自引:0,他引:1  
The authors consider the impact of the granularity on scheduling task graphs. Scheduling consists of two parts, the processors assignment of tasks, also called clustering, and the ordering of tasks for execution in each processor. The authors introduce two types of clusterings: nonlinear and linear clusterings. A clustering is nonlinear if two parallel tasks are mapped in the same cluster otherwise it is linear. Linear clustering fully exploits the natural parallelism of a given directed acyclic task graph (DAG) while nonlinear clustering sequentializes independent tasks to reduce parallelism. The authors also introduce a new quantification of the granularity of a DAG and define a coarse grain DAG as the one whose granularity is greater than one. It is proved that every nonlinear clustering of a coarse grain DAG can be transformed into a linear clustering that has less or equal parallel time than the nonlinear one. This result is used to prove the optimality of some important linear clusterings used in parallel numerical computing  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号