期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

杨博王鼎兴郑纬民《软件学报》2001,12(5):698-705

交互式并行化系统通过提供友好的交互功能并引入用户知识来协助程序的并行化,是解决自动并行化能力不足的一条有效途径.描述了一个并行化系统交互环境TIPSIE(interactive en vironment of Tsinghua interactive parallelizing system),并就构造该环境的性能预测、增量编译和数据相关查询等关键技术进行了讨论.实验结果表明,这些技术能够有效地提高系统的并行化能力和效率. 相似文献

2.

Automatic Parallelization of Recursive Procedures

Manish Gupta Sayak Mukhopadhyay Navin Sinha 《International journal of parallel programming》2000,28(6):537-562

Parallelizing compilers have traditionally focussed mainly on parallelizing loops. This paper presents a new framework for automatically parallelizing recursive procedures that typically appear in divide-and-conquer algorithms. We present compile-time analysis, using powerful, symbolic array section analysis, to detect the independence of multiple recursive calls in a procedure. This allows exploitation of a scalable form of nested parallelism, where each parallel task can further spawn off parallel work in subsequent recursive calls. We describe a runtime system which efficiently supports this kind of nested parallelism without unnecessarily blocking tasks. We have implemented this framework in a parallelizing compiler, which is able to automatically parallelize programs like quicksort and mergesort, written in C. For cases where even the advanced compile-time analysis we describe is not able to prove the independence of procedure calls, we propose novel techniques for speculative runtime parallelization, which are more efficient and powerful in this context than analogous techniques proposed previously for speculatively parallelizing loops. Our experimental results on an IBM G30 SMP machine show good speedups obtained by following our approach. 相似文献

3.

JAVA并行化编译器JAPS-II

于勐陈贵海阳雪林谢立过敏意《软件学报》2002,13(4):739-747

JAPS-II(Java automatic parallelizing system version 2)是一个Java源代码重构编译器,用来发现和实现串行Java程序中对象内和对象间的并行性.其目标体系结构为基于工作站网络环境的分布式存储器计算机系统.介绍了JAPS-II的体系结构和实现JAPS-II的关键技术,包括用于对象并行性分析的数据流分析技术、提高对象并行性和减少运行开销的优化技术以及类重构和代码生成技术.测试结果表明,JAPS-II能够有效地发现循环中和对象内、对象间的并行性,获得加速比.这相似文献

4.

Using knowledge-based systems for research on parallelizing compilers

Chao-Tung Yang Shian-Shyong Tseng Yun-Woei Fann Ting-Ku Tsai Ming-Huei Hsieh Cheng-Tien Wu 《Concurrency and Computation》2001,13(3):181-208

相似文献

5.

一个交互式的Fortran77并行化系统 总被引：5，自引：1，他引：5

陈文光杨博王紫瑶郑丰宙郑纬民《软件学报》1999,10(12):1259-1267

并行化编译器可以把现有的串行程序自动或半自动地转换为并行程序.现有并行化系统的自动并行化效果与手工并行化的效果相比还有一定的差距,这是由于并行化工具的分析能力不足以及程序中所固有的语义信息无法被并行化工具所理解而造成的.TIPS(Tsinghua interactive parallelizing system)系统通过提供一些友好的交互式工具,使用户与编译器紧密协作,是提高并行化系统的能力和效率的一条有效途径. 相似文献

6.

基于嵌套循环分类的并行识别技术

赵捷赵荣彩丁锐黄品丰《软件学报》2012,23(10):2695-2704

传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合OpenMP代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合MPI代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和MPI并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的MPI并行代码其性能加速比提高了20%以上. 相似文献

7.

Compile-Time Estimation of Communication Costs for Data Parallel Programs

《Journal of Parallel and Distributed Computing》1996,39(1):46-65

Most of the current compiler projects for distributed memory architectures leave the critical and time-consuming problem of finding performance-efficient data distributions and profitable program transformations for a given parallel program almost entirely to the programmer. Performance estimators provide critical performance information to both programmers and parallelizing compilers, the most crucial part of which involves determining the communication overhead induced by a program. In this paper, we present a very practical approach to the problem of compile-time estimation of communication costs for regular codes that includes analytical methods to model the number of messages exchanged, data volume transferred, transfer time, and network contention. In order to achieve high estimation accuracy, our estimator aggressively exploits compiler analysis and optimization information. It is assumed that machine parameters and problem size are known at compile time. We conducted a variety of experiments to validate the estimation accuracy and the ability to support both the programmer and compiler in the effort of performance tuning of parallel programs. We believe that our approach can be automatically applied to a large class of regular codes. 相似文献

8.

面向对象语言并行化中的调用局部化优化

于勐臧婉瑜谢立孙钟秀过敏意《计算机学报》2002,25(4):409-416

该文提出了一种将调用局部化技术应用于并行环境下面向对象语言的方法，文中详细讨论了该技术的适用条件以及如何通过该方法减少循环中的远程过程调用开销，该优化技术产首先将循环分离成多个包含有远程调用的循环，再将分离后的循环分离给循环中对象所在的处理器，最后，化简迭代空间，并且用消息传递来传输数据，这种优化对象分布和循环并行化之后进行，将函数调用局部化于处理器，通过这种优化，可以进一步挖掘循环中的任务并行性，降低计算复杂度，减少函数调用开销，尤其适合面向对象语言中对循环里小函数的优化，该技术已经在作者设计的Java自动并行化编译器JAPS－Ⅱ中实现，在实验中，利用这种优化技术得到了超线性性加速比。相似文献

9.

An Object-Oriented Framework for Loop Parallelization

Omori Youichi Fukuda Akira Joe Kazuki 《The Journal of supercomputing》1999,13(1):57-69

Generation of efficient parallel code is a major goal of a well-designed and developed parallelizing compiler. Another important goal is portability of both compiler system and the resulting output source codes. The various choices of current and future parallel computer architectures as well as the cost of developing a parallelizing compiler make portability a very important design goal. Since the design of parallelizing compilers is considerably move complex than designing conventional compilers, it is very important to achieve both efficiency and portability. To meet this dual goal, we have investigated the application of object oriented design to parallelizing compilers. Our parallelizing compiler design is based on abstractions of intermediate representations of loops and their class definitions. In this paper, we address the problem of loop parallelization and propose a framework where the loop parallelization process is divided into three phases and the optimization of loops is performed via a cyclic application of these three phases. The class of each phase is hierarchically derived from intermediate representations of loops. This facilitates the portability of the resulting parallelizing compilers. Furthermore, one of the phases uses a reservation table of hardware resources in order to obtain optimized parallel programs for given hardware resources. The validation of the proposed framework is given through the application of the object oriented design on an example program which is then parallelized efficiently. 相似文献

10.

The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation

Hansang Bae Dheya Mustafa Jae-Woo Lee Aurangzeb Hao Lin Chirag Dave Rudolf Eigenmann Samuel P. Midkiff 《International journal of parallel programming》2013,41(6):753-767

This paper provides an overview and an evaluation of the Cetus source-to-source compiler infrastructure. The original goal of the Cetus project was to create an easy-to-use compiler for research in automatic parallelization of C programs. In meantime, Cetus has been used for many additional program transformation tasks. It serves as a compiler infrastructure for many projects in the US and internationally. Recently, Cetus has been supported by the National Science Foundation to build a community resource. The compiler has gone through several iterations of benchmark studies and implementations of those techniques that could improve the parallel performance of these programs. These efforts have resulted in a system that favorably compares with state-of-the-art parallelizers, such as Intel’s ICC. A key limitation of advanced optimizing compilers is their lack of runtime information, such as the program input data. We will discuss and evaluate several techniques that support dynamic optimization decisions. Finally, as there is an extensive body of proposed compiler analyses and transformations for parallelization, the question of the importance of the techniques arises. This paper evaluates the impact of the individual Cetus techniques on overall program performance. 相似文献

11.

Computing programs containing band linear recurrences on vectorsupercomputers

Haigeng Wang Nicolau A. Keung S. Kai-Yeung Siu 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(8):769-782

Many large-scale scientific and engineering computations, e.g., some of the Grand Challenge problems, spend a major portion of execution time in their core loops computing band linear recurrences (BLRs). Conventional compiler parallelization techniques cannot generate scalable parallel code for this type of computation because they respect loop-carried dependences (LCDs) in programs, and there is a limited amount of parallelism in a BLR with respect to LCDs. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs significant overhead. In this paper, we present a new scalable algorithm called the Regular Schedule, for parallel evaluation of BLRs. We describe our implementation of the Regular Schedule and discuss how to obtain maximum memory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on our Regular Schedule, to parallelizing programs containing BLR and other kinds of code. Significant improvements in CPU performance for a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using highly optimized coded-in-assembly BLAS routines [11] are demonstrated on Convex C240. Our approach can be used both at the user level in parallel programming code containing BLRs, and in compiler parallelization of such programs combined with recurrence recognition techniques for vector supercomputers 相似文献

12.

K-DT: a formal system for the evaluation of linear data dependence testing techniques

Jie Zhao Rongcai Zhao 《The Journal of supercomputing》2018,74(4):1655-1675

The power of data dependence testing techniques of a parallelizing compiler is its essence to transform and optimize programs. Numerous techniques were proposed in the past, and it is, however, still a challenging problem to evaluate the relative power of these techniques to better understand the data dependence testing problem. In the past, either empirical studies or experimental evaluation results are published to compare these data dependence testing techniques, being not able to convince the research community completely. In this paper, we show a theoretical study on this issue, comparing the power on disproving dependences of existing techniques by proving theorems in a proposed formal system K-DT. Besides, we also present the upper bounds of these techniques and introduce their minimum complete sets. To the best of our knowledge, K-DT is the first formal system used to compare the power of data dependence testing techniques, and this paper is the first work to show the upper bounds and minimum complete sets of data dependence testing techniques. 相似文献

13.

Self adaptive run time scheduling for the automatic parallelization of loops with the C2μTC/SL compiler

《Parallel Computing》2013,39(10):603-614

In this paper we suggest a new approach for solving the hyperplane problem, also known as “wavefront” computation. In direct contrast to most approaches that reduce the problem to an integer programming one or use several heuristic approaches, we gather information at compile time and delegate the solution to run time. We present an adaptive technique which intuitively calculates which new threads will be able to be executed in the next computation cycle based on which threads are executed in the current one. Moving the solution to the run time environment provides us with higher versatility alongside a perfect solution of the underlying hyperplane pattern being discovered without the need to perform any prior calculations. The main contribution of this paper is the presentation of the self adaptive algorithm, an algorithm which does not need to know the tile size (which controls the granularity of parallelism) beforehand. Instead, the algorithm itself adapts the tile size while the program is running in order to achieve optimal efficiency. Experimental results show that if we have a sufficient number of parallel processing elements to diffuse the scheduler’s workload, its overhead becomes low enough that it is overshadowed by the net gain in parallelism. For the implementation of the algorithm we suggest, and for our experimentations our parallelizing compiler C2μTC/SL is used, a C parallelizing compiler which maps sequential programs on the SVP processor and model. 相似文献

14.

The Design of the PROMIS Compiler—Towards Multi-Level Parallelization

Hideki Saito Nicholas J. Stavrakos Constantine D. Polychronopoulos Alex Nicolau 《International journal of parallel programming》2000,28(2):195-212

相似文献

15.

Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system

《Microprocessors and Microsystems》2005,29(2-3):51-62

The DEFACTO compilation and synthesis system is capable of automatically mapping computations expressed in high-level imperative programming languages as C to FPGA-based systems. DEFACTO combines parallelizing compiler technology with behavioral VHDI, synthesis tools to guide the application of high-level compiler transformations in the search of high-quality hardware designs. In this article we illustrate the effectiveness of this approach in automatically mapping several kernel codes to an FPGA quickly and correctly. We also present a detailed example of the comparison of the performance of an automatically generated design against a manually generated implementation of the same computation. The design-space-exploration component of DEFACTO is able to explore a large number of designs for a particular computation that would otherwise be impractical for any designers. 相似文献

16.

Automatic Parallelization of Array-oriented Programs for a Multi-core Machine

Wai-Mee Ching Da Zheng 《International journal of parallel programming》2012,40(5):514-531

We present the work on automatic parallelization of array-oriented programs for multi-core machines. Source programs written in standard APL are translated by a parallelizing APL-to-C compiler into parallelized C code, i.e. C mixed with OpenMP directives. We describe techniques such as virtual operations and data-partitioning used to effectively exploit parallelism structured around array-primitives. We present runtime performance data, showing the speedup of the resulting parallelized code, using different numbers of threads and different problem sizes, on a 4-core machine, for several examples. 相似文献

17.

自动并行编译的新进展 总被引：2，自引：0，他引：2

朱根江谢立孙钟秀《软件学报》1993,4(4):1-7

自动并行编译是并行程序的主要途径之一,本文概述了发展自动并行化编译的必要性及其主要进展,讨论了当前采用的主要技术和今后的发展动向。相似文献

18.

Task scheduling in multiprocessing systems

El-Rewini H. Ali H.H. Lewis T. 《Computer》1995,28(12):27-37

The complex problem of assigning tasks to processing elements in order to optimize a performance measure has resulted in numerous heuristics aimed at approximating an optimal solution. This article addresses the task scheduling problem in many of its variations and surveys the major solutions. The scheduling techniques we discuss might be used by a compiler writer to optimize the code that comes out of a parallelizing compiler. The compiler would produce grains of sequential code, and the optimizer would schedule these grains such that the program runs in the shortest time 相似文献

19.

Towards Acceptability of Optimizations: An Extended View of Compiler Correctness

Wolfgang Goerigk 《Electronic Notes in Theoretical Computer Science》2002,65(2)

The theory of relative program correctness and its preservation allows for elaborate and practically adequate definitions of correct implementation notions as they are established by transformations implemented in a compiler. It generalizes Hoare's and Floyd's partial and total program correctness and correctness preservation by classifying finite and infinite errors to be either acceptable (unavoidable) or unacceptable (chaotic, to be avoided). We will define correct implementation by particular compositional diagram commutativities, and we will further extend this theory also to express correctness of compiling specifications and of compiler programs and their implementations in the same uniform relational setting. Unacceptable error outcomes can semantically model pre-conditions such as well-formedness conditions for compilers or optimization pre-conditions for user programs. Our theory allows to distinguish between different correct implementation requirements, for instance (horizontally) for user programs or (vertically) for the compiler implementation, just as if we would switch on and off compiler options and tune one compiler to appropriately preserve correctness in different application domains. 相似文献

20.

Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime

Alcides?Fonseca Email author Bruno?Cabral Jo?o?Rafael Ivo?Correia 《International journal of parallel programming》2016,44(6):1337-1358

There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. We created a new parallelizing compiler that analyses the read and write instructions, and control-flow modifications in programs to identify a set of dependencies between the instructions in the program. Afterwards, the compiler, based on the generated dependencies graph, rewrites and organizes the program in a task-oriented structure. Parallel tasks are composed by instructions that cannot be executed in parallel. A work-stealing-based parallel runtime is responsible for scheduling and managing the granularity of the generated tasks. Furthermore, a compile-time granularity control mechanism also avoids creating unnecessary data-structures. This work focuses on the Java language, but the techniques are general enough to be applied to other programming languages. We have evaluated our approach on 8 benchmark programs against OoOJava, achieving higher speedups. In some cases, values were close to those of a manual parallelization. The resulting parallel code also has the advantage of being readable and easily configured to improve further its performance manually. 相似文献