期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Implementing a dynamic processor allocation policy for multiprogrammed parallel applications in the SolarisTM

Kelvin K. Yue David J. Lilja 《Concurrency and Computation》2001,13(6):449-464

Parallel applications typically do not perform well in a multiprogrammed environment that uses time‐sharing to allocate processor resources to the applications' parallel threads. Co‐scheduling related parallel threads, or statically partitioning the system, often can reduce the applications' execution times, but at the expense of reducing the overall system utilization. To address this problem, there has been increasing interest in dynamically allocating processors to applications based on their resource demands and the dynamically varying system load. The Loop‐Level Process Control (LLPC) policy (Yue K, Lilja D. Efficient execution of parallel applications in multiprogrammed multiprocessor systems. 10th International Parallel Processing Symposium, 1996; 448–456) dynamically adjusts the number of threads an application is allowed to execute based on the application's available parallelism and the overall system load. This study demonstrates the feasibility of incorporating the LLPC strategy into an existing commercial operating system and parallelizing compiler and provides further evidence of the performance improvement that is possible using this dynamic allocation strategy. In this implementation, applications are automatically parallelized and enhanced with the appropriate LLPC hooks so that each application interacts with the modified version of the Solaris operating system. The parallelism of the applications are then dynamically adjusted automatically when they are executed in a multiprogrammed environment so that all applications obtain a fair share of the total processing resources. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

2.

Dynamically adapting to system load and program behavior in multiprogrammed multiprocessor systems

Iffat H. Kazi David J. Lilja 《Concurrency and Computation》2002,14(12):957-985

Parallel execution of application programs on a multiprocessor system may lead to performance degradation if the workload of a parallel region is not large enough to amortize the overheads associated with the parallel execution. Furthermore, if too many processes are running on the system in a multiprogrammed environment, the performance of the parallel application may degrade due to resource contention. This work proposes a comprehensive dynamic processor allocation scheme that takes both program behavior and system load into consideration when dynamically allocating processors. This mechanism was implemented on the Solaris operating system to dynamically control the execution of parallel C and Java application programs. Performance results show the effectiveness of this scheme in dynamically adapting to the current execution environment and program behavior, and that it outperforms a conventional time‐shared system. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献

3.

Machine independent AND and OR parallel execution of logicprograms. II. Compiled execution

Ramkumar B. Kale L.V. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(2):181-192

For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler 相似文献

4.

A performance debugging tool for high performance Fortran programs

Takashi Suzuoka Jaspal Subhlok Thomas Gross 《Concurrency and Computation》1997,9(10):927-945

Parallel languages allow the programmer to express parallelism at a high level. The management of parallelism and the generation of interprocessor communication is left to the compiler and the runtime system. This approach to parallel programming is particularly attractive if a suitable widely accepted parallel language is available. High Performance Fortran (HPF) has emerged as the first popular machine independent parallel language, and remarkable progress has been made towards compiling HPF efficiently. However, the performance of HPF programs is often poor and unpredictable, and obtaining adequate performance is a major stumbling block that must be overcome if HPF is to gain widespread acceptance. The programmer is often in the dark about how to improve the performance of an HPF program since poor performance can be attributed to a variety of reasons, including poor choice of algorithm, limited use of parallelism, or an inefficient data mapping. This paper presents a profiling tool that allows the programmer to identify the regions of the program that execute inefficiently, and to focus on the potential causes of poor performance. The central idea is to distinguish the code that is executing efficiently from the code that is executing poorly. Efficient code uses all processors of a parallel system to make progress, while inefficient code causes processors to wait, execute replicated code, idle, communicate, or perform compiler bookkeeping. We designate the latter code as non-scalable, since adding more processors generally does not lead to improved performance for such code. By analogy, the former code is called scalable. The tool presented here separates a program into scalable and non-scalable components and identifies the causes of non-scalability of different components. We show that compiler information is the key to dividing the execution times into logical categories that are meaningful to the programmer. We present the design and implementation of a profiler that is integrated with Fx, a compiler for a variant of HPF. The paper includes two examples that demonstrate how the data reported by the profiler are used to identify and resolve performance bugs in parallel programs. © 1997 John Wiley & Sons, Ltd. 相似文献

5.

A framework for exploiting task and data parallelism on distributedmemory multicomputers

Ramaswamy S. Sapatnekar S. Banerjee P. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(11):1098-1116

Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster 相似文献

6.

A compiler for exploiting nested parallelism in OpenMP programs

Xinmin Tian Jay P. Hoeflinger Grant Haab Yen-Kuang Chen Milind Girkar Sanjiv Shah 《Parallel Computing》2005,31(10-12):960

This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel^® C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragmas or directives. We conduct the performance evaluation of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel C++ compiler on Hyper-Threading Technology (HT) enabled multiprocessor systems. The performance results show that the multithreaded code generated by the Intel compiler achieved a speedup up to 4.69 on 4 processors with HT enabled for five different input video sequences for the H.264 encoder workload, and a 1.28 speedup on an HT enabled single-CPU system and 1.99 speedup on an HT-enabled dual-CPU system for the audio–visual speech recognition workload. The performance gain due to exploiting nested parallelism for leveraging Hyper-Threading Technology is up to 70% for two multimedia workloads under different multiprocessor system configurations. These results demonstrate that hyper-threading benefits can be achieved by exploiting nested parallelism through Intel compiler and runtime system support for OpenMP programs. 相似文献

7.

并行化编译器中基于工作量的条件并行化研究

侯永生赵荣彩张平韩枫《微计算机信息》2005,21(4):220-221

并行化编译器通过发掘串行程序中的并行性来提高程序的运行性能。但当可并行的工作量与并行的线程数目之比较小时，有可能采用并行执行反而会降低程序的整体性能。本文工作基于SUIF结构．研究精确的工作量计算方法，并实现了基于工作量的条件并行化技术．有效地提高了并行程序的执行性能。相似文献

8.

EDGE结构上一种通过超块重组加速单线程应用的方法

魏学超安虹毛梦捷《小型微型计算机系统》2012,(10):2249-2254

Explicit Data Graph Execution(EDGE)ISA是一种专门为类数据流驱动的分片式众核处理器而设计的指令集体系结构.相较于传统的采用控制流驱动的处理器,EDGE结构以超块(Hyperblock)而不是单个指令作为其执行单位,在超块内部实现数据流执行,超块之间按照推测序保持控制流执行,有利于挖掘指令级并行性.但是,EDGE编译器按照程序的串行执行顺序组织超块,超块间和超块内部受限于数据依赖,削弱了整个程序运行时的潜在数据级并行性和线程级并行性,不利于发挥EDGE分片式结构的优势.本文通过分析EDGE编译器超块组织的特点,结合EDGE结构特有的执行模型,提出一种普适性的超块组织框架来模拟EDGE结构上多线程运行的效果,进一步挖掘EDGE结构运行串行单线程程序时的指令级并行性.本文选用TRIPS微处理器作为EDGE结构的实例处理器,利用矩阵乘法等三个实验验证了我们所提出的框架的可行性,实验结果表明这些应用在TRIPS上获得了较好的性能提升. 相似文献

9.

Multiprocessor execution of functional programs 总被引：1，自引：0，他引：1

Benjamin Goldberg 《International journal of parallel programming》1988,17(5):425-473

Functional languages have recently gained attention as vehicles for programming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This paper describes research that was performed to demonstrate that multiprocessor execution of functional programs on current multiprocessors is feasible, and results in a significant reduction in their execution times.Two implementations of the functional language ALFL were built on commercially available multiprocessors.Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, andBuckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs and a run-time system that supports their execution. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, calledserial combinators, that can be executed in parallel.The abstract machine model supported by Alfalfa and Buckwheat is calledheterogeneous graph reduction, which is a hybrid of graph reduction and conventional stack-oriented execution. This model supports parallelism, lazy evaluation, and highe order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat runtime systems support dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.This research was supported in part by National Science Foundation grants DCR-8302018 and DCR-8521451, by a DARPA subcontract with SDC/Unisys, and by gifts from Burroughs Austin Research Center and the Intel Corporation. 相似文献

10.

面向高性能计算的流VLIW编译技术研究

下载免费PDF全文

管茂林伍楠文梅吴伟柴俊张春元《计算机工程与科学》2008,30(7):100-104

本文基于斯坦福大学设计的KernelC编译器ISCD,针对64位流处理器体系结构,设计实现了其核心VLIW编译器,并针对高性能计算应用需求进行优化,实现了分布式寄存器负载均衡和指令自动合并技术。实验结果表明,该编译器能够很好地开发程序中的并行性,具有较高的效率。相似文献

11.

An effective processor allocation strategy for multiprogrammedshared-memory multiprocessors

Yue K.K. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(12):1246-1258

Existing techniques for sharing the processing resources in multiprogrammed shared-memory multiprocessors, such as time-sharing, space-sharing, and gang-scheduling, typically sacrifice the performance of individual parallel applications to improve overall system utilization. We present a new processor allocation technique called Loop-Level Process Control (LLPC) that dynamically adjusts the number of processors an application is allowed to use for the execution of each parallel section of code, based on the current system load. This approach exploits the maximum parallelism possible for each application without overloading the system. We implement our scheme on a Silicon Graphics Challenge multiprocessor system and evaluate its performance using applications from the Perfect Club benchmark suite and synthetic benchmarks. Our approach shows significant improvements over traditional time-sharing and gang-scheduling. It has performance comparable to, or slightly better than, static space-sharing, but our strategy is more robust since, unlike static space-sharing, it does not require a priori knowledge of the applications' parallelism characteristics 相似文献

12.

A study of potential parallelism among traces in Java programs

Borys J. Bradel Tarek S. Abdelrahman 《Science of Computer Programming》2009,74(5-6):296-313

The exploitation of parallelism among traces, i.e. hot paths of execution in programs, is a novel approach to the automatic parallelization of Java programs and it has many advantages. However, to date, the extent to which parallelism exists among traces in programs has not been made clear. The goal of this study is to measure the amount of trace-level parallelism in several Java programs. We extend the Jupiter Java Virtual Machine with a simulator that models an abstract parallel system. We use this simulator to measure trace-level parallelism. We further use it to examine the effects of the number of processors, trace window size, and communication type and cost on performance. We also analyze the dependence characteristics of the benchmarks and see how they relate to parallelism. The results indicate that enough trace-level parallelism exists for a modest number of processors. Thus, we conclude that trace-based parallelization is a potentially viable approach to improve the performance of Java programs. 相似文献

13.

SMA:前瞻性多线程体系结构 总被引：4，自引：1，他引：3

肖刚周兴铭徐明邓鹍《计算机学报》1999,22(6):582-590

提出了一种新的ＩＬＰ处理器体系结构－前瞻性多线程体系的结构,简称ＳＭＡ．它结合了前瞻性执行机制和多线程执行机制,以整个线程为长步进行前瞻性执行,多个线程并行执行并且共享处理器硬件资源,这样,处理器既通过组合每个线程的指令窗口形成一个大的动态指令窗口,开发出程序中更大的ＩＬＰ,又利用多线程执行机制屏蔽各种长延迟操作,达到较高的资源利用率;介绍了ＳＭＡ执行模型,并讨论了ＳＭＡ处理器的实现和其中的关键技相似文献

14.

The superblock: An effective technique for VLIW and superscalar compilation 总被引：8，自引：1，他引：7

Wen -Mei W. Hwu Scott A. Mahlke William Y. Chen Pohua P. Chang Nancy J. Warter Roger A. Bringmann Roland G. Ouellette Richard E. Hank Tokuzo Kiyohara Grant E. Haab John G. Holm Daniel M. Lavery 《The Journal of supercomputing》1993,7(1-2):229-248

A compiler for VLIW and superscalar processors must expose sufficient instruction-level parallelism (ILP) to effectively utilize the parallel hardware. However, ILP within basic blocks is extremely limited for control-intensive programs. We have developed a set of techniques for exploiting ILP across basic block boundaries. These techniques are based on a novel structure called thesuperblock. The superblock enables the optimizer and scheduler to extract more ILP along the important execution paths by systematically removing constraints due to the unimportant paths. Superblock optimization and scheduling have been implemented in the IMPACT-I compiler. This implementation gives us a unique opportunity to fully understand the issues involved in incorporating these techniques into a real compiler. Superblock optimizations and scheduling are shown to be useful while taking into account a variety of architectural features. 相似文献

15.

Profile-assisted instruction scheduling

William Y. Chen Scott A. Mahlke Nancy J. Warter Sadun Anik Wen-Mei W. Hwu 《International journal of parallel programming》1994,22(2):151-181

Instruction schedulers for superscalar and VLIW processors must expose sufficient instruction-level parallelism to the hardware in order to achieve high performance. Traditional compiler instruction scheduling techniques typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase instruction-level parallelism for the frequent execution scenarios at the expense of the less freuent ones. Profile information identifies these important execution scenarios in a program. In this paper, two major categories of profile information are studied: control-flow and memory-dependence. Profile-assisted code scheduling techniques have been incorporated into the IMPACT-I compiler. These techniques are acyclic global scheduling and software pipelining. This paper describes the scheduling algorithms, highlights the modifications required to use profile information, and explains the hardware and compiler support for dealing with hazards that arise from aggressive use of profile information. The effectiveness of these profile-based scheduling techniques is evaluated for a range of superscalar and VLIW processors. 相似文献

16.

Compiling C for the EARTH multithreaded architecture

Laurie J. Hendren Xinan Tang Yingchun Zhu Shereen Ghobrial Guang R. Gao Xun Xue Haiying Cai Pierre Ouellet 《International journal of parallel programming》1997,25(4):305-338

Multithreaded architectures provide an opportunity for efficiently executing programs with irregular parallelism and/or irregular locality. This paper presents a strategy that makes use of the multithreaded execution model without exposing multithreading to the programmer. Our approach is to design simple extensions to C, and to provide compiler support that automatically translates high-level C programs into lower-level threaded programs. In this paper we present EARTH-C our extended C language which contains simple constructs for specifying control parallelism, data locality, shared variables and atomic operations. Based on EARTH-C, we describe compiler techniques that are used for translating to lower-level Threaded-C programs for the EARTH multithreaded architecture. We demonstrate our approach with six benchmark programs. We show that even naive EARTH-C programs can lead to reasonable performance, and that more advanced EARTH-C programs can give performance very close to hand-coded threated-C programs. This work supported, in part, by NSERC and FCAR. 相似文献

17.

Exploiting parallelism on a fine-grained MIMD architecture based upon channel queues

Rajiv Gupta Sunah Lee 《International journal of parallel programming》1992,21(3):169-192

We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh. 相似文献

18.

A scheduling policy for preserving cache locality in a multiprogrammed system

《Journal of Systems Architecture》2000,46(13):1191-1204

In a multiprogrammed system, when the operating system switches contexts, in addition to the cost for handling the processes being swapped out and in, the cache performance of processors also can be affected. If frequent context switching replaces the data loaded into cache memory before they are completely reused, the programs suffer from cache misses due to the damage in cache locality. In particular, for the programs with good cache locality, such as blocked programs, a scheduling mechanism of keeping cache locality against context switching is essential to achieve good processor utilization. To solve this requirement, we propose a preemption-safe policy to exploit the cache locality of blocked programs in a multiprogrammed system. The proposed policy delays context switching until a block is fully reused, but also compensates for the monopolized processor time on processor scheduling mechanisms. Our simulation results show that in a situation where blocked programs are run on multiprogrammed shared-memory multiprocessors, the proposed policy improves the performance of these programs due to a decrease in cache misses. In such situations, it also has a beneficial impact on the overall system performance due to the enhanced processor utilization. 相似文献

19.

Performance-driven processor allocation

Corbalan J. Martorell X. Labarta J. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(7):599-611

In current multiprogrammed multiprocessor systems, to take into account the performance of parallel applications is critical to decide an efficient processor allocation. In this paper, we present the performance-driven processor allocation policy (PDPA). PDPA is a new scheduling policy that implements a processor allocation policy and a multiprogramming-level policy, in a coordinated way, based on the measured application performance. With regard to the processor allocation, PDPA is a dynamic policy that allocates to applications the maximum number of processors to reach a given target efficiency. With regard to the multiprogramming level, PDPA allows the execution of a new application when free processors are available and the allocation of all the running applications is stable, or if some applications show bad performance. Results demonstrate that PDPA automatically adjusts the processor allocation of parallel applications to reach the specified target efficiency, and that it adjusts the multiprogramming level to the workload characteristics. PDPA is able to adjust the processor allocation and the multiprogramming level without human intervention, which is a desirable property for self-configurable systems, resulting in a better individual application response time. 相似文献

20.

Trapezoid self-scheduling: a practical scheduling scheme forparallel compilers

Tzen T.H. Ni L.M. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(1):87-98

A practical processor self-scheduling scheme, trapezoid self-scheduling, is proposed for arbitrary parallel nested loops in shared-memory multiprocessors. Generally, loops are the richest source of parallelism in parallel programs. To dynamically allocate loop iterations to processors, one may achieve load balancing among processors at the expense of run-time scheduling overhead. By linearly decreasing the chunk size at run time, the best tradeoff between the scheduling overhead and balanced workload can be obtained in the proposed trapezoid self-scheduling approach. Due to its simplicity and flexibility, this approach can be efficiently implemented in any parallel compiler. The small and predictable number of chores also allow efficient management of memory in a static fashion. The experiments conducted in a 96-node Butterfly GP-1000 clearly show the advantage of the trapezoid self-scheduling over other well-known self-scheduling approaches 相似文献