首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In modular programs, groups of routines constitute conceptual abstractions. A method for providing execution profiles for such programs is presented. The central idea is that the execution time for a routine is charged to the routines that call it. The implementation of this method by a profiler called gprof is described. The techniques used to gather the necessary information about the timing and structure of the program are given, as is the processing used to propagate routine execution times along arcs of the call graph of the program. The method for displaying the profile to the user is discussed. Experience using the profiles for hand-tuning large programs is summarized. Additional uses for the profiles are suggested.  相似文献   

2.
Scalar performance on processors with instruction level parallelism (ILP) is often limited by control and data dependences. This paper describes a family of compiler techniques, called Critical Path Reduction (CPR) techniques, which reduce the length of critical paths through control and data dependences. Control CPR reduces the number of branches on the critical path and improves the performance of branch intensive codes on processors with inadequate branch throughput or excessive branch latency. Data CPR reduces the number of arithmetic operations on the critical path. Optimization and scheduling are adapted to support CPR.  相似文献   

3.
This paper presents a system for parallel execution of Prolog supporting both independent conjunctive and disjunctive parallelism. The system is intended for distributed memory architecture and is composed of a set of workers with a hierarchical structure scheduler. The execution model has been designed in such a way that each worker's environment does not contain references to terms in other environments, thus reducing communication overhead. In order to guarantee the improvement of the performance by the parallelism exploitation, a granularity control has been introduced for each kind of parallelism. For conjunctive parallelism PDP applies a control based on the estimation provided by CASLOG. The features of the system allow to introduce this control without adding overhead. For disjunctive parallelism PDP controls granularity by applying a heuristic-based method, which can be adapted to other parallel Prolog systems. Different scheduling policies have also been tested. The system has been implemented on a transputer network and performance results show that it provides a high speedup for coarse grain parallel programs.  相似文献   

4.
This paper describes an investigation into estimating the quality of functional programs. The work reported here is part of a larger, ongoing study into a quantitative analysis of the effect of utilizing different programming paradigms on code quality. Before undertaking such a comparative analysis it was necessary to establish a baseline of quality indicators which could then be used as metrics for the remainder of the project. Thus the aim of the research presented here was to evaluate a set of suggested indicators corresponding to internal attributes by investigating the correlation between the suggested indicators and the desired external quality-type attributes of the code. A method for the evaluation of the suggested metrics is discussed and the results of performing such an evaluation for functional programs are presented.  相似文献   

5.
Recent proposals for multi-paradigm declarative programming combine the most important features of functional, logic and concurrent programming into a single framework. The operational semantics of these languages is usually based on a combination of narrowing and residuation. In this paper, we introduce a non-standard, residualizing semantics for multi-paradigm declarative programs and prove its equivalence with a standard operational semantics. Our residualizing semantics is particularly relevant within the area of program transformation where it is useful, e.g., to perform computations during partial evaluation. Thus, the proof of equivalence is a crucial result to demonstrate the correctness of (existing) partial evaluation schemes.  相似文献   

6.
As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE. This paper is based in part on “Chip multiprocessing and the Cell Broadband Engine”, ACM Computing Frontiers 2006.  相似文献   

7.
Case studies in asynchronous data parallelism   总被引:1,自引:0,他引:1  
Is the owner-computes style of parallelism, captured in a variety of data parallel languages, attractive as a paradigm for designing explicit control parallel codes? This question gives rise to a number of others. Will such use be unwieldy? Will the resulting code run well? What can such an approach offer beyond merely replicating, in a more labor intensive way, the services and coverage of data parallel languages? We investigate these questions via a simple example and “real world” case studies developed using C-Linda, a language for explicit parallel programming formed by the merger of C with the Linda coordination language. The results demonstrate owner-computes is an effective design strategy in Linda.  相似文献   

8.
本介绍一个采用VLIW超长指令字体系结构的高性能单片多处理机,在这个体系结构中采用流水寄存器堆来消除循环程序内的数据相关,从而使程序能够在指令级以极高的并行度并行运行。模拟实验结果表明这个体系结构具有很高的运算速度和很好的性能价格比。  相似文献   

9.
10.
Programs whose parallelism stems from multiple recursion form an interesting subclass of parallel programs with many practical applications. The highly irregular shape of many recursion trees makes it difficult to obtain good load balancing with small overhead. We present a system, called REAPAR, that executes recursive C programs in parallel on SMP machines. Based on data from a single profiling run of the program, REAPAR selects a load-balancing strategy that is both effective and efficient and it generates parallel code implementing that strategy. The performance obtained by REAPAR on a diverse set of benchmarks matches that published for much more complex systems requiring high-level problem-oriented explicitly parallel constructs. A case study even found REAPAR to be competitive to handwritten (low-level, machine-oriented) thread-parallel code  相似文献   

11.
As massively parallel computers proliferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper, we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors  相似文献   

12.
We model a deterministic parallel program by a directed acyclic graph of tasks, where a task can execute as soon as all tasks preceding it have been executed. Each task can allocate or release an arbitrary amount of memory (i.e., heap memory allocation can be modeled). We call a parallel schedule “space efficient” if the amount of memory required is at most equal to the number of processors times the amount of memory required for some depth-first execution of the program by a single processor. We describe a simple, locally depth-first scheduling algorithm and show that it is always space efficient. Since the scheduling algorithm is greedy, it will be within a factor of two of being optimal with respect to time. For the special case of a program having a series-parallel structure, we show how to efficiently compute the worst case memory requirements over all possible depth-first executions of a program. Finally, we show how scheduling can be decentralized, making the approach scalable to a large number of processors when there is sufficient parallelism  相似文献   

13.
本文对具有高通讯延迟的多处理机系统(机群系统)上的任务调度算法进行了研究,与以往算法主要考虑任务图的关键路径不同,本文给出了任务图的调度与其偶图匹配的对应关系,并由此提出了一种新的启发式算法,通过模拟试验显示本算法具有较好的调度效果。  相似文献   

14.
A parallel-execution model that can concurrently exploit AND and OR parallelism in logic programs is presented. This model employs a combination of techniques in an approach to executing logic problems in parallel, making tradeoffs among number of processes, degree of parallelism, and combination bandwidth. For interpreting a nondeterministic logic program, this model (1) performs frame inheritance for newly created goals, (2) creates data-dependency graphs (DDGs) that represent relationships among the goals, and (3) constructs appropriate process structures based on the DDGs. (1) The use of frame inheritance serves to increase modularity. In contrast to most previous parallel models that have a large single process structure, frame inheritance facilitates the dynamic construction of multiple independent process structures, and thus permits further manipulation of each process structure. (2) The dynamic determination of data dependency serves to reduce computational complexity. In comparison to models that exploit brute-force parallelism and models that have fixed execution sequences, this model can reduce the number of unification and/or merging steps substantially. In comparison to models that exploit only AND parallelism, this model can selectively exploit demand-driven computation, according to the binding of the query and optional annotations. (3) The construction of appropriate process structures serves to reduce communication complexity. Unlike other methods that map DDGs directly onto process structures, this model can significantly reduce the number of data sent to a process and/or the number of communication channels connected to a process  相似文献   

15.
Large scale parallel programming projects may become heterogeneous in bothlanguage and architectural model. We propose that skeletal programmingtechniques can alleviate some of the costs involved in designing and portingsuch programs, illustrating our approach with a simple program which combinesshared memory and message passing code. We introduce Activity Graphs as asimple and practical means of capturing model independent aspects of theoperational semantics of skeletal parallel programs. They are independent oflow level details of parallel implementation and so can act as an intermediatelayer for compilation to diverse underlying models. Activity graphs providea notion of parallel activities, dependencies between activities, and theprocess groupings within which these take place. The compilation processuses a set of graph generators (templates) to derive the activity graph. Wedescribe simple schemes for transforming activity graphs into message passingprograms, targeting both MPI and BSP.  相似文献   

16.
    
During the last decade object‐oriented programming has grown from marginal influence into widespread acceptance. During the same period, progress in hardware and networking has changed the computing environment from sequential to parallel. Multi‐processor workstations and clusters are now quite common. Unnumbered proposals have been made to combine both developments. Always the prime objective has been to provide the advantages of object‐oriented software design at the increased power of parallel machines. However, combining both concepts has proven to be notoriously difficult. Depending on the approach, often key characteristics of either the object‐oriented paradigm or key performance factors of parallelism are sacrificed, resulting in unsatisfactory languages. This survey first recapitulates well‐known characteristics of both the object‐oriented paradigm and parallel programming, and then marks out the design space of possible combinations by identifying various interdependencies of key concepts. The design space is then filled with data points: for 111 proposed languages we provide brief characteristics and feature tables. Feature tables, the comprehensive bibliography, and web‐addresses might help in identifying open questions and preventing re‐inventions. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

17.
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems.  相似文献   

18.
Stochastic bounds are obtained on execution times of parallel programs when the number of processors is unlimited. A parallel program is considered to consist of interdependent tasks with synchronization constraints. These constraints are described by an acyclic directed graph called a task graph. The execution times of tasks are considered to be independently identically distributed (i.i.d.) random variables. The performance measure of interest is the overall execution of the considered parallel program (task graph). Stochastic bound methods are applied to obtain lower and upper bounds on this measure. Another upper bound is obtained for parallel programs having `new better than used in expectation' (NBUE) random variables as task execution times. NBUE random variables are replaced with exponential random variables of the same mean to derive this upper bound  相似文献   

19.
Calculating the maximum execution time of real-time programs   总被引:5,自引:1,他引:5  
In real-time systems, the timing behavior is an important property of each task. It has to be guaranteed that the execution of a task does not take longer than the specified amount of time. Thus, a knowledge about the maximum execution time of programs is of utmost importance.This paper discusses the problems for the calculation of the maximum execution time (MAXT... MAximum eXecution Time). It shows the preconditions which have to be met before the MAXT of a task can be calculated. Rules for the MAXT calculation are described. Triggered by the observation that in most cases the calculated MAXT far exceeds the actual execution time, new language constructs are introduced. These constructs allow programmers to put into their programs more information about the behavior of the algorithms implemented and help to improve the self checking property of programs. As a consequence, the quality of MAXT calculations is improved significantly. In a realistic example, an improvement fator of 11 has been achieved.  相似文献   

20.
Execution monitors are widely used during software development for tasks that require an understanding of program behavior, such as debugging and profiling. The Icon programming language has been enhanced with a framework that supports execution monitoring. Under the enhanced translator and interpreter, neither source modification nor any special compiler command-line option is required in order to monitor an Icon program. Execution monitors are written in the source language, instead of the implementation language. Performance, portability, and detailed access to the monitored program's state are achieved using a coroutine model and dynamic loading rather than the separate-process model employed by many conventional monitoring systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号