首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The time-cost behavior of a program directly depends on its execution environment (i.e., the number of other programs in the system). We present techniques to analyze the time-cost behavior of programs that include the effect of their execution environment. It is assumed the underlying system has a shared memory architecture. The paper shows that one can analytically relate a parallel program's time cost to its execution environment. A direct benefit of the work is to let us meet the real-time requirement of a parallel program and utilize maximally a multiprocessor system's resource  相似文献   

2.
Debuggers play an important role in developing parallel applications. They are used to control the state of many processes, to present distributed information in a concise and clear way, to observe the execution behavior, and to detect and locate programming errors. More sophisticated debugging systems also try to improve understanding of global execution behavior and intricate details of a program. In this paper we describe the design and implementation of SPiDER, which is an interactive source‐level debugging system for both regular and irregular High‐Performance Fortran (HPF) programs. SPiDER combines a base debugging system for message‐passing programs with a high‐level debugger that interfaces with an HPF compiler. SPiDER, in addition to conventional debugging functionality, allows a single process of a parallel program to be expected or the entire program to be examined from a global point of view. A sophisticated visualization system has been developed and included in SPiDER to visualize data distributions, data‐to‐processor mapping relationships, and array values. SPiDER enables a programmer to dynamically change data distributions as well as array values. For arrays whose distribution can change during program execution, an animated replay displays the distribution sequence together with the associated source code location. Array values can be stored at individual execution points and compared against each other to examine execution behavior (e.g. convergence behavior of a numerical algorithm). Finally, SPiDER also offers limited support to evaluate the performance of parallel programs through a graphical load diagram. SPiDER has been fully implemented and is currently being used for the development of various real‐world applications. Several experiments are presented that demonstrate the usefulness of SPiDER. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

3.
P-GRADE provides a high-level graphical environment to develop parallel applications transparently both for parallel systems and the Grid. P-GRADE supports the interactive execution of parallel programs as well as the creation of a Condor, Condor-G or Globus job to execute parallel programs in the Grid. In P-GRADE, the user can generate either PVM or MPI code according to the underlying Grid where the parallel application should be executed. PVM applications generated by P-GRADE can migrate between different Grid sites and as a result P-GRADE guarantees reliable, fault-tolerant parallel program execution in the Grid. The GRM/PROVE performance monitoring and visualisation toolset has been extended towards the Grid and connected to a general Grid monitor (Mercury) developed in the EU GridLab project. Using the Mercury/GRM/PROVE Grid application monitoring infrastructure any parallel application launched by P-GRADE can be remotely monitored and analysed at run time even if the application migrates among Grid sites. P-GRADE supports workflow definition and co-ordinated multi-job execution for the Grid. Such workflow management can provide parallel execution at both inter-job and intra-job level. Automatic checkpoint mechanism for parallel programs supports the migration of parallel jobs inside the workflow providing a fault-tolerant workflow execution mechanism. The paper describes all of these features of P-GRADE and their implementation concepts.  相似文献   

4.
This paper proposes novel produced order parallel queue processor architecture. To store intermediate results, the proposed system uses a first-in-first-out (FIFO) circular queue-registers instead of random access registers. Datum is inserted in the queue-registers in produced order scheme and can be reused. We show that this feature has profound implications in the areas of parallel execution, programs compactness, hardware simplicity and high execution speed.Our performance evaluations show a significant performance improvement (e.g., 10 to 26% decrease in program size and 6 to 46% decrease in execution time over a range of benchmark programs) when compared with the earlier proposed architecture.  相似文献   

5.
The construction of large software systems is always achieved through assembly of independently written components — program modules. For these software components to work together, they must share a common set of data types and principles for representing structured data such as arrays of values and files. This common set of tools for creating and operating on data objects is provided by the infrastructure of the computer system: the hardware, operating system and runtime code. Because the nature and properties of these tools are crucial for correct operation of software components and their inter-operation, it is essential to have a precise specification that may be used for verifying correctness of application software on one hand, and to verify correctness of system behavior on the other. We call such a specification a program execution model (PXM). It is evident that the properties of the PXM implemented by a computer system can have serious impact on the ability of application programmers to practice modular software construction. This paper discusses the concept of program execution models and presents a set of principles that a PXM must satisfy to provide a sound basis for modular software construction. Because parallel program execution on computer systems with many processing units is an essential part of contemporary computing environments, the expression of parallelism and modular software construction using components involving parallel operations is included in this treatment. The conclusion is that it is possible to build computer systems that implement a PXM within which any parallel program may be used, unmodified, as a component for building more substantial parallel programs.  相似文献   

6.
The tension between software development costs and efficiency is especially high when considering parallel programs intended to run on a variety of architectures. In the domain of shared memory architectures and explicitly parallel programs, the authors have addressed this problem by defining a programming structure that eases the development of effectively portable programs. On each target multiprocessor, an effectively portable program runs almost as efficiently as a program fine-tuned for that machine. Additionally, its software development cost is close to that of a single program that is portable across the targets. Using this model, programs are defined in terms of data structure and partitioning-scheduling abstractions. Low software development cost is attained by writing source programs in terms of abstract interfaces and thereby requiring minimal modification to port; high performance is attained by matching (often dynamically) the interfaces to implementations that are most appropriate to the execution environment. The authors include results of a prototype used to evaluate the benefits and costs of this approach  相似文献   

7.
Performance is a main issue in parallel application development. Dynamic tuning is a technique that changes certain applications’ parameters on-line to improve their performance adapting the execution to actual conditions. To perform that, it is necessary to collect measurements, analyze application behavior and carry out tuning actions during the application execution. Computational Grids present proclivity for dynamic changes in the environment during the application execution. Therefore, dynamic tuning tools are necessary to reach the expected performance indexes of applications on those environments. This paper addresses the dynamic tuning of parallel/distributed applications on Computational Grids. We analyze Grid environments to determine their characteristics and we present the development of dynamic tuning tool GMATE enabled for such environments. The performance analysis is based on performance models that indicate how to improve the application execution. A particular problem which provokes performance bottlenecks is the load imbalance in Master/Worker applications. A heuristic to dynamically tune granularity of work and number of workers is proposed. Finally, we describe the experimental validation of the performance model and its applicability on a set of real parallel applications.  相似文献   

8.
Parallel execution is normally used to decrease the amount of time required to run a program. However, the parallel execution may require far more space than that required by the sequential execution. Worse yet, the parallel space requirement may be very much more difficult to predict than the sequential space requirement because there are more factors to consider. These include essentially nondeterministic factors that can influence scheduling, which in turn may dramatically influence space requirements. We survey some scheduling algorithms that attempt to place bounds on the amount of time and space used during parallel execution. We also outline a direction for future research. This direction takes us into the area of functional programming, where the declarative nature of the languages can help the programmer to produce correct parallel programs, a feat that can be difficult with procedural languages. Currently the high-level nature of functional languages can make it difficult for the programmer to understand the operational behavior of the program. We look at some of the problems in this area, with the goal of achieving a programming environment that supports correct, efficient parallel programs.  相似文献   

9.
一个基于图像代数的并行图像处理环境   总被引:3,自引:0,他引:3  
系统可用性和应用程序可移植性差是许多现有的并行图像处理计算结构难以获得实际应用的重要原因,基于Ritter提出的图像代数理论,研究、实现了一个并行图像处理环境.用户在并行计算结构上进行程序设计时,只需用图像处理环境提供的图像代数运算描述算法即可,处理环境能够根据用户算法的描述,依据一个时间开销模型,自动从并行实现函数库中提取出最优或近似最优的并行代码完成算法的运行,算法的并行实现和并行计算结构的硬件细节对用户透明。  相似文献   

10.
A programming environment to support the development and use of engineering applications is presented. The environment provides uniform support for a set of Pascal-class languages in which engineering and scientific applications are commonly written. The environment includes a dynamically multilanguage interpreter debugger to aid in the interactive development of applications. For the application and user, the environment provides a graphical program interface based on the concept of a software control panel. Through a control panel, the user may interactively modify program parameters and exercise fine-grain control over program execution. The environment also includes a graphical design tool for constructing executable block diagrams based on standard application programs. The control-panel tool is integrated with the design tool, to provide a uniform interface to all levels of program execution  相似文献   

11.
D.A.  P.D. 《Performance Evaluation》2005,60(1-4):165-187
We present a new performance modeling system for message-passing parallel programs that is based around a Performance Evaluating Virtual Parallel Machine (PEVPM). We explain how to develop PEVPM models for message-passing programs using a performance directive language that describes a program’s serial segments of computation and message-passing events. This is a novel bottom-up approach to performance modeling, which aims to accurately model when processing and message-passing occur during program execution. The times at which these events occur are dynamic, because they are affected by network contention and data dependencies, so we use a virtual machine to simulate program execution. This simulation is done by executing models of the PEVPM performance directives rather than executing the code itself, so it is very fast. The simulation is still very accurate because enough information is stored by the PEVPM to dynamically create detailed models of processing and communication events. Another novel feature of our approach is that the communication times are sampled from probability distributions that describe the performance variability exhibited by communication subject to contention. These performance distributions can be empirically measured using a highly accurate message-passing benchmark that we have developed. This approach provides a Monte Carlo analysis that can give very accurate results for the average and the variance (or even the probability distribution) of program execution time. In this paper, we introduce the ideas underpinning the PEVPM technique, describe the syntax of the performance modeling language and the virtual machine that supports it, and present some results, for example, parallel programs to show the power and accuracy of the methodology.  相似文献   

12.
Current software environments used to support parallel processing on a cluster of workstations (COW) are not satisfactory, do not provide complete transparency and are not specifically designed for parallel processing. In particular, the establishment of a parallel processing environment and the initialisation of parallel processes suffer from poor performance. Each parallel process of an application is created sequentially and in many cases the logon operation must be completed before remote resources could be acquired. These operations are also performed manually. We present in this paper an original approach that addresses the problem of parallel process creation. The remote workstations are acquired completely transparently and dynamically, and parallel processes are created concurrently. To demonstrate the feasibility of this approach we show a system based on RHODOS (a client/server and microkernel based distributed operating system), specifically designed to improve the performance of process instantiation and therefore able to improve the overall execution performance of parallel programs, in particular parallel process creation. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

13.
Load balancing increases the efficient use of existing resources for parallel and distributed applications. At a coarse level of granularity, advances in runtime systems for parallel programs have been proposed in order to control available resources as efficiently as possible by utilizing idle resources and using task migration. Simultaneously, at a finer granularity level, advances in algorithmic strategies for dynamically balancing computational loads by data redistribution have been proposed in order to respond to variations in processor performance during the execution of a given parallel application. Combining strategies from each level of granularity can result in a system which delivers advantages of both. The resulting integration is systemic in nature and transfers the responsibility of efficient resource utilization from the application programmer to the runtime system. This paper presents the design and implementation of a system that combines an algorithmic fine-grained data parallel load balancing strategy with a systemic coarse-grained task-parallel load balancing strategy, and reports on recent experimental results of running a computationally intensive scientific application under this integrated system. The experimental results indicate that a distributed runtime environment which combines both task and data migration can provide performance advantages with little overhead. It also presents proposals for performance enhancements of the implementation, as well as future explorations for effective resource management. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

14.
We have developed a problem-solving framework, called ConClass, that is capable of classifying continuous real-time problems dynamically and concurrently on a distributed system. ConClass provides an efficient development environment for describing and decomposing a classification problem and synthesizing solutions. In ConClass, decomposed concurrent subproblems specified by the application developer effectively correspond to the actual distributed hardware elements. This scheme is useful for designing and implementing efficient distributed processing, making it easier to anticipate and evaluate system behavior. The ConClass system provides an object replication feature that prevents any particular object from being overloaded. In order to deal with an indeterminate amount of problem data, ConClass dynamically creates object networks that justify hypothesized solutions, and thus achieves a dynamic load distribution. A number of efficient execution mechanisms that manage a variety of asynchronous aspects of distributed processing have been implemented without using schedulers or synchronization schemes that are liable to develop bottlenecks. We have confirmed the efficiency of parallel distributed processing and load balancing of ConClass with an experimental application  相似文献   

15.
Parallel applications typically do not perform well in a multiprogrammed environment that uses time‐sharing to allocate processor resources to the applications' parallel threads. Co‐scheduling related parallel threads, or statically partitioning the system, often can reduce the applications' execution times, but at the expense of reducing the overall system utilization. To address this problem, there has been increasing interest in dynamically allocating processors to applications based on their resource demands and the dynamically varying system load. The Loop‐Level Process Control (LLPC) policy (Yue K, Lilja D. Efficient execution of parallel applications in multiprogrammed multiprocessor systems. 10th International Parallel Processing Symposium, 1996; 448–456) dynamically adjusts the number of threads an application is allowed to execute based on the application's available parallelism and the overall system load. This study demonstrates the feasibility of incorporating the LLPC strategy into an existing commercial operating system and parallelizing compiler and provides further evidence of the performance improvement that is possible using this dynamic allocation strategy. In this implementation, applications are automatically parallelized and enhanced with the appropriate LLPC hooks so that each application interacts with the modified version of the Solaris operating system. The parallelism of the applications are then dynamically adjusted automatically when they are executed in a multiprogrammed environment so that all applications obtain a fair share of the total processing resources. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

16.
We consider the time-dependent demands for data movement that a parallel program makes on the architecture that executes it. The result is an architecture-independent metric that represents the temporal behavior of data-movement requirements. Programs are described as series of computations and data movements, and while message passing is not ruled out, we focus on explicit parallel programs using a fixed number of processes in a distributed shared-memory environment. Operations are assumed to be explicitly allocated to processors when the metric is applied, which might correspond to intermediate code in a parallelizing compiler. The metric is called the interprocess read (IR) temporal metric. A key to developing an architecture-independent temporal metric is modeling program execution time in an architecture-independent way. This is possible because well-synchronized parallel programs make coordinated progress above a certain level of granularity. Our execution time characterization takes into account barrier synchronization and critical sections. We illustrate the metric using instruction count on simple code fragments and then from multiprocessor program traces (Splash benchmarks). Results of running the benchmarks on simulated network architectures show that the IR metric for the time scale of network response predicts performance better than whole program measures.  相似文献   

17.
Performance evaluation of fork and join synchronization primitives   总被引:1,自引:0,他引:1  
Summary The paper presents a performance model of fork and join synchronization primitives. The primitives are used in parallel programs executed on distributed systems. Three variants of the execution of parallel programs with fork and join primitives are considered and queueing models are proposed to evaluate their performance on a finite number of processors. Synchronization delays incurred by the programs are represented by a state-dependent server with service rate depending on a particular synchronization scheme. Closed form results are presented for the two processor case and a numerical method is proposed for many processors. Fork-join queueing networks having more complex structure i.e., processors arranged in series and in parallel, are also analyzed in the same manner. The networks can model the execution of jobs with a general task precedence graph corresponding to a nested structure of the fork-join primitives. Some performance indices of the parallel execution of programs are studied. The results show that the speedup which can be obtained theoretically in a parallel system may be decreased significantly by synchronization constraints.This research we carried out while the author was visiting ISEM, Université de Paris-Sud, France  相似文献   

18.
Networks of workstations (NOWs) are a low-cost and widespread platform for parallel computing. This paper focuses on the dynamic task-scheduling problem in NOW environments. The aim is to minimize the completion time of parallel programs by distributing cooperating concurrent tasks to homogeneous networked nodes. Cooperation dependencies as well as creation and termination dependencies between tasks are taken into account. An event lattice model is introduced to describe past, actual and future behavior of a parallel program in execution. Based on this model an algorithm is presented to dynamically assign tasks to the nodes of a dedicated distributed system. Crucial for the efficiency of this approach is a top-down construction of all operating system entities involved in distributed resource management, particularly the close cooperation of the compiler and runtime system, which allows the creation of event lattice clippings at runtime.  相似文献   

19.
This paper deals with the construction and use of simple synthetic programs that model the behavior of more complex, real parallel programs. Synthetic programs can be used in many ways: to construct an easily ported suite of benchmark programs, to experiment with alternate parallel implementations of a program without actually writing them, and to predict the behavior and performance of an algorithm on a new or hypothetical machine. Synthetic programs are constructed easily from scratch and from existing programs and can even be constructed using nothing but information obtained from traces of the real program's execution.  相似文献   

20.
This article reports the use of case studies to evaluate the performance degradation caused by the kernel-level lock. We define the lock ratio as a ratio of the execution time for critical sections to the total execution time of a parallel program. The kernel-level lock ratio determines how effective programs work on symmetric multiprocessor (SMP) systems. We have measured the lock ratios and the performance of three types of parallel programs on SMP systems with Linux 2.0: matrix multiplication, parallel make, and WWW server programs. Experimental results show that the higher the lock ratio of parallel programs, the worse their performance becomes. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号