首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Several large real‐world applications have been developed for distributed and parallel architectures. We examine two different program development approaches. First, the usage of a high‐level programming paradigm which reduces the time to create a parallel program dramatically but sometimes at the cost of a reduced performance; a source‐to‐source compiler, has been employed to automatically compile programs—written in a high‐level programming paradigm—into message passing codes. Second, a manual program development by using a low‐level programming paradigm—such as message passing—enables the programmer to fully exploit a given architecture at the cost of a time‐consuming and error‐prone effort. Performance tools play a central role in supporting the performance‐oriented development of applications for distributed and parallel architectures. SCALA—a portable instrumentation, measurement, and post‐execution performance analysis system for distributed and parallel programs—has been used to analyze and to guide the application development, by selectively instrumenting and measuring the code versions, by comparing performance information of several program executions, by computing a variety of important performance metrics, by detecting performance bottlenecks, and by relating performance information back to the input program. We show several experiments of SCALA when applied to real‐world applications. These experiments are conducted for a NEC Cenju‐4 distributed‐memory machine and a cluster of heterogeneous workstations and networks. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

2.
Dynamically allocating computing nodes to parallel applications is a promising technique for improving the utilization of cluster resources. Detailed simulations can help identify allocation strategies and problem decomposition parameters that increase the efficiency of parallel applications. We describe a simulation framework supporting dynamic node allocation which, given a simple cluster model, predicts the running time of parallel applications taking CPU and network sharing into account. Simulations can be carried out without needing to modify the application code. Thanks to partial direct execution, simulation times and memory requirements are reduced. In partial direct execution simulations, the application's parallel behavior is retrieved via direct execution, and the duration of individual operations is obtained from a performance prediction model or from prior measurements. Simulations may then vary cluster model parameters, operation durations and problem decomposition parameters to analyze their impact on the application performance and identify the limiting factors. We implemented the proposed techniques by adding direct execution simulation capabilities to the Dynamic Parallel Schedules parallelization framework. We introduce the concept of dynamic efficiency to express the resource utilization efficiency as a function of time. We verify the accuracy of our simulator by comparing the effective running time, respectively the dynamic efficiency, of parallel program executions with the running time, respectively the dynamic efficiency, predicted by the simulator under different parallelization and dynamic node allocation strategies.  相似文献   

3.
Originally implemented to support wireless sensor networks, the lightweight mobile code daemons works on wired networks as well. The mobile code daemon we present is based on a core network protocol called the remote execution and action protocol (REAP), which is responsible for passing messages between nodes within our network. The authors have implemented them to run on multiple operating systems and architectures. The daemons use a peer-to-peer indexing scheme to find mobile code packages as needed. Calls to the daemons are automatically dereferenced to find the proper package for a node's OS and hardware.  相似文献   

4.
5.
D.A.  P.D. 《Performance Evaluation》2005,60(1-4):165-187
We present a new performance modeling system for message-passing parallel programs that is based around a Performance Evaluating Virtual Parallel Machine (PEVPM). We explain how to develop PEVPM models for message-passing programs using a performance directive language that describes a program’s serial segments of computation and message-passing events. This is a novel bottom-up approach to performance modeling, which aims to accurately model when processing and message-passing occur during program execution. The times at which these events occur are dynamic, because they are affected by network contention and data dependencies, so we use a virtual machine to simulate program execution. This simulation is done by executing models of the PEVPM performance directives rather than executing the code itself, so it is very fast. The simulation is still very accurate because enough information is stored by the PEVPM to dynamically create detailed models of processing and communication events. Another novel feature of our approach is that the communication times are sampled from probability distributions that describe the performance variability exhibited by communication subject to contention. These performance distributions can be empirically measured using a highly accurate message-passing benchmark that we have developed. This approach provides a Monte Carlo analysis that can give very accurate results for the average and the variance (or even the probability distribution) of program execution time. In this paper, we introduce the ideas underpinning the PEVPM technique, describe the syntax of the performance modeling language and the virtual machine that supports it, and present some results, for example, parallel programs to show the power and accuracy of the methodology.  相似文献   

6.
The Java programming language is increasingly used in the implementation of servers with stringent availability, reliability, and performance requirements. Our Java Application Supervisor (JAS) software system is an attachment to a Java runtime environment that enhances the availability of a target Java program. To this end, JAS automatically detects and resolves certain reliability and performance problems during the execution of the target program. JAS does not require any source or byte code modifications in the target program. Instead, JAS is configured for a target program by supplying simple policies that determine how JAS reacts to problems during the target program execution. JAS typically imposes little execution time and memory overhead on the target program. We describe an experiment with a Web proxy that exhibits reliability and performance problems under heavy load. In this experiment, running the proxy in conjunction with JAS increased the rate of successful requests to the proxy by 33% and decreased the averagerequest processing time by 22%. JAS was also used successfully in two Java servers at Bell Labs to monitor server reliability and performance and ensure long‐term availability. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

7.
Originally developed with a single language in mind, the JVM is now targeted by numerous programming languages—its automatic memory management, just‐in‐time compilation, and adaptive optimizations—making it an attractive execution platform. However, the garbage collector, the just‐in‐time compiler, and other optimizations and heuristics were designed primarily with the performance of Java programs in mind. Consequently, many of the languages targeting the JVM, and especially the dynamically typed languages, are suffering from performance problems that cannot be simply solved at the JVM side. In this article, we aim to contribute to the understanding of the character of the workloads imposed on the JVM by both dynamically typed and statically typed JVM languages. To this end, we introduce a new set of dynamic metrics for workload characterization, along with an easy‐to‐use toolchain to collect the metrics. We apply the toolchain to applications written in six JVM languages (Java, Scala, Clojure, Jython, JRuby, and JavaScript) and discuss the findings. Given the recently identified importance of inlining for the performance of Scala programs, we also analyze the inlining behavior of the HotSpot JVM when executing bytecode originating from different JVM languages. As a result, we identify several traits in the non‐Java workloads that represent potential opportunities for optimization. © 2015 The Authors. Software: Practice and Experience Published by John Wiley & Sons Ltd.  相似文献   

8.
Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present two profiling techniques for the fine‐grained parallel programming language Split‐C, which provides a simple global address space memory model. One profiler provides a detailed analysis of a program's execution. The other profiler collects cumulative information. As our experience shows, it is quite challenging to profile programs that make use of efficient, low‐overhead communication. We incorporated techniques which minimize profiling effects on the running program, and quantified the profiling overhead. We present several Split‐C applications showing that the profiler is useful in determining performance bottlenecks. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

9.
Parallel execution of application programs on a multiprocessor system may lead to performance degradation if the workload of a parallel region is not large enough to amortize the overheads associated with the parallel execution. Furthermore, if too many processes are running on the system in a multiprogrammed environment, the performance of the parallel application may degrade due to resource contention. This work proposes a comprehensive dynamic processor allocation scheme that takes both program behavior and system load into consideration when dynamically allocating processors. This mechanism was implemented on the Solaris operating system to dynamically control the execution of parallel C and Java application programs. Performance results show the effectiveness of this scheme in dynamically adapting to the current execution environment and program behavior, and that it outperforms a conventional time‐shared system. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

10.
Java just‐in‐time compilers often compile only hot methods because the compilation overhead is a part of the running time. This requires precise and efficient hot spot detection, which includes distinguishing hot methods from cold ones, detecting them as early as possible, and paying a small detection overhead. Hot spot detection is especially important in embedded applications because they show more of a start‐up phase behavior of a regular application where methods are not executed heavily, so the hot methods are not definite. Because a long‐running method is likely to be a hot method, we can detect a hot method by measuring its running time during interpretation. However, precise measurement of the running time during execution is too expensive, especially in embedded systems, so many counter‐based heuristics have been proposed to estimate it such as Oracle's HotSpot heuristic. One problem is that although the overhead of these heuristics is low, they do not estimate the running time precisely, which may lead to imprecise hot spot detection.This paper proposes a new hot spot detection heuristic called flow‐sensitive runtime estimation, which can estimate the running time more precisely than others with a relatively low overhead. It only counts important bytecode instructions dynamically, but it can obtain the precise count of all interpreted bytecode instructions with a simple arithmetic calculation. We also propose a static analysis technique to predict those hot methods which spends a huge execution time once invoked, so as to compile them at their first invocation. Our experimental results show that these techniques can improve the performance by as much as an average of 7.4% compared with the HotSpot heuristic for the benchmarks when they run once, which is often regarded as showing the start‐up phase behavior. Even for real embedded Java applications such as the digital TV Java Xlet applications, our techniques can improve the user response time by an average of 7.1%. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

11.

Model checkers frequently fail to completely verify a concurrent program, even if partial-order reduction is applied. The verification engineer is left in doubt whether the program is safe and the effort toward verifying the program is wasted. We present a technique that uses the results of such incomplete verification attempts to construct a (fair) scheduler that allows the safe execution of the partially verified concurrent program. This scheduler restricts the execution to schedules that have been proven safe (and prevents executions that were found to be erroneous). We evaluate the performance of our technique and show how it can be improved using partial-order reduction. While constraining the scheduler results in a considerable performance penalty in general, we show that in some cases our approach—somewhat surprisingly—even leads to faster executions.

  相似文献   

12.
Pilsung Kang 《Software》2018,48(3):385-401
Function call interception (FCI), or method call interception (MCI) in the object‐oriented programming domain, is a technique of intercepting function calls at program runtime. Without directly modifying the original code, FCI enables to undertake certain operations before and/or after the called function or even to replace the intercepted call. Thanks to this capability, FCI has been typically used to profile programs, where functions of interest are dynamically intercepted by instrumentation code so that the execution control is transferred to an external module that performs execution time measurement or logging operations. In addition, FCI allows for manipulating the runtime behavior of program components at the fine‐grained function level, which can be useful in changing an application's original behavior at runtime to meet certain execution requirements such as maintaining performance characteristics for different input problem sets. Due to this capability, however, some FCI techniques can be used as a basis of many security exploits for vulnerable systems. In this paper, we survey a variety of FCI techniques and tools along with their applications to diverse areas in the computing and software domains. We describe static and dynamic FCI techniques at large and discuss the strengths and weaknesses of different implementations in this category. In addition, we also discuss aspect‐oriented programming implementation techniques for intercepting method calls.  相似文献   

13.
The execution model for mobile, dynamically‐linked, object‐oriented programs has evolved from fast interpretation to a mix of interpreted and dynamically compiled execution. The primary motivation for dynamic compilation is that compiled code executes significantly faster than interpreted code. However, dynamic compilation, which is performed while the application is running, introduces execution delay. In this paper we present two dynamic compilation techniques that enable high performance execution while reducing the effect of this compilation overhead. These techniques can be classified as (1) decreasing the amount of compilation performed, and (2) overlapping compilation with execution. We first present and evaluate lazy compilation, an approach used in most dynamic compilation systems in which individual methods are compiled on‐demand upon their first invocation. This is in contrast to eager compilation, in which all methods in a class are compiled when a new class is loaded. In this work, we describe our experience with eager compilation, as well as the implementation and transition to lazy compilation. We empirically detail the effectiveness of this decision. Our experimental results using the SpecJVM Java benchmarks and the Jalapeño JVM show that, compared to eager compilation, lazy compilation results in 57% fewer methods being compiled and reductions in total time of 14 to 26%. Total time in this context is compilation plus execution time. Next, we present profile‐driven, background compilation, a technique that augments lazy compilation by using idle cycles in multiprocessor systems to overlap compilation with application execution. With this approach, compilation occurs on a thread separate from that of application threads so as to reduce intermittent, and possibly substantial, delay in execution. Profile information is used to prioritize methods as candidates for background compilation. Methods are compiled according to this priority scheme so that performance‐critical methods are invoked using optimized code as soon as possible. Our results indicate that background compilation can achieve the performance of off‐line compiled applications and masks almost all compilation overhead. We show significant reductions in total time of 14 to 71% over lazy compilation. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

14.
在安全关键系统中,针对多核处理器共享硬件资源竞争带来的执行时间波动性问题,提出了基于性能计数器PMR和RSB的通用测试方法,通过捕捉执行时间波动性相关的硬件事件来分析硬件资源的共享性、执行时间的波动性和硬件平台的黑盒或灰盒行为。此方法可用于硬件平台的性能评估,也可用于应用任务的资源消耗评估,从而为WCET预测提供指导。  相似文献   

15.
It has been difficult to develop simple formulations to predict the execution time of parallel programs due to the complexity of characterizing parallel hardware and software. In an attempt to clarify these characterizations, we introduce a methodology for applying a simple performance model based on Amdahl′s law. Our formulation results in accurate predictions of execution time on available systems, allowing programmers to select the optimal number of processors to apply to a particular problem or to select an appropriate problem size for the number of processors available. In short, we accurately quantify the scalability of a specific algorithm when it is run on a specific parallel computer. Our predictions are based on simple experiments that characterize machine performance and on a simple analysis of the parallel program. We illustrate our method for a program executed on a Sequent Symmetry multiprocessor with 20 processors. Our predictions closely match experimental results, differing by no more than 5% from the actual execution times. Our results illustrate key performance limitations of parallel systems, showing the impact of overhead and the scaling of problem size.  相似文献   

16.
There are several purposes of analyzing a program:functional or performance analysis,debugging or,more recently,mapping a program to a new paraller or distributed architecture.In this paper,we introduce an effective method leading to the Execution Graph (EG)from a program.First,the Unix profiling tool Gprof is used to get the Execution Model(EM)of a C-program.Then the event-driven monitoring tool AICOS-SIMPLE is used to get the EG which includes not only the call graph but also the execution time table of the program.This method is suitable for analyzing modern distributed programs.As the example of the analysis,the well known HTTP protocol under the NCSA Mosaic is chosen.Ae EG of NCSA Mosaic on the routing level is given.  相似文献   

17.
In this paper, we explore how to get the information of input‐output coupling parameters (IOCPs) for a class of uncertain discrete‐time systems by using iterative learning technique. Firstly, by taking advantage of repetitiveness of control system and informative input and output data, we design an iterative learning scheme for unknown IOCPs. It is shown that we can get the exact values of IOCPs one by one through running the repetitive system T+1 times if the control system is with identical initial state and noise free. Secondly, we give the iterative learning scheme for unknown IOCPs in the presence of measurement noise, system noise, or initial state drift and analyze the influence factors on the performance of developed iterative learning scheme. Meanwhile, we introduce the maximum allowable control deviation into the iterative learning mechanism to minimize the negative impact of noise on the performance of learning scheme and to enhance the robust of iterative learning scheme. Thirdly, for a class of multiple‐input–multiple‐output systems, we also develop iterative learning mechanism for unknown input‐output coupling matrices. Finally, an illustrative example is given to demonstrate the effectiveness of proposed iterative learning scheme.  相似文献   

18.
Negative acknowledgments (NACKs) and subsequent retries, used to resolve races and to enforce a total order among shared memory accesses in distributed shared memory (DSM) multiprocessors, not only introduce extra network traffic and contention, but also increase node controller occupancy, especially at the home. We present possible protocol optimizations to minimize these retries and offer a thorough study of the performance effects of these messages on six scalable scientific applications running on 64-node systems and larger. To eliminate NACKs, we present a mechanism to queue pending requests at the main memory of the home node and augment it with a novel technique of combining pending read requests, thereby accelerating the parallel execution for 64 nodes by as much as 41 percent (a speedup of 1.41) compared to a modified version of the SGI Origin 2000 protocol. We further design and evaluate a protocol by combining this mechanism with a technique that we call write string forwarding, used in the AlphaServer GS320 and Piranha systems. We find that without careful design considerations, especially regarding atomic read-modify-write operations, this aggressive write forwarding can hurt performance. We identify and evaluate the necessary micro-architectural support to solve this problem. We compare the performance of these novel NACK-free protocols with a base bitvector protocol, a modified version of the SGI Origin 2000 protocol, and a NACK-free protocol that uses dirty sharing and write string forwarding as in the Piranha system. To understand the effects of network speed and topology the evaluation is carried out on three network configurations.  相似文献   

19.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

20.
We consider the time-dependent demands for data movement that a parallel program makes on the architecture that executes it. The result is an architecture-independent metric that represents the temporal behavior of data-movement requirements. Programs are described as series of computations and data movements, and while message passing is not ruled out, we focus on explicit parallel programs using a fixed number of processes in a distributed shared-memory environment. Operations are assumed to be explicitly allocated to processors when the metric is applied, which might correspond to intermediate code in a parallelizing compiler. The metric is called the interprocess read (IR) temporal metric. A key to developing an architecture-independent temporal metric is modeling program execution time in an architecture-independent way. This is possible because well-synchronized parallel programs make coordinated progress above a certain level of granularity. Our execution time characterization takes into account barrier synchronization and critical sections. We illustrate the metric using instruction count on simple code fragments and then from multiprocessor program traces (Splash benchmarks). Results of running the benchmarks on simulated network architectures show that the IR metric for the time scale of network response predicts performance better than whole program measures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号