首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With the increasing performance demand in real-time systems it becomes more and more important to provide feedback to programmers and software development tools on the performance-relevant code parts of a real-time program. So far, this information was limited to an estimation of the worst-case execution time (WCET) and its associated worst-case execution path (WCEP) only. However, both, the WCET and the WCEP, only provide partial information. Only code parts that are on one of the WCEPs are indicated to the programmer. No information is provided for all other code parts. To give a comprehensive view covering the entire code base, tools in the spirit of program profiling are required. This work proposes an efficient approach to compute worst-case timing information for all code parts of a program using a complementary metric, called criticality. Every statement of a program is assigned a criticality value, expressing how critical the code is with respect to the global WCET. This gives valuable information how close the worst execution path passing through a specific program part is to the global WCEP. We formally define the criticality metric and investigate some of its properties with respect to dominance in control-flow graphs. Exploiting some of those properties, we propose an algorithm that reduces the overhead of computing the metric to cover complete programs. We also investigate ways to efficiently find only those code parts whose criticality is above a given threshold. Experiments using well-established real-time benchmark programs show an interesting distribution of the criticality values, revealing considerable amounts of highly critical as well as uncritical code. The metric thus provides ideal information to programmers and software development tools to optimize the worst-case execution time of these programs.  相似文献   

2.
Harnessing the computational capabilities of a network of workstations promises to off-load work from overloaded supercomputers onto largely idle resources overnight. Several capabilities are needed to do this, including support for an architecture-independent parallel programming environment, task migration, automatic resource allocation, and fault tolerance. The Hector distributed run-time environment is designed to present these capabilities transparently to programmers. MPI programs can be run under this environment on homogeneous clusters with no modifications to their source code needed. The design of Hector, its internal structure, and several benchmarks and tests are presented  相似文献   

3.
线程级推测(TLS)技术可挖掘程序并行执行潜能,提高多核资源利用率,但目前TACLeBench的内核基准仍未在TLS并行化中得到有效分析。针对该问题设计了循环级推测执行的剖析方案和剖析工具。选取7个代表性的TACLeBench内核基准程序,首先对程序进行初始化分析,选取程序热点片段插入循环标识;其次对这些片段进行交叉编译,记录程序推测线程与内存地址相关数据,剖析其循环级最大潜在并行性;最后综合探讨程序运行时的特征(线程粒度、可并行化覆盖率、依赖特征)以及源码对加速比的影响。实验结果表明:1)该类程序适合采用TLS加速,与串行执行结果相比,循环结构的推测执行下的大部分程序的加速比在2以上,其中最高加速比达到20.79;2)利用TLS加速TACLeBench内核程序时,多数应用可有效利用4核到16核的计算资源。  相似文献   

4.
There are many paradigms being promoted and explored for programming parallel computers, including modified sequential languages, new imperative languages and applicative languages. SISAL is an applicative language which has been designed by a consortium of industrial and research organizations for the specification and execution of parallel programs. It allows programs to be written with little concern for the structure of the underlying machine, thus the programmer is free to explore different ways of expressing the parallelism. A major problem with applicative languages has been their poor efficiency at handling large data structures. To counter this problem SISAL includes some advanced memory management techniques for reducing the amount of data copying that occurs. In this paper we discuss the implementation of some image processing benchmarks in SISAL and C to evaluate the effectiveness of the memory management code. In general, the SISAL program was easier to code than the C (augmented with the PARMACS macros) because we were not concerned with the parallel implementation details. We found that the SISAL performance was in general comparable to C, and that it could be brought in line with an efficient parallel C implementation by some programmer-specified code transformations.  相似文献   

5.
We introduce a new performance metric, called load balancing factor (LBF), to assist programmers when evaluating different tuning alternatives. The LBF metric differs from traditional performance metrics since it is intended to measure the performance implications of a specific tuning alternative rather than quantifying where time is spent in the current version of the program. A second unique aspect of the metric is that it provides guidance about moving work within a distributed or parallel program rather than reducing it. A variation of the LBF metric can also be used to predict the performance impact of changing the underlying network. The LBF metric is computed incrementally and online during the execution of the program to be tuned. We also present a case study that shows that our metric can accurately predict the actual performance gains for a test suite of six programs  相似文献   

6.
In this paper we introduce our estimation method for parallel execution times, based on identifying separate “parts” of the work done by parallel programs. Our run time analysis works without any source code inspection. The time of parallel program execution is expressed in terms of the sequential work and the parallel penalty. We measure these values for different problem sizes and numbers of processors and estimate them for unknown values in both dimensions using statistical methods. This allows us to predict parallel execution time for unknown inputs and non-available processor numbers with high precision. Our prediction methods require orders of magnitude less data points than existing approaches. We verified our approach on parallel machines ranging from a multicore computer to a peta-scale supercomputer.  相似文献   

7.
Programs whose parallelism stems from multiple recursion form an interesting subclass of parallel programs with many practical applications. The highly irregular shape of many recursion trees makes it difficult to obtain good load balancing with small overhead. We present a system, called REAPAR, that executes recursive C programs in parallel on SMP machines. Based on data from a single profiling run of the program, REAPAR selects a load-balancing strategy that is both effective and efficient and it generates parallel code implementing that strategy. The performance obtained by REAPAR on a diverse set of benchmarks matches that published for much more complex systems requiring high-level problem-oriented explicitly parallel constructs. A case study even found REAPAR to be competitive to handwritten (low-level, machine-oriented) thread-parallel code  相似文献   

8.
This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed memory parallel computers. MPIBench uses a very precise, globally synchronised clock to measure the performance of MPI communication routines. It can generate probability distributions of communication times, not just the average values produced by other MPI benchmarks. This allows useful insights to be made into the MPI communication performance of parallel computers, and in particular how performance is affected by network contention. The Performance Evaluating Virtual Parallel Machine (PEVPM) provides a simple, fast and accurate technique for modelling and predicting the performance of message-passing parallel programs. It uses a virtual parallel machine to simulate the execution of the parallel program. The effects of network contention can be accurately modelled by sampling from the probability distributions generated by MPIBench. These tools are particularly useful on clusters with commodity Ethernet networks, where relatively high latencies, network congestion and TCP problems can significantly affect communication performance, which is difficult to model accurately using other tools. Experiments with example parallel programs demonstrate that PEVPM gives accurate performance predictions on commodity clusters. We also show that modelling communication performance using average times rather than sampling from probability distributions can give misleading results, particularly for programs running on a large number of processors.  相似文献   

9.
D.A.  P.D. 《Performance Evaluation》2005,60(1-4):165-187
We present a new performance modeling system for message-passing parallel programs that is based around a Performance Evaluating Virtual Parallel Machine (PEVPM). We explain how to develop PEVPM models for message-passing programs using a performance directive language that describes a program’s serial segments of computation and message-passing events. This is a novel bottom-up approach to performance modeling, which aims to accurately model when processing and message-passing occur during program execution. The times at which these events occur are dynamic, because they are affected by network contention and data dependencies, so we use a virtual machine to simulate program execution. This simulation is done by executing models of the PEVPM performance directives rather than executing the code itself, so it is very fast. The simulation is still very accurate because enough information is stored by the PEVPM to dynamically create detailed models of processing and communication events. Another novel feature of our approach is that the communication times are sampled from probability distributions that describe the performance variability exhibited by communication subject to contention. These performance distributions can be empirically measured using a highly accurate message-passing benchmark that we have developed. This approach provides a Monte Carlo analysis that can give very accurate results for the average and the variance (or even the probability distribution) of program execution time. In this paper, we introduce the ideas underpinning the PEVPM technique, describe the syntax of the performance modeling language and the virtual machine that supports it, and present some results, for example, parallel programs to show the power and accuracy of the methodology.  相似文献   

10.
Efficient performance tuning of parallel programs is often hard. Optimization is often done when the program is written as a last effort to increase the performance. With sequential programs each (executed) code segment will affect the completion time. In the case of a parallel program executed on a multiprocessor this is not always true, due to dependencies between the different threads. Thus, certain code segments of the execution may not affect the completion time of the program. Optimization of such code segments will not increase the performance. In this paper we present an approach to optimize performance by finding the extended critical path of the multithreaded program. The extended critical path analysis is a generalization of the critical path analysis in the sense that it also deals with more threads than processors. We have implemented the extended critical path analysis in a performance optimization tool. The tool allows the user to determine the extended critical path of a multithreaded application written for the Solaris operating system for any number of processors based on execution on a single processor workstation.  相似文献   

11.
Parallel processing systems using networks of workstations are being used to provide an alternative to expensive parallel processors. Scheduling of tasks on these networks is an important and practical problem that must be addressed. Although CPU load is an important parameter to many of the proposed scheduling schemes, no quantitative analysis of CPU load and its precise relation to the run time of application programs has to date been presented. The work in this paper describes the experimental analysis of one common load measure, the UNIX load average, and its relationship to the run time of computation-bound parallel programs. Data was gathered using a test application program designed to mimic common applications, performing long bursts of computation with occasional interprocess data exchange over the network. The resulting execution times and measured load averages were then analyzed using regression analysis to detect load-run time trends. This paper describes the test program and the experiments, then details the results of the data analysis. A technique is then presented for the evaluation of the load-run time relationship for a computation-bound program on a network of workstations.  相似文献   

12.
The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Eleven representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. Based on the measurement results, both the computation and communication behavior of these parallel programs were investigated. The various time interval distributions were modeled by statistical functions which were verified by a nonlinear regression technique using the empirical data. The temporal and spatial localities of message destinations were also studied. A model for the temporal locality of message length was introduced and used to analyze the communication traces. A trace-drive simulation environment, which uses the communication patterns of the parallel programs as inputs, was developed to study the behavior of the communication hardware under real workload. Simulation results on DMA and link utilizations are reported  相似文献   

13.
This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks.  相似文献   

14.
Parallel computing scalability evaluates the extent to which parallel programs and architectures can effectively utilize increasing numbers of processors. In this paper, we compare a group of existing scalability metrics and evaluation models with an experimental metric which uses network latency to measure and evaluate the scalability of parallel programs and architectures. To provide insight into dynamic system performance, we have developed an integrated software environment prototype for measuring and evaluating multiprocessor scalability performance, called Scale-Graph. Scale-Graph uses a graphical instrumentation monitor to collect, measure and analyze latency-related data, and to display scalability performance based on various program execution patterns. The graphical software tool is X-Windows based and is currently implemented on standard workstations to analyze performance data of the KSR-1, a hierarchical ring-based shared-memory architecture  相似文献   

15.
We study the problem of developing a set of syntax-driven transformations for automatic translation of shared memory parallel programs into sequential programs. The result is a sequential program that is semantically equivalent to the original program. Consequently, the problem of debugging parallel programs is reduced to the problem of debugging sequential programs. Moreover, the efficiency of parallel programs can be increased by sequentializing code segments that include extra parallelism.The main difficulty in developing such a system is to preserve the fairness property of any actual parallel execution, which states that no process can wait forever unserved. Thus, non termination of the sequential version (namely an infinite loop) is allowed only if there is at least one fair parallel execution that does not halt as well (i.e., a process that executes an infinite loop whose termination is not dependent on any other process).We show that it is sufficient to consider the case of two sequential programs executed in parallel in order to solve the general case. We then describe several types of transformations and check their ability to preserve fairness. The results, with regards to the existence of such a transformation for general parallel programs are not conclusive; however, we do show that restricted cases (which are likely to appear in the reality) can be sequentialized using this set of transformations.  相似文献   

16.
In the area of parallelizing compilers, considerable research has been carried out on data dependency analysis, parallelism extraction, as well as program and data partitioning. However, designing a practical, low complexity scheduling algorithm without sacrificing performance remains a challenging problem. A variety of heuristics have been proposed to generate efficient solutions but they take prohibitively long execution times for moderate size or large problems. In this paper, we propose an algorithm called FASTEST (Fast Assignment and Scheduling of Tasks using an Efficient Search Technique) that has O(e) time complexity, where e is the number of edges in the task graph. The algorithm first generates an initial solution in a short time and then refines it by using a simple but robust random neighborhood search. We have also parallelized the search to further lower the time complexity. We are using the algorithm in a prototype automatic parallelization and scheduling tool which compiles sequential code and generates parallel code optimized with judicious scheduling. The proposed algorithm is evaluated with several application programs and outperforms a number of previous algorithms by generating parallelized code with shorter execution times, while taking dramatically shorter scheduling times. The FASTEST algorithm generates optimal solutions for a majority of the test cases and close-to-optimal solutions for the rest  相似文献   

17.
Symbolic execution is a well-known program analysis technique which represents program inputs with symbolic values instead of concrete, initialized, data and executes the program by manipulating program expressions involving the symbolic values. Symbolic execution has been proposed over three decades ago but recently it has found renewed interest in the research community, due in part to the progress in decision procedures, availability of powerful computers and new algorithmic developments. We provide here a survey of some of the new research trends in symbolic execution, with particular emphasis on applications to test generation and program analysis. We first describe an approach that handles complex programming constructs such as input recursive data structures, arrays, as well as multithreading. Furthermore, we describe recent hybrid techniques that combine concrete and symbolic execution to overcome some of the inherent limitations of symbolic execution, such as handling native code or availability of decision procedures for the application domain. We follow with a discussion of techniques that can be used to limit the (possibly infinite) number of symbolic configurations that need to be analyzed for the symbolic execution of looping programs. Finally, we give a short survey of interesting new applications, such as predictive testing, invariant inference, program repair, analysis of parallel numerical programs and differential symbolic execution.  相似文献   

18.
19.
多核处理器已广泛应用于高性能计算领域,如何有效地将传统串行程序转换为并行代码并减少程序中嵌套循环所占用时间仍是该领域的挑战性问题。本文首先基于多面体模型对嵌套循环进行依赖特征分析并实现瓦片分割,据此自动生成粗粒度并行代码。针对多核阵列处理器的结构特点,采用遗传算法生成通信优化的瓦片任务序列,在此基础上建立了有效的任务调度模型。最后将上述方法应用于LU分解,结果表明该方法与传统调度算法相比,在增加数据局部性、实现负载平衡方面具有更好效果。  相似文献   

20.
刘金硕  黄朔  邓娟 《计算机工程》2022,48(12):16-23
当使用高分辨率的图像作为图像处理算法的输入时会降低算法运行速度,将算法并行化可提升执行效率,但手动将串行程序转换为并行程序则较为繁琐,并且现有自动并行翻译工具性能不稳定,同时翻译后的程序是单一并行模式。面向基于面片的三维多视角立体视觉(PMVS)算法,提出一种从C到CUDA的自动两级并行翻译方法。使用ANTLR自动解析源C代码,通过分析数据依赖关系和循环数组私有化来识别可并行化的循环结构,将算法翻译成CPU多线程和GPU两级并行结构的代码。在算法执行过程中,将输入图像在CPU和GPU上分别进行处理,降低了算法总执行时间。实验结果表明,该方法的计算加速比随着输入图像分辨率的增加逐渐提高,最高约达到32,相比于PPCG和OpenACC自动并行翻译方法提升明显。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号