首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
随着高性能计算技术的不断发展,并行程序的设计、调试、优化逐渐成为并行技术应用的关键,而性能工具在提高并行程序的执行效率方面发挥着重要的作用。本文阐述了并行程序性能工具的系统结构,以及各个模块功能的原理,并实现了一个基于MPI消息统计的性能工具。  相似文献   

2.
为了便于用户快速、直观地了解到机群系统中并行应用程序的性能情况,将Linux计算机群与Windows控制显示平台相结合,提出了一种基于事件的异构平台并行程序性能可视化方法.该方法以MPI作为底层编程环境,在高层使用MPE技术,依据动态性能检测方式获取程序执行过程信息;设计C#语言及Jumpshot日志图形化分析集成工具实现并行程序性能可视化.实验结果表明,该方法可准确,直观地反映程序性能信息,有助于程序员简便、有效地对并行程序进行量化分析,对提高机群系统的可用性、改善程序性能及效率等方面具有较高的实用价值.  相似文献   

3.
目前,随着大规模并行计算的高速发展,并行程序性能分析与建模的地位日益重要,而并行程序性能数据的采集是进行性能分析的基础。硬件计数器的使用使人们能够更加便利地在程序执行过程中采集性能数据。文章讨论了OpenMP并行程序的性能数据采集技术,并介绍一种利用PAPI进行数据采集的实现方法。  相似文献   

4.
VENUS:一个通用的并行性能可视化环境   总被引:1,自引:0,他引:1  
本文介绍了一个通用的并行程序性能可视化环境VENUS。在分析当前并行性能可视化工具不足的基础上,VENUS系统采用了基于可扩展的多层性能视图模型的可视化方法,并改进了PVM的跟踪机制以支持性能可视化分析与程序源代码的直接对应。实验表明,VENUS系统能够有效地帮助发现并行程序中的性能瓶颈。  相似文献   

5.
为了便于对异构平台下的并行程序性能进行分析,在对可视化技术和并行计算与控制显示平台研究的基础上设计了一种异构环境下的性能可视化模型.针对该模型的特点利用监测代码插桩技术、性能数据事后分析等方法,给出了并行性能数据获取、转换与绘图的具体方法和实现过程,为跨平台并行性能数据的采集和转换提供了一种简便方法.实验结果表明了在异构环境下该方法对并行性能数据可视化的可行性与有效性.  相似文献   

6.
随着多核/众核处理器技术的快速发展,程序需要越来越多地采用多线程并行技术以提升性能.随着线程个数的增多,线程并行运行过程中相互间同步/互斥及资源竞争关系更加复杂,导致程序性能优化的难度增大.为了使编程人员直观地了解线程的动态运行过程,特别是线程间同步及资源共享带来的影响,帮助其进行程序性能优化,设计实现了一种面向Pthread的并行程序线程性能分析工具PPAT(Pthreadsprogramanalysistool),该工具可在程序运行过程中动态获取线程运行及线程间互斥/同步信息,生成线程通信图,并以多种可视化的方法显示,为编程人员优化程序性能提供依据.  相似文献   

7.
基于事件跟踪的并行程序性能分析,就是通过分析各处理器采集的事件记录、计算程序对象的执行时间和探究事件间的相互关系,来揭示程序的性能问题。这一工作要求各处理器采集的事件时戳必须具有可比性。由于各种原因,通过测量获得的事件时戳往往是不同步的,这直接影响性能分析工作的开展。介绍处理器时钟误差的概念及产生原因、并行程序性能分析中的测量误差、时钟条件和时戳同步需求,最后介绍一种基于恒定时钟漂移的线性误差插值技术,在一定程度上解决了并行程序性能分析中的时戳同步问题。  相似文献   

8.
朱鹏  李巍  李云春 《软件学报》2010,21(Z1):284-289
随着超级计算机的发展,其使用到的核心数逐渐达到数十万,而且运行于其上的应用的复杂性也不断加大.因此,开发人员需要对并行应用的性能进行测量,并做出分析,以便对程序源码进行优化,提高程序的执行效率.但是由于核心数的大量增加,对并行程序性能进行测量将得到海量的性能数据,如何处理海量性能数据,以便分析并行程序性能成为一个难点.介绍了一种基于迭代聚类的并行应用性能分析方法,该方法使用数据挖掘的聚类算法处理处理海量性能数据,并可以根据条件迭代执行,确定影响并行程序性能的函数和进程,然后通过贝叶斯信息准则评价聚类结果,以确定迭代聚类的可靠性,最后用实验证明了方法的有效性.  相似文献   

9.
王海兵 《计算机应用》2011,31(Z1):172-173,176
通过重载MPI消息传递函数,在重载的MPI函数中调用MPE库中各日志记录函数,实现了大规模面向对象有限元程序自定义并行性能监测。对一个典型冲击动力学问题进行了16 CPU的并行有限元模拟,通过并行性能监测对其有限元并行算法进行了分析。  相似文献   

10.
陈诗然  胡凯  张伟  张璐 《计算机工程》2008,34(13):75-77
介绍一种多集群计算模式,在分析了多集群系统结构灵活、具有可重组性等特点的基础上,研究适用于该模式的并行作业性能监测分析方法与技术,设计和实现了一个并行作业性能监测分析工具。它采用动态性能分析方法,遵循分布式软件设计架构,具有高内聚、低耦合的模块组织结构,运行验证表明其能够在多集群计算模式下有效工作。  相似文献   

11.
This paper describes an approach to carry out performance analysis of parallel embedded applications. The approach is based on measurement, but in addition, the idea of driving the measurement process (application instrumentation and monitoring) by a behavioral model is introduced. Using this model, highly comprehensible performance information can be collected. The whole approach is based on this behavioral model, one instrumentation method and two tools, one for monitoring and the other for visualization and analysis. Each of these is briefly described, and the steps to carry out performance analysis using them are clearly defined. They are explained by means of a case study. Finally, one method to evaluate the intrusiveness of the monitoring approach is proposed, and the intrusiveness results for the case study are presented.  相似文献   

12.
该文在对并行调试技术进行深入分析的基础上,重点研究了基于事件分析的并行调试与监测分析技术,并对其设计与实现方法进行了详细探讨。  相似文献   

13.
近年来,随着并行编程的普及,性能监测和剖析已经成为计算机系统领域最重要的研究课题之一。PMU(Performance Monitoring Unit),即现代处理器里集成的微体系事件性能计数器,为性能监测提供了底层支持,使得在以极小的额外开销和极少的对目标程序的干扰的情况下对程序进行性能监测成为可能。Pview(Performance View)是一种在系统级支持对并行程序尤其是多线程程序进行性能监测与分析的工具,它同时支持全系统和针对特定进程(线程组)的性能事件直接计数或者抽样的分析方法。Pview在Linux操作系统平台上通过扩展内核2. 6. 30,实现了一个新的系统调用Pvicw来提供性能监测服务;同时与以模块方式实现的数据收集引擎协作,可以实现抽样并将大规模样本数据传输到用户空间供进一步分析。  相似文献   

14.
OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the specification of a scientific application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the current OP2 library for generating efficient code targeting contemporary GPU platforms. In this we focus on some of the software architecture design choices and low-level optimizations to maximize performance on NVIDIA’s Fermi architecture GPUs. The performance impact of these design choices is quantified on two NVIDIA GPUs (GTX560Ti, Tesla C2070) using the end-to-end performance of an industrial representative CFD application developed using the OP2 API. Results show that for each system, a number of key configuration parameters need to be set carefully in order to gain good performance. Utilizing a recently developed auto-tuning framework, we explore the effect of these parameters, their limitations and insights into optimizations for improved performance.  相似文献   

15.
张斌 《微机发展》1998,8(2):10-13
分析讨论了在分布式系统环境下分布式程序的功能行为和性能方法 ;指出事件驱动监视是一种非常适于分析并行和分布式系统上程序的技术。通过分析 Transputer多处理机上运行的并行光线跟踪程序 ,说明了性能监视所起的巨大作用。  相似文献   

16.
网格计算通过网络连接来获得一个高性能和高效的计算平台。网格网络的监测和性能测量为网格性能分析、负载平衡、任务调度等提供了重要的科学依据,而成为大规模网格服务的关键组件。现有的几种网格监测方法因缺乏对监测数据的推断分析而无法对网格网络的性能进行测量。通过对网格网络性能测量的特点、GloPerf及传统网络测量技术的分析,提出了基于网络断层扫描的网格网络性能测量方法。研究结果为网格网络性能的测量提供了新的途径。  相似文献   

17.
Current technologies allow efficient data collection by several sensors to determine an overall evaluation of the status of a cluster. However, no previous work of which we are aware analyzes the behavior of the parallel programs themselves in real time. In this paper, we perform a comparison of different artificial intelligence techniques that can be used to implement a lightweight monitoring and analysis system for parallel applications on a cluster of Linux workstations. We study the accuracy and performance of deterministic and stochastic algorithms when we observe the flow of both library‐function and operating‐system calls of parallel programs written with C and MPI. We demonstrate that monitoring of MPI programs can be achieved with high accuracy and in some cases with a false‐positive rate near 0% in real time, and we show that the added computational load on each node is small. As an example, the monitoring of function calls using a hidden Markov model generates less than 5% overhead. The proposed system is able to automatically detect deviations of a process from its expected behavior in any node of the cluster, and thus it can be used as an anomaly detector, for performance monitoring to complement other systems or as a debugging tool. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

18.
The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix–matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP.  相似文献   

19.
The emergence of high performance 3D graphics cards has opened the way to PC clusters for high performance multi- display environment.In order to exploit the rendering ability of PC clusters,we should design appropriate parallel rendering algorithms and parallel graphics library interfaces.Due to the rapid development of Direct3D,we bring forward DPGL,the Direct3D9-based parallel graphics library in D3DPR parallel rendering system,which implements Direct3D9 interfaces to support existing Direct3D9 application parallelization with no modification.Based on the parallelism analysis of Direct3D9 rendering pipeline,we briefly introduce D3DPR parallel rendering system.DPGL is the fundamental component of D3DPR.After presenting DPGL three layers architecture, we discuss the rendering resource interception and management.Finally,we describe the design and implementation of DPGL in detail, including rendering command interception layer,rendering command interpretation layer and rendering resource parallelization layer.  相似文献   

20.
Automatic performance debugging of parallel applications includes two main steps: locating performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in two ways: first, several previous efforts automate locating bottlenecks, but present results in a confined way that only identifies performance problems with a priori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships, but these efforts do not focus on locating performance bottlenecks or uncovering their root causes.The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any prior knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on the basis of the rough set theory, we propose an innovative approach to automatically uncover root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code—MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号