期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

彭超陈华蓉《微计算机信息》2006,22(6):220-221

随着高性能计算技术的不断发展,并行程序的设计、调试、优化逐渐成为并行技术应用的关键,而性能工具在提高并行程序的执行效率方面发挥着重要的作用。本文阐述了并行程序性能工具的系统结构,以及各个模块功能的原理,并实现了一个基于MPI消息统计的性能工具。相似文献

2.

基于事件的异构平台并行程序性能可视化研究

顾慧郑晓薇申安来逯文晖《计算机工程与设计》2010,31(24)

为了便于用户快速、直观地了解到机群系统中并行应用程序的性能情况,将Linux计算机群与Windows控制显示平台相结合,提出了一种基于事件的异构平台并行程序性能可视化方法.该方法以MPI作为底层编程环境,在高层使用MPE技术,依据动态性能检测方式获取程序执行过程信息;设计C#语言及Jumpshot日志图形化分析集成工具实现并行程序性能可视化.实验结果表明,该方法可准确,直观地反映程序性能信息,有助于程序员简便、有效地对并行程序进行量化分析,对提高机群系统的可用性、改善程序性能及效率等方面具有较高的实用价值. 相似文献

3.

OpenMP并行程序的性能数据采集

富弘毅周海芳杨学军《计算机工程》2005,31(19):67-69,78

目前,随着大规模并行计算的高速发展,并行程序性能分析与建模的地位日益重要,而并行程序性能数据的采集是进行性能分析的基础。硬件计数器的使用使人们能够更加便利地在程序执行过程中采集性能数据。文章讨论了OpenMP并行程序的性能数据采集技术,并介绍一种利用PAPI进行数据采集的实现方法。相似文献

4.

VENUS：一个通用的并行性能可视化环境 总被引：1，自引：0，他引：1

石晓虹赵银亮《小型微型计算机系统》1998,19(12):1-7

本文介绍了一个通用的并行程序性能可视化环境ＶＥＮＵＳ。在分析当前并行性能可视化工具不足的基础上，ＶＥＮＵＳ系统采用了基于可扩展的多层性能视图模型的可视化方法，并改进了ＰＶＭ的跟踪机制以支持性能可视化分析与程序源代码的直接对应。实验表明，ＶＥＮＵＳ系统能够有效地帮助发现并行程序中的性能瓶颈。相似文献

5.

异构平台下的并行程序性能可视化方法

郑晓薇顾慧《计算机工程与设计》2010,31(4)

为了便于对异构平台下的并行程序性能进行分析,在对可视化技术和并行计算与控制显示平台研究的基础上设计了一种异构环境下的性能可视化模型.针对该模型的特点利用监测代码插桩技术、性能数据事后分析等方法,给出了并行性能数据获取、转换与绘图的具体方法和实现过程,为跨平台并行性能数据的采集和转换提供了一种简便方法.实验结果表明了在异构环境下该方法对并行性能数据可视化的可行性与有效性. 相似文献

6.

PPAT:一种Pthread并行程序线程性能分析工具

温莎莎刘轶刘弢宋平李博钱德沛《计算机应用与软件》2012,(11)

随着多核/众核处理器技术的快速发展,程序需要越来越多地采用多线程并行技术以提升性能.随着线程个数的增多,线程并行运行过程中相互间同步/互斥及资源竞争关系更加复杂,导致程序性能优化的难度增大.为了使编程人员直观地了解线程的动态运行过程,特别是线程间同步及资源共享带来的影响,帮助其进行程序性能优化,设计实现了一种面向Pthread的并行程序线程性能分析工具PPAT(Pthreadsprogramanalysistool),该工具可在程序运行过程中动态获取线程运行及线程间互斥/同步信息,生成线程通信图,并以多种可视化的方法显示,为编程人员优化程序性能提供依据. 相似文献

7.

并行程序性能分析中的时戳同步问题及方法

杜祝平于磊李志博侯雪梅连百友《计算机应用与软件》2012,(1):298-300

基于事件跟踪的并行程序性能分析,就是通过分析各处理器采集的事件记录、计算程序对象的执行时间和探究事件间的相互关系,来揭示程序的性能问题。这一工作要求各处理器采集的事件时戳必须具有可比性。由于各种原因,通过测量获得的事件时戳往往是不同步的,这直接影响性能分析工作的开展。介绍处理器时钟误差的概念及产生原因、并行程序性能分析中的测量误差、时钟条件和时戳同步需求,最后介绍一种基于恒定时钟漂移的线性误差插值技术,在一定程度上解决了并行程序性能分析中的时戳同步问题。相似文献

8.

一种基于迭代聚类的并行应用性能分析方法

朱鹏李巍李云春《软件学报》2010,21(Z1):284-289

随着超级计算机的发展,其使用到的核心数逐渐达到数十万,而且运行于其上的应用的复杂性也不断加大.因此,开发人员需要对并行应用的性能进行测量,并做出分析,以便对程序源码进行优化,提高程序的执行效率.但是由于核心数的大量增加,对并行程序性能进行测量将得到海量的性能数据,如何处理海量性能数据,以便分析并行程序性能成为一个难点.介绍了一种基于迭代聚类的并行应用性能分析方法,该方法使用数据挖掘的聚类算法处理处理海量性能数据,并可以根据条件迭代执行,确定影响并行程序性能的函数和进程,然后通过贝叶斯信息准则评价聚类结果,以确定迭代聚类的可靠性,最后用实验证明了方法的有效性. 相似文献

9.

大规模面向对象有限元程序的并行性能监测

王海兵《计算机应用》2011,31(Z1):172-173,176

通过重载MPI消息传递函数,在重载的MPI函数中调用MPE库中各日志记录函数,实现了大规模面向对象有限元程序自定义并行性能监测。对一个典型冲击动力学问题进行了16 CPU的并行有限元模拟,通过并行性能监测对其有限元并行算法进行了分析。相似文献

10.

多集群并行作业的性能监测及分析

下载免费PDF全文

陈诗然胡凯张伟张璐《计算机工程》2008,34(13):75-77

介绍一种多集群计算模式,在分析了多集群系统结构灵活、具有可重组性等特点的基础上,研究适用于该模式的并行作业性能监测分析方法与技术,设计和实现了一个并行作业性能监测分析工具。它采用动态性能分析方法,遵循分布式软件设计架构,具有高内聚、低耦合的模块组织结构,运行验证表明其能够在多集群计算模式下有效工作。相似文献

11.

Model-driven monitoring support for the multi-view performance analysis of parallel embedded applications 总被引：1，自引：0，他引：1

J. Reference to Garcí a J. Reference to Entrialgo F. J. Reference to Su rez D. F. Reference to Garcí a 《Performance Evaluation》2000,39(1-4):81-98

This paper describes an approach to carry out performance analysis of parallel embedded applications. The approach is based on measurement, but in addition, the idea of driving the measurement process (application instrumentation and monitoring) by a behavioral model is introduced. Using this model, highly comprehensible performance information can be collected. The whole approach is based on this behavioral model, one instrumentation method and two tools, one for monitoring and the other for visualization and analysis. Each of these is briefly described, and the steps to carry out performance analysis using them are clearly defined. They are explained by means of a case study. Finally, one method to evaluate the intrusiveness of the monitoring approach is proposed, and the intrusiveness results for the case study are presented. 相似文献

12.

基于事件分析的并行调试与监测分析技术

张慧成王华杜祝平魏鸿《计算机工程与应用》2002,38(19):45-47

该文在对并行调试技术进行深入分析的基础上,重点研究了基于事件分析的并行调试与监测分析技术,并对其设计与实现方法进行了详细探讨。相似文献

13.

Pview:一种基于PMU的支持并行程序性能分析的新方法

闫洁徐恒阳安虹刘玉王耀彬《计算机科学》2011,38(2):288-292

近年来,随着并行编程的普及,性能监测和剖析已经成为计算机系统领域最重要的研究课题之一。PMU(Performance Monitoring Unit),即现代处理器里集成的微体系事件性能计数器,为性能监测提供了底层支持,使得在以极小的额外开销和极少的对目标程序的干扰的情况下对程序进行性能监测成为可能。Pview(Performance View)是一种在系统级支持对并行程序尤其是多线程程序进行性能监测与分析的工具,它同时支持全系统和针对特定进程(线程组)的性能事件直接计数或者抽样的分析方法。Pview在Linux操作系统平台上通过扩展内核2. 6. 30,实现了一个新的系统调用Pvicw来提供性能监测服务;同时与以模块方式实现的数据收集引擎协作,可以实现抽样并将大规模样本数据传输到用户空间供进一步分析。相似文献

14.

Designing OP2 for GPU architectures

M.B. Giles G.R. Mudalige B. Spencer C. Bertolli I. Reguly 《Journal of Parallel and Distributed Computing》2013

OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the specification of a scientific application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the current OP2 library for generating efficient code targeting contemporary GPU platforms. In this we focus on some of the software architecture design choices and low-level optimizations to maximize performance on NVIDIA’s Fermi architecture GPUs. The performance impact of these design choices is quantified on two NVIDIA GPUs (GTX560Ti, Tesla C2070) using the end-to-end performance of an industrial representative CFD application developed using the OP2 API. Results show that for each system, a number of key configuration parameters need to be set carefully in order to gain good performance. Utilizing a recently developed auto-tuning framework, we explore the effect of these parameters, their limitations and insights into optimizations for improved performance. 相似文献

15.

分布式程序的性能监视及其应用

张斌《微机发展》1998,8(2):10-13

分析讨论了在分布式系统环境下分布式程序的功能行为和性能方法 ;指出事件驱动监视是一种非常适于分析并行和分布式系统上程序的技术。通过分析 Transputer多处理机上运行的并行光线跟踪程序 ,说明了性能监视所起的巨大作用。相似文献

16.

基于网络断层扫描的网格网络性能测量分析

王伟蔡皖东李勇军《计算机科学》2007,34(5):45-47

网格计算通过网络连接来获得一个高性能和高效的计算平台。网格网络的监测和性能测量为网格性能分析、负载平衡、任务调度等提供了重要的科学依据,而成为大规模网格服务的关键组件。现有的几种网格监测方法因缺乏对监测数据的推断分析而无法对网格网络的性能进行测量。通过对网格网络性能测量的特点、GloPerf及传统网络测量技术的分析,提出了基于网络断层扫描的网格网络性能测量方法。研究结果为网格网络性能的测量提供了新的途径。相似文献

17.

Lightweight monitoring of MPI programs in real time

German Florez Zhen Liu Susan M. Bridges Anthony Skjellum Rayford B. Vaughn 《Concurrency and Computation》2005,17(13):1547-1578

Current technologies allow efficient data collection by several sensors to determine an overall evaluation of the status of a cluster. However, no previous work of which we are aware analyzes the behavior of the parallel programs themselves in real time. In this paper, we perform a comparison of different artificial intelligence techniques that can be used to implement a lightweight monitoring and analysis system for parallel applications on a cluster of Linux workstations. We study the accuracy and performance of deterministic and stochastic algorithms when we observe the flow of both library‐function and operating‐system calls of parallel programs written with C and MPI. We demonstrate that monitoring of MPI programs can be achieved with high accuracy and in some cases with a false‐positive rate near 0% in real time, and we show that the added computational load on each node is small. As an example, the monitoring of function calls using a hidden Markov model generates less than 5% overhead. The proposed system is able to automatically detect deviations of a process from its expected behavior in any node of the cluster, and thus it can be used as an anomaly detector, for performance monitoring to complement other systems or as a debugging tool. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

18.

Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

Jesús Cámara Javier Cuenca Luis-Pedro García Domingo Giménez 《Parallel Computing》2014

The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix–matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP. 相似文献

19.

DPGL： The Direct3D9-based Parallel Graphics Library for Multi-display Environment

Zhen Liu Jiao-Ying Shi 《国际自动化与计算杂志》2007,4(1):30-37

The emergence of high performance 3D graphics cards has opened the way to PC clusters for high performance multi- display environment.In order to exploit the rendering ability of PC clusters,we should design appropriate parallel rendering algorithms and parallel graphics library interfaces.Due to the rapid development of Direct3D,we bring forward DPGL,the Direct3D9-based parallel graphics library in D3DPR parallel rendering system,which implements Direct3D9 interfaces to support existing Direct3D9 application parallelization with no modification.Based on the parallelism analysis of Direct3D9 rendering pipeline,we briefly introduce D3DPR parallel rendering system.DPGL is the fundamental component of D3DPR.After presenting DPGL three layers architecture, we discuss the rendering resource interception and management.Finally,we describe the design and implementation of DPGL in detail, including rendering command interception layer,rendering command interpretation layer and rendering resource parallelization layer. 相似文献

20.

Automatic performance debugging of SPMD-style parallel programs

Xu LiuAuthor Vitae Jianfeng ZhanAuthor Vitae Kunlin ZhanAuthor Vitae Dan Meng^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(7):925-937

Automatic performance debugging of parallel applications includes two main steps: locating performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in two ways: first, several previous efforts automate locating bottlenecks, but present results in a confined way that only identifies performance problems with a priori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships, but these efforts do not focus on locating performance bottlenecks or uncovering their root causes.The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any prior knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on the basis of the rough set theory, we propose an innovative approach to automatically uncover root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code—MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks. 相似文献