期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Message-passing performance of various computers

Jack J. Dongarra Tom Dunigan 《Concurrency and Computation》1997,9(10):915-926

This report compares the performance of different computer systems for basic message passing. Latency and bandwidth are measured on Convex, Cray, IBM, Intel, KSR, Meiko, nCUBE, NEC, SGI and TMC multiprocessors. Communication performance is contrasted with the computational power of each system. The comparison includes both shared and distributed memory computers as well as networked workstation clusters. © 1997 John Wiley & Sons, Ltd. 相似文献

2.

Performance and scalability of MPI on PC clusters

Glenn R. Luecke Marina Kraeva Jing Yuan Silvia Spanoyannis 《Concurrency and Computation》2004,16(1):79-107

The purpose of this paper is to compare the communication performance and scalability of MPI communication routines on a Windows Cluster, a Linux Cluster, a Cray T3E‐600, and an SGI Origin 2000. All tests in this paper were run using various numbers of processors and two message sizes. In spite of the fact that the Cray T3E‐600 is about 7 years old, it performed best of all machines for most of the tests. The Linux Cluster with the Myrinet interconnect and Myricom's MPI performed and scaled quite well and, in most cases, performed better than the Origin 2000, and in some cases better than the T3E. The Windows Cluster using the Giganet Full Interconnect and MPI/Pro's MPI performed and scaled poorly for small messages compared with all of the other machines. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

3.

NPBMG在分布式环境下的并行实现

下载免费PDF全文

胡庆丰《计算机工程与科学》1998,20(1):28-30

ＮＡＳＰａｒｌｌｅｌＢｅｎｃｈｍａｒｋｓ（ＮＰＢ）是为了测试和评价超级计算机性能而设计的并行基准测试程序集ＭＧＢｅｎｃｈｍａｒｋ是其中的一个核心程度. 相似文献

4.

The performance and scalability of SHMEM and MPI‐2 one‐sided routines on a SGI Origin 2000 and a Cray T3E‐600

Glenn R. Luecke Silvia Spanoyannis Marina Kraeva 《Concurrency and Computation》2004,16(10):1037-1060

This paper compares the performance and scalability of SHMEM and MPI‐2 one‐sided routines on different communication patterns for a SGI Origin 2000 and a Cray T3E‐600. The communication tests were chosen to represent commonly used communication patterns with low contention (accessing distant messages, a circular right shift, a binary tree broadcast) to communication patterns with high contention (a ‘naive’ broadcast and an all‐to‐all). For all the tests and for small message sizes, the SHMEM implementation significantly outperformed the MPI‐2 implementation for both the SGI Origin 2000 and Cray T3E‐600. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

5.

High performance computing using MPI and OpenMP on multi-core parallel systems 总被引：1，自引：0，他引：1

Haoqiang Jin Dennis JespersenPiyush Mehrotra Rupak BiswasLei Huang Barbara Chapman 《Parallel Computing》2011,37(9):562-575

The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures. 相似文献

6.

Non-linear Optimization on a Parallel Intel i860 RISC Based Architecture

J.F. BALL R.E. DORSEY J.D. JOHNSON 《Computational Economics》1997,10(3):279-294

相似文献

7.

Optimization and Performance of a Fortran 90 MPI-Based Unstructured Code on Large-Scale Parallel Systems

Shires Dale Mohan Ram 《The Journal of supercomputing》2003,25(2):131-141

The message-passing interface (MPI) has become the standard in achieving effective results when using the message passing paradigm of parallelization. Codes written using MPI are extremely portable and are applicable to both clusters and massively parallel computing platforms. Since MPI uses the single program, multiple data (SPMD) approach to parallelism, good performance requires careful tuning of the serial code as well as careful data and control flow analysis to limit communication. We discuss optimization strategies used and their degree of success to increase performance of an MPI-based unstructured finite element simulation code written in Fortran 90. We discuss performance results based on implementations using several modern massively parallel computing platforms including the SGI Origin 3800, IBM Nighthawk 2 SMP, and Cray T3E-1200. 相似文献

8.

NPB CG在分布式环境下的并行实现

下载免费PDF全文

胡庆丰刘杰《计算机工程与科学》1997,19(4):54-56

ＣＧＢｅｎｃｈｍａｒｋ是ＮＡＳＰａｒａｌｅｌＢｅｎｃｈｍａｒｋｓ（ＮＰＢ）中的一个核心程序，它用共轭梯度法求大型稀疏对称正定矩阵的最小特征值，本文介绍其主要算法，并给出在分布式环境下的高效并行算法，最后给出了在ＳＧＩＣｈａｌｅｎｇｅＰＶＭ平台上的测试结果相似文献

9.

An Assessment of MPI Environments for Windows NT 总被引：1，自引：0，他引：1

Takeda K. Allsopp N. K. Hardwick J. C. Macey P. C. Nicole D. A. Cox S. J. Lancaster D. J. 《The Journal of supercomputing》2001,19(3):315-323

In this paper we evaluate the MPI environments currently available for Windows NT on the Intel IA32 and Compaq/DEC Alpha architectures. We present benchmark results for low-level communication and for the NAS Parallel Benchmarks to allow comparison with other systems, but our primary interest is determining real application performance and robustness in production cluster environments. For this we use PAFEC-FE, a large FORTRAN code for finite-element analysis. We present results from three MPI implementations, two architectures, and three networking technologies (10 and 100 Mbit/s Ethernet and 1 Gbit/s Myrinet). 相似文献

10.

On the single processor performance of simple lattice Boltzmann kernels

《Computers & Fluids》2006,35(8-9):910-919

This report presents a comprehensive survey of the effect of different data layouts on the single processor performance characteristics for the lattice Boltzmann method both for commodity “off-the-shelf” (COTS) architectures and tailored HPC systems, such as vector computers. We cover modern 64-bit processors ranging from IA32 compatible (Intel Xeon/Nocona, AMD Opteron), superscalar RISC (IBM Power4), IA64 (Intel Itanium 2) to classical vector (NEC SX6+) and novel vector (Cray X1) architectures. Combining different data layouts with architecture dependent optimization strategies we demonstrate that the optimal implementation strongly depends on the architecture used. In particular, the correct choice of the data layout could supersede complex cache-blocking techniques in our kernels. Furthermore our results demonstrate that vector systems can outperform COTS architectures by one order of magnitude. 相似文献

11.

Comparing the performance of MPICH with Cray's MPI and with SGI's MPI

Glenn R. Luecke Marina Kraeva Lili Ju 《Concurrency and Computation》2003,15(9):779-802

The purpose of this paper is to compare the performance of MPICH with the vendor Message Passing Interface (MPI) on a Cray T3E‐900 and an SGI Origin 3000. Seven basic communication tests which include basic point‐to‐point and collective MPI communication routines were chosen to represent commonly‐used communication patterns. Cray's MPI performed better (and sometimes significantly better) than Mississippi State University's (MSU's) MPICH for small and medium messages. They both performed about the same for large messages, however for three tests MSU's MPICH was about 20% faster than Cray's MPI. SGI's MPI performed and scaled better (and sometimes significantly better) than MPICH for all messages, except for the scatter test where MPICH outperformed SGI's MPI for 1 kbyte messages. The poor scalability of MPICH on the Origin 3000 suggests there may be scalability problems with MPICH. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

12.

一种Nehalem平台上的MPI多级分段归约算法

邹金安刘志强廖蔚《小型微型计算机系统》2012,33(4):733-738

基于线程MPI环境提出一种适用于Nehalem平台长消息归约的多级分段归约算法(HSRA).HSRA考虑了Nehalem系统的体系结构特点,分处理器内归约和处理器外归约两个步骤实施节点内归约通信,在均匀分布计算负载的前提下仅需要较少的远端内存访问.首先在MPIActor的归约算法框架中设计、实现了HSRA算法,从访存角度分析了HSRA算法的开销,然后与单级分段和已有的另外三种节点内基于共享内存的归约算法进行比较;最后在真实系统上通过IMB(Intel MPI Benchmark)验证算法,实验结果表明,该算法是一种适用于在Nehalem系统中处理长消息节点内归约的高效算法. 相似文献

13.

几种XML基准测试的比较与分析

钱茛南马子明曹晓芳《软件》2011,32(5):71-73

XML基准测试已经成为流行的XML文档管理技术的测试方法。文章从测试场景、测试数据和对XQuery支持三方面,对主流的五种XML基准测试：Xmach-1、Xmark、Xbench、XOO7和TpoX进行了比较与分析。并阐明了这五种测试方法当前所存在的问题。相似文献

14.

Performance issues in emerging homogeneous multi-core architectures

Abdullah Kayi Tarek El-Ghazawi Gregory B. Newby 《Simulation Modelling Practice and Theory》2009,17(9):1485-1499

Multi-core architectures have emerged as the dominant architecture for both desktop and high-performance systems. Multi-core systems introduce many challenges that need to be addressed to achieve the best performance. Therefore, benchmarking of these processors is necessary to identify the possible performance issues. In this paper, broad range of homogeneous multi-core architectures are investigated in terms of essential performance metrics. To measure performance, we used micro-benchmarks from High-Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), LMbench, and an FFT benchmark. Performance analysis is conducted on multi-core systems from UltraSPARC and x86 architectures; including systems based on Conroe, Kentsfield, Clovertown, Santa Rosa, Barcelona, Niagara, and Victoria Falls processors. Also, the effect of multi-core architectures in cluster performance is examined using a Clovertown based cluster. Finally, cache coherence overhead is analyzed using a full-system simulator. Experimental analysis and observations in this study provide for a better understanding of the emerging homogeneous multi-core systems. 相似文献

15.

Optimizing the Cray Graph Engine for performant analytics on cluster,SuperDome Flex,Shasta systems and cloud deployment

Christopher D. Rickett Kristyn J. Maschhoff Sreenivas R. Sukumar 《Concurrency and Computation》2024,36(10):e7982

We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-sided MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems. 相似文献

16.

Experiences with Sweep3D implementations in Co-array Fortran

Cristian Coarfa Yuri Dotsenko John Mellor-Crummey 《The Journal of supercomputing》2006,36(2):101-121

As part of the recent focus on increasing the productivity of parallel application developers, Co-array Fortran (CAF) has emerged as an appealing alternative to the Message Passing Interface (MPI). CAF belongs to the family of global address space parallel programming languages; such languages provide the abstraction of globally addressable memory accessed using one-sided communication. At Rice University we are developing caf c, an open source, multiplatform CAF compiler. Our earlier studies show that caf c-compiled CAF programs achieve similar performance to that of corresponding MPI codes for the NAS Parallel Benchmarks. In this paper, we present a study of several CAF implementations of Sweep3D on four modern architectures. We analyze the impact of using one-sided communication in Sweep3D, identify potential sources of inefficiencies and suggest ways to address them. Our results show that we achieve comparable performance to that of the MPI version on three cluster-based architectures and outperform it by up to 10 % on the SGI Altix 3000. This work was supported in part by the Department of Energy under Grant DE-FC03-01ER25504/A000, the Los Alamos Computer Science Institute (LACSI) through LANL contract number 03891-99-23 as part of the prime contract (W-7405-ENG-36) between the DOE and the Regents of the University of California, Texas Advanced Technology Program under Grant 003604-0059-2001, and Compaq Computer Corporation under a cooperative research agreement. This research was performed in part using the Molecular Science Computing Facility (MSCF) in the William R. Wiley Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the U.S. Department of Energy’s Office of Biological and Environmental Research and located at the Pacific Northwest National Laboratory. Pacific Northwest is operated for the Department of Energy by Battelle. The computations were performed in part on an Itanium cluster purchased with support from the NSF under Grant EIA-0216467, Intel, and Hewlett Packard and on the National Science Foundation Terascale Computing System at the Pittsburgh Supercomputing Center. Cristian Coarfa and Yuri Dotsenko contributed equally to this work. 相似文献

17.

Shared Memory vs. Message Passing: The COMOPS Benchmark Experiment

Luo Yong 《The Journal of supercomputing》1999,13(3):283-301

This paper presents the comparison of the COMOPS benchmark performance in MPI and shared memory on four different shared memory platforms: the DEC AlphaServer 8400/300, the SGI Power Challenge, the SGI Origin2000, and the HP-Convex Exemplar SPP1600. The paper also qualitatively analyzes the obtained performance data based on an understanding of the corresponding architecture and the MPI implementations. Some conclusions are made for the inter-processor communication performance on these four shared memory platforms. 相似文献

18.

Parallel multigrid finite volume computation of three-dimensional thermal convection

《Computers & Mathematics with Applications》1999,37(9):49-60

A parallel implementation of the finite volume method for three-dimensional, time-dependent, thermal convective flows is presented. The algebraic equations resulting from the finite volume discretization, including a pressure equation which consumes most of the computation time, are solved by a parallel multigrid method. A flexible parallel code has been implemented on the Intel Paragon, the Cray T3D, and the IBM SP2 by using domain decomposition techniques and the MPI communication software. The code can use 1D, 2D, or 3D partitions as required by different geometries, and is easily ported to other parallel systems. Numerical solutions for air (Prandtl number Pr = 0.733) with various Rayleigh numbers up to 10⁷ are discussed. 相似文献

19.

PoLAPACK: parallel factorization routines with algorithmic blocking

Jaeyoung Choi 《Concurrency and Computation》2001,13(12):1033-1047

LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Most of these factorization routines are implemented with block‐partitioned algorithms in order to perform matrix–matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have different performance ratios of computation and communication, the optimal computational block sizes are different from one another in order to generate the maximum performance of an algorithm. Therefore, the ata matrix should be distributed with the machine specific optimal block size before the computation. Too small or large a block size makes achieving good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an ‘algorithmic blocking’ on two‐dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

20.

High performance air pollution modeling for a power plant environment

María J. Martín David E. Singh J. Carlos Mourio Francisco F. Rivera Ramn Doallo Javier D. Bruguera 《Parallel Computing》2003,29(11-12):1763

The aim of this work is to provide a high performance air quality simulation using the STEM-II (Sulphur Transport Eulerian Model 2) program, a large-scale pollution modeling application. First, we optimize the sequential program with the aim of increasing data locality. Then, we parallelized the program using OpenMP directives for shared memory systems, and the MPI library for distributed memory machines. Performance results are presented for a SGI O2000 multiprocessor, a Fujitsu AP3000 multicomputer and a Cluster of PCs. Experimental results show that the parallel versions of the code achieve important reductions in the CPU time needed by each simulation. This will allow us to obtain results with adequate speed and reliability for the industrial environment where it is intended to be applied. 相似文献