期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluating InfiniBand performance with PCI Express

Jiuxing Liu Mamidala A. Vishnu V. Panda D.K. 《Micro, IEEE》2005,25(1):20-29

The InfiniBand architecture is an industry standard that offers low latency and high bandwidth as well as advanced features such as remote direct memory access (RDMA), atomic operations, multicast, and quality of service. InfiniBand products can achieve a latency of several microseconds for small messages and a bandwidth of 700 to 900 Mbytes/s. As a result, it is becoming increasingly popular as a high-speed interconnect technology for building high-performance clusters. The Peripheral Component Interconnect (PCI) has been the standard local-I/O-bus technology for the last 10 years. However, more applications require lower latency and higher bandwidth than what a PCI bus can provide. As an extension, PCI-X offers higher peak performance and efficiency. InfiniBand host channel adapters (HCAs) with PCI Express achieve 20 to 30 percent lower latency for small messages compared with HCAs using 64-bit, 133-MHz PCI-X interfaces. PCI Express also improves performance at the MPI level, achieving a latency of 4.1/spl mu/s for small messages. It can also improve MPI collective communication and bandwidth-bound MPI application performance. 相似文献

2.

基于RDMA操作的MPI-2单边通信的设计与实现

江海昇范辉《计算机应用》2006,26(3):550-0552

MPI-2单边通信存在很高的通信开销以及对通信进程中远程进程的依赖。为此提出了在InfiniBand体系结构上的高性能MPI-2单边通信设计方法。其中，MPI-2单边通信操作，比如MPI_Put, MPI_Get以及MPI_Accumulate将对应于InfiniBand远程直接内存访问(Remote Direct Memory Access, RDMA)操作。设计是基于MPICH2的在InfiniBand上的应用，可以很好地实现通信和计算的重叠处理。相似文献

3.

Design of scalable Java message-passing communications over InfiniBand

Roberto R. Expósito Guillermo L. Taboada Juan Touri?o Ramón Doallo 《The Journal of supercomputing》2012,61(1):141-165

This paper presents ibvdev a scalable and efficient low-level Java message-passing communication device over InfiniBand. The continuous increase in the number of cores per processor underscores the need for efficient communication support for parallel solutions. Moreover, current system deployments are aggregating a significant number of cores through advanced network technologies, such as InfiniBand, increasing the complexity of communication protocols, especially when dealing with hybrid shared/distributed memory architectures such as clusters. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC). The developed communication middleware ibvdev increases Java applications performance on clusters of multicore processors interconnected via InfiniBand through: (1) providing Java with direct access to InfiniBand using InfiniBand Verbs API, somewhat restricted so far to MPI libraries; (2) implementing an efficient and scalable communication protocol which obtains start-up latencies and bandwidths similar to MPI performance results; and (3) allowing its integration in any Java parallel and distributed application. In fact, it has been successfully integrated in the Java messaging library MPJ Express. The experimental evaluation of this middleware on an InfiniBand cluster of multicore processors has shown significant point-to-point performance benefits, up to 85% start-up latency reduction and twice the bandwidth compared to previous Java middleware on InfiniBand. Additionally, the impact of ibvdev on message-passing collective operations is significant, achieving up to one order of magnitude performance increases compared to previous Java solutions, especially when combined with multithreading. Finally, the efficiency of this middleware, which is even competitive with MPI in terms of performance, increments the scalability of communications intensive Java HPC applications. 相似文献

4.

Process Arrival Pattern Aware Alltoall and Allgather on InfiniBand Clusters

Ying Qian Ahmad Afsahi 《International journal of parallel programming》2011,39(4):473-493

Recent studies show that MPI processes in real applications could arrive at an MPI collective operation at different times. This imbalanced process arrival pattern can significantly affect the performance of the collective operation. MPI_Alltoall() and MPI_Allgather() are communication-intensive collective operations that are used in many scientific applications. Therefore, their efficient implementations under different process arrival patterns are critical to the performance of scientific applications running on modern clusters. In this paper, we propose novel RDMA-based process arrival pattern aware MPI_Alltoall() and MPI_Allgather() for different message sizes over InfiniBand clusters. We also extend the algorithms to be shared memory aware for small to medium size messages under process arrival patterns. The performance results indicate that the proposed algorithms outperform the native MVAPICH implementations as well as other non-process arrival pattern aware algorithms when processes arrive at different times. Specifically, the RDMA-based process arrival pattern aware MPI_Alltoall() and MPI_Allgather() are 3.1 times faster than MVAPICH for 8 KB messages. On average, the applications studied in this paper (FT, RADIX, and N-BODY) achieve a speedup of 1.44 using the proposed algorithms. 相似文献

5.

An efficient design for fast memory registration in RDMA

Li Ou Xubin He Jizhong Han 《Journal of Network and Computer Applications》2009,32(3):642-651

Remote Direct Memory Access (RDMA) improves network bandwidth and reduces latency by eliminating unnecessary copies from network interface card to application buffers, but the communication buffer management to reduce memory registration and deregistration cost is a significant challenge to be addressed. Previous studies use pin-down cache and batched deregistration, but only simple LRU is used as a replacement algorithm to manage cache space. In this paper, we evaluate the cost of memory registration in both user and kernel spaces. Based on our analysis, we reduce the overhead of communication buffer management in two aspects simultaneously: utilize a Memory Registration Region Cache (MRRC), and optimize the RDMA communication process of clients and servers with Fast RDMA Read and Write Process (FRRWP). MRRC manages memory in terms of memory region, and replaces old memory regions according to both their sizes and recency. FRRWP overlaps memory registrations between a client and a server, and allows applications to submit RDMA write operations without being blocked by message synchronization. We compare the performance of MRRC and FRRWP with traditional RDMA operations. The results show that our new design improves the total cost of memory registrations and overall communication latency by up to 70%. 相似文献

6.

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

Zheng Gu Matthew Small Xin Yuan Aniruddha Marathe David K. Lowenthal 《International journal of parallel programming》2013,41(5):682-703

Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high point-to-point communication performance on RDMA-enabled clusters is challenging due to both the complexity in communication protocols and the impact of the protocol invocation scenario on the performance of a given protocol. In this work, we analyze existing protocols and show that they are not ideal in many situations, and propose to use protocol customization, that is, different protocols for different situations to improve MPI performance. More specifically, by leveraging the RDMA capability, we develop a set of protocols that can provide high performance for all protocol invocation scenarios. Armed with this set of protocols that can collectively achieve high performance in all situations, we demonstrate the potential of protocol customization by developing a trace-driven toolkit that allows the appropriate protocol to be selected for each communication in an MPI application to maximize performance. We evaluate the performance of the proposed techniques using micro-benchmarks and application benchmarks. The results indicate that protocol customization can out-perform traditional communication schemes by a large degree in many situations. 相似文献

7.

Towards Scalable Java HPC with Hybrid and Native Communication Devices in MPJ Express

Ansar Javed Bibrak Qamar Mohsan Jameel Aamir Shafi Bryan Carpenter 《International journal of parallel programming》2016,44(6):1142-1172

MPJ Express is a messaging system that allows application developers to parallelize their compute-intensive sequential Java codes on High Performance Computing clusters and multicore processors. In this paper, we extend MPJ Express software to provide two new communication devices. The first device—called hybrid—enables MPJ Express to exploit hybrid parallelism on cluster of multicore processors by sitting on top of existing shared memory and network communication devices. The second device—called native—uses JNI wrappers in interfacing MPJ Express to native MPI implementations like MPICH and Open MPI. We evaluate performance of these devices on a range of interconnects including 1G/10G Ethernet, 10G Myrinet and 40G InfiniBand. In addition, we analyze and evaluate the cost of MPJ Express buffering layer and compare it with the performance numbers of other Java MPI libraries. Our performance evaluation reveals that the native device allows MPJ Express to achieve comparable performance to native MPI libraries—for latency and bandwidth of point-to-point and collective communications—which is a significant gain in performance compared to existing communication devices. The hybrid communication device—without any modifications at application level—also helps parallel applications achieve better speedups and scalability by exploiting multicore architecture. Our performance evaluation quantifies the cost incurred by buffering and its impact on overall performance of software. We witnessed comparative performance as both new devices improve application performance and achieve upto 90 % of the theoretical bandwidth available without application rewriting effort—including NAS Parallel Benchmarks, point-to-point and collective communication. 相似文献

8.

A Speculative and Adaptive MPI Rendezvous Protocol Over RDMA-enabled Interconnects

Mohammad J. Rashti Ahmad Afsahi 《International journal of parallel programming》2009,37(2):223-246

Overlapping computation with communication is a key technique to conceal the effect of communication latency on the performance of parallel applications. Message Passing Interface (MPI) is a widely used message passing standard for high performance computing. One of the most important factors in achieving a good level of overlap is the MPI ability to make progress on outstanding communication operations. In this paper, we propose a novel speculative MPI Rendezvous protocol that uses RDMA Read and RDMA Write to effectively improve communication progress and consequently the overlap ability. Performance results based on a modified MPICH2 implementation over 10-Gigabit iWARP Ethernet reveal a significant (80–100%) improvement in receiver side overlap and progress ability. We have also observed up to 30% improvement in application wait time for some NPB applications as well as the RADIX application. For applications that do not benefit from this protocol, an adaptation mechanism is used to stop the speculation to effectively reduce the protocol overhead. 相似文献

9.

一种改进的高性能远程内存直接访问(RDMA)的实现

徐志斌徐炜民《计算机应用与软件》2008,25(1):264-266

Infiniband体系架构(IBA)是一种新兴的互联技术,它主要用于处理节点和输入输出节点之间的互联,从而形成一个系统级的网络.为了改进通信子系统的性能,IBA提供了很多特性,其中之一就是远程直接内存访问(Romote Direct Memory Access,RDMA).但是在标准的IBA实现中,对于控制报文和小报文使用发送接收操作来进行.本文提出了一种改进的RDMA实现,使得RDMA能够应用于控制报文和小报文,从而使得传输小报文和控制报文的延迟平均降低24%. 相似文献

10.

Martini: A Network Interface Controller Chip for High Performance Computing with Distributed PCs

《Parallel and Distributed Systems, IEEE Transactions on》2007,18(9):1282-1295

In this paper, “Martini,” a network interface controller chip for our original network called RHiNET is described. Martini is designed to provide high-bandwidth and low-latency communication with small overhead. To obtain high performance communication, protected user-level zero-copy RDMA communication functions are completely implemented by a hardwired logic. Also, to reduce the communication latency efficiently, we have proposed PIO-based communication mechanisms called “On-the-fly (OTF)” and have implemented them on Martini. The evaluation results show that Martini connected to a 64bit/66MHz PCI-bus achieves 470MByte/s maximum bidirectional bandwidth and 1.74 μsec minimum latency on host-to-host memory copying. 相似文献