首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
In this paper, “Martini,” a network interface controller chip for our original network called RHiNET is described. Martini is designed to provide high-bandwidth and low-latency communication with small overhead. To obtain high performance communication, protected user-level zero-copy RDMA communication functions are completely implemented by a hardwired logic. Also, to reduce the communication latency efficiently, we have proposed PIO-based communication mechanisms called “On-the-fly (OTF)” and have implemented them on Martini. The evaluation results show that Martini connected to a 64bit/66MHz PCI-bus achieves 470MByte/s maximum bidirectional bandwidth and 1.74 μsec minimum latency on host-to-host memory copying.  相似文献   

2.
田卓  陈一峯 《软件学报》2021,32(9):2945-2962
“神威·太湖之光”国产超级计算机的特点是适用于高通量计算系统,此类系统往往存储器访问延迟,网络延迟较长.在实际应用中,有一大类问题是时间演化的模拟问题,往往需要高频状态迭代,每次迭代需要通信.此类应用问题的典型代表是分子动力学模拟,分子的性质依赖于时间演化,导致状态相关的时间尺度上难以并行化.实际应用中,全原子模型需要模拟超过ms时间尺度,每一步的物理时间为1fs~2.5fs,这意味着所需时间步个数超过1012个.众核处理器中,不同核心访存时需较长的“排队”等待,造成访存延迟.另外,网卡通信延迟以及较长的数据通路会带来网络延迟,由此导致在长延迟的众核处理器上进行一次有效的模拟几乎是不可能的.解决此类问题的主要挑战是提高迭代频率,即每秒执行尽可能多的迭代步.针对神威高性能芯片处理器的体系结构特点,以分子动力学模拟为例,研究了一系列优化策略以提高迭代频率:(1)单核通信与片上核间同步相结合,降低通信成本;(2)共享内存等待与从核同步相结合,优化异构体系结构中的核间同步;(3)改变计算模式,减少核间数据关联和依赖关系;(4)数据传输与计算重叠,掩盖访存延迟;(5)规则化问题,以提高访存凝聚性.  相似文献   

3.
Owing to the isolation barrier between VMs, inter-domain communication suffers great performance loss. Current solutions widely exploit inter-domain shared memory mechanism to improve performance. Also the larger the shared memory buffer is, the higher the throughput and the less the latency. However, these solutions which use static fixed size shared memory, do not take memory utilization and the heterogeneous upper applications into consideration. In this paper, we have designed and implemented an adaptive shared memory mechanism for inter-domain communication, called AdaptIDC, adjusting the shared memory dynamically. With the help of completely independent in/out buffer design, the IOIHMD adjustment algorithm, the control ring and event channel reuse mechanism, AdaptIDC achieves superior shared memory utilization and yet does not sacrifice high performance between co-existing VMs. In the evaluation, we observe that AdaptIDC can greatly improve the shared memory utilization while performance draws near the fixed static shared page solution.  相似文献   

4.
Large-scale distributed shared-memory multiprocessors (DSMs) provide a shared address space by physically distributing the memory among different processors. A fundamental DSM communication problem that significantly affects scalability is an increase in remote memory latency as the number of system nodes increases. Remote memory latency, caused by accessing a memory location in a processor other than the one originating the request, includes both communication latency and remote memory access latency over I/O and memory buses. The proposed architecture reduces remote memory access latency by increasing connectivity and maximizing channel availability for remote communication. It also provides efficient and fast unicast, multicast, and broadcast capabilities, using a combination of aggressively designed multiplexing techniques. Simulations show that this architecture provides excellent interconnect support for a highly scalable, high-bandwidth, low-latency network.  相似文献   

5.
针对机群系统的通信瓶颈问题,研究一种新型的以“信令寻径式交换技术”为核心的高速光纤传输交换网络。本文分析了信令寻径式交换网络通信协议,在此基础上设计并实现了基于光纤通道技术的高速网络接口卡。该网卡针对大小数据包的特点,分别对各种类型的帧进行专门设计,具有较高的通信效率。测试结果表明,由该网络接口卡组成点对点网络,其单向传输延时为4.22µs,通信带宽高达993.8Mbps,最大链路利用率为93.53%。目前,该网络接口卡性能稳定,运行状况良好,能满足机群系统通信需要。  相似文献   

6.
While it is imperative to exploit middleware technologies in developing software for distributed embedded control systems, it is also necessary to tailor them to meet the stringent resource constraints and performance requirements of embedded control systems. In this paper, we propose a CORBA-based middleware for Controller Area Network (CAN) bus systems. Our design goals are to reduce the memory footprint and remote method invocation overhead of the middleware and make it support group communication that is often needed in embedded control systems. To achieve these, we develop a transport protocol on the CAN and a group communication scheme based on the publisher/subscriber model by realizing subject-based addressing that utilizes the message filtering mechanism of the CAN. We also customize the method invocation and message passing protocol of CORBA so that CORBA method invocations are efficiently serviced on a low-bandwidth network such as the CAN. This customization includes packed data encoding and variable-length integer encoding for compact representation of IDL data types.We have implemented our CORBA-based middleware using GNU ORBit. We report on the memory footprint and method invocation latency of our implementation.  相似文献   

7.
Multistage Interconnection Networks (MIN) have been widely used for building large-scale shared-memory multiprocessor systems. Complex interactions between many processors and memory modules through the MIN, (such as interprocessor communication, process scheduling and synchronization and remote-memory access) result in a significantly large space of possible performance behavior and potential performance bottlenecks. To provide insight into dynamic system performance, we have developed an integrated data collection, analysis, and visualization environment for a MIN-based multiprocessor system, called MIN-Graph. The MIN-Graph is a graphical instrumentation monitor to aid users in investigating performance problems and in determining an effective way to exploit the high performance capabilities of interconnection network multiprocessor systems. Interconnection network contention is a major bottleneck of parallel computing on MIN-based multiprocessors. This paper focuses on evaluating the contention behavior through performance monitoring and visualization. Four sets of system and scientific application programs with different programming and scheduling models and different memory access patterns are monitored and tested to observe the various network contention behaviors. The MIN-Graph is implemented on the BBN GP1000 and the BBN TC2000.  相似文献   

8.
采用硬件无关的设计方法,在Linux下实现了一个基于通用以太网卡的高性能集群通信库HPCL/Ethernet。该库利用Linux操作系统的网络设备接口实现了对各种以太网NIC(Network Interface Card)的支持,在软件层实现一个轻量级通信协议支持消息的可靠传输。采用通道简化缓冲区管理,固定缓冲区策略来减少消息拷贝,采用中断回收技术、链路层直接处理消息来优化通信。测试结果表明,该通信库在传输小消息时具有很低延迟,在传输大消息时具有高带宽的特点。传输消息长度为1196bytes时带宽高达1.18E+08bytes/s,是千兆以太网卡硬件带宽的94.4%。  相似文献   

9.
作为系统域网络接入设备,适配器的功能和性能对整个机群系统的性能有着至关重要的影响.鉴于嵌入式技术的发展,提出了基于Intel IOP310I/O处理器的曙光4000A超级计算机DCNet系统域网络适配器设计.适配器在原嵌入式系统基础上将本地内存总线扩展为用于网络互连的局部总线,并基于该总线设计实现了网络接口部件.DCNet适配器不但实现了与Myrinet,SCI和QsNet适配器相近的性能,而且证明了基于嵌入式系统和内存总线扩展网络接口方法实现高性能适配器是有效可行的.  相似文献   

10.
In this paper, we describe the design and evaluation of a PC cluster system in which IEEE 1394 is applied. Networks for parallel cluster computing require low latency and high bandwidth. It is also important that the networks be commercially available at low cost. Few network devices satisfy all of the above requirements. However, the IEEE 1394 standard provides a good compromise for fulfilling these requirements. We have used IEEE 1394 devices, which support a 400 Mbps data transfer rate, to connect the nodes of a PC cluster system which we have designed and implemented. We have implemented two communication libraries. One is a fast communication library called CF for IEEE 1394. The other is a MPI layer library on the CF library. Experimental results show that CF achieves a 17.2 microsecond round‐trip time. On application benchmarks, the system was considerably faster than TCP/IP over Fast Ethernet. Even though the system was constructed at very low cost, it provides good performance. Using the IEEE 1394 standard is thus a good solution for low‐cost cluster systems. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

11.
针对设计CLUMPS上的机群通信协议方面的问题进行了深入的研究,给出了一个基于SMP结点的机群系统的多重通信协议结构及其协议模型、协议通信策略实现。最后对节点内部多重协议通信和传统的TCP/IP协议通信进行了比较,并给出了实验数据。多重通信协议系统能够充分地利用SMP结点的特性,对于不同互连结构的计算单元间采用不同的通信模式,从而有效地提高系统通信硬件的利用率,提高系统的总体性能。  相似文献   

12.
Remote Direct Memory Access (RDMA) improves network bandwidth and reduces latency by eliminating unnecessary copies from network interface card to application buffers, but the communication buffer management to reduce memory registration and deregistration cost is a significant challenge to be addressed. Previous studies use pin-down cache and batched deregistration, but only simple LRU is used as a replacement algorithm to manage cache space. In this paper, we evaluate the cost of memory registration in both user and kernel spaces. Based on our analysis, we reduce the overhead of communication buffer management in two aspects simultaneously: utilize a Memory Registration Region Cache (MRRC), and optimize the RDMA communication process of clients and servers with Fast RDMA Read and Write Process (FRRWP). MRRC manages memory in terms of memory region, and replaces old memory regions according to both their sizes and recency. FRRWP overlaps memory registrations between a client and a server, and allows applications to submit RDMA write operations without being blocked by message synchronization. We compare the performance of MRRC and FRRWP with traditional RDMA operations. The results show that our new design improves the total cost of memory registrations and overall communication latency by up to 70%.  相似文献   

13.
MPJ Express is a messaging system that allows application developers to parallelize their compute-intensive sequential Java codes on High Performance Computing clusters and multicore processors. In this paper, we extend MPJ Express software to provide two new communication devices. The first device—called hybrid—enables MPJ Express to exploit hybrid parallelism on cluster of multicore processors by sitting on top of existing shared memory and network communication devices. The second device—called native—uses JNI wrappers in interfacing MPJ Express to native MPI implementations like MPICH and Open MPI. We evaluate performance of these devices on a range of interconnects including 1G/10G Ethernet, 10G Myrinet and 40G InfiniBand. In addition, we analyze and evaluate the cost of MPJ Express buffering layer and compare it with the performance numbers of other Java MPI libraries. Our performance evaluation reveals that the native device allows MPJ Express to achieve comparable performance to native MPI libraries—for latency and bandwidth of point-to-point and collective communications—which is a significant gain in performance compared to existing communication devices. The hybrid communication device—without any modifications at application level—also helps parallel applications achieve better speedups and scalability by exploiting multicore architecture. Our performance evaluation quantifies the cost incurred by buffering and its impact on overall performance of software. We witnessed comparative performance as both new devices improve application performance and achieve upto 90 % of the theoretical bandwidth available without application rewriting effort—including NAS Parallel Benchmarks, point-to-point and collective communication.  相似文献   

14.
With the use of state and memory reduction techniques in verification by explicit state enumeration, runtime becomes a major limiting factor. We describe a parallel version of the explicit state enumeration verifier Mur for distributed memory multiprocessors and networks of workstations using the message passing paradigm. In experiments with three complex cache coherence protocols on an Sp2 multiprocessor and on a network of workstations at UC Berkeley, parallel Mur shows close to linear speedups, which are largely insensitive to communication latency and bandwidth. There is some slowdown with increasing communication overhead, for which a simple yet relatively accurate approximation formula is given. Techniques to reduce overhead and required bandwidth and to allow heterogeneity and dynamically changing load in the parallel machine are discussed, which we expect will allow good speedups when using conventional networks of workstations.  相似文献   

15.
并行系统的互连技术一直是高性能计算机的一个关键研究领域。在本文作者1999年中提出了一种基于多端口快速存储器的新型互连体系结构MCIM,并在此基础上构造出规模为16-128个CPU的结点系统。该文将MCIM原理应用在互连网络通信技术上,实现了一种以存储器系统为中心 的路由器MRouter。它采用流水操作和穿通传输技术,可用于构成低延时、高带宽的高性能互连网络。这将种互连网络与上述结点系统结合,可以实现更大规模的并行系统。而且其无论在板级或是结点级都采用同一种互连技术,有利于系统的模块化实现。文中介绍了以存储器为中心的互连机制MCIM的原理,同时描述了以存储器为中心的路由器MRouter的结构及数据传输流程,在仿真实验中作者对MCIM的互连通信技术和其它互连技术进行了测试和比较。  相似文献   

16.
Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers.  相似文献   

17.
The ever growing needs for computation power and accesses to critical resources have launched in a very short time a large number of grid projects and many realizations have been done on dedicated network infrastructures. On Internet-based infrastructures, however, there are very few distributed or interactive applications (MPI, DIS, HLA, remote visualization) because of insufficient end-to-end performances (bandwidth, latency, for example) to support such an interactivity. For the moment, computing resources and network resources are viewed separately in the Grid architecture and we believe this is the main bottleneck for achieving end-to-end performances. In this paper, we promote the idea of a Grid infrastructure able to adapt to the applications needs and thus define the idea of application-aware Grid infrastructures where the network infrastructure is tightly involved in both the communication and processing process. We report on our early experiences in building application-aware components based on active networking technologies for providing a low latency and a low overhead multicast framework for applications running on a computational Grid. Performance results from both simulations and implementation prototypes confirm that introducing application-aware components at specific location in the network infrastructure can succeed in providing not only performances for the end-users but also new perspectives in building a communication framework for computational Grids.  相似文献   

18.
Lightweight threads have an important role to play in parallel systems: they can be used to exploit shared-memory parallelism, to mask communication and I/O latencies, to implement remote memory access, and to support task-parallel and irregular applications. In this paper, we address the question of how to integrate threads and communication in high-performance distributed-memory systems. We propose an approach based on global pointer and remote service request mechanisms, and explain how these mechanisms support dynamic communication structures, asynchronous messaging, dynamic thread creation and destruction, and a global memory model via interprocessor references. We also explain how these mechanisms can be implemented in various environments. Our global pointer and remote service request mechanisms have been incorporated in a runtime system called Nexus that is used as a compiler target for parallel languages and as a substrate for higher-level communication libraries. We report the results of performance studies conducted using a Nexus implementation; these results indicate that Nexus mechanisms can be implemented efficiently on commodity hardware and software systems.  相似文献   

19.
20.
基于仿真试验系统对高速、实时性的要求,通用试验体系支撑平台需具有反射内存网通信能力。本文开发了反射内存网通信组件,以完成仿真试验系统中反射内存网上各试验设备的实时通信,实现对试验设备的实时控制及试验数据的监视。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号