首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Unidirectional ring-based networks are currently popular choices for high performance large scale shared memory multiprocessors. This class of networks is attractive for their simple hardware interfaces, high speed communication, wider data path, and easy addition of extra nodes. However, a single ring does not scale well due to the fixed bandwidth, and the hierarchical ring networks as a natural extension of a single ring show limited scalability due to their limited bandwidth near the root. In this paper we present a new interconnection network called the Multistage Ring Network (MRN). The MRN has a 2-level hierarchy of rings, and its interconnection of global rings forms a type of the multistage network. The architecture of the MRN is effective at diffusing the global traffic on the network to all global rings, and the bandwidth of the network increases proportionally with increases in the system size. Our results show that in a peak throughput, the MRN performs seven times better than the hierarchical ring network for system size of 1024.  相似文献   

2.
The paper analyzes and selects an appropriate interconnection network for a compliant multiprocessor. The multiprocessor is compliant to the tasks assigned to it in the sense that it can be reconfigured to provide a more efficient fit to the tasks to be executed. A number of possible candidate networks for the multiprocessor is considered: Omega, ADM, Hypercube and Torus. The potential applicability of these networks to the multiprocessor is analyzed from the points of view of partitionability, inter-PE delay, fault impact, and cost. After the individual analysts of the above points of consideration is completed, a weighted network factor is formed, and the optimal type of network is selected, under different performance criteria. The overall results point to the selection of the Torus or Hypercube network for most cases under consideration.  相似文献   

3.
Parallel computing performance on scalable shared-memory architectures is affected by the structure of the interconnection networks linking processors to memory modules and on the efficiency of the memory/cache management systems. Cache Coherence Nonuniform Memory Access (CC-NUMA) and Cache Only Memory Access (COMA) are two effective memory systems, and the hierarchical ring structure is an efficient interconnection network in hardware. This paper focuses on comparative performance modeling and evaluation of CC-NUMA and COMA on a hierarchical ring shared-memory architecture. Analytical models for the two memory systems for comparative evaluation are presented. Intensive performance measurements on data migrations have been conducted on the KSR-1, a COMA hierarchical ring shared-memory machine. Experimental results support the analytical models, and we present practical observations and comparisons of the two cache coherence memory systems. Our analytical and experimental results show that a COMA system balances the work load well. However the overhead of frequent data movement may match the gains obtained from improving load balance. We believe our performance results could be further generalized to the two memory systems on a hierarchical network architecture. Although a CC-NUMA system may not automatically balance the load at the system level, it provides an option for a user to explicitly handle data locality for a possible performance improvement  相似文献   

4.
ShuffleNet and de Bruijn networks have been proposed as multihop lightwave networks based on wavelength division multiplexing (WDM). With multihop lightwave networks, few fixed wavelength transmitters/receivers are assigned to each user, eliminating the need for wavelength agility and pretransmission coordination. These networks have been shown to be very effective for uniform traffic. For communications with high locality, we propose two-level hierarchical networks. At the first level, each cluster of users can be connected either via a ShuffleNet (SH) or a de Bruijn network (dB). At the second level, all the clusters in the system can be connected by two rings in opposite directions (SH/Ring and dB/Ring), a de Bruijn network (dB/dB), or a ShuffleNet (SH/SH). The performance of ShuffleNet, de Bruijn networks, and the hierarchical networks SH/Ring, dB/Ring, dB/dB, and SH/SH is analyzed. For communications with a high locality, the hierarchical networks are shown to be very effective.  相似文献   

5.
片上多核处理器(CMP)已经成为处理器发展的方向,处理器设计的重点也转到了互连网络和存储层次结构方面,其中的一个关键问题是如何维护各处理器各级缓存(Cache)的一致性,该问题在传统的共享存储多处理器中使用Cache一致性协议来解决,而CMP相对于传统的多处理器结构具有更高的片上互连带宽和速度,给Cache一致协议提出了新的要求,也提供了新的改进机会.传统的总线侦听协议存在可扩展性不足和不必要的广播、侦听过多的缺点,而目录协议则存在失效间接延时大和复杂度高、验证困难等问题.环形连接的可扩展性好于总线结构,而其实现复杂度也远小于通常目录协议所使用的包交换点到点网络.将基于环的侦听协议应用于CMP;并考虑利用环的顺序性取消原有协议中冲突引起的重发操作,消除可能的饥饿、死锁和活锁等情况,增加协议的稳定性,同时减少消息流量和功耗;利用片上互连延时短的特点,将侦听结果和侦听请求同时传播,使得处理器可以根据侦听结果来对侦听请求进行选择性的侦听操作,可减少不必要的侦听操作,降低功耗.  相似文献   

6.
A large scale, cache-based multiprocessor that is interconnected by a hierarchical network such as hierarchical buses or a multistage interconnection network (MIN) is considered. An adaptive cache coherence scheme for the system is proposed based on a hardware approach that handles multiple shared reads efficiently. The new protocol allows multiple copies of a shared data block in the hierarchical network, but minimizes the cache coherence overhead by dynamically partitioning the network into sharing and nonsharing regions based on program behavior. The new cache coherence scheme effectively utilizes the bandwidth of the hierarchical networks and exploits the locality properties of parallel algorithms. Simulation experiments have been carried out to analyze the performance of the new protocol. The simulation results show that the new protocol gives 15% to 30% performance improvement over some existing cache coherence schemes on similar systems for a wide range of workload parameters  相似文献   

7.
Multistage Interconnection Networks (MIN) have been widely used for building large-scale shared-memory multiprocessor systems. Complex interactions between many processors and memory modules through the MIN, (such as interprocessor communication, process scheduling and synchronization and remote-memory access) result in a significantly large space of possible performance behavior and potential performance bottlenecks. To provide insight into dynamic system performance, we have developed an integrated data collection, analysis, and visualization environment for a MIN-based multiprocessor system, called MIN-Graph. The MIN-Graph is a graphical instrumentation monitor to aid users in investigating performance problems and in determining an effective way to exploit the high performance capabilities of interconnection network multiprocessor systems. Interconnection network contention is a major bottleneck of parallel computing on MIN-based multiprocessors. This paper focuses on evaluating the contention behavior through performance monitoring and visualization. Four sets of system and scientific application programs with different programming and scheduling models and different memory access patterns are monitored and tested to observe the various network contention behaviors. The MIN-Graph is implemented on the BBN GP1000 and the BBN TC2000.  相似文献   

8.
The IEEE 802.17 is a standardized ring topology network architecture, called the Resilient Packet Ring (RPR), to be used mainly in metropolitan and wide area networks. This paper introduces destination differentiation in ingress aggregated fairness for RPR and focuses on the RPR MAC client implementation of the IEEE 802.17 RPR MAC in the aggressive mode of operation. It also introduces an enhanced active queue management scheme for ring networks that achieves destination differentiation as well as higher overall utilization of the ring bandwidth with simpler and less expensive implementation than the generic implementation provided in the standard. The enhanced scheme introduced in this paper provides performance comparable to the per destination queuing implementation, which is the best achievable performance, while providing weighted destination based fairness as well as weighted ingress aggregated fairness. In addition, the proposed scheme has been demonstrated via extensive simulations to provide improved stability and fairness with respect to different packet arrival rates as compared to earlier algorithms.  相似文献   

9.
This paper considers various physical constraints which influence the design of interconnection networks used in multiprocessor systems. Design expressions are presented for implementing an N log N packet passing interconnection network composed of circuit switched crossbar chip modules. Expressions reflecting chip level and board level pin and area constraints are derived and used to determine the network delay expected at a given clock frequency. Logic and memory delay, signal path delay, clock skew, and clock distribution delay parameters are defined and used to determine the maximum frequency which can be obtained with a given design. An example 2048 × 2048 network design is considered. This example indicates that using aggressive packaging and MOS technology, a clock frequency of about 40 Mhz is achievable. However, even at this frequency, this network would result in a one-way delay (ignoring blocking and hot spot delays) of about 1 μsec. A read operation from memory requiring a round trip would thus require about 2 μsec. This represents considerable slowdown when compared with accessing strictly local memory and appears to be a major problem in the design of network centered multiprocessor architectures.  相似文献   

10.
The performance evaluation of processor-memory communications for multiprocessor systems using circuit switched interconnection networks with a hold strategy is performed. Message size and processor processing time are considered and shown to have a significant effect on the overall system performance. A closed queuing network model is proposed such that only (n+2) states are required by the proposed model, in contrast to (n2+3n+4)/2 states needed in previous studies, where n is the number of stages of the multistage interconnection network. Since a closed-form solution is obtained, the behavior of a complete cycle of memory access through multistage interconnection networks can be accurately analyzed and various performance bounds can be obtained  相似文献   

11.
针对目前弹性分组环技术无法组建复杂多环网络,不能透明承载端到端跨环数据业务的问题,提出一种基于通用多协议标签交换 (GMPLS)技术的跨环互连承载模型。该模型通过拓展GMPLS数据平面中的分组交换能力,可实现弹性分组环多环互连和跨环业务承载。仿真结果表明,该模型可行、有效,能透明、可靠地传送跨环数据业务。  相似文献   

12.
Polymorphic Torus is a novel interconnection network for SIMD massively parallel computers, able to support effectively both local and global communication. Thanks to this characteristic, Polymorphic Torus is highly suitable for computer vision applications, since vision involves local communication at the low-level stage and global communication at the intermediate- and high-level stages. In this paper we evaluate the performance of Polymorphic Torus in the computer vision domain. We consider a set of basic vision tasks, namely,convolution, histogramming, connected component labeling, Hough transform, extreme point identification, diameter computation, andvisibility, and show how they can take advantage of the Polymorphic Torus communication capabilities. For each basic vision task we propose a Polymorphic Torus parallel algorithm, give its computational complexity, and compare such a complexity with the complexity of the same task inmesh, tree, pyramid, and hypercube interconnection networks. In spite of the fact that Polymorphic Torus has the same wiring complexity as mesh, the comparison shows that in all of the vision tasks under examination it achieves complexity lower than or at most equal to hypercube, which is the most powerful among the interconnection networks considered.  相似文献   

13.
Passive network monitoring is the basis for a multitude of systems that support the robust, efficient, and secure operation of modern computer networks. Emerging network monitoring applications are more demanding in terms of memory and CPU resources due to the increasingly complex analysis operations that are performed on the inspected traffic. At the same time, as the traffic throughput in modern network links increases, the CPU time that can be devoted for processing each network packet decreases. This leads to a growing demand for more efficient passive network monitoring systems in which runtime performance becomes a critical issue.In this paper we present locality buffering, a novel approach for improving the runtime performance of a large class of CPU and memory intensive passive monitoring applications, such as intrusion detection systems, traffic characterization applications, and NetFlow export probes. Using locality buffering, captured packets are being reordered by clustering packets with the same port number before they are delivered to the monitoring application. This results in improved code and data locality, and consequently, in an overall increase in the packet processing throughput and decrease in the packet loss rate. We have implemented locality buffering within the widely used libpcap packet capturing library, which allows existing monitoring applications to transparently benefit from the reordered packet stream without modifications. Our experimental evaluation shows that locality buffering improves significantly the performance of popular applications, such as the Snort IDS, which exhibits a 21% increase in the packet processing throughput and is able to handle 67% higher traffic rates without dropping any packets.  相似文献   

14.
Hierarchical ring-based multiprocessor systems are attractive and enjoy several advantages over other type of systems. They ensure unique paths between nodes, simple node interfaces and simple cross-ring connections. Furthermore, employing point-to-point links allows the system to run at high clock rate which increases bandwidth and decreases latency. This paper investigates the performance of hierarchical ring-based shared-memory multiprocessors. Rings in the hierarchy are composed of point-to-point, unidirectional links and apply the Scalable Coherent Interface (SCI) protocol. We pay special emphasis on the impact of locality on processor and interconnection design issues such as number of outstanding requests, and ring topology. We find that in order to exploit the power of hierarchical multiprocessors an accurate and appropriate model of locality must be used. Hierarchical multiprocessors that are well balanced (uniform) tend to provide lower latency and higher system throughput. For non-uniform systems, high degree of locality is required for the hierarchies to perform well. However, restricting the number of outstanding transactions per processor is important in decreasing packets latency and avoiding network contention.  相似文献   

15.
Due to advances in fiber optics and VLSI technology, interconnection networks that allow simultaneous broadcasts are becoming feasible. Distributed shared memory (DSM) implementations on such networks promise high performance even for small applications with small granularity. This paper, after summarizing the architecture of one such implementation called the Simultaneous Multiprocessor Optical Exchange Bus (SOME-Bus), presents simple algorithms for improving the performance of parallel programs running on the SOME-Bus multiprocessor implementing cache-coherent DSM. The algorithms are based on run-time data redistribution via dynamic page migration protocol. They use memory access references together with the information of average channel utilization, average channel waiting time, number of messages in the channel queue or short-term average channel waiting time reported by each node and gathered by hardware monitors to make correct decisions related to the placement of shared data. Simulations with four parallel codes on a 64-processor SOME-Bus show that the algorithms yield significant performance improvements such as reduction in the execution times, number of remote memory accesses, average channel waiting times, average network latencies and increase in average channel utilizations.  相似文献   

16.
In the study of multiprocessor networks, attention must be paid to the effects of packaging constraints on cost and performance as well as the physical hierarchy imposed by packaging. Given wiring costs and constraints, hierarchical or clustered networks are a promising approach to building multiprocessor interconnects. We examine a simple approach to clustering: local buses connect processors within a cluster, and intercluster networks such ask-aryn-cubes and generalized shuffle–exchange networks provide connections between clusters. We perform detailed queueing analysis and simulations and compare the performance of a variety of flat and clustered networks. Previously, hierarchical networks were believed to be beneficial only for traffic with good locality. Under wiring cost models, we find that hierarchical networks may have superior performance to nonhierarchical networks, even under uniform traffic (which previously was believed to be a “worst case” scenario for hierarchical topologies). Relative performance is dependent on the cost constraint used, the system configuration and message granularity. The choice of a multiprocessor interconnect must take into consideration all these factors.  相似文献   

17.
Performance modeling of Cartesian product networks   总被引:1,自引:0,他引:1  
This paper presents a comprehensive performance model for fully adaptive routing in wormhole-switched Cartesian product networks. Besides the generality of the model which makes it suitable to be used for any product graph, experimental (simulation) results show that the proposed model exhibits high accuracy even in heavy traffic and saturation region, where other models have severe problems to predict the performance of the network. Most popular interconnection network can be defined as a Cartesian product of two or more networks including the mesh, hypercube, and torus networks. Torus and mesh networks are the most popular topologies used in recent supercomputing parallel machines. They have been widely used for realizing on-chip network in recent on-chip multicore and multiprocessors system.  相似文献   

18.
Local area networks (LANs) have received a lot of attention recently primarily because they offer a very convenient, fast and cost-effective means of communication within a relatively small geographical area. Token passing ring protocol is one of the popular access methods for communication among stations connected via a local area ring networks. In this method, access to the shared transmission medium is controlled by a token.nI earlier ring networks, the transmission time of data packet used to be longer than the time to go around the ring once (this time is called ring latency). However, with the recent advances in transmission technology, faster and faster transmission media are being used and transmission time of a data packet has become so small that sometimes it takes much less time to transmit a data packet than to go around the ring. In this paper, we present a comparative performance of three service operations in an environment when ring latency is larger than the transmission time of a data packet. The performance of these service operations is evaluated by using computer simulation. The simulation results are presented in terms of normalized average packet delay.  相似文献   

19.
A multistage bus network (MEN) is proposed to overcome some of the shortcomings of the conventional multistage interconnection networks (MINs), single bus, and hierarchical bus interconnection networks. The MBN consists of multiple stages of buses connected in a manner similar to the MINs and has the same bandwidth at each stage. A switch in an MBN is similar to that in a MIN switch except that there is a single bus connection instead of a crossbar. MBNs support bidirectional routing and there exists a number of paths between any source and destination pair. The authors develop self routing techniques for the various paths, present an algorithm to route a request along the path with minimum distance, and analyze the probabilities of a packet taking different routes. Further, they derive a performance analysis of a synchronous packet-switched MBN in a distributed shared memory environment and compare the results with those of an equivalent bidirectional MIN (BMIN). Finally, they present the execution time of various applications on the MBN and the BMIN through an execution-driven simulation. They show that the MBN provides similar performance to a BMIN while offering simplicity in hardware and more fault-tolerance than a conventional MIN  相似文献   

20.
This paper suggests a speed boosting technique for system area networks in massive parallel multiprocessor computers by decreasing the diameter and increasing the throughput of a pair of opposite simplex rings (a duplex ring), a couple and quadruple of such rings. The result is achieved through replacing a duplex ring with a pair of minimal switched multidimensional rings with different steps in each ring. The decreased diameter and increased throughput of rings appreciably reduce packet delivery delays in system area networks based on the pairs of such rings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号