首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Parallel computing performance on scalable shared-memory architectures is affected by the structure of the interconnection networks linking processors to memory modules and on the efficiency of the memory/cache management systems. Cache Coherence Nonuniform Memory Access (CC-NUMA) and Cache Only Memory Access (COMA) are two effective memory systems, and the hierarchical ring structure is an efficient interconnection network in hardware. This paper focuses on comparative performance modeling and evaluation of CC-NUMA and COMA on a hierarchical ring shared-memory architecture. Analytical models for the two memory systems for comparative evaluation are presented. Intensive performance measurements on data migrations have been conducted on the KSR-1, a COMA hierarchical ring shared-memory machine. Experimental results support the analytical models, and we present practical observations and comparisons of the two cache coherence memory systems. Our analytical and experimental results show that a COMA system balances the work load well. However the overhead of frequent data movement may match the gains obtained from improving load balance. We believe our performance results could be further generalized to the two memory systems on a hierarchical network architecture. Although a CC-NUMA system may not automatically balance the load at the system level, it provides an option for a user to explicitly handle data locality for a possible performance improvement  相似文献   

2.
Multistage Interconnection Networks (MIN) have been widely used for building large-scale shared-memory multiprocessor systems. Complex interactions between many processors and memory modules through the MIN, (such as interprocessor communication, process scheduling and synchronization and remote-memory access) result in a significantly large space of possible performance behavior and potential performance bottlenecks. To provide insight into dynamic system performance, we have developed an integrated data collection, analysis, and visualization environment for a MIN-based multiprocessor system, called MIN-Graph. The MIN-Graph is a graphical instrumentation monitor to aid users in investigating performance problems and in determining an effective way to exploit the high performance capabilities of interconnection network multiprocessor systems. Interconnection network contention is a major bottleneck of parallel computing on MIN-based multiprocessors. This paper focuses on evaluating the contention behavior through performance monitoring and visualization. Four sets of system and scientific application programs with different programming and scheduling models and different memory access patterns are monitored and tested to observe the various network contention behaviors. The MIN-Graph is implemented on the BBN GP1000 and the BBN TC2000.  相似文献   

3.
One type of interconnection network for a medium to large-scale parallel processing system (i.e., a system with 26 to 216 processors) is a buffered packet-switched multistage interconnection network (MIN). It has been shown that the performance of these networks is satisfactory for uniform network traffic. More recently, several studies have indicated that the performance of MIN's is degraded significantly when there is hot spot traffic, that is, a large fraction of the messages are routed to one particular destination. A multipath MIN is a MIN with two or more paths between all source and destination pairs. This research investigates how the Extra Stage Cube multipath MIN can reduce the detrimental effects of tree saturation caused by hot spots. Simulation is used to evaluate the performance of the proposed approaches. The objective of this evaluation is to show that, under certain conditions, the performance of the network with the usual routing scheme is severely degraded by the presence of hot spots. With the proposed approaches, although the delay time of hot spot traffic may be increased, the performance of the background traffic, which constitutes the majority of the network traffic, can be significantly improved  相似文献   

4.
Even though there have been strong research activities about distributed virtual shared-memory (DVSM) systems, their architectures have been not widely used in current high-performance computing markets. The reason is that the previously introduced DVSM systems use conventional interconnection technologies like Ethernet, which incurs high execution overhead due to process interruption at data communication for memory consistency. In this paper, we present the DVSM architecture based on the next generation of an interconnection technique, the InfiniBand Architecture (IBA). Because the IBA supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations in hardware, we can minimize the communication overhead for memory consistency on the DVSM system. For characterizing multithreaded applications on our IBA-based DVSM system, we examined two different shared-memory programming models, i.e. SPMD and OpenMP benchmarks. We show that our DVSM to use full features of the IBA can improve the performance significantly over the IPoIB-based DVSM system in all benchmarks, and also comparable to the bus-based shared-memory multiprocessor system in some benchmarks.  相似文献   

5.
A Multistage Interconnection Network (MIN) makes it possible to build large-scale shared-memory multiprocessor systems. To provide insight into dynamic system performance, we have developed an integrated data collection, analysis, and data visualization environment for a MIN-based multiprocessor system, called MIN-Graph. MIN-Graph is a graphical instrumentation monitor to aid users in investigating performance problems and in determining an effective way of exploiting the high performance capabilities of interconnection network multiprocessor systems. Our monitor measures, analyzes, evaluates, and displays the events, performance, and overhead of interprocessor communication, process scheduling, remote-memory access, network contention, and other processes on a MIN-based multiprocessor. The graphical monitor is X-window based and implemented on the BBN GP1000 and the BBN TC2000.  相似文献   

6.
In a large-scale shared-memory multiprocessor, there is a possibility of serious contention due to many requests issued concurrently for the same memory location. A multistage combining network, in which each switch is enhanced with combining so that multiple requests directed to the same memory location can form a single request, significantly reduces the amount of the contention. However, employing combining in every switch of a multistage interconnection network tends to increase the cost and to slow down the network. In this paper, assuming a single-job environment, we investigate some simple strategies that allow only a limited portion of a network to have a combining capability. We show that for situations with a limited number of hot spot locations, these simple strategies can provide performance comparable to a complete combining network in which every switch is enhanced with combining  相似文献   

7.
This paper extends research into rhombic overlapping-connectivity interconnection networks into the area of parallel applications. As a foundation for a shared-memory non-uniform access bus-based multiprocessor, these interconnection networks create overlapping groups of processors, buses, and memories, forming a clustered computer architecture where the clusters overlap. This overlapping-membership characteristic is shown to be useful for matching parallel application communication topology to the architecture's bandwidth characteristics. Many parallel applications can be mapped to the architecture topology so that most or all communication is localized within an overlapping cluster, at the low latency of processor direct to cache (or memory) over a bus. The latency of communication between parallel threads does not degrade parallel performance or limit the graininess of applications. Parallel applications can execute with good speedup and scaling on a proposed architecture which is designed to obtain maximum advantage from the overlapping-cluster characteristic, and also allows dynamic workload migration without moving the instructions or data. Scalability limitations of bus-based shared-memory multiprocessors are overcome by judicious workload allocation schemes, that take advantage of the overlapping-cluster memberships. Bus-based rhombic shared-memory multiprocessors are examined in terms of parallel speedup models to explain their advantages and justify their use as a foundation for the proposed computer architecture. Interconnection bandwidth is maximized with bi-directional circular and segmented overlapping buses. Strategies for mapping parallel application communication topologies to rhombic architectures are developed. Analytical models of enhanced rhombic multiprocessor performance are developed with a unique bandwidth modeling technique, and are compared with the results of simulation.  相似文献   

8.
A tutorial on dependability and performance-related dependability models for multiprocessors is presented. Multiprocessors are classified as having shared-memory or distributed-memory architectures, and some fundamental dependability modeling concepts. Reliability models based on four types of reliability evaluation techniques (terminal, multiterminal, task-based, and network reliability) are examined. The status of research efforts on performance-related dependability is discussed, and the models' effectiveness is illustrated with a few numerical examples. A brief survey of software packages for dependability computation in included  相似文献   

9.
《Computer Networks》2003,41(5):641-665
The designs of most systems-on-a-chip (SoC) architectures rely on simulation as a means for performance estimation. Such designs usually start with a parameterizable template architecture, and the design space exploration is restricted to identifying the suitable parameters for all the architectural components. However, in the case of heterogeneous SoC architectures such as network processors the design space exploration also involves a combinatorial aspect––which architectural components are to be chosen, how should they be interconnected, task mapping decisions––thereby increasing the design space. Moreover, in the case of network processor architectures there is also an associated uncertainty in terms of the application scenario and the traffic it will be required to process. As a result, simulation is no longer a feasible option for evaluating such architectures in any automated or semi-automated design space exploration process due to the high simulation times involved. To address this problem, in this paper we hypothesize that the design space exploration for network processors should be separated into multiple stages, each having a different level of abstraction. Further, it would be appropriate to use analytical evaluation frameworks during the initial stages and resort to simulation techniques only when a relatively small set of potential architectures is identified. None of the known performance evaluation methods for network processors have been positioned from this perspective.We show that there are already suitable analytical models for network processor performance evaluation which may be used to support our hypothesis. To this end, we choose a reference system-level model of a network processor architecture and compare its performance evaluation results derived using a known analytical model [Thiele et al., Design space exploration of network processor architectures, in: Proc. 1st Workshop on Network Processors, Cambridge, MA, February 2002; Thiele et al., A framework for evaluating design tradeoffs in packet processing architectures, in: Proc. 39th Design Automation Conference (DAC), New Orleans, USA, ACM Press, 2002] with the results derived by detailed simulation. Based on this comparison, we propose a scheme for the design space exploration of network processor architectures where both analytical performance evaluation techniques and simulation techniques have unique roles to play.  相似文献   

10.
Parallel computing scalability evaluates the extent to which parallel programs and architectures can effectively utilize increasing numbers of processors. In this paper, we compare a group of existing scalability metrics and evaluation models with an experimental metric which uses network latency to measure and evaluate the scalability of parallel programs and architectures. To provide insight into dynamic system performance, we have developed an integrated software environment prototype for measuring and evaluating multiprocessor scalability performance, called Scale-Graph. Scale-Graph uses a graphical instrumentation monitor to collect, measure and analyze latency-related data, and to display scalability performance based on various program execution patterns. The graphical software tool is X-Windows based and is currently implemented on standard workstations to analyze performance data of the KSR-1, a hierarchical ring-based shared-memory architecture  相似文献   

11.
This paper presents a multiprocessor performance prediction methodology supported by experimental measurements, which predicts the execution time of large application programs on large parallel architectures based on a small set of sample data. We propose a graph model to describe application program behavior. In order to precisely abstract an architecture model for the prediction, important and implicit architecture parameters are obtained by experiments. We focus on performance predictions of application programs in shared-memory and data-parallel architectures. Real world applications are implemented using the shared-memory model on the KSR-1 and using the data-parallel model on the CM-5 for performance measurements and prediction validation. We show that experimental measurements provide strong support for performance predictions on multiprocessors with implicit communications and complex memory systems, such as shared-memory and data-parallel systems, while analytical techniques partially applied in the prediction significantly reduce computer simulation and measurement time.  相似文献   

12.
This paper presents bitonic sorting schemes for special-purpose parallel architectures such as sorting networks and for general-purpose parallel architectures such as SIMD and/or MIMD computers. First, bitonic sorting algorithms for shared-memory SIMD and/or MIMD computers are developed. Shared-memory accesses through the interconnection network of shared memory SIMD and/or MIMD computers can be very time consuming. A scheme is introduced which reduces the number of such accesses. This scheme is based on the parity strategy which is the main idea of the paper. By reducing the communication through the network, a performance improvement is achieved. Second, a recirculating bitonic sorting network is presented, which is composed of one level of N/2 comparators plus an Ω-network of (log N-1) switch levels. This network reduces the cost complexity to O(N log N) compared with the O(N log2 N) of the original bitonic sorting network, while preserving the same time complexity. Finally, a simplified multistage bitonic sorting network, is presented. For simplifying the interlevel wiring, the parity strategy is used, so N/2 keys are wired straight through the network  相似文献   

13.
Providing highly flexible connectivity is a major architectural challenge for hardware implementation of reconfigurable neural networks. We perform an analytical evaluation and comparison of different configurable interconnect architectures (mesh NoC, tree, shared bus and point-to-point) emulating variants of two neural network topologies (having full and random configurable connectivity). We derive analytical expressions and asymptotic limits for performance (in terms of bandwidth) and cost (in terms of area and power) of the interconnect architectures considering three communication methods (unicast, multicast and broadcast). It is shown that multicast mesh NoC provides the highest performance/cost ratio and consequently it is the most suitable interconnect architecture for configurable neural network implementation. Routing table size requirements and their impact on scalability were analyzed. Modular hierarchical architecture based on multicast mesh NoC is proposed to allow large scale neural networks emulation. Simulation results successfully validate the analytical models and the asymptotic behavior of the network as a function of its size.  相似文献   

14.
《Performance Evaluation》2002,47(2-3):139-162
Shared-memory switches are still in commercial use and in the future will possibly be used in large-scale multistage architectures. The need to handle multicasting is also growing. In this paper analytical models are presented for shared-memory switches with the Replicate-At-Send-distinct address queue multicast scheme. Models for both Random (Bernoulli) and Bursty (Correlated) traffic sources are presented. The models are accurate for various loads, with different ratios of multicast traffic and a range of fanout sizes. They can be further extended to analyse more complex multicast schemes or multistage architectures.  相似文献   

15.
This paper presents two different multistage interconnection network designs for shared-memory multiprocessors that provide unrestricted multicast and notification capabilities. The networks allow efficient synchronization and communication because they conserve network bandwidth by eliminating polling and by performing multicast to multiple recipient processors, as opposed to broadcast or individual messages per recipient processor. Simulation results show that the use of these networks not only decreases synchronization overhead, but also increases network performance for nonsynchronization traffic. The hardware complexity of these schemes is reasonable, making them practical for real systems. Their use in supporting efficient directory-based update or invalidate cache coherence is also discussed.  相似文献   

16.
Pipelined circuit switching (PCS) that combines the advantages of both circuit switching and wormhole switching is an efficient method for passing messages in interconnection networks. Analytical modelling is a cost-effective tool and plays an important role in achieving a clear understanding of the network performance. However, most of the existing models for PCS are unable to capture the realistic nature of message behaviours generated by real-world applications, which have a significant impact on the design and performance of communication networks. This paper presents a new analytical model for PCS in interconnection networks in the presence of bursty and correlated message arrivals coupled with hot-spot destinations, which can capture the bursty message arrival process and non-uniform distribution of message destinations. Such a traffic pattern has been found in many practical communication environments. The accuracy of the proposed analytical model is validated through extensive simulation experiments. The model is then applied to investigate the effects of the bursty message arrivals and hot-spot destinations on the performance of interconnection networks with PCS.  相似文献   

17.
Pipelined Circuit Switching (PCS) has been suggested as an efficient switching method for supporting interprocessor communication in multicomputer networks due to its ability to preserve both communication performance and fault-tolerant demands in these networks. A number of studies have demonstrated that PCS can exhibit superior performance characteristics over Wormhole Switching (WS) under uniform traffic. However, the performance properties of PCS have not yet been thoroughly investigated in the presence of non-uniform traffic. Analytical model of PCS for common networks (e.g., hypercube) under the uniform traffic pattern has been reported in the literature. A non-uniform traffic model that has attracted much attention is the hot spot model which leads to extreme network congestion resulting in serious performance degradation due to the tree saturation phenomenon in the network. An analytical model for WS with hot spot traffic has been reported in the literature. However, to the best of our knowledge, there has not been reported any analytical model for PCS augmented with virtual channels in the presence of hot spot traffic. This paper proposes a model for this switching mechanism using new methods to calculate the probability of message header blocking and hot spot rates on channels. The model makes latency predictions that are in good agreement with those obtained through simulation experiments. An extensive performance comparison using the new analytical model reveals that PCS performs the same or in some occasions worse than WS in the presence of hot spot traffic.  相似文献   

18.
The efficiency of the basic operations of a NUMA (nonuniform memory access) multiprocessor determines the parallel processing performance on a NUMA multiprocessor. The authors present several analytical models for predicting and evaluating the overhead of interprocessor communication, process scheduling, process synchronization, and remote memory access, where network contention and memory contention are considered. Performance measurements to support the models and analyses through several numerical examples have been done on the BBN GP1000, a NUMA shared-memory multiprocessor. Analytical and experimental results give a comprehensive understanding of the various effects, which are important for the effective use of NUMA shared-memory multiprocessor. The results presented can be used to determine optimal strategies in developing an efficient programming environment for a NUMA system  相似文献   

19.
Multistage interconnection networks (MINs) are a basic class of switch-based network architectures, which are used for constructing scalable parallel computers or for connecting networks. Semi-layer MINs are a special case of MINs. A performance evaluation of semi-layer MINs (using simulation models) is presented in this paper. The configurations of the under study networks apply a conflict drop resolution mechanism. The proposed architecture's performance is studied under uniform traffic conditions and various offered loads, buffer-lengths and MIN sizes. In this paper, the improvements on semi-layer MIN performance, in terms of throughput and latency, are demonstrated quantitatively. These performance measures can be valuable tools for designers of parallel multiprocessor systems and networks, in order to minimize the overall deployment costs and help deliver efficient systems.  相似文献   

20.
Geyong  Mohamed   《Performance Evaluation》2005,60(1-4):255-273
The efficiency of a large-scale multicomputer is critically dependent on the performance of its interconnection network. Current multicomputers have widely employed the torus as their underlying network topology for efficient interprocessor communication. In order to ensure a successful exploitation of the computational power offered by multicomputers it is essential to obtain a clear understanding of the performance capabilities of their interconnection networks under various system configurations. Analytical modelling plays an important role in achieving this goal. This study proposes a concise performance model for computing communication delay in the torus network with circuit switching in the presence of multiple time-scale correlated traffic which is found in many real-world parallel computation environments and has strong impact on network performance. The tractability and reasonable accuracy of the analytical model demonstrated by extensive simulation experiments make it a practical and cost-effective evaluation tool to investigate network performance with various alternative design solutions and under different operating conditions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号