首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Hierarchical ring-based multiprocessor systems are attractive and enjoy several advantages over other type of systems. They ensure unique paths between nodes, simple node interfaces and simple cross-ring connections. Furthermore, employing point-to-point links allows the system to run at high clock rate which increases bandwidth and decreases latency. This paper investigates the performance of hierarchical ring-based shared-memory multiprocessors. Rings in the hierarchy are composed of point-to-point, unidirectional links and apply the Scalable Coherent Interface (SCI) protocol. We pay special emphasis on the impact of locality on processor and interconnection design issues such as number of outstanding requests, and ring topology. We find that in order to exploit the power of hierarchical multiprocessors an accurate and appropriate model of locality must be used. Hierarchical multiprocessors that are well balanced (uniform) tend to provide lower latency and higher system throughput. For non-uniform systems, high degree of locality is required for the hierarchies to perform well. However, restricting the number of outstanding transactions per processor is important in decreasing packets latency and avoiding network contention.  相似文献   

2.
Research on multiprocessor interconnection networks has primarily focused on wormhole switching, virtual channel flow control, and routing algorithms to enhance their performance. The rationale behind this research is that by alleviating the network latency for high network loads, the overall system performance would improve; many studies have used synthetic workloads to support this claim. However, such workloads may not necessarily capture the behavior of real applications. In this paper, we have used parallel applications for a closer examination of the network behavior. In particular, the performance benefit from enhancing a 2D mesh with virtual channels (VCs) and a fully adaptive routing algorithm is examined with a set of shared-memory and message passing applications. Execution time and average message latency of shared memory applications are measured using execution-driven simulation and by varying many architectural attributes that affect the network workload. The communication traces of message passing applications, collected on an IBM-SP2, are used to run a trace-driven simulation of the mesh architecture to obtain message latency. Simulation results show that VCs and adaptive routing can reduce the network latency to varying degrees depending on the application. However, these modest benefits do not translate to significant improvements in the overall execution time because the load on the network is not high enough to exploit the advantages of the network enhancements. Moreover, this benefit may be negated if the architectural enhancements increase the network cycle time. Rather, emphasis should be placed on improving the raw network bandwidth and faster network interfaces  相似文献   

3.
Even though there have been strong research activities about distributed virtual shared-memory (DVSM) systems, their architectures have been not widely used in current high-performance computing markets. The reason is that the previously introduced DVSM systems use conventional interconnection technologies like Ethernet, which incurs high execution overhead due to process interruption at data communication for memory consistency. In this paper, we present the DVSM architecture based on the next generation of an interconnection technique, the InfiniBand Architecture (IBA). Because the IBA supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations in hardware, we can minimize the communication overhead for memory consistency on the DVSM system. For characterizing multithreaded applications on our IBA-based DVSM system, we examined two different shared-memory programming models, i.e. SPMD and OpenMP benchmarks. We show that our DVSM to use full features of the IBA can improve the performance significantly over the IPoIB-based DVSM system in all benchmarks, and also comparable to the bus-based shared-memory multiprocessor system in some benchmarks.  相似文献   

4.
Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempted to achieve the minimum completion time by distributing the workload as evenly as possible while minimizing the number of synchronization operations required. The authors consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to nonlocal data. They show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. They propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. They compare the performance of this new algorithm to other known algorithms by using five representative kernel programs on a Silicon Graphics multiprocessor workstation, a BBN Butterfly, a Sequent Symmetry, and a KSR-1, and show that the new algorithm offers substantial performance improvements, up to a factor of 4 in some cases. The authors conclude that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds  相似文献   

5.
We consider a network of workstations (NOW) organization consisting of bus-based multiprocessors interconnected by a high latency and high bandwidth interconnect, such as ATM, on which a shared-memory programming model using a multiple-writer distributed virtual shared-memory system is imposed. The latencies associated with bringing data into the local memory are a severe performance limitation of such systems. To make the access latencies tolerable, we propose a novel prefetch approach and show how it can be integrated into the software-based coherence layer of a multiple-writer protocol. This approach uses the access history of each page to guide which pages to prefetch. Based on detailed architectural simulations and seven scientific applications we find that our prefetch algorithm can remove a vast majority of the remote operations, which improves the performance of all applications. We also find that the bandwidth provided by ATM switches available today is sufficient to accommodate prefetching. However, the protocol processing overhead of available ATM interfaces limits the gain of the prefetching algorithms.  相似文献   

6.
Sisal (Streams and Iterations in Single Assignment Language) is a general-purpose applicative language intended for use on both conventional and novel multiprocessor systems. In this report we discuss the project's objectives, philosophy, and accomplishments and state our future plans. Four significant results of the Sisal project are compilation techniques for high-performance parallel applicative computation, a microtasking environment that supports dataflow on conventional shared-memory architectures, execution times comparable to those of Fortran, and cost-effective speedup on shared-memory multiprocessors.  相似文献   

7.
随着单个芯片上集成的处理器的个数越来越多,传统的电互连网络已经无法满足对互连网络性能的需求,需要一种新的互连方式,因此光互连网络技术应运而生.目前,电互连的片上网络在功耗、性能、带宽、延迟等方面遇到了瓶颈,而光互连作为一种新的互连方式引用到片上网络具有低损耗、高吞吐率、低延迟等无可比拟的优势.本文主要探讨了片上光网络的...  相似文献   

8.
Large-scale distributed shared-memory multiprocessors (DSMs) provide a shared address space by physically distributing the memory among different processors. A fundamental DSM communication problem that significantly affects scalability is an increase in remote memory latency as the number of system nodes increases. Remote memory latency, caused by accessing a memory location in a processor other than the one originating the request, includes both communication latency and remote memory access latency over I/O and memory buses. The proposed architecture reduces remote memory access latency by increasing connectivity and maximizing channel availability for remote communication. It also provides efficient and fast unicast, multicast, and broadcast capabilities, using a combination of aggressively designed multiplexing techniques. Simulations show that this architecture provides excellent interconnect support for a highly scalable, high-bandwidth, low-latency network.  相似文献   

9.
Disk arrays and shared-memory multiprocessors are new technologies that are rapidly becoming pervasive. They are complementary because disk arrays naturally balance the I/O workload by interleaving data across all disks while a shared-memory multiprocessor balances the processing workload across multiple processors. In this paper, we examine how disk arrays and shared memory multiprocessors lead to an effective method for constructing database machines for general-purpose complex query processing. We show that disk arrays can lead to cost-effective storage systems if they are configured from suitably small formfactor disk drives. We introduce the storage system metricdata temperature (IO/s/Gbyte) as a way to evaluate how well a disk configuration can sustain its workload, and we show that disk arrays can sustain the same data temperature as a more expensive mirrored-disk configuration. We use the metric to evaluate the performance of disk arrays in XPRS, an operational shared-memory multiprocessor database system being developed at the University of California, Berkeley.  相似文献   

10.
Shared-memory multiprocessor performance is strongly affected by factors such as sequential code, barriers, cache coherence, virtual memory paging, and the multiprocessor system itself with resource scheduling and multiprogramming. Several timing models and analysis for these effects are presented. A modified Ware model based on these timing models is given to evaluate comprehensive performance of a shared-memory multiprocessor. Performance measurement has been done on the Encore Multimax, a shared-memory multiprocessor. The evaluation models are the analyses based on a general shared-memory multiprocessor system and architecture and can be applied to other types of shared-memory multiprocessors. Analytical and experimental results give a clear understanding of the various effects and a correct measure of the performance, which are important for the effective use of a shared-memory multiprocessor  相似文献   

11.
In symmetric multiprocessors (SMPs), the cache coherence overhead and the speed of the shared buses limit the address/snoop bandwidth needed to broadcast transactions to all processors. As a solution, a scalable address subnetwork called symmetric multiprocessor network (SYMNET) is proposed in which address requests and snoop responses of SMPs are implemented optically. SYMNET not only uses passive optical interconnects that increases the speed of the proposed network, but also pipelines address requests at a much faster rate than electronics. This increases the address bandwidth for snooping, but the preservation of cache coherence can no longer be maintained with the usual snooping protocols. A modified coherence protocol, coherence in SYMNET (COSYM), is introduced to solve the coherence problem. COSYM was evaluated with a subset of Splash-2 benchmarks and compared with the electrical bus-based MOESI protocol. The simulation studies have shown a 5-66 percent improvement in execution time for COSYM as compared to MOESI for various applications. Simulations have also shown that the average latency for a transaction to complete using COSYM protocol was 5-78 percent better than the MOESI protocol. It is also seen that SYMNET can scale up to hundreds of processors while still using fast snooping-based cache coherence protocols, and additional performance gains may be attained with further improvement in optical device technology.  相似文献   

12.
Commercial transaction processing applications are an important workload running on symmetric multiprocessor systems (SMPs). They differ dramatically from scientific, numeric-intensive, and engineering applications because they are I/O bound, and they contain more system software activities. Most SMP servers available in the market have been designed and optimized for scientific and engineering workloads. A major challenge of studying architectural effects on the performance of a commercial workload is the lack of easy access to large-scale and complex database engines running on a multiprocessor system with powerful I/O facilities. Experiments involving case studies have been shown to be highly time-consuming and expensive. In this paper, we investigate the feasibility of using queuing network models with the support of simulation to study the SMP architectural impacts on the performance of commercial workloads. We use the commercial benchmark TPC-C as the workload. A bus-based SMP machine is used as the target platform. Queueing network modeling is employed to characterize the TPC-C workload on the SMP. The system components such as processors, memory, the memory bus, I/O buses, and disks are modeled as service centers, and their effects on performance are analyzed. Simulations are conducted as well to collect the workload-specific parameters (model parameterization) and to verify the accuracy of the model. Our studies find that among disk-related parameters, the disk rotation latency affects the performance of TPC-C most significantly. Among I/O buses and number of disks, the number of I/O buses has the deepest impact on performance. This study also demonstrates that our modeling approach is feasible, cost-effective, and accurate for evaluating the performance of commercial workloads on SMPs, and it is complementary to the measurement-based experimental approaches.  相似文献   

13.
Data communication among processors is a key problem to increase performance of parallel algorithms for multiprocessors. In this paper, we investigate different types of domain decomposition methods on the distributed memory multiprocessor and give the estimates of their speedup and efficiency based on the computational complexity and the communication overheads.  相似文献   

14.
Many scientific applications are comprised of irregular reductions on large data sets. In shared-memory parallel programs, these irregular reductions are typically computed in parallel using replicated buffers, then combined using synchronization. We develop L W , a new technique which partitions irregular reductions so that each processor computes values only for locally assigned data, eliminating the need for buffers or synchronized writes. Computation is replicated if its results are needed on multiple processors. We experimentally evaluate its performance for three irregular codes on a software DSM running on a distributed-memory multiprocessor and two shared-memory multiprocessors while varying connectivity, locality, and adaptivity. Results show L W improves performance significantly compared to using replicated buffers, and can match or exceed explicit message-passing gather/scatter for applications with low locality or high adaptivity.  相似文献   

15.
Multis, shared-memory multiprocessors that are implemented with single buses and snooping cache protocols are inherently limited to a small number of processors, and, as systems grow beyond a single bus, the bandwidth requirements of broadcast operations limit scalability. Hardware support to provide cache coherence without the use of broadcast can become very expensive. An approach to maintaining coherence using approximate information held in special-purpose caches called pruning-caches that provides robust performance over a wide range of workloads is presented. The pruning-cache approach is compared to the more conventional inclusion cache for providing multilevel inclusion (MLI) in the cache hierarchy. It is shown that pruning-caches are more cost-effective and more robust. Using both analysis and simulation, it is also shown that the k-ary n-cube topology provides scalable, bottleneck-free communication for uniform, point-to-point traffic  相似文献   

16.
In general, message passing multiprocessors suffer from communication overhead between processors and shared memory multiprocessors suffer from memory contention. Also, in computer vision tasks, data I/O overhead limits performance. In particular, high level vision tasks, which are complex and require nondeterministic communication, are strongly affected by these disadvantages. This paper proposes a flexibly (tightly/loosely) coupled hypercube multiprocessor (FCHM) for high level vision to alleviate these problems. A variable address space memory scheme in which a set of adjacent memory modules can be merged into a shared memory module by a dynamically partitionable hypercube topology is proposed. The architecture is quantitatively analyzed using computational models and simulated on the Intel’s Personal SuperComputer (iPSC/I), a hypercube multiprocessor. A parallel algorithm for exhaustive search is simulated on FCHM using the iPSC/I showing significant performance improvements over that of the iPSC/I. This research was supported in part by IBM corporation.  相似文献   

17.
The presence of high-performance mechanisms in shared-memory multiprocessors such as private caches, the extensive pipelining of memory access, and combining networks may render a logical concurrency model complex to implement or inefficient. The problem of implementing a given logical concurrency model in such a multiprocessor is addressed. Two concurrency models are considered, and simple rules are introduced to verify that a multiprocessor architecture adheres to the models. The rules are applied to several examples of multiprocessor architectures  相似文献   

18.
在嵌入式多处理器系统互联中,为满足大量实时数据的传输,要求互联总线具备高带宽、低延迟、高可扩展性和高可靠性等特点,导致传统的共享总线方式成为数据交互的瓶颈。该文对比分析当前流行的互联总线技术,研究StarFabric互联技术的特点,包括网络拓扑结构设计和StarFabric软件技术等,设计并实现基于高速总线的简单通信协议。通过实际应用测试验证其具有高带宽、低延迟的特点。  相似文献   

19.
Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip.For multiprocessors,an efficient communication network that matches the needs of the target application is always critical to the overall performance.Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs).However,the quest for high performance networks has led to very complex and resource-expensive NoC designs,leaving little room for the real computing force,i.e.,PEs.Moreover,many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers.We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs.This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs.We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology,but also the resource requirement of each router.Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network.The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications.  相似文献   

20.
Emerging multiprocessor architectures such as chip multiprocessors, embedded architectures, and massively parallel architectures, demand faster, more efficient, and more scalable cache coherence schemes. In devising more cost-efficient schemes, formal insights into a system model is deemed useful. We, in this paper, build formalisms for execution in cache based Distributed shared-memory multiprocessors (DSM) obeying Release Consistency model, and derive conditions for cache coherence. A cost-efficient cache coherence scheme without directories is designed. Our approach relies on processor directed coherence actions, which are early in nature. The scheme exploits sharing information provided by a programmer-centric framework. Per-processor coherence buffers (CB) are employed to impose coherence on live shared variables between consecutive release points in the execution. Simulation of 8 entry 4-way associative CB based system achieves a speedup of 1.07–4.31 over full-map 3-hop directory scheme for six of the SPLASH-2 benchmarks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号