期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

潘国腾窦强谢伦国《计算机工程与科学》2008,30(6):131-133

基于CC-NUMA结构的DSM多处理器系统是大规模高性能并行计算机的一个实现方式,由于比监听协议具有更好的扩展性,系统多采用基于目录的Cache一致性协议。但是,随着系统规模的不断扩大,目录协议同样面临着可扩展性的问题。本文在分析影响目录协议可扩展性因素的基础上,对当前比较典型的几种目录组织形式从存储开销方面进行了讨论,最后提出了基于目录Cache的两级目录组织方案。相似文献

2.

环连接CMP模拟器：Godson-Ring

下载免费PDF全文

曹非《计算机工程与应用》2013,49(9):13-18

片上互连结构和cache一致性协议是片上多核处理器（CMP）设计的关键。为了探索使用环形互连结构CMP的cache一致性协议设计空间,需要使用对环形互连结构和cache一致性协议进行精确模拟的CMP模拟器平台。Godson-Ring是一个环连接CMP的用户态模拟器平台,采用功能和时序相分离的模拟方式,使用了事件驱动和执行驱动相结合的方法,周期精确地模拟了环形互连结构和cache一致性协议的硬件行为。该模拟器具有速度快和灵活性高的特点,能模拟多种cache一致性协议,可以快速、有效地探索环连接CMP的cache一致性协议设计空间。相似文献

3.

片上多处理器末级Cache优化技术研究

李浩谢伦国《计算机研究与发展》2012,(Z1):172-179

片上多核技术的出现给处理器的设计和实现带来很多挑战,片上存储系统的设计就是其中最重要的方面之一.为了缓解日益严峻的存储墙问题,研究者们通常在片上放置大容量末级Cache,片上末级Cache设计和优化技术已成为当前的研究热点.介绍了片上多处理器(CMP)末级Cache设计面临的挑战,然后分别介绍了以私有设计和共享设计为基础的多种CMP末级Cache优化技术,并对它们进行了比较分析. 相似文献

4.

多核处理器Cache一致性协议关键技术研究

黄安文张民选《计算机工程与科学》2009,31(Z1)

多核处理器规模的不断扩大和核间通信机制的日益复杂,使得Cache一致性维护变得更加困难。本文从多核处理器Cache一致性问题的产生背景出发,分析监听协议、目录协议、Token协议和Hammer协议的实现机制以及在多核环境中的优缺点,分别从一致性协议与片上互连结构协同设计、面向低功耗应用的协议优化策略、Cache一致性协议验证及容错机制等角度考虑,对未来多核处理器Cache一致性协议设计的发展趋势和技术挑战进行详细分析与讨论。相似文献

5.

CMP中基于目录的协作Cache设计方案

下载免费PDF全文

赵小雨吴俊敏隋秀峰王庆波唐轶轩《计算机工程》2010,36(21):283-285

片上多处理器中二级Cache的设计和管理是影响其性能的关键因素之一。在私有二级Cache的基础上,提出一种基于集中式一致性目录的协作Cache设计方案,通过有效地管理片上存储资源来优化处理器的性能,从而使该协作Cache具有平均访存延迟小、Cache缺失率低、可扩展性好等优点。实验结果显示,与共享二级Cache设计相比,协作Cache可以将4核处理器的吞吐量平均提高13.5%,而其硬件开销约为8.1%。相似文献

6.

异构多处理器系统Cache一致性解决方案

田芳姜秀柱王书芹李娜《微计算机信息》2008,24(29)

SoC技术的发展使多个异构的处理器集成到一个芯片成为可能,这种结构已成为提高微处理器性能的重要途径.与传统的多处理器系统一样,Cache一致性问题也是片内异构多处理器系统必须首先解决的问题.本文在分析Cache一致性问题的基础上,对采用不同监听协议的多处理器的集成,以牺牲简单的硬件为代价来完成一致性协议的转化.将此方法并入多处理器芯片封装内来管理,可保证在异构多处理器系统中数据的一致性. 相似文献

7.

可交换数据Cache结构的CMP:EDCA-CMP

陈建党郭松柳王海霞汪东升《小型微型计算机系统》2007,28(7):1331-1333

随着集成电路工艺技术的飞速发展,单芯片多处理器（Single-chip Multiprocessor,CMP）结构将是一种有效利用片上晶体管资源、提高系统性能的有效途径.CMP中各个内核通过共享同级存储装置共享数据,如共享一级Cache,共享二级Cache等.可交换数据Cache结构的CMP（Exchangeable Data Cache Architecture,EDCA-CMP）通过交换一级数据Cache的内容共享数据Cache,降低对下级存储的访问延迟,提高数据Cache的命中率,获得较高的性能. 相似文献

8.

片上多核处理器共享末级缓存动静结合地址映射机制

曹非刘志勇《计算机科学》2012,39(8):304-310

片上多核处理器(CMP)通常采用私有或者共享的末级高速缓存(cache)结构,而共享末级cache一般使用静态地址映射机制。该机制将各处理器临时私有访问的数据映射于分布在其他处理器的末级cache中,使得各处理器对临时私有数据的访问延时增加。针对该问题,提出了一种动静结合的共享末级cache地址映射方法。该方法可将原来静态映射于其他处理器末级cache中的临时私有数据动态映射于访问者处理器的本地末级cache中,减少了大量静态映射所造成的长延时非本地末级cache访问,从而有效降低了整个共享末级cache的访问延时,在提高性能的同时降低了功耗和带宽使用。实验结果表明,动静结合的地址映射方式应用于采用环连接互连结构和侦听顺序环协议的CMP结构时,可获得的平均性能提升为9%,最大性能提升为38%。相似文献

9.

基于总线监听的Cache一致性协议分析

汤伟李俊峰《福建电脑》2009,25(7):58-59

片内多处理器系统是当前计算机体系结构研究的热点问题之一。与传统的多处理机系统一样,Cache一致性问题也是片内多处理器系统必须首先解决的问题。本文首先介绍了片内多处理器系统中的Cache一致性问题及其解决方法,然后着重讨论了两种基于总线监听的Cache一致性协议：MSI协议和MESI协议,并对它们进行了分析比较。相似文献

10.

一种递归定义的可扩展片上网络拓扑结构 总被引：1，自引：0，他引：1

朱晓静《计算机学报》2011,34(5):924-930

晶体管工艺的持续发展导致片上处理器数的逐渐增多,片上系统的核间通信要求吞吐量高、延时低、可扩展性好,传统的片上总线和crossbar互连结构已无法满足片上系统的通信需求,为此研究者提出新的片上互连结构,称为片上网络.为满足片上网络的特有通信需求,提出了一种可扩展的拓扑结构Rgrid及其路由算法DR,它缩短了片上处理器间... 相似文献

11.

An optical interconnection network and a modified snooping protocol for the design of large-scale symmetric multiprocessors (SMPs)

Louri A. Kodi A.K. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(12):1093-1104

In symmetric multiprocessors (SMPs), the cache coherence overhead and the speed of the shared buses limit the address/snoop bandwidth needed to broadcast transactions to all processors. As a solution, a scalable address subnetwork called symmetric multiprocessor network (SYMNET) is proposed in which address requests and snoop responses of SMPs are implemented optically. SYMNET not only uses passive optical interconnects that increases the speed of the proposed network, but also pipelines address requests at a much faster rate than electronics. This increases the address bandwidth for snooping, but the preservation of cache coherence can no longer be maintained with the usual snooping protocols. A modified coherence protocol, coherence in SYMNET (COSYM), is introduced to solve the coherence problem. COSYM was evaluated with a subset of Splash-2 benchmarks and compared with the electrical bus-based MOESI protocol. The simulation studies have shown a 5-66 percent improvement in execution time for COSYM as compared to MOESI for various applications. Simulations have also shown that the average latency for a transaction to complete using COSYM protocol was 5-78 percent better than the MOESI protocol. It is also seen that SYMNET can scale up to hundreds of processors while still using fast snooping-based cache coherence protocols, and additional performance gains may be attained with further improvement in optical device technology. 相似文献

12.

Multiple-bus shared-memory system: Aquarius project

Carlton M. Despain A. 《Computer》1990,23(6):80-83

A multiple-bus architecture called a multi-multi is presented. The architecture is designed to handle several dimensions with a moderate number of processors per bus. It provides scaling to a large number of processors in a system. A key characteristic of the architecture is the large amount of bandwidth it provides. Each node in the architecture contains a microprocessor, memory, and a cache. The cache-coherence protocol for the multi-multi architecture combines features of snooping cache schemes, to provide consistency on individual buses, with features of directory schemes, to provide consistency between buses. The snooping cache component can take advantage of the low-latency communication possible on shared buses for efficiency, yet the complete protocol will support many more processors than a single bus can. The resulting protocol naturally extends cache coherence from a multi to a multi-multi. Cache and directory states are described. Concepts that allow efficient performance, namely, local sharing, root node, and bus addresses in the directory, are discussed 相似文献

13.

Design and Simulation of the Aquarius-II Multiprocessor

Vason P. Srini Tam M. Nguyen Darren R. Busing Mike J. Carlton Bruce K. Holmer Georges E. Smine Alvin M. Despain 《Journal of Systems Integration》1997,7(2):151-178

Aquarius-II is a cache coherent multiprocessor system designed for the parallel execution of Prolog programs. It contains two tiers of memory: synchronization memory and high bandwidth (HB) memory. The synchronization memory consists of snooping caches connected to a bus and is used to store rendezvous points, synchronization bits, synchronization variables such as locks and semaphores and most of the write shared data. The HB memory is used to store the bulk of the application program code and data. It contains caches and an inexpensive VLSI chip based crossbar interconnection network to memory. The caches connected to the crossbar do not have full snooping capability. The architecture is evaluated by a full simulation of parallel execution of Prolog programs on Aquarius-II. The design details of the components of the architecture and simulation results are presented. Simulation results indicate that the two tier memory system significantly reduces memory interference and speeds up synchronization when compared to a single bus multi. This shared memory multiprocesor architecture has the potential to support other parallel programming paradigms. 相似文献

14.

基于Simics的分布式一致性协议仿真

郑志硕郑存陆曹宏徙《计算机与现代化》2011,(9):105-108

用于多种计算机系统和指令系统仿真的Virtutech Simics只提供一个简单的顺序扁平侦听式高速缓存一致性（Snoo-ping Cache Coherence Protocol）模型支持MESI协议,从而制约了可仿真的并行处理器个数。以下将基于目录的分布式高速缓存一致性协议（Distributed Directory-based Cache Coherence Protocol）模型应用于Simics中并给出基于Simics的分布式一致性协议的仿真结果。这一结果证实分布式协议能降低事件总数,减少网络中的事件。本文提出一个简单的基于目录的分布式高速缓存一致性协议,从而解决制约Simics的可扩放性问题。相似文献

15.

分布式RAID中Cache模块的设计

冯晶朱兰娟吴智铭《计算机应用》2005,25(2):475-477

针对分布式RAID的特殊架构，设计了基于总线侦听方法的Cache模块。该模块采用主存分块映射策略来解决总线侦听方法，由于共享网络总线对带宽要求太高，使用较少带宽、较少的数据操作，提高了分布式RAID的系统性能。对Cache模块设计进行了性能分析，对多处理机系统Cache一致性问题的解决方案进行了分析比较。相似文献

16.

SimTile:片状多核处理器的高效模拟器(英文)

刘涛季振洲王庆《计算机科学与探索》2010,4(12):1115-1120

传统的基于共享总线的多核芯片随着核心数增加产生了瓶颈问题。新型TiledCMP(chip multiprocessor)的结构设计中,片上核心互联网络对提高扩展能力和执行效率起到了重要作用。为了实现低延迟、高带宽的核心通信,高速点对点网络方式的片上多核互联结构模拟成为研究的热点。抽象片上Tiled方式16核功能单元结构,设计实现了SimTile模拟器,可提供配置灵活、功能单元齐全的片上多核处理器设计,支持高效率的全局共享缓存、高速片上路由结构。模拟器采用模块化的组件配置方式,片上核心数量与互联网络结构、数据一致性协议、全局寄存器通信与cache共享模式等,均可通过精简的参数调整。实验表明模拟器执行效率较高,为片上多核研究提供了灵活、高效并具备可扩展性的新平台。相似文献

17.

Design and implementation of dual processor block with shared external cache memory

Soo-Won Kim Hanseok Ko Woo-Jong Hahn Jong-Sik HahmAuthor vitae 《Microprocessors and Microsystems》1997,20(10):595-605

The availability of low cost, high performance microprocessors has led to various designs of shared memory multiprocessor systems. As a result, commercial products which are based on shared memory have been proliferated. Such a multiprocessor system is heavily influenced by the structure of memory system and it is not difficult to find that most configurations include local cache memories. The more processors a system carries, the larger local cache memory is needed to maintain the traffic to and from the shared memory at reasonable level. The implementation of local cache memories, however, is not a simple task because of environmental limitations. In particular, the general lack of board space availability presents a formidable problem. A cache memory system usually needs space mostly to support its complex control logic circuits for the cache itself and network interfaces like snooping logic circuits for shared bus. Although packaging can be made denser to reduce system size, there are still multiple processors per board. It requires a more area-efficient cache memory architecture. This paper presents a design of shared cache for dual processor board of bus-based symmetric multiprocessors. The design and implementation issues are described first and then the evaluation and measurement results are discussed. The shared cache proposed in this paper has been determined to be quite area-efficient without the significant loss of throughput and scalability. It has been implemented as a plug-in unit for TICOM, a prevalent commercial multiprocessor system. 相似文献

18.

The Stanford Hydra CMP 总被引：5，自引：0，他引：5

Hammond L. Hubbert B.A. Siu M. Prabhu M.K. Chen M. Olukolun K. 《Micro, IEEE》2000,20(2):71-84

The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors and their primary caches on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming, a paradigm that allows performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the architecture of the Hydra design with a focus on its speculative thread support, and describes our prototype implementation. Chip multiprocessors offer an economical, scalable architecture for future microprocessors. Thread-level speculation support allows them to speed up past software 相似文献

19.

基于共享总线的多处理器cache一致性的硬件实现*

李均晓张盛兵沈绪榜《计算机应用研究》2008,25(6):1890-1893

龙腾R2微处理器是西北工业大学航空微电子中心设计的采用PowerPC体系结构,具有自主知识产权的R ISC微处理器。为了扩展其多处理器的功能,采用总线侦听的方法来维护多处理器环境下的cache一致性。首先介绍了共享总线侦听技术以及侦听协议,然后详细介绍了龙腾R2微处理器的总线侦听部件的实现方案,对几类cache一致性的实现方案以及性能进行了评析。FPGA实验结果表明,总线侦听部件能高效而准确地保证多处理器系统的cache一致性。相似文献

20.

A Scalable Distributed Shared Memory Architecture

Krishnamoorthy S. Choudhary A. 《Journal of Parallel and Distributed Computing》1994,22(3)

Scalability of a multiprocessor architecture depends on its ability to manage interconnection network latency with increasing number of processors. Interconnection network latency can be minimized by reducing the distance traversed by a message in terms of number of nodes and wire lengths. Scalability of a DSM architecture also depends on the scalability of the coherency protocol and the associated directory storage requirements. In this paper we describe a DSM architecture based on a fat tree interconnection network with augmented switching nodes. The proposed architecture is CC-NUMA, but supports several important features of COMA architectures. The scalability of this architecture is enhanced by integrating routing and cache coherency operations, which helps in improving locality by trapping requests locally. Scalability of a DSM architecture is defined and evaluated in terms of the asymptotic speedup of an algorithm with increasing number of processors. 相似文献