期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Improving multiprocessor performance with fine-grain coherence bypass

WANG Hui WANG Rui LUAN ZhongZhi QIAN XueHai QIAN DePei 《中国科学:信息科学(英文版)》2015,(1):84-98

Efficient and scalable cache coherence protocol is crucial to high-performance servers with sharedmemory. The directory-based cache coherence protocol is more desirable than the snooping-based protocol with respect to the scalability. However, even for the former protocol, scaling to a large number of cores is also challenging due to the additional area requirements of the directories. We observed that a significant percentage of the referenced memory blocks were only accessed by a single core(even in parallel applications) which could be considered as private memory blocks. An intuitive motivation from this observation is that memory blocks accessed by a single core do not require coherence maintenance. The issue is to identify the private block and track the change of its access pattern. We propose a novel hardware approach to(1) dynamically identify the shared memory blocks at the cache block level, and(2) bypass the coherence procedure for the private memory blocks. This approach increases the effectiveness of the directory-based approach and therefore improves the system performance. Experimental results showed that, our approach can on an average(1) avoid the coherence tracking of about 54% referenced memory blocks,(2) reduce the coherence overhead by 77%,(3) avoid 8% L2 cache misses, and(4) shorten the execution time of parallel applications by 13%. 相似文献

2.

CCNoC: Cache-Coherent Network on Chip for Chip Multiprocessors 总被引：1，自引：1，他引：0

下载免费PDF全文

王惊雷薛一波王海霞李崇民汪东升《计算机科学技术学报》2010,25(2):257-266

As the number of cores in chip multiprocessors(CMPs) increases,cache coherence protocol has become a key issue in integration of chip multiprocessors.Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles:design complexity,performance and scalability.This paper proposes Cache Coherent Network on Chip(CCNoC),a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and ... 相似文献

3.

Index-Based Cache Coherence Protocol

Soha S. Zaghloul Najlaa AHuwaishel Maram AlAlwan 《通讯和计算机》2014,(6):479-483

A multiprocessor envirorLment may encounter many problems such as deadlock, load balancing and cache coherence. However, the latter is considered the most dangerous if not properly designed, the system works naturally but generates inaccurate results. This occurs if obsolete versions of a memory block are used. Users may not be aware of the presence of such problem. Two main approaches are known to maintain data consistency： namely, snoopy and directory-based protocols. Each approach has its advantages and limitations. This paper proposes a new technique that considers both previously mentioned approaches. The network architecture is slightly updated by adding an index table to each processor. The proposed protocol is expected to reduce the access time, decrease the number of accesses to main memory, maintain data consistency, and assure the usage of the most recent value of a shared variable. 相似文献

4.

A Framework of Memory Consistency Models 总被引：2，自引：1，他引：2

下载免费PDF全文

Hu Weiwu Shi Weisong Tang Zhimin 《计算机科学技术学报》1998,13(2):110-124

相似文献

5.

NONH:A New Cache-Based Coherence Protocol for Linked List Structure DSM System and Its Performance Evaluation

下载免费PDF全文

Fang Zhiyi Ju Jiubin 《计算机科学技术学报》1996,11(4):405-415

The management of memory coherence is an important problem in distributed shared memory(DSM)system.In a cache-based coherence DSM system using linked list structure,the key to maintaining the coherence and improving system performance is how to manage the owner in the linked list.This paper presents the design of a new management protocol-NONH(New-Owner New-Head)and its performance evaluation.The analysis results show that this protocol can improve the scalability and performence of a coherent DSM system using linked list.It is also suitable for managing the cache coherency in tree-like hierarchical architecture. 相似文献

6.

Effcient Handling of Lock Hand-off in DSM Multiprocessors with Buffering Coherence Controllers

下载免费PDF全文

Benjamín Sahelices Agustín de Dios Pablo Ibáez Víctor Vials-Yúfera José María Llabería 《计算机科学技术学报》2012,27(1):75-91

Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters. 相似文献

7.

Model Checking Data Consistency for Cache Coherence Protocols 总被引：1，自引：0，他引：1

下载免费PDF全文

Hong Pan Hui-Min Lin and Yi Lv 《计算机科学技术学报》2006,21(5):765-775

A method for automatic verification of cache coherence protocols is presented, in which cache coherence protocols are modeled as concurrent value-passing processes, and control and data consistency requirement are described as formulas in first-orderμ-calculus. A model checker is employed to check if the protocol under investigation satisfies the required properties. Using this method a data consistency error has been revealed in a well-known cache coherence protocol. The error has been corrected, and the revised protocol has been shown free from data consistency error for any data domain size, by appealing to data independence technique. 相似文献

8.

OpenMP on Networks of Workstations for Software DSMs 总被引：3，自引：0，他引：3

下载免费PDF全文

章锋陈国良张兆庆《计算机科学技术学报》2002,17(1):0-0

This paper describes the implementation of a sizable subset of OpenMP on networks of workstations(NOWs) and the source-to-source OpenMP complier(AutoPar） is used for the JIAJIA home-based shared virtual memory system (SVM).The paper suggests some simple modifications and extensions to the OpenMP standard for the difference between SVM and SMP(symmetric multi processor),at which the OpenMP specification is aimed.The OpenMP translator is based on an automatic paralleization compiler,so it is possible to check the correctness of the semantics of OpenMP programs which is not required in an OpenMP-compliant implementation AutoPar is measured for five applications including both programs from NAS Parallel Benchmarks and real applications on a cluster of eight Pentium Ⅱ PCs connected by a 100 Mbps switched Eternet.The evaluation shows that the parallelization by annotaing OpenMPdirectives is simple and the performance of generatd JIAJIA code is still acceptable on NOWs. 相似文献

9.

Efficient fine-grained shared buffer management for multiple OpenCL devices

Chang-qing Xun Dong Chen Qiang Lan Chun-yuan Zhang 《浙江大学学报:C卷英文版》2013,14(11):859-872

OpenCL programming provides full code portability between different hardware platforms, and can serve as a good programming candidate for heterogeneous systems, which typically consist of a host processor and several accelerators. However, to make full use of the computing capacity of such a system, programmers are requested to manage diverse OpenCL-enabled devices explicitly, including distributing the workload between different devices and managing data transfer between multiple devices. All these tedious jobs pose a huge challenge for programmers. In this paper, a distributed shared OpenCL memory （DSOM） is presented, which relieves users of having to manage data transfer explicitly, by supporting shared buffers across devices. DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer. To support fine-grained shared buffer management, we designed a kernel parser in DSOM for buffer access range analysis. A basic modified, shared, invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers. In addition, we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible. This strategy enables overlap of data transfer with kernel execution. Our experimental results show that the applicability of our method for buffer access range analysis is good, and the efficiency of DSOM is high. 相似文献

10.

A Novel Memory Structure for Embedded Systems: Flexible Sequential and Random Access Memory

下载免费PDF全文

Ying Chen Karthik Ranganathan Vasudev V. Pai DavidJ. Lilja and Kia Bazargan 《计算机科学技术学报》2005,20(5):596-606

The on-chip memory performance of embedded systems directly affects the system designers＇ decision about how to allocate expensive silicon area. A novel memory architecture, flexible sequential and random access memory （FSRAM）, is investigated for embedded systems. To realize sequential accesses, small “links”are added to each row in the RAM array to point to the next row to be prefetched. The potential cache pollution is ameliorated by a small sequential access buyer （SAB）. To evaluate the architecture-level performance of FSRAM, we ran the Mediabench benchmark programs on a modified version of the SimpleScalar simulator. Our results show that the FSRAM improves the performance of a baseline processor with a 16KB data cache up to 55%, with an average of 9%; furthermore, the FSRAM reduces 53.1% of the data cache miss count on average due to its prefetching effect. We also designed RTL and SPICE models of the FSRAM, which show that the FSRAM significantly improves memory access time, while reducing power consumption, with negligible area overhead. 相似文献

11.

基于新型Cache一致性协议的共享虚拟存储系统 总被引：11，自引：2，他引：9

胡伟武施巍松唐志敏《计算机学报》1999,22(5):467-475

介绍了一个基于新型Ｃａｃｈｅ一致性协议的共享虚拟存储系统ＪＩＡＪＩＡ,与目前国际上具有代表性的共享虚拟存储系统相比,ＪＩＡＪＩＡ采用了基于ＵＮＭＡ的结构,能够把多个机器的物理地址空间组织成一个更大的共享虚拟地址空间,此外,ＪＩＡＪＩＡ实现了一种基于锁的新型一致性协议,通过附带在锁上的ｗｒｉｔｅ－ｎｏｔｉｃｅ来维护一致性,从而避免了传统的目录协议中由目录引起的存储开销和系统复杂度,利用一些被广泛使用相似文献

12.

Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Alberto Ros Ricardo Fernández-Pascual Manuel E. Acacio José M. García 《Journal of Parallel and Distributed Computing》2008

In glueless shared-memory multiprocessors where cache coherence is usually maintained using a directory-based protocol, the fast access to the on-chip components (caches and network router, among others) contrasts with the much slower main memory. Unfortunately, directory-based protocols need to obtain the sharing status of every memory block before coherence actions can be performed. This information has traditionally been stored in main memory, and therefore these cache coherence protocols are far from being optimal. In this work, we propose two alternative designs for the last-level private cache of glueless shared-memory multiprocessors: the lightweight directory and the SGluM cache. Our proposals completely remove directory information from main memory and store it in the home node’s L2 cache, thus reducing both the number of accesses to main memory and the directory memory overhead. The main characteristics of the lightweight directory are its simplicity and the significant improvement in the execution time for most applications. Its drawback, however, is that the performance of some particular applications could be degraded. On the other hand, the SGluM cache offers more modest improvements in execution time for all the applications by adding some extra structures that cope with the cases in which the lightweight directory fails. 相似文献

13.

环连接CMP的缓存一致性协议

曹非刘志勇《计算机研究与发展》2009,46(Z2)

片上多核处理器(CMP)已经成为处理器发展的方向,处理器设计的重点也转到了互连网络和存储层次结构方面,其中的一个关键问题是如何维护各处理器各级缓存(Cache)的一致性,该问题在传统的共享存储多处理器中使用Cache一致性协议来解决,而CMP相对于传统的多处理器结构具有更高的片上互连带宽和速度,给Cache一致协议提出了新的要求,也提供了新的改进机会.传统的总线侦听协议存在可扩展性不足和不必要的广播、侦听过多的缺点,而目录协议则存在失效间接延时大和复杂度高、验证困难等问题.环形连接的可扩展性好于总线结构,而其实现复杂度也远小于通常目录协议所使用的包交换点到点网络.将基于环的侦听协议应用于CMP;并考虑利用环的顺序性取消原有协议中冲突引起的重发操作,消除可能的饥饿、死锁和活锁等情况,增加协议的稳定性,同时减少消息流量和功耗;利用片上互连延时短的特点,将侦听结果和侦听请求同时传播,使得处理器可以根据侦听结果来对侦听请求进行选择性的侦听操作,可减少不必要的侦听操作,降低功耗. 相似文献

14.

基于目录的Cache一致性协议的可扩展性研究

下载免费PDF全文

潘国腾窦强谢伦国《计算机工程与科学》2008,30(6):131-133

基于CC-NUMA结构的DSM多处理器系统是大规模高性能并行计算机的一个实现方式,由于比监听协议具有更好的扩展性,系统多采用基于目录的Cache一致性协议。但是,随着系统规模的不断扩大,目录协议同样面临着可扩展性的问题。本文在分析影响目录协议可扩展性因素的基础上,对当前比较典型的几种目录组织形式从存储开销方面进行了讨论,最后提出了基于目录Cache的两级目录组织方案。相似文献

15.

On the Correctness of Program Execution When Cache Coherence Is Maintained Locally at Data-Sharing Boundaries in Distributed Shared Memory Multiprocessors

H. Sarojadevi S. K. Nandy S. Balakrishnan 《International journal of parallel programming》2004,32(5):415-446

Emerging multiprocessor architectures such as chip multiprocessors, embedded architectures, and massively parallel architectures, demand faster, more efficient, and more scalable cache coherence schemes. In devising more cost-efficient schemes, formal insights into a system model is deemed useful. We, in this paper, build formalisms for execution in cache based Distributed shared-memory multiprocessors (DSM) obeying Release Consistency model, and derive conditions for cache coherence. A cost-efficient cache coherence scheme without directories is designed. Our approach relies on processor directed coherence actions, which are early in nature. The scheme exploits sharing information provided by a programmer-centric framework. Per-processor coherence buffers (CB) are employed to impose coherence on live shared variables between consecutive release points in the execution. Simulation of 8 entry 4-way associative CB based system achieves a speedup of 1.07–4.31 over full-map 3-hop directory scheme for six of the SPLASH-2 benchmarks. 相似文献

16.

DP&TB: a coherence filtering protocol for many-core chip multiprocessors

Fengkai Yuan Zhenzhou Ji 《The Journal of supercomputing》2013,66(1):249-261

Future many-core chip multiprocessors (CMPs) will integrate hundreds of processor cores on chip. Two cache coherence protocols are the mainstream applied to current CMPs. The token-based protocol (Token) provides high performance, but it generates a prohibitive amount of network traffic, which translates into excessive power consumption. The directory-based protocol (Directory) reduces network traffic, yet trades off with the storage overhead of the directory as well as entails comparatively low performance caused by indirection limiting its applicability for many-core CMPs. In this work, we present DP&TB, a novel cache coherence protocol particularly suited to future many-core CMPs. In DP&TB, cache coherence is maintained at the granularity of a page, facilitating to filter out either unnecessary coherence inspections for blocks inside private pages or network traffic for blocks inside shared pages. We employ Directory to detect private and shared pages and Token to maintain the coherence of the blocks inside shared pages. DP&TB inherits the merit of Directory and Token and overcome their problems. Experimental results show that DP&TB comprehensively beyond Directory and Token with improvement by 9.1 % in performance over Token and by 13.8 % in network traffic over Directory. In addition, the storage overhead of DP&TB is less than half of that of Directory. Our proposal can fulfill the requirement of many-core CMPs to achieve high performance, power and area efficiency. 相似文献

17.

多核处理器Cache一致性协议关键技术研究

黄安文张民选《计算机工程与科学》2009,31(Z1)

多核处理器规模的不断扩大和核间通信机制的日益复杂,使得Cache一致性维护变得更加困难。本文从多核处理器Cache一致性问题的产生背景出发,分析监听协议、目录协议、Token协议和Hammer协议的实现机制以及在多核环境中的优缺点,分别从一致性协议与片上互连结构协同设计、面向低功耗应用的协议优化策略、Cache一致性协议验证及容错机制等角度考虑,对未来多核处理器Cache一致性协议设计的发展趋势和技术挑战进行详细分析与讨论。相似文献

18.

Integrated Memory Controllers with Parallel Coherence Streams

Chaudhuri M. Heinrich M. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1159-1173

Previous work in scalable hardware distributed shared memory (DSM) multiprocessors has established the critical and dominant role that protocol processing bandwidth (or its inverse, occupancy) plays in determining overall performance in architectures with standalone memory/coherence controllers. However, with recent architectural trends toward integrated (on-chip) memory controllers and the well-known fact that processor frequency is increasing more rapidly than memory systems, we must ask whether parallel coherence processing engines (either multiple integrated protocol processors/cores or multiple protocol threads) are needed in DSM machines constructed from modern processor architectures and, if so, when. We construct a useful analytical model to give the designer insight into when parallel coherence streams will improve performance and verify our model via detailed simulation on 64-threaded microbenchmarks and parallel applications and on single-node multiprogrammed workloads. Surprisingly, and contrary to related work, we find that, in these architectures, adding a second coherence engine has almost no impact on performance. Further, for less-tuned applications that suffer from hot spots (contentious requests to the same memory line), additional engines offer no benefit whatsoever. Even with double the memory bandwidth (or channels), an additional coherence processing stream yields only slight performance improvement. Only for a special class of DSM machines employing directoryless broadcast protocols over unordered interconnects does parallel "snoop" processing offer reasonable performance improvement for communication-intensive applications. Overall, given the architectural trends, this is good news for DSM designers who want to minimize the resources necessary (protocol threads or integrated protocol processor cores for maintaining internode coherence, respectively) to create SMTp-based or multi-CMP-based scalable DSM machines using directory protocols. 相似文献

19.

硬件结构支持的基于同步的高速缓存一致性协议

黄河刘磊宋风龙马啸宇《计算机学报》2009,32(8)

共享存储系统中如何高效地实现高速缓存一致性是体系结构设计面临的一个关键问题和难点问题.已有的基于目录的协议存在难于实现、验证复杂和存储空间开销大等问题.面向片上众核处理器,文中提出一种由硬件结构支持、基于同步的高速缓存一致性协议.该方案不使用目录,而是通过使用bloom-filter表示一致性信息,并在并行程序中的同步点维护高速缓存一致性.与现有的基于目录的高速缓存一致性协议相比,该方案可以降低目录协议的实现、验证复杂度.用SPLASH一2测试程序集评估表明,基于同步的协议可以获得与基于目录的协议相当的性能. 相似文献