期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Formal automatic verification of cache coherence in multiprocessors with relaxed memory models

Fong Pong Dubois M. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(9):989-1006

State-based, formal methods have been successfully applied to the automatic verification of cache coherence in sequentially consistent systems. However, coherence in shared memory multiprocessors under a relaxed memory model is much more complex to verify automatically. With relaxed memory models, incoming invalidations and outgoing updates can be delayed in each cache while processors are allowed to race ahead. This buffering of memory accesses considerably increases the amount of state in each cache and the complexity of protocol interactions. Moreover, because caches can hold inconsistent copies of the same data for long periods of time, coherence cannot be verified by simply checking that cached copies are identical at all times. This paper makes two major contributions. First, we demonstrate how to model and verify cache coherence under a relaxed memory model in the context of state-based verification methods. Frameworks for modeling the hardware and for generating correct memory access sequences driving the hardware model are developed. We also show correctness properties which must be verified on the hardware model. Second, we demonstrate a successful application of a state-based verification tool called SSM for the verification of the delayed protocol, an aggressive protocol for relaxed memory models. SSM is based on an abstraction technique preserving the properties to verify. We show that with classical, explicit approaches the verification of cache coherence is realistically unfeasible because of the state space explosion problem, whereas SSM is able to verify protocols both at both behavioral and message-passing levels. 相似文献

2.

多机系统中的Cache一致性探讨

盘清文谢玉和《计算机工程》1993,19(2):53-58

相似文献

3.

存储器替换机制及其实现

A·Chaib 胡铭曾《计算机工程与设计》2000,21(6):24-27

提出一种解决PACT01一种结合动态可编程逻辑阵列(DPGA)的处理器的新型体系结制中cache的一致性与同步性问题的算法,并且解决多线程支持的快速上下文切换及快速用户级操作问题。存储器替换机制是解决cache的一致性问题及当cache未命中时从局部或远程存储器到cacbe存储器的数据替换问题的一种硬件实现方法,产生冲突的原因是由于多线程并行的写入／读取的位置相同和读或写的位置相同。文中选择的是相联映射策略,同时也选择了最少最近使用LRU算法,即在cache未命中时替换最少最近使用的参考块,为实现LRU算法设置了与每块相对应的计数器。相似文献

4.

The DASH prototype: Logic overhead and performance 总被引：1，自引：0，他引：1

Lenoski D. Laudon J. Joe T. Nakahira D. Stevens L. Gupta A. Hennessy J. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(1):41-61

The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. The hardware overhead of directory-based cache coherence in a 48-processor is examined. The data show that the overhead is only about 10-15%, which appears to be a small cost for the ease of programming offered by coherent caches and the potential for higher performance. The performance of the system is discussed, and the speedups obtained by a variety of parallel applications running on the prototype are shown. Using a sophisticated hardware performance monitor, the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup are characterized. The optimizations incorporated in the DASH protocol are evaluated in terms of their effectiveness on parallel applications and on atomic tests that stress the memory system 相似文献

5.

Efficient methods for formally verifying safety properties of hierarchical cache coherence protocols

Xiaofang Chen Yu Yang Ganesh Gopalakrishnan Ching-Tsun Chou 《Formal Methods in System Design》2010,36(1):37-64

Multicore architectures are considered inevitable, given that sequential processing hardware has hit various limits. Unfortunately, the memory system of multicore processors is a huge bottleneck. To combat this problem, one needs to design aggressively optimized cache coherence protocols. This introduces the design correctness problem for advanced cache coherence protocols which will be hierarchically organized for scalable designs. Experiences show that monolithic formal verification will not scale to hierarchical designs. Hence, one needs to handle the complexity of several coherence protocols running concurrently, i.e. hierarchical protocols, using compositional techniques. 相似文献

6.

Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》2001,61(12):1775

Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme. 相似文献

7.

An argument for simple COMA

Ashley Saulsbury Tim Wilkinson John Carter Anders Landin 《Future Generation Computer Systems》1995,11(6):553-566

We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity — similarly to distributed virtual shared memory (DVSM) systems —leaving simpler hardware to maintain shared memory coherence at a cache line granularity.

By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all-hardware designs. 相似文献

8.

Transactional coherence and consistency: simplifying parallel hardware and software 总被引：1，自引：0，他引：1

Hammond L. Carlstrom B.D. Wong V. Chen M. Koryrakis C. Olukotun K. 《Micro, IEEE》2004,24(6):92-103

Transactional coherence and consistency (TCC) simplifies parallel hardware and software design by eliminating the need for conventional cache coherence and consistency models and letting programmers parallelize a wide range of applications with a simple, lock-free transactional model. TCC eases both parallel programming and parallel architecture design by relying on programmer-defined transactions as the basic unit of parallel work, communication, memory coherence, and memory consistency 相似文献

9.

一种支持操作系统的硬件事务内存系统

下载免费PDF全文

王绍刚吴丹庞征斌杨晓东《计算机工程与科学》2010,32(1):123-126

本文分析了目前提出的硬件事务内存系统在支持线程切换及迁移、内存页置换、系统调用等方面存在的问题,并提出了新的支持操作系统的实现方案DTM。DTM采用单独的面向事务处理的一致性协议来检测及仲裁事务间冲突,与传统的事务内存系统相比,DTM具有实现简单、可扩展性好及性能上的优势。相似文献

10.

Energy-efficient synonym data detection and consistency for virtual cache

《Microprocessors and Microsystems》2016

The cache memory consumes a large proportion of the energy used by a processor. In the on-chip cache, the translation lookaside buffer (TLB) accounts for 20–50% of energy consumption of the on-chip cache. To reduce energy consumption caused by TLB accesses, a virtual cache can be accessed by virtual addresses which are issued by a processor directly. However, a virtual cache may result in the synonym problem. In this paper, we propose low-cost synonym detection hardware and a synonym data coherence mechanism. These reduce the energy consumption incurred by TLB lookups, and maintain synonym data consistency in the virtual cache. The proposed synonym detection hardware efficiently reduces the number of blocks that must be looked up in a virtual cache for saving energy. In addition, the proposed synonym data coherence mechanism also reduces the number of invalidated blocks in the virtual cache to prevent the destruction of cache locality. The simulation results show that our proposed energy-aware virtual cache consumes 51%, 27%, and 20% less energy than the traditional physical cache, traditional virtual cache, and synonym lookaside buffer (SLB), respectively. In addition, our proposed design shows almost the same static energy consumption as SLB, and reduces static energy consumption by about 20% compared with the traditional physical cache and virtual cache. 相似文献

11.

CMP中基于目录的协作Cache设计方案

下载免费PDF全文

赵小雨吴俊敏隋秀峰王庆波唐轶轩《计算机工程》2010,36(21):283-285

片上多处理器中二级Cache的设计和管理是影响其性能的关键因素之一。在私有二级Cache的基础上,提出一种基于集中式一致性目录的协作Cache设计方案,通过有效地管理片上存储资源来优化处理器的性能,从而使该协作Cache具有平均访存延迟小、Cache缺失率低、可扩展性好等优点。实验结果显示,与共享二级Cache设计相比,协作Cache可以将4核处理器的吞吐量平均提高13.5%,而其硬件开销约为8.1%。相似文献

12.

多核Cache稀疏目录性能提升方法综述

吴健虢陈海燕刘胜邓让钰陈俊杰《计算机工程与科学》2019,41(3):385-392

受限于功耗,十多年前通用微处理器就停止追求更高的主频转而向集成更多处理器核的方向发展;同时,随着晶体管密度按摩尔定律不断提高,单片可集成的处理器核数成倍增长,片上多核、众核处理器已成为高性能微处理器发展的主流。未来千核级通用众核处理器支持共享存储编程模型是一种必然趋势,但传统的Cache一致性目录结构面临着查找延迟高、目录项替换频繁以及硬件代价和功耗可扩展性有限等问题。稀疏目录实现了传统目录结构硬件开销与一致性维护效率的折衷,被认为是众核处理器维护Cache一致性的一种高能效、可扩展结构。综述了近年来提高稀疏目录性能的相关研究与方法,并对其在面积、访问延迟、功耗和实现复杂性等方面进行分析,归纳出这些方法各自的优点和存在的不足,对创新设计未来高性能众核处理器共享存储体系结构具有一定的参考价值。相似文献

13.

A Lock-Based Cache Coherence Protocol for Scope Consistency 总被引：5，自引：2，他引：5

下载免费PDF全文

Hu Weiwu Shi Weisong Tang Zhimin Li Ming 《计算机科学技术学报》1998,13(2):97-109

Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors.Although scalable to a certain extent,directory protocols are complex enough to prevent it from being used in very large scale multiprocessors with tens of thousands of nodes.his paper proposes a lock-based cache coherence protocol for scope consistency.In does not rely on directory information to maintain cache coherence.Instead,cache coherence is maintained through requiring the releasing processor of a lock to stroe all write-notices generated in the associated critical section to the lock and the acquiring processor invalidates or updates its locally cached data copies according to the write notices of the lock.To evaluate the performance of the lock-based cache coherence protocol,a software SDM system named JIAJIA is built on network of workstations.Besides the lock-based cache coherence protocol,JIAJIA also characterizes itself with its shared memory organization scheme which combines the physical memories of multiple workstations to form a large shared space.Performance measurements with SPLASH2 program suite and NAS benchmarks indicate that,compared to recent SVM systems such as CVM,higher speedup is achieved by JIAJIA.Besides,JIAJIA can solve large scale problems that cannot be solved by other SVM systems due to memory size limitation. 相似文献

14.

Maintaining Cache Coherence through Compiler-Directed Data Prefetching

Hock-Beng Lim Pen-Chung Yew 《Journal of Parallel and Distributed Computing》1998,53(2):170

In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency. 相似文献

15.

Processor- and memory-based checkpoint and rollback recovery

Bowen N.S. Pradham D.K. 《Computer》1993,26(2):22-31

Several hardware-based techniques that support checkpoint and rollback recovery are presented. The focus is on hardware schemes for uniprocessors, shared-memory multiprocessors, and distributed virtual-memory systems. A taxonomy for processor and memory techniques based on the memory hierarchy is presented. This provides a basis for understanding subtle differences among the various schemes. Processor-based schemes that handle transient faults by using processor-based transparent rollback techniques and memory-based schemes that roll back data instead of instructions and can be integrated with the processor techniques or can be exploited by higher levels of software are discussed 相似文献

16.

Godson-T缓存一致性协议的Murphi建模和验证

周琰《计算机系统应用》2013,22(10):124-128

Godson-T缓存一致性协议是用于Godson-T众核处理器的缓存一致性协议．在Godson-T协议中,缓存一致性协议和存储一致性模型存在紧密的紧耦合关系,分析协议的一致性时发现该协议满足的缓存一致性不是强一致性,不满足传统意义上缓存透明的一致性要求．我们选取了Murphi模型检测工具作为我们建模的语言和验证工具．在对Godson-T缓存一致性协议建模的时候,由于协议的上述特点,我们需要对处理器核结点,高速缓存和内存作为一个整体建模,并成功地验证了协议的相关性质．相似文献

17.

Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Alberto Ros Ricardo Fernández-Pascual Manuel E. Acacio José M. García 《Journal of Parallel and Distributed Computing》2008

In glueless shared-memory multiprocessors where cache coherence is usually maintained using a directory-based protocol, the fast access to the on-chip components (caches and network router, among others) contrasts with the much slower main memory. Unfortunately, directory-based protocols need to obtain the sharing status of every memory block before coherence actions can be performed. This information has traditionally been stored in main memory, and therefore these cache coherence protocols are far from being optimal. In this work, we propose two alternative designs for the last-level private cache of glueless shared-memory multiprocessors: the lightweight directory and the SGluM cache. Our proposals completely remove directory information from main memory and store it in the home node’s L2 cache, thus reducing both the number of accesses to main memory and the directory memory overhead. The main characteristics of the lightweight directory are its simplicity and the significant improvement in the execution time for most applications. Its drawback, however, is that the performance of some particular applications could be degraded. On the other hand, the SGluM cache offers more modest improvements in execution time for all the applications by adding some extra structures that cope with the cases in which the lightweight directory fails. 相似文献

18.

Compiler analysis for cache coherence: interprocedural array data-flow analysis and its impact on cache performance

Choi L. Pen-Chung Yew 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(9):879-896

In this paper, we present compiler algorithms for detecting references to stale data in shared-memory multiprocessors. The algorithm consists of two key analysis techniques, state reference detection and locality preserving analysis. While the stale reference detection finds the memory reference patterns that may violate cache coherence, the locality preserving analysis minimizes the number of such stale references by analyzing both temporal and spatial reuses. By computing the regions referenced by arrays inside loops, we extend the previous scalar algorithms for more precise analysis. We develop a full interprocedural array data-flow algorithm, which performs both bottom-up side-effect analysis and top-down context analysis on the procedure call graph to further exploit locality across procedure boundaries. The interprocedural algorithm eliminates cache invalidations at procedure boundaries, which were assumed in the previous compiler algorithms. We have fully implemented the algorithm in the Polaris parallelizing compiler. Using execution-driven simulations on Perfect Club benchmarks, we demonstrate how unnecessary cache misses can be eliminated by the automatic stale reference detection. The algorithm can be used to implement cache coherence in the shared-memory multiprocessors that do not have hardware directories, such as Cray T3D. 相似文献

19.

Synergistic design of an application-oriented sparse directory on many-core embedded systems

《Journal of Systems Architecture》2017

As many-core embedded systems are evolving from single-memory based designs to systems-on-a-chip running on an on-chip network, implementing a cache coherence mechanism in large-scale many-core embedded systems turns out to be a technical challenge. However, existing coherence mechanisms are difficult to scale beyond tens of cores, which require either excessive area or energy, complex hierarchical protocols, or inexact representations of sharer sets. In this paper, we present a hardware-software synergistic design of a cache coherence mechanism by considering OS-level application allocation and hardware-level coherence operations. The proposed application-oriented sparse directory (AoSD) cooperates with a contiguous allocation algorithm to isolate cache coherence traffic and thereby reduce interferences among applications. The proposed micro-architecture of sharer set representations is area-efficient; moreover, it can also be configured dynamically to track a flexible and exact sharer set. We verify our design by analyzing memory requirements of different cache organizations and implementing our design on a popular simulator Graphite to evaluate cache coherence traffic improvement. The results show that our design is both area-efficient and efficient with improvements in memory network performance by 11.74%–28.72%. It is also indicated that our design is feasible to scale up to work well in thousands-of-cores embedded systems. 相似文献

20.

片上多核处理器Cache一致性协议优化研究综述

胡森森计卫星王一拙陈旭付文飞石峰《软件学报》2017,28(4):1027-1047

现代晶体管技术在单芯片上集成多个处理器已经成为现实.近年来,随着多核处理器集成核数的不断增加,高速缓存的一致性问题凸显出来,已成为多核处理器的性能瓶颈之一,亟待解决.本文介绍了片上多核处理器一致性问题的由来.总结了多核时代高速缓存一致性协议设计的关键问题,综述了近年来学术界对一致性的研究.从程序访存行为模式、目录组织结构、一致性粒度、一致性协议流量、目录协议的可扩展性等方面,阐述了近年来缓存一致性协议性能优化的方向.对目前片上多核处理器缓存一致性协议设计中存在的问题进行了讨论,并指出了未来进一步研究的方向. 相似文献