期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures

Dimitra Papagiannopoulou Andrea Marongiu Tali Moreshet Luca Benini Maurice Herlihy R. Iris Bahar 《International journal of parallel programming》2018,46(6):1304-1328

High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access costs. In order to keep the cores and memory hierarchy simple, many-core embedded systems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These “coherence-free” systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both usability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this article, we present a new scheme for hardware transactional memory (HTM) support within a cluster-based, many-core embedded system that lacks an underlying cache-coherence protocol. We propose two alternative data versioning implementations for the HTM support, Full-Mirroring and Distributed Logging and we conduct a performance comparison between them. To the best of our knowledge, these are the first designs for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our designs can achieve significant performance improvements over traditional lock-based schemes. 相似文献

2.

Supporting faulty banks in NUCA by NoC assisted remapping mechanisms

Kuei-Chung Chang Chen-Yu Chen Chin-Sheng Yu Ching-Wen Chen 《The Journal of supercomputing》2014,67(2):305-323

The many-core SoC is a future trend technology, and the process yield will face many unpredictable challenges. Nonuniform cache architecture (NUCA) can improve the performance of many-core SoC for embedded systems. It embeds a NoC into the cache memory to enhance the data access by distributing traffic loads to several banks in parallel. Providing fault-tolerant mechanism in NUCA is very important because the chip can still work efficiently when some memory banks are unusable. In this paper, we design a specific router working with static and dynamic cache remapping policies to support faulty banks in NUCA. When a L2 cache bank in NUCA is unusable, static remapping policy (SRP) selects a suitable neighbor cache bank according to the collected remapping cost to assist with the cache access by considering cache status and traffic status of the router. We also propose a dynamic remapping policy (DRP) to select the suitable cache bank dynamically at runtime to fit the real loading status of neighbor nodes around the faulty bank. The experimental results show that the average improvement of the SRP is approximated to 26 %, and the average improvement of the DRP is approximated to 28 %. 相似文献

3.

Hybrid address spaces: A methodology for implementing scalable high-level programming models on non-coherent many-core architectures

《Journal of Systems and Software》2014

This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71× over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix++, a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demonstrates improvements of a factor of 3.1× and 3.2× respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing. 相似文献

4.

硬件结构支持的基于同步的高速缓存一致性协议

黄河刘磊宋风龙马啸宇《计算机学报》2009,32(8)

共享存储系统中如何高效地实现高速缓存一致性是体系结构设计面临的一个关键问题和难点问题.已有的基于目录的协议存在难于实现、验证复杂和存储空间开销大等问题.面向片上众核处理器,文中提出一种由硬件结构支持、基于同步的高速缓存一致性协议.该方案不使用目录,而是通过使用bloom-filter表示一致性信息,并在并行程序中的同步点维护高速缓存一致性.与现有的基于目录的高速缓存一致性协议相比,该方案可以降低目录协议的实现、验证复杂度.用SPLASH一2测试程序集评估表明,基于同步的协议可以获得与基于目录的协议相当的性能. 相似文献

5.

DP&TB: a coherence filtering protocol for many-core chip multiprocessors

Fengkai Yuan Zhenzhou Ji 《The Journal of supercomputing》2013,66(1):249-261

Future many-core chip multiprocessors (CMPs) will integrate hundreds of processor cores on chip. Two cache coherence protocols are the mainstream applied to current CMPs. The token-based protocol (Token) provides high performance, but it generates a prohibitive amount of network traffic, which translates into excessive power consumption. The directory-based protocol (Directory) reduces network traffic, yet trades off with the storage overhead of the directory as well as entails comparatively low performance caused by indirection limiting its applicability for many-core CMPs. In this work, we present DP&TB, a novel cache coherence protocol particularly suited to future many-core CMPs. In DP&TB, cache coherence is maintained at the granularity of a page, facilitating to filter out either unnecessary coherence inspections for blocks inside private pages or network traffic for blocks inside shared pages. We employ Directory to detect private and shared pages and Token to maintain the coherence of the blocks inside shared pages. DP&TB inherits the merit of Directory and Token and overcome their problems. Experimental results show that DP&TB comprehensively beyond Directory and Token with improvement by 9.1 % in performance over Token and by 13.8 % in network traffic over Directory. In addition, the storage overhead of DP&TB is less than half of that of Directory. Our proposal can fulfill the requirement of many-core CMPs to achieve high performance, power and area efficiency. 相似文献

6.

Design and implementation of dual processor block with shared external cache memory

Soo-Won Kim Hanseok Ko Woo-Jong Hahn Jong-Sik HahmAuthor vitae 《Microprocessors and Microsystems》1997,20(10):595-605

The availability of low cost, high performance microprocessors has led to various designs of shared memory multiprocessor systems. As a result, commercial products which are based on shared memory have been proliferated. Such a multiprocessor system is heavily influenced by the structure of memory system and it is not difficult to find that most configurations include local cache memories. The more processors a system carries, the larger local cache memory is needed to maintain the traffic to and from the shared memory at reasonable level. The implementation of local cache memories, however, is not a simple task because of environmental limitations. In particular, the general lack of board space availability presents a formidable problem. A cache memory system usually needs space mostly to support its complex control logic circuits for the cache itself and network interfaces like snooping logic circuits for shared bus. Although packaging can be made denser to reduce system size, there are still multiple processors per board. It requires a more area-efficient cache memory architecture. This paper presents a design of shared cache for dual processor board of bus-based symmetric multiprocessors. The design and implementation issues are described first and then the evaluation and measurement results are discussed. The shared cache proposed in this paper has been determined to be quite area-efficient without the significant loss of throughput and scalability. It has been implemented as a plug-in unit for TICOM, a prevalent commercial multiprocessor system. 相似文献

7.

Godson-T缓存一致性协议的Murphi建模和验证

周琰《计算机系统应用》2013,22(10):124-128

Godson-T缓存一致性协议是用于Godson-T众核处理器的缓存一致性协议．在Godson-T协议中,缓存一致性协议和存储一致性模型存在紧密的紧耦合关系,分析协议的一致性时发现该协议满足的缓存一致性不是强一致性,不满足传统意义上缓存透明的一致性要求．我们选取了Murphi模型检测工具作为我们建模的语言和验证工具．在对Godson-T缓存一致性协议建模的时候,由于协议的上述特点,我们需要对处理器核结点,高速缓存和内存作为一个整体建模,并成功地验证了协议的相关性质．相似文献

8.

众核处理器Cache一致性研究综述

韩立敏安建峰高德远樊晓桠任向隆《计算机应用研究》2012,29(11):4011-4016

以瓦片结构众核处理器一致性协议的设计为主线,综述了国内外近年来关于众核处理器cache一致性的相关研究;介绍了不同NUCA结构对一致性协议的影响;分析和对比了几种传统目录一致性协议的特性及其存在的问题;归纳了最新几个面向众核结构一致性协议的设计思想和特性。最后为设计具备应用程序适应性和可扩展性的cache一致性协议指出了几个关键的设计方向。相似文献

9.

M<Emphasis Type="SmallCaps">osaic</Emphasis>: A Scalable Coherence Protocol

Lucia G. Menezo Valentin Puente Pablo Abad Jose-Angel Gregorio 《International journal of parallel programming》2018,46(6):1110-1138

The coherence protocol presented in this work, denoted Mosaic, introduces a new approach to face the challenges of complex multilevel cache hierarchies in future many-core systems. The essential aspect of the proposal is to eliminate the condition of inclusiveness through the different levels of the memory hierarchy while maintaining the complexity of the protocol limited. Cost reduction decisions taken to reduce this complexity may introduce artificial inefficiencies in the on-chip cache hierarchy, especially when the number of cores and private cache size is large. Our approach trades area and complexity for on-chip bandwidth, employing an integrated broadcast mechanism in a directory structure. In energy terms, the protocol scales like a conventional directory coherence protocol, but relaxes the shared information inclusiveness. This allows the performance implications of directory size and associativity reduction to be overcome. As it is even simpler than a conventional directory, the results of our evaluation show that the approach is quite insensitive, in terms of performance and energy expenditure, to the size and associativity of the directory. 相似文献

10.

环连接CMP的缓存一致性协议

曹非刘志勇《计算机研究与发展》2009,46(Z2)

片上多核处理器(CMP)已经成为处理器发展的方向,处理器设计的重点也转到了互连网络和存储层次结构方面,其中的一个关键问题是如何维护各处理器各级缓存(Cache)的一致性,该问题在传统的共享存储多处理器中使用Cache一致性协议来解决,而CMP相对于传统的多处理器结构具有更高的片上互连带宽和速度,给Cache一致协议提出了新的要求,也提供了新的改进机会.传统的总线侦听协议存在可扩展性不足和不必要的广播、侦听过多的缺点,而目录协议则存在失效间接延时大和复杂度高、验证困难等问题.环形连接的可扩展性好于总线结构,而其实现复杂度也远小于通常目录协议所使用的包交换点到点网络.将基于环的侦听协议应用于CMP;并考虑利用环的顺序性取消原有协议中冲突引起的重发操作,消除可能的饥饿、死锁和活锁等情况,增加协议的稳定性,同时减少消息流量和功耗;利用片上互连延时短的特点,将侦听结果和侦听请求同时传播,使得处理器可以根据侦听结果来对侦听请求进行选择性的侦听操作,可减少不必要的侦听操作,降低功耗. 相似文献

11.

A Lock-Based Cache Coherence Protocol for Scope Consistency 总被引：5，自引：2，他引：5

下载免费PDF全文

Hu Weiwu Shi Weisong Tang Zhimin Li Ming 《计算机科学技术学报》1998,13(2):97-109

Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors.Although scalable to a certain extent,directory protocols are complex enough to prevent it from being used in very large scale multiprocessors with tens of thousands of nodes.his paper proposes a lock-based cache coherence protocol for scope consistency.In does not rely on directory information to maintain cache coherence.Instead,cache coherence is maintained through requiring the releasing processor of a lock to stroe all write-notices generated in the associated critical section to the lock and the acquiring processor invalidates or updates its locally cached data copies according to the write notices of the lock.To evaluate the performance of the lock-based cache coherence protocol,a software SDM system named JIAJIA is built on network of workstations.Besides the lock-based cache coherence protocol,JIAJIA also characterizes itself with its shared memory organization scheme which combines the physical memories of multiple workstations to form a large shared space.Performance measurements with SPLASH2 program suite and NAS benchmarks indicate that,compared to recent SVM systems such as CVM,higher speedup is achieved by JIAJIA.Besides,JIAJIA can solve large scale problems that cannot be solved by other SVM systems due to memory size limitation. 相似文献

12.

High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Arslan Munir Farinaz Koushanfar Ann Gordon-Ross Sanjay Ranka 《The Journal of supercomputing》2013,66(1):431-487

Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs. 相似文献

13.

一种多线程阵列众核处理器的二级Cache划分机制

陈逸飞朱蕾李宏亮《计算机工程与科学》2019,41(3):400-408

阵列众核处理器由于其较高的计算性能和能效比已经广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器,其核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。引入硬件同时多线程技术,针对实验中单核心多线程二级Cache利用率较低的问题,提出了一种共享二级Cache划分机制。经实验模拟,通过上述优化的共享二级Cache划分机制,二级指令Cache失效率下降18.59%,数据Cache失效率下降6.60%,整体CPI性能提升达到10.1%。相似文献

14.

High Performance Software Coherence for Current and Future Architectures

《Journal of Parallel and Distributed Computing》1995,29(2):179-195

Shared memory provides an attractive and intuitive programming model for large-scale parallel computing, but requires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shared memory emulations on networks of workstations to tightly coupled fully cache-coherent distributed shared memory multiprocessors. Previous work indicates that performance varies dramatically from one end of this spectrum to the other. Hardware cache coherence is fast, but also costly and time-consuming to design and implement, while DSM systems provide acceptable performance on only a limit class of applications. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space, without cache coherence-can provide most of the performance benefits of fully cache-coherent hardware, at a fraction of the cost. To support this claim we present a software coherence protocol that runs on this class of machines, and use simulation to conduct a performance study. We look at both programming and architectural issues in the context of software and hardware coherence protocols. Our results suggest that software coherence on NCC-NUMA machines in a more cost-effective approach to large-scale shared-memory multiprocessing than either pure distributed shared memory or hardware cache coherence. 相似文献

15.

BufferBank: A distributed cache infrastructure for peer-to-peer application

Bin Huang Zhigang Sun Hongyi Chen Jianbiao Mao Ziwen Zhang 《Peer-to-Peer Networking and Applications》2014,7(4):485-496

Peer-to-peer (P2P) systems generate a major fraction of the current Internet traffic which significantly increase the load on ISP networks. To mitigate these negative impacts, many previous works in the literature have proposed caching of P2P traffic. But very few have considered designing a distributed caching infrastructure in the edge network. This paper demonstrates that a distributed caching infrastructure is more suitable than traditional proxy cache servers which cache data in disk, and it is viable to use the memory of users in the edge network as the cache space. This paper presents the design and evaluation of a distributed network cache infrastructure for P2P application, called BufferBank. BufferBank provides a number of application interfaces for P2P applications to make full use of the cache space. Three-level mapping is introduced and elaborated to improve the reliability and security of this distributed cache mechanism. Our measurement results suggest that BufferBank can decrease the data obtaining delay, compared with traditional P2P cache server based on disk. 相似文献

16.

Cache-based high-level simulation of microthreaded many-core architectures

《Journal of Systems Architecture》2014,60(7):529-552

The accuracy of simulated cycles in high-level simulators is generally less than the accuracy in detailed simulators for a single-core systems, because high-level simulators simulate the behaviour of components rather than the components themselves as in detailed simulators. The simulation problem becomes more challenging when simulating many-core systems, where many cores are executing instructions concurrently. In these systems data may be accessed from multiple caches and the abstraction of the instruction execution has to consider the dynamic resource sharing on the whole chip. The problem becomes even more challenging in microthreaded many-core systems, because there may exist concurrent hardware threads. Which means that the latency of long latency operations can be tolerated from many cycles to just few cycles. We have previously presented a simulation technique to improve the accuracy in high-level simulation of microthreaded many-core systems, known as Signature-based high- level simulator, which adapts the throughput of the program based on the type of instructions, number of instructions and number of active threads in the pipeline. However, it disregards the access to different levels of the caches on the many-core system. Accessing L1-cache has far less latency than accessing off-chip memory and if the core is not able to tolerate latency, different levels of caches can not be treated equally. The distributed cache network along with the synchronization-aware coherency protocol in the Microgrid is a complicated memory architecture and it is difficult to simulate its behaviour at a high-level. In this article we present a high-level cache model, which aims to improve the accuracy in high-level simulators for general-purpose many-core systems by adding little complexity to the simulator and without affecting the simulation speed. 相似文献

17.

Mbalancer:虚拟机内存资源动态预测与调配

王志钢汪小林靳辛欣王振林罗英伟《软件学报》2014,25(10):2206-2219

在现代数据中心,虚拟化技术在资源管理、服务器整合、提高资源利用率等方面发挥了巨大的作用,已成为云计算架构中关键的抽象层次和重要的支撑性技术。在虚拟化环境中,如果要保证高资源利用率和系统性能,必须有一个高效的内存管理方法,使得虚拟机的物理内存大小能够满足应用程序不断变化的内存需求。因此,如何在单机以及数据中心内进行内存资源的动态调控,就成为了一个关键性问题。实现了一个低开销、高精确度的内存工作集跟踪机制,进而进行相应的本地或者全局的内存调控。采用了多种动态内存调控技术：气球技术能够在单机内有效地为各个虚拟机动态调节内存;远程缓存技术可在物理机之间进行内存调度;虚拟机迁移可将虚拟机负载在多个物理主机间进行均衡。深入分析了以上各种方案的优缺点,并根据内存超载的情况有针对性地设计了相应的调控策略,实验数据表明：所提出的预测式的内存资源管理方法能够对内存资源进行在线监控和动态调配,并有效地提高了数据中心的内存资源利用率,降低了数据中心能耗。相似文献

18.

A tagless cache design for power saving in embedded systems

Ching-Wen Chen Chang-Jung Ku 《The Journal of supercomputing》2012,62(1):174-198

In embedded systems, cache is commonly used to improve system performance. However, the cache consumes a large amount of power, and among the components of the cache memory, tag comparisons consume the most amount of power. Therefore, how to design a cache that does not consume so much power when comparing tags and that has a high hit ratio is an important challenge. In this paper, we propose a Tagless Instruction Cache, called TL-IC, that does not perform tag comparisons in order to save power in embedded systems. To guarantee that an instruction fetched from TL-IC is the desired instruction, instead of cache lines being used, the basic blocks of programs are placed into TL-IC. In addition, to utilize TL-IC as much as possible in order to save the most amount of power and to take into account the general-purpose and special-purpose applications, both the static allocation and the dynamic allocation of basic blocks are used to select the frequently executed basic blocks of programs in TL-IC. With a high utilization of TL-IC that does not perform tag comparisons, the power consumed in fetching instructions can be efficiently reduced. In the simulation results, we show and compare the power consumption of our proposed TL-IC, L0 cache, Linebuffer, and TH-IC. 相似文献

19.

RTOS动态分区内存管理机制的优化设计

叶新栋唐志强涂时亮《单片机与嵌入式系统应用》2009,(9):9-11

分区存储管理是满足多道程序设计的最简单的存储管理方法。本文首先分析了嵌入式RTOS中动态分区内存管理机制的实现方法,并在此基础上结合动态分区机制提出了一种小块内存动态缓存分配机制,有效地弥补了动态分区内存管理的不足之处,减少了内存中外部碎片的数量并提高了内存的利用率及分配的实时性,对嵌入式RTOS内核的设计有一定指导意义。相似文献

20.

Improving performance of multi-core NUCA coherent systems using NoC-assisted mechanisms

Kuei-Chung Chang Ing-Ming Liao Chiu-Han Liao 《The Journal of supercomputing》2012,62(3):1318-1337

The significant speed-gap between processor and memory makes last-level cache performance crucial for multi-core architectures (MCA). Non-uniform cache architecture (NUCA) has been proposed to overcome the performance limitations of MCA for many embedded applications. The cache is partitioned into sub-banks, with each sub-bank being an independently accessible entity connected with a fast on-chip network (NoC). This paper presents two NoC-assisted mechanisms to improve the performance and power consumption of NUCA coherence. The first mechanism provides priority-based communication based on the wormhole routing architecture to support NUCA coherence. High-priority coherent packets are transmitted first to save time. The second mechanism offers multicasting communication based on the proposed priority-based NoC to provide efficient cache coherency for NUCA. We dispatch and collect coherence packets at the collecting nodes (CN) to further decrease the number of coherent messages flowing in the NoC. Experimental results show that the priority-based transmission can improve performance by approximately 10?%. The proposed multicasting mechanism can further improve performance and decrease power consumption of the NoC in NUCA by approximately 15?%. The two proposed mechanisms can together enhance the performance by 25?% averagely. 相似文献