期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

曹非刘志勇《计算机科学》2012,39(8):304-310

片上多核处理器(CMP)通常采用私有或者共享的末级高速缓存(cache)结构,而共享末级cache一般使用静态地址映射机制。该机制将各处理器临时私有访问的数据映射于分布在其他处理器的末级cache中,使得各处理器对临时私有数据的访问延时增加。针对该问题,提出了一种动静结合的共享末级cache地址映射方法。该方法可将原来静态映射于其他处理器末级cache中的临时私有数据动态映射于访问者处理器的本地末级cache中,减少了大量静态映射所造成的长延时非本地末级cache访问,从而有效降低了整个共享末级cache的访问延时,在提高性能的同时降低了功耗和带宽使用。实验结果表明,动静结合的地址映射方式应用于采用环连接互连结构和侦听顺序环协议的CMP结构时,可获得的平均性能提升为9%,最大性能提升为38%。相似文献

2.

基于共享cache多核处理器的数据库内存排序优化

邓亚丹吴京熊伟景宁《计算机研究与发展》2009,46(Z2)

针对目前主流的多核处理器,提出了共享cache敏感的数据库排序多线程执行框架(sharedcache sensitive multithreaded sorting framework,SCS-MSF).首先分析了多线程QuickSort排序在共享cache多核处理器中执行时面临的性能瓶颈,在此基础上针对SCS-MSF每个处理阶段的数据访问特点,提出了各自的多线程并行执行模式,并通过各种优化策略改善线程执行时的cache性能,特别是减少多线程访问共享cache时的访问冲突问题,以提高线程的cache性能.在实验中,基于内存数据库EaseDB实现了SCS-MSF.实验结果表明SCS-MSF具有良好cache访问性能,从而提高了多线程执行的效率,而且性能稳定,数据库排序性能得到了较大提高. 相似文献

3.

PLRU替换算法在嵌入式系统cache中的实现

李洪毛志刚《微处理机》2010,31(1):16-19

合理的cache设计是缩小处理器和存储器速度差距的主要解决方法,也是影响系统性能的关键因素之一。cache替换策略是影响cache性能的主要因素,目前最常用的替换算法是LRU算法,为了降低模块复杂度和实现的难度,从LRU算法简化出一种PLRU（PseudoLRU）替换算法。通过采用开源的Simple Scalar仿真工具,对LRU、RANDOM、FIFO、PLRU等各种常见的cache替换算法进行了性能比较和分析,并对PLRU进行实现。实验结果表明,使用PLRU替换算法cache的缺失率与LRU算法基本相同,但是有着更小的面积和更短的关键路径。相似文献

4.

多核处理器片上存储系统研究 总被引：1，自引：1，他引：0

下载免费PDF全文

黄安文高军张民选《计算机工程》2010,36(4):4-6

针对多核处理器计算能力和访存速度间差异不断增大对多核系统性能提升的制约问题,分析几款典型多核处理器存储系统的设计特点,探讨多核处理器片上存储系统发展的关键技术,包括延迟造成的非一致cache访问、核与cache互连形式对访存性能的束缚以及片上cache设计的复杂化等。相似文献

5.

指导cache静态划分的程序性能profiling优化技术

贾耀仓武成岗张兆庆《计算机研究与发展》2012,49(1):93-102

对于共享cache的多核处理器,如何管理好各个核对cache的利用,对于充分发挥多核处理器性能是很关键的问题.目前采用的cache替换方法程序间会出现性能干扰,cache静态划分技术则是通过为同时运行的程序分配不同的空间来解决性能干扰问题.为了给程序分配合适大小的cache空间,需要对程序进行性能profiling,即事先多遍运行收集程序在各种cache容量下的性能数据,这种性能profiling方法开销巨大,影响实用.为了解决性能profiling需要多遍运行程序的问题,提出了只需单遍运行的程序性能profiling优化技术.该技术利用在线的phase分析技术识别程序的运行阶段,避免对相同阶段的重复profiling;同时分析程序各phase的性能同cache容量变化的关系趋势,对于性能不敏感的容量变化则不进行profiling,降低开销.在程序运行结束后通过程序各phase在cache各种容量下的性能来估计程序在各容量下的整体性能,以指导cache静态划分.实验表明,该技术的开销仅为7%,而该方法指导的cache划分比未划分时有8%的性能改进,同多遍运行的程序性能profiling指导的cache划分性能相比仅有1%的下降. 相似文献

6.

面向机器学习的高性能SIMT处理器cache的设计与实现

《计算机应用与软件》2019,(7)

为了满足机器学习中大数据、并行计算及降低处理器与主存之间的差距等要求,设计基于自主研发的SIMT处理器的流水线cache结构。依据局部性原理与LRU替换算法相结合设计专用的伪LRU替换算法,与通用的轮询、LFU、LRU替换算法共同完成cache替换算法的可配置要求,实现处理器与主存之间的快速交互。采用Xilinx公司virtex ultrascale系列的xcvu440-flga2892-2-e FPGA芯片对设计进行综合。结果表明该结构指令cache最大时延为2.923 ns,数据cache最大时延为3.258 ns,满足SIMT处理器性能要求。相似文献

7.

基于向量引用Platform-Oblivious内存连接优化技术

张延松张宇王珊《软件学报》2018,29(3):883-895

以MapD为代表的图分析数据库系统通过GPU、Phi等新型众核处理器来支持高性能分析处理,在面向复杂数据模式时连接操作仍然是重要的性能瓶颈.近年来,异构处理器逐渐成为高性能计算的主流平台,内存连接性能的研究从多核CPU平台扩展到新兴的众核处理器,但众多的研究成果并未系统地揭示连接算法性能、连接数据集大小、硬件架构之间的内在联系,难以为未来异构处理器平台的数据库提供连接平台优化选择策略.本文以面向多核CPU、Xeon Phi、GPU处理器平台的内存连接优化技术为目标,通过优化内存哈希表设计,实现以向量映射替代哈希映射操作,消除哈希代价对内存连接算法的影响,从而更加准确地测量内存连接算法在多核CPU的cache大小、Xeon Phi的cache大小、Xeon Phi的并发多线程、GPU的SIMT（单指令多线程）机制等硬件相关因素影响下的性能特征.实验结果表明,缓存与并发多线程机制是提高内存连接算法性能的重要影响因素.缓存机制对于满足cache大小的连接操作具有性能优势,而GPU的并发多线程机制则在较大表的连接操作中具有较高的性能,Xeon Phi则在满足其L2 cache大小的连接操作中具有最高性能.实验结果揭示了内存连接操作性能与异构处理器硬件特性的联系,为未来异构处理器平台内存数据库查询优化器提供了优化策略. 相似文献

8.

基于硬件cache锁机制的Java虚拟机即时编译器优化

敖琪蔡嵩松王剑《计算机研究与发展》2012,(Z1):185-190

Java虚拟机即时编译器以方法为单位进行编译,编译器将字节码方法编译成可执行代码,并经过数据cache存入内存中,当再次执行到该代码段时,处理器需要从包含该代码段的内存区域取指令执行,如果该内存区域在数据cache中已经建立映射,就可以直接从数据cache中读取数据,读数据的性能就会有大幅度的提高.但是编译生成的大量可执行代码在cache中频繁替换,当生成代码被替换出cache后,代码再次执行时处理器必须访问速度较慢的主存储器,成为编译器的性能瓶颈.设计并实现了硬件cache锁机制,提出了一种软硬件协同设计的即时编译方法.通过该方法,生成代码执行时的cache失效次数降低了6.9%,SPECjvm2008中程序最高获得了17.9%的性能提升,平均性能提升4.2%. 相似文献

9.

多级缓存模式下的数据块替换优化算法

兰丽《计算机工程》2013,39(4)

多数处理器中采用多级包含的cache存储层次,现有的末级cache块替换算法带来的性能开销较大.针对该问题,提出一种优化的末级cache块替换算法PLI,在选择丢弃块时考虑其在上级cache的访问频率,以较小的代价选出最优的LLC替换块.在时钟精确模拟器上的评测结果表明,该算法较原算法性能平均提升7％. 相似文献

10.

基于共享总线的多处理器cache一致性的硬件实现*

李均晓张盛兵沈绪榜《计算机应用研究》2008,25(6):1890-1893

龙腾R2微处理器是西北工业大学航空微电子中心设计的采用PowerPC体系结构,具有自主知识产权的R ISC微处理器。为了扩展其多处理器的功能,采用总线侦听的方法来维护多处理器环境下的cache一致性。首先介绍了共享总线侦听技术以及侦听协议,然后详细介绍了龙腾R2微处理器的总线侦听部件的实现方案,对几类cache一致性的实现方案以及性能进行了评析。FPGA实验结果表明,总线侦听部件能高效而准确地保证多处理器系统的cache一致性。相似文献

11.

嵌入式处理器中的Sector Cache:性能与面积的折衷

左琦付宇卓鲁欣《小型微型计算机系统》2006,27(1):166-170

Sector Cache曾经被用于一些最早使用Cache技术的计算机系统中．虽然Sector Cache的性能略差于普通Cache，但同样Cache容量下Sector结构所需的标记位明显少于普通结构．由于嵌入式处理器对芯片面积的要求非常严格，Sector Cache的优点在嵌入式处理器中就更为明显．本文用基于仿真的方法详细分析了Sector结构的Cache在嵌入式应用环境下的性能．仿真结果表明，合理使用Sector结构可以以较小的性能代价有效地减少标记位数量．因此，采用Sector Cache就可以在满足性能要求的前提下尽可能减小Cache控制器的面积．本文认为Sector Cache是嵌入式处理器设计者进行性能／面积折衷有效手段．相似文献

12.

一种嵌入式处理器的动态可重构Cache设计 总被引：1，自引：0，他引：1

张毅汪东升《计算机工程与应用》2004,40(8):94-96,232

一般的处理器芯片都有片上高速缓存Cache,它一般是由固定大小的一级Cache(L1)和二级Cache(L2)构成,文章介绍了一种在嵌入式处理器设计中实现的动态可重构Cache。动态可重构Cache的思想最早是罗彻斯特大学(UniversityofRochester)的学者在他们的一篇关于存储层次的论文1中提出的,当时主要是针对高性能的超标量通用处理器。在此嵌入式处理器设计过程中,笔者创造性地继承了这一思想。通过增加少量硬件以及编译器的配合,在嵌入式处理器中L1Cache和L2Cache总体大小不变的情况下,L1Cache和L2Cache的大小可以根据具体的应用程序动态配置。通过对高速缓存的动态配置,不仅可以有效地提高Cache的命中率,还能够有效降低处理器的功耗。相似文献

13.

A cache-aware program transformation technique suitable for embedded systems

《Information and Software Technology》2002,44(13):783-795

In embedded systems caches are very precious for keeping low the memory bandwidth and to allow employing slow and narrow off-chip devices. Conversely, the power and die size resources consumed by the cache force the embedded system designers to use small and simple cache memories. This kind of caches can experience poor performance because of their not flexible placement policy. In this scenario, a big fraction of the misses can originate from the mismatch between the cache behavior and the memory accesses' locality features (conflict misses).In this paper we analyze the conflict miss phenomenon and define a cache utilization measure. Then we propose an object level Cache Aware allocation Technique (CAT) to transform the application to fit the cache structure, minimize the number of conflict misses and maximize cache exploitation. The solution transforms the program layout using the standard functionalities of a linker.The CAT approach allowed the considered applications to deliver the same performance on two times and sometimes four times smaller caches. Moreover the CAT improved programs on direct-mapped caches outperformed the original versions on set-associative caches. In this way, the results highlight that our approach can help embedded system designers to meet the system requirements with smaller and simpler cache memories. 相似文献

14.

Two-level caches tuning technique for energy consumption in reconfigurable embedded MPSoC

《Journal of Systems Architecture》2013,59(8):656-666

In order to meet the ever-increasing computing requirement in the embedded market, multiprocessor chips were proposed as the best way out. In this work we investigate the energy consumption in these embedded MPSoC systems. One of the efficient solutions to reduce the energy consumption is to reconfigure the cache memories. This approach was applied for one cache level/one processor architecture, but has not yet been investigated for multiprocessor architecture with two level caches. The main contribution of this paper is to explore two level caches (L1/L2) multiprocessor architecture by estimating the energy consumption. Using a simulation platform, we first built a multiprocessor architecture, and then we propose a new algorithm that tunes the two-level cache memory hierarchy (L1 and L2). The tuning caches approach is based on three parameters: cache size, line size, and associativity. To find the best cache configuration, the application is divided into several execution intervals. And then, for each interval, we generate the best cache configuration. Finally, the approach is validated using a set of open source benchmarks; Spec 2006, Splash-2, MediaBench and we discuss the performance in terms of speedup and energy reduction. 相似文献

15.

A new cache architecture based on temporal and spatial locality 总被引：5，自引：0，他引：5

Jung-Hoon Jang-Soo Shin-Dug 《Journal of Systems Architecture》2000,46(15):1451-1467

A data cache system is designed as low power/high performance cache structure for embedded processors. Direct-mapped cache is a favorite choice for short cycle time, but suffers from high miss rate. Hence the proposed dual data cache is an approach to improve the miss ratio of direct-mapped cache without affecting this access time. The proposed cache system can exploit temporal and spatial locality effectively by maximizing the effective cache memory space for any given cache size. The proposed cache system consists of two caches, i.e., a direct-mapped cache with small block size and a fully associative spatial buffer with large block size. Temporal locality is utilized by caching candidate small blocks selectively into the direct-mapped cache. Also spatial locality can be utilized aggressively by fetching multiple neighboring small blocks whenever a cache miss occurs. According to the results of comparison and analysis, similar performance can be achieved by using four times smaller cache size comparing with the conventional direct-mapped cache.And it is shown that power consumption of the proposed cache can be reduced by around 4% comparing with the victim cache configuration. 相似文献

16.

Instruction cache locking for multi-task real-time embedded systems

Tiantian Liu Minming Li Chun Jason Xue 《Real-Time Systems》2012,48(2):166-197

In a multi-task embedded system, a cache is shared by different tasks, which increases the complexity of cache management and the unpredictability of cache behavior. This unpredictability in turn brings an overestimation of application??s worst-case execution time (WCET) and worst-case CPU utilization (WCU) which are two of the most important criteria for real-time embedded systems. Modern processors often provide cache locking capability, which can be applied statically and dynamically to manage cache in a predictable manner. The selection of instructions to be locked in the instruction cache (I-Cache) has dramatic influence on the system performance. This paper focuses on applying cache locking techniques to the shared I-Cache to minimize WCU for multi-task embedded systems. We analyze and compare three different strategies to perform I-Cache locking: static locking, semi-dynamic locking, and dynamic locking. Different algorithms are proposed utilizing the foreknown information of embedded applications. Experimental results show that the proposed algorithms can reduce WCU compared to previous techniques. 相似文献

17.

嵌入式处理器中降低Cache缺失代价设计方法研究 总被引：2，自引：0，他引：2

黄海林许彤范东睿唐志敏《小型微型计算机系统》2006,27(11):2077-2081

以龙芯1号处理器为研究对象，探讨了嵌入式处理器中降低Cache缺失代价的设计方法．通过分析处理器的结构特征，本文实现了在关键字优先基础上一次缺失下命中的非阻塞数据Cache，可以将处理器平均性能提高3．9％,同时利用局部性原理，在关键字优先非阻塞数据Cache的基础上，本文提出了一种类非阻塞的指令Cache设计方法，可以降低指令Cache的缺失代价，以较小的实现代价进一步将处理器平均性能提高7．7％．通过本文的工作，可以同时降低指令Cache和数据Cache的缺失代价，处理器的平均性能提高了11．6％．相似文献

18.

一种低功耗的动态可重构Cache设计 总被引：1，自引：0，他引：1

何勇肖斌陈章龙涂时亮《计算机应用与软件》2009,26(8):247-250

在嵌入式微处理器设计中,cache提高了性能的同时也成了主要的功耗来源.提出一种非统一的动态可重构的低功耗cache结构,和一种动态重构算法DAS(Dynamic Associativity Selection),通过动态重构cache来降低功耗.基于MiBench的仿真结果表明,可重构的cache结构比普通的cache结构的性能更优且能耗更低,指令和数据cache命中率分别平均提高了2.1%和1.4%,内存系统平均能耗降低了8.1%. 相似文献

19.

嵌入式系统的高速缓存管理

ZHONG Rui FANG Wen-kai 《数字社区&智能家居》2008,(12)

本文针对嵌入式文件系统提出了一套基于最久未使用页面替换算法的高速缓存管理方案。该高速缓存管理模块的使用,较好地提高文件系统的读写性能。以NorFlash操作为例,使用高速缓存以后,可以使多数量、小文件的写入速度提升20%左右,读出速度提升30%-40%。对于大容量数据传输,适当地调整缓存的容量,也可使得写入速度提升2%,读出速度提升10%左右。相似文献

20.

Cache设计的分析与决策

沈绪榜陆铁军《小型微型计算机系统》1992,13(7):1-8,45

相似文献