首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 203 毫秒
1.
近年来,随着集成电路技术的发展处理器与存储器之间的速度差异越来越大,存储器愈发成为制约计算系统性能的瓶颈。对于嵌入式、低功耗领域的DSP而言,其架构和应用场景与通用CPU不同,CPU的访存设计难以满足DSP的访存需求。针对超长指令字DSP在访存实时性、顺序与固定延迟、高效数据一致性方面的需求,设计了一种适用于DSP的标量访存单元,可配置的设计能够满足DSP的访存实时性;基于ID的顺序机制保证超长指令字架构对Load指令返回数据的顺序与固定延迟要求,存储开销为87.5 B;硬件查找“首1”加速了数据一致性所需的写回操作。当Cache中25%,50%和75%的行需要写回时,优化后的一致性写回开销为逐行扫描方法的26.4%,51.3%和76.2%,只与有效脏行数量成正比,与Cache容量无关。  相似文献   

2.
高性能处理器普遍采用片上集成大容量复杂结构的一级Cache提高处理器性能,但随着Cache容量和复杂度的增加,访问Cache所产生的访存延迟和功耗明显增加;基于存储队列,提出了一种通过减少Cache访问次数来降低功耗和延迟的方法,利用存储队列来缓存Load/Store指令的数据,并且当存储队列不满时,通过空闲入口暂存已经完成的仿存数据,提高了连续访存数据的复用率,减少了Cache的访问次数;仿真结果显示,该方法在增加少量的控制逻辑基础上,显著减少了Cache的访问次数,降低了Cache的功耗,减少了访存延迟,加快了执行速度。  相似文献   

3.
片上多处理器中二级Cache的设计和管理是影响其性能的关键因素之一。在私有二级Cache的基础上,提出一种基于集中式一致性目录的协作Cache设计方案,通过有效地管理片上存储资源来优化处理器的性能,从而使该协作Cache具有平均访存延迟小、Cache缺失率低、可扩展性好等优点。实验结果显示,与共享二级Cache设计相比,协作Cache可以将4核处理器的吞吐量平均提高13.5%,而其硬件开销约为8.1%。  相似文献   

4.
存储相关性预测对于减少存储相关性冲突、提高微处理器性能具有十分重要的作用。针对传统相关性预测器硬件开销大、可实现性较差的缺点,通过对存储相关性的局部性分析,提出了一种基于指令距离的存储相关性预测方法。该方法充分利用了发生存储相关性冲突的指令在指令距离上的局部性,预测冲突指令的指令距离,进而控制部分访存指令的发射时机,大大减少了存储相关性冲突的次数。实验结果表明,在硬件开销约为1KB的情况下,使用基于指令距离的相关性预测器后,每个时钟周期平均执行的指令数可以提高1.70%,最高可以提高5.11%。在硬件开销较小的情况下,较大程度提高了微处理器的性能。  相似文献   

5.
双精度普通矩阵乘法DGEMM是BLAS库中最核心的函数之一,大部分三级BLAS库函数的核心计算都是通过调用DGEM M来实现的.该文针对龙芯3A具有128位访存指令的特点,通过理论分析,找到了最佳的循环展开方式;针对龙芯3A的Cache替换策略(随机替换),通过使用地址交错技术,减少了Cache的冲突失效;针对龙芯3A访存带宽有限的问题,通过使用共享数据的任务划分方式,减少了数据访存量.优化后的DGEMM单核和多核运算速度均是性能最高的开源BLAS库(Goto-BLAS)的2倍多.  相似文献   

6.
阵列众核处理器由于其较高的计算性能和能效比已经广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器,其核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。引入硬件同时多线程技术,针对实验中单核心多线程二级Cache利用率较低的问题,提出了一种共享二级Cache划分机制。经实验模拟,通过上述优化的共享二级Cache划分机制,二级指令Cache失效率下降18.59%,数据Cache失效率下降6.60%,整体CPI性能提升达到10.1%。  相似文献   

7.
从单机性能优化角度对一个高阶精度结构网格CFI)并行程序进行了优化。通过识别关键变量并对其进行 常量参数化优化,使编译器能够实现更高级别的针对性优化;根据程序数据结构特点及访问模式,设计了分级数据缓 存技术,使程序主要计算代码能够以更优的方式访问主要数据结构,提高了访存空间局部性;进行了各种循环变换,以 优化访存性能。在国家超算长沙中心“`Tianhe—lA',并行机上的测试结果表明,相对于采用Intel编译器最高优化级别 的版本,其对10。万网格点二维翼型算例,串行程序性能提高约22.2%-28.9%;对1. 12亿网格点三角翼算例,并行 程序性能提高约13.9%-20.2%。  相似文献   

8.
“存储墙”问题已经成为处理器性能提升的主要障碍,而处理器内核猜测执行预测路径上访存指令时预载入的存储器数据所导致Cache污染会严重影响处理器性能.本文提出一种针对猜测执行过程中预载入数据的Cache污染控制方法CSDA.首先,利用置信度评估技术从所有预测路径中分离出错误概率较大的路径.然后,根据低置信度污染型访存指令识别历史表将低置信度预测路径上的访存指令划分为预取型和污染型,为污染型的访存指令建立低优先级Load/Store队列,并采用污染数据Cache存储污染数据.仿真结果表明,在双核模式下,CSDA策略相对于baseline结构来说,L1 D-Cache缺失率降低幅度从9%-23%,平均降低了17%;L2 Cache缺失率的下降范围从1.02%-14.39%,平均为5.67%;IPC的提升幅度从0.19% -5.59%,平均为2.21%.  相似文献   

9.
结合访存失效队列状态的预取策略   总被引:1,自引:0,他引:1  
随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%.  相似文献   

10.
低效率的访存操作是限制微处理器性能提高的一个关键因素。因此提高访存速度可以有效改善微处理器的性能。提出了一种基于增加数据宽度的方式来提高访存速度的方法。通过使用多字宽存储器来增加数据带宽,降低失效开销的时钟周期,从而达到提高访存效率的目的。  相似文献   

11.
Cache自适应写分配策略   总被引:1,自引:0,他引:1  
处理器所能提供的有效带宽是目前制约处理器性能提高的关键因素 .通过对Cache写失效行为的分析,提出了一种新的提高处理器带宽利用率的Cache写失效处理策略--Cache自适应写分配策略 .该策略在访存失效队列中收集全修改Cache块,对全修改Cache块采用非写分配策略,并能够自适应地切换为写分配策略 .与传统的Cache写失效处理策略相比,Cache自适应写分配策略硬件代价小,避免了不必要的数据传输,降低Cache污染,减少存储管理队列阻塞的频率 .结果表明,采用Cache自适应写分配策略,STREAM基准测试程序带宽平均提高62.6%,SPEC CPU2000程序的IPC值平均提高5.9% .  相似文献   

12.
This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and vector parallelism.  相似文献   

13.
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.  相似文献   

14.
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency.  相似文献   

15.
This paper examines the performance and memory-access behavior of the C4.5 decision-tree induction program, a representative example of data mining applications, for both uniprocessor and parallel implementations. The goals of this paper are to characterize C4.5, in particular its memory hierarchy usage, and to decrease the run-time of C4.5 via algorithmic improvement and parallelization. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. This paper makes the following four contributions. The first contribution is presenting a complete characterization of the C4.5 decision-tree induction program. The results show that with the exception of the input data set, the working set fits into an 8-Kbyte data cache; the instruction working set also fits into an 8-Kbyte instruction cache. For data sets larger than the L2 cache, performance is limited by accesses to main memory. The results further establish that four-way issue can provide up to a factor of two performance improvement over single-issue for larger L2 caches; for smaller L2 caches, out-of-order dispatch provides a large performance improvement over in-order dispatch. The second contribution made by this paper is examining the effect on the memory hierarchy of changing the layout of the input dataset in memory, showing again that the performance is limited by memory accesses. One proposed data layout decreases the dynamic instruction count by up to 24%, but usually results in lower performance due to worse cache behavior. Another proposed data layout does not improve the dynamic instruction count over the original layout, but has better cache behavior and decreases the run-time by up to a factor of two. Third, this paper presents the first decision-tree induction program parallelized for a ccNUMA architecture. A method for splitting the decision tree hash table is discussed that allows the hash table to be updated and accessed simultaneously without the use of locks. The performance of the parallel version is compared to the original version of C4.5 and a uniprocessor version of C4.5 using the best data layout found. Speedup curves from a six-processor Sun E4000 SMP system show a speedup on the induction step of 3.99, and simulation results show that the performance is mostly unaffected by increasing the remote memory access time until it is over a factor of ten greater than the local memory access time. Last, this paper characterizes the parallelized decision-tree induction program. Compared to the uniprocessor version, the parallel version exerts significantly less pressure on the memory hierarchy, with the exception of having a much larger first level data working set.  相似文献   

16.
Java虚拟机即时编译器以方法为单位进行编译,编译器将字节码方法编译成可执行代码,并经过数据cache存入内存中,当再次执行到该代码段时,处理器需要从包含该代码段的内存区域取指令执行,如果该内存区域在数据cache中已经建立映射,就可以直接从数据cache中读取数据,读数据的性能就会有大幅度的提高.但是编译生成的大量可执行代码在cache中频繁替换,当生成代码被替换出cache后,代码再次执行时处理器必须访问速度较慢的主存储器,成为编译器的性能瓶颈.设计并实现了硬件cache锁机制,提出了一种软硬件协同设计的即时编译方法.通过该方法,生成代码执行时的cache失效次数降低了6.9%,SPECjvm2008中程序最高获得了17.9%的性能提升,平均性能提升4.2%.  相似文献   

17.
The static specification of operations executed in parallel using No Operations (NOPs) is another culprit to make code size to be increased in VLIW architecture. Some alternatives in the instruction encoding and memory subsystem are proposed to minimize the impact of NOP on the code size. One is the compressed cache using the packed encoding scheme and the other is the decompressed cache using the unpacked encoding scheme. The compressed cache shows high memory utilization but increases the pipeline branch penalty because it requires very complex fetch hardware. On the contrary, the fetch overhead can be decreased in the decompressed cache because the unpacked encoding scheme allows an instruction to be issued to the pipeline without any recovery process. However, it has a shortcoming that the memory utilization is deteriorated due to the memory allocation irrespective of the number of useful operations. In this research, a new instruction encoding scheme called a semi-packed encoding scheme and the section cache, which enables effective store and retrieval of semi-packed instructions, are proposed. This can decrease the hardware complexity to fetch an instruction and the wasted memory space due to NOPs via the partially fixed length of an instruction. The experimental results reveal that the memory utilization in the section cache is 3.4 times higher than in the decompressed cache. The memory subsystem using the section cache can provide about 15% performance improvement with the moderate size of chip area.  相似文献   

18.
众核处理器设计在芯片面积上受到了巨大挑战,如何将有限的芯片面积投入到运算能力中,是众核处理器体系结构研究的热点。聚焦众核处理器的指令缓存结构设计,研究通过在多核核心之间共享一级指令缓存,以获取指令系统及处理器流水线性能的提升。给出了共享指令缓存的结构设计,对该结构进行了节拍级精确的性能模拟,并通过RTL级代码的综合得到了面积开销和时序指标。测试结果表明,共享指令缓存可以降低11%~27%的缓存脱靶率,提升4%~7%的流水线性能。  相似文献   

19.
为了提高片上Flash在嵌入式应用中的读取速度,提出了一种基于预取和缓存原理的片上Flash加速控制器。该控制器包括预取缓存和高速缓存两种加速方案。其中预取缓存方案采用位宽扩展和预取技术加速顺序指令的读取,并采用分支缓存存储非顺序指令,降低由非顺序指令造成的预取缺失代价;而高速缓存方案采用组相联和路预测技术,提高指令重用率,减少Flash访问次数,降低系统功耗。针对不同的应用场景,两种加速方案既可通过寄存器来静态切换,也可通过软件流程来自适应动态切换,从而获得最佳的读取速度提升。多项基准程序的测试结果表明了所提出的片上Flash加速控制器在性能和功耗优化上的可行性和高效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号