首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
We revisit memory hierarchy design viewing memory as an inter-operation communication mechanism. We show how dynamically collected information about inter-operation memory communication can be used to improve memory latency. We propose two techniques: (1) Speculative Memory Cloaking, and (2) Speculative Memory Bypassing. In the first technique, we use memory dependence prediction to speculatively identify dependent loads and stores early in the pipeline. These instructions may then communicate prior to address calculation and disambiguation via a fast communication mechanism. In the second technique, we use memory dependence prediction to speculatively transform DEF-store-load-USE dependence chains within the instruction window into DEF-USE ones. As a result, dependent stores and loads are taken off the communication path resulting in further reduction in communication latency. Experimental analysis shows that our methods, on the average, correctly handle 40% (integer) and 19% (floating point) of all memory loads. Moreover, our techniques result in performance improvements of 4.28% (integer) and 3.20% (floating point) over a highly aggressive, dynamically scheduled processor implementing naive memory dependence speculation. We also study the value and address locality characteristics of the values our methods correctly handle. We demonstrate that our methods are orthogonal to both address and value prediction.  相似文献   

2.
李勇  胡慧俐  杨焕荣 《计算机应用》2014,34(4):1005-1009
数字信号处理软件中循环程序在执行时间上占有很大比例,用指令缓冲器暂存循环代码可以减少程序存储器的访问次数,提高处理器性能。在VLIW处理器指令流水线中增加一个支持循环指令的缓冲器,该缓冲器能够缓存循环程序指令,并以软件流水的形式向功能部件派发循环程序指令。这样循环程序代码只需访存一次而执行多次,大大减少了访存次数。在循环指令运行期间,缓冲器发出信号使程序存储器进入睡眠状态可以降低处理器功耗。典型的应用程序测试表明,使用了循环缓冲后,取指流水线空闲率可达90%以上,处理器整体性能提高10%左右,而循环缓冲的硬件面积开销大约占取指流水线的9%。  相似文献   

3.
One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and causes pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel Elastic Pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed Elastic Pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves the overall performance by up to 20.7 % (on average 13.3 %) for representative benchmarks, at trivial hardware overhead.  相似文献   

4.
Embedded systems are vulnerable to buffer overflow attacks. In this paper, we propose a hardware memory monitor module that aims to detect buffer overflow attacks by analyzing the security of an embedded processor at the instruction level. The functionality of the memory monitor module does not rely on the source code and can perform security check through dynamic methods. Compared with several existing countermeasures that protect only part of the program’s data space, our proposed memory monitor module can protect the program’s entire data space. The proposed memory monitor module has negligible performance overhead because it runs in parallel with the embedded processor. As demonstrated in an FPGA (Field Programmable Gate Array) based prototype, the experimental results show that our memory monitor module can effectively resist several types of buffer overflow attacks with approximately a 15% hardware cost overhead and only a 0.1% performance penalty.  相似文献   

5.
随着数字信号处理器主频的不断提高,其中的运算单元已由单层流水线结构向多层流水线结构变迁。但随之带来了访问内存时出现等待周期的问题。文章提出了读写分层及硬件写叫缓冲的设计,消除了访存单元等待周期,使访存单元获得100%的工作效率。  相似文献   

6.
针对超标量深流水线中物理寄存器资源冲突造成的流水线阻塞问题,提出了一种多指令共享同一物理寄存器资源的非阻塞指令发射方法。该方法可在物理寄存器资源冲突下继续分配物理寄存器,利用发射缓冲队列临时缓冲冲突的指令,增加发射流水级实际可分配的物理寄存器数量,释放发射窗口,提高物理寄存器使用的并行性。实验结果表明:相对于传统重命名方法,该方法可减少27.3%的物理寄存器资源实现传统方法相同的性能。  相似文献   

7.
通用处理器的高带宽访存流水线研究   总被引:1,自引:0,他引:1  
存储器访问速度的发展远远跟不上处理器运算速度的发展,日益严峻的访存速度问题严重制约了处理器速度的进一步发展.降低load-to-use延迟是提高处理器访存性能的关键,在其他条件确定的情况下,增加访存通路的带宽是降低load-to-use延迟的最有效途径,但增加带宽意味着增加访存通路的硬件逻辑复杂度,势必会增加访存通路的功耗.文中的工作立足于分析程序固有的访存特性,探索高带宽访存流水线的设计和优化空间,分析程序访存行为的规律性,并根据这些规律性给出高带宽访存流水线的低复杂度、低延迟、低功耗解决方案.文中的工作大大简化了高带宽访存流水线的设计,降低了关键路径的时延和功耗,被用于指导Godsonx处理器的访存设计.在处理器整体面积增加1.7%的情况下,将访存流水线的带宽提高了一倍,处理器的整体件能平均提高了8.6%.  相似文献   

8.
龙芯2号处理器设计和性能分析   总被引:16,自引:4,他引:16  
介绍龙芯2号处理器设计及其性能测试结果.龙芯2号采用四发射超标量超流水结构。片内一级指令和数据高速缓存各64KB,片外二级高速缓存最多可达8MB.为了充分发挥流水线的效率,龙芯2号实现了先进的转移猜测、寄存器重命名、动态调度等乱序执行技术以及非阻塞的Cache访问和load Speculation等动态存储访问机制.龙芯2号处理器采用0.18gm的CMOS工艺实现,在正常电压下的最高工作频率为500MHz,500MHz时的实测功耗为3~5W.龙芯2号单精度峰值浮点运算速度为20亿a/秒,双精度浮点运算速度为10亿a/秒,SPECCPU2000的实测性能是龙芯1号的8~10倍,综合性能已经达到PentiumⅢ的水平.目前芯片样机能流畅运行完整的64位中文Linux操作系统,全功能的Mozilla浏览器、多媒体播放器和OpenOffice办公套件,可以满足绝大多数桌面应用的要求.  相似文献   

9.
We present the Memory Trace Visualizer (MTV), a tool that provides interactive visualization and analysis of the sequence of memory operations performed by a program as it runs. As improvements in processor performance continue to outpace improvements in memory performance, tools to understand memory access patterns are increasingly important for optimizing data intensive programs such as those found in scientific computing. Using visual representations of abstract data structures, a simulated cache, and animating memory operations, MTV can expose memory performance bottlenecks and guide programmers toward memory system optimization opportunities. Visualization of detailed memory operations provides a powerful and intuitive way to expose patterns and discover bottlenecks, and is an important addition to existing statistical performance measurements.  相似文献   

10.
现代高性能处理器PowerPC620与Alpha21164的核心技术分析   总被引:2,自引:0,他引:2  
PowerPC620和Alpha21164是当今世界上的两种高性能的处理器,它们的实现体现了两种截然不同的高性能处理器设计思想,故从体系结构、指令流水线性、指令调度规则、转移处理、存储系统等角度对他们作一详细分析,有助于了解当今高性能处理器的核心技术和指令级并行处理技术的发展方向。  相似文献   

11.
Previous work in scalable hardware distributed shared memory (DSM) multiprocessors has established the critical and dominant role that protocol processing bandwidth (or its inverse, occupancy) plays in determining overall performance in architectures with standalone memory/coherence controllers. However, with recent architectural trends toward integrated (on-chip) memory controllers and the well-known fact that processor frequency is increasing more rapidly than memory systems, we must ask whether parallel coherence processing engines (either multiple integrated protocol processors/cores or multiple protocol threads) are needed in DSM machines constructed from modern processor architectures and, if so, when. We construct a useful analytical model to give the designer insight into when parallel coherence streams will improve performance and verify our model via detailed simulation on 64-threaded microbenchmarks and parallel applications and on single-node multiprogrammed workloads. Surprisingly, and contrary to related work, we find that, in these architectures, adding a second coherence engine has almost no impact on performance. Further, for less-tuned applications that suffer from hot spots (contentious requests to the same memory line), additional engines offer no benefit whatsoever. Even with double the memory bandwidth (or channels), an additional coherence processing stream yields only slight performance improvement. Only for a special class of DSM machines employing directoryless broadcast protocols over unordered interconnects does parallel "snoop" processing offer reasonable performance improvement for communication-intensive applications. Overall, given the architectural trends, this is good news for DSM designers who want to minimize the resources necessary (protocol threads or integrated protocol processor cores for maintaining internode coherence, respectively) to create SMTp-based or multi-CMP-based scalable DSM machines using directory protocols.  相似文献   

12.
A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one-port memories) in order to minimize contention; however, this minimizes memory sharing. Idealized sharing occurs by using a single shared memory for all processors but this maximizes contention. Instead, in this paper we show that perfect memory sharing of shared memory can be achieved with a collection of two-port memories, as long as the number of processors is less than the number of memories. We show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of asymptotically. The proof utilizes a new bin packing model, which is interesting in its own right. Further, for important special cases that arise in practice a more sophisticated modification of this approximation algorithm is in fact optimal. We also discuss the online memory allocation problem and present fast online algorithms that provide good memory utilization while allowing fast updates.  相似文献   

13.
In this paper, we describe and evaluate three possible architectures for using 3D-DRAMs and PCMs in the processor memory hierarchy. We explore: (i) using 3D-DRAM as main memory with PCM as backing store; (ii) using 3D-DRAM as the Last Level Cache and PCM as the main memory; and (iii) using both 3D-DRAM and PCM as main memory. In each of these configurations, since the proposed memories are significantly faster than today’s off-chip 2D DRAMs for main memories and magnetic hard drives for secondary storage, we introduce hardware assistance to speedup virtual to physical address translation.We use Simics, a full system simulator, and benchmarks from both SPEC and OLTP suites to evaluate our designs. We use CACTI for obtaining energy and latency values for our configurations. We measure energy consumed and execution performance for the selected benchmarks.Our studies lead to the following conclusions. The best performance is obtained when 3D-DRAMs are used as last level caches (LLC) and PCM as the main memory. However, this organization performs poorly in terms of energy consumed. Our 3D-DRAM together with PCM as main memory is the best choice in terms of energy consumed. In terms of write-backs, 3D-DRAM as LLC causes fewer writes to PCM than the other organization.These experiments can be extended to explore specific memory organizations, capacities of 3D-DRAM needed as LLC or main memory and how the hybrid PCM/DRAM memory should be used for specific application contexts.  相似文献   

14.
数值计算程序的存储复杂性分析   总被引:12,自引:1,他引:11  
由于越来越多的技术用于缩小处理器与存储器之间的日益加大的速度差距,计算机的存储系统变得日趋复杂.现在,任何一个程序设计者,尤其是数值计算程序的设计者,若不考虑其所用计算平台存储系统的特点是很难获取高性能的.因此公用传统的算法评价方法,从时间复杂性和空间复杂性着手来解释一个算法的不同实现在同一计算平台上很大的性能差异,显然是不够的.计算平台存储系统的特点必须在分析算法的复杂性时加以考虑.孙家昶199  相似文献   

15.
石云  张素文 《微计算机信息》2007,23(35):108-109,43
本文介绍了一种基于FLASH MEMORY的数据存储应用方法。详细阐述了使用E5CPU将FLASH MEMORY作为数据存储器的基本原理,着重分析了地址映射的过程,给出了相应的框图及关键的程序。  相似文献   

16.
嵌入式处理器中访存部件的低功耗设计研究   总被引:2,自引:0,他引:2  
以“龙芯1号”处理器为研究对象,探讨了嵌入式处理器中访存部件的低功耗设计方法.通过对访存部件的结构、功耗以及关键路径进行分析,利用局部性原理,提出一种根据虚拟地址历史记录进行判断的方法,可以显著减少TLB和Cache对RAM块的访问次数,使得TLB部件功耗平均降低了28.1%,Cache部件功耗平均降低了54.3%,处理器总功耗平均降低了23.2%,而关键路径延时反而减少,处理器性能略有提高.  相似文献   

17.
This paper presents a cost-effective and high-performance dual-thread VLIW processor model. The dual-thread VLIW processor model is a low-cost subset of the Weld architecture paradigm. It supports one main thread and one speculative thread running simultaneously in a VLIW processor with a register file and a fetch unit per thread along with memory disambiguation hardware for speculative load and store operations. This paper analyzes the performance impact of the dual-thread VLIW processor, which includes analysis of migrating disambiguation hardware for speculative load operations to the compiler and of the sensitivity of the model to the variation of branch misprediction, second-level cache miss penalties, and register file copy time. Up to 34 percent improvement in performance can be attained using the dual-thread VLIW processor when compared to a single-threaded VLIW processor model.  相似文献   

18.
This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance and energy consumption are improved substantially.  相似文献   

19.
In this paper, we propose and evaluate a cluster-based network server called PRESS. The server relies on locality-conscious request distribution and a standard for user-level communication to achieve high performance and portability. We evaluate PRESS by first isolating the performance benefits of three key features of user-level communication: low processor overhead, remote memory accesses, and zero-copy transfers. Next, we compare PRESS to servers that involve less intercluster communication, but are not as easily portable. Our results for an 8-node server cluster and five WWW traces demonstrate that user-level communication can improve performance by as much as 52 percent compared to a kernel-level protocol. Low processor overhead, remote memory writes, and zero-copy all make nontrivial contributions toward this overall gain. Our results also show that portability in PRESS causes no throughput degradation when we exploit user-level communication extensively.  相似文献   

20.
一种用于评估多核处理器存储层次性能的模型,使用排队论建模,求解速度快,可以在设计早期给出不同配置参数对处理器整体性能的影响,从而调整存储层次结构,优化设计.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号