首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 140 毫秒
1.
设计了一种软件流水循环缓冲,用于存储和派发循环体指令,减少执行循环程序时的访存次数,从而减少访存延迟对性能的影响。在详细研究软件流水和循环展开的基础上,完成了软件流水循环缓冲的设计。所设计的循环缓冲可以存储112条32位指令,用循环专用指令来控制循环程序的执行。对设计进行了模拟验证,并用Design Complier对设计进行了综合。  相似文献   

2.
传统的流水线设计是以转移指令为中心的,大量逻辑资源被用于提高处理器转移预测的能力,以保证向流水线发射和执行部件提供充足的指令流。在阵列众核处理器中提出了一种以访存为中心的核心流水线设计。通过提高访存装载指令在流水线中的执行优先级,以及访存装载指令的预测执行机制,可以有效减少顺序流水线因访存延迟所带来的停顿,提高流水线性能和能效比。测试结果表明,以4KB容量的装载指令访存地址表为例,访存为中心的流水线设计可以带来8.6%的流水线性能提升和7%的流水线能效比提高。  相似文献   

3.
软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环迭代的指令来加快循环的执行时间.随着处理器速度和访存速度差距越拉越大,访存指令尤其是cache miss的访存指令日益成为系统性能提高的瓶颈.由于这些指令的延迟不是固定的,如何在软件流水中预测并掩盖这些访存指令的延迟是非常重要的.与前人预测访存延迟的方法不同,引入cache profiling技术,通过动态收集到profile信息来预测访存延迟,并进行适当的调度.当增加模调度循环中的访存指令的延迟时,启动间隔也会随之增大,导致性能不会随之上升.CSMS算法和FLMS算法在尽量不增大启动间隔的情况下,改变访存指令的延迟.改进了CSMS算法和FLMS算法,根据cache profiling的信息来改变访存延迟,所以比前人的方法更为准确.实验表明,新方法可以有效地提高程序性能,对SPEC2000测试程序平均性能提高1%左右,个别例子的性能改进高达11%.  相似文献   

4.
在超长指令字结构的数字信号处理器中,其指令存储器的功耗所占比重较大。但是,根据数字信号应用的特点,可以采用循环缓冲来减小指令存储器的功耗。本文提出了一种编译器控制的循环缓冲技术,由编译器选择合适的循环代码将其放入循环缓冲,从而减小了取指过程中指令存储器的功耗;给出了循环缓冲的体系结构设计、功耗分析以及有效利用循环缓冲的编译方法;最后用功能级功耗模型验证了该方法的有效性。  相似文献   

5.
通用处理器的高带宽访存流水线研究   总被引:1,自引:0,他引:1  
存储器访问速度的发展远远跟不上处理器运算速度的发展,日益严峻的访存速度问题严重制约了处理器速度的进一步发展.降低load-to-use延迟是提高处理器访存性能的关键,在其他条件确定的情况下,增加访存通路的带宽是降低load-to-use延迟的最有效途径,但增加带宽意味着增加访存通路的硬件逻辑复杂度,势必会增加访存通路的功耗.文中的工作立足于分析程序固有的访存特性,探索高带宽访存流水线的设计和优化空间,分析程序访存行为的规律性,并根据这些规律性给出高带宽访存流水线的低复杂度、低延迟、低功耗解决方案.文中的工作大大简化了高带宽访存流水线的设计,降低了关键路径的时延和功耗,被用于指导Godsonx处理器的访存设计.在处理器整体面积增加1.7%的情况下,将访存流水线的带宽提高了一倍,处理器的整体件能平均提高了8.6%.  相似文献   

6.
随着处理器和存储器速度差距的不断拉大,访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数,一般假定这些指令的延迟为cache命中或者cache miss的延迟,所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息,利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度,编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生,容易引起reorder buffer满,导致流水线阻塞。调度容易cache miss的指令。使其并行执行,从而隐藏cache miss的长延迟,就可以提高程序性能。因此,我们针对load指令,一方面修改频繁miss的指令的延迟,一方面修改调度策略,提高存储级并行度。实验证明,我们的调度对于bzip2有高达4.8%的提升,art有4%的提升,整体平均提高1.5%。  相似文献   

7.
手持移动应用等嵌入式应用的广泛普及,对嵌入式系统及其核心部件—嵌入式处理器的成本、性能和功耗都提出了苛刻的要求,但是现有技术无法或者很难在低成本下,提高流水线嵌入式处理器的性能。论文提出了一种跳转隐藏技术,它针对嵌入式应用中大量出现且占据大部分执行时间的短循环结构,通过“隐藏”跳转指令,增强流水线效率。分析表明,该方法可以在较小的硬件代价下有效地提高短循环的执行效率,从而进一步满足嵌入式应用对嵌入式系统及其核心部件嵌入式处理器的需要。  相似文献   

8.
在国产申威高性能多核服务器系统中,基础编译系统对应用程序中访存操作进行代码生成时,没有考虑国产处理器指令特征,导致编译器生成的访存地址计算代码效率较低,影响国产高性能处理器的性能。为充分发挥国产处理器高性能计算能力,提出一种加速访存地址计算的编译优化方法。加速访存地址计算编译优化基于处理器支持带扩展因子的运算指令,在编译器后端内存地址表达式合法性检查中,添加针对乘加模式的地址计算表达式合法性检查算法,自动识别地址表达式中存在的乘加运算并进行合法性检验,对符合条件的地址表达式在代码生成阶段匹配生成带扩展因子的运算指令来快速计算访存地址,从而加快访存指令的发射与执行以及应用程序中的访存地址生成,提升访存效率。使用行业标准性能测试集SPEC CPU2006对优化效果进行评测,结果表明,相比优化前SPECspeed Integer与SPECspeed Float Point两个子集,该优化方法平均性能分别提高了2.53%与1.50%。  相似文献   

9.
以相变存储器(PCM)为代表的新型非易失存储器,具有存储密度高和静态功耗低等传统动态随机存取存储器(DRAM)不具备的优势,但是过长的写操作延时会严重影响访存的性能.设计了基于PCM的图形处理器(GPU)中的存储系统.仿真结果显示,GPU程序中的内存写请求分布极不均匀,对少量的内存地址有非常高的访问频率.面向访存分布不均匀特点的专用缓冲单元设计,能够有效地存储频繁访问的内存数据,从而减少对PCM的访问次数,消除过长的写操作延时对系统性能的负面影响.GPU仿真器上的结果显示,基于缓冲单元的PC以存储系统能够有效地提高GPU的运算性能.  相似文献   

10.
针对嵌入式控制与数字信号处理混合应用领域,建立了一种基于MCU-DSP融合架构处理器的Load先行机制.该内核使用静态超标量技术,拥有整数、存取、循环三条流水线,并采用特殊的四级流水.在存取流水线中,Load先行机制通过动态调度指令的访存顺序,实现了Load指令对Store指令的先行,提前了整数流水线中运算操作数的准备,加快了流水线的处理速度.  相似文献   

11.
The instruction compression mechanism used to solve the drawbacks of traditional very long instruction word (VLIW) architectures often leads to poor code density in the instruction cache, which causes the irregular lengths of long instructions to cross the different cache line. These split long instructions cannot be fetched simultaneously, which creates a bottleneck for VLIW architectures. This paper proposes a buffing mechanism which can slide the split long instruction as a continuous form to offer better efficiency in instruction fetching. This approach helps maintain the behaviors of the software pipeline technology, which schedules iterative instructions to enhance the performance of streaming processing for VLIW architectures. In the proposed mechanism, the instruction stream buffer stores the repeat block completely and suspends as far as possible the cache access to reduce access time. The advantages of repeatedly issuing instructions in the instruction buffer and preventing split long instructions, can substantially improve the performance in fetching instructions. Simulation results show that the mechanism is efficient at the instruction level for the basic DSP/IMG library by improving performance by 35% on average.  相似文献   

12.
As processors continue to exploit more instruction level parallelism, greater demands are placed on the performance of the memory system. In this paper, we introduce a novel modification of the processor pipeline called memory renaming . Memory renaming applies register access techniques to load and store instructions to speed the processing of memory traffic. The approach works by accurately predicting memory communication early in the pipeline and then re - mapping the communication to fast physical registers. This work extends previous studies of data value and dependence speculation. When memory renaming is added to the processor pipeline, renaming can be applied to 30-50 % of all memory references, translating to an overall improvement in execution time of up to 14 % for current pipeline configurations. As store forward delay times grow larger, renaming support can lead to performance improvements of as much as 42 %. Furthermore, this improvement is seen across all memory segments—including the heap segment which has often been difficult to manage efficiently.  相似文献   

13.
龙芯1号处理器结构设计   总被引:33,自引:7,他引:26  
首先介绍了龙芯处理器的研制背景及其技术路线。分析了龙芯处理器坚持高性能定位、稳扎稳打的设计策略以及兼容主流处理器的原因,并指出在目前达到与国外相同主频的客观条件不具备的情况下,应走通过优化处理器结构来提高性能的道路,并以处理器结构技术的突破为根本。然后介绍了龙芯1号处理器的体系结构设计,包括基于操作队列复用的动态流水线设计、在乱序执行的情况下实现精确例外处理、取指与转移控制结构、存储管理以及针对缓冲区逐出攻击的系统安全设计等等。测试表明龙芯1号处理器的指令流水线效率高,其安全设计能有效防范使用缓冲区送出技术进行的网络攻击。但龙芯1号处理器的Cache过小,在组织方式上也有待改进。  相似文献   

14.
The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time.  相似文献   

15.
分支目标缓存(BTB)是高端嵌入式CPU的主要耗能部件之一。针对BTB访问中引入的冗余功耗问题,提出了一种循环体访问过滤机制消除循环体指令流中顺序指令对BTB的无效访问。进一步提出了一种分支跟踪方法补偿循环过滤机制对循环体中非循环类分支指令的错误过滤造成的性能损失,节省了循环体指令流中顺序指令访问BTB的大量冗余功耗。基于Powerstone基准程序的仿真实验表明,在128表项BTB配置下,二级循环过滤器和4表项分支踪迹表可以减少约71.9%的BTB功耗,而平均每条指令周期数(CPI)退化仅为0.66%。  相似文献   

16.
《Micro, IEEE》2004,24(6):62-73
With the natural trend toward integration, microprocessors are increasingly supporting multiple cores on a single chip. To keep design effort and costs down, designers of these multicore microprocessors frequently target an entire product range, from mobile laptops to high-end servers. This article discusses a continual flow pipeline (CFP) processor. Such processor architecture can sustain a large number of in-flight instructions (commonly referred to as the instruction window and comprising all instructions renamed but not retired) without requiring the cycle-critical structures to scale up. By keeping these structures small and making the processor core tolerant of memory latencies, a CFP mechanism enables the new core to achieve high single-thread performance, and many of these new cores can be placed on a chip for high throughput. The resulting large instruction window reveals substantial instruction-level parallelism and achieves memory latency tolerance, while the small size of cycle-critical resources permits a high clock frequency  相似文献   

17.
Simultaneous multithreading (SMT) processors can issue multiple instructions from distinct processes or threads in the same cycle. This technique effectively increases the overall throughput by keeping the pipeline resources more occupied at the potential expense of reducing single thread performance due to resource sharing. In the software domain, an increasing number of dynamically linked libraries (DLL) are used by applications and operating systems, providing better flexibility and modularity, and enabling code sharing. It is observed that a significant amount of execution time in software today is spent in executing standard DLL instructions, that are shared among multiple threads or processes. However, for an SMT processor with a virtually-indexed cache implementation, existing instruction fetching mechanisms can induce unnecessary false I-TLB and I-Cache misses caused by the DLL-based instructions that are intended to be shared. This problem is more prominent when multiple independent threads are executing concurrently on an SMT processor.In this work, we investigate a neglected form of contention between running threads in the I-TLB and I-Cache (including both VIVT and VIPT) due to DLLs. To address these shortcomings, we propose a system level technique involving a light-weight modification in the microarchitecture and the OS. By exploiting the nature of the DLLs in our optimized system, we can reinstate the intended sharing of the DLLs in an SMT machine. Using Microsoft Windows based applications, our simulation results show that the optimized instruction fetching mechanism can reduce the number of DLL misses up to 5.5 times and improve the instruction cache hit rates by up to 62%, resulting in up to 30% DLL IPC improvements and up to 15% overall IPC improvements.  相似文献   

18.
针对超标量深流水线中物理寄存器资源冲突造成的流水线阻塞问题,提出了一种多指令共享同一物理寄存器资源的非阻塞指令发射方法。该方法可在物理寄存器资源冲突下继续分配物理寄存器,利用发射缓冲队列临时缓冲冲突的指令,增加发射流水级实际可分配的物理寄存器数量,释放发射窗口,提高物理寄存器使用的并行性。实验结果表明:相对于传统重命名方法,该方法可减少27.3%的物理寄存器资源实现传统方法相同的性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号