首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 203 毫秒
1.
以基本块为单位的非顺序指令预取   总被引:1,自引:0,他引:1  
取指令能力的高低对微处理器的性能有很大影响。指令预取技术能够有效地降低指令Cache的访问失效率,提高微处理器的取指令能力,进而提高微处理器的性能。本文提出了一种由分支指令指导的、以基本块为单位的非顺序指令预取技术,每次预取将一个完整的基本块读入指令Cache。这种方法使用静态策略分析程序行为,实现所需的硬件复杂度低。模拟结果显示,该方法能够有效地提高指令Cache访问的命中率。  相似文献   

2.
同步数据触发体系结构SDTA将传统指令级并行细化到微操作级并行,具有较高的数据处理能力,但其特殊的指令格式及指令特性,给指令Cache访问带来了挑战。指令预取技术能够有效地降低指令Cache的访问失效率,增强处理器取指能力,提高性能。本文分析了SDTA指令集特性,提出了一种适合SDTA指令集特性的软硬件相结合的混合指令预取机制,采用硬件预取引擎和软件提示相结合进行预取。该方法能够有效地提高指令Cache命中率,且具有实现简单、无效预取率低、不会增加代码体积等特点。  相似文献   

3.
在含Cache的处理器中,代码排布和指令预取是减少取指延迟的常用技术.代码排布侧重研究代码执行的空间相对位置,指令预取则关注于代码执行的时间相对关系.片上Trace技术非入侵地获得程序的执行路径及时间信息,将代码执行的时空关系联系起来,因此为排布技术和预取技术的结合使用提供了基础.基于YHFT-DSP平台,利用程序运行的周期行为特性设置预取,利用VLIW结构处理器的空闲单元执行预取指令,提出以增加预取容限为目标的函数级代码排布方法.实验结果表明,该方法能有效预取并减少指令Cache失效.  相似文献   

4.
基于取指执行时序范畴的多核共享Cache干扰分析   总被引:2,自引:0,他引:2  
在多核结构中,获得并行应用线程的安全、精确的最坏情况执行时间(worst case execution time,WCET)的最大挑战之一在于共享资源的竞争冲突检测.在共享Cache的多核处理器中,线程在共享Cache中的指令可能被其他并行线程的指令替换,从而导致了线程间在共享Cache上的干扰,因此多核结构下线程WCET需要考虑并行线程间在共享Cache上的干扰.在现有的简单地址映射干扰分析基础上,考虑了指令取指执行时序因素对干扰的影响,提出了非干扰状态的充分不必要条件,根据指令的取指执行时序范畴判断线程在共享Cache上的干扰状态.通过排除非干扰状态,可以进一步精确多核结构中线程的WCET估值.理论分析证明了该方法的有效性.实验结果表明,与当前现有的考虑执行周期和基于逻辑访问先后顺序的方法相比,基于时序方法下的WCET估值分别可以提高12%和7%的精确度.  相似文献   

5.
在实时软件系统中,软件时间性能的分析与评估技术是一个重要的课题,然而随着CPU的结构越来越复杂,采用传统的模拟底层硬件执行的方法越来越困难。而基于分布函数的最坏执行时间(Worst Case Execution Time,WCET)估计方法从概率角度出发,可以绕过复杂的底层硬件建模,估计程序的最坏执行时间。首先对TI TMS320C6713 DSP汇编代码进行基本块的划分,以基本块为结点构建程序流图;然后用贝塔分布模拟每条指令的运行时间并采用改进的计划评审技术(Program Evaluation and Review Technique,PERT)确定贝塔分布相关参数,指令叠加后用正态分布模拟每个基本块的执行时间;最后利用基于路径的方法得到整个程序的最坏执行时间。实验结果表明此方法是可行的和合理的。  相似文献   

6.
正确有效的指令预取策略是避免指令缺失的关键技术,程序流程改变时指令预取方向正确率不高、指令预取准确度和存储器带宽有效利用率较低是导致指令缺失的主要因素.本文提出基于置信度评估的自适应选择性指令主动推送技术ASIAP,一方面减少无效指令预取的数量,进行精确指令预取,在避免Cache污染的同时提升指令预取的有效性;另一方面采用指令主动推送部件自适应选择性地完成部分非顺序指令预取请求,减少了取入错误路径上无用指令的可能.通过与Next_Line、Target_Line、Wrong_Path、BTA、Markov和CFGP等策略的对比,在2-16内核配置下,ASIAP策略相对于其它策略准确性平均提升3.7%-28.71%;L1 I-cache缺失率平均下降3.3%-14.39%.  相似文献   

7.
在同时多线程处理器中,提高取指单元的吞吐率意味着各线程之间的Cache竞争更加激烈,而这种竞争又制约着取指单元吞吐率的提高。本文针对当前超长指令字体系结构的新特点,提出了一种同时提高取指单元和处理器吞吐率的方法。该方法通过尽可能早地作废取指流水线中的无效地址,减少了由无效取指导致的程序Cache冲突,也提高了整个处理器的性能。实验结果表明,该方法使处理器和取指单元的吞吐率均相对提高了12%~23%,而一级程序Cache的失效率则略微增加甚至降低。另外,它还能够减少10%~25%的一级程
程序Cache读访问,从而降低了处理器的功耗。  相似文献   

8.
由于多核处理器优越的计算性能,多核处理器现已广泛应用在嵌入式实时系统中.相对于单核处理器,多核处理器存在资源共享竞争、并行任务干扰等因素,尤其是缓存(Cache)一致性问题,导致任务最坏情况执行时间(worst-case execution time,WCET)的预测更加困难.基于以上因素,提出基于多级一致性协议的多核处理器WCET分析方法.该方法针对多级一致性协议体系架构,提出多级一致性域的概念,将多核处理器的数据访问分为域内访问和跨域访问2个层次,根据Cache读写策略和MESI(modify exclusive shared invalid)一致性协议,得出一致性域内部和跨一致性域的Cache状态更新函数,从而实现多级一致性协议嵌套情况下的WCET分析.实验结果表明,在改变Cache配置参数的情况下,该方法分析结果与GEM5仿真结果的变化趋势一致,经过相关性分析,GEM5仿真结果与该方法分析结果相关性系数不低于0.98;在分析精度方面,该方法的平均过估计率为1.30,相比现有方法降低了0.78.  相似文献   

9.
吕鸣松  关楠  王义 《软件学报》2014,25(2):179-199
实时系统时间分析的首要任务是估计程序的最坏情况执行时间(worst-case execution time,简称WCET).程序的WCET 通常受到硬件体系结构的影响,Cache则是其中最为突出的因素之一.对面向WCET计算的Cache分析研究进行了综述,介绍了经典Cache分析框架与Cache分析核心技术,并从循环结构分析、数据Cache分析、多级Cache分析、多核共享Cache分析、非LRU替换策略分析等角度介绍了Cache分析在不同维度上的研究问题与主要挑战,总结了现有技术的优缺点,展望了Cache分析研究的未来发展方向.  相似文献   

10.
方娟  张红波 《计算机科学》2012,(Z2):48-50,64
存储访问延迟一直是制约计算机系统整体性能的瓶颈,多核处理器的出现使"存储墙"问题更加严重。预取技术可以隐藏存储访问延迟,因此基于多核处理器的预取技术最近成为学术界研究的热点。研究了目前较为新颖的多核处理器预取技术Future execution,然后针对其缺陷提出改进,即提出了FE-Runahead架构,其减少了二级Cache访问缺失,提高了二级Cache命中率。实验结果表明,改进后的预取架构的二级Cache命中率提高了约9%,相对执行时间减少了8%。  相似文献   

11.
The schedulability analysis of real-time embedded systems requires worst case execution time (WCET) analysis for the individual tasks. Bounding WCET involves not only language-level program path analysis, but also modeling the performance impact of complex micro-architectural features present in modern processors. In this paper, we statically analyze the execution time of embedded software on processors with speculative execution. The speculation of conditional branch outcomes (branch prediction) significantly improves a program's execution time. Thus, accurate modeling of control speculation is important for calculating tight WCET estimates. We present a parameterized framework to model the different branch prediction schemes. We further consider the complex interaction between speculative execution and instruction cache performance, that is, the fact that speculatively executed blocks can generate additional cache hits/misses. We extend our modeling to capture this effect of branch prediction on cache performance. Starting with the control flow graph of a program, our technique uses integer linear programming to estimate the program's WCET. The accuracy of our method is demonstrated by tight estimates obtained on realistic benchmarks.  相似文献   

12.
Lee  Minsuk  Min  Sang Lyul  Shin  Heonshik  Kim  Chong Sang  Park  Chang Yun 《Real-Time Systems》1997,13(1):47-65
Cache memories have been extensively used to bridge the speed gap between high speed processors and relatively slow main memory. However, they are not widely used in real-time systems due to their unpredictable performance. This paper proposes an instruction prefetching scheme called threaded prefetching as an alternative to instruction caching in real-time systems. In the proposed threaded prefetching, an instruction block pointer called a thread is assigned to each instruction memory block and is made to point to the next block on the worst case execution path that is determined by a compile-time analysis. Also, the thread is not updated throughout the entire program execution to guarantee predictability. This paper also compares the worst case performances of various previous instruction prefetching schemes with that of the proposed threaded prefetching. By analyzing several benchmark programs, we show that the worst case performance of the proposed scheme is significantly better than those of previous instruction prefetching schemes. The results also show that when the block size is large enough the worst case performance of the proposed threaded prefetching scheme is almost as good as that of an instruction cache with 100 % hit ratio.  相似文献   

13.
Hard real-time systems demand high performance in combination with a timing predictable program execution. The performance of a system in the worst-case, represented by its worst case execution time (WCET), highly depends on the design of the memory subsystem. In this paper we focus on the instruction memory hierarchy and quantify the impact of different on-chip instruction memories on the worst-case timing of the system. A function-based dynamic instruction scratchpad (D-ISP), an instruction cache, and static instruction scratchpads using basic-block-based and function-based assignment algorithms are compared. Therefore, we provide WCET bounds for systems with different on-chip instruction memories and different off-chip memory timings.We show that for small memory sizes a static instruction scratchpad usually outperforms the other memories in terms of the WCET estimate. However, with increasing memory sizes the D-ISP is able to reach lower WCET bounds. An instruction cache can only provide lower WCET bounds than the other memories, if no suitable assignment for the static instruction scratchpads is found or if the D-ISP suffers from thrashing or frequently loads unused code.  相似文献   

14.
The calculation of worst case execution time (WCET) is a fundamental requirement of almost all scheduling approaches for hard real-time systems. Due to their unpredictability, hardware enhancements such as cache and pipelining are often ignored in attempts to find WCET of programs. This results in estimations that are excessively pessimistic. In this article a simple instruction pipeline is modeled so that more accurate estimations are obtained. The model presented can be used with any schedulability analysis that allows sections of nonpreemptable code to be included. Our results indicate the WCET overestimates at basic block level can be reduced from over 20% to less than 2%, and that the overestimates for typical structured real-time programs can be reduced by 17%–40%.  相似文献   

15.
A prefetch method that enables stride prefetching at the secondary cache without accessing the processor's internal resources is developed and evaluated. It uses a data-range-table that enables it to detect usable strides and memory access streams which fall into the same data range. By using program driven simulation of scientific applications in the context of shared-memory multiprocessors, it is shown that the proposed method can reduce load stall times by an amount comparable to a conventional stride driven prefetching method which requires access to the processor's instruction address register.  相似文献   

16.
Timing analysis of concurrent programs running on shared cache multi-cores   总被引:1,自引:1,他引:0  
Memory accesses form an important source of timing unpredictability. Timing analysis of real-time embedded software thus requires bounding the time for memory accesses. Multiprocessing, a popular approach for performance enhancement, opens up the opportunity for concurrent execution. However due to contention for any shared memory by different processing cores, memory access behavior becomes more unpredictable, and hence harder to analyze. In this paper, we develop a timing analysis method for concurrent software running on multi-cores with a shared instruction cache. Communication across tasks is by message passing. Our method progressively improves the lifetime estimates of tasks that execute concurrently on multiple cores, in order to estimate potential conflicts in the shared cache. Possible conflicts arising from overlapping task lifetimes are accounted for in the hit-miss classification of accesses to the shared cache, to provide safe execution time bounds. We show that our method produces lower worst-case response time (WCRT) estimates than existing shared-cache analysis on a real-world embedded application. Furthermore, we also exploit instruction cache locking to improve WCRT. By locking some beneficial memory blocks into L1 cache, the WCET of the tasks and L2 cache conflicts are reduced, resulting in better WCRT. Experiments demonstrate that significant WCRT reduction is achieved through cache locking.  相似文献   

17.
In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead.  相似文献   

18.
In a multi-task embedded system, a cache is shared by different tasks, which increases the complexity of cache management and the unpredictability of cache behavior. This unpredictability in turn brings an overestimation of application??s worst-case execution time (WCET) and worst-case CPU utilization (WCU) which are two of the most important criteria for real-time embedded systems. Modern processors often provide cache locking capability, which can be applied statically and dynamically to manage cache in a predictable manner. The selection of instructions to be locked in the instruction cache (I-Cache) has dramatic influence on the system performance. This paper focuses on applying cache locking techniques to the shared I-Cache to minimize WCU for multi-task embedded systems. We analyze and compare three different strategies to perform I-Cache locking: static locking, semi-dynamic locking, and dynamic locking. Different algorithms are proposed utilizing the foreknown information of embedded applications. Experimental results show that the proposed algorithms can reduce WCU compared to previous techniques.  相似文献   

19.
Modeling out-of-order processors for WCET analysis   总被引:1,自引:0,他引:1  
Estimating the Worst Case Execution Time (WCET) of a program on a given processor is important for the schedulability analysis of real-time systems. WCET analysis techniques typically model the timing effects of micro-architectural features in modern processors (such as pipeline, cache, branch prediction) to obtain safe and tight estimates. In this paper, we model out-of-order superscalar processor pipelines for WCET analysis. The analysis is, in general, difficult even for a basic block (a sequence of instructions with single-entry and single-exit points) if some of the instructions have variable latencies. This is because the WCET of a basic block on out-of-order pipelines cannot be obtained by assuming maximum latencies of the individual instructions. Our timing estimation technique for a basic block proceeds by a fixed-point analysis of the time intervals at which the instructions enter/leave a pipeline stage. To extend our estimation to whole programs, we use Integer Linear Programming (ILP) to combine the timing estimates for basic blocks. Timing effects of instruction cache and branch prediction are also modeled within our pipeline analysis framework. This forms a combined timing analysis framework that captures out-of-order pipeline, cache, branch prediction as well as the mutual interaction among these micro-architectural features. The accuracy of our analysis is demonstrated via tight estimates obtained for several benchmarks. Preliminary version of parts of this paper has previously been published as Li et al. (2004). Abhik Roychoudhury received his B.E. in Computer Engineering from Jadavpur University (India) in 1995 and his M.S. / Ph.D. degrees (both in Computer Science) from the State University of New York at Stony Brook in 1997 and 2000 respectively. Since 2001 he has been an Assistant Professor at National University of Singapore. His research interests are in models and methods for reliable development of embedded software and systems, with specific focus on software validation, analysis and comprehension. Xianfeng Li is a postdoctoral researcher in the Department of Computer Science and Technology at Peking University, China. He received his Ph.D. from National University of Singapore in 2005. His research interests include real-time systems, modeling and evaluation of computer architecture, and System-on-Chips. Tulika Mitra is an Assistant Professor in School of Computing at National University of Singapore from January 2001. She received her PhD in Computer Science from SUNY at Stony Brook in December 2000. Tulika received M.E in Computer Science and Automation from Indian Institute of Science in 1997 and her B.E. in Computer Engineering from Jadavpur University, India in 1995. Her current research focuses on design and analysis of embedded and real-time systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号