期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

蒋进松黄凯陈辰王钰博严晓浪《计算机工程与科学》2016,38(12):2381-2391

为了提高片上Flash在嵌入式应用中的读取速度,提出了一种基于预取和缓存原理的片上Flash加速控制器。该控制器包括预取缓存和高速缓存两种加速方案。其中预取缓存方案采用位宽扩展和预取技术加速顺序指令的读取,并采用分支缓存存储非顺序指令,降低由非顺序指令造成的预取缺失代价;而高速缓存方案采用组相联和路预测技术,提高指令重用率,减少Flash访问次数,降低系统功耗。针对不同的应用场景,两种加速方案既可通过寄存器来静态切换,也可通过软件流程来自适应动态切换,从而获得最佳的读取速度提升。多项基准程序的测试结果表明了所提出的片上Flash加速控制器在性能和功耗优化上的可行性和高效性。相似文献

2.

一种高能效的结构不对称指令缓存

刘骁高红光陈芳园丁亚军《计算机工程与科学》2017,39(3):443-450

在现代微处理器中,指令缓存的Tag读取、比较消耗了指令缓存较大比例的能耗。提出一种基于推断的低能耗指令缓存:不对称指令缓存。根据跳转指令比例低的特点,在该结构中区别处理跳转指令和顺序指令,使用和数据不完全对应的简化标记管理位。该结构采用了命中推断和变长指令取指两种创新技术,其中基于命中推断技术解决了指令缓存命中时Tag比较过多的问题;使用变长指令取指技术提高了顺序指令块的命中率。实验结果表明,对于选取的SPEC2006测试程序,不对称指令缓存结构较常规L1指令Cache取指能耗下降了40%~60%,比无标记指令缓存结构TH IC能耗降低了9%;取指ED2P方面,较常规L1指令Cache优化约50%,比TH IC结构优化约17%。相似文献

3.

Variable Length Instruction Compression on Transport Triggered Architectures

Timo Viitanen Janne Helkala Heikki Kultala Pekka Jääskeläinen Jarmo Takala Tommi Zetterman Heikki Berg 《International journal of parallel programming》2018,46(6):1283-1303

The memories used for embedded microprocessor devices consume a large portion of the system’s power. The power dissipation of the instruction memory can be reduced by using code compression methods, which may require the use of variable length instruction formats in the processor. The power-efficient design of variable length instruction fetch and decode is challenging for static multiple-issue processors, which aim for low power consumption on embedded platforms. The memory-side power savings using compression are easily lost on inefficient fetch unit design. We propose an implementation for instruction template-based compression and two instruction fetch alternatives for variable length instruction encoding on transport triggered architecture, a static multiple-issue exposed data path architecture. With applications from the CHStone benchmark suite, the compression approach reaches an average compression ratio of 44% at best. We show that the variable length fetch designs reduce the number of memory accesses and often allow the use of a smaller memory component. The proposed compression scheme reduced the energy consumption of synthesized benchmark processors by 15% and area by 33% on average. 相似文献

4.

Motorola's 88000 family architecture

Alsup M. 《Micro, IEEE》1990,10(3):48-66

The initial members of the 88000 family of high-performance 32-bit microprocessor are the 88100 processor and the 88200 cache and memory management unit (CMMU). The processor manipulates integer and floating-point data and initiates instruction and data memory transactions. The CMMU minimizes the latency of main memory requests by maintaining a cache for data transaction and a cache for memory management translations. A typical system consists of one processor and two identical cache chips, one servicing instruction fetch requests, the other servicing data read and write requests. The overall design process for the 88000 family is described, and the integer instructions are discussed. Decisions made with respect to the processor, cache, and software are examined. Some data on the use of the instruction set by the available compilers and the efficiency of the cache and memory systems are presented 相似文献

5.

Software Trace Cache for Commercial Applications

Alex Ramirez Josep Ll. Larriba-Pey Carlos Navarro Mateo Valero Josep Torrellas 《International journal of parallel programming》2002,30(5):373-395

In this paper we address the important problem of instruction fetch for future wide issue superscalar processors. Our approach focuses on understanding the interaction between software and hardware techniques targeting an increase in the instruction fetch bandwidth. That is the objective, for instance, of the Hardware Trace Cache (HTC). We design a profile based code reordering technique which targets a maximization of the sequentiality of instructions, while still trying to minimize instruction cache misses. We call our software approach, Software Trace Cache (STC). We evaluate our software approach, and then compare it with the HTC and the combination of both techniques. Our results on PostgreSQL show that for large codes with few loops and deterministic execution sequences the STC offers better results than a HTC. Also, both the software and hardware approaches combine well to obtain improved results. 相似文献

6.

一种提高同时多线程VLIW处理器中取指单元吞吐率的方法

下载免费PDF全文

万江华陈书明《计算机工程与科学》2007,29(6):97-101

在同时多线程处理器中,提高取指单元的吞吐率意味着各线程之间的Cache竞争更加激烈,而这种竞争又制约着取指单元吞吐率的提高。本文针对当前超长指令字体系结构的新特点,提出了一种同时提高取指单元和处理器吞吐率的方法。该方法通过尽可能早地作废取指流水线中的无效地址,减少了由无效取指导致的程序Cache冲突,也提高了整个处理器的性能。实验结果表明,该方法使处理器和取指单元的吞吐率均相对提高了12%～23%,而一级程序Cache的失效率则略微增加甚至降低。另外,它还能够减少10%～25%的一级程
程序Cache读访问,从而降低了处理器的功耗。相似文献

7.

The Precomputed-Branch architecture: Efficient branches with compiler support

《Journal of Systems Architecture》1999,45(9):651-679

Accurate instruction fetch and branch prediction is increasingly important in today's superscalar architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely outcome of branch instructions. A branch target buffer (BTB) is often used to provide target addresses for taken branches and to predict the destination of indirect jumps. Using a BTB avoids the delay needed to recalculate the destination address and reduces the misfetch penalty. However, an effective branch target buffer can be large and can possibly increase the cycle time of a processor. We propose that a design used in older computers, such as the PDP-8, be used in modern architectures instead of a BTB design. The compiler would pre-compute the branch destination for most branch instructions, allowing the branch information to be stored with the instruction. We consider computing branch destinations at link time and as instructions are fetched into the instruction cache; both alternatives offer similar performance with different advantages. A very small BTB is still useful to predict indirect branches, which cannot be pre-computed. Our results show that the Precomputed-Branch architecture performs better than an architecture using only a BTB, and has significant hardware savings. This is particularly true for larger programs more representative of modern applications. 相似文献

8.

Instruction scheduling and transformation for a VLIW unified reduced instruction set computer/digital signal processor processor with shared register architecture

Cheng‐Yu Lee Min‐Chin Hung Rong‐Guey Chang 《Concurrency and Computation》2014,26(1):134-151

The popularity of multimedia applications made them a major theme in embedded systems. The key component for supporting multimedia application well is embedded processor. Thus, we have designed and implemented an embedded processor, called UniDual processor, to achieve this objective. Its key features are the integration of instructions of reduced instruction set computers (RISCs) and digital signal processors (DSPs) as well as the support of special instruction set and shared‐based clustered register architecture. However, an important issue of UniDual that remains open is how to efficiently allocate registers. In this paper, we present a scheduling and instruction transformation approach to resolve the aforementioned issue. The proposed approach schedules instructions and then transforms overlapped instructions into RISC and DSP instructions by taking communication overhead and hardware limitations into account. Compared with the greedy approach, the evaluation shows that our work is relatively effective in performance and code size reduction. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

9.

A section cache system designed for VLIW architectures

《Journal of Systems Architecture》2000,46(14):1293-1308

The static specification of operations executed in parallel using No Operations (NOPs) is another culprit to make code size to be increased in VLIW architecture. Some alternatives in the instruction encoding and memory subsystem are proposed to minimize the impact of NOP on the code size. One is the compressed cache using the packed encoding scheme and the other is the decompressed cache using the unpacked encoding scheme. The compressed cache shows high memory utilization but increases the pipeline branch penalty because it requires very complex fetch hardware. On the contrary, the fetch overhead can be decreased in the decompressed cache because the unpacked encoding scheme allows an instruction to be issued to the pipeline without any recovery process. However, it has a shortcoming that the memory utilization is deteriorated due to the memory allocation irrespective of the number of useful operations. In this research, a new instruction encoding scheme called a semi-packed encoding scheme and the section cache, which enables effective store and retrieval of semi-packed instructions, are proposed. This can decrease the hardware complexity to fetch an instruction and the wasted memory space due to NOPs via the partially fixed length of an instruction. The experimental results reveal that the memory utilization in the section cache is 3.4 times higher than in the decompressed cache. The memory subsystem using the section cache can provide about 15% performance improvement with the moderate size of chip area. 相似文献

10.

一种精确的分支预测微处理器模型 总被引：3，自引：0，他引：3

陈跃跃周兴铭《计算机研究与发展》2003,40(5):741-745

在当今深流水宽发射的微处理器中，为实现高性能，精确的分支预测是不可缺少的关键技术．分支预测失效将浪费大量的时钟周期，无法发挥乱序执行的效能．宽发射微处理器的有效性能同时还依赖指令窗口的大小和指令预取宽度．提出了一种新的更精确的支持分支预测和分支误预测周期损失的微处理器模型．根据指令的执行带宽为指令窗口中可用指令数的平方根统计规律，给出了一个更为精确的描述微处理器取指带宽、分支预测精度、分支误预测周期损失、指令窗口大小和IPC之间关系的算法，并讨论了这些参数的综合权衡以及这些参数对程序IPC的影响．由此可以确定依赖多个微处理器参数的取指带宽阈值和微处理器中几个关键参数的选取．相似文献

11.

Execution History Guided Instruction Prefetching

Zhang Yi Haga Steve Barua Rajeev 《The Journal of supercomputing》2004,27(2):129-147

The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by Rivers et al. [19] shows that this factor alone explains why most modern microprocessors do not use such hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First, we present a method that does not require an extra port to I-cache. Second, the performance improvement for our method is greater than the best competing method BHGP [23] even disregarding the improvement from not having an extra port. The three key features of our method that prevent the above deficiencies are as follows. First, late prefetching is prevented by correlating misses to dynamically preceding instructions. For example, if the I-cache miss latency is 12 cycles, then the instruction that was fetched 12 cycles prior to the miss is used as the prefetch trigger. Second, the miss history table is kept to a reasonable size by grouping contiguous cache misses together and associated them with one preceding instruction, and therefore, one table entry. Third, the extra I-cache port is avoided through efficient prefetch filtering methods. Experiments show that for our benchmarks, chosen for their poor I-cache performance, an average improvement of 9.2% in runtime is achieved versus the BHGP methods [23], while the hardware cost is also reduced. The improvement will be greater if the runtime impact of avoiding an extra port is considered. When compared to the original machine without prefetching, our method improves performance by about 35% for our benchmarks. 相似文献

12.

基于跳转轨迹的分支目标缓冲研究

熊振亚林正浩任浩琪《计算机科学》2017,44(3):195-201, 214

现代计算机体系结构受两个方面的困扰:性能和能耗。为降低嵌入式处理器日益增长的功耗,提出基于跳转轨迹的分支目标缓冲结构(TG-BTB)。与传统分支目标缓冲每次提取指令时需要查询分支目标缓冲不同,TG-BTB只在执行轨迹预测为跳转时才查询分支目标缓冲。该结构通过在程序执行过程中动态分析跳转轨迹行为,可以实现只在轨迹跳转时查询分支目标缓冲,从而降低功耗。在动态分析过程中首先提取记录两条跳转分支指令之间的指令间隔,然后将提取的指令间隔存储在TG-BTB中,最后根据存储在TG-BTB中的指令间隔决定是否需要查询BTB。基于基准测试向量进行模型验证和性能测试,实验结果表明TG-BTB降低了81%的BTB查询能耗。相似文献

13.

An Energy-Efficient Instruction Scheduler Design with Two-Level Shelving and Adaptive Banking 总被引：2，自引：0，他引：2

下载免费PDF全文

Yu-Lai Zhao Xian-Feng Li Dong Tong and Xu Cheng 《计算机科学技术学报》2007,22(1):15-24

Mainstream processors implement the instruction scheduler using a monolithic CAM-based issue queue （IQ）, which consumes increasingly high energy as its size scales. In particular, its instruction wakeup logic accounts for a major portion of the consumed energy. Our study shows that instructions with 2 non-ready operands （called 2OP instructions） are in small percentage, but tend to spend long latencies in the IQ. They can be effectively shelved in a small RAM-based waiting instruction buffer （WIB） and steered into the IQ at appropriate time. With this two-level shelving ability, half of the CAM tag comparators are eliminated in the IQ, which significantly reduces the energy of wakeup operation. In addition, we propose an adaptive banking scheme to downsize the IQ and reduce the bit-width of tag comparators. Experiments indicate that for an 8-wide issue superscalar or SMT proeessor,the energy consumption of the instruction scheduler can be reduced by 67%. Furthermore, the new design has potentially faster scheduler clock speed while maintaining close IPC to the monolithic scheduler design. Compared with the previous work on eliminating tags through prediction, our design is superior in terms of both energy reduction and SMT support. 相似文献

14.

RSA 踪迹驱动指令Cache 计时攻击研究

陈财森王韬郭世泽周平《软件学报》2013,24(7):1683-1694

指令Cache 攻击是基于获取算法执行路径的一种旁路攻击方式.首先,通过分析原有RSA 指令Cache 计时攻击存在可行性不高且能够获取的幂指数位不足等局限性,建立了新的基于监视整个指令Cache 而不只是监视特定指令Cache 的踪迹驱动计时攻击模型;然后,提出了一种改进的基于SWE 算法窗口大小特征的幂指数分析算法;最后,在实际环境下,利用处理器的同步多线程能力确保间谍进程与密码进程能够同步运行.针对OpenSSLv.0.9.8f 中的RSA算法执行指令Cache 计时攻击实验,实验结果表明:新的攻击模型在实际攻击中具有更好的可操作性;改进的幂指数分析算法能够进一步缩小密钥搜索空间,提高了踪迹驱动指令Cache 计时攻击的有效性.对于一个512 位的幂指数,新的分析算法能够比原有分析算法多恢复出大约50 个比特位. 相似文献

15.

Implementation of the PIPE processor

Farrens M.K. Pleszhun A.R. 《Computer》1991,24(1):65-70

The PIPE (parallel instruction with pipelined execution) processor, which is the result of a research project initiated to investigate high-performance computer architectures for VLSI implementation, is described. The lessons learned from the implementation are discussed. The most important result was the discovery that supporting architectural queues does not complicate the instruction issue logic and fees the processor clock rate from external memory speed influences. It was also found that the decision to support an instruction set with two instruction sizes and to allow consecutive two-parcel instruction issues profoundly affected the instruction fetch logic design. Other significant results concerned the issue logic, barrel shifter, cache control logic, and branch count 相似文献

16.

Linked instruction caches for enhancing power efficiency of embedded systems

Chang-Jung Ku Ching-Wen Chen An Hsia Chun-Lin Chen 《Microprocessors and Microsystems》2014

The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time. 相似文献

17.

同步数据触发体系结构中指令预取技术研究

下载免费PDF全文

郭建军戴葵王志英《计算机工程与科学》2009,31(8)

同步数据触发体系结构SDTA将传统指令级并行细化到微操作级并行,具有较高的数据处理能力,但其特殊的指令格式及指令特性,给指令Cache访问带来了挑战。指令预取技术能够有效地降低指令Cache的访问失效率,增强处理器取指能力,提高性能。本文分析了SDTA指令集特性,提出了一种适合SDTA指令集特性的软硬件相结合的混合指令预取机制,采用硬件预取引擎和软件提示相结合进行预取。该方法能够有效地提高指令Cache命中率,且具有实现简单、无效预取率低、不会增加代码体积等特点。相似文献

18.

CASA: A New IFU Architecture for Power-Efficient Instruction Cache and TLB Designs

下载免费PDF全文

孙含欣杨鲲鹏赵雨来佟冬程旭《计算机科学技术学报》2008,23(1):141-153

The instruction fetch unit （IFU） usually dissipates a considerable portion of total chip power. In traditional IFU architectures, as soon as the fetch address is generated, it needs to be sent to the instruction cache and TLB arrays for instruction fetch. Since limited work can be done by the power-saving logic after the fetch address generation and before the instruction fetch, previous power-saving approaches usually suffer from the unnecessary restrictions from traditional IFU architectures. In this paper, we present CASA, a new power-aware IFU architecture, which effectively reduces the unnecessary restrictions on the power-saving approaches and provides sufficient time and information for the power-saving logic of both instruction cache and TLB. By analyzing, recording, and utilizing the key information of the dynamic instruction flow early in the front-end pipeline, CASA brings the opportunity to maximize the power efficiency and minimize the performance overhead. Compared to the baseline configuration, the leakage and dynamic power of instruction cache is reduced by 89.7% and 64.1% respectively, and the dynamic power of instruction TLB is reduced by 90.2%. Meanwhile the performance degradation in the worst case is only 0.63%. Compared to previous state-of-the-art power-saving approaches, the CASA-based approach saves IFU power more effectively, incurs less performance overhead and achieves better scalability. It is promising that CASA can stimulate further work on architectural solutions to power-efficient IFU designs. 相似文献

19.

基于同时多线程的IFSBSMT取指策略研究

李静梅关海洋《计算机科学》2012,39(8):311-315

取指策略直接影响处理器的指令吞吐率.针对传统处理器取指策略存在取指带宽利用不均衡、指令队列冲突率高的缺点,提出基于同时多线程处理器的取指策略IFSBSMT.该策略以线程的IPC值为基础,速取优先级高的线程进行取指,并利用预取指令条数预算的方式分配取指带宽,采取线程IPC值和L2 Cache缺失率的双优先级动态资源分配机制分配处理器的系统资源.研究结果表明,IFSBSMT策略有效地解决了取指带宽、指令队列冲突及资源浪费问题,进一步提高了指令吞吐率,且具有较好的取指公平性. 相似文献

20.

Trace Cache及Trace处理器技术 总被引：3，自引：0，他引：3

下载免费PDF全文

杜贵然罗金平徐明胡瀚涛周兴铭《计算机工程与科学》2001,23(1):39-43

Trace Cache和Trace处理器着力解决取指令的带宽,是一种颇具潜力的技术。本文在介绍Trace Cache技术的基础上,结合ILP研究的现状,提出了未来Trace相关技术的研究方向。相似文献