期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

TQSIM: A fast cycle-approximate processor simulator based on QEMU

《Journal of Systems Architecture》2016

Timing simulation of a processor is a key enabling technique to explore the design space of system architecture or to develop the software without an available hardware. We propose a fast cycle-approximate simulation technique for modern superscalar out-of-order processors. The proposed simulation technique is designed in two parts; the front-end provides correct functional execution of the guest application, and the back-end provides a timing model. For the back-end, we developed a novel processor timing model that combines a simple-formula-based analytical model and a scheduling analysis of sampled traces so as to boost up the simulation speed with minimal accuracy loss. Attached with a cache simulator, a branch predictor, and a trace analyzer, the proposed technique is implemented over the popular and portable QEMU emulator, so named TQSIM (Timed QEMU-based SIMulator). Sacrificing around 8 percent of the accuracy, TQSIM enables one or two orders of magnitude faster simulation than a reference cycle-accurate simulation when the target architecture is an ARM Cortex A15 processor. TQSIM is an open-source project currently available online. 相似文献

2.

A consistency-free memory architecture for sort-last parallel rendering processors

《Journal of Systems Architecture》2007,53(5-6):272-284

Current rendering processors are aiming to process triangles as fast as possible and they have the tendency of equipping with multiple rasterizers to be capable of handling a number of triangles in parallel for increasing polygon rendering performance. However, those parallel architectures may have the consistency problem when more than one rasterizer try to access the data at the same address. This paper proposes a consistency-free memory architecture for sort-last parallel rendering processors, in which a consistency-free pixel cache architecture is devised and effectively associated with three different memory systems consisting of a single frame buffer, a memory interface unit, and consistency-test units. Furthermore, the proposed architecture can reduce the latency caused by pixel cache misses because the rasterizer does not wait until cache miss handling is completed when the pixel cache miss occurs. The experimental results show that the proposed architecture can achieve almost linear speedup upto four rasterizers with a single frame buffer. 相似文献

3.

基于软件的网络处理器的路由高速缓存算法研究

刘祯刘斌郑凯《软件学报》2007,18(12):3115-3123

路由器需要以较低的代价灵活、高速地实现路由查找这一基本功能.为网络处理器设计了一种基于软件的路由查找高速缓存算法.网络处理器片上高速存储器中的一部分空间被划分出来,由指令代码来维护一个路由查找结果缓存表.通过选择合适的哈希函数,平衡表项之间的冲突并刷新复杂度,该算法可以缩短路由查找的延迟,减少多处理单元对存储器总线的竞争,为其他网络应用提供更多的处理时间.基于真实网络流量的实验表明,即便每个处理单元中仅有少量表项,网络处理器的吞吐量仍然可以得到有效的提升. 相似文献

4.

超标量处理器乱序提交机制的研究与设计

李昭刘有耀焦继业潘树朋《计算机工程》2021,47(4):180-186

针对超标量处理器中长周期执行指令延迟退休及持续译码导致的重排序缓存(ROB)阻塞问题,提出一种指令乱序提交机制.通过设计容量可配置的多缓存指令提交结构,实现存储器操作指令和ALU类型指令的分类退休,根据超标量处理器架构及性能需求对目标缓存和存储缓存容量进行参数化配置降低流水线阻塞风险,同时利用指令目的寄存器编码提交模式... 相似文献

5.

基于路径预测的踪迹预构

吴代辉唐朔飞季振洲王凯峰康海涛《计算机工程》2005,31(12):90-92,135

提出的踪迹预构机制根据程序的执行流,动态标志出经常发生失效的踪迹,并提前从指令缓存中取指对其进行构造。基于路径预测的踪迹预构机制跟踪踪迹的提交历史,并在每次预测下一个踪迹的同时,预测其接下来的第N个踪迹是否需要被预构。通过提前预构出这些踪迹,可以大大地减少踪迹缓存的失效率。相似文献

6.

乱序执行机器上的load指令调度

周谦冯晓兵张兆庆《计算机科学》2007,34(11):298-300

随着处理器和存储器速度差距的不断拉大，访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数，一般假定这些指令的延迟为cache命中或者cache miss的延迟，所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息，利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度，编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生，容易引起reorder buffer满，导致流水线阻塞。调度容易cache miss的指令。使其并行执行，从而隐藏cache miss的长延迟，就可以提高程序性能。因此，我们针对load指令，一方面修改频繁miss的指令的延迟，一方面修改调度策略，提高存储级并行度。实验证明，我们的调度对于bzip2有高达4．8％的提升，art有4％的提升，整体平均提高1．5％。相似文献

7.

WCET Analysis of Superscalar Processors Using Simulation With Coloured Petri Nets

Burns Frank Koelmans Albert Yakovlev Alexandre 《Real-Time Systems》2000,18(2-3):275-288

Determining a tight WCET of a block of code to be executed on a modern superscalar processor architecture is becoming ever more difficult due to the dynamic behaviour exhibited by current processors, which include dynamic scheduling features such as speculative and out-of-order execution in the context of multiple execution units with deep pipelines. We describe the use of Coloured Petri Nets (CP-nets) in a simulation based approach to this problem. A complex model of a generic processor architecture is described, with emphasis on the modelling strategy for obtaining the WCET and an analysis of the results. 相似文献

8.

一种双端口发射队列及其性能优化

隋兵才孙彩霞王永文郭辉《计算机工程与科学》2021,43(7):1168-1172

发射队列是超标量处理器的乱序控制部件,也是处理器中的关键部件,对整个处理器的性能起着决定性的作用.提出了一种能够有效提高乱序超标量处理器性能的双端口发射队列结构.该队列能够根据指令之间的相关性,估算指令的发射时机,将指令分配到不同的队列中.对比了2种不同的发射策略对性能的影响,输入端标记执行流水线的策略能够获得较高的I... 相似文献

9.

Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Nan Zhang Author Vitae 《Journal of Parallel and Distributed Computing》2011,71(7):915-924

Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor’s L2 cache. Through an analysis on relevant performance events using Intel VTune™Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading. 相似文献

10.

FILESPPA: Fast Instruction Level Embedded System Power and Performance Analyzer

Nikolaos KroupisAuthor Vitae Dimitrios SoudrisAuthor Vitae 《Microprocessors and Microsystems》2011,35(3):329-342

In the low power embedded systems design, it is important to analyze and optimize both the hardware and the software components of the system. The power consumption evaluation of the embedded systems is very slow procedure using the instruction-level power models into the simulator. Moreover, a huge number of simulations are needed to explore the power consumption in the instruction memory hierarchy to find the best cache parameters of each hierarchy’s level. In this paper we present a methodology which is aiming to estimate the system power consumption in short time, without simulation. The proposed methodology is based on the fast instruction analysis using instruction level power models, cache memory and memory power models. Based on the proposed methodology a software tool was developed named FILESPPA in order to automate the methodology’s steps for the MIPS processor architectures. The experimental results show the efficiency of the proposed methodology and tool in term of estimation accuracy, reducing the system power estimation time of the simulation technique. 相似文献

11.

Using the first-level cache stack distance histograms to predict multi-level LRU cache misses

《Microprocessors and Microsystems》2017

For cache analytical modeling, the stack distance theory is widely utilized to predict LRU-cache behaviors. Typically, the stack distance histogram collecting is implemented by profiling memory references. However, the profiled memory references merely reflect instruction fetching and load/store executions, which only represent the memory accesses to first-level (L1) caches. That is why these traces cannot be applied directly to construct stack distance histograms for downstream (L2 and L3) caches.Therefore, this paper proposes a stack distance probability model to extend the stack distance theory to the multi-level LRU cache behavior predictions. The inputs of our model are the L1 cache stack distance histograms and the multi-level LRU cache configurations. The outputs are the L2 and L3 cache stack distance histograms, with which the conflict misses in L2 and L3 caches can be quantified quickly and precisely.15 benchmarks chosen from Mobybench 2.0, Mibench I and Mediabench II are used to evaluate the accuracy of our model. Compared to the simulation results from Gem5 in AtomicSimpleCPU mode, the average absolute error of predicting cache misses in the I/D shared L2 cache is less than 5% while that of estimating the L3 cache misses is less than 7%. Furthermore, contrast to the time overhead of Gem5 AtomicSimpleCPU simulations, our model can speed up the cache miss prediction about x100 on average. 相似文献

12.

Memory Considerations for Low Energy Ray Tracing

D. Kopta K. Shkurko J. Spjut E. Brunvand A. Davis 《Computer Graphics Forum》2015,34(1):47-59

We propose two hardware mechanisms to decrease energy consumption on massively parallel graphics processors for ray tracing. First, we use a streaming data model and configure part of the L2 cache into a ray stream memory to enable efficient data processing through ray reordering. This increases L1 hit rates and reduces off‐chip memory energy substantially through better management of off‐chip memory access patterns. To evaluate this model, we augment our architectural simulator with a detailed memory system simulation that includes accurate control, timing and power models for memory controllers and off‐chip dynamic random‐access memory . These details change the results significantly over previous simulations that used a simpler model of off‐chip memory, indicating that this type of memory system simulation is important for realistic simulations that involve external memory. Secondly, we employ reconfigurable special‐purpose pipelines that are constructed dynamically under program control. These pipelines use shared execution units that can be configured to support the common compute kernels that are the foundation of the ray tracing algorithm. This reduces the overhead incurred by on‐chip memory and register accesses. These two synergistic features yield a ray tracing architecture that reduces energy by optimizing both on‐chip and off‐chip memory activity when compared to a more traditional approach. 相似文献

13.

结合访存失效队列状态的预取策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机学报》2007,30(7):1104-1114

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%. 相似文献

14.

龙芯2号处理器设计和性能分析 总被引：16，自引：4，他引：16

胡伟武张福新李祖松《计算机研究与发展》2006,43(6):959-966

介绍龙芯2号处理器设计及其性能测试结果．龙芯2号采用四发射超标量超流水结构。片内一级指令和数据高速缓存各64KB，片外二级高速缓存最多可达8MB．为了充分发挥流水线的效率，龙芯2号实现了先进的转移猜测、寄存器重命名、动态调度等乱序执行技术以及非阻塞的Cache访问和load Speculation等动态存储访问机制．龙芯2号处理器采用0．18gm的CMOS工艺实现，在正常电压下的最高工作频率为500MHz，500MHz时的实测功耗为3～5W．龙芯2号单精度峰值浮点运算速度为20亿a／秒，双精度浮点运算速度为10亿a／秒，SPECCPU2000的实测性能是龙芯1号的8～10倍，综合性能已经达到PentiumⅢ的水平．目前芯片样机能流畅运行完整的64位中文Linux操作系统，全功能的Mozilla浏览器、多媒体播放器和OpenOffice办公套件，可以满足绝大多数桌面应用的要求．相似文献

15.

Compiler Controlled Prefetching for Multiprocessors Using Low-Overhead Traps and Prefetch Engines

《Journal of Parallel and Distributed Computing》2000,60(5):585-615

In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead. 相似文献

16.

Exploiting computer resources for fast nearest neighbor classification

José R. Herrero Juan J. Navarro 《Pattern Analysis & Applications》2007,10(4):265-275

Modern computers provide excellent opportunities for performing fast computations. They are equipped with powerful microprocessors and large memories. However, programs are not necessarily able to exploit those computer resources effectively. In this paper, we present the way in which we have implemented a nearest neighbor classification. We show how performance can be improved by exploiting the ability of superscalar processors to issue multiple instructions per cycle and by using the memory hierarchy adequately. This is accomplished by the use of floating-point arithmetic which usually outperforms integer arithmetic, and block (tiled) algorithms which exploit the data locality of programs allowing for an efficient use of the data stored in the cache memory. Our results are validated with both an analytical model and empirical results. We show that regular codes could be performed faster than more complex irregular codes using standard data sets. 相似文献

17.

基于高预测性的稀疏矩阵向量乘法并行计算优化

夏天付格林曲劭儒罗中沛任鹏举《计算机研究与发展》2023,76(9):1973-1987

稀疏矩阵向量乘法(sparse matrix-vector multiplication, SpMV)是广泛应用于科学计算、工业仿真和智能计算等领域的重要算法,是核心的计算行为之一. 在一些应用场景中,需要进行多次的SpMV迭代,以完成精确的数值模拟、线性代数求解和图分析收敛等计算要求. 受限于SpMV本身的高度随机性和稀疏性所导致的数据局部性极差、缓存效率极低、计算模式非常不规则等问题,导致其计算负载成为当前高性能处理器的优化难点和研究热点. 基于现代高性能超标量乱序处理器的架构特征,深入研究SpMV的各类性能瓶颈,并且提出从提升可预测性和降低程序复杂度的角度进行全面的性能优化. 其核心思想是：通过构建串行访问的数据结构,提升数据访问的规律性和局部性,大幅度优化数据预取效率和缓存利用效率;通过构建规则的分支跳转条件,提升程序的分支预测准确率,有效提升程序执行效率;通过灵活运用SIMD指令集,有效提升计算资源利用率. 通过对以上特性的优化,该方法可以显著缓解性能瓶颈,大幅度提升处理器资源、缓存资源和访存带宽的利用率,并且获得与主流商用计算库MKL相比平均2.6倍的加速比,相比于现有最先进算法获得平均1.3倍的加速比.

相似文献

18.

IP Routing table compaction and sampling schemes to enhance TCAM cache performance

Ruirui Guo José G. Delgado-Frias 《Journal of Systems Architecture》2009,55(1):61-69

Routing table lookup is an important operation in packet forwarding. This operation has a significant influence on the overall performance of the network processors. Routing tables are usually stored in main memory which has a large access time. Consequently, small fast cache memories are used to improve access time. In this paper, we propose a novel routing table compaction scheme to reduce the number of entries in the routing table. The proposed scheme has three versions. This scheme takes advantage of ternary content addressable memory (TCAM) features. Two or more routing entries are compacted into one using don’t care elements in TCAM. A small compacted routing table helps to increase cache hit rate; this in turn provides fast address lookups. We have evaluated this compaction scheme through extensive simulations involving IPv4 and IPv6 routing tables and routing traces. The original routing tables have been compacted over 60% of their original sizes. The average cache hit rate has improved by up to 15% over the original tables. We have also analyzed port errors caused by caching, and developed a new sampling technique to alleviate this problem. The simulations show that sampling is an effective scheme in port error-control without degrading cache performance. 相似文献

19.

Dynamic Partitioning of Shared Cache Memory 总被引：6，自引：0，他引：6

G. E. Suh L. Rudolph S. Devadas 《The Journal of supercomputing》2004,28(1):7-26

This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses.The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory. 相似文献

20.

The impact of wrong-path memory references in cache-coherent multiprocessor systems 总被引：1，自引：0，他引：1

Resit Ayse Joshua J. Augustus K. 《Journal of Parallel and Distributed Computing》2007,67(12):1256

The core of current-generation high-performance multiprocessor systems is out-of-order execution processors with aggressive branch prediction. Despite their relatively high branch prediction accuracy, these processors still execute many memory instructions down mispredicted paths. Previous work that focused on uniprocessors showed that these wrong-path (WP) memory references may pollute the caches and increase the amount of cache and memory traffic. On the positive side, however, they may prefetch data into the caches for memory references on the correct-path. While computer architects have thoroughly studied the impact of WP effects in uniprocessor systems, there is no comparable work for multiprocessor systems. In this paper, we explore the effects of WP memory references on the memory system behavior of shared-memory multiprocessor (SMP) systems for both broadcast and directory-based cache coherence. Our results show that these WP memory references can increase the amount of cache-to-cache transfers by 32%, invalidations by 8% and 20% for broadcast and directory-based SMPs, respectively, and the number of writebacks by up to 67% for both systems. In addition to the extra coherence traffic, WP memory references also increase the number of cache line state transitions by 21% and 32% for broadcast and directory-based SMPs, respectively. In order to reduce the performance impact of these WP memory references, we introduce two simple mechanisms—filtering WP blocks that are not likely-to-be-used and WP aware cache replacement—that yield speedups of up to 37%. 相似文献