期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Scalable Load and Store Processing in Latency-Tolerant Processors

Gandhi A. Akkary H. Rajwar R. Srinivasan S.T. Lai K. 《Micro, IEEE》2006,26(1):30-39

New load and store processing algorithms let memory-latency-tolerant architectures sustain thousands of in-flight instructions without scaling cycle-critical fully-associative load and store queues. These algorithms rely on redoing some stores after fetching cache miss data from memory (to fix memory dependences). Doing so provides better power and area characteristics than constantly enforcing memory dependences among a several loads and stores, many of which have unknown addresses. 相似文献

2.

乱序执行机器上的load指令调度

周谦冯晓兵张兆庆《计算机科学》2007,34(11):298-300

随着处理器和存储器速度差距的不断拉大，访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数，一般假定这些指令的延迟为cache命中或者cache miss的延迟，所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息，利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度，编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生，容易引起reorder buffer满，导致流水线阻塞。调度容易cache miss的指令。使其并行执行，从而隐藏cache miss的长延迟，就可以提高程序性能。因此，我们针对load指令，一方面修改频繁miss的指令的延迟，一方面修改调度策略，提高存储级并行度。实验证明，我们的调度对于bzip2有高达4．8％的提升，art有4％的提升，整体平均提高1．5％。相似文献

3.

Using the first-level cache stack distance histograms to predict multi-level LRU cache misses

《Microprocessors and Microsystems》2017

For cache analytical modeling, the stack distance theory is widely utilized to predict LRU-cache behaviors. Typically, the stack distance histogram collecting is implemented by profiling memory references. However, the profiled memory references merely reflect instruction fetching and load/store executions, which only represent the memory accesses to first-level (L1) caches. That is why these traces cannot be applied directly to construct stack distance histograms for downstream (L2 and L3) caches.Therefore, this paper proposes a stack distance probability model to extend the stack distance theory to the multi-level LRU cache behavior predictions. The inputs of our model are the L1 cache stack distance histograms and the multi-level LRU cache configurations. The outputs are the L2 and L3 cache stack distance histograms, with which the conflict misses in L2 and L3 caches can be quantified quickly and precisely.15 benchmarks chosen from Mobybench 2.0, Mibench I and Mediabench II are used to evaluate the accuracy of our model. Compared to the simulation results from Gem5 in AtomicSimpleCPU mode, the average absolute error of predicting cache misses in the I/D shared L2 cache is less than 5% while that of estimating the L3 cache misses is less than 7%. Furthermore, contrast to the time overhead of Gem5 AtomicSimpleCPU simulations, our model can speed up the cache miss prediction about x100 on average. 相似文献

4.

The impact of incorrectly speculated memory operations in a multithreaded architecture

Sendag R. Ying Chen Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(3):271-285

The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually needed by the program. In this study, we examine how the load instructions executed on what turn out to be incorrectly executed program paths impact the memory system performance. We find that incorrect speculation (wrong execution) on the instruction and thread-level provides an indirect prefetching effect for the later correct execution paths and threads. By continuing to execute the mispredicted load instructions even after the instruction or thread-level control speculation is known to be incorrect, the cache misses observed on the correctly executed paths can be reduced by 16 to 73 percent, with an average reduction of 45 percent. However, we also find that these extra loads can increase the amount of memory traffic and can pollute the cache. We introduce the small, fully associative wrong execution cache (WEC) to eliminate the potential pollution that can be caused by the execution of the mispredicted load instructions. Our simulation results show that the WEC can improve the performance of a concurrent multithreaded architecture up to 18.5 percent on the benchmark programs tested, with an average improvement of 9.7 percent, due to the reductions in the number of cache misses. 相似文献

5.

Cache自适应写分配策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机研究与发展》2007,44(2):348-354

处理器所能提供的有效带宽是目前制约处理器性能提高的关键因素 .通过对Cache写失效行为的分析,提出了一种新的提高处理器带宽利用率的Cache写失效处理策略--Cache自适应写分配策略 .该策略在访存失效队列中收集全修改Cache块,对全修改Cache块采用非写分配策略,并能够自适应地切换为写分配策略 .与传统的Cache写失效处理策略相比,Cache自适应写分配策略硬件代价小,避免了不必要的数据传输,降低Cache污染,减少存储管理队列阻塞的频率 .结果表明,采用Cache自适应写分配策略,STREAM基准测试程序带宽平均提高62.6%,SPEC CPU2000程序的IPC值平均提高5.9% . 相似文献

6.

Retention Benefit Based Intelligent Cache Replacement

下载免费PDF全文

李凌达陆俊林程旭《计算机科学技术学报》2014,29(6):947-961

The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cor... 相似文献

7.

Compiler Controlled Prefetching for Multiprocessors Using Low-Overhead Traps and Prefetch Engines

《Journal of Parallel and Distributed Computing》2000,60(5):585-615

In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead. 相似文献

8.

Accurately modeling superscalar processor performance with reduced trace

Kiyeon Lee Sangyeun Cho 《Journal of Parallel and Distributed Computing》2013

Trace-driven simulation of out-of-order superscalar processors is far from straightforward. The dynamic nature of out-of-order superscalar processors combined with the static nature of traces can lead to large inaccuracies in the results when the traces contain only a subset of executed instructions for trace reduction. In this paper, we describe and comprehensively evaluate the pairwise dependent cache miss model (PDCM), a framework for fast and accurate trace-driven simulation of out-of-order superscalar processors. The model determines how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. Our experimental results demonstrate that a PDCM-based simulator produces highly accurate simulation results (less than 3% error) with fast simulation speeds (62.5× on average) compared with an execution-driven simulator. Moreover, we observed that the proposed simulation method is capable of preserving a processor’s dynamic off-core memory access behavior and accurately predicting the relative performance change when a processor’s low-level memory hierarchy parameters are changed. 相似文献

9.

The performance of parallel matrix algorithms on a broadcast‐based architecture

Constantine Katsinis Diana Hecht Ming Zhu Harsha Narravula 《Concurrency and Computation》2006,18(3):271-303

Due to advances in fiber‐optics and very large scale integration (VLSI) technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. This paper summarizes one such multiprocessor architecture called the Simultaneous Optical Multiprocessor Exchange Bus (SOME‐Bus). It also presents enhancements to the network interface and the cache and directory controllers which support cache block combining, capture and prefetch, and allow complete overlap of processing time with the communication time due to compulsory misses. The paper uses two fundamental matrix algorithms to characterize the impact of each enhancement on performance. Cache miss analysis and results from the execution of these programs on a SOME‐Bus simulator show that block capture and prefetch combined with an effective block replacement policy succeed in significantly reducing the miss rate due to compulsory misses as the cache size increases, while a similar increase of cache size in traditional architectures leaves the miss rate due to compulsory misses unaffected. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

10.

An evaluation of hardware-based and compiler-controlled optimizations of snooping cache protocols

Fredrik Dahlgren Jonas Skeppstedt Per Stenström 《Future Generation Computer Systems》1998,13(6):469-487

Coherence misses and invalidation traffic limit the performance of bus-based multiprocessors using write-invalidate snooping caches. This paper considers optimizations of a write-invalidate protocol that remove such overhead. While coherence misses are attacked by a hybrid update/invalidate protocol and another technique where update instructions are selectively inserted by a compiler, invalidation traffic is reduced by three optimizations that coalesce ownership acquisition with miss handling: migrate-on-dirty, an adaptive hardware-based scheme, and compiler-controlled insertion of load-exclusive instructions.

The relative effectiveness of these optimizations are evaluated using detailed architectural simulations and a set of four parallel programs. We find that while both of the update-based schemes effectively remove most coherence misses, the hybrid update/invalidate scheme causes lower traffic. By contrast, the compiler-based approach to cut invalidation traffic is slightly more efficient than the adaptive hardware-based scheme. Moreover, the migrate-on-dirty heuristic is found to have devastating effects on the miss rate. 相似文献

11.

YAARC: yet another approach to further reducing the rate of conflict misses

Mohsen Sharifi Behrouz Zolfaghari 《The Journal of supercomputing》2008,44(1):24-40

Traditional set associative caches are seriously prone to conflict misses. We propose an adapted new skewed associative architecture as an attempt to alleviate this problem. It has already been shown that skewed associative caches can reduce the rate of conflict misses by using different hash functions to index different banks. Building on this observation, we propose yet another approach to further reduce the rate of conflict misses, nicknamed YAARC (Yet Another Approach to Reducing Conflicts) that uses different hash functions to index into a single bank. Mathematical modeling and simulation results are exploited to evaluate the impact of YAARC on the rate of conflict misses. Mathematical analysis show the superiority of YAARC caches over set and skewed associative caches from the conflict miss point of view. Simulations, using some benchmarks from SPEC CPU2000 benchmark suit that former researchers have reported them as the best candidates for cache performance evaluation, also show nearly 43% conflict miss rate improvement for the skewed associative cache over the set associative cache, and nearly 31% improvement for the YAARC cache over the skewed associative cache. This implies that YAARC caches considerably outperform set and skewed associative caches from the conflict miss point of view. Since production of YAARC caches require a dispensable amount of hardware overhead, they can be considered as a cost effective approach to minimize the rate of conflict misses.

Behrouz ZolfaghariEmail:

相似文献

12.

Designing the TFP microprocessor

Hsu P.Y.-T. 《Micro, IEEE》1994,14(2):23-33

Designed to efficiently support large, real-world, floating-point-intensive applications, the TFP (short for Tremendous Floating-Point) microprocessor is a superscalar implementation of the Mips Technologies architecture. This floating-point, computation-oriented processor uses a superscalar machine organization that dispatches up to four instructions each clock cycle to two floating-point execution units, two memory load/store units, and two integer execution units. Its split-level cache structure reduces cache misses by directing integer data references to a 16-Kbyte on-chip cache, while channeling floating-point data references off chip to a 4 Mbyte cache 相似文献

13.

Evaluation of Compiler-Controlled Updating to Reduce Coherence-Miss Penalties in Shared-Memory Multiprocessors

《Journal of Parallel and Distributed Computing》1999,56(2):122-143

We consider in this paper the effectiveness of a new approach calledcompiler-controlledupdating to reduce coherence-miss penalties in shared-memory multiprocessors. A key part of the method is a compiler algorithm that identifies the last store instruction to a memory block in a flow graph using classic dataflow analysis techniques. Such stores are marked and replaced by update instructions that at run time make the memory copy clean. Whereas this static method shortens the read-miss latency for actively shared blocks, it can cause useless traffic for shared blocks that are effectively private. We therefore complement the static analysis with a dynamic simple heuristic in the cache coherence protocol aiming at classifying blocks as private or shared at run time. We evaluate the performance effects of compiler-controlled updating using six scientific parallel applications compiled by an optimizing compiler that incorporates our static analysis and then running them on a detailed CC-NUMA architectural simulation model. We have found that the compiler algorithm can convert between 83 and 100% of the dirty misses into clean misses. By adding the private/shared heuristic, the update traffic of private memory blocks can be practically eliminated. Overall, the static analysis in combination with the dynamic heuristic is shown to reduce the execution time by as much as 32%. 相似文献

14.

改进型缓存敏感B+树的研究 总被引：1，自引：0，他引：1

王晨陈刚董金祥《计算机测量与控制》2006,14(11):1531-1534,1550

在内存数据库中,处理器缓存的失配次数对系统的性能有重要的影响;缓存敏感的索引能减少在做查询操作时产生的缓存失配次数,从而提高系统的性能;传统的设计思路将结点大小等于缓存块大小,认为这样就能使得缓存失配次数减少;但是这样的设计忽略了TLB失配对系统性能的影响;我们提出了一种缓存敏感索引——改进型缓存敏感B＋树（简称MCSB＋树）,它同时兼顾了缓存失配和TLB失配对系统性能的影响。比传统的缓存敏感索引能提供更好的操作性能。相似文献

15.

Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency

下载免费PDF全文

张栚滈王箫音佟冬易江芳陆俊林王克义《计算机科学技术学报》2012,27(4):769-780

Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation.But this in-flight forwarding only occupies about 15% of all store-load communications,which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further.This paper presents a new micro-architecture named ASW (short for active store window).It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ.This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache,which is referred to as far forwarding in this paper.At the back-end of the pipeline,it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding.The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures.Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSQ design on most benchmarks by 10.22% and 8.71% respectively. 相似文献

16.

结合访存失效队列状态的预取策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机学报》2007,30(7):1104-1114

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%. 相似文献

17.

Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors

Dahlgren F. Stenstrom P. 《Journal of Parallel and Distributed Computing》1995,26(2)

Write-invalidate protocols suffer from memory-access penalties due to coherence misses. While write-update or hybrid update/invalidate protocols can reduce coherence misses, the update traffic can increase memory-system contention. We show in this paper that update-based cache protocols can perform significantly better than write-invalidate protocols by incorporating a write cache in each processing node. Because it is legal to delay the propagation of modifications of a block until the next synchronization under relaxed memory consistency models, a write cache can significantly reduce traffic by exploiting locality in write accesses. By concentrating on a cache-coherent NUMA architecture, we study the implementation aspects of augmenting a write-invalidate, a write-update and two hybrid update/invalidate protocols with write caches. Through detailed architectural simulations using five benchmark programs, we find that write caches, with only a few blocks each, help write-invalidate protocols to cut the false-sharing miss rate and hybrid update/invalidate protocols to keep other copies, including the memory copy, clean at an acceptable write traffic level. Overall, the memory-access penalty associated with coherence misses is drastically reduced. 相似文献

18.

A Matrix–Matrix Multiplication methodology for single/multi-core architectures using SIMD

Vasilios Kelefouras Angeliki Kritikakou Costas Goutis 《The Journal of supercomputing》2014,68(3):1418-1440

In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space. 相似文献

19.

Balanced instruction cache: reducing conflict misses of direct-mapped caches through balanced subarray accesses 总被引：1，自引：0，他引：1

Chuanjun Zhang 《Computer Architecture Letters》2006,5(1):2-5

It is observed that the limited memory space of direct-mapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays. The key technique of the balanced cache is a programmable subarray decoder through which the mapping of memory reference addresses to cache subarrays can be optimized hence conflict misses of direct-mapped caches can be resolved. The experimental results show that the miss rate of balanced cache is lower than that of the same sized two-way set-associative caches on average and can be as low as that of the same sized four-way set-associative caches for particular applications. Compared with previous techniques, the balanced cache requires only one cycle to access all cache hits and has the same access time as direct-mapped caches. 相似文献

20.

Execution History Guided Instruction Prefetching

Zhang Yi Haga Steve Barua Rajeev 《The Journal of supercomputing》2004,27(2):129-147

The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by Rivers et al. [19] shows that this factor alone explains why most modern microprocessors do not use such hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First, we present a method that does not require an extra port to I-cache. Second, the performance improvement for our method is greater than the best competing method BHGP [23] even disregarding the improvement from not having an extra port. The three key features of our method that prevent the above deficiencies are as follows. First, late prefetching is prevented by correlating misses to dynamically preceding instructions. For example, if the I-cache miss latency is 12 cycles, then the instruction that was fetched 12 cycles prior to the miss is used as the prefetch trigger. Second, the miss history table is kept to a reasonable size by grouping contiguous cache misses together and associated them with one preceding instruction, and therefore, one table entry. Third, the extra I-cache port is avoided through efficient prefetch filtering methods. Experiments show that for our benchmarks, chosen for their poor I-cache performance, an average improvement of 9.2% in runtime is achieved versus the BHGP methods [23], while the hardware cost is also reduced. The improvement will be greater if the runtime impact of avoiding an extra port is considered. When compared to the original machine without prefetching, our method improves performance by about 35% for our benchmarks. 相似文献