期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Taxonomy of Data Prefetching for Multicore Processors

Surendra Byna Member IEEE Yong Chen 《计算机科学技术学报》2009,24(3):405-417

Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory.With hardware and/or software support,data prefetching brings data closer to a processor before it is actually needed.Many prefetching techniques have been developed for single-core processors.Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching t... 相似文献

2.

结合访存失效队列状态的预取策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机学报》2007,30(7):1104-1114

随着存储系统的访问速度与处理器的运算速度的差距越来越显著,访存性能已成为提高计算机系统性能的瓶颈.通过对指令Cache和数据Cache失效行为的分析,提出一种预取策略--结合访存失效队列状态的预取策略.该预取策略保持了指令和数据访问的次序,有利于预取流的提取.并将指令流和数据流的预取相分离,避免相互替换.在预取发起时机的选择上,不但考虑当前总线是否空闲,而且结合访存失效队列的状态,减小对处理器正常访存请求的影响.通过流过滤机制提高预取准确性,降低预取对访存带宽的需求.结果表明,采用结合访存失效队列状态的预取策略,处理器的平均访存延时减少30%,SPEC CPU2000程序的IPC值平均提高8.3%. 相似文献

3.

A NUCA Substrate for Flexible CMP Cache Sharing 总被引：1，自引：0，他引：1

Jaehyuk Huh J. Changkyu Kim C. Shafi H. Lixin Zhang L. Burger D. Keckler S.W. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1028-1040

We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses. 相似文献

4.

Exploiting locality and tolerating remote memory access latency using thread migration

Stephen Jenks Jean-Luc Gaudiot 《International journal of parallel programming》1997,25(4):281-304

Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers. 相似文献

5.

A Novel Memory Structure for Embedded Systems: Flexible Sequential and Random Access Memory

下载免费PDF全文

Ying Chen Karthik Ranganathan Vasudev V. Pai DavidJ. Lilja and Kia Bazargan 《计算机科学技术学报》2005,20(5):596-606

The on-chip memory performance of embedded systems directly affects the system designers＇ decision about how to allocate expensive silicon area. A novel memory architecture, flexible sequential and random access memory （FSRAM）, is investigated for embedded systems. To realize sequential accesses, small “links”are added to each row in the RAM array to point to the next row to be prefetched. The potential cache pollution is ameliorated by a small sequential access buyer （SAB）. To evaluate the architecture-level performance of FSRAM, we ran the Mediabench benchmark programs on a modified version of the SimpleScalar simulator. Our results show that the FSRAM improves the performance of a baseline processor with a 16KB data cache up to 55%, with an average of 9%; furthermore, the FSRAM reduces 53.1% of the data cache miss count on average due to its prefetching effect. We also designed RTL and SPICE models of the FSRAM, which show that the FSRAM significantly improves memory access time, while reducing power consumption, with negligible area overhead. 相似文献

6.

Retention Benefit Based Intelligent Cache Replacement

下载免费PDF全文

李凌达陆俊林程旭《计算机科学技术学报》2014,29(6):947-961

The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cor... 相似文献

7.

乱序执行机器上的load指令调度

周谦冯晓兵张兆庆《计算机科学》2007,34(11):298-300

随着处理器和存储器速度差距的不断拉大，访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数，一般假定这些指令的延迟为cache命中或者cache miss的延迟，所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息，利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度，编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生，容易引起reorder buffer满，导致流水线阻塞。调度容易cache miss的指令。使其并行执行，从而隐藏cache miss的长延迟，就可以提高程序性能。因此，我们针对load指令，一方面修改频繁miss的指令的延迟，一方面修改调度策略，提高存储级并行度。实验证明，我们的调度对于bzip2有高达4．8％的提升，art有4％的提升，整体平均提高1．5％。相似文献

8.

Performance of One''s Complement Caches

Qing Yang Sridar Adina T. Sun 《Journal of Parallel and Distributed Computing》1998,48(2):143

On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost. 相似文献

9.

基于镜像层关联的Docker注册表缓存预取策略

张晨邓玉辉《计算机科学与探索》2021,15(2):249-260

随着容器技术的广泛普及,大型Docker公共注册表使用对象存储服务来解决镜像数量剧增的问题,但这种松耦合的注册表设计导致较高的延迟开销。为了增强注册表性能,提出一种基于镜像层关联的Docker注册表缓存预取策略LCPA,当注册表服务器缓存未命中时,通过分析镜像元数据文件构建镜像的存储结构,由关联度模型对存储结构计算得到相关镜像层集合,并从后端存储中主动式预取回注册表中以提高缓存命中率。经真实工作负载下收集的Docker数据集测试,实验结果表明LCPA策略比LRU、LIRS和GDFS等缓存算法提高12%~29%的平均缓存命中率,拉取镜像的平均延迟节省率提高了21.1%~49.4%。与现有的LPA预取算法相比,拉取镜像的平均缓存命中率提升25.6%。仿真实验表明该策略可以有效地利用缓存空间,大幅提升注册表的缓存命中率,并降低镜像拉取的延迟开销。相似文献

10.

Server-Based Data Push Architecture for Multi-Processor Environments

下载免费PDF全文

Xian-He Sun Surendra Byna and Yong Chen 《计算机科学技术学报》2007,22(5):641-652

Data access delay is a major bottleneck in utilizing current high-end computing(HEC)machines.Prefetch- ing,where data is fetched before CPU demands for it,has been considered as an effective solution to masking data access delay.However,current client-initiated prefetching strategies,where a computing processor initiates prefetching instructions,have many limitations.They do not work well for applications with complex,non-contiguous data access patterns.While technology advances continue to increase the gap between computing and data access performance, trading computing power for reducing data access delay has become a natural choice.In this paper,we present a server- based data-push approach and discuss its associated implementation mechanisms.In the server-push architecture,a dedicated server called Data Push Server(DPS)initiates and proactively pushes data closer to the client in time.Issues, such as what data to fetch,when to fetch,and how to push are studied.The SimpleScalar simulator is modified with a dedicated prefetching engine that pushes data for another processor to test DPS based prefetching.Simulation results show that L1 Cache miss rate can be reduced by up to 97%(71% on average)over a superscalar processor for SPEC CPU2000 benchmarks that have high cache miss rates. 相似文献

11.

Model-based cache-aware dispatching of object-oriented software for multicore systems

Tolga Ovatman Feza Buzluca 《Journal of Systems and Software》2013

In recent years, processor technology has evolved towards multicore processors, which include multiple processing units (cores) in a single package. Those cores, having their own private caches, often share a higher level cache memory dedicated to each processor die. This multi-level cache hierarchy in multicore processors raises the importance of cache utilization problem. Assigning parallel-running software components with common data to processor cores that do not share a common cache increases the number of cache misses. In this paper we present a novel approach that uses model-based information to guide the OS scheduler in assigning appropriate core affinities to software objects at run-time. We build graph models of software and cache hierarchies of processors and devise a graph matcher algorithm that provides mapping between these two graphs. Using this mapping we obtain candidate core sets that each software object can be affiliated with at run-time. These affiliations are determined based on the idea that software components that have the potential to share common data at run-time should run on cores that share a common cache. We also develop an object dispatcher algorithm that keeps track of object affiliations at run-time and dispatches objects by using the information from the compile-time graph matcher. We apply our approach on design pattern implementations and two different application program running on servers using CFS scheduling. Our results show that cache-aware dispatching based on information obtained from software model, decreases number of cache misses significantly and improves CFS’ scheduling performance. 相似文献

12.

高带宽远程内存结构中的预取研究 总被引：1，自引：0，他引：1

许建卫陈明宇包云岗《计算机科学》2005,32(8):15-20

高速电路和光互联技术的发展极大地提高了网络的速度与带宽。因而,突破高性能计算机CPU与内存紧耦合的传统结构成为可能,CPU与内存的耦合不再受距离的限制,这必将引起体系结构的变革。文[1]提出DSAG结构——CPU与内存在空间上分离,每个CPU节点上仅留少量内存．将海量内存放在远程统一管理作为内存服务器,CPU节点和内存服务器之间通过高速网络互连。这种新的体系结构带来了更好的共享性和可扩展性,但同时也对我们解决CPU和内存之间的不平衡性问题带来了挑战。为了降低DSAG这种远程内存结构增加的访存时延,我们考虑到CPU正常访存没有充分利用网络的高带宽,因此可以利用剩余的网络带宽来进行远程内存数据的预取。本论文在应用程序执行时记录本地（相对于远程内存）不命中的地址信息,以页对齐分析其中存在的页框流（Page Frame Stream）的统计特征,并提出可基于页框流的预取机制可降低访存延迟、提升系统性能的观点。最后我们采用模拟的方法验证了观点的可行性与正确性,进一步提出了三种预取策略,比较并分析影响预取效果的因素。相似文献

13.

A consistency-free memory architecture for sort-last parallel rendering processors

《Journal of Systems Architecture》2007,53(5-6):272-284

Current rendering processors are aiming to process triangles as fast as possible and they have the tendency of equipping with multiple rasterizers to be capable of handling a number of triangles in parallel for increasing polygon rendering performance. However, those parallel architectures may have the consistency problem when more than one rasterizer try to access the data at the same address. This paper proposes a consistency-free memory architecture for sort-last parallel rendering processors, in which a consistency-free pixel cache architecture is devised and effectively associated with three different memory systems consisting of a single frame buffer, a memory interface unit, and consistency-test units. Furthermore, the proposed architecture can reduce the latency caused by pixel cache misses because the rasterizer does not wait until cache miss handling is completed when the pixel cache miss occurs. The experimental results show that the proposed architecture can achieve almost linear speedup upto four rasterizers with a single frame buffer. 相似文献

14.

Using the First-Level Caches as Filters to Reduce the Pollution Caused by Speculative Memory References

Onur?Mutlu Email author Hyesoon?Kim David?N.?Armstrong Yale?N.?Patt 《International journal of parallel programming》2005,33(5):529-559

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. This paper proposes the use of the first-level caches as filters that predict the usefulness of speculative memory references. With the proposed technique, speculative memory references bring data only into the first-level caches rather than all levels in the cache hierarchy. The processor monitors the use of the cache blocks in the first-level caches and decides which blocks to keep in the cache hierarchy based on the usefulness of cache blocks. It is shown that a simple implementation of this technique usually outperforms inclusive and exclusive baseline cache hierarchies commonly used by today’s processors and results in IPC performance improvements of up to 10% on the SPEC CPU2000 integer benchmarks. 相似文献

15.

An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration

Acacio M.E. Gonzalez J. Garcia J.M. Duato J. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(8):755-768

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time. 相似文献

16.

Execution History Guided Instruction Prefetching

Zhang Yi Haga Steve Barua Rajeev 《The Journal of supercomputing》2004,27(2):129-147

The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by Rivers et al. [19] shows that this factor alone explains why most modern microprocessors do not use such hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First, we present a method that does not require an extra port to I-cache. Second, the performance improvement for our method is greater than the best competing method BHGP [23] even disregarding the improvement from not having an extra port. The three key features of our method that prevent the above deficiencies are as follows. First, late prefetching is prevented by correlating misses to dynamically preceding instructions. For example, if the I-cache miss latency is 12 cycles, then the instruction that was fetched 12 cycles prior to the miss is used as the prefetch trigger. Second, the miss history table is kept to a reasonable size by grouping contiguous cache misses together and associated them with one preceding instruction, and therefore, one table entry. Third, the extra I-cache port is avoided through efficient prefetch filtering methods. Experiments show that for our benchmarks, chosen for their poor I-cache performance, an average improvement of 9.2% in runtime is achieved versus the BHGP methods [23], while the hardware cost is also reduced. The improvement will be greater if the runtime impact of avoiding an extra port is considered. When compared to the original machine without prefetching, our method improves performance by about 35% for our benchmarks. 相似文献

17.

Moving address translation closer to memory in distributed shared-memory multiprocessors

Qiu X. Dubois M. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(7):612-623

To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low. 相似文献

18.

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Lee Jaejin Jung Changhee Lim Daeseob Solihin Yan 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(9):1309-1324

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle. 相似文献

19.

Design of write merging and read prefetching buffer in DRAM controller for embedded processor

Chen Zhao Kuizhi Mei Nanning ZhengAuthor Vitae 《Microprocessors and Microsystems》2014

Write merging and read prefetching are effective methods for improving processor performance, and they are mainly used in processors for desktop or server. As embedded system requires more powerful microprocessor, how to improve the performance of embedded processor is worthy of concern. This paper presents the architecture of write merging and read prefetching buffer in DRAM controller for embedded processor. The evaluation model is constructed, and the result demonstrates that the proposed method can reduce cache miss penalty dramatically. Additionally, the design of DRAM controller with write merging and read prefetching buffer is implemented and verified on FPGA platform, it can reduce CPI by 19.6% on average. Moreover, the RTL module of presented design is synthesized by Design Compiler, and synthesis result shows that hardware cost of proposed architecture is relatively small compared to performance amelioration. 相似文献

20.

Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors

David K. Poulsen Pen-Chung Yew 《Journal of Parallel and Distributed Computing》1996,33(2):172

This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and vector parallelism. 相似文献