首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Prefetching is one of several techniques for hiding and tolerating the large memory latencies of scalable multiprocessors. In this paper, we present a performance model for analyzing the limits and effectiveness of data prefetching. The model incorporates the effects of program behavior, network characteristics, cache coherency protocols, and memory consistency model. Our results indicate that, as long as there is enough extra network bandwidth, prefetching is very effective in hiding large latencies. In machines with sufficiently large caches to hold the program working set, the intra- and internode cache interference is marginally low enough to have any significant impact on prefetching performance. Furthermore, we reveal the fact that the effective prefetch distance plays a vital role and adapts extremely well to changes in cache miss rates and remote latencies, thus allowing prefetches to be more effective in hiding latency. An adaptive algorithm is provided to optimize the prefetch distance. This is based on the dynamic behavior of the application, interconnection network, and distributed caches and memories. This optimization of the prefetch distance constitutes a significant advantage of prefetching over other latency tolerating techniques, such as multithreading. We show that the prefetch distance can be chosen constant, program-dependent, or decided by performance information. The optimal distance could be adaptively determined using both compile-time and runtime conditions. Our results are therefore useful not only to compiler writers, but also for the development of runtime support systems in multiprocessors. In large-scale systems, in which network traffic control predominates the performance, the ultimate goal is to match program behavior with machine behavior.  相似文献   

2.
Routing table lookup is an important operation in packet forwarding. This operation has a significant influence on the overall performance of the network processors. Routing tables are usually stored in main memory which has a large access time. Consequently, small fast cache memories are used to improve access time. In this paper, we propose a novel routing table compaction scheme to reduce the number of entries in the routing table. The proposed scheme has three versions. This scheme takes advantage of ternary content addressable memory (TCAM) features. Two or more routing entries are compacted into one using don’t care elements in TCAM. A small compacted routing table helps to increase cache hit rate; this in turn provides fast address lookups. We have evaluated this compaction scheme through extensive simulations involving IPv4 and IPv6 routing tables and routing traces. The original routing tables have been compacted over 60% of their original sizes. The average cache hit rate has improved by up to 15% over the original tables. We have also analyzed port errors caused by caching, and developed a new sampling technique to alleviate this problem. The simulations show that sampling is an effective scheme in port error-control without degrading cache performance.  相似文献   

3.
In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead.  相似文献   

4.
A prefetch method that enables stride prefetching at the secondary cache without accessing the processor's internal resources is developed and evaluated. It uses a data-range-table that enables it to detect usable strides and memory access streams which fall into the same data range. By using program driven simulation of scientific applications in the context of shared-memory multiprocessors, it is shown that the proposed method can reduce load stall times by an amount comparable to a conventional stride driven prefetching method which requires access to the processor's instruction address register.  相似文献   

5.
并行文件系统中适度贪婪的Cache预取一体化算法   总被引:3,自引:0,他引:3  
卢凯  金士尧  卢锡城 《计算机学报》1999,22(11):1172-1177
传统文件系统中的Cache和预取技术是两种降低访问延迟的有效方法。在并行科学计算应用的I/O访问模式下,简单的Cache和预取技术已无法提供较高的Cache命中率,该文在分析该I/O模式的基础上提出了适度贪婪的Cache和预取一体化算法(PGI),该算法充分利用了并行文件系统环境的特点,采用了适度贪婪的动态滑模技术,可以有铲地消除预取时的抖动,降低系统处理开锁,并同时采用了Cache和预取一体化的  相似文献   

6.
Pentium4处理器的内存层次分析   总被引:2,自引:0,他引:2  
吴金  齐欢 《微机发展》2004,14(7):47-48,51
处理器存储系统的效率对其整体性能有着十分重要的作用。文中介绍了P4处理器内存的体系结构,它包括一级数据Cache、二级Cache、Trace Cache;各部分完成的功能以及为提高命中率和降低存取时间,从而提高效率而采取的预取处理机制;P4处理器主要采取具有层次结构的内存设计、大容量的二级Cache和在跟踪Cache中采用预取处理机制的方法来提高Cache的命中率和降低未命中的代价来缩短处理器的访问时间,最终达到提高处理器整体性能的目的。  相似文献   

7.
位置相关信息服务中访问数据涉及到复杂的空间计算,导致访问数据的延迟时间较长,而数据预取能够显著提高数据的访问速度,缩短访问数据的时间。基于LDD的预取策略如DDP考虑了数据距离,但是没有考虑数据的访问概率和更新频率及数据大小。针对以上问题提出基于价值的数据预取(CDP)策略,一些重要的数据预取因素如访问概率、更新频率、数据项大小、数据距离和有效范围等都包含在价值函数里,根据价值函数值的大小来选择被预取的数据。通过实验对比,CDP比DDP策略能更有效的提高缓存命中率。  相似文献   

8.
Hash tables, as a type of data indexing structure that provides efficient data access based on key values, are widely used in various computer applications, especially in system software, databases, and high-performance computing field that requires extremely high performance. In network, cloud computing and IoT services, hash tables have become the core system components of cache systems. However, with the large-scale increase in the amount of large-scale data, performance bottlenecks have gradually emerged in systems designed with a multi-core CPU as the core of the hash table structure. There is an urgent need to further improve the high performance and scalability of the hash tables. With the increasing popularity of general-purpose Graphic Processing Units (GPUs) and the substantial improvement of hardware computing capabilities and concurrency performance, various types of system software tasks with parallel computing as the core have been optimized on the GPU and have achieved considerable performance promotion. Due to the sparseness and randomness, using the existing parallel structure of the hash tables directly on the GPUs will inevitably bring high-frequency memory access and frequent bus data transmission, which affects the performance of the hash tables on the GPUs. This study focuses on the analysis of memory access, hit ratio, and index overhead of hash table indexes in the cache system. A hybrid access cache indexing framework CCHT (Cache Cuckoo Hash Table) adapted to GPU is proposed and provided. The cache strategy suitable to different requirements of hit ratios and index overheads allows concurrent execution of write and query operations, maximizing the use of the computing performance and concurrency characteristics of GPU hardware, reducing memory access and bus transferring overhead. Through GPU hardware implementation and experimental verification, CCHT has better performance than other cache indexing hash tables while ensuring cache hit ratios.  相似文献   

9.
为了解决自动化单元测试工具在测试大规模C++工程时经常发生内存溢出故障且耗时较长这一问题,在测试流程中引入了缓存优化技术,并提出了一种面向不同测试方式的缓存优化方法;当用户直接对整个工程进行测试时,系统将采用缓存预取的方式,通过设计的缓存预取模型,在缓存出现读缺失之前为其提供数据块;当用户对单个文件进行测试时,系统将采用改进的GDSF替换算法进行缓存替换;实验表明,该方法能够有效地避免此类单元测试工具发生内存溢出故障并减少了测试的时间,使其支持的被测工程规模由5 000行左右增加至十几万行,大大提升了系统的性能。  相似文献   

10.
Optimizing main-memory join on modern hardware   总被引:4,自引:0,他引:4  
In the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rates, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called "radix-cluster", which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an order of magnitude on modern RISC hardware when both memory and CPU resources are optimized  相似文献   

11.
The aim of the hierarchical cache memories that are equipped for GPUs is the management of irregular memory access patterns for general purpose workloads. The level-1 data cache (L1D) of the GPU plays an important role for its ability in the provision of high bandwidth and low-latency data accesses. Unfortunately, the GPU L1D may become a performance bottleneck due to facing many performance challenges such as cache contention and resource congestion. These critical issues come from a large number of simultaneous requests from the SIMT cores to the limited-capacity L1D. We observe that many applications have a large number of requests with a very low reuse probability, resulting in the GPU performance degradation. To overcome these challenges, we propose an efficient cache bypassing mechanism that can periodically filter the access stream and make an accurate bypassing decision to improve the efficiency of the L1D. The proposed technique uses a small storage amount to save the tag array of the L1D for the early miss prediction before it makes the bypassing decision. The experiment results reveal that the proposed technique significantly increases the cache efficiency and the GPU performance.  相似文献   

12.
李靖  余建桥 《计算机应用》2010,30(7):1950-1952
数据预取是移动数据库缓存技术中的关键,CMIP预取策略通过客户端历史访问记录关联规则的挖掘得到预取数据,使系统性能得到了提高。但由于没考虑到数据的更新率及数据大小,将会经常发生缓存失效。在此算法的基础上增加对数据更新率及大小的判断并对所选数据排序,然后进行预取数据的选择。通过改进降低了缓存的失效率、减少了数据访问的时间及电能的消耗。  相似文献   

13.
The existing NAND flash memory file systems have not taken into account multiple NAND flash memories for large-capacity storage. In addition, since large-capacity NAND flash memory is much more expensive than the same capacity hard disk drive, it is cost wise infeasible to build large-capacity flash drives. To resolve these problems, this paper suggests a new file system called NAFS for large-capacity storage with multiple small-capacity and low-cost NAND flash memories. It adopts a new cache policy, mount scheme, and garbage collection scheme in order to improve read and write performance, to reduce the mount time, and to improve the wear-leveling effectiveness. Our performance results show that NAFS is more suitable for large-capacity storage than conventional NAND file systems such as YAFFS2 and JFFS2 and a disk-based file system for Linux such as HDD-RAID5-EXT3 in terms of the read and write transfer rate using a double cache policy and the mount time using metadata stored on a separate partition. We also demonstrate that the wear-leveling effectiveness of NAFS can be improved by our adaptive garbage collection scheme.  相似文献   

14.
Mobile computers can be equipped with wireless communication devices that enable users to access data services from any location. In wireless communication, the server-to-client (downlink) communication bandwidth is much higher than the client-to-server (uplink) communication bandwidth. This asymmetry makes the dissemination of data to client machines a desirable approach. However, dissemination of data by broadcasting may induce high access latency in case the number of broadcast data items is large. We propose two methods aiming to reduce client access latency of broadcast data. Our methods are based on analyzing the broadcast history (i.e., the chronological sequence of items that have been requested by clients) using data mining techniques. With the first method, the data items in the broadcast disk are organized in such a way that the items requested subsequently are placed close to each other. The second method focuses on improving the cache hit ratio to be able to decrease the access latency. It enables clients to prefetch the data from the broadcast disk based on the rules extracted from previous data request patterns. The proposed methods are implemented on a Web log to estimate their effectiveness. It is shown through performance experiments that the proposed rule-based methods are effective in improving the system performance in terms of the average latency as well as the cache hit ratio of mobile clients.  相似文献   

15.
《Data Processing》1986,28(2):79-81
Disc caching is a well-accepted technique designed to speed access to disc storage. For the first time, it is being applied to personal computers. The problem in the past has been designing and implementing caching algorithms. There are two types, prefetch and replacement. Both try to provide an optimum choice of what data to keep in cache memory and to predict future data accesses. A UK company has addressed the problem and claims to have radically improved access time.  相似文献   

16.
The availability of low cost, high performance microprocessors has led to various designs of shared memory multiprocessor systems. As a result, commercial products which are based on shared memory have been proliferated. Such a multiprocessor system is heavily influenced by the structure of memory system and it is not difficult to find that most configurations include local cache memories. The more processors a system carries, the larger local cache memory is needed to maintain the traffic to and from the shared memory at reasonable level. The implementation of local cache memories, however, is not a simple task because of environmental limitations. In particular, the general lack of board space availability presents a formidable problem. A cache memory system usually needs space mostly to support its complex control logic circuits for the cache itself and network interfaces like snooping logic circuits for shared bus. Although packaging can be made denser to reduce system size, there are still multiple processors per board. It requires a more area-efficient cache memory architecture. This paper presents a design of shared cache for dual processor board of bus-based symmetric multiprocessors. The design and implementation issues are described first and then the evaluation and measurement results are discussed. The shared cache proposed in this paper has been determined to be quite area-efficient without the significant loss of throughput and scalability. It has been implemented as a plug-in unit for TICOM, a prevalent commercial multiprocessor system.  相似文献   

17.
Main memory cache performance continues to play an important role in determining the overall performance of object-oriented, object-relational and XML databases. An effective method of improving main memory cache performance is to prefetch or pre-load pages in advance to their usage, in anticipation of main memory cache misses. In this paper we describe a framework for creating prefetching algorithms with the novel features of path and cache consciousness. Path consciousness refers to the use of short sequences of object references at key points in the reference trace to identify paths of navigation. Cache consciousness refers to the use of historical page access knowledge to guess which pages are likely to be main memory cache resident most of the time and then assumes these pages do not exist in the context of prefetching. We have conducted a number of experiments comparing our approach against four highly competitive prefetching algorithms. The results shows our approach outperforms existing prefetching techniques in some situations while performing worse in others. We provide guidelines as to when our algorithm should be used and when others maybe more desirable.  相似文献   

18.
This research explores a compressed memory hierarchy model which can increase both the effective memory space and bandwidth of each level of memory hierarchy. It is well known that decompression time causes a critical effect to the memory access time and variable-sized compressed blocks tend to increase the design complexity of the compressed memory systems. This paper proposes a selective compressed memory system (SCMS) incorporating the compressed cache architecture and its management method. To reduce or hide decompression overhead, this SCMS employs several effective techniques, including selective compression, parallel decompression and the use of a decompression buffer. In addition, fixed memory space allocation method is used to achieve efficient management of the compressed blocks. Trace-driven simulation shows that the SCMS approach can not only reduce the on-chip cache miss ratio and data traffic by about 35% and 53%, respectively, but also achieve a 20% reduction in average memory access time (AMAT) over conventional memory systems (CMS). Moreover, this approach can provide both lower memory traffic at a lower cost than CMS with some architectural enhancement. Most importantly, the SCMS is a more attractive approach for future computer systems because it offers high performance in cases of long DRAM latency and limited bus bandwidth.  相似文献   

19.
On-board disk cache is an effective approach to improve disk performance by reducing the number of physical accesses to the magnetic media. Disk drive manufacturers are increasing the on-board disk cache size to match the capacity growth of the backend magnetic media. Some disk drives nowadays have a cache of 32 MB. Modern computer systems use large amounts of memory to improve performance, any data brought into host memory will be re-accessed there, not in the on-board disk cache. This feature has a significant impact on the behavior of disk cache. This is because computer systems are complex systems consisting of various components. The components are correlated with each other. Therefore, a specific component cannot be isolated from the overall system when we analyze its performance behavior. This paper employs four block-level real traces to explore the performance behavior of the on-board disk cache by considering the impacts of the cache hierarchy contained in computer systems. The analysis gives three major implications: (1) I/O stream at block-level contains negligible temporal locality. Therefore, read/write cache can only achieve marginal benefits. (2) Static write cache does not achieve performance gains since the write stream does not have too much interference with the read stream. Therefore, it is better to leave the on-board disk cache shared by both the write and read streams. (3) Read cache dominates the contribution to the hit ratio besides prefetch. Thus, it is better to focus on improving the read performance rather than write performance of disk cache.  相似文献   

20.
Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号