首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Retrospective adaptive prefetching for interactive Web GIS applications   总被引:1,自引:0,他引:1  
A major task of a Web GIS (Geographic Information Systems) system is to transfer map data to client applications over the Internet, which may be too costly. To improve this inefficient process, various solutions are available. Caching the responses of the requests on the client side is the most commonly implemented solution. However, this method may not be adequate by itself. Besides caching the responses, predicting the next possible requests from a client and updating the cache with responses for those requests together provide a remarkable performance improvement. This procedure is called “prefetching” and makes caching mechanisms more effective and efficient. This paper proposes an efficient prefetching algorithm called Retrospective Adaptive Prefetch (RAP), which is constructed over a heuristic method that considers the former actions of a given user. The algorithm reduces the user-perceived response time and improves user navigation efficiency. Additionally, it adjusts the cache size automatically, based on the memory size of the client’s machine. RAP is compared with four other prefetching algorithms. The experiments show that RAP provides better performance enhancements than the other methods.  相似文献   

2.
This paper proposes using a user-level memory thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.  相似文献   

3.
Technology evolution gives an easy access to high performance dedicated computing machines using, for example, GPUs or FPGAS. When designing algorithms dealing with highly structured multidimensional data, the real bottleneck is often linked to memory access. The strategies implemented in standard CPU cache architectures are no longer efficient due to the parallelism level and the inherent structure of data. This article presents the so-called “n-Dimensional Adaptive and Predictive Cache” (nD-AP Cache) architecture aiming at efficient data access for grid traversal. A theoretical model of the 3D version of the cache was setup in order to predict the cache efficiency for given statistical characteristics of the access sequences and for given parameters of the cache. The practical example of ray shooting algorithms has been chosen in order to carefully explore the design space and exercise the 3D-AP cache. For this purpose, a simulation model as well as a fully functional emulation platform have been designed. Thanks to the proven efficiency of the architecture further improvement and applications of the nD-AP Cache are discussed. Comparisons with standard caches show that the nD-AP Cache allows to be two times more efficient than an “ideal” associative cache and, this, with four times less memory.  相似文献   

4.
This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.  相似文献   

5.
The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding.  相似文献   

6.
This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and vector parallelism.  相似文献   

7.
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency.  相似文献   

8.
Network continuous-media applications are emerging with a great pace. Cache memories have long been recognized as a key resource (along with network bandwidth) whose intelligent exploitation can ensure high performance for such applications. Cache memories exist at the continuous-media servers and their proxy servers in the network. Within a server, cache memories exist in a hierarchy (at the host, the storage-devices, and at intermediate multi-device controllers). Our research is concerned with how to best exploit these resources in the context of continuous media servers and in particular, how to best exploit the available cache memories at the drive, the disk array controller, and the host levels. Our results determine under which circumstances and system configurations it is preferable to devote the available memory to traditional caching (a.k.a. data sharing) techniques as opposed to prefetching techniques. In addition, we show how to configure the available memory for optimal performance and optimal cost. Our results show that prefetching techniques are preferable for small-size caches (such as those expected at the drive level). For very large caches (such as those employed at the host level) caching techniques are preferable. For intermediate cache sizes (such as those at multi-device controllers) a combination of both strategies should be employed.  相似文献   

9.
Main memory cache performance continues to play an important role in determining the overall performance of object-oriented, object-relational and XML databases. An effective method of improving main memory cache performance is to prefetch or pre-load pages in advance to their usage, in anticipation of main memory cache misses. In this paper we describe a framework for creating prefetching algorithms with the novel features of path and cache consciousness. Path consciousness refers to the use of short sequences of object references at key points in the reference trace to identify paths of navigation. Cache consciousness refers to the use of historical page access knowledge to guess which pages are likely to be main memory cache resident most of the time and then assumes these pages do not exist in the context of prefetching. We have conducted a number of experiments comparing our approach against four highly competitive prefetching algorithms. The results shows our approach outperforms existing prefetching techniques in some situations while performing worse in others. We provide guidelines as to when our algorithm should be used and when others maybe more desirable.  相似文献   

10.
Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.  相似文献   

11.
Helper threaded prefetching based on Chip Multiprocessor is a well known approach to reducing memory latency and has been explored in linked data structures accesses. However, conventional helper threaded prefetching often suffers from useless prefetches and cache thrashing, which affect its effectiveness. In this paper, we first analyzed the shortcomings of conventional helper threaded prefetching for linked data structures. Then we proposed an improved helper threaded prefetching, Skip Helper Threaded Prefetching, for hotspots with two level data traversals. Our solution is to profile the applications and balance delinquent loads between main thread and prefetching thread based on the characteristic of operations in their hotspots. Evaluations show that the proposed solution improves average performance by 8.9% (-O2) and 8.5% (-O3) over the conventional helper threaded prefetching that greedily prefetches all delinquent loads. We also compare our proposal with the active threaded prefetching which synchronizes with main thread by semaphore, and find that our proposal provides better performance for the targeted applications.  相似文献   

12.
The paper concerns new communication solutions for hierarchical Chip Multiprocessor (CMP) systems composed of many CMP modules interconnected by a global data exchange network. New architectural solutions for internal module data communication are presented in the presence of hierarchical data caches in CMP modules. Inside CMP modules, dynamic shared memory core clusters are organized around L1–L2 data cache busses. Such clusters enable a group-oriented data communication based on reads on the fly to L1 banks of data present on the busses by many cores at a time. Dynamic switching of cores between such L1–L2 busses is done with porting data in core’s L1 caches. Together with data reads on the fly, it provides a very efficient intercluster “communication on the fly,” especially useful for transfers of strongly shared data. It provides fast cache to cache group data transmissions and eliminates standard transactions based on shared memory in the system. Comparative experimental results based on automatic scheduling of program data flow graphs and execution in a simulator of the proposed architecture evaluate the assumed architectural solutions. The multi-CMP system structure is assessed while taking into account technological limitations of the size of the single CMP module.  相似文献   

13.
Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we show that a fixed value for detecting patterns and prefetch degree makes GHB to (1) be conservative while there are more opportunities to create new addresses and (2) generate wrong addresses in the presence of constant strides. To resolve these problems, we separate the pattern length from the prefetching degree. The result is an aggressive prefetcher that can generate more addresses with a given pattern length. Furthermore with a variable pattern length mechanism, constant strides are grouped, such that more accurate patterns are detected. As the aggressiveness of this prefetcher is relatively high, we further propose an efficient throttling procedure to reduce the negative effects of wrong prefetching using a new measure of cache pollution. This adaptive method is suitable for CMP processors where the prefetcher resides in the shared LCC. Simulation results with a mixed suite of integer and floating point benchmarks from SPEC CPU2006 show that on a single-core processor both aggressive and adaptive methods outperform existing prefetchers by 48 and 28 %, respectively, while increasing the memory traffic by 20 and 14 %, respectively. Further on an 8-core CMP with a mix of multiprogrammed workloads, the adaptive method outperforms the state-of-the-art throttling methods by 8 % in speedup, while reducing the memory traffic by 3 %.  相似文献   

14.
Trace-driven simulations of numerical Fortran programs are used to study the impact of the parallel loop scheduling strategy on data prefetching in a shared memory multiprocessor with private data caches. The simulations indicate that to maximize memory performance, it is important to schedule blocks of consecutive iterations to execute on each processor, and then to adaptively prefetch single-word cache blocks to match the number of iterations scheduled. Prefetching multiple single-word cache blocks on a miss reduces the miss ratio by approximately 5% to 30% compared to a system with no prefetching. In addition, the proposed adaptive prefetching scheme further reduces the miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution of coherence invalidations also is examined. It is found that invalidations tend to be evenly distributed throughout the execution of parallel loops, but tend to be clustered when executing sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy  相似文献   

15.
为了提高网络内存的访存性能,基于一种页面级流缓存和预取结构提出了可变步长的带状流检测算法VSS(variable stride stream)和基于时钟步长的流预取优化算法来优化网络访存性能.带状流检测算法解决了固定步长流检测下循环访问中虚拟页地址的跳跃问题,消除了断流,可以有效提高流检测的覆盖率.基于时钟步长的流预取优化动态调整预取长度,可以解决有些预取不能及时取回的问题,进一步提高预取性能.通过和顺序预取算法的比较可以看出,VSS算法可以实现高准确率、低通信开销的预取.通过模拟分析了这种流缓存和预取机制在网络访存系统中的应用,验证了以少量性能下降换取灵活的远程内存扩展方法的可行性.  相似文献   

16.
《Journal of Systems Architecture》1999,45(12-13):1047-1073
In this paper we provide a survey of hardware-based data cache prefetching strategies. We then present two new methods which improve both the accuracy and effectiveness of data cache prefetching. The first design ties data address prediction to the instruction prefetching logic, allowing data cache prefetching to work in tandem with dynamic branch prediction. The second mechanism prefetches link-based data structures, typically problematic data accesses for sequential prefetching schemes. Combining the two mechanisms we can improve data cache hit rates, while reducing memory bus traffic by as much as 50%.  相似文献   

17.
Proxy caches are essential to improve the performance of the World Wide Web and to enhance user perceived latency. Appropriate cache management strategies are crucial to achieve these goals. In our previous work, we have introduced Web object-based caching policies. A Web object consists of the main HTML page and all of its constituent embedded files. Our studies have shown that these policies improve proxy cache performance substantially.In this paper, we propose a new Web object-based policy to manage the storage system of a proxy cache. We propose two techniques to improve the storage system performance. The first technique is concerned with prefetching the related files belonging to a Web object, from the disk to main memory. This prefetching improves performance as most of the files can be provided from the main memory rather than from the proxy disk. The second technique stores the Web object members in contiguous disk blocks in order to reduce the disk access time. We used trace-driven simulations to study the performance improvements one can obtain with these two techniques. Our results show that the first technique by itself provides up to 50% reduction in hit latency, which is the delay involved in providing a hit document by the proxy. An additional 5% improvement can be obtained by incorporating the second technique.  相似文献   

18.
High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. This paper proposes the use of the first-level caches as filters that predict the usefulness of speculative memory references. With the proposed technique, speculative memory references bring data only into the first-level caches rather than all levels in the cache hierarchy. The processor monitors the use of the cache blocks in the first-level caches and decides which blocks to keep in the cache hierarchy based on the usefulness of cache blocks. It is shown that a simple implementation of this technique usually outperforms inclusive and exclusive baseline cache hierarchies commonly used by today’s processors and results in IPC performance improvements of up to 10% on the SPEC CPU2000 integer benchmarks.  相似文献   

19.
Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme.  相似文献   

20.
Data prefetching is a useful approach for reduction of memory access stalls in many scientific applications. However, it suffers from cache pollution severly in some applications. In this paper, we study the effectiveness of combining data prefetching with non-blocking loads on cache pollution and explain why it shows good result in our simulation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号