首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost.  相似文献   

2.
3.
We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations incurring cache misses with high probability, one for Gaussian elimination and back substitution, and one for the length computation part of the longest common subsequence problem incurring cache misses with high probability. This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.  相似文献   

4.
The memory behavior of cache oblivious stencil computations   总被引:1,自引:0,他引:1  
We present and evaluate a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an “ideal cache” of size Z, our algorithm saves a factor of Θ(Z 1/n ) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy. We evaluate our algorithm in terms of the number of cache misses, and demonstrate that the memory behavior agrees with our theoretical predictions. Our experimental evaluation is based on a finite-difference solution of a heat diffusion problem, as well as a Gauss-Seidel iteration and a 2-dimensional LBMHD program, both reformulated as cache oblivious stencil computations. This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.  相似文献   

5.
Irani 《Algorithmica》2002,33(3):384-409
Abstract. We consider the paging problem where the pages have varying size. This problem has applications to page replacement policies for caches containing World Wide Web documents. We consider two models for the cost of an algorithm on a request sequence. In the first (the FAULT model) the goal is to minimize the number of page faults. In the second (the BIT model) the goal is to minimize the total number of bits that have to be read into the cache. We show offline algorithms for both cost models that obtain approximation factors of O(log k) , where k is the ratio of the size of the cache to the size of the smallest page. We show randomized online algorithms for both cost models that are O(log 2 k) -competitive. In addition, if the input sequence is generated by a known distribution, we show an algorithm for the FAULT model whose expected cost is within a factor of O(log k) of any other online algorithm.  相似文献   

6.
Loop tiling is an effective optimizing transformation to boost the memory performance of a program, especially for dense matrix scientific computations. The magnitude and stability of the achieved performance improvements are heavily dependent on the appropriate selection of tile sizes. Many existing tile selection algorithms try to find tile sizes which eliminate self-interference cache conflict misses, maximize cache utilization, and minimize cross-interference cache conflict misses. These techniques depend heavily on the actual layout of the arrays in memory. Array padding, an effective data layout optimization technique, is therefore incorporated by many algorithms to help loop tiling stabilize its effectiveness by avoiding pathological array sizes.In this paper, we examine several such combined algorithms in terms of cost-benefit trade-offs, and introduce a new algorithm. The preliminary experimental results show that more precise and costly tile selection and array padding algorithms may not be justified by the resulting performance improvements since such improvements may also be achieved by much simpler and therefore less expensive strategies. The key issues in finding a good tiling algorithm are (1) to identify critical performance factors and (2) to develop corresponding performance models that allow predictions at a sufficient level of accuracy. Following this insight, we have developed a new tiling algorithm that performs better than previous algorithms in terms of execution time and stability, and generates code with a performance comparable to the best measured algorithm. Experimental results on two standard benchmark kernels for matrix multiply and LU factorization show that the new algorithm is orders of magnitude faster than the best previous algorithm without sacrificing stability and execution speed of the generated code.  相似文献   

7.
A new search strategy, called depth-m search, is proposed for branch-and-bound algorithms, wherem is a parameter to be set by the user. In particular, depth-1 search is equivalent to the conventional depth-first search, and depth- search is equivalent to the general heuristic search (including best-bound search as a special case). It is confirmed by computational experiment that the performance of depth-m search continuously changes from that, of depth-first search to that of heuristic search, whenm is changed from 1 to . The exact upper bound on the size of the required memory space is derived and shown to be bounded byO(nm), wheren is the problem size. Some methods for controllingm during computation are also proposed and compared, to carry out the entire computation within a given memory space bound.  相似文献   

8.
Ruoming  Gagan   《Performance Evaluation》2005,60(1-4):73-105
In this paper, we revisit the problem of performance prediction on SMP machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms.

In our previous work, we have developed a number of techniques for parallelizing this class of reductions. Our previous work has shown that each of the three techniques, full replication, optimized full locking, and cache-sensitive, can outperform others depending upon problem, dataset, and machine parameters. Therefore, an important question is, “Can we predict the performance of these techniques for a given problem, dataset, and machine?”.

This paper addresses this question by developing an analytical performance model that captures a two-level cache, coherence cache misses, TLB misses, locking overheads, and contention for memory. Analytical model is combined with results from micro-benchmarking to predict performance on real machines. We have validated our model on two different SMP machines. Our results show that our model effectively captures the impact of memory hierarchy (two-level cache and TLB) as well as the factors that limit parallelism (contention for locks, memory contention, and coherence cache misses). The difference between predicted and measured performance is within 20% in almost all cases. Moreover, the model is quite accurate in predicting the relative performance of the three parallelization techniques.  相似文献   


9.
Multithreading is a well known technique to hide latency in a non-blocking cache architecture. By switching execution from one thread to another, the CPU can perform useful work, while waiting for pending requests to be processed by the main memory. In this paper we examine the effects of varying the associativity and block size on cache performance in a reduced locality of reference environment, due to multithreading. We find that for associativity equal to the number of threads, the cache produces very low miss rate even for small sizes. Also by taking into account the increase in cycle time due to larger cache size or associativity we find that the optimum cache configuration for best processor performance is 16Kbytes direct mapped. Finally, with a constant main memory bandwidth, increasing the block size to more than 32 bytes, reduces the miss rate, but degrades processor performance.  相似文献   

10.
Multithreading is a well known technique to hide latency in a non-blocking cache architecture. By switching execution from one thread to another, the CPU can perform useful work, while waiting for pending requests to be processed by the main memory. In this paper we examine the effects of varying the associativity and block size on cache performance in a reduced locality of reference environment, due to multithreading. We find that for associativity equal to the number of threads, the cache produces very low miss rate even for small sizes. Also by taking into account the increase in cycle time due to larger cache size or associativity we find that the optimum cache configuration for best processor performance is 16Kbytes direct mapped. Finally, with a constant main memory bandwidth, increasing the block size to more than 32 bytes, reduces the miss rate, but degrades processor performance.  相似文献   

11.
The synchronization barrier is a point in the program where the processing elements (PEs) wait until all the PEs have arrived at this point. In a reduction computation, given a commutative and associative binary operationop, one needs to reduce valuesa 0,...,a N-1, stored in PEs 0,...,N-1 to a single valuea *=a 0 op a, op...op a N -1 and then to broadcast the resulta * to all PEs. This computation is often followed by a synchronization barrier. Routines to perform these functions are frequently required in parallel programs. Simple and efficient, workingC-language routines for the parallel barrier synchronization and reduction computations are presented. The codes are appropriate for a CREW (concurrent-read-exclusive-write) or EREW parallel random access shared memory MIMD computer. They require only shared memory read and write; no locks, semaphores etc. are needed. The running time of each of these routines isO(logN). The amount of shared memory required and the number of shared memory accesses generated are botO(N). These are the asymptotically minimum values for the three parameters. The algorithms employ the obvious computational scheme involving a binary tree. Examples of applications for these routines and results of performance testing on the Sequent Balance 21000 computer are presented.An abstract of this article appeared inProc. 1989 Int. Conf. Parallel Processing, p. II-175.  相似文献   

12.
In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of PCs. The three-dimensional FFT algorithm can be altered into a block three-dimensional FFT algorithm to reduce the number of cache misses. We show that the block three-dimensional FFT algorithm improves performance by utilizing the cache memory effectively. We use the block three-dimensional FFT algorithm to implement the parallel three-dimensional FFT algorithm. We succeeded in obtaining performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.  相似文献   

13.
The uniform memory hierarchy model of computation   总被引:9,自引:0,他引:9  
TheUniform Memory Hierarchy (UMH) model introduced in this paper captures performance-relevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies.A sequential computer's memory is modeled as a sequence M 0,M 1,... of increasingly large memory modules. Computation takes place inM 0. Thus,M 0 might model a computer's central processor, whileM 1 might be cache memory,M 2 main memory, and so on. For each moduleM u, a busB u connects it with the next larger module Mu+1. All buses may be active simultaneously. Data is transferred along a bus in fixed-sized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parametrized by the rate at which the blocksizes increase and by the ratio of the blockcount to the blocksize. A third parameter, the transfer-cost (inverse bandwidth) function, determines the time to transfer blocks at the different levels of the hierarchy.UMH analysis refines traditional methods of algorithm analysis by including the cost of data movement throughout the memory hierarchy. Thecommunication efficiency of a program is a ratio measuring the portion of UMH running time during which M0 is active. An algorithm that can be implemented by a program whose communication efficiency is nonzero in the limit is said to becommunication- efficient. The communication efficiency of a program depends on the parameters of the UMH model, most importantly on the transfer-cost function. Athreshold function separates those transfer-cost functions for which an algorithm is communication-efficient from those that are too costly. Threshold functions for matrix transpose, standard matrix multiplication, and Fast Fourier Transform algorithms are established by exhibiting communication-efficient programs at the threshold and showing that more expensive transfer-cost functions are too costly.A parallel computer can be modeled as a tree of memory modules with computation occurring at the leaves. Threshold functions are established for multiplication ofN×N matrices using up to N2 processors in a tree with constant branching factor.  相似文献   

14.
Conventional implementations of iterative numerical algorithms, especially multigrid methods, merely reach a disappointing small percentage of the theoretically available CPU performance when applied to representative large problems. One of the most important reasons for this phenomenon is that the need for data locality due to poor main memory latency and limited bandwidth is entirely neglected by many developers designing numerical software. Only when most of the data to be accessed during the computation are found in the system cache (or in one of the caches if the machine architecture comprises a cache hierarchy) fast program execution can be expected. Otherwise, i.e. in case of a significant rate of cache misses, the processor must stay idle until the necessary operands are fetched from main memory, whose cycle time is in general extremely large compared to the time needed to execute a floating point instruction. In this paper, we describe program transformation techniques developed to improve the cache performance of two-dimensional multigrid algorithms. Although we merely consider the solution of Poisson's equation on the unit square using structured grids, our techniques provide valuable hints towards the efficient treatment of more general problems. Received January 31, 1999; revised October 17, 1999  相似文献   

15.
As the speed gap between main memory and modern processors continues to widen, the cache behavior becomes more important for main memory database systems (MMDBs). Indexing technique is a key component of MMDBs. Unfortunately, the predominant indexes — B+-trees and T-trees — have been shown to utilize cache poorly, which triggers the development of many cache-conscious indexes, such as CSB+-trees and pB+-trees. Most of these cache-conscious indexes are variants of conventional B+-trees, and have better cache performance than B+-trees. In this paper, we develop a novel J + -tree index, inspired by the Judy structure which is an associative array data structure, and propose a more cache-optimized index — Prefetching J + -tree (pJ+-tree), which applies prefetching to J+-tree to accelerate range scan operations. The J+-tree stores all the keys in its leaf nodes and keeps the reference values of leaf nodes in a Judy structure, which makes J+-tree not only hold the advantages of Judy (such as fast single value search) but also outperform it in other aspects. For example, J+-trees can achieve better performance on range queries than Judy. The pJ+-tree index exploits prefetching techniques to further improve the cache behavior of J+-trees and yields a speedup of 2.0 on range scans. Compared with B+-trees, CSB+-trees, pB+-trees and T-trees, our extensive experimental study shows that pJ+-trees can provide better performance on both time (search, scan, update) and space aspects.  相似文献   

16.
主存技术的不断进步,使得主存多媒体数据库的实现成为可能.研究表明,主存多媒体数据库系统性能深受处理器缓存未命中的影响,缓存感知型主存索引是提高数据检索效率的有效手段.针对SA-Tree不适用于主存存取的缺点,提出它的变体CSA-Tree.CSA-Tree利用PCA降维技术,将树的各层节点采用不同的维度表示,这样不仅提高了缓存空间的利用率,还降低了CPU负载,从而提高了索引查询效率.大量实验证明,CSA-Tree在主存环境中具有良好的高维数据检索性能.  相似文献   

17.
For cache analytical modeling, the stack distance theory is widely utilized to predict LRU-cache behaviors. Typically, the stack distance histogram collecting is implemented by profiling memory references. However, the profiled memory references merely reflect instruction fetching and load/store executions, which only represent the memory accesses to first-level (L1) caches. That is why these traces cannot be applied directly to construct stack distance histograms for downstream (L2 and L3) caches.Therefore, this paper proposes a stack distance probability model to extend the stack distance theory to the multi-level LRU cache behavior predictions. The inputs of our model are the L1 cache stack distance histograms and the multi-level LRU cache configurations. The outputs are the L2 and L3 cache stack distance histograms, with which the conflict misses in L2 and L3 caches can be quantified quickly and precisely.15 benchmarks chosen from Mobybench 2.0, Mibench I and Mediabench II are used to evaluate the accuracy of our model. Compared to the simulation results from Gem5 in AtomicSimpleCPU mode, the average absolute error of predicting cache misses in the I/D shared L2 cache is less than 5% while that of estimating the L3 cache misses is less than 7%. Furthermore, contrast to the time overhead of Gem5 AtomicSimpleCPU simulations, our model can speed up the cache miss prediction about x100 on average.  相似文献   

18.
Suppose that a program makes a sequence of m accesses (references) to data blocks; the cache can hold k<m blocks. An access to a block in the cache incurs one time unit, and fetching a missing block incurs d time units. A fetch of a new block can be initiated while a previous fetch is in progress; thus, min{k,d} block fetches can be in progress simultaneously. Any sequence of block references is modeled as a walk on the access graph of the program. The goal is to find a policy for prefetching and caching, which minimizes the overall execution time of a given reference sequence. This study is motivated from the pipelined operation of modern memory controllers, and from program execution on fast processors. In the offline case, we show that an algorithm proposed by Cao et al. [Proc. of SIGMETRICS, 1995, pp. 188-197] is optimal for this problem. In the online case, we give an algorithm that is within factor of 2 from the optimal in the set of online deterministic algorithms, for any access graph, and k,d?1. Better ratios are obtained for several classes of access graphs which arise in applications, including complete graphs and directed acyclic graphs (DAG).  相似文献   

19.
There have been numerous techniques proposed in the literature that aim to improve the performance of cache memories by reducing cache conflicts. These techniques were proposed over the past decade and each proposal independently claimed to reduce conflict misses. However, because the published results used different benchmarks and different experimental setups, it is not easy to compare them. In this paper we report a side-by-side comparison of these techniques. We also evaluate the suitability of some of these techniques for caches with higher set associativities. In addition to evaluating techniques for their impact on cache misses and average memory access times, we also evaluate the techniques for their ability in reducing the non-uniformity of cache accesses.The conclusion of our work is that, each application may benefit from a different technique and no single scheme works universally well for all applications. We also observe that, for the majority of applications, XORing (XOR) and Odd-multiplier indexing schemes perform reasonably well. Among programmable associativity techniques, B-cache performs better than column-associative and adaptive-caches, but column-associative caches require very minimal extensions to hardware. Uniformity of cache accesses is improved most by B-cache technique while column-associative cache also improves cache access uniformities.Based on the observation that different techniques benefit different applications, we explored the use of multiple, programmable addressing mechanisms, each addressing scheme designed for a specific application. We include some preliminary data using multiple addressing schemes.  相似文献   

20.
Young 《Algorithmica》2002,33(3):371-383
Abstract. Consider the following file caching problem: in response to a sequence of requests for files, where each file has a specified size and retrieval cost , maintain a cache of files of total size at most some specified k so as to minimize the total retrieval cost. Specifically, when a requested file is not in the cache, bring it into the cache and pay the retrieval cost, and remove other files from the cache so that the total size of files remaining in the cache is at most k . This problem generalizes previous paging and caching problems by allowing objects of arbitrary size and cost, both important attributes when caching files for world-wide-web browsers, servers, and proxies. We give a simple deterministic on-line algorithm that generalizes many well-known paging and weighted-caching strategies, including least-recently-used, first-in-first-out, flush-when-full, and the balance algorithm. On any request sequence, the total cost incurred by the algorithm is at most k/(k-h+1) times the minimum possible using a cache of size h ≤ k . For any algorithm satisfying the latter bound, we show it is also the case that for most choices of k , the retrieval cost is either insignificant or at most a constant (independent of k ) times the optimum. This helps explain why competitive ratios of many on-line paging algorithms have been typically observed to be constant in practice.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号