期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A survey of cache coherence schemes for multiprocessors 总被引：1，自引：0，他引：1

Stenstrom P. 《Computer》1990,23(6):12-24

Schemes for cache coherence that exhibit various degrees of hardware complexity, ranging from protocols that maintain coherence in hardware, to software policies that prevent the existence of copies of shared, writable data, are surveyed. Some examples of the use of shared data are examined. These examples help point out a number of performance issues. Hardware protocols are considered. It is seen that consistency can be maintained efficiently, although in some cases with considerable hardware complexity, especially for multiprocessors with many processors. Software schemes are investigated as an alternative capable of reducing the hardware cost 相似文献

2.

Performance trade-offs for microprocessor cache memories

Alpert D.B. Flynn M.J. 《Micro, IEEE》1988,8(4):44-54

Design trade-offs for integrated microprocessors caches are examined. A model of cache utilization is introduced to evaluate the effects on cache performance of varying the block size. By considering the overhead cost of sorting address tags and replacement information along with data, it is found that large block sizes lead to more cost-effective cache designs than predicted by previous studies. When the overhead cost is high, caches that fetch only partial blocks on a miss perform better than similar caches that fetch entire blocks. This study indicates that lessons from mainframe and minicomputer design practice should be critically examined to benefit the design of microprocessors 相似文献

3.

Performance analysis of location-dependent cache invalidation schemes for mobile environments 总被引：3，自引：0，他引：3

Jianliang Xu Xueyan Tang Dik Lun Lee 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(2):474-488

Mobile location-dependent information services are gaining increasing interest in both academic and industrial communities. In these services, data values depend on their locations. Caching frequently accessed data on mobile clients can help save wireless bandwidth and improve system performance. However, since client location changes constantly, location-dependent data may become obsolete not only due to updates performed on data items but also because of client movements across the network. To the best of the authors' knowledge, previous work on cache invalidation issues focused on data updates only. This paper considers data inconsistency caused by client movements and proposes three location-dependent cache invalidation schemes. The performance for the proposed schemes is investigated by both analytical study and simulation experiments in a scenario where temporal- and location-dependent updates coexist. Both analytical and experimental results show that, in most cases, the proposed methods substantially outperform the NSI scheme, which drops the entire cache contents when hand-off is performed. 相似文献

4.

Dynamic zero-sensitivity scheme for low-power cache memories

Yen-Jen Chang Feipei Lai 《Micro, IEEE》2005,25(4):20-32

A low-power cache has become essential in many applications, but cache accesses contribute significantly to a chip's total power consumption. Because most bit values read from the cache are 0's, the authors introduce a dynamic zero-sensitivity (DZS) scheme that reduces average cache power consumption by preventing bitlines from discharging in reading a 0. 相似文献

5.

Asymptotic analysis of memory interference in multiprocessors with private cache memories

Ya.A Kogan L.B Boguslavsky 《Performance Evaluation》1985,5(2):97-104

A closed Markovian queueing network model of a multiprocessor system with a two-level memory hierarchy is considered. Performance measures are introduced and obtained from the partition function. With a special structure of branching probabilities the network model is reduced to a nonlinear machine interference model. Asymptotic representation of the stationary distribution for heavy traffic conditions in the latter model is used for deriving approximate formulas for the performance measures when the number of processors and memory modules is large. For normal usage the approximations are obtained from asymptotic expansions of the partition function. 相似文献

6.

A user-access model-driven approach to proxy cache performance analysis

Edward F. Watson Ying Shi Ye-Sho Chen 《Decision Support Systems》1999,25(4):309

World-Wide Web usage has experienced tremendous growth in recent years. This growth has resulted in a significant increase in network and server loads that have adversely affected user response times. Among many viable and available approaches to reducing user response time, Web caching has received considerable attention. The purpose of this paper is to present an empirically derived model of Internet user-access activity, and to demonstrate its usefulness for conducting model-driven discrete simulation studies of cache performance analysis. The user-access model is shown to be a reasonable representation of Internet activity, and the user-access approach to cache performance analysis is shown to be a favorable alternative to trace-driven simulation. A report on the accuracy of the model and a summary of the findings are presented. 相似文献

7.

An analysis of cache performance for a hypercube multicomputer

Stunkel C.B. Fuchs W.K. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(4):421-432

Multicomputer cache simulation results derived from address traces collected from an Intel iPSC/2 hypercube multicomponent are presented. The primary emphasis is on examining how increasing the number of processor nodes executing a parallel application affects the overall multicomputer cache performance. The effects on multicomputer direct-mapped cache performance of application-specific data partitioning, data access patterns, communication distribution, and communication frequency are illustrated. The effects of system accesses on total cache performance are explored, as well as the reasons for application-specific differences in cache behavior for system and user accesses. Comparing user code results with full user and system code analysis reveals the significant effect of system accesses, and this effect increases with multicomputer size. The time distribution of an application's message-passing operations is found to more strongly affect cache performance than the total amount of time spent in message-passing code 相似文献

8.

Orthogonal schemes for bidirectional associative memories

Haryono Sadananda R. Phien H.N. 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》1997,27(3):543-551

Two issues are addressed in this paper. Firstly, it investigates some important properties of bidirectional associative memories (BAM) and proposes an improved capacity estimate. Those properties are the encoding form of the input pattern pairs as sell as their decoding, the orthogonality of the pattern pairs, the similarity of associated patterns, and the density of the pattern pairs. Secondly, it proposes an implementation approach to improve the storage capacity. The approach embraces three proposed methods, i.e., the bipolar-orthogonal augmentation, the set partition, and the combined method. Along with those methods is the construction of the set of bipolar orthogonal patterns. 相似文献

9.

Compiler analysis for cache coherence: interprocedural array data-flow analysis and its impact on cache performance

Choi L. Pen-Chung Yew 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(9):879-896

In this paper, we present compiler algorithms for detecting references to stale data in shared-memory multiprocessors. The algorithm consists of two key analysis techniques, state reference detection and locality preserving analysis. While the stale reference detection finds the memory reference patterns that may violate cache coherence, the locality preserving analysis minimizes the number of such stale references by analyzing both temporal and spatial reuses. By computing the regions referenced by arrays inside loops, we extend the previous scalar algorithms for more precise analysis. We develop a full interprocedural array data-flow algorithm, which performs both bottom-up side-effect analysis and top-down context analysis on the procedure call graph to further exploit locality across procedure boundaries. The interprocedural algorithm eliminates cache invalidations at procedure boundaries, which were assumed in the previous compiler algorithms. We have fully implemented the algorithm in the Polaris parallelizing compiler. Using execution-driven simulations on Perfect Club benchmarks, we demonstrate how unnecessary cache misses can be eliminated by the automatic stale reference detection. The algorithm can be used to implement cache coherence in the shared-memory multiprocessors that do not have hardware directories, such as Cray T3D. 相似文献

10.

IP Routing table compaction and sampling schemes to enhance TCAM cache performance

Ruirui Guo José G. Delgado-Frias 《Journal of Systems Architecture》2009,55(1):61-69

Routing table lookup is an important operation in packet forwarding. This operation has a significant influence on the overall performance of the network processors. Routing tables are usually stored in main memory which has a large access time. Consequently, small fast cache memories are used to improve access time. In this paper, we propose a novel routing table compaction scheme to reduce the number of entries in the routing table. The proposed scheme has three versions. This scheme takes advantage of ternary content addressable memory (TCAM) features. Two or more routing entries are compacted into one using don’t care elements in TCAM. A small compacted routing table helps to increase cache hit rate; this in turn provides fast address lookups. We have evaluated this compaction scheme through extensive simulations involving IPv4 and IPv6 routing tables and routing traces. The original routing tables have been compacted over 60% of their original sizes. The average cache hit rate has improved by up to 15% over the original tables. We have also analyzed port errors caused by caching, and developed a new sampling technique to alleviate this problem. The simulations show that sampling is an effective scheme in port error-control without degrading cache performance. 相似文献

11.

Efficient cache invalidation schemes for mobile data accesses

Po-Jen Chuang Yu-Shian Chiu 《Information Sciences》2011,181(22):5084-5101

This paper presents two cache invalidation schemes to maintain data consistency between the server and clients in a mobile environment. Designed according to real situations like MANET, the Adaptive Data Dividing (ADD) scheme divides data into groups of different utilization ratios and varies group broadcast intervals to reduce data access time and bandwidth consumption. The Data Validity Defining (DVD) scheme aims to solve the validity problem of cached data items, which usually happens after clients disconnect from the server. Experimental evaluations and performance analyses exhibit that the two schemes outperform most existing cache invalidation schemes in terms of data access time, cache miss ratios, query uplink ratios and bandwidth consumption. 相似文献

12.

Improving proxy cache performance: analysis of three replacementpolicies

Dilley J. Arlitt M. 《Internet Computing, IEEE》1999,3(6):44-50

Web cache replacement policy choice affects network bandwidth demand and object hit rate, which affect page load time. Two new policies implemented in the Squid cache server show marked improvement over the standard mechanism 相似文献

13.

CMP体系结构上非包含高速缓存的设计及性能分析

冯昊吴承勇《计算机工程与设计》2008,29(7):1595-1600

半导体技术的发展使得在芯片上集成数十亿个晶体管成为可能.目前工业界和学术界倾向于采用片上多处理器体系结构(CMP),对于此类结构,芯片性能受片外访存影响较大,因此如何组织片上高速缓存层次结构是一个关键.针对此问题,提出采用非包含高速缓存组织片上最后一级高速缓存,以降低片外访存次数.并通过对Splash2部分测试程序的详细模拟,对CMP上高速缓存层次结构的不同组织方式做了比较.数据显示非包含高速缓存最多可使平均访存时间降低8.3%.同时,指出非包含高速缓存有助于节省片上资源的特性,并给出片上集成三级高速缓存后CMP上高速缓存层次结构的设计建议. 相似文献

14.

Optimizing graph algorithms for improved cache performance

Park J.-S. Penner M. Prasanna V.K. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(9):769-782

We develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of /spl Omega/(N/sup 3///spl radic/C), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines. 相似文献

15.

Techniques for fast instruction cache performance evaluation

David B. Whalley 《Software》1993,23(1):95-118

Cache performance has become a very crucial factor in the overall system performance of machines. Effective analysis of a cache design requires the evaluation of the performance of the cache for typical programs that are to be executed on the machine. Recent attempts to reduce the time required for such evaluations either result in a loss of accuracy or require an initial pass by a filter to reduce the length of the trace. This paper evaluates techniques that attempt to overcome these problems for instruction cache performance evaluation. For each technique variations with and without periodic context switches are examined. Information calculated during the compilation is used to reduce the number of references in the trace. Thus, in effect references are stripped before the initial trace is generated. These techniques are shown to significantly reduce the time required for evaluating instruction caches with no loss of accuracy. 相似文献

16.

A comparative performance analysis of deadlock avoidance control algorithms for FMS

Luca Ferrarini Luigi Piroddi Stefano Allegri 《Journal of Intelligent Manufacturing》1999,10(6):569-585

A deadlock condition for flexible manufacturing systems is characterized by a set of parts, which have been processed but cannot be discharged by a set of machines or buffers. To avoid such problems, it is necessary to adopt suitable control policies which limit the resource allocation in the system, thus affecting the overall system performance. In the present work, we address the problem of evaluating and comparing the performance of deadlock avoidance control policies applied to FMS. The problem is discussed for both untimed and timed models, and for models both with and without deadlock avoidance control policies. Different control algorithms, among the most common in the literature, have been considered. Imperfect deadlock avoidance control policies are also considered. In addition, some indices are proposed to assess the structural properties of FMS with respect to deadlock occurrence and their performance. Two different application examples are analyzed, with the help of a commercial simulation package. Finally, an adaptive algorithm which can learn from system evolution to avoid deadlocks is illustrated. 相似文献

17.

基于最小延迟代价的Web缓存替换算法研究

韩英杰石磊《计算机工程与设计》2008,29(8):1925-1928

命中率、字节命中率和延迟时间是Web缓存系统中最重要的性能指标,但是却难以准确、合理地度量不同大小的Web对象的访问延迟.引入字节延迟的概念,为不同的对象延迟建立了一个比较合理的评价标准.提出最小延迟代价的Web缓存替换算法LLC,使用户访问的延迟时间尽可能缩短.实验结果表明,与常用的缓存替换算法相比,LLC算法在有效减少用户感知的访问延迟方面具有较好的性能表现. 相似文献

18.

Performance of hashed cache data migration schemes on multicomputers

Seema Hiranandani Joel Saltz Piyush Mehrotra Harry Berryman 《Journal of Parallel and Distributed Computing》1991,12(4)

Many researchers approach the problem of programming distributed memory machines by assuming a global shared name space. Thus the user views the distributed memory of the machine as though it were shared. A major issue that arises at this point is how to manage the memory. When a processor accesses data stored on another processor's memory, data must be moved between the two processors. Once these data are retrieved from another processor's memory, several interesting issues are raised. Where should these data be stored locally? What transformations must be performed to the code to guarantee that the nonlocal accesses reference the correct memory location? What optimizations can be performed to reduce the time spent in accessing the nonlocal data? In this paper we examine various data migration mechanisms that allow an explicit and controlled mapping of data to memory. We describe, experimentally evaluate, and model a set of schemes for storing and retrieving off-processor array elements. The schemes are all based on using hash tables for efficient access of nonlocal data. The three different techniques evaluated are the basic hashed cache, partial enumeration, and full enumeration, the details of which are described in the paper. In all three schemes, nonlocal data are stored in hash tables—the difference is in the amount of memory used by the schemes and the retrieval mechanisms for nonlocal data. 相似文献

19.

Effective low-Mach number improvement for upwind schemes

Shu-sheng Chen Chao Yan Xing-hao Xiang 《Computers & Mathematics with Applications》2018,75(10):3737-3755

In this paper, we present an effective low-Mach number improvement for upwind schemes. The artificial viscosity of upwind schemes scales with

1 ∕ M a

incurring a loss of accuracy for the Mach number approaching zero. The remedy is based on three steps: (i) the jump of the left and right states is split into the density diffusion part and velocity diffusion part; (ii) the velocity diffusion part is rescaled by multiplying a scaling function; (iii) the scaling function is only related to the local Mach number without the cut-off reference Mach number and meantime restricted by a shock senor. The resulting modification is very easily implemented and applied within Roe, HLL and Rusanov, etc. Then, asymptotic analysis and numerical experiments for a wide Mach number demonstrate that this novel approach is equipped with these attractive properties: (1) free from the cut-off global problem and damping checkerboard modes; (2) satisfying the correct

M a^{2}

scaling of pressure fluctuations, the divergence constraint and a Poisson equation; (3) independent of Mach number in terms of accuracy; (4) better accurate and higher resolution in low Mach number regimes when compared with existing upwind schemes, and applicable for moderate or high Mach numbers. Thus, the proposed modification is expected to provide as an excellent and reliable candidate to simulate turbulent flows at all Mach numbers. 相似文献

20.

A comparative performance analysis of reliable group rekey transport protocols for secure multicast 总被引：3，自引：0，他引：3

Sanjeev Sencun Sushil 《Performance Evaluation》2002,49(1-4):21-41

In this paper, we present a new scalable and reliable key distribution protocol for group key management schemes that use logical key hierarchies (LKH) for scalable group rekeying. Our protocol called WKA-BKR is based upon two ideas—weighted key assignment and batched key retransmission—both of which exploit the special properties of LKH and the group rekey transport payload to reduce the bandwidth overhead of the reliable key delivery protocol. Using both analytic modeling and simulation, we compare the performance of WKA-BKR with that of other rekey transport protocols, including a recently proposed protocol based on proactive FEC. Our results show that for most network loss scenarios, the bandwidth used by WKA-BKR is lower than that of the other protocols. 相似文献