期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors

Dahlgren F. Stenstrom P. 《Journal of Parallel and Distributed Computing》1995,26(2)

Write-invalidate protocols suffer from memory-access penalties due to coherence misses. While write-update or hybrid update/invalidate protocols can reduce coherence misses, the update traffic can increase memory-system contention. We show in this paper that update-based cache protocols can perform significantly better than write-invalidate protocols by incorporating a write cache in each processing node. Because it is legal to delay the propagation of modifications of a block until the next synchronization under relaxed memory consistency models, a write cache can significantly reduce traffic by exploiting locality in write accesses. By concentrating on a cache-coherent NUMA architecture, we study the implementation aspects of augmenting a write-invalidate, a write-update and two hybrid update/invalidate protocols with write caches. Through detailed architectural simulations using five benchmark programs, we find that write caches, with only a few blocks each, help write-invalidate protocols to cut the false-sharing miss rate and hybrid update/invalidate protocols to keep other copies, including the memory copy, clean at an acceptable write traffic level. Overall, the memory-access penalty associated with coherence misses is drastically reduced. 相似文献

2.

An evaluation of hardware-based and compiler-controlled optimizations of snooping cache protocols

Fredrik Dahlgren Jonas Skeppstedt Per Stenström 《Future Generation Computer Systems》1998,13(6):469-487

Coherence misses and invalidation traffic limit the performance of bus-based multiprocessors using write-invalidate snooping caches. This paper considers optimizations of a write-invalidate protocol that remove such overhead. While coherence misses are attacked by a hybrid update/invalidate protocol and another technique where update instructions are selectively inserted by a compiler, invalidation traffic is reduced by three optimizations that coalesce ownership acquisition with miss handling: migrate-on-dirty, an adaptive hardware-based scheme, and compiler-controlled insertion of load-exclusive instructions.

The relative effectiveness of these optimizations are evaluated using detailed architectural simulations and a set of four parallel programs. We find that while both of the update-based schemes effectively remove most coherence misses, the hybrid update/invalidate scheme causes lower traffic. By contrast, the compiler-based approach to cut invalidation traffic is slightly more efficient than the adaptive hardware-based scheme. Moreover, the migrate-on-dirty heuristic is found to have devastating effects on the miss rate. 相似文献

3.

The potential of compile-time analysis to adapt the cache coherenceenforcement strategy to the data sharing characteristics

Mounes-Toussi F. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(5):470-481

相似文献

4.

Combining flow and dependence analyses to expose redundant array accesses

Elana D. Granston Alexander V. Veidenbaum 《International journal of parallel programming》1995,23(5):423-470

The success of large-scale, hierarchical and distributed shared memory systems hinges on our ability to reduce delays resulting from remote accesses to shared data. To facilitate this, we present a compile-time algorithm for analyzing programs with doall-style parallelism to determine when read and write accesses to shared data areredundant (unnecessary). One identified, redundant remote accesses can be replaced by local accesses or eliminated entirely. This optimization improves program performance in two ways. First, slow memory accesses are replaced by faster ones. Second, the time to perform other remote memory accesses may be reduced as a result of the decreased traffic level. We also show how the information obtained through redundancy analysis can be used for other compiler optimizations such as prefetching and cache management. 相似文献

5.

基于密文策略属性基加密算法的云存储数据更新方法

刘荣潘洪志刘波祖婷方群何昕王杨《计算机应用》2018,38(2):348-351

针对云计算数据易遭非法窃取和恶意篡改问题,提出一种支持动态更新操作的密文策略的属性基加密方案（DU-CPABE）。首先利用线性分割思想将数据分成固定大小的数据块,然后采用密文策略属性基加密（CP-ABE）算法对各数据块进行加密,最后提出一种Address-Merkle Hash Tree（A-MHT）搜索树结构,借助A-MHT快速定位数据块实现云服务器中数据动态更新。经理论分析验证了方案的安全性,而且在理想信道中的仿真实验结果显示,在更新次数为5时,此方案相比CP-ABE方案的数据更新时间开销平均下降幅度为14.6%。实验结果表明：DU-CPABE方案在云计算服务中数据动态更新这一过程能够有效地减小数据更新的时间开销,同时降低系统开销。相似文献

6.

基于启发信息的QoS路由蚁群算法

周如旗徐宁《现代计算机》2008,(6)

提出一种改进的路由蚁群算法,算法采用了动态更新的概率替代传统的路由表,引入干扰系数作为启发信息,从而提高了算法收敛速度.通过验证,算法具有更快的收敛速度和较好的吞吐能力.在网络节点出现故障时,该算法能快速地更新节点上信息,使网络趋于平稳. 相似文献

7.

Parallel single grid and multigrid solution of industrial compressible flow problems

Anders Ålund Per Lötstedt Mattias Sillén 《Computers & Fluids》1997,26(8):775-791

The Euler and Navier-Stokes equations with a k-ε turbulence model are solved numerically in parallel on a distributed memory machine IBM SP2, a shared memory machine SGI Power Challenge, and a cluster of SGI workstations. The grid is partitioned into blocks and the steady state solution is computed using single grid and multigrid iteration. The multigrid algorithm is analyzed leading to an estimate of the elapsed time per iteration. Based on this analysis, a heuristic algorithm is devised for distributing and splitting the blocks for a good static load balance. Speed-up results are presented for a wing, a complete aircraft and an air inlet. 相似文献

8.

乱序执行机器上的load指令调度

周谦冯晓兵张兆庆《计算机科学》2007,34(11):298-300

随着处理器和存储器速度差距的不断拉大，访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数，一般假定这些指令的延迟为cache命中或者cache miss的延迟，所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息，利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度，编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生，容易引起reorder buffer满，导致流水线阻塞。调度容易cache miss的指令。使其并行执行，从而隐藏cache miss的长延迟，就可以提高程序性能。因此，我们针对load指令，一方面修改频繁miss的指令的延迟，一方面修改调度策略，提高存储级并行度。实验证明，我们的调度对于bzip2有高达4．8％的提升，art有4％的提升，整体平均提高1．5％。相似文献

9.

A NUCA Substrate for Flexible CMP Cache Sharing 总被引：1，自引：0，他引：1

Jaehyuk Huh J. Changkyu Kim C. Shafi H. Lixin Zhang L. Burger D. Keckler S.W. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1028-1040

We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses. 相似文献

10.

基于容差关系的多粒度粗糙集中近似集动态更新方法

徐怡肖鹏《计算机应用》2019,39(5):1247-1251

针对不完备信息系统变化时缺失值获取具体属性值的特性，为解决多粒度粗糙集中更新近似集时间效率低的问题，提出了一种基于容差关系的近似集动态更新算法。首先，讨论了基于容差关系的近似集变化的性质，并根据相关性质得出乐观、悲观多粒度粗糙集的近似集的变化趋势；然后，针对更新容差类效率低的问题，提出了动态更新容差类的定理；最后，在此基础上，设计出基于容差关系的近似集动态更新算法。采用UCI数据库中4个数据集进行仿真实验，当数据集变大时，所提更新算法的计算时间远小于静态更新算法的计算时间，即所提动态更新算法的时间效率高于静态算法，验证了所提算法的正确性和高效性。相似文献

11.

Compiler Controlled Prefetching for Multiprocessors Using Low-Overhead Traps and Prefetch Engines

《Journal of Parallel and Distributed Computing》2000,60(5):585-615

In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead. 相似文献

12.

Improving memory utilization in cache coherence directories

Lilja D.J. Yew P.-C. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(10):1130-1146

Efficiently maintaining cache coherence is a major problem in large-scale shared memory multiprocessors. Hardware directory coherence schemes have very high memory requirements, while software-directed schemes must rely on imprecise compile-time memory disambiguation. Recently proposed dynamically tagged directory schemes allocate pointers to blocks only as they are referenced, which significantly reduces their memory requirements, but they still allocate pointers to blocks that do not need them. The authors present two compiler optimizations that exploit the high-level sharing information available to the compiler to further reduce the size of a tagged directory by allocating pointers only when necessary. Trace-driven simulations are used to show that the performance of this combined hardware-software approach is comparable to other coherence schemes, but with significantly lower memory requirements. In addition, these simulations suggest that this approach is less sensitive to the quality of the memory disambiguation and interprocedural analysis performed by the compiler than software-only coherence schemes 相似文献

13.

Compiler analysis for cache coherence: interprocedural array data-flow analysis and its impact on cache performance

Choi L. Pen-Chung Yew 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(9):879-896

In this paper, we present compiler algorithms for detecting references to stale data in shared-memory multiprocessors. The algorithm consists of two key analysis techniques, state reference detection and locality preserving analysis. While the stale reference detection finds the memory reference patterns that may violate cache coherence, the locality preserving analysis minimizes the number of such stale references by analyzing both temporal and spatial reuses. By computing the regions referenced by arrays inside loops, we extend the previous scalar algorithms for more precise analysis. We develop a full interprocedural array data-flow algorithm, which performs both bottom-up side-effect analysis and top-down context analysis on the procedure call graph to further exploit locality across procedure boundaries. The interprocedural algorithm eliminates cache invalidations at procedure boundaries, which were assumed in the previous compiler algorithms. We have fully implemented the algorithm in the Polaris parallelizing compiler. Using execution-driven simulations on Perfect Club benchmarks, we demonstrate how unnecessary cache misses can be eliminated by the automatic stale reference detection. The algorithm can be used to implement cache coherence in the shared-memory multiprocessors that do not have hardware directories, such as Cray T3D. 相似文献

14.

Data forwarding in scalable shared-memory multiprocessors

Koufaty D.A. Xiangfeng Chen Poulsen D.K. Torrellas J. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(12):1250-1264

Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups 相似文献

15.

Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP

Nakano Hirofumi Ishizaka Kazuhisa Obata Motoki Kimura Keiji Kasahara Hironori 《International journal of parallel programming》2003,31(3):211-223

Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC Ver. 6 update 1 loop parallelizing compiler. 相似文献

16.

A Framework for Incremental Extensible Compiler Construction

Steven Carroll Constantine Polychronopoulos 《International journal of parallel programming》2004,32(4):289-316

Extensibility in complex compiler systems goes well beyond modularity of design and it needs to be considered from the early stages of the design, especially the design of the Intermediate Representation. One of the primary barriers to compiler pass extensibility and modularity is interference between passes caused by transformations that invalidate existing analysis information. In this paper, we also present a callback system which is provided to automatically track changes to the compilers internal representation (IR) allowing full pass reordering and an easy-to-use interface for developing lazy update incremental analysis passes. We present a new algorithm for incremental interprocedural data flow analysis and demonstrate the benefits of our design framework and our prototype compiler system. It is shown that compilation time for multiple data flow analysis algorithms can be cut in half by incrementally updating data flow analysis. 相似文献

17.

利用多维分级Cache替换策略减少对PCM内存写回量

阮深沉王海霞汪东升《计算机工程与科学》2016,38(8):1568-1573

寻找新型存储材料代替DRAM内存是当前的一个研究热点。相变存储PCM因其具有低功耗、高存储密度和非易失性的优点受到广泛的关注,然而PCM的可擦写次数有限,要用作内存必须考虑如何减少对其的写操作。针对该问题,一种有效的解决方法是优化Cache替换策略,减少Cache中脏块被替换出的数量。现有研究主要通过在插入和访问命中时给脏块设定较高的保护优先级来达到给脏块额外保护的目的,但是在降级过程中不再对脏块与干净块进行区分,这导致Cache可能在存在大量干净块的情况下仍然先替换脏块。提出一种新型的Cache替换策略MAC,它通过一个多维分级结构在脏块与干净块之间设置了不可逾越的界限,使得脏块能得到更有力的保护。模拟实验表明,相对LRU替换策略,MAC以较低的硬件开销代价平均减少约25.12%的内存写,同时对程序运行性能几乎没有影响。相似文献

18.

Reliability-aware low energy scheduling in real time systems with shared resources

《Microprocessors and Microsystems》2017

Dynamic voltage scaling (DVS) is a technique which is widely used to save energy in a real time system. Recent research shows that it has a negative impact on the system reliability. In this paper, we consider the problem of the system reliability and focus on a periodic task set that the task instance shares resources. Firstly, we present a static low power scheduling algorithm for periodic tasks with shared resources called SLPSR which ignores the system reliability. Secondly, we prove that the problem of the reliability-aware low power scheduling for periodic tasks with shared resources is NP-hard and present two heuristic algorithms called SPF and LPF respectively. Finally, we present a dynamic low power scheduling algorithm for periodic tasks with shared resources called DLPSR to reclaim the dynamic slack time to save energy while preserving the system reliability. Experimental results show that the presented algorithm can reduce the energy consumption while improving the system reliability. 相似文献

19.

一种新型实时调度算法研究 总被引：2，自引：0，他引：2

何东之李伟张向文《小型微型计算机系统》2005,26(11):1965-1970

在许多片上特定应用系统中，任务多且切换频繁，任务切换开销大，有时甚至严重影响系统的可调度性．研究了动态可抢占门限调度算法，它通过初始门限值、动态门限值的计算和优化线程分配，实现了在处理器高利用率下，有效降低任务切换开销的目的，并相应地减少了对内存的需求．动态可抢占门限调度算法是将静态抢占门限算法与动态调度算法有机地结合在一起。完成了由静态到动态无缝转换．相似文献

20.

一种基于维层次聚集树的Cube增量更新算法

胡孔法陈崚董逸生《小型微型计算机系统》2005,26(12):2126-2130

提出利用Cube中的维层次聚集树（dimension hierarchy aggregate tree,简称DHA-Tree）来对聚集Cube进行增量更新维护,在维层次聚集Cube中进行数据插入和删除等数据更新时,充分利用维层次聚集树中的维层次前缀,由下向上用更新前后的差值对受到更新结点影响的所有祖先结点进行增量更新.在插入新维数据时,在不需要重新构建聚集Cube就可以对聚集Cube进行增量更新,从而减少了Cube的更新时间.对基于维层次聚集树的聚集Cube与传统Cube进行了算法性能分析和比较,结果表明本文所提出的聚集Cube的增量更新算法性能最佳. 相似文献