首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
在GPU中,一个warp内的所有线程在锁步中执行相同的指令。某些线程的内存请求可以得到快速处理,而其余请求会经历较长时间。在最慢的请求完成之前,warp不能执行下一条指令,导致内存发散。对GPU中warp间的异构性进行了研究,实现并优化了一种基于inter warp异构性的缓存管理机制和内存调度策略,以减少内存发散和缓存排队延迟的负面影响。根据缓存命中率将warp分类,以驱动后面的3个组件:(1)基于warp类型的缓存旁路技术组件,使低缓存利用率的warp进入旁路,不访问L2缓存;(2)基于warp类型的缓存插入/提升策略组件,防止来自高缓存利用率warp的数据被过早清除;(3)基于warp类型的内存控制器组件,优先处理从高缓存利用率的warp接收到的请求,并优先处理来自相同warp的请求。基于warp间异构性的缓存管理和内存调度机制在8种不同的GPGPU应用中,与基准GPU相比,平均加速18.0%。  相似文献   

2.
针对MEC(memory efficient convolution)卷积算法在传统设备下因访问数据地址不连续导致的缓存命中率低、内存访问延时长等问题,提出一种适用于MEC算法访存行为的优化方法。该方法分为中间矩阵转换和矩阵运算两部分。对于中间矩阵转换部分,采用修改数据读取顺序的方式对其进行优化,使读取方式符合算法的访存行为。对于矩阵运算部分,采用更加适合矩阵运算的内存数据布局对卷积核矩阵修改,并利用TVM(tensor virtual machine)平台封装的计算函数,重新设计中间矩阵同卷积核矩阵的计算方式。使用平台自带并行库对运算过程进行加速。实验结果表明,相比传统MEC算法,提出的优化方法可以有效解决缓存命中率低、内存访问延时长等问题,同MEC算法的运算时间对比,在单个卷积层上平均获得了50%的速度提升,在多层神经网络中最低获得了57%以上的速度提升,同空间组合算法的运算时间对比,最高获得了80%的速度提升。  相似文献   

3.
On-chip caches to reduce average memory access latency are commonplace in today's commercial microprocessors. These on-chip caches generally have low associativity and small cache sizes. Cache line conflicts are the main source of cache misses, which are critical for overall system performance. This paper introduces an innovative design for on-chip data caches of microprocessors, called one's complement cache. While binary complement numbers have been successfully used in designing arithmetic units, to the best of our knowledge, no one has ever considered using such complement numbers in cache memory designs. This paper will show that such complement numbers help greatly in reducing cache misses in a data cache, thereby improving data cache performance. By parallel computation of cache addresses and memory addresses, the new design does not increase the critical hit time of cache accesses. Cache misses caused by line interference are reduced by evenly distributing data items referenced by program loops across all sets in a cache. Even distribution of data in the cache is achieved by making the number of sets in the cache a prime or an odd number, so that the chance of related data being mapped to a same set is small. Trace-driven simulations are used to evaluate the performance of the new design. Performance results on benchmarks show that the new design improves cache performance significantly with negligible additional hardware cost.  相似文献   

4.
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses compiler analyses to identify potentially stale and nonstale data references in a parallel program and enforces cache coherence by prefetching the potentially stale references. In this manner, the CCDP scheme brings up-to-date data into the caches to avoid stale references and also hides the latency of these memory accesses. Furthermore, the scheme also prefetches the nonstale references to hide their memory latencies. To evaluate the performance impact of the CCDP scheme on a real system, we applied the scheme on five applications from the SPEC CFP95 and CFP92 benchmark suites, and executed the resulting codes on the Cray T3D. The experimental results indicate that for all of the applications studied, our scheme provides significant performance improvements by caching shared data and using data prefetching to enforce cache coherence and to hide memory latency.  相似文献   

5.
大规模数据排序、搜索引擎、流媒体等大数据应用在面向延迟的多核/众核处理器上运行时资源利用率低下,一级缓存命中率高,二级/三级缓存命中率低,LLC容量的增加对IPC的提升并不明显。针对缓存资源利用率低的问题,分析了大数据应用的访存行为特点,提出了针对大数据应用的两种众核处理器缓存结构设计方案,两种结构均只有一级缓存,Share结构为完全共享缓存,Partition结构为部分共享缓存。评估结果表明,两种方案在访存延迟增加不多的前提下能大幅节省芯片面积,其中缓存容量较低时,Partition结构优于Share结构,缓存容量较高时,Share结构要逐渐优于Partition结构。由于众核处理器中分配到每个处理器核的容量有限,因此Partition结构有一定的优势。  相似文献   

6.
Mehlhorn  Sanders 《Algorithmica》2003,35(1):75-93
Abstract. We consider the simple problem of scanning multiple sequences. There are k sequences of total length N which are to be scanned concurrently. One pointer into each sequence is maintained and an adversary specifies which pointer is to be advanced. The concept of scanning multiple sequence is ubiquitous in algorithms designed for hierarchical memory. In the external memory model of computation with block size B , a memory consisting of m blocks, and at most m sequences the problem is trivially solved with N/B memory misses by reserving one block of memory for each sequence. However, in a cache memory with associativity a , every access may lead to a cache fault if k > a . For a direct mapped cache (a = 1 ) two sequences suffice. We show that by randomizing the starting addresses of the sequences the number of cache misses can be kept to O(N/B) provided that k = O(m/B 1/a ) , i.e., the number of sequences that can be supported is decreased by a factor B 1/a . We also show a corresponding lower bound. Our result leads to a general method for converting sequence based algorithms designed for the external memory model of computation to cache memory even for caches with small associativity.  相似文献   

7.
Radiation-induced soft error has become an emerging reliability threat to high performance microprocessor design. As the size of on chip cache memory steadily increased for the past decades, resilient techniques against soft errors in cache are becoming increasingly important for processor reliability. However, conventional soft error resilient techniques have significantly increased the access latency and energy consumption in cache memory, thereby resulting in undesirable performance and energy efficiency degradation. The emerging 3D integration technology provides an attractive advantage, as the 3D microarchitecture exhibits heterogeneous soft error resilient characteristics due to the shielding effect of die stacking. Moreover, the 3D shielding effect can offer several inner dies that are inherently invulnerable to soft error, as they are implicitly protected by the outer dies. To exploit the invulnerability benefit, we propose a soft error resilient 3D cache architecture, in which data blocks on the soft error invulnerable dies have no protection against soft error, therefore, access to the data block on the soft error invulnerable die incurs a considerably reduced access latency and energy. Furthermore, we propose to maximize the access on the soft error invulnerable dies by dynamically moving data blocks among different dies, thereby achieving further performance and energy efficiency improvement. Simulation results show that the proposed 3D cache architecture can reduce the power consumption by up to 65% for the L1 instruction cache, 60% for the L1 data cache and 20% for the L2 cache, respectively. In general, the overall IPC performance can be improved by 5% on average.  相似文献   

8.
Cache-only memory access (COMA) multiprocessors support scalable coherent shared memory with a uniform memory access programming model. The local portion of shared memory associated with a processor is organized as a cache. This cache-based organization of memory results in long remote memory access latencies. Latency-hiding mechanisms can reduce effective remote memory access latency by making data present in a processor's local memory by the time the data are needed. In this paper we study the effectiveness of latency-hiding mechanisms on the KSR2 multiprocessor in improving the performance of three programs. The communication patterns of each program are analyzed and the mechanisms for latency hiding are applied. Results from a 52-processor system indicate that these mechanisms hide a significant portion of the latency of remote memory accesses. The results also quantify benefits in overall application performance.An earlier version of this paper was presented at the 1995 International Conference on Parallel Processing Techniques and Applications.  相似文献   

9.
Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems through a compiler-directed cache coherence scheme called the Cache Coherence with Data Prefetching (CCDP) scheme. The CCDP scheme enforces cache coherence by prefetching the potentially stale references in a parallel program. It also prefetches the non-stale references to hide their memory latencies. To optimize the performance of the CCDP scheme, some prefetch hardware support is provided to efficiently handle these two forms of data prefetching operations. We also developed the compiler techniques utilized by the CCDP scheme for stale reference detection, prefetch target analysis, and prefetch scheduling. We evaluated the performance of the CCDP scheme via execution-driven simulations of several numerical applications from the SPEC CFP95 and the Perfect benchmark suites. The simulation results show that the CCDP scheme provides significant performance improvements for the applications studied, comparable to that obtained with a full-map hardware cache coherence scheme.  相似文献   

10.
Emerging non-volatile memory (NVM) has been considered as the most promising candidate of DRAM for future main memory design in mobile devices. NVM-based main memory exhibits attractive features, such as byte-addressability, low standby power, high density and near DRAM performance. However, the nature of non-volatility makes NVM vulnerable to be attacked by malicious programs. Though several data encryption techniques have been proposed to solve this problem, they do not consider the limited resources in mobile systems. To address this issue, in this paper, we propose an energy-efficient encryption mechanism, named MobiLock, to effectively enhance the security of NVM-based main memory in mobile systems. The basic idea is to enhance the encryption and decryption performance by utilizing cache and concurrency mechanisms, respectively. To achieve this, we first develop a cache mechanism to cache the encrypted intermediate data (i.e., PAD) whose plaintexts are updated frequently, for accelerating decryption and reducing recomputation of PAD. We then propose a concurrency mechanism to read the ciphertext in NVM and calculate the PAD simultaneously, to reduce the decryption latency. The evaluation results show that our technique can effectively reduce encryption energy consumption and decryption latency, respectively.  相似文献   

11.
Cache memories reduce memory latency and traffic in computing systems. Most existing caches are implemented as board-based systems. Advancing VLSI technology will soon permit significant caches to be integrated on chip with the processors they support. In designing on-chip caches, the constraints of VLSI become significant. The primary constraints are economic limitations on circuit area and off-chip communications. The paper explores the design of on-chip instruction-only caches in terms of these constraints. The primary contribution of this work is the development of a unified economic model of on-chip instruction-only cache design which integrates the points of view of the cache designer and of the floorplan architect. With suitable data, this model permits the rational allocation of constrained resources to the achievement of a desired cache performance. Specific conclusions are that random line replacement is superior to LRU replacement, due to an increased flexibility in VLSI floorplan design; that variable set associativity can be an effective tool in regulating a chip's floorplan; and that sectoring permits area efficient caches while avoiding high transfer widths. Results are reported on economic functionality, from chip area and transfer width to miss ratio. These results, or the underlying analysis, can be used by microprocessor architects to make intelligent decisions regarding appropriate cache organizations and resource allocations.  相似文献   

12.
Reducing memory space requirement is important to many applications. For data-intensive applications, it may help avoid executing the program out-of-core. For high-performance computing, memory space reduction may improve the cache hit rate as well as performance. For embedded systems, it can reduce the memory requirement, the memory latency and the energy consumption. This paper investigates program transformations which a compiler can use to reduce the memory space required for storing program data. In particular, the paper uses integer programming to model the problem of combining loop shifting, loop fusion and array contraction to minimize the data memory required to execute a collection of multi-level loop nests. The integer programming problem is then reduced to an equivalent network flow problem which can be solved in polynomial time.  相似文献   

13.
It is observed that the limited memory space of direct-mapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays. The key technique of the balanced cache is a programmable subarray decoder through which the mapping of memory reference addresses to cache subarrays can be optimized hence conflict misses of direct-mapped caches can be resolved. The experimental results show that the miss rate of balanced cache is lower than that of the same sized two-way set-associative caches on average and can be as low as that of the same sized four-way set-associative caches for particular applications. Compared with previous techniques, the balanced cache requires only one cycle to access all cache hits and has the same access time as direct-mapped caches.  相似文献   

14.
In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore, we proposed the new NAND flash memory package for overcoming this major drawback. We present a high performance and low power NAND flash memory system with a dual cache memory. The proposed NAND flash package consists of two parts, i.e., an NAND flash memory module, and a dual cache module. The new NAND flash memory system can achieve dramatically higher performance and lower power consumption compared with any conventionM NAND-type flash memory module. Our results show that the proposed system can reduce about 78% of write operations into the flash memory cell and about 70% of read operations from the flash memory cell by using only additional 3KB cache space. This value represents high potential to achieve low power consumption and high performance gain.  相似文献   

15.
随着处理器和存储器速度差距的不断拉大,访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数,一般假定这些指令的延迟为cache命中或者cache miss的延迟,所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息,利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度,编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生,容易引起reorder buffer满,导致流水线阻塞。调度容易cache miss的指令。使其并行执行,从而隐藏cache miss的长延迟,就可以提高程序性能。因此,我们针对load指令,一方面修改频繁miss的指令的延迟,一方面修改调度策略,提高存储级并行度。实验证明,我们的调度对于bzip2有高达4.8%的提升,art有4%的提升,整体平均提高1.5%。  相似文献   

16.
This research explores a compressed memory hierarchy model which can increase both the effective memory space and bandwidth of each level of memory hierarchy. It is well known that decompression time causes a critical effect to the memory access time and variable-sized compressed blocks tend to increase the design complexity of the compressed memory systems. This paper proposes a selective compressed memory system (SCMS) incorporating the compressed cache architecture and its management method. To reduce or hide decompression overhead, this SCMS employs several effective techniques, including selective compression, parallel decompression and the use of a decompression buffer. In addition, fixed memory space allocation method is used to achieve efficient management of the compressed blocks. Trace-driven simulation shows that the SCMS approach can not only reduce the on-chip cache miss ratio and data traffic by about 35% and 53%, respectively, but also achieve a 20% reduction in average memory access time (AMAT) over conventional memory systems (CMS). Moreover, this approach can provide both lower memory traffic at a lower cost than CMS with some architectural enhancement. Most importantly, the SCMS is a more attractive approach for future computer systems because it offers high performance in cases of long DRAM latency and limited bus bandwidth.  相似文献   

17.
评测访存延迟对于优化应用访存模式和数据放置有重要的指导意义,然而数据Cache、多线程、数据预取等技术却严重干扰了访存延迟测量的精度。设计并实现了基于可变步长的访存延迟测量模型,在一块空间内根据用户指定的步长创建访问序列环,循环访问这个序列得出平均时间,即为访存延迟。最后对Intel的通用处理器和飞腾处理器在不同数据大小、步长、线程数等情况下的访存延迟进行了测量比较,该模型能够显示存储层次并精确显示测量延迟。  相似文献   

18.
Power consumption is an important issue for cluster supercomputers as it directly affects running cost and cooling requirements. This paper investigates the memory energy efficiency of high-end data servers used for supercomputers. Emerging memory technologies allow memory devices to dynamically adjust their power states and enable free rides by overlapping multiple DMA transfers from different I/O buses to the same memory device. To achieve maximum energy saving, the memory management on data servers needs to judiciously utilize these energy-aware devices. As we explore different management schemes under five real-world parallel I/O workloads, we find that the memory energy behavior is determined by a complex interaction among four important factors: (1) cache hit rates that may directly translate performance gain into energy saving, (2) cache populating schemes that perform buffer allocation and affect access locality at the chip level, (3) request clustering that aims to temporally align memory transfers from different buses into the same memory chips, and (4) access patterns in workloads that affect the first three factors.  相似文献   

19.
20.
软件流水是一种重要的指令调度技术,它通过同时执行来自不同循环迭代的指令来加快循环的执行时间.随着处理器速度和访存速度差距越拉越大,访存指令尤其是cache miss的访存指令日益成为系统性能提高的瓶颈.由于这些指令的延迟不是固定的,如何在软件流水中预测并掩盖这些访存指令的延迟是非常重要的.与前人预测访存延迟的方法不同,引入cache profiling技术,通过动态收集到profile信息来预测访存延迟,并进行适当的调度.当增加模调度循环中的访存指令的延迟时,启动间隔也会随之增大,导致性能不会随之上升.CSMS算法和FLMS算法在尽量不增大启动间隔的情况下,改变访存指令的延迟.改进了CSMS算法和FLMS算法,根据cache profiling的信息来改变访存延迟,所以比前人的方法更为准确.实验表明,新方法可以有效地提高程序性能,对SPEC2000测试程序平均性能提高1%左右,个别例子的性能改进高达11%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号