首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we propose a novel on-chip L2 cache organization for chip multiprocessors (CMPs) with private L2 caches. The proposed approach, called reusability-aware cache sharing (RACS), combines the advantages of both a private L2 cache and a shared L2 cache. Since a private L2 cache organization has a short access latency, the RACS scheme employs a private L2 cache organization. However, when a cache block in a private L2 cache is selected for eviction, RACS first evaluates its reusability. If the block is likely to be reused in the near future, it may be saved to a peer L2 cache which has space available. In this way, the RACS scheme effectively simulates the larger capacity of a shared L2 cache. Simulation results show that RACS reduced the number of off-chip memory accesses by 24% compared to a pure private L2 cache organization on average for the SPLASH 2 multi-threaded benchmarks, and by 16% for multi-programmed benchmarks.  相似文献   

2.
Performance trade-offs between fast data access by local data replication and cache capacity maximization by global data sharing have been extensively studied for many-core Chip Multiprocessors (CMPs). Costly simulations over a wide spectrum of the design space are generally required to gain insight for a sound design. To lower the cost, we develop an abstract model for understanding the performance impact of data replication on CMP caches. To overcome the lack of real-time interactions among multiple cores in the model, we further develop an efficient single-pass stack simulation to study the performance of CMP cache organizations with various degrees of data replication. The global stack logically incorporates a shared stack and per-core private stacks; shared/private reuse (stack) distances can be collected in a single-pass simulation. With the reuse distances, one can calculate the performance of CMP cache organizations with various degrees of data replication. We verify both the model and the stack simulation against execution-driven simulations with commercial multithreaded workloads. The results show that the abstract model provides accurate information about performance trade-offs of data replication. The stack simulation accurately predicts the performance of various cache organizations with 2-9 percent error margins using only about 8 percent of the simulation time.  相似文献   

3.
Providing a real-time cloud service requires simultaneously retrieving a large amount of data. How to improve the performance of file access becomes a great challenge. This paper first addresses the preconditions of dealing with this problem considering the requirements of applications, hardware, software, and network environments in the cloud. Then, a novel distributed layered cache system named HDCache is proposed. HDCahe is built on the top of Hadoop Distributed File System (HDFS). Applications can integrate the client library of HDCache to access the multiple cache services. The cache services are built up with three access layers an in-memory cache, a snapshot of the local disk, and a network disk provided by HDFS. The files loaded from HDFS are cached in a shared memory which can be directly accessed by the client library. In order to improve robustness and alleviate workload, the cache services are organized in a peer-to-peer style using a distributed hash table and every cached file has three replicas scattered in different cache service nodes. Experimental results show that HDCache can store files with a wide range in their sizes and has the access performance in a millisecond level under highly concurrent environments. The tested hit ratio obtained from a real-world cloud serviced is higher than 95 %.  相似文献   

4.
片上多处理器中二级Cache的设计和管理是影响其性能的关键因素之一。在私有二级Cache的基础上,提出一种基于集中式一致性目录的协作Cache设计方案,通过有效地管理片上存储资源来优化处理器的性能,从而使该协作Cache具有平均访存延迟小、Cache缺失率低、可扩展性好等优点。实验结果显示,与共享二级Cache设计相比,协作Cache可以将4核处理器的吞吐量平均提高13.5%,而其硬件开销约为8.1%。  相似文献   

5.
GPUs provide megabytes of registers and shared memories to maintain the contexts for thousands of threads and enable fast data sharing amongst threads of a thread block, respectively. Besides, GPUs employ L1 cache to provide the high bandwidth service for memory requests. However, the average L1 cache capacity per thread is very limited, resulting in cache thrashing which in turn impairs the performance. Meanwhile, many registers and shared memories are unassigned to any warps or thread blocks. Moreover, registers and shared memories that are assigned can be idle when warps or thread blocks are finished. Exploiting the above insights, we propose Virtual-Cache to cost-effectively increase the effective size of L1 cache by utilizing the unassigned and released registers and shared memories as cache-lines in this paper. Specifically, we leverage the unassigned registers and shared memories to serve cache requests directly. Regarding the registers assigned to a warp, they can work as cache-lines after the warp completes the execution and before they are accessed again by a new launched warp. Regarding the shared memories of a thread block, they are enabled to serve cache requests when the thread block is finished till they are referenced by shared memory instructions of the relaunched thread block. The register file, shared memory and L1 cache are physically independent but logically unified as a large virtual cache with redesigned cache-line management. We develop the control and data path for the register file, making the register file accessible for cache requests by borrowing an operand collector to serve the cache requests. We also expand the control and data path for the shared memory to serve the cache requests. Our evaluation results show that Virtual-Cache makes the performance improved by 28% over the previously proposed cache management technique for cache-sensitive applications.  相似文献   

6.
Many network applications requires access to most up-to-date information. An update event makes the corresponding cached data item obsolete, and cache hits due to obsolete data items become simply useless to those applications. Frequently accessed but infrequently updated data items should get higher preference while caching, and infrequently accessed but frequently updated items should have lower preference. Such items may not be cached at all or should be evicted from the cache to accommodate items with higher preference. In wireless networks, remote data access is typically more expensive than in wired networks. Hence, an efficient caching scheme considers both data access and update patterns can better reduce data transmissions in wireless networks. In this paper, we propose a step-wise optimal update-based replacement policy, called the Update-based Step-wise Optimal (USO) policy, for wireless data networks to optimize transmission cost by increasing effective hit ratio. Our cache replacement policy is based on the idea of giving preference to frequently accessed but infrequently updated data, and is supported by an analytical model with quantitative analysis. We also present results from our extensive simulations. We demonstrate that (1) the analytical model is validated by the simulation results and (2) the proposed scheme outperforms the Least Frequently Used (LFU) scheme in terms of effective hit ratio and communication cost.  相似文献   

7.
We propose and analyze an adaptive per-user per-object cache consistency management (APPCCM) scheme for mobile data access in wireless mesh networks. APPCCM supports strong data consistency semantics through integrated cache consistency and mobility management. The objective of APPCCM is to minimize the overall network cost incurred due to data query/update processing, cache consistency management, and mobility management. In APPCCM, data objects can be adaptively cached at the mesh clients directly or at mesh routers dynamically selected by APPCCM. APPCCM is adaptive, per-user and per-object as the decision regarding where to cache a data object accessed by a mesh client is made dynamically, depending on the mesh client’s mobility and data query/update characteristics, and the network’s conditions. We develop analytical models for evaluating the performance of APPCCM and devise a computational procedure for dynamically calculating the overall network cost incurred. We demonstrate via both model-based analysis and simulation validation that APPCCM outperforms non-adaptive cache consistency management schemes that always cache data objects at the mesh client, or at the mesh client’s current serving mesh router for mobile data access in wireless mesh networks.  相似文献   

8.
Client cache is an important technology for the optimization of distributed and centralized storage systems. As a representative client cache system, the performance of CacheFiles is limited by transition faults. Furthermore, CacheFiles just supports a simple LRU policy with a tightly-coupled design. To overcome these limitations, we propose to employ Stable Set Model (SSM) to improve CacheFiles and design an enhanced CacheFiles, SAC. SSM assumes that data access can be decomposed to access on some stable sets, in which elements are always repeatedly accessed or not accessed together. Using SSM methods can improve the cache management and reduce the effect of transition faults. We also adopt loosely- coupled methods to design prefetch and replacement policies. We implement our scheme on Linux 2.6.32 and measure the execution time of the scheme with various file I/O benchmarks. Experiments show that SAC can significantly improve I/O performance and reduce execution time up to 84%0, compared with the existing CacheFiles.  相似文献   

9.
一种面向多核DSP的小容量紧耦合快速共享数据池   总被引:7,自引:0,他引:7  
该文结合片上便笺式存储器(SPM)的结构特点,提出了一种面向异构多核DSP的新型小容量紧耦合共享存储结构——快速共享数据池(FSDP).FSDP在存储层次上与一级Cache平行,可以被访存指令直接访问,采用多体并行的结构、交叉访问模式和基于硬件信号灯的自动同步机制,支持多个DSP核的并行访问与快速的核间数据交换,两核之间交换单个数据只需4拍.该文构建了FSDP的模拟模型,并进行了RTL级设计实现和分析.多种典型测试程序的验证表明,FSDP对于DSP核间细粒度共享数据的传输具有很高的效率,相比同类的VS-SPM结构能够将程序性能提高37%,与传统的共享数据Cache结合使用能够将异构多核DSP的性能提高13%.  相似文献   

10.
片上多处理器中延迟和容量权衡的cache结构   总被引:1,自引:0,他引:1  
片上多处理器中二级cache的设计面临着延迟和容量不能同时满足的矛盾,私有结构有较小的命中延迟但是减少了cache的有效容量,共享结构能增加cache的有效容量但是有较长的命中延迟.提出了一种适用于CMP的cache结构--延迟和容量权衡的cache结构(TCLC).该结构是一种混合私有结构和共享结构的设计,核心思想是动态识别cache块的共享类型,根据不同共享类型分别对其进行优化,对私有cache块采用迁移的优化策略,对共享只读cache块采用复制的优化策略,对共享读写cache块采用中心放置的优化策略,以期达到访问延迟接近私有结构,有效容量接近共享结构的目的,从而缓解线延迟的影响,减少平均内存访问延迟.全系统模拟的实验结果表明,采用TCLC结构,相对于私有结构性能平均提高13.7%.相对于共享结构性能平均提高12%.  相似文献   

11.
Web缓存是用来解决网络访问延迟和网络拥塞问题,缓存替换策略直接影响缓存的命中率。为此,提出一种朴素贝叶斯(NB)分类器重访概率预测的Web缓存替换策略;根据用户之前访问日志,通过分区操作提取多项特征来表示每次访问的对象,并构建特征数据集;训练NB分类器,用来确定缓存中对象被再次访问的概率,为对象分配权重;结合LRU策略来合理删除一些对象。仿真结果表明,提出的策略在保证较高命中率的同时有效降低了执行时间。  相似文献   

12.
The modern chip multiprocessors are vulnerable to transient faults caused by either on-purpose attacks or system mistakes, especially for those with large and multi-level caches in cloud servers. In this paper, we propose a modified/shared replication cache to keep a redundancy for the latest accessed and modified/shared L2 cache lines. According to the experiments based on Multi2Sim, this cache with proper size can provide considerable data reliability. In addition, the cache can reduce the average latency of memory hierarchy for error correction, with only about 20.2% of L2 cache energy cost and 2% of L2 cache silicon overhead.  相似文献   

13.
一种片上众核结构共享Cache动态隐式隔离机制研究   总被引:2,自引:0,他引:2  
访存带宽是限制众核处理器件能提升的关键,将片上最后一级Cache设计为所有处理器核共享是必要的.在共享Cache中隔离放置冲突的数据,是提高共享Cache性能的关键.文中提出了缓存块链接的硬件方法,用于隔离共享Cache中不同线程之间的数据.文中基于时钟精准的片上众核结构模拟器,使用Splash2程序组和生物信息学中的仟务,对所提机制进行了评估.实验结果表明,与传统共享Cache相比,使用缓存块链接机制时,使得共享Cache的冲突性缺失率降低约20%,而使得IPC平均提高了约10%.  相似文献   

14.
We propose a simple solution to the problem of efficient stack evaluation of LRU multiprocessor cache memories with arbitrary set-associative mapping. It is an extension of the existing stack evaluation techniques for all set-associative LRU uniprocessor caches. Special marker entries are used in the stack to represent data blocks (or lines) deleted by an invalidation-based cache coherence protocol. A method of marker-splitting is employed when a data block below a marker in the stack is accessed. Using this technique, one-pass trace evaluation of memory access trace yields hit ratios for all cache sizes and set-associative mappings of multiprocessor caches in a single pass over a memory reference trace. Simulation experiments on some multiprocessor trace data show an order-of-magnitude speed-up in simulation time using this one-pass technique  相似文献   

15.
Multi-core x86_64 processors introduced an important change in architecture, a shared last level cache. Historically, each processor has had access to a large private cache that seamlessly and transparently (to end users) interfaced with main memory. Previously, processes or threads only had to compete for memory bandwidth, but now they are competing for actual space. Competition for space and environmental resources is a problem studied in other scientific domains. This paper introduces methods from ecology to model multi-core cache usage with the competitive Lotka–Volterra equations. A model is presented and validated for characterizing the interaction of cores through shared caching, and for predicting the degree to which different workloads will interfere with each others’ execution from cache contention.  相似文献   

16.
非一致Cache体系结构(Non-Uniform Cache Architecture,NUCA)几乎已经成为未来片上大容量Cache的设计趋势.非一致Cache中,数据提升技术通过将经常访问的数据放置在距离处理器较近的Cache bank中减少处理器对该数据访问的等待时间,对NUCA的性能有着重要影响.然而,目前已有的数据提升技术使用固定的提升策略,投有考虑所要提升到目标bank的实际状态,容易将目标bank中更有用的数据"挤"得远离处理器,从而产生Cache污染问题,严重制约了提升技术的性能发挥.针对这一问题,文中提出智能多跳提升技术.智能多跳提升技术能够感知候选目标bank的状态,为被提升的数据动态地选择合适的目标bank,从而提高了提升效率,减少了Cache污染.同时,智能多跳提升技术的设计巧妙地利用了处理器访问的反向路径,只是简单地扩充了处理器访问报文的格式,并没有增加对Cache bank的额外访问.最后使用全系统模拟器对来自NAS Parallel Benchmark和Livermore Benchmark的15个基准测试程序进行了详细测试,智能多跳提升技术单位提升操作节省的时钟周期数是已有提升技术的1.50倍,最多达到2.61倍;系统的IPC性能平均提高了6.24%,最高达到19.03%.  相似文献   

17.
In traditional cache structures, entries in the data array and the tag array are tightly coupled, that is, entries in both arrays are one-to-one mapped. In this paper, we decouple the traditional one-to-one mapping between data and tag arrays for cache structures. The key idea is that the block tag is stored in different tag arrays such that these tag and data arrays are accessed by different indices. The freedom due to decoupling the tag association may bring several advantages. We use a formal inference to verify if a cache structure can give correct decoupled addressing. We summarize three generalized decoupled models that can also be applied to other previously proposed approaches in the literature. We evaluate our schemes and compare with other approaches by trace-driven simulation. The simulation results show that the decoupled mechanisms can reduce significant tag area with a slight increase of the average access time per instruction.  相似文献   

18.
黄光奇  李子木  周兴铭  窦勇 《计算机学报》2001,24(12):1318-1323
随着半导体工艺技术的飞速发展,单芯片多处理器(Single-Chip Multiprocessor,SCMP)结构将是一条提高处理器性能的有效途径。该文在分析SCMP结构的特点的基础上,提出了SCMP的一种结构实现:共享多端口数据Cache结构(Shared Multi-Ported Data Cache Architecture,SMPDCA).SMPDCA结构具有三个突出的优点:最小的通信延迟、没有Cache一致性维护开销和数据Cache命中率提高。模拟结果表明,与数据Cache私有的结构相比,SMPDCA结构的煅出优点使得应用程序的性能得到了明显的提高,特别是对于改善处理器之间的通信与交互比较多的应用程序的性能具有最为明显的效果。  相似文献   

19.
当片上多处理器系统上运行多个不同程序时,如何给这些不同的应用程序分配适当的cache空间成为一个难题。Cache划分就是解决这一难题的有效方法,目前大部分的划分方法都是针对最后一级共享cache设计的。私有cache划分(private cache partitioning,PCP)方法采用一个分布式一致性引擎(DCE)把多个私有cache组织在一起,最后通过硬件信息提取单元获得多个程序在不同cache路上的命中分布情况,用于指导划分算法的执行,最后由每个DCE根据划分算法运行的结果对cache空间进行划分。实验结果表明PCP方法降低了失效率,提高了程序执行性能。  相似文献   

20.
The traditional dynamic random-access memory (DRAM) storage medium can be integrated on chips via modern emerging 3D-stacking technology to architect a DRAM shared cache in multicore systems. Compared with static random-access memory (SRAM), DRAM is larger but slower. In the existing research, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together in shared cache systems, ranging from SRAM structure improvement to optimizing cache tags and data access. However, little attention has been paid to designing a shared cache scheduling scheme for multiprogrammed workloads with different memory footprints in multicore systems. Motivated by this, we propose a hybrid shared cache scheduling scheme that allows a multicore system to utilize SRAM and 3D-stacked DRAM efficiently, thus achieving better workload performance. This scheduling scheme employs (1) a cache monitor, which is used to collect cache statistics; (2) a cache evaluator, which is used to evaluate the cache information during the process of programs being executed; and (3) a cache switcher, which is used to self-adaptively choose SRAM or DRAM shared cache modules. A cache data migration policy is naturally developed to guarantee that the scheduling scheme works correctly. Extensive experiments are conducted to evaluate the workload performance of our proposed scheme. The experimental results showed that our method can improve the multiprogrammed workload performance by up to 25% compared with state-of-the-art methods (including conventional and DRAM cache systems).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号