期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李家文沈立《计算机科学与探索》2012,6(1):58-66

为改善虚拟化系统的cache隔离性,提高系统的整体性能,面向虚拟化环境设计并实现了一种cache动态划分算法。该算法采用页面着色的思想,通过为虚拟机分配私有颜色页面来实现cache的划分,同时能够根据虚拟机的cache需求为其动态调整cache容量。在Xen虚拟环境中实现了该算法。实验结果表明,该算法可以在较低开销的情况下,显著提高多虚拟机上并发程序的全局性能。相似文献

2.

Dynamic cache partitioning based on hot page migration

Xiaolin WANG Xiang WEN Yechen LI Zhenlin WANG Yingwei LUO Xiaoming LI 《Frontiers of Computer Science》2012,6(4):363-372

Static cache partitioning can reduce inter-application cache interference and improve the composite performance of a cache-polluted application and a cache-sensitive application when they run on cores that share the last level cache in the same multi-core processor. In a virtualized system, since different applications might run on different virtual machines (VMs) in different time, it is inapplicable to partition the cache statically in advance. This paper proposes a dynamic cache partitioning scheme that makes use of hot page detection and page migration to improve the composite performance of co-hosted virtual machines dynamically according to prior knowledge of cache-sensitive applications. Experimental results show that the overhead of our page migration scheme is low, while in most cases, the composite performance is an improvement over free composition. 相似文献

3.

面向多线程多道程序的加权共享Cache划分 总被引：5，自引：1，他引：4

所光杨学军《计算机学报》2008,31(11)

并行应用在共享Cache结构的多核处理器执行时,会因为对共享Cache的冲突访问而产生性能下降和执行时间不确定的现象.共享Cache划分技术可以把共享Cache互斥地分配给多个进程使用,是解决该问题的有效方法.由于线程间的数据共享,线程数目不同的应用对共享Cache的利用率不同,但传统的以失效率最低为目标的共享Cache划分算法(例如UCP)没有区分应用线程数目的不同.文中设计了一种面向多线程多道程序的加权共享Cache划分框架(Weighted Cache Partitioning,WCP),包括面向应用的失效率监控器和加权Cache划分算法.失效率监控器以进程为单位动态监控在不同的Cache容量下应用的失效率;而加权Cache划分算法扩展了传统的失效率最优的Cache划分算法,根据应用线程数目的不同在进行Cache划分时给应用赋予不同的权值,以使具有更多线程的应用获得更多的共享Cache,从而提高系统的整体性能.实验结果表明:加权Cache划分算法虽然失效率有所增高,但却改进了IPC吞吐率、加权加速比和公平性.在由科学和工程计算应用组成的多道程序测试用例中,WCP-1的IPC吞吐率比以失效率最低为目标函数的共享Cache划分算法最高高出10.8%,平均高出5.5%. 相似文献

4.

内存体系划分技术的研究与发展

邱杰凡华宗汉范菁刘磊《软件学报》2022,33(2):751-769

在多核计算机时代,多道程序在整个共享内存体系上的“访存干扰”是制约系统总体性能和服务质量的重要因素.即使当前内存资源已相对丰富,但如何优化内存体系的性能、降低访存干扰并高效地管理内存资源,仍是计算机体系结构领域的研究热点.为深入研究该问题,详述将“页着色(pagecoloring)”内存划分技术应用于整个内存体系(包括Cache、内存通道以及内存DRAM Bank),进而消除了并行多道程序在共享内存体系上的访存干扰的一系列先进方法.从DRAM Bank、Channel与Cache以及非易失性内存(non-volatile memory, NVM)等内存体系中介质为切入点,层次分明地展开论述:首先,详述将页着色应用于多道程序在DRAM Bank与通道的划分,消除多道程序间的访存冲突;随后是将页着色应用于在内存体系中Cache和DRAM的“垂直”协同划分,可同时消除多级内存介质上的访存干扰;最后是将页着色应用于包含NVM的混合内存体系,以提高程序运行效率和系统整体效能.实验结果表明,所提内存划分方法提高了系统整体性能(平均5%-15%)、服务质量(QoS),并有效地降低了系统能耗.通过梳理... 相似文献

5.

一种多线程阵列众核处理器的二级Cache划分机制

陈逸飞朱蕾李宏亮《计算机工程与科学》2019,41(3):400-408

阵列众核处理器由于其较高的计算性能和能效比已经广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器,其核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。引入硬件同时多线程技术,针对实验中单核心多线程二级Cache利用率较低的问题,提出了一种共享二级Cache划分机制。经实验模拟,通过上述优化的共享二级Cache划分机制,二级指令Cache失效率下降18.59%,数据Cache失效率下降6.60%,整体CPI性能提升达到10.1%。相似文献

6.

基于分布式合作cache的私有cache划分方法

李浩谢伦国《计算机应用研究》2012,29(1):229-233

当片上多处理器系统上运行多个不同程序时,如何给这些不同的应用程序分配适当的cache空间成为一个难题。Cache划分就是解决这一难题的有效方法,目前大部分的划分方法都是针对最后一级共享cache设计的。私有cache划分(private cache partitioning,PCP)方法采用一个分布式一致性引擎(DCE)把多个私有cache组织在一起,最后通过硬件信息提取单元获得多个程序在不同cache路上的命中分布情况,用于指导划分算法的执行,最后由每个DCE根据划分算法运行的结果对cache空间进行划分。实验结果表明PCP方法降低了失效率,提高了程序执行性能。相似文献

7.

Dynamic Partitioning of Shared Cache Memory 总被引：6，自引：0，他引：6

G. E. Suh L. Rudolph S. Devadas 《The Journal of supercomputing》2004,28(1):7-26

This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses.The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory. 相似文献

8.

Multiprocessor Cache Coherence Based on Virtual Memory Support

Petersen K. Li K. 《Journal of Parallel and Distributed Computing》1995,29(2)

Virtual-memory-based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared-memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtual-memory techniques. The key feature of the approach is that the virtual-memory translation hardware on each processor is used to detect shared accesses that could lead to memory incoherencies, and VM page fault handlers execute the appropriate actions to maintain cache coherence. VM-based cache coherence basically trades off design simplicity for increased software overheads. The work presented in this paper evaluates this trade-off. We show that VM-based cache coherence performs well for scientific applications that require significant aggregate memory bandwidth. 相似文献

9.

Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Eduardo H.M. Cruz Matthias Diener Marco A.Z. Alves Philippe O.A. Navaux 《Journal of Parallel and Distributed Computing》2014

In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages. 相似文献

10.

Precise control of page cache for containers

Kun WANG Song WU Shengbang LI Zhuo HUANG Hao FAN Chen YU Hai JIN 《Frontiers of Computer Science》2024,18(2):182102

Container-based virtualization is becoming increasingly popular in cloud computing due to its efficiency and flexibility. Resource isolation is a fundamental property of containers. Existing works have indicated weak resource isolation could cause significant performance degradation for containerized applications and enhanced resource isolation. However, current studies have almost not discussed the isolation problems of page cache which is a key resource for containers. Containers leverage memory cgroup to control page cache usage. Unfortunately, existing policy introduces two major problems in a container-based environment. First, containers can utilize more memory than limited by their cgroup, effectively breaking memory isolation. Second, the OS kernel has to evict page cache to make space for newly-arrived memory requests, slowing down containerized applications. This paper performs an empirical study of these problems and demonstrates the performance impacts on containerized applications. Then we propose pCache (precise control of page cache) to address the problems by dividing page cache into private and shared and controlling both kinds of page cache separately and precisely. To do so, pCache leverages two new technologies: fair account (f-account) and evict on demand (EoD). F-account splits the shared page cache charging based on per-container share to prevent containers from using memory for free, enhancing memory isolation. And EoD reduces unnecessary page cache evictions to avoid the performance impacts. The evaluation results demonstrate that our system can effectively enhance memory isolation for containers and achieve substantial performance improvement over the original page cache management policy. 相似文献

11.

虚拟机缓存划分的设计与实现

靳辛欣陈昊罡汪小林王振林温翔罗英伟李晓明《计算机科学与探索》2010,4(1):36-45

阐述了一种基于VMM(virtual machine manager)的虚拟机缓存划分的设计与实现。该方法采用操作系统中的页面着色技术,在虚拟机管理器Xen上进行实现。这种机制对于VMM之上的客户操作系统是完全透明的,便于操作,具有很好的灵活性。经测试表明,提出的缓存划分的方法能够显著地提高同时运行在不同虚拟机上的应用程序的性能。对从SPEC CPU 2006基准测试程序里面挑选出来的并发程序的负载进行测试,结果表明缓存划分最高可以使其性能提升19%。相似文献

12.

一种针对存储服务器设计的动态分区缓存管理系统 总被引：1，自引：0，他引：1

孟晓垣那文武徐伟卜庆忠许鲁《计算机研究与发展》2009,46(Z2)

提出了一种动态分区缓存管理系统,简称DPCache(dynamic partitioned buffer cache system),它适用于网络存储服务器中多应用共享缓存资源的应用模式.DPCache基于应用对缓存资源进行分区管理,其优点在于:1)每个独立的缓存分区可根据应用负载特征选择适合的缓存替换策略以提高分区缓存资源利用率;2)缓存分区在系统运行过程中通过可配置的缓存回收策略来有序竞争缓存资源,从而实现应用级的缓存区分服务.目前在Linux-2.6.18内核下实现了该系统,实验数据表明,DPCache不仅能够在实际应用中有效地支持多种缓存区分服务语义,同时它还能够支持对特定应用的性能优化. 相似文献

13.

ARP:同时多线程处理器中共享Cache自适应运行时划分机制 总被引：1，自引：1，他引：0

隋秀峰吴俊敏陈国良《计算机研究与发展》2008,45(7)

同时多线程是一种延迟容忍的体系结构,采用共享的二级Cache,在每个周期内可以执行多个线程的多条指令,这就会增加对存储层次的压力,文中主要研究了SMT处理器中多个并发执行的线程之间共享Cache的划分问题,尤其是Cache共享中的公平性问题以及它和吞吐量之间的关系,传统的LRU策略会根据线程的需要隐式地划分共享Cache,给具有较高需求的线程分配较多的Cache空间,对Cache的管理具有不公平性,从而会引起线程饿死、优先级反转等问题,实现了一种自适应、运行时划分机制(ARP)来管理共享Cache.ARP采用公平性作为划分的度量,并且使用动态划分算法来优化公平性,该算法具有易于实现,所需剖析较少的特点,硬件上使用经典的监控器来收集每个线程的栈距离信息,其存储开销不到0.25%.实验结果显示,与基于LRU的Cache划分相比,ARP可以将一个2路SMT处理器的公平性提高2.26倍,而将吞吐量平均提高14.75%. 相似文献

14.

Moving address translation closer to memory in distributed shared-memory multiprocessors

Qiu X. Dubois M. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(7):612-623

To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low. 相似文献

15.

Security enhancement of cloud servers with a redundancy-based fault-tolerant cache structure

《Future Generation Computer Systems》2015

The modern chip multiprocessors are vulnerable to transient faults caused by either on-purpose attacks or system mistakes, especially for those with large and multi-level caches in cloud servers. In this paper, we propose a modified/shared replication cache to keep a redundancy for the latest accessed and modified/shared L2 cache lines. According to the experiments based on Multi2Sim, this cache with proper size can provide considerable data reliability. In addition, the cache can reduce the average latency of memory hierarchy for error correction, with only about 20.2% of L2 cache energy cost and 2% of L2 cache silicon overhead. 相似文献

16.

指导cache静态划分的程序性能profiling优化技术

贾耀仓武成岗张兆庆《计算机研究与发展》2012,49(1):93-102

对于共享cache的多核处理器,如何管理好各个核对cache的利用,对于充分发挥多核处理器性能是很关键的问题.目前采用的cache替换方法程序间会出现性能干扰,cache静态划分技术则是通过为同时运行的程序分配不同的空间来解决性能干扰问题.为了给程序分配合适大小的cache空间,需要对程序进行性能profiling,即事先多遍运行收集程序在各种cache容量下的性能数据,这种性能profiling方法开销巨大,影响实用.为了解决性能profiling需要多遍运行程序的问题,提出了只需单遍运行的程序性能profiling优化技术.该技术利用在线的phase分析技术识别程序的运行阶段,避免对相同阶段的重复profiling;同时分析程序各phase的性能同cache容量变化的关系趋势,对于性能不敏感的容量变化则不进行profiling,降低开销.在程序运行结束后通过程序各phase在cache各种容量下的性能来估计程序在各容量下的整体性能,以指导cache静态划分.实验表明,该技术的开销仅为7%,而该方法指导的cache划分比未划分时有8%的性能改进,同多遍运行的程序性能profiling指导的cache划分性能相比仅有1%的下降. 相似文献

17.

多核处理器面向低功耗的共享Cache划分方案 总被引：1，自引：0，他引：1

下载免费PDF全文

熊伟殷建平所光赵志恒《计算机工程与科学》2010,32(10):26-29

随着多核处理器的发展,片上Cache的容量随之增大,其功耗占整个芯片功耗的比率也越来越大。如何减少Cache的功耗,已成为当今Cache设计的一个热点。本文研究了面向低功耗的多核处理器共享Cache的划分技术(LP-CP)。文中提出了Cache划分框架,通过在处理器中加入失效率监控器来动态地收集程序的失效率,然后使用面向低功耗的共享Cache划分算法,计算性能损耗阈值范围内的共享Cache划分策略。我们在一个共享L2 Cache的双核处理器系统中,使用多道程序测试集测试了面向低功耗的Cache划分:在性能损耗阈值为1%和3%的情况中,系统的Cache关闭率分别达到了20.8%和36.9%。相似文献

18.

VxWorks 653分区进程间大数据共享方法研究 总被引：1，自引：0，他引：1

徐克熊智勇李奎《测控技术》2016,35(11):98-102

VxWorks 653提供的API在实现分区进程间大数据共享时,会出现系统资源消耗过大、内存无法回收利用等问题,而且不能保证共享数据的更新率和连续性.为此,设计并实现了一种基于双缓存机制的数据共享方法.这种方法动态创建两块缓存,根据进程不同优先级进行读、写、释放内存等操作,实现数据共享.实验结果证明,这种双缓存机制能在不影响系统调度的情况下,高效完成进程间的大数据共享,并且该方法可以有效节省系统资源,具有良好的通用性和可移植性. 相似文献

19.

Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping

Chang Chih-Yung Sheu Jang-Ping Chen Hsi-Chiuen 《The Journal of supercomputing》2002,22(2):197-219

This article presents an algorithm to reduce cache conflicts and improve cache localities. The proposed algorithm analyzes locality reference space for each reference pattern, partitions the multi-level cache into several parts with different sizes, and then maps array data onto the scheduled cache positions to eliminate cache conflicts. A greedy method for rearranging array variables in declared statement is also developed, to reduce the memory overhead for mapping arrays onto a partitioned cache. Besides, loop tiling and the proposed schemes are combined to exploit opportunities for both temporal and spatial reuse. Atom is used as a tool to develop a simulation of the behavior of the direct-mapping cache to demonstrate that our approach is effective at reducing number of cache conflicts and exploiting cache localities. Experimental results reveal that applying the cache partitioning scheme can greatly reduce the cache conflicts and thus save program execution time in both single-level cache and multi-level cache hierarchies. 相似文献

20.

存储池:一种适合于编译器的存储管理方法

杨书鑫薛丽萍张兆庆《计算机工程》2005,31(6):79-80,131

介绍开放源码编译器ORC中使用的存储管理方法:存储池.存储池不是一种通用的存储管理方法,但是它特别适合于编译器.在编译器这个应用场合下,存储池比通用的malloc/free存储管理机制具有十分明显的优点.具体表现为分配速度快、管理开销小、释放时间开销小以及没有内存泄漏的问题. 相似文献