期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

朱贺飞陆超周晓方闵昊周电《计算机工程与应用》2007,43(1):96-99

针对Linux操作系统,实现了面向32位RSIC嵌入式处理器的存储器管理单元。通过在指令快表中增加预比较电路,提高了处理器连续访问同一虚拟页面时的地址转换效率。快表失效时,设计了专门的硬件来实现页表查询及快表填充,处理速度明显优于软件。论文设计的MMU能够很好地和Linux配合,完成地址映射及存储权限管理。相似文献

2.

Energy-efficient synonym data detection and consistency for virtual cache

《Microprocessors and Microsystems》2016

The cache memory consumes a large proportion of the energy used by a processor. In the on-chip cache, the translation lookaside buffer (TLB) accounts for 20–50% of energy consumption of the on-chip cache. To reduce energy consumption caused by TLB accesses, a virtual cache can be accessed by virtual addresses which are issued by a processor directly. However, a virtual cache may result in the synonym problem. In this paper, we propose low-cost synonym detection hardware and a synonym data coherence mechanism. These reduce the energy consumption incurred by TLB lookups, and maintain synonym data consistency in the virtual cache. The proposed synonym detection hardware efficiently reduces the number of blocks that must be looked up in a virtual cache for saving energy. In addition, the proposed synonym data coherence mechanism also reduces the number of invalidated blocks in the virtual cache to prevent the destruction of cache locality. The simulation results show that our proposed energy-aware virtual cache consumes 51%, 27%, and 20% less energy than the traditional physical cache, traditional virtual cache, and synonym lookaside buffer (SLB), respectively. In addition, our proposed design shows almost the same static energy consumption as SLB, and reduces static energy consumption by about 20% compared with the traditional physical cache and virtual cache. 相似文献

3.

Code Transformations for TLB Power Reduction

Reiley Jeyapaul Aviral Shrivastava 《International journal of parallel programming》2010,38(3-4):254-276

The Translation Look-aside Buffer (TLB) is a very important part in the hardware support for virtual memory management implementation of high performance embedded systems. The TLB though small is frequently accessed, and therefore not only consumes significant energy, but also is one of the important thermal hot-spots in the processor. Recently, several circuit and microarchitectural implementations of TLBs have been proposed to reduce TLB power. One simple, yet effective TLB design for power reduction is the Use-Last TLB architecture proposed in IEEE J Solid State Circuits, 1190–1199, (2004). The Use-Last TLB architecture reduces the power consumption when the last page is accessed again. In this work, we develop code transformation techniques to reduce the page switchings in data cache accesses and propose an efficient page-aware code placement technique to enhance the energy reduction capabilities achieved by the Use-Last TLB architecture for instruction cache accesses. Our comprehensive page switch reduction algorithm results in an average of 39% reduction in the data-TLB page switching, and our code placement heuristic results in an average of 76% reduction in the instrucion-TLB page switchings with negligible impact on the performance on benchmarks from MiBench, Multimedia, DSPStone and BDTI suites. The reduced page switch count through our techniques achieves an equivalent power savings, above and beyond the reduction achieved by the Use-Last TLB architecture implementation. 相似文献

4.

Translation-lookaside buffer consistency

Teller P.J. 《Computer》1990,23(6):26-36

Nine solutions to the cache consistency problem for shared-memory multiprocessors with multiple translation-lookaside buffers (TLBs) are described. A TLB's function is defined, and it is shown how TLB inconsistency arises in uniprocessor and multiprocessor architectures. The problem of TLB consistency is solved in a uniprocessor and in multiprocessors with a shared bus, virtual-address caches, and hardware cache consistency. Solutions that can be implemented in multiprocessors with more general interconnection networks and without hardware cache consistency are presented 相似文献

5.

Multiprocessor Cache Coherence Based on Virtual Memory Support

Petersen K. Li K. 《Journal of Parallel and Distributed Computing》1995,29(2)

Virtual-memory-based cache coherence is a mechanism that relies only on hardware that already exists on the microprocessors of a shared-memory multiprocessor system, yet dynamically detects and resolves potential cache inconsistencies using virtual-memory techniques. The key feature of the approach is that the virtual-memory translation hardware on each processor is used to detect shared accesses that could lead to memory incoherencies, and VM page fault handlers execute the appropriate actions to maintain cache coherence. VM-based cache coherence basically trades off design simplicity for increased software overheads. The work presented in this paper evaluates this trade-off. We show that VM-based cache coherence performs well for scientific applications that require significant aggregate memory bandwidth. 相似文献

6.

一个基于IA-64体系的内存管理大页面的实现模型

陈鸣春潘金贵《计算机科学》2007,34(4):276-278

本文提出了一种基于IA-64体系结构的内存页面大页面化的模型，可执行文件ELF的Data Segment使用大页面。由于转换解析缓冲区（TLB）能映射更大的虚拟内存范围，从而可减小未命中率，因此可以提高使用大页面的高性能计算（HPC）应用程序或使用大量虚拟内存的任何内存访问密集型应用程序系统性能。相似文献

7.

一种TLB结构优化方法

下载免费PDF全文

何军张晓东郭勇《计算机工程》2012,38(21):253-256

针对国产处理器地址代换旁路缓冲(TLB)性能不足的问题,通过对现有的虚实地址代换流程进行分析,提出设置独立第三级页表基址虚实映射缓存,对数据TLB结构进行优化的方法,减少低级页表虚实映射关系对高级页表虚实映射关系的挤占淘汰。SPEC CPU2000测试结果表明,近一半的课题能减少60％以上数据TLB的DM次数,少数课题甚至能减少90％以上,有效减少数据TLB缺失率。相似文献

8.

Compressed page walk cache

Dunbo ZHANG Chaoyang JIA Li SHEN 《Frontiers of Computer Science》2022,16(3):163104

GPUs are widely used in modern high-performance computing systems. To reduce the burden of GPU programmers, operating system and GPU hardware provide great supports for shared virtual memory, which enables GPU and CPU to share the same virtual address space. Unfortunately, the current SIMT execution model of GPU brings great challenges for the virtual-physical address translation on the GPU side, mainly due to the huge number of virtual addresses which are generated simultaneously and the bad locality of these virtual addresses. Thus, the excessive TLB accesses increase the miss ratio of TLB. As an attractive solution, Page Walk Cache (PWC) has received wide attention for its capability of reducing the memory accesses caused by TLB misses. However, the current PWC mechanism suffers from heavy redundancies, which significantly limits its efficiency. In this paper, we first investigate the facts leading to this issue by evaluating the performance of PWC with typical GPU benchmarks. We find that the repeated L4 and L3 indices of virtual addresses increase the redundancies in PWC, and the low locality of L2 indices causes the low hit ratio in PWC. Based on these observations, we propose a new PWC structure, namely Compressed Page Walk Cache (CPWC), to resolve the redundancy burden in current PWC. Our CPWC can be organized in either direct-mapped mode or set-associated mode. Experimental results show that CPWC increases by 3 times over TPC in the number of page table entries, increases by 38.3% over PWC in L2 index hit ratio and reduces by 26.9% in the memory accesses of page tables. The average memory accesses caused by each TLB miss is reduced to 1.13. Overall, the average IPC can improve by 25.3%. 相似文献

9.

嵌入式处理器中访存部件的低功耗设计研究 总被引：2，自引：0，他引：2

黄海林范东睿许彤唐志敏《计算机学报》2006,29(5):815-821

以“龙芯1号”处理器为研究对象,探讨了嵌入式处理器中访存部件的低功耗设计方法.通过对访存部件的结构、功耗以及关键路径进行分析,利用局部性原理,提出一种根据虚拟地址历史记录进行判断的方法,可以显著减少TLB和Cache对RAM块的访问次数,使得TLB部件功耗平均降低了28.1%,Cache部件功耗平均降低了54.3%,处理器总功耗平均降低了23.2%,而关键路径延时反而减少,处理器性能略有提高. 相似文献

10.

针对嵌入式系统的低功耗存储器管理单元设计

下载免费PDF全文

朱贺飞陆超周晓方闵昊周电《计算机工程》2007,33(5):226-228

针对Linux操作系统，实现了面向32位RSIC嵌入式处理器的低功耗存储器管理单元。通过在指令快表中增加预比较电路，提高了处理器连续访问同一虚拟页面时的地址转换效率，降低指令快表命中时的功耗37.07%。两级比较结构的内容寻址存储器与传统结构相比，在失效和命中时分别可以取得44.98%和74.94%的功耗节省。该文设计的存储器管理单元能够很好地和Linux配合，完成地址映射及存储权限管理。相似文献

11.

Lightweight dynamic partitioning for last-level cache of multicore processor on real system

Ludan Zhang Yi Liu Rui Wang Depei Qian 《The Journal of supercomputing》2014,69(2):547-560

With rapid development of multi/many-core processors, contention in shared cache becomes more and more serious that restricts performance improvement of parallel programs. Recent researches have employed page coloring mechanism to realize cache partitioning on real system and to reduce contentions in shared cache. However, page coloring-based cache partitioning has some side effects, one is page coloring restricts memory space that an application can allocate, from which may lead to memory pressure, another is changing cache partition dynamically needs massive page copying which will incur large overhead. To make page coloring-based cache partition more practical, this paper proposes a malloc allocator-based dynamic cache partitioning mechanism with page coloring. Memory allocated by our malloc allocator can be dynamically partitioned among different applications according to partitioning policy. Only coloring the dynamically allocated pages can remit memory pressure and reduce page copying overhead led by re-coloring compared to all-page coloring. To further alleviate the overhead, we introduce minimum distance page copying strategy and lazy flush strategy. We conduct experiments on real system to evaluate these strategies and results show that they work well for reducing cache misses and re-coloring overhead. 相似文献

12.

Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Eduardo H.M. Cruz Matthias Diener Marco A.Z. Alves Philippe O.A. Navaux 《Journal of Parallel and Distributed Computing》2014

In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages. 相似文献

13.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

14.

Replacement techniques for dynamic NUCA cache designs on CMPs

Javier Lira Carlos Molina Ryan N. Rakvic Antonio González 《The Journal of supercomputing》2013,64(2):548-579

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches. 相似文献

15.

Adopting system call based address translation into user-level communication 总被引：1，自引：0，他引：1

Moon-Sang Lee Sang-Kwon Lee Joonwon Lee Seung-Ryoul Maeng 《Computer Architecture Letters》2006,5(1):26-29

User-level communication alleviates the software overhead of the communication subsystem by allowing applications to access the network interface directly. For that purpose, efficient address translation of virtual address to physical address is critical. In this study, we propose a system call based address translation scheme where every translation is done by the kernel instead of a translation cache on a network interface controller as in the previous cache based address translation. According to our experiments, our scheme achieves up to 4.5 % reduction in application execution time compared to the previous cache based approach. 相似文献

16.

An Experimental Evaluation of the HP V-Class and SGI Origin 2000 Multiprocessors using Microbenchmarks and Scientific Applications

Ravi?Iyer Jack?Perdue Lawrence?Rauchwerger Email author Nancy?M.?Amato Laxmi?Bhuyan 《International journal of parallel programming》2005,33(4):307-350

As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, performance analysts perform benchmark analysis on existing multiprocessors. In this study, we present a detailed comparison of two architectures, the HP V-Class and the SGI Origin 2000. Our goal is to compare and contrast design techniques used in these multiprocessors. We present the impact of processor design, cache/memory hierarchies and coherence protocol optimizations on the memory system performance of these multiprocessors. We also study the effect of parallelism overheads such as process creation and synchronization on the user-level performance of these multiprocessors. Our experimental methodology uses microbenchmarks as well as scientific applications to characterize the user-level performance. Our microbenchmark results show the impact of Ll/L2 cache size and TLB size on uniprocessor load/store latencies, the effect of coherence protocol design/optimizations and data sharing patterns on multiprocessor memory access latencies and finally the overhead of parallelism. Our application-based evaluation shows the impact of problem size, dominant sharing patterns and number of Processors used on speedup and raw execution time. Finally, we use hardware counter measurements to study the correlation of system-level performance metrics and the application’s execution time performance.preliminary version of this paper appeared in the 13th ACM International Conference on Supercomputing (ICS’99).⁽¹³⁾ This work was done while Iyer and Bhuyan were at Texas A&M. It was supported in part by a Hewlett-Packard Equipment Grant. Amato and Rauchwerger supported in part by NSF Grants ACI-9872126, EIA-9975018, EIA-0103742, EIA-9805823, ACR-0081510, ACR-0113971, CCR-0113974, EIA-9810937, EIA-0079874, by the DOE ASCI ASAP program, and by the Texas Higher Education Coordinating Board grant ATP-000512-0261-2001. Perdue supported in part by a Dept. of Education Graduate Fellowship (GAANN) 相似文献

17.

MIPS内存管理单元的设计与实现

下载免费PDF全文

卢仕听尤凯迪韩军曾晓洋《计算机工程》2010,36(21):270-271,274

设计MIPS32 4kc处理器内存管理单元(MMU),该模块对处理器地址进行合法性检查,并按照不同的地址空间对虚拟地址进行静态或动态映射。在硬件上采用三级流水线方式实现JTLB,并为处理器指令端口和数据端口设计相应的快表以提高TLB的查询速度。MMU与总线接口模块的时序采用简化的AMBA协议,与处理器进行联合调试并运行Linux操作系统,同时在功能上通过FPGA验证。该模块经过DC综合后,面积约为32K等效逻辑门。相似文献

18.

A moving threads processor architecture MTPA

M. Forsell V. Leppänen 《The Journal of supercomputing》2011,57(1):5-19

Moving threads is a new kind of approach for multicore processor architectures. Traditionally, each thread stays in the core where it is created, and data is moved from the main memory via caches to each core and thread. In the moving threads approach, each core can access only a certain portion of the main memory via its local memory block, and thus extremely lightweight threads are moved between the cores. As a consequence, all kinds of cache coherence problems and need for read reply messages are eliminated. Also Lamport’s sequential consistency of shared memory multiprocessor systems is achieved for free. In this paper, we propose a processor architecture (MTPA) for the moving threads paradigm. We describe the overall structure, operation, instruction set, and thread management mechanism as well as evaluate the proposed architecture with different functional unit settings with simulations and give early silicon area and power consumption estimates. 相似文献

19.

Dynamic cache partitioning based on hot page migration

Xiaolin WANG Xiang WEN Yechen LI Zhenlin WANG Yingwei LUO Xiaoming LI 《Frontiers of Computer Science》2012,6(4):363-372

Static cache partitioning can reduce inter-application cache interference and improve the composite performance of a cache-polluted application and a cache-sensitive application when they run on cores that share the last level cache in the same multi-core processor. In a virtualized system, since different applications might run on different virtual machines (VMs) in different time, it is inapplicable to partition the cache statically in advance. This paper proposes a dynamic cache partitioning scheme that makes use of hot page detection and page migration to improve the composite performance of co-hosted virtual machines dynamically according to prior knowledge of cache-sensitive applications. Experimental results show that the overhead of our page migration scheme is low, while in most cases, the composite performance is an improvement over free composition. 相似文献

20.

A consistency-free memory architecture for sort-last parallel rendering processors

《Journal of Systems Architecture》2007,53(5-6):272-284

Current rendering processors are aiming to process triangles as fast as possible and they have the tendency of equipping with multiple rasterizers to be capable of handling a number of triangles in parallel for increasing polygon rendering performance. However, those parallel architectures may have the consistency problem when more than one rasterizer try to access the data at the same address. This paper proposes a consistency-free memory architecture for sort-last parallel rendering processors, in which a consistency-free pixel cache architecture is devised and effectively associated with three different memory systems consisting of a single frame buffer, a memory interface unit, and consistency-test units. Furthermore, the proposed architecture can reduce the latency caused by pixel cache misses because the rasterizer does not wait until cache miss handling is completed when the pixel cache miss occurs. The experimental results show that the proposed architecture can achieve almost linear speedup upto four rasterizers with a single frame buffer. 相似文献