首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Memory access scheduling is an effective manner to improve performance of Chip Multi-Processors (CMPs) by taking advantage of the timing characteristics of a DRAM. A memory access scheduler can subdivide resources utilization (banks and rows) to increase throughput by accessing different DRAM banks in parallel. However, different threads running on different cores may exhibit different performance. One thread may experience starvation while the others are serviced normally. Therefore, designing a scheduler which reduces the unfairness in the DRAM system, while also improving system throughput on a variety of workloads and systems, is necessary. In this paper, a distributed fair DRAM scheduling for two-dimensional mesh network-on-chips (NoCs), called DFDS, is presented. The key design points in DFDS are: (i) assessing the total waiting cycles of a memory request in NoC and considering it as a metric in arbitration. For this purpose waiting cycles of a memory request are put in an additional flit in a packet and are updated while traversing the NoC, and (ii) proposing a semi-dynamic virtual channel allocation to provide in-order memory requests to memory controllers (MCs). Consequently, we use a simple scheduling algorithm in MCs, instead of complex algorithms. To validate our approach, we apply synthetic and real workload from Parsec benchmark suite. The results show effectiveness of our approach, as we reduce the waiting time of memory requests by up to 15%.  相似文献   

2.
The ever increasing memory demands of many scientific applications and the complexity of today’s shared computational resources still require the occasional use of virtual memory, network memory, or even out-of-core implementations, with well known drawbacks in performance and usability. In Mills et al. (Adapting to memory pressure from within scientific applications on multiprogrammed COWS. In: International Parallel and Distributed Processing Symposium, IPDPS, Santa Fe, NM, 2004), we introduced a basic framework for a runtime, user-level library, MMlib, in which DRAM is treated as a dynamic size cache for large memory objects residing on local disk. Application developers can specify and access these objects through MMlib, enabling their application to execute optimally under variable memory availability, using as much DRAM as fluctuating memory levels will allow. In this paper, we first extend our earlier MMlib prototype from a proof of concept to a usable, robust, and flexible library. We present a general framework that enables fully customizable memory malleability in a wide variety of scientific applications. We provide several necessary enhancements to the environment sensing capabilities of MMlib, and introduce a remote memory capability, based on MPI communication of cached memory blocks between ‘compute nodes’ and designated memory servers. The increasing speed of interconnection networks makes a remote memory approach attractive, especially at the large granularity present in large scientific applications. We show experimental results from three important scientific applications that require the general MMlib framework. The memory-adaptive versions perform nearly optimally under constant memory pressure and execute harmoniously with other applications competing for memory, without thrashing the memory system. Under constant memory pressure, we observe execution time improvements of factors between three and five over relying solely on the virtual memory system. With remote memory employed, these factors are even larger and significantly better than other, system-level remote memory implementations.  相似文献   

3.
A DRAM (dynamic RAM) with an on-chip cache, called the cache DRAM, has been proposed and fabricated. It is a hierarchical RAM containing a 1-Mb DRAM for the main memory and an 8-kb SRAM (static RAM) for cache memory. It uses a 1.2-μm CMOS technology. Suitable for no-wait-state memory access in low-end workstations and personal computers, the chip also serves high-end systems as a secondary cache scheme. It is shown how the cache DRAM bridges the gap in speed between high-performance microprocessor units and existing DRAMs. The cache DRAM concept is explained, and its architecture is presented. The error checking and correction scheme used to improve the cache DRAM's reliability is described. Performance results for an experimental device are reported  相似文献   

4.
Non-volatile memory(NVM)provides a scalable and power-efficient solution to replace dynamic random access memory(DRAM)as main memory.However,because of the relatively high latency and low bandwidth of NVM,NVM is often paired with DRAM to build a heterogeneous memory system(HMS).As a result,data objects of the application must be carefully placed to NVM and DRAM for the best performance.In this paper,we introduce a lightweight runtime solution that automatically and transparently manages data placement on HMS without the requirement of hardware modifications and disruptive change to applications.Leveraging online profiling and performance models,the runtime solution characterizes memory access patterns associated with data objects,and minimizes unnecessary data movement.Our runtime solution effectively bridges the performance gap between NVM and DRAM.We demonstrate that using NVM to replace the majority of DRAM can be a feasible solution for future HPC systems with the assistance of a software-based data management.  相似文献   

5.
SPM(Scratchpad Memory)是实时嵌入式系统中常见的片上存储器,其分配管理在编译期进行,从而可以在编译完成时确定访存时延.当前的SPM分配方法主要用于减少程序在平均情况下的执行时间.然而,在硬实时系统中,最差情况下的执行时间(WCET, Worst-Case Execution Time)是更为关键的指标.通过分析优化程序WCET值过程中存在的主要问题以及现有算法,基于变量公用度概念,提出一种启发式搜索算法用于最小化程序WCET值的数据变量SPM分配,实验表明,论文提出的分配方法可获得更好的优化效果.  相似文献   

6.
The many-accelerator architecture, mostly composed of general-purpose cores and accelerator-like function units (FUs), becomes a great alternative to homogeneous chip multiprocessors (CMPs) for its superior power-efficiency. However, the emerging many-accelerator processor shows a much more complicated memory accessing pattern than general purpose processors (GPPs) because the abundant on-chip FUs tend to generate highly-concurrent memory streams with distinct locality and bandwidth demand. The disordered memory streams issued by diverse accelerators exhibit a mutual- interference behavior and cannot be efficiently handled by the orthodox main memory interface that provides an inflexible data fetching mode. Unlike the traditional DRAM memory, our proposed Aggregation Memory System (AMS) can function adaptively to the characterized memory streams from different FUs, because it provides the FUs with different data fetching sizes and protects their locality in memory access by intelligently interleaving their data to memory devices through sub-rank binding. Moreover, AMS can batch the requests without sub-rank conflict into a read burst with our optimized memory scheduling policy. Experimental results from trace-based simulation show both conspicuous performance boost and energy saving brought by AMS.  相似文献   

7.
The growing gap between microprocessor speed and DRAM speed is a major problem that computer designers are facing. In order to narrow the gap, it is necessary to improve DRAM’s speed and throughput. To achieve this goal, this paper proposes techniques to take advantage of the characteristics of the 3-stage access of contemporary DRAM chips by grouping the accesses of the same row together and interleaving the execution of memory accesses from different banks. A family of Bubble Filling Scheduling (BFS) algorithms are proposed in this paper to minimize memory access schedule length and improve memory access time for embedded systems.When the memory access trace is known in some application-specific embedded systems, this information can be fully utilized to generate efficient memory access schedules. The offline BFS algorithm can generate schedules which are 47.49% shorter than in-order scheduling and 8.51% shorter than existing burst scheduling on average. When memory accesses are received by the single memory controller in real time, the memory accesses have to be scheduled as they come. The online BFS algorithm in this paper serves this purpose and generates schedules which are 58.47% shorter than in-order scheduling and 4.73% shorter than burst scheduling on average. To improve the memory throughput and further reduce the memory access schedule, an architecture with dual memory controllers is proposed. According to the experimental results, the dual controller algorithm can generate schedules which are 62.89% shorter than in-order scheduling, 14.23% shorter than burst scheduling, and 10.07% shorter than single controller BFS algorithms on average.  相似文献   

8.
相变存储器(PCM)由于其非易失性、高读取速度以及低静态功耗等优点,已成为主存研究领域的热点.然而,目前缺乏可用的PCM设备,这使得基于PCM的算法研究得不到有效验证.因此,本文提出了利用主存模拟器仿真并验证PCM算法的思路.本文首先介绍了现有主存模拟器的特点,并指出其并不能完全满足当前主存研究的实际需求,在此基础上提出并构建了一个基于DRAM和PCM的混合主存模拟器.与现有模拟器的实验比较结果表明,本文设计的混合主存模拟器能够有效地模拟DRAM和PCM混合存储架构,并能够支持不同形式的混合主存系统模拟,具有高可配置性.最后,论文通过一个使用示例说明了混合主存模拟器编程接口的易用性.  相似文献   

9.
On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads.  相似文献   

10.
当海量数据请求访问异构内存系统时,异构内存页在动态随机存储器(dynamic random access memory,DRAM)和非易失性存储器(non-volatile memory,NVM)之间进行频繁的往返迁移.然而,应用于传统内存页的迁移策略难以适应内存页"冷""热"度的快速动态变化,这使得从DRAM迁移至N...  相似文献   

11.
This paper explores how research teams in Intel’s Digital Health Group are using ethnography to identify ‘designable moments’—spaces, times, objects, issues and practices which suggest opportunities for appropriate interventions. It argues that technology innovation should aim to incorporate the views, experiences and practices of users from the start of the design process to support independent living and develop culturally sensitive enhancements that contribute towards wellbeing and a life of quality for local older populations.  相似文献   

12.
On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel PDE solver with adaptive mesh refinement (AMR). The solver is parallelized using OpenMP and the adaptive mesh refinement makes dynamic load balancing necessary. Due to the dynamically changing memory access pattern caused by the runtime adaption, it is a challenging task to achieve a high degree of geographical locality. The main conclusions of the study are: (1) that geographical locality is very important for the performance of the solver, (2) that the performance can be improved significantly using dynamic page migration of misplaced data, (3) that a migrate-on-next-touch directive works well whereas the first-touch strategy is less advantageous for programs exhibiting a dynamically changing memory access patterns, and (4) that the overhead for such migration is low compared to the total execution time.  相似文献   

13.
Sohie  G.R.L. Kloker  K.L. 《Micro, IEEE》1988,8(6):49-67
A overview is given of Motorola's DSP96002, a digital signal processor that implements IEEE-standard floating-point arithmetic. It is designed for graphics, image processing, spectral analysis and scientific computing applications. Performance peaks at 40.5 Mflops (million floating-point operations per second) and 13.5 MIPS (million instructions per second) and 18 Mflops on assembly-language benchmarks. The DSP is software-compatible with the fixed-point 56000/1 family architecture and instruction set. The 96002 achieves compatibility with other processors and databases, higher mathematical accuracy, and better error handling than implementations that do not conform to the IEEE standard. The 96002's on-chip memories, dual-bus architecture, and transparent DMA are suitable for multiprocessor systems in which many 96002s connect with minimum external components. These features result in a smaller-footprint, lower-cost system than other microprocessors or data-path chips. On-chip support for the fast access modes of external memories achieves near-SRAM (static random-access memory) performance with high-density DRAM/VRAM (dynamic RAM/virtual RAM) devices. An on-chip circuit emulation controller provides full access and control of the machine state for system debugging. A variety of software and hardware development tools support the 96002  相似文献   

14.
Memory diagnostics are important to improving the resilience of DRAM main memory. As bit cell size reaches physical limits, DRAM memory will be more likely to suffer both transient and permanent errors. Memory diagnostics that operate online can be a component of a comprehensive strategy to allay errors. This paper presents a novel approach, Asteroid, to integrate online memory diagnostics during workload execution. The approach supports diagnostics that adapt at runtime to workload behavior and resource availability to maximize test quality while reducing performance overhead. We describe Asteroid’s design and how it can be efficiently integrated with a hierarchical memory allocator in modern operating systems. We also present how the framework enables control policies to dynamically configure a diagnostic. Using an adaptive policy, in a 16-core server, Asteroid has modest overhead of 1–4 % for workloads with low to high memory demand. For these workloads, Asteroid’s adaptive policy has good error coverage and can thoroughly test memory.  相似文献   

15.
Demand for memory capacity and bandwidth keeps increasing rapidly in modern computer systems, and memory power consumption is becoming a considerable portion of the system power budget. However, the current DDR DIMM standard is not well suited to effectively serve CMP memory requests from both a power and performance perspective. We propose a new memory module called a multicore DIMM, where DRAM chips are grouped into multiple virtual memory devices, each of which has its own data path and receives separate commands. The Multicore DIMM is designed to improve the energy efficiency of memory systems with small impact on system performance. Dividing each memory modules into 4 virtual memory devices brings a simultaneous 22%, 7.6%, and 18% improvement in memory power, IPC, and system energy-delay product respectively on a set of multithreaded applications and consolidated workloads.  相似文献   

16.
In order to exploit parallel resources, most transactional memory (TM) systems execute atomic blocks concurrently and must thus be prepared for data conflicts. In the event of a conflict, a TM system must choose a policy to decide when and how to manage the resulting contention. In this paper, we analyze the interplay between conflict resolution time and contention management policy in the context of hardware-supported TM systems, highlighting the performance and implementation implications of the various points in the design space. We show that both policy decisions have a significant impact on the ability to exploit available parallelism and ensure progress of individual transactions. Our analysis corroborates previous research findings that stalling (especially at access time and retrying the problematic access) helps side-step conflicts and avoid wasted work. We demonstrate that conflict resolution time has the dominant effect on performance: Lazy (which delays resolution to commit time) uncovers more parallelism than Eager (which resolves conflicts at access time). With Lazy, in spite of an aborted transaction tending to waste more work (31% more than Eager), the cumulative amount of wasted work is lower since fewer transactions are aborted (1.6 × less than Eager). Lazy’s delayed conflict resolution also ensures progress and decreases the likelihood of pathologies (like livelock) while Eager needs sophisticated priority mechanisms in the contention management to avoid pathologies. Finally, we evaluate a mixed conflict resolution that detects write–write conflicts eagerly while detecting read–write conflicts lazily, and show that it provides a good compromise between implementation complexity and exploiting concurrency for performance.  相似文献   

17.
Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.  相似文献   

18.
忆阻器存储研究与展望   总被引:2,自引:0,他引:2  
随着信息呈现爆炸式增长,而CMOS的工艺尺寸逐步接近其理论极限尺寸,新型纳米级存储器件的需求日趋迫切.忆阻器被认为具有替代动态随机存储器,适应海量存储的巨大潜力.在综述忆阻器与忆阻系统概念的产生与发展的基础上,讨论忆阻器作为存储单元的特性,综述了忆阻器的阻变存储结构以及相对应的读写方法.在总结分析目前研究存在问题的基础上,探讨了今后的研究方向.  相似文献   

19.
Greene  F.S. 《Computer》1972,5(1):31-39
Power reduction techniques are described for both LSI memory components and systems. These techniques have been verified experimentally for both read-only and random access read/write components using power switching circuits external to the chips. A number of ways to apply on-chip power switching are described. The power switching concept is also described for memory cards and systems that can be organized into blocks.  相似文献   

20.
基于程序访存模式的低功耗存储技术   总被引:1,自引:0,他引:1  
与不断提升的计算能力相适应,移动手持设备上的存储系统结构越来越复杂,容量越来越大.这种趋势导致存储系统,主要是片上缓存和主存,在系统总能耗的占比中不断攀升.在当前手持设备多由电池驱动并且电池容量十分有限的情况下,存储系统的低功耗设计就显得十分重要.虽然现有的存储器件提供了一定的硬件节能支持,但是只有与应用程序的访存行为的规律相结合,才能充分发挥硬件的节能潜力.对现有的各种低功耗存储技术进行了梳理和总结,给出程序的访存模式的概念,归纳出访存模式在3个方面的内涵,并进一步详细介绍了程序的访存模式在片上缓存和主存低功耗技术中的应用.最后,展望未来结合访存模式进行低功耗存储系统研发的可能方向.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号