期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Toward multi-programmed workloads with different memory footprints: a self-adaptive last level cache scheduling scheme

Jingyu Zhang Minyi Guo Chentao Wu Yuanyi Chen 《中国科学:信息科学(英文版)》2018,61(1):012105

With the emerging of 3D-stacking technology, the dynamic random-access memory (DRAM) can be stacked on chips to architect the DRAM last level cache (LLC). Compared with static randomaccess memory (SRAM), DRAM is larger but slower. In the existing research papers, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together, ranging from SRAM structure improvement, to optimizing cache tag and data access. Instead, little attention has been paid to designing an LLC scheduling scheme for multi-programmed workloads with different memory footprints. Motivated by this, we propose a self-adaptive LLC scheduling scheme, which allows us to utilize SRAM and 3D-stacked DRAM efficiently, achieving better workload performance. This scheduling scheme employs (1) an evaluation unit, which is used to probe and evaluate the cache information during the process of programs being executed; and (2) an implementation unit, which is used to self-adaptively choose SRAM or DRAM. To make the scheduling scheme work correctly, we develop a data migration policy. We conduct extensive experiments to evaluate the performance of our proposed scheme. Experimental results show that our method can improve the multi-programmed workload performance by up to 30% compared with the state-of-the-art methods. 相似文献

2.

The cache DRAM architecture: a DRAM with an on-chip cache memory

Hidaka H. Matsuda Y. Asakura M. Fujishima K. 《Micro, IEEE》1990,10(2):14-25

A DRAM (dynamic RAM) with an on-chip cache, called the cache DRAM, has been proposed and fabricated. It is a hierarchical RAM containing a 1-Mb DRAM for the main memory and an 8-kb SRAM (static RAM) for cache memory. It uses a 1.2-μm CMOS technology. Suitable for no-wait-state memory access in low-end workstations and personal computers, the chip also serves high-end systems as a secondary cache scheme. It is shown how the cache DRAM bridges the gap in speed between high-performance microprocessor units and existing DRAMs. The cache DRAM concept is explained, and its architecture is presented. The error checking and correction scheme used to improve the cache DRAM's reliability is described. Performance results for an experimental device are reported 相似文献

3.

Amdahl’s law for multithreaded multicore processors

Hao Che Minh Nguyen 《Journal of Parallel and Distributed Computing》2014

In this paper, we conduct performance scaling analysis of multithreaded multicore processors (MMPs) for parallel computing. We propose a thread-level closed-queuing network model covering a fairly large design space, accounting for hardware scaling models, coarse-grain, fine-grain, and simultaneous multithreading (SMT) cores, shared resources, including cache, memory, and critical sections. We then derive a closed-form solution for this model in terms of speedup performance measure. This solution makes it possible to analyze performance scaling properties of MMPs along multiple dimensions. In particular, we show that for the parallelizable part of the workload, the speedup, in the absence of resource contention, is no longer just a linear function of parallel processing unit counts, as predicted by Amdahl’s law, but also a strong function of workload characteristics, ranging from strong memory-bound to strong CPU-bound workloads. We also find that with core multithreading, super linear speedup, higher than that predicted by Amdahl’s law, may be achieved for the parallelizable part of the workload, if core threads exhibit strong cache affinity and the workload is strongly memory-bound. Then, we derive a tight speedup upper bound in the presence of both memory resource contention and critical section for multicore processors with single-threaded cores. This speedup upper bound indicates that with resource contention among threads, whether it is due to shared memory or critical section, a sequential term is guaranteed to emerge from the parallelizable part of the workload, fundamentally limiting the scalability of multicore processors for parallel computing, in addition to the sequential part of the workload, as dictated by Amdahl’s law. As a result, to improve speedup performance for MMPs, one should strive to enhance memory parallelism and confine critical sections as locally as possible, e.g., to the smallest possible number of threads in the same core. 相似文献

4.

A scheduling policy for preserving cache locality in a multiprogrammed system

《Journal of Systems Architecture》2000,46(13):1191-1204

In a multiprogrammed system, when the operating system switches contexts, in addition to the cost for handling the processes being swapped out and in, the cache performance of processors also can be affected. If frequent context switching replaces the data loaded into cache memory before they are completely reused, the programs suffer from cache misses due to the damage in cache locality. In particular, for the programs with good cache locality, such as blocked programs, a scheduling mechanism of keeping cache locality against context switching is essential to achieve good processor utilization. To solve this requirement, we propose a preemption-safe policy to exploit the cache locality of blocked programs in a multiprogrammed system. The proposed policy delays context switching until a block is fully reused, but also compensates for the monopolized processor time on processor scheduling mechanisms. Our simulation results show that in a situation where blocked programs are run on multiprogrammed shared-memory multiprocessors, the proposed policy improves the performance of these programs due to a decrease in cache misses. In such situations, it also has a beneficial impact on the overall system performance due to the enhanced processor utilization. 相似文献

5.

Data memory organization and optimizations in application-specificsystems

Ranjan Panda P. Dutt N.D. Nicolau A. Catthoor F. Vandecappelle A. Brockmeyer E. Kulkarni C. De Greef E. 《Design & Test of Computers, IEEE》2001,18(3):56-68

In application-specific designs, customized memory organization expands the search space for cost-optimized solutions. Several optimization strategies can be applied to embedded systems with several different memory architectures: data cache, scratch-pad memory, custom memory architectures, and dynamic random-access memory (DRAM) 相似文献

6.

Exploration of temperature-aware refresh schemes for 3D stacked eDRAM caches

《Microprocessors and Microsystems》2016

Recent studies have shown that embedded DRAM (eDRAM) is a promising approach for 3D stacked last-level caches (LLCs) rather than SRAM due to its advantages over SRAM; (i) eDRAM occupies less area than SRAM due to its smaller bit cell size; and (ii) eDRAM has much less leakage power and access energy than SRAM, since it has much smaller number of transistors than SRAM. However, different from SRAM cells, eDRAM cells should be refreshed periodically in order to retain the data. Since refresh operations consume noticeable amount of energy, it is important to adopt appropriate refresh interval, which is highly dependent on the temperature. However, the conventional refresh method assumes the worst-case temperature for all eDRAM stacked cache banks, resulting in unnecessarily frequent refresh operations. In this paper, we propose a novel temperature-aware refresh scheme for 3D stacked eDRAM caches. Our proposed scheme dynamically changes refresh interval depending on the temperature of eDRAM stacked last-level cache (LLC). Compared to the conventional refresh method, our proposed scheme reduces the number of refresh operations of the eDRAM stacked LLC by 28.5% (on 32 MB eDRAM LLC), on average, with small area overhead. Consequently, our proposed scheme reduces the overall eDRAM LLC energy consumption by 12.5% (on 32 MB eDRAM LLC), on average. 相似文献

7.

Design of an adaptive cache coherence protocol for large scalemultiprocessors

Yang Q. Thangadurai G. Bhuyan L.N. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(3):281-293

A large scale, cache-based multiprocessor that is interconnected by a hierarchical network such as hierarchical buses or a multistage interconnection network (MIN) is considered. An adaptive cache coherence scheme for the system is proposed based on a hardware approach that handles multiple shared reads efficiently. The new protocol allows multiple copies of a shared data block in the hierarchical network, but minimizes the cache coherence overhead by dynamically partitioning the network into sharing and nonsharing regions based on program behavior. The new cache coherence scheme effectively utilizes the bandwidth of the hierarchical networks and exploits the locality properties of parallel algorithms. Simulation experiments have been carried out to analyze the performance of the new protocol. The simulation results show that the new protocol gives 15% to 30% performance improvement over some existing cache coherence schemes on similar systems for a wide range of workload parameters 相似文献

8.

Efficient consolidation-aware VCPU scheduling on multicore virtualization platform

《Future Generation Computer Systems》2016

Multicore processors are widely used in today’s computer systems. Multicore virtualization technology provides an elastic solution to more efficiently utilize the multicore system. However, the Lock Holder Preemption (LHP) problem in the virtualized multicore systems causes significant CPU cycles wastes, which hurt virtual machine (VM) performance and reduces response latency. The system consolidates more VMs, the LHP problem becomes worse. In this paper, we propose an efficient consolidation-aware vCPU (CVS) scheduling scheme on multicore virtualization platform. Based on vCPU over-commitment rate, the CVS scheduling scheme adaptively selects one algorithm among three vCPU scheduling algorithms: co-scheduling, yield-to-head, and yield-to-tail based on the vCPU over-commitment rate because the actions of vCPU scheduling are split into many single steps such as scheduling vCPUs simultaneously or inserting one vCPU into the run-queue from the head or tail. The CVS scheme can effectively improve VM performance in the low, middle, and high VM consolidation scenarios. Using real-life parallel benchmarks, our experimental results show that the proposed CVS scheme improves the overall system performance while the optimization overhead remains low. 相似文献

9.

Dynamic Partitioning of Shared Cache Memory 总被引：6，自引：0，他引：6

G. E. Suh L. Rudolph S. Devadas 《The Journal of supercomputing》2004,28(1):7-26

This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses.The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory. 相似文献

10.

A multicore periodical preemption virtual machine scheduling scheme to improve the performance of computational tasks

Chao Yu Leihua Qin Jingli Zhou 《The Journal of supercomputing》2014,67(1):254-276

In virtualized environments, the VMM (virtual machine monitor) scheduler is critical to overall performance, as it allocates the physical resources. However, traditional schedulers have poor I/O performance of mixed workloads. Although recent research significantly improves I/O performance, they degrade the performance of computational tasks by shortening time slices and reducing cache efficiency. In order to eliminate these problems while guaranteeing I/O performance, this paper presents a multicore periodical preemption scheduling scheme with three optimization techniques: (1) periodically coalescing and handling I/O events to reduce the preemption rate and scheduling latency, which guarantees I/O performance; (2) taking advantage of multicore environments and centrally handling I/O events on different cores in a Round-Robin manner to lengthen time slices, which improves the performance of computational tasks; (3) using a dedicated priority for I/O event handling to keep the CPU fairness. We implement a Xen-based prototype and evaluate the performance of I/O workloads and computation-intensive workloads. The experimental results demonstrate that our scheduling scheme efficiently lengthens time slices and improves the performance of computational tasks, achieving the same I/O performance as the existing approaches optimized for I/O. 相似文献

11.

3D DRAM Design and Application to 3D Multicore Systems

Hongbin Sun Jibang Liu Anigundi R.S. Nanning Zheng Jian-Qiang Lu Rose K. Tong Zhang 《Design & Test of Computers, IEEE》2009,26(5):36-47

From a system architecture perspective, 3D technology can satisfy the high memory bandwidth demands that future multicore/manycore architectures require. This article presents a 3D DRAM architecture design and the potential for using 3D DRAM stacking for both L2 cache and main memory in 3D multicore architecture. 相似文献

12.

多核程序交互理论及应用

丁晨袁良《计算机工程与科学》2014,36(1):1-5

多核处理器上共享缓存使用效率,即程序局部性是影响并行程序性能的关键因素之一。提出了以足迹为基础的局部性理论。介绍了缺失率、重用距离和足迹之间的转化关系,并利用足迹可组合性特征建立了并行程序局部性预测模型。相似文献

13.

Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Jorge González-Domínguez^{Author Vitae} Guillermo L. Taboada Author VitaeBasilio B. Fraguela Author Vitae María J. Martín Author VitaeJuan Touriño Author Vitae 《Computers & Electrical Engineering》2012,38(2):258-269

Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification. 相似文献

14.

Memory aware load balance strategy on a parallel branch‐and‐bound application

Juliana M.N. Silva Cristina Boeres Lúcia M.A. Drummond Artur A. Pessoa 《Concurrency and Computation》2015,27(5):1122-1144

The latest trends in high performance computing systems show an increasing demand on the use of a large scale multicore system in an efficient way so that high compute‐intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance, it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling specification that considers not only processors characteristics but also memory contention. This paper proposes a multicore cluster representation that captures relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention on application performance. Improved performance was achieved by a branch‐and‐bound application applied to the partitioning sets problem that incorporated a memory aware load balancing strategy based on the proposed multicore cluster representation. An in‐depth analysis on this application execution showed its applicability to modern systems. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

15.

嵌入式系统中基于数据复用的进程调度

梅岩王力生《计算机工程》2005,31(21):78-80

在嵌入式系统中,往往同时存在多个进程,不同进程之间或多或少地存在着数据复用。该文讨论了进程调度策略的目标,即最大程度地复用cache中的数据,以提高程序的执行效率,其内容包含3个主要方面：对数据逻辑上分块,以某种顺序遍历数据块,重构程序代码。相似文献

16.

Hardware-Based Profiling: An Effective Technique for Profile-Driven Optimization

Thomas M. Conte Burzin A. Patel Kishore N. Menezes J. Stan Cox 《International journal of parallel programming》1996,24(2):187-206

Profile-based optimization can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for profiling run significantly slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper introduces hardware-based profiling that uses traditional branch handling hardware to generate profile information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%–4.6%). This allows a program to be profiled while it is used, eliminating the need for a test input suite. With contemporary processors driven increasingly by compiler support, hardware-based profiling is important for high-performance systems. 相似文献

17.

Impact of level-2 cache sharing on the performance and power requirements of homogeneous multicore embedded systems

Abu Fadi N. Manira 《Microprocessors and Microsystems》2009,33(5-6):388-397

In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics. 相似文献

18.

一种多核友好的持久性内存键值系统

汪庆朱博弘舒继武《计算机研究与发展》2021,58(2):397-405

相比于传统内存,持久性内存具有容量大和非易失的特点,这为构建大规模键值存储系统提供了新的机遇.然而,在多核服务器架构下设计持久性内存键值系统面临着诸多挑战,包括并发控制带来的CPU缓存抖动、对持久性内存有限写带宽的消耗和竞争以及持久性内存高延迟带来的线程冲突加剧.提出一种多核友好的持久性内存键值系统(multicore-friendly persistent memory key-value store,MPKV),通过设计高效并发控制方法和减少对持久性内存的写操作,充分提高多核并发性能.为避免锁资源带来的额外持久性内存写带宽消耗,MPKV引入了易失性锁管理机制,将写锁资源从索引中分离,在DRAM(dynamic RAM)中单独维护它们.为保证崩溃一致性和提高并发查询性能,MPKV引入了2阶段原子写机制,利用CPU提供的原子写操作指令将系统从一个一致性状态原子地切换到另一个一致性状态,并支持了无锁查询.基于易失性锁管理机制,MPKV还提出一种并发写消除机制,以提高更新操作之间的并发效率.当出现2个冲突的更新操作时,并发写消除机制让其中一个操作直接返回,不做任何持久性内存的分配与写操作.实验显示,MPKV相比于pmemkv具有更良好的性能以及多核扩展性.其中,在18线程环境下,MPKV的吞吐达到pmemkv的1.7~6.2倍. 相似文献

19.

Memory-centric scheduling for multicore hard real-time systems

Gang Yao Rodolfo Pellizzoni Stanley Bak Emiliano Betti Marco Caccamo 《Real-Time Systems》2012,48(6):681-715

Memory resources are a serious bottleneck in many real-time multicore systems. Previous work has shown that, in the worst case, execution time of memory intensive tasks can grow linearly with the number of cores in the system. To improve hard real-time utilization, a real-time multicore system should be scheduled according to a memory-centric scheduling approach if its workload is dominated by memory intensive tasks. In this work, a memory-centric scheduling technique is proposed where (a)?core isolation is provided through a coarse-grained (high-level) Time Division Multiple Access (TDMA) memory schedule; and (b)?the scheduling policy of each core ??promotes?? the priority of its memory intensive computations above CPU-only computation when memory access is permitted by the high-level schedule. Our evaluation reveals that under high memory demand, our scheduling approach can improve hard real-time task utilization significantly compared to traditional multicore scheduling. 相似文献

20.

DTS: Dynamic TDMA scheduling for Networked Control Systems

《Journal of Systems Architecture》2014,60(2):194-205

Networked Control Systems (NCSs) are pervasively applied in modern industry. With increasing functionalities, modern NCSs tend to have dynamic workload by holding a variety of applications via a shared network. To handle workload variations and provide performance guarantees, dynamic network scheduling scheme is highly desired in NCSs. In this paper, we propose a network scheduling scheme, referred to as DTS, that can make on-the-fly decisions to schedule the applications in NCSs. DTS aims at NCSs that use time-triggered network as shared medium and Time division multiple access (TDMA) as network access method. DTS dynamically changes the network accessing sequence of the applications in a way to provide optimal system performance and maintain control stability in NCSs. DTS adopts a decentralized schedule mechanism where each application can make its local schedule decision, enhancing the scalability of NCSs. Simulation results demonstrate the effectiveness of the proposed scheme by improving the network bandwidth and providing better system performance in NCS comparing with the existing time-triggered scheduling schemes. 相似文献