首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory.With hardware and/or software support,data prefetching brings data closer to a processor before it is actually needed.Many prefetching techniques have been developed for single-core processors.Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching t...  相似文献   

3.
张延松  张宇  王珊 《软件学报》2018,29(3):883-895
以MapD为代表的图分析数据库系统通过GPU、Phi等新型众核处理器来支持高性能分析处理,在面向复杂数据模式时连接操作仍然是重要的性能瓶颈.近年来,异构处理器逐渐成为高性能计算的主流平台,内存连接性能的研究从多核CPU平台扩展到新兴的众核处理器,但众多的研究成果并未系统地揭示连接算法性能、连接数据集大小、硬件架构之间的内在联系,难以为未来异构处理器平台的数据库提供连接平台优化选择策略.本文以面向多核CPU、Xeon Phi、GPU处理器平台的内存连接优化技术为目标,通过优化内存哈希表设计,实现以向量映射替代哈希映射操作,消除哈希代价对内存连接算法的影响,从而更加准确地测量内存连接算法在多核CPU的cache大小、Xeon Phi的cache大小、Xeon Phi的并发多线程、GPU的SIMT(单指令多线程)机制等硬件相关因素影响下的性能特征.实验结果表明,缓存与并发多线程机制是提高内存连接算法性能的重要影响因素.缓存机制对于满足cache大小的连接操作具有性能优势,而GPU的并发多线程机制则在较大表的连接操作中具有较高的性能,Xeon Phi则在满足其L2 cache大小的连接操作中具有最高性能.实验结果揭示了内存连接操作性能与异构处理器硬件特性的联系,为未来异构处理器平台内存数据库查询优化器提供了优化策略.  相似文献   

4.
In a multiprogrammed system, when the operating system switches contexts, in addition to the cost for handling the processes being swapped out and in, the cache performance of processors also can be affected. If frequent context switching replaces the data loaded into cache memory before they are completely reused, the programs suffer from cache misses due to the damage in cache locality. In particular, for the programs with good cache locality, such as blocked programs, a scheduling mechanism of keeping cache locality against context switching is essential to achieve good processor utilization. To solve this requirement, we propose a preemption-safe policy to exploit the cache locality of blocked programs in a multiprogrammed system. The proposed policy delays context switching until a block is fully reused, but also compensates for the monopolized processor time on processor scheduling mechanisms. Our simulation results show that in a situation where blocked programs are run on multiprogrammed shared-memory multiprocessors, the proposed policy improves the performance of these programs due to a decrease in cache misses. In such situations, it also has a beneficial impact on the overall system performance due to the enhanced processor utilization.  相似文献   

5.
赵姗  杨秋松  李明树 《软件学报》2019,30(4):1164-1190
为了满足应用程序的多样化需求,异构多核处理器出现并逐渐进入市场,其中的处理核心(core)具有不同的微架构或者指令集架构(ISA),为应用提供多样化特性支持,比如指令级并行(ILP)、内存级并行(MLP),这些核心协同工作满足整个计算系统的优化目标,比如高性能、低功耗或者良好的能效.然而,目前主流的调度技术主要是针对传统同构处理器架构设计,没有考虑异构硬件能力的差异性.在异构多核处理器环境下,调度技术如何感知硬件的异构特性,为不同类型的应用程序提供更加合适和匹配的硬件资源,这是值得探索的问题.对近年来在该研究领域的成果进行了综述研究,特别是在性能非对称多核处理器架构下,异构调度技术面临的优化目标、分析模型、调度决策和算法评估等主要问题进行了分析和描述,并依次对相关技术进行了系统的总结,最后从软硬件融合的角度对今后的研究工作进行了展望.  相似文献   

6.
Nowadays, real-time embedded applications have to cope with an increasing demand of functionalities, which require increasing processing capabilities. With this aim real-time systems are being implemented on top of high-performance multicore processors that run multithreaded periodic workloads by allocating threads to individual cores. In addition, to improve both performance and energy savings, the industry is introducing new multicore designs such as ARM’s big.LITTLE that include heterogeneous cores in the same package.A key issue to improve energy savings in multicore embedded real-time systems and reduce the number of deadline misses is to accurately estimate the execution time of the tasks considering the supported processor frequencies. Two main aspects make this estimation difficult. First, the running threads compete among them for shared resources. Second, almost all current microprocessors implement Dynamic Voltage and Frequency Scaling (DVFS) regulators to dynamically adjust the voltage/frequency at run-time according to the workload behavior. Existing execution time estimation models rely on off-line analysis or on the assumption that the task execution time scales linearly with the processor frequency, which can bring important deviations since the memory system uses a different power supply.In contrast, this paper proposes the Processor–Memory (Proc–Mem) model, which dynamically predicts the distinct task execution times depending on the implemented processor frequencies. A power-aware EDF (Earliest Deadline First)-based scheduler using the Proc–Mem approach has been evaluated and compared against the same scheduler using a typical Constant Memory Access Time model, namely CMAT. Results on a heterogeneous multicore processor show that the average deviation of Proc–Mem is only by 5.55% with respect to the actual measured execution time, while the average deviation of the CMAT model is 36.42%. These results turn in important energy savings, by 18% on average and up to 31% in some mixes, in comparison to CMAT for a similar number of deadline misses.  相似文献   

7.
随着嵌入式处理器技术的不断发展以及人们对嵌入式设备性能的要求越来越高,嵌入式处理器由单核时代进入多核时代。然而,传统嵌入式系统软件开发方法还是基于单核模式,并没有利用嵌入式多核处理器多核并行化的特点,没有充分发挥嵌入式多核处理器的性能。虽然在PC平台上,多核并行化方法相对更成熟,但嵌入式多核处理器在处理器数目、Cache以及总线等方面有很大不同,嵌入式平台多核并行化并不能借助PC平台的实践方法,因此基于嵌入式平台研究多核并行化的方法是很有意义的。  相似文献   

8.
In a shared-memory multiprocessor system, it may be more efficient to schedule a task on one processor than on another if relevant data already reside in a particular processor's cache. The effects of this type of processor affinity are examined. It is observed that tasks continuously alternate between executing at a processor and releasing this processor due to I/O, synchronization, quantum expiration, or preemption. Queuing network models of different abstract scheduling policies are formulated, spanning the range from ignoring affinity to fixing tasks on processors. These models are solved via mean value analysis, where possible, and by simulation otherwise. An analytic cache model is developed and used in these scheduling models to include the effects of an initial burst of cache misses experienced by tasks when they return to a processor for execution. A mean-value technique is also developed and used in the scheduling models to include the effects of increased bus traffic due to these bursts of cache misses. Only a small amount of affinity information needs to be maintained for each task. The importance of having a policy that adapts its behavior to changes in system load is demonstrated  相似文献   

9.
Dynamic Partitioning of Shared Cache Memory   总被引:6,自引:0,他引:6  
This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches.Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses.The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory.  相似文献   

10.
Time Warp synchronized parallel discrete event simulators are organized to operate asynchronously and aggressively without explicit synchronization between the concurrently executing simulation objects. In place of an explicit synchronization mechanism, the concurrent simulators implement an independent but common virtual clock model and a rollback/recovery mechanism to restore causal order when out-of-order events are detected. When the critical path of execution of the simulation is balanced across this parallel threads of execution, this can result in a highly effective, lightweight synchronization mechanism to implement parallel simulation. However, imbalances in the workload across the threads can result in excessive rollback in some threads and slowed progress of the critical path. On small shared memory multi-core systems, a lowest time-stamp scheduling policy can effectively balance the workload. However, on larger many-core chips, conventional load balancing and workload migration will once again become necessary. Fortunately, emerging many-core chips contain some interesting features that can potentially be exploited to improve the performance of parallel simulations. In particular, the recently developed Intel Single-chip Cloud Computer (SCC) provides mechanisms for the runtime control of the frequency and voltage settings of the chip. Furthermore, the frequency and voltage settings are independently set within different regions (called islands) of the chip. Thus, in a Time Warp simulation, one could increase the frequency of the cores executing threads on the critical path (those experiencing infrequent rollback) and decrease the frequency of the cores executing threads off the critical path (those experiencing excessive rollback). This paper investigates the run-time control and adjustment of core frequency in some contemporary x86 multi-core processors to identify the platforms that can support the exploration of dynamic run-time control of core frequency settings. The results show that while all multi-core processors have software controllable core frequency modulation capabilities, they are generally not fully independent as the system comes under load and are therefore unsuitable for these studies. Fortunately, one processor, the AMD X6 line, provides software control for core frequencies that can be fixed (by software) even as the system operates under load.  相似文献   

11.
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.  相似文献   

12.
异构多核处理器通常由高性能的大核和低能耗的小核组成,在其上进行合理的线程调度可以有效地提高资源利用率,节省能耗。之前论文提出的大小核上的公平性调度并没有考虑核上有不同频率/电压状态的情况,而现在支持DVFS调节的处理器越来越普遍,因此很有必要将线程间公平度的计算进行扩展和改进。提出在每个核有若干种不同的DVFS状态时异构多核处理器上线程公平度的计算方法,对已有的性能预测模型进行改进,采用自适应算法调整模型中的系数,并在此基础上提出了一种调度策略,维持各线程之间的公平度和处理器功率满足提前设定的阈值,同时选取能效最优化的配置,实现减小应用运行能耗的目的。实验结果表明,与所提出的调度策略相比,采用static、DVFS-only、swap-only三种调度方法时,在总的运行时间几乎相同的情况下,平均要多产生20%以上能耗,对于有些应用甚至达到了50%。  相似文献   

13.
一种面向多核系统的并行计算任务分配方法   总被引:2,自引:0,他引:2  
随着多核处理器的普及,目前的大规模并行处理系统普遍采用多核处理器,这对于资源管理和调度提出了更高的要求.提出了基于共享Cache资源划分的方法,建立了面向多核处理器支持Cache资源分配的进程调度模型,设计并实现了并行任务到多核处理器的映射算法,更好地解决了大规模资源管理系统中面向多核处理器的任务分配问题,降低了使用共享Cache的多个进程运行时的相互干扰,提升了应用程序性能.  相似文献   

14.
As a process executes on a processor, it builds up state in that processor′s cache. In multiprogrammed workloads, the opportunity to reuse this state may be lost when a process gets rescheduled, either because intervening processes destroy its cache state or because the process may migrate to another processor. In this paper, we explore affinity scheduling, a technique that helps reduce cache misses by preferentially scheduling a process on a processor where it has run recently. Our study focuses on a bus-based multiprocessor executing a variety of workloads, including mixes of scientific, software development, and database applications. In addition to quantifying the performance benefits of exploiting affinity, our study is distinctive in that it provides low-level data from a hardware performance monitor that details why the workloads perform as they do. Overall, for the workloads studied, we show that affinity scheduling reduces the number of cache misses by 7-36%, resulting in execution time improvements of up to 10%. Although the overall improvements are small, modifying the operating system scheduler to exploit affinity appears worthwhile-affinity has no negative impact on the workloads and we show that it is extremely simple to add to existing schedulers.  相似文献   

15.
The performance loss resulting from different cache misses is variable in modern systems for two reasons: 1) memory access latency is not uniform, and 2) the latency toleration ability of processor cor...  相似文献   

16.
动态可重构缓存由于能够在运行时进行缓存容量、结构、映射规则等方面的重新配置,因而在资源利用率和能耗方面有很大优势。针对超长指令字处理器发射宽度动态变化的特点,提出了在运行时利用其动态特征来驱动缓存的重构,从而达到动态分离或合并处理器核的目的。这不同于传统的以缓存缺失率来驱动缓存重构的方法。为了平滑频繁重构场景下缓存的性能,进一步提出了一种重构时的过渡机制,使缓存平滑地从一种配置过渡到另一种配置。设计了实验并对重构策略进行了性能评估,仿真结果表明,该方法可以实现在重构后2 000周期内,缓存缺失率平均下降16%,并且提高了系统性能。  相似文献   

17.
The software and hardware techniques to exploit the potential of multi-core processors are falling behind, even though the number of cores and cache levels per chip is increasing rapidly. There is no explicit communications support available, and hence inter-core communications depend on cache coherence protocols, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we present Software Controlled Eviction (SCE) to improve the performance of multithreaded applications running on multi-core processors by moving shared data to shared cache levels before it is demanded from remote private caches. Simulation results show that SCE offers significant performance improvement (8-28%) and reduces L3 cache misses by 88-98%.  相似文献   

18.
A NUCA Substrate for Flexible CMP Cache Sharing   总被引:1,自引:0,他引:1  
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.  相似文献   

19.
Li  Jianjiang  Deng  Zhaochu  Du  Panpan  Lin  Jie 《The Journal of supercomputing》2022,78(4):4779-4798

The Sunway TaihuLight is the first supercomputer built entirely with domestic processors in China. On Sunway Taihulight, the local data memory (LDM) of the slave core is limited, so data transmission with the main memory is frequent during calculation, and the memory access efficiency is low. On the other hand, for many scientific computing programs, how to solve the storage problem of irregular access data is the key to program optimization. Software cache (SWC) is one of the effective means to solve these problems. Based on the characteristics of Sunway TaihuLight structure and irregular access, this paper designs and implements a new software cache structure by using part of the space in LDM to simulate the cache function, which uses new cache address mapping and conflicts solution to solve high data access overhead and storage overhead in a traditional cache. At the same time, the SWC uses the register communication between the slave cores to share on the different slave core LDMs, increasing the capacity of the software cache and improving the hit rate. In addition, we adopt a double buffer strategy to access regular data in batches, which hides the communication overhead between the slave core and the main memory. The test results on the Sunway TaihuLight platform show that the software cache structure in this paper can effectively reduce the program running time, improve the software cache hit rate, and achieve a better optimization effect.

  相似文献   

20.
同步操作在保证多核处理器线程的数据一致性和正确性等方面起着重要作用。随着处理器内核数量的不断增加,同步操作的开销也越来越大。栅栏同步是并行应用中多核同步的重要方法之一。软件同步方法通常需要数千个周期才能完成多个内核之间的同步,这种高延迟和串行化同步会导致多核程序性能的显著下降。相比于软件栅栏同步方法,硬件栅栏能够实现较低的同步延迟,然而传统集中式硬件栅栏的可扩展性有限,难以适应众核处理器系统的同步需求。面向众核处理器提出了一种层次化硬件栅栏机制——HSync,它由本地栅栏单元和全局栅栏单元组成,二者协调配合,以实现低硬件开销的快速同步。实验结果表明,与传统的集中式硬件栅栏相比,层次化硬件栅栏机制将众核处理器系统性能提高了1.13倍,同时网络流量减少了74%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号