首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. Shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section (acquiring the lock, accessing shared data, and releasing the lock) and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity (the remaining four), we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 71%, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.  相似文献   

2.
针对Windows环境下多线程同步缺乏标准的读写锁机制,本文逐步论述利用临界区对读写锁的3种实现方式。通过对不同数量的临界区进行组合操作,可以达到读写锁的设计目的,并产生不同的读/写优先倾向。由于尝试在多线程中对读写锁进行复杂的交叉操作,故进一步对特定环境下的临界区内部细节进行分析和测试,并最终确保利用临界区实现读写锁的可靠性。最后是对读写锁尝试加解锁补充功能的一些讨论。  相似文献   

3.
This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and vector parallelism.  相似文献   

4.
A Lock-Based Cache Coherence Protocol for Scope Consistency   总被引:5,自引:2,他引:5       下载免费PDF全文
Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors.Although scalable to a certain extent,directory protocols are complex enough to prevent it from being used in very large scale multiprocessors with tens of thousands of nodes.his paper proposes a lock-based cache coherence protocol for scope consistency.In does not rely on directory information to maintain cache coherence.Instead,cache coherence is maintained through requiring the releasing processor of a lock to stroe all write-notices generated in the associated critical section to the lock and the acquiring processor invalidates or updates its locally cached data copies according to the write notices of the lock.To evaluate the performance of the lock-based cache coherence protocol,a software SDM system named JIAJIA is built on network of workstations.Besides the lock-based cache coherence protocol,JIAJIA also characterizes itself with its shared memory organization scheme which combines the physical memories of multiple workstations to form a large shared space.Performance measurements with SPLASH2 program suite and NAS benchmarks indicate that,compared to recent SVM systems such as CVM,higher speedup is achieved by JIAJIA.Besides,JIAJIA can solve large scale problems that cannot be solved by other SVM systems due to memory size limitation.  相似文献   

5.
We have designed and implemented a light‐weight process (thread) library called ‘Lesser Bear’ for SMP computers. Lesser Bear has high portability and thread‐level parallelism. Creating UNIX processes as virtual processors and a memory‐mapped file as a huge shared‐memory space enables Lesser Bear to execute threads in parallel. Lesser Bear requires exclusive operation between peer virtual processors, and treats a shared‐memory space as a critical section for synchronization of threads. Therefore, thread functions of the previous Lesser Bear are serialized. In this paper, we present a scheduling mechanism to execute thread functions in parallel. In the design of the proposed mechanism, we divide the entire shared‐memory space into partial spaces for virtual processors, and prepare two queues (Protect Queue and Waiver Queue) for each partial space. We adopt an algorithm in which lock operations are not necessary for enqueueing. This algorithm allows us to propose a scheduling mechanism that can reduce the scheduling overhead. The mechanism is applied to Lesser Bear and evaluated by experimental results. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

6.
This article describes a framework for synchronization optimizations and a set of transformations for programs that implement critical sections using mutual exclusion locks. The basic synchronization transformations take constructs that acquire and release locks and move these constructs both within and between procedures. They also eliminate, acquire and release constructs that use the same lock and are adjacent in the program. The article also presents a synchronization optimization algorithm, lock elimination, that uses these transformations to reduce the synchronization overhead. This algorithm locates computations that repeatedly acquire and release the same lock, then transforms the computations so that they acquire and release the lock only once. The goal of this algorithm is to reduce the lock overhead by reducing the number of times that computations acquire and release locks. But because the algorithm also increases the sizes of the critical sections, it may decrease the amount of available concurrency. The algorithm addresses this trade-off by providing several different optimization policies. The policies differ in the amount by which they increase the sizes of the critical sections. Experimental results from a parallelizing compiler for object-based programs illustrate the practical utility of the lock elimination algorithm. For three benchmark applications, the algorithm can dramatically reduce the number of times the applications acquire and release locks, which significantly reduces the amount of time processors spend acquiring and releasing locks. The resulting overall performance improvements for these benchmarks range from no observable improvement to up to 30% performance improvement. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

7.
8.
In shared memory multiprocessors, efficient synchronization is imperative to assure good performance. There are two aspects to the “cost” of a synchronization operation: the first is the waiting time at synchronization points, and the second is the intrinsic overhead in performing the operation. The overhead has two components. The first component deals with contention resolution for synchronization operation among competing processors. The second component deals with the shared data accesses that the processor has to perform once it enters a synchronization region. We present a mechanism to reduce the overhead of performing synchronization operations in a cache-based shared memory multiprocessor. The mechanism is based on the intuitive notion that parallel programs invariably use synchronization operations to govern the access to shared data. Traditional multiprocessor cache protocols treat synchronization accesses the same way as normal read/write memory accesses, leading to inefficiencies in performing synchronization operations which ultimately limit the scalability of such systems. The key idea in our approach is to combine synchronization with the coherence maintenance for the cached data. Each cache line maintains states for synchronization as well as for cache coherence, and the cache protocol ensures the correctness of the synchronization operations and the coherence of the data at these synchronization points. To assess the performance gain due to the proposed mechanism, simulation studies are performed using a workload model that represents a dynamic scheduling paradigm which forms the core of several parallel programs. Results from simulation studies show that the proposed cache-based synchronization performs significantly better than traditional cache coherence approaches.  相似文献   

9.
The parallel inference machine (PIM) is now being developed at ICOT. It consists of a dozen or more clusters, each of which is a tightly coupled multiprocessor (comprising about eight processing elements) with shared global memory and a common bus. Kernel language 1 (KL1), a parallel logic programming language based on Guarded Horn Clauses (GHC), is executed on each PIM cluster. This paper describes the memory access characteristics in KL1 parallel execution and a locally parallel cache mechanism with hardware lock. The most important issue of locally parallel cache design is how to reduce common bus traffic. A write-back cache protocol having five cache states specially optimized for KL1 execution on each PIM cluster is described. We introduced new software controlled memory access commands, named DW, ER, and RP. A hardware lock mechanism is attached to the cache on each processor. This lock mechanism enables efficient word-by-word locking, reducing common bus traffic by using the cache states.  相似文献   

10.
Remote Direct Memory Access (RDMA) improves network bandwidth and reduces latency by eliminating unnecessary copies from network interface card to application buffers, but the communication buffer management to reduce memory registration and deregistration cost is a significant challenge to be addressed. Previous studies use pin-down cache and batched deregistration, but only simple LRU is used as a replacement algorithm to manage cache space. In this paper, we evaluate the cost of memory registration in both user and kernel spaces. Based on our analysis, we reduce the overhead of communication buffer management in two aspects simultaneously: utilize a Memory Registration Region Cache (MRRC), and optimize the RDMA communication process of clients and servers with Fast RDMA Read and Write Process (FRRWP). MRRC manages memory in terms of memory region, and replaces old memory regions according to both their sizes and recency. FRRWP overlaps memory registrations between a client and a server, and allows applications to submit RDMA write operations without being blocked by message synchronization. We compare the performance of MRRC and FRRWP with traditional RDMA operations. The results show that our new design improves the total cost of memory registrations and overall communication latency by up to 70%.  相似文献   

11.
Multicore processors need to communicate when working on shared tasks. In classical systems, this is performed via shared objects protected by locks, which are implemented with atomic operations on the main memory. However, access to shared main memory is already a bottleneck for multicore processors. Furthermore, the access time to a shared memory is often hard to predict and therefore problematic for real-time systems.This paper presents a shared on-chip memory that is used for communication and supports atomic operations to implement locks. Access to the shared memory is arbitrated with time division multiplexing, providing time-predictable access. The shared memory supports extended time slots so that a processor can execute more than one memory operation atomically. This allows for the implementation of locking and other synchronization primitives.We evaluate this shared scratchpad memory with synchronization support on a 9-core version of the T-CREST multicore platform. Worst-case access latency to the shared scratchpad is 13 clock cycles. Access to the atomic section under full contention, when every processor core wants access to acquire a lock, is 135 clock cycles.  相似文献   

12.
共享存储系统中如何高效地实现高速缓存一致性是体系结构设计面临的一个关键问题和难点问题.已有的基于目录的协议存在难于实现、验证复杂和存储空间开销大等问题.面向片上众核处理器,文中提出一种由硬件结构支持、基于同步的高速缓存一致性协议.该方案不使用目录,而是通过使用bloom-filter表示一致性信息,并在并行程序中的同步点维护高速缓存一致性.与现有的基于目录的高速缓存一致性协议相比,该方案可以降低目录协议的实现、验证复杂度.用SPLASH一2测试程序集评估表明,基于同步的协议可以获得与基于目录的协议相当的性能.  相似文献   

13.
相比于传统内存,持久性内存具有容量大和非易失的特点,这为构建大规模键值存储系统提供了新的机遇然而,在多核服务器架构下设计持久性内存键值系统面临着诸多挑战,包括并发控制带来的CPU缓存抖动、对持久性内存有限写带宽的消耗和竞争以及持久性内存高延迟带来的线程冲突加剧提出一种多核友好的持久性内存键值系统(multicore-f...  相似文献   

14.
周京晖 《软件》2013,34(5):6-11
数据缓存可以显著提高读取性能,但也需要及时同步以保证数据实时性。数据源变化触发的即时同步是一种通用的解决方案。但是大压力下,数据快速变化导致的频繁同步必然对服务器性能和通信效率产生负面影响。本文介绍的数据缓存按需同步方案根据业务需求定制同步策略,可以在满足实时性指标的基础上改善服务器性能和通信效率。该方案已经成功应用于笔者团队研发的UniBase分布式内存数据库,并在实际项目中证明了其可行性。  相似文献   

15.
We present an energy banding algorithm for Monte Carlo (MC) neutral particle transport simulations which depend on large cross section lookup tables. In MC codes, read-only cross section data tables are accessed frequently, exhibit poor locality, and are typically too much large to fit in fast memory. Thus, performance is often limited by long latencies to RAM, or by off-node communication latencies when the data footprint is very large and must be decomposed on a distributed memory machine. The proposed energy banding algorithm allows maximal temporal reuse of data in band sizes that can flexibly accommodate different architectural features. The energy banding algorithm is general and has a number of benefits compared to the traditional approach. In the present analysis we explore its potential to achieve improvements in time-to-solution on modern cache-based architectures.  相似文献   

16.
为满足嵌入式多核数控系统高速、高精的应用需求,针对现有多核通信延迟过高、通信数据量过小等不足,研究基于ARM与DSP双核架构嵌入式数控系统,设计并实现一种基于该数控系统平台的多核数据通信机制。该通信机制基于共享内存实现,包括硬件驱动实现、内存划分、通信同步、共享缓存池建立以及通信协议搭建等关键部分。针对双核间数据传输延迟和数据传输量2个影响系统性能的重要参数开展实验测试,并于实际数控系统环境进行应用测试,结果表明,该通信方法可满足ARM与DSP双核架构的嵌入式数控系统2 MB数据通信量与20 ms通信延迟的性能需求。  相似文献   

17.
针对现代计算机系统中的存储墙问题,提出一种适合于链式数据结构的数据预取方法——纯遍历推送方法。采用基于共享高速缓存的多核处理器平台CMP上的多线程技术,在主程序运行时分离出一个推送线程,由其将主线程需要的数据提前预取至处理器共享高速缓存中以隐藏主线程的存储器延迟。实验结果证明该方法在CMP架构下对以链式结构为主的内存受限程序的性能有一定的改进。  相似文献   

18.
在GPU中,一个warp内的所有线程在锁步中执行相同的指令。某些线程的内存请求可以得到快速处理,而其余请求会经历较长时间。在最慢的请求完成之前,warp不能执行下一条指令,导致内存发散。对GPU中warp间的异构性进行了研究,实现并优化了一种基于inter warp异构性的缓存管理机制和内存调度策略,以减少内存发散和缓存排队延迟的负面影响。根据缓存命中率将warp分类,以驱动后面的3个组件:(1)基于warp类型的缓存旁路技术组件,使低缓存利用率的warp进入旁路,不访问L2缓存;(2)基于warp类型的缓存插入/提升策略组件,防止来自高缓存利用率warp的数据被过早清除;(3)基于warp类型的内存控制器组件,优先处理从高缓存利用率的warp接收到的请求,并优先处理来自相同warp的请求。基于warp间异构性的缓存管理和内存调度机制在8种不同的GPGPU应用中,与基准GPU相比,平均加速18.0%。  相似文献   

19.
文章介绍了一种四通道数字伺服控制器的系统设计和实现方案;针对研制任务中高通信波特率和多路控制的要求,设计了以XC164为主控制芯片,基于JS71175型485协议处理器的高速同步485总线通信的多通道伺服控制器,该控制器在提高了同步485通信速率的同时可以实现四路伺服机构的控制、伺服机构动作的锁定及解除锁定,具有体积小、功率大、集成度高特点;实验结果表明,在保证通信稳定的前提下,通信波特率可达2 Mbit/s ,总功率可达1 600 W,满足任务要求。  相似文献   

20.
为了提高片上Flash在嵌入式应用中的读取速度,提出了一种基于预取和缓存原理的片上Flash加速控制器。该控制器包括预取缓存和高速缓存两种加速方案。其中预取缓存方案采用位宽扩展和预取技术加速顺序指令的读取,并采用分支缓存存储非顺序指令,降低由非顺序指令造成的预取缺失代价;而高速缓存方案采用组相联和路预测技术,提高指令重用率,减少Flash访问次数,降低系统功耗。针对不同的应用场景,两种加速方案既可通过寄存器来静态切换,也可通过软件流程来自适应动态切换,从而获得最佳的读取速度提升。多项基准程序的测试结果表明了所提出的片上Flash加速控制器在性能和功耗优化上的可行性和高效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号