首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 187 毫秒
1.
多核数字信号处理器(DSP)的性能常常受限于共享存储的长延迟Cache一致性访问.数据前向(forwarding)技术是隐藏长延迟访问的一种有效手段.根据多核DSP应用的两类重要特征,提出了一种面向共享存储多核DSP结构的数据流分簇前向技术DSCF(data stream clustered forwarding).DSCF方法的主要特点是:兼容基本的共享存储Cache一致性协议;不污染目标Cache;数据的传输速度能够与消费速度相匹配;系统结构的可扩展性好.典型测试程序的模拟评测表明,采用DSCF方法能够将Cache一致性失效率平均降低44%,将系统总体性能提升30%~70%.  相似文献   

2.
多核环境下的Cache设计技术受到线延时和应用等多方面因素影响,私有和共享方案都存在各自的不足.提出了一种异构的CMP Cache结构,采用两类具有不同Cache层次的结点组成多核芯片,设计了基于间接索引的Cache容量复用等技术,提供了容量有效且访问迅速的片上存储层次.在全系统环境下对SPEC CPU2000,SPLASH2等程序的评测结果表明,异构CMP Cache结构能够适应各类应用的需要,对单进程和多线程应用平均性能提高分别可达16%和9%.异构CMP Cache同时具有硬件设计简单的特点,具有较好的工程可实现性,其设计思想将应用在未来的龙芯多核处理器设计中.  相似文献   

3.
为了提高Cell处理器对共享数据访问的性能,本文设计并实现了一个能够支持释放一致性存储模型的软件Cache。实验结果表明,该软件Cache能够大大缩短SPE对系统主存中共享数据的访问时间开销,提高Cell处理器上OpenMP程序的并行性能。  相似文献   

4.
随着集成电路工艺技术的飞速发展,单芯片多处理器(Single-chip Multiprocessor,CMP)结构将是一种有效利用片上晶体管资源、提高系统性能的有效途径.CMP中各个内核通过共享同级存储装置共享数据,如共享一级Cache,共享二级Cache等.可交换数据Cache结构的CMP(Exchangeable Data Cache Architecture,EDCA-CMP)通过交换一级数据Cache的内容共享数据Cache,降低对下级存储的访问延迟,提高数据Cache的命中率,获得较高的性能.  相似文献   

5.
片上多核处理器已逐渐取代传统超标量处理器成为集成电路设计的主流结构,但芯片的存储墙问题依旧是设计的一个难题。CMP通过大容量的末级高速缓存来缓解访存压力。在软件编程模式向多线程并行方式转变的背景下,针对多线程应用在多核处理器上的Cache访问特征,提出一种面向私有末级Cache的优化算法,通过硬件缓冲器记录处理器访存地址,从而实现共享数据在Cache间的传递机制,有效降低Cache失效开销。实验结果表明,在硬件开销不超过Cache部件0.1%的情况下,测试用例平均加速比为1.13。  相似文献   

6.
现有的并行代价模型大多是面向共享存储或分布存储结构设计的,不完全适合异构多核处理器。为解决这个问题,提出了面向异构多核处理器的并行代价模型,通过定量刻画计算核心运算能力、存储访问延迟和数据传输开销对循环并行执行时间的影响,提高加速并行循环识别的准确性。实验结果表明,提出的并行代价模型能有效识别加速并行循环,将其识别结果作为后端生成并行代码的依据,可有效提高并行程序在异构多核处理器上的性能。  相似文献   

7.
Cell处理器上软件缓存的设计与实现   总被引:1,自引:0,他引:1       下载免费PDF全文
在 Cell异构多核处理器上,并行程序对不规则共享数据的访问延迟较大,共享数据的一致性维护困难。为解决上述问题,提出一种基于扩充Location Consistency存储模型一致性协议的软件缓存。测试结果表明,该软件缓存能够缩短近40%的共享数据访问时间,有效提高并行程序的执行效率。  相似文献   

8.
基于取指执行时序范畴的多核共享Cache干扰分析   总被引:2,自引:0,他引:2  
在多核结构中,获得并行应用线程的安全、精确的最坏情况执行时间(worst case execution time,WCET)的最大挑战之一在于共享资源的竞争冲突检测.在共享Cache的多核处理器中,线程在共享Cache中的指令可能被其他并行线程的指令替换,从而导致了线程间在共享Cache上的干扰,因此多核结构下线程WCET需要考虑并行线程间在共享Cache上的干扰.在现有的简单地址映射干扰分析基础上,考虑了指令取指执行时序因素对干扰的影响,提出了非干扰状态的充分不必要条件,根据指令的取指执行时序范畴判断线程在共享Cache上的干扰状态.通过排除非干扰状态,可以进一步精确多核结构中线程的WCET估值.理论分析证明了该方法的有效性.实验结果表明,与当前现有的考虑执行周期和基于逻辑访问先后顺序的方法相比,基于时序方法下的WCET估值分别可以提高12%和7%的精确度.  相似文献   

9.
同时多线程(SMT)是一种延迟容忍的体系结构,它在每个周期内可以执行多个线程的多条指令.在SMT处理器上,对于片上共享存储这个复杂的结构资源,至今还没有很好的共享和冲突解决方案.本文着重研究了在多个并发执行的线程间划分共享Cache所存在的问题,指出基于LRU策略的传统Cache会根据需要隐式地划分共享Cache,这在某些情况下会导致全局性能的下降.针对这一问题并且考虑到SMT处理器上对Cache访问带宽的需求,本文提出采用一种多模块多体的Cache结构设计方案.并且在一个修改过的SMT模拟器上对该设计方案进行了性能评价.实验结果显示,相比于基于LRU策略的传统Cache,这一结构可以将一个4路SMT处理器的IPC提高9%.  相似文献   

10.
一种片上众核结构共享Cache动态隐式隔离机制研究   总被引:2,自引:0,他引:2  
访存带宽是限制众核处理器件能提升的关键,将片上最后一级Cache设计为所有处理器核共享是必要的.在共享Cache中隔离放置冲突的数据,是提高共享Cache性能的关键.文中提出了缓存块链接的硬件方法,用于隔离共享Cache中不同线程之间的数据.文中基于时钟精准的片上众核结构模拟器,使用Splash2程序组和生物信息学中的仟务,对所提机制进行了评估.实验结果表明,与传统共享Cache相比,使用缓存块链接机制时,使得共享Cache的冲突性缺失率降低约20%,而使得IPC平均提高了约10%.  相似文献   

11.
黄干平 《计算机学报》1993,16(9):655-660
本文给出一种适用于SIMD并行算法的共享存储器设计方案,它允许多个处理机按相应的同步并行算法并行无存取冲突地存取各自的数据,以满足算法执行的需要,该方案包含两部分,即数据在共享存储器内的存放方法和互联网络的结构及其功能,文章最后说明了该方案的若干性能、实现方法和优点。  相似文献   

12.
OpenMP作为共享存储并行编程标准,以其良好的易用性、支持增量并行等特点成为并行程序设计的主流模型之一.OpenMP标准是针对UMA共享存储结构制定的,其循环调度机制只考虑了负载平衡而无须考虑数据分布.然而在机群OpenMP系统中,数据局部性是影响性能的关键因素.针对OpenMP标准中静态调度策略不适合机群计算的缺点,提出了一个充分体现拥有者计算原则的LBS调度算法,并通过扩展制导的方式在机群OpenMP系统(OpenMP/JIAJIA)上加以实现.测试结果表明,LBS算法对于机群OpenMP系统很有效.  相似文献   

13.
TreadMarks: shared memory computing on networks of workstations   总被引:2,自引:0,他引:2  
Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value  相似文献   

14.
多核计算机上非递归并行计算矩阵乘积   总被引:1,自引:1,他引:0  
提出延迟隐藏的数据预取模型,实现计算与访存的重叠操作,以达到共享二级缓存零缺失;给出基本块的概念,以简化算法的数据结构和减少存储开销;按基本块连续存储方式存储矩阵元素,从存储层次上优化算法,显著地减少页表缓冲缺失;采取非递归调度基本块的策略,充分利用多核计算机的共享二级缓存来减少访问主存的次数,并且不局限于某种特定的存储结构,实现算法缓存无关.多核计算机上的实验结果表明,给出的非递归计算矩阵乘积的线程级并行算法高效、可扩展.  相似文献   

15.
Concurrent C/C++ is a superset of C and C++ that provides parallel programming facilities based on message passing. Upon porting Concurrent C/C++to a shared memory multiprocessor, the authors believed it would be appropriate to supplement Concurrent C/C++ with explicit facilities for synchronizing accesses to shared data structures. The capsule, which is a shared memory access mechanism designed especially for Concurrent C/C++ to match the C++data abstraction facility called the class, is discussed. Capsules are like monitors but they have significant advantages. Capsules satisfy T. Bloom's (1979) criteria for expressiveness of synchronization conditions, support inheritance, allow operations to execute in parallel, and permit them to time out. The design of capsules is reviewed. The author evaluates existing shared memory mechanisms, describes capsules, gives examples of capsules, compares capsules with monitors, and discusses how capsules are implemented by the Concurrent C compiler  相似文献   

16.
We present an efficient technique for parallel manipulation of data structures that avoids memory access conflicts. That is, this technique works on the Exclusive Read/Exclusive Write (EREW) model of computation, which is the weakest shared memory, MIMD machine model. It is used in a new parallel radix sort algorithm that is optimal for keys whose values are over a small range. Using the radix sort and known results for parallel prefix on linked lists, we develop parallel algorithms that efficiently solve various computations on trees and “unicycular graphs.” Finally, we develop parallel algorithms for connected components, spanning trees, minimum spanning trees, and other graph problems. All of the graph algorithms achieve linear speedup for all but the sparsest graphs.  相似文献   

17.
The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms  相似文献   

18.
Parallel Data Mining for Association Rules on Shared-Memory Systems   总被引:11,自引:1,他引:10  
In this paper we present a new parallel algorithm for data mining of association rules on shared-memory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm. A lot of data-mining tasks (e.g. association rules, sequential patterns) use complex pointer-based data structures (e.g. hash trees) that typically suffer from suboptimal data locality. In the multiprocessor case shared access to these data structures may also result in false sharing. For these tasks it is commonly observed that the recursive data structure is built once and accessed multiple times during each iteration. Furthermore, the access patterns after the build phase are highly ordered. In such cases locality and false sharing sensitive memory placement of these structures can enhance performance significantly. We evaluate a set of placement policies for parallel association discovery, and show that simple placement schemes can improve execution time by more than a factor of two. More complex schemes yield additional gains. Received 24 May 1999 / Revised 20 June 2000 / Accepted in revised form 6 July 2000  相似文献   

19.
In this paper, we propose a new parallel genome matching algorithm using graphics processing units (GPUs). Our proposed approach is based on the Aho–Corasick algorithm and it was developed based on a consideration of the architectural features of existing GPUs with a hundred or more cores. Thus, we provide an appropriate task partitioning method that runs on multiple threads and we fully utilize the cache memory and the shared memory structures available in GPUs. Especially, we propose a tiled access method for rapid data transfer from the global memory to the shared memory. We also provide new models for cache-friendly state transition table to improve performance of pattern matching operations on GPUs. The maximum throughput we achieved in various experiments was 15.3 Gbps. Moreover, we showed that our proposed design outperformed an earlier approach with a 15.4 % performance improvement.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号