首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 281 毫秒
1.
共享存储多核处理器中“忙-等待”技术常用来实现锁或栅栏等同步操作,这些典型的同步机制通常受限于较长的同步延迟和资源竞争等问题,导致扩展性较差,且需要不时进行访存操作,影响正常存储器访问操作,加剧对存储系统的带宽需求。提出了一种用于同步数据触发结构多核处理器的基于指令Cache作废的同步技术,同步时作废将执行的指令Cache行导致取指失效,向L2 Cache发送取指请求,L2 Cache中设置相应的过滤机制,不服务不满足同步条件的处理器核的取指请求,使相应处理器核暂停,达到同步目的。测试表明,该方法在可扩展性和同步性能方面均具有一定的优势。  相似文献   

2.
众核处理器片上同步机制和评估方法研究   总被引:1,自引:0,他引:1  
同步机制是片上多核/众核处理器正确执行和协同通信的关键,其效率对处理器的性能非常重要.针对片上众核体系结构,提出并实现了两种粗粒度同步机制和一种细粒度同步机制,即片上专用硬件支持的同步机制、基于原语的片上互斥访问同步机制和基于满空标志位的细粒度同步机制;提出了粗粒度同步机制的评估标准和评估方法,并设计了量化评估程序.以片上同构众核处理器Godson-T模拟器和AMD Opteron商业片上多核处理器为平台,评估比较了提出的硬件支持的同步机制与基于原语的同步机制的性能.结果表明,硬件支持可以使得片上众核处理器的同步机制性能明显提高;在传统基于原语的同步机制中,大部分性能损失是由于负载不平衡和同步点的串行化操作而造成的等待时间.  相似文献   

3.
层次化片上多核处理器以紧耦合的多个核构成超节点,对访存和片上通信的局部性有良好支撑,能有效地缓解片上多核中数据通信带来的通信开销.在关于多核处理器的Amdahl开销/性能模型已有的研究基础上,引入片上数据通信延迟作为Amdahl任务计算开销的新元素,构建了层次化片上多核处理器的Amdahl加速比扩展模型.基于该扩展模型,就层次化片上多核处理器的加速比与超节点配置的关系问题展开研究.模拟分析发现,要获得良好的加速比性能,层次化片上多核处理器需要在超节点数目与超节点的大小(超节点内核的个数)之间作仔细的权衡;对于给定核数目的层次化片上多核处理器,使系统性能最优的超节点大小往往出现在中间某个值而不是最大或者最小,并且该值随着系统规模的变化会发生相应的变化.  相似文献   

4.
Barrier同步操作是能够直接影响处理器性能的一类操作.针对流处理器体系结构,提出并实现了2种软件同步机制和1种硬件同步机制,即基于互斥计数器的Barrier同步、基于共享状态寄存器的Lock-free Barrier同步和基于专用硬件管理单元的Barrier同步;在一款流处理器原型系统中测试并分析了在不同负载规模、不同负载分布、典型应用情况下3种同步机制的性能.结果表明,基于专用硬件管理单元的Barrier同步机制性能更优.  相似文献   

5.
李崇民  王海霞  张熙  汪东升 《计算机学报》2011,34(11):2064-2072
随着片上可集成的处理器核数增加,多核处理器的片上通信延迟不断增大,目录存储开销也随之线性增长.层次化缓存结构将片上缓存递归划分为多级区域,并将数据复制到各级区域内以减小片上通信延迟,同时通过多级目录结构降低了目录存储开销.文中通过对数据访问特征进行分析,提出一种新型改进层次化缓存结构(EHCD),将从片外读入的数据直接...  相似文献   

6.
受限于功耗,十多年前通用微处理器就停止追求更高的主频转而向集成更多处理器核的方向发展;同时,随着晶体管密度按摩尔定律不断提高,单片可集成的处理器核数成倍增长,片上多核、众核处理器已成为高性能微处理器发展的主流。未来千核级通用众核处理器支持共享存储编程模型是一种必然趋势,但传统的Cache一致性目录结构面临着查找延迟高、目录项替换频繁以及硬件代价和功耗可扩展性有限等问题。稀疏目录实现了传统目录结构硬件开销与一致性维护效率的折衷,被认为是众核处理器维护Cache一致性的一种高能效、可扩展结构。综述了近年来提高稀疏目录性能的相关研究与方法,并对其在面积、访问延迟、功耗和实现复杂性等方面进行分析,归纳出这些方法各自的优点和存在的不足,对创新设计未来高性能众核处理器共享存储体系结构具有一定的参考价值。  相似文献   

7.
多核处理器的软件和硬件设计从传统的任务串行到任务并行处理,与单核处理器软硬件设计区别较大。本文以8核DSP芯片TMS320C6678为例,介绍了多核DSP的系统结构、多核处理器的任务分割和任务分配到各个核的方法、多个核之间的任务管理和核间通信,以及基于多核导航器的硬件设计方法。  相似文献   

8.
为了得到更高的吞吐率和性能功耗比,众核处理器摒弃了复杂的乱序处理器核,而在芯片内集成了大量的轻量级顺序处理器核。为了更好地支持核间数据共享,并减少访问片外存储器带来的开销,众核处理器往往采用共享的末级缓存LLC(Last LevelCache)。因为需要对为数众多相对独立的访问请求作出响应,因此相对于传统多核处理器的末级片内缓存,众核处理器的末级片内缓存更容易产生抖动现象。传统的最久未使用LRU(Least Recent Used)高速缓存替换策略在这种情况下往往无能为力,而几种最新提出的高速缓存替换策略也见效甚微。基于传统的最不经常使用LFU(Least Frequent Used)替换算法,提出一种改进的高速缓存替换算法。相对于LFU替换算法,该算法获取信息的粒度更粗,并且可以掌握更加全局的信息,而这些优势使得该算法更适合作为众核处理器末级片内缓存的替换算法。实验结果表明,在一个64核的众核处理器上,该替换算法可以有效地缓解末级片内缓存的抖动现象,同时该算法实现需要的硬件开销很小。  相似文献   

9.
密码多核处理器互联结构研究与设计   总被引:1,自引:0,他引:1  
为了提高多任务密码算法硬件实现的高效性,对密码算法的多核处理特征及多核互连结构设计基本定律——Amdahl定律进行了研究,提出了密码多核处理器的结构模型,并针对影响多核系统处理性能的参数进行了模拟及分析。基于分析结果,对通用多核处理器中常用的2D-Mesh互联结构进行了优化设计,并给出了优化方案的硬件实现。最后基于VCS仿真平台对设计方案进行了仿真验证。验证结果表明,相比于传统2D-Mesh结构,本方案具有较低的核间信息传递延迟,证明了改进方案的合理性。  相似文献   

10.
李春江  唐滔  杨灿群 《计算机科学》2013,40(9):35-37,60
硬件锁用简单的取数指令实现“取并加一”或“取并减一”的原子操作.首先介绍了通用多核多线程FT处理器实现的硬件锁机制,并和软件锁机制进行了比较,之后介绍了使用硬件锁机制实现多线程同步的方法,然后在GNUOpenMP运行库中设计并实现了利用硬件锁的多线程同步机制,最后采用典型OpenMP测试程序对使用硬件锁和使用软件锁的同步操作性能进行了评估和分析.  相似文献   

11.
Amdahl's Law in the Multicore Era   总被引:4,自引:0,他引:4  
Hill  M.D. Marty  M.R. 《Computer》2008,41(7):33-38
Augmenting Amdahl's law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster.  相似文献   

12.
We consider the problem of power and performance management for a multicore server processor in a cloud computing environment by optimal server configuration for a specific application environment. The motivation of the study is that such optimal virtual server configuration is important for dynamic resource provision in a cloud computing environment to optimize the power and performance tradeoff for certain specific type of applications. Our strategy is to treat a multicore server processor as an M/M/m queueing system with multiple servers. The system performance measures are the average task response time and the average power consumption. Two core speed and power consumption models are considered, namely, the idle-speed model and the constant-speed model. Our investigation includes justification of centralized management of computing resources, server speed constrained optimization, power constrained performance optimization, and performance constrained power optimization. Our main results are (1) cores should be managed in a centralized way to provide the highest performance without consumption of more energy in cloud computing; (2) for a given server speed constraint, fewer high-speed cores perform better than more low-speed cores; furthermore, there is an optimal selection of server size and core speed which can be obtained analytically, such that a multicore server processor consumes the minimum power; (3) for a given power consumption constraint, there is an optimal selection of server size and core speed which can be obtained numerically, such that the best performance can be achieved, i.e., the average task response time is minimized; (4) for a given task response time constraint, there is an optimal selection of server size and core speed which can be obtained numerically, such that minimum power consumption can be achieved while the given performance guarantee is maintained.  相似文献   

13.
Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory.With hardware and/or software support,data prefetching brings data closer to a processor before it is actually needed.Many prefetching techniques have been developed for single-core processors.Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching t...  相似文献   

14.
A multicore FPGA platform with cache-integrated network interfaces (NIs) has been developed, appropriate for scalable multicores, that combine the best of two worlds – the flexibility of caches (using implicit communication) and the efficiency of scratchpad memories (using explicit communication). Furthermore, the proposed scheme provides virtualized user-level RDAM capabilities and special hardware primitives (counter, queues) for the communication and synchronization of the cores.This paper presents how the proposed architecture can be utilized in the domain of network processing applications using the hardware synchronization mechanisms. Two representatives network processing benchmarks are used; one for header processing and one for payload processing. The Multiple Reader Queue (MRQ) scheme is utilized in the case of header processing, while in the case of payload processing where transfer of bulk data is required, the user-level RDMA scheme is utilized. These applications are mapped and evaluated to an FPGA platform with up to 24 processors. The performance evaluation in the domain of network processing shows that the proposed scheme can offer low latency communication and increased programming efficiency while it also offloads the processor from the communication and synchronization processes.  相似文献   

15.
赵姗  杨秋松  李明树 《软件学报》2019,30(4):1164-1190
为了满足应用程序的多样化需求,异构多核处理器出现并逐渐进入市场,其中的处理核心(core)具有不同的微架构或者指令集架构(ISA),为应用提供多样化特性支持,比如指令级并行(ILP)、内存级并行(MLP),这些核心协同工作满足整个计算系统的优化目标,比如高性能、低功耗或者良好的能效.然而,目前主流的调度技术主要是针对传统同构处理器架构设计,没有考虑异构硬件能力的差异性.在异构多核处理器环境下,调度技术如何感知硬件的异构特性,为不同类型的应用程序提供更加合适和匹配的硬件资源,这是值得探索的问题.对近年来在该研究领域的成果进行了综述研究,特别是在性能非对称多核处理器架构下,异构调度技术面临的优化目标、分析模型、调度决策和算法评估等主要问题进行了分析和描述,并依次对相关技术进行了系统的总结,最后从软硬件融合的角度对今后的研究工作进行了展望.  相似文献   

16.
张延松  张宇  王珊 《软件学报》2018,29(3):883-895
以MapD为代表的图分析数据库系统通过GPU、Phi等新型众核处理器来支持高性能分析处理,在面向复杂数据模式时连接操作仍然是重要的性能瓶颈.近年来,异构处理器逐渐成为高性能计算的主流平台,内存连接性能的研究从多核CPU平台扩展到新兴的众核处理器,但众多的研究成果并未系统地揭示连接算法性能、连接数据集大小、硬件架构之间的内在联系,难以为未来异构处理器平台的数据库提供连接平台优化选择策略.本文以面向多核CPU、Xeon Phi、GPU处理器平台的内存连接优化技术为目标,通过优化内存哈希表设计,实现以向量映射替代哈希映射操作,消除哈希代价对内存连接算法的影响,从而更加准确地测量内存连接算法在多核CPU的cache大小、Xeon Phi的cache大小、Xeon Phi的并发多线程、GPU的SIMT(单指令多线程)机制等硬件相关因素影响下的性能特征.实验结果表明,缓存与并发多线程机制是提高内存连接算法性能的重要影响因素.缓存机制对于满足cache大小的连接操作具有性能优势,而GPU的并发多线程机制则在较大表的连接操作中具有较高的性能,Xeon Phi则在满足其L2 cache大小的连接操作中具有最高性能.实验结果揭示了内存连接操作性能与异构处理器硬件特性的联系,为未来异构处理器平台内存数据库查询优化器提供了优化策略.  相似文献   

17.
This paper describes two dynamic core allocation techniques for video decoding on homogeneous and heterogeneous embedded multicore platforms with the objective of reducing energy consumption while guaranteeing performance. While decoding a frame, the scheme measures “slack” and “overshoot” over the budgeted decode time and amortizes across the neighboring frames to achieve overall performance, compensating for the overshoot with the slack time. It allocates, on a per-frame basis, an appropriate number and types of cores for decoding to guarantee performance, while saving energy by using clock gating to switch off unused cores. Using the Sniper simulator to evaluate the implementation of the scheme on a modern embedded processor, we get an energy saving of 6%–61% while strictly adhering to the required performance of 75 fps on homogeneous multicore architectures. We receive an energy saving of 2%–46% while meeting the performance of 25 fps on heterogeneous multicore architectures. Thus, we show that substantial energy savings can be achieved in video decoding by employing dynamic core allocation, compared with the default strategy of allocating as many cores as available.  相似文献   

18.
《Micro, IEEE》2004,24(6):62-73
With the natural trend toward integration, microprocessors are increasingly supporting multiple cores on a single chip. To keep design effort and costs down, designers of these multicore microprocessors frequently target an entire product range, from mobile laptops to high-end servers. This article discusses a continual flow pipeline (CFP) processor. Such processor architecture can sustain a large number of in-flight instructions (commonly referred to as the instruction window and comprising all instructions renamed but not retired) without requiring the cycle-critical structures to scale up. By keeping these structures small and making the processor core tolerant of memory latencies, a CFP mechanism enables the new core to achieve high single-thread performance, and many of these new cores can be placed on a chip for high throughput. The resulting large instruction window reveals substantial instruction-level parallelism and achieves memory latency tolerance, while the small size of cycle-critical resources permits a high clock frequency  相似文献   

19.
面向层次化NoC的混合并行编程模型   总被引:1,自引:0,他引:1       下载免费PDF全文
曹祥  易伟  潘红兵  高明伦  李丽 《计算机工程》2010,36(13):278-280
为更好发挥多核处理器的硬件性能,针对层次化的片上网络架构,提出MPI/OpenMP混合并行编程模型。运用基于MPI的任务级并行模型实现片内簇间的高效通信,采用OpenMP模型实现簇内四核的通信、同步和数据交换。实验结果表明,与单一并行编程模型相比,混合并行编程模型加速比提高了20%~50%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号