首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
传统的推测多线程技术总是假定程序的并行粒度大小应该随着处理器核资源数目的增加而增大,未考虑不同数目的处理器核资源对程序自身并行性能的影响作用。针对这个问题,提出一种自适应的循环并行粒度调节方法用于优化处理器核资源的分配过程。以推测级为单位,通过动态收集循环中所有推测线程的性能量化分析结果,进行推测代价评估。并利用评估结果动态调整循环的并行粒度大小,优化所分配到的处理器核资源的数目,以减少不必要的推测代价。实验表明,该方法不但在SPEC CPU基准测试程序集上能取得较好的性能提升,而且进一步优化了推测时的能耗开销。  相似文献   

2.
如何有效利用多核提供的丰富晶体管资源对串行程序的执行进行加速是当前研究中的热点问题。线程级推测(thread-level speculation,TLS)技术旨在充分利用多核资源,最大化地开发出串行代码中存在的潜在并行性。目前TLS技术已经在多种串行应用的并行化工作中得到有效利用,但嵌入式应用程序仍未在推测并行化方面进行有效的分析。因此,选取了八个具有代表性的嵌入式应用,对其在循环级推测并行化中的性能提升潜力和运行时特征(数据依赖、线程粒度和并行覆盖率)进行探讨。实验结果表明,利用线程级推测并行化嵌入式应用的加速效果优于指令级并行技术,实验中的最大加速比达到了13.29;在嵌入式应用领域,该技术可以有效地利用4到8核的计算资源。  相似文献   

3.
线程级推测(TLS)技术可挖掘程序并行执行潜能,提高多核资源利用率,但目前TACLeBench的内核基准仍未在TLS并行化中得到有效分析。针对该问题设计了循环级推测执行的剖析方案和剖析工具。选取7个代表性的TACLeBench内核基准程序,首先对程序进行初始化分析,选取程序热点片段插入循环标识;其次对这些片段进行交叉编译,记录程序推测线程与内存地址相关数据,剖析其循环级最大潜在并行性;最后综合探讨程序运行时的特征(线程粒度、可并行化覆盖率、依赖特征)以及源码对加速比的影响。实验结果表明:1)该类程序适合采用TLS加速,与串行执行结果相比,循环结构的推测执行下的大部分程序的加速比在2以上,其中最高加速比达到20.79;2)利用TLS加速TACLeBench内核程序时,多数应用可有效利用4核到16核的计算资源。  相似文献   

4.
多核处理器通过增加处理器核数提高计算能力,虽然可以通过同时运行多道程序的方式利用处理器资源,但是多核处理器真正的成功取决于解决并行应用开发中的难题.为此,处理器体系结构和编程模型的协同开发是必须的.而随着核数的增多,传统上使用的软件模拟器因为软件的串行性而性能越来越差,无法支持这种软硬件协同开发.FPGA天生的并行性使它在模拟多核处理器时具有较高的模拟性能和高度的可扩放性,成为处理器体系结构研究的理想工具.本文介绍了基于FPGA的多核模拟系统,RAMP-Pink.该系统基于HASim实现,同时支持事务存储和线程级推测,用于对事务存储和线程级推测的软硬件协同开发.该模拟系统可配置不同的FPGA开发平台,也可以以软件模拟方式运行.  相似文献   

5.
线程级推测技术使在多核上加速传统上难以手工或自动并行化的串行程序成为可能,它不仅需要合理地选择线程的划分策略,而且需要合理地选择适合推测执行的应用.已有的大量研究主要集中在如SPEC CPU这样的桌面应用领域,为了全面地认识TLS技术的应用适用性,本文探讨TLS技术对科学计算应用的性能提升潜力,提出一套TLS适用性的基本判定准则,实验结果表明采用该技术加速SPLASH2中的多数应用可以有效利用16核及以上的计算资源.  相似文献   

6.
推测多线程技术通过推测执行的方式开发应用程序的线程级并行性,以提高程序执行性能。该技术一般通过执行模型来检测运行时可能的线程推测错误情况,并采取合适的机制恢复程序正确运行。描述的Prophet是一种基于硬件实现的推测多线程执行模型。重点描述了Prophet执行模型针对执行模型设计的关键问题的解决方案,包括Prophet的线程状态控制和多版本的Cach。系统,Prophet的多版本Cache系统提供了推测数据缓存功能,并使用基于总线监听的Cache协议实现了数据依赖违规检测。还给出了使用Olden基准程序对Prophet执行模型进行功能和性能测试的结果,并分析说明了Prophet系统可以有效地开发应用程序的线程级并行性。  相似文献   

7.
SpMT WaveCache:开发数据流计算机中的推测多线程   总被引:1,自引:0,他引:1  
推测多线程技术(Speculative Multithreading,SpMT)是通过推测地执行多个线程来开发线程级并行性,提高超标量处理器性能.通过增加额外的硬件单元,比如线程同步单元(Thread Synchronization Unit,TSU)、线程上下文表(Thread Context Table,TCT)和线程内存历史表(Thread Memory History,TMH),扩展了事务性内存系统,提高了基于波标量指令集系统结构(WaveScalar ISA)实现的WaveCache模拟器的性能.同时,还提出了一种新的两级线程级事务提交机制.最后,采用了6个来自SPEC、Media和Mibench测试程序集的真实测试程序.评估了推测多线程WaveCache(SpMT WaveCaehe)的性能.实验表明,SpMT WaveCache比超标量系统结构提高了2~3倍的性能,是一种有效的开发动态数据流计算机性能的方法.  相似文献   

8.
众核处理器具有强大的并行处理能力,成为提升路由器转发性能的有效途径.基于众核处理器的数据包处理采用多级流水线结构,每个流水阶段的执行时间不同,要求分配不同的核数.已有的核资源均衡分配方法(equi-partition,EQUI)为每个流水阶段分配相同的核数,存在核资源浪费等缺点,限制了数据包处理性能.提出了一种众核处理器资源优化方法,即根据数据包的处理步骤将其划分成多个子阶段,通过统计各阶段的总执行时间,按执行时间比例分配给各个模块所需核数.与已有的EQUI相比,核资源最佳分配方法在数据包转发速率上提高了约20%.  相似文献   

9.
针对子程序结构的线程级推测并行性分析   总被引:3,自引:0,他引:3  
线程级推测技术为开发更多的线程级并行性,充分利用多核加速传统上难以手工或自动并行化的串行程序提供可行的技术途径.然而,这种技术的性能严重地依赖于线程划分方案.有研究表明,仅推测执行循环所产生的并行性是不够的.但推测执行子程序结构比循环结构要难.本文提出寻找适于推测并行执行的子程序结构的基本判定依据;通过运行由Simplescalar工具集改造得到的动态剖析工具ProRV、ProFun和SPEC CPU2000基准测试程序,我们对子程序结构线程化推测执行的适合性进行详细分析,给出具有指导意义的实验分析方法和实验数据.我们发现:①无返回值的子程序结构占据程序整体执行时间的大约40%;返回稀疏整型的子程序结构占据了程序整体执行时间的大约10%,对其返回值的预测成功率在70%左右.对于其他返回值类型的子程序结构,由于对其返回值的预测成功率过低,我们认为不适合作为线程划分的对象.②简单的last-value的值预测方案对于返回值的预测是简单而且足够有效的.③访存数据依赖普遍存在于子程序与其后继代码之间,显式同步机制对于针对子程序结构的线程级推测是必要的.  相似文献   

10.
Explicit Data Graph Execution(EDGE)ISA是一种专门为类数据流驱动的分片式众核处理器而设计的指令集体系结构.相较于传统的采用控制流驱动的处理器,EDGE结构以超块(Hyperblock)而不是单个指令作为其执行单位,在超块内部实现数据流执行,超块之间按照推测序保持控制流执行,有利于挖掘指令级并行性.但是,EDGE编译器按照程序的串行执行顺序组织超块,超块间和超块内部受限于数据依赖,削弱了整个程序运行时的潜在数据级并行性和线程级并行性,不利于发挥EDGE分片式结构的优势.本文通过分析EDGE编译器超块组织的特点,结合EDGE结构特有的执行模型,提出一种普适性的超块组织框架来模拟EDGE结构上多线程运行的效果,进一步挖掘EDGE结构运行串行单线程程序时的指令级并行性.本文选用TRIPS微处理器作为EDGE结构的实例处理器,利用矩阵乘法等三个实验验证了我们所提出的框架的可行性,实验结果表明这些应用在TRIPS上获得了较好的性能提升.  相似文献   

11.
In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures.  相似文献   

12.
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.  相似文献   

13.
连接是数据查询处理中最耗时、使用最频繁的操作之一,对提高连接操作的速率具有重要意义。阵列众核处理器是一类重要的众核处理器,具有强大的并行能力,可用来加速并行计算。基于阵列众核处理器的结构,设计和优化了一种高效的多层分区Hash连接算法。该算法通过多层划分的策略大大降低了主存访问次数,通过分区重排方法有效消除了数据倾斜的影响,获得了很高的性能。在异构融合阵列众核处理器DFMC(Deeply-Fused Many Core)原型系统上的实验结果表明,DFMC上多层分区Hash连接算法的性能是CPU-GPU耦合结构上最快的连接算法的8.0倍,表明利用阵列众核处理器加速数据查询应用具有优势。  相似文献   

14.
The author demonstrates the methodology for parallelizing of finding stochastic bounds for Markov chains on multicore and manycore platforms. The stochastic bounds algorithm for Markov chains with the sparse matrices is investigated, thus needing a lot of irregular memory access. Its parallel implementations should scale across multiple threads and characterize with a high performance and performance portability between multicore and manycore platforms. The presented methods are built on the usage of two parallelization extensions of the C++ language: OpenMP and Cilk Plus. For this two extensions, we use two programming models, namely loop parallelism and task-based parallelism. The numerical experiments show the execution time of the implementations and the scalability on multicore and manycore platforms. This work provides the parallel implementations and at the same time presents an educational example of how computer science problems with irregular memory access can be implemented for high performance using OpenMP and Cilk Plus.  相似文献   

15.
快速多极子方法(FMM)是一种求解N体问题的快速高效数值算法,在宇宙学和分子动力学等模拟中具有广泛的应用。申威SW26010是一款国产众核异构处理器,含260核心(4核组)。基于申威SW26010的众核架构设计和实现了快速多极子方法,并对核心函数(尤其是最耗时的粒子对相互作用)系统地进行了性能优化,包括异步DMA、SIMD向量化、循环展开、内联汇编指令调整等。以粒子对相互作用为例,优化后代码的计算速度约为主核上运行的原始代码的400倍,每个核组上的浮点性能达到250 GFLOPS,即理论峰值性能的32.5%。  相似文献   

16.
与超级计算机的快速的开发,规模和复杂性曾经正在增加,并且可靠性和跳回面临更大的挑战。在容错有许多重要技术,例如基于差错预言的积极失败回避技术,反应容错基于检查点,和安排技术到改进可靠性。系统差错的特征上的质、量的描述为这些技术是很批评的。这研究在超级计算机把 Sunway BlueLight 称为的二典型 petascale 上分析失败的来源(基于多核心中央处理器) 并且 Sunway TaihuLight (基于异构的 manycore 中央处理器) 。它揭开一些有趣的差错特征并且在主要部件差错之中发现未知关联关系。最后,纸在资源和不同时间跨度的各种各样的谷物分析二台超级计算机的失败时间,并且为 petascale 超级计算机造一个一致多维的失败时间模型。  相似文献   

17.
The Journal of Supercomputing - The world is creating ever more data and the applications are required to deal with ever-increasing datasets. To process such datasets heterogeneous and manycore...  相似文献   

18.
Current parallel systems composed of mixed multi/manycore systems and/with GPUs become more complex due to their heterogeneous nature. The programmability barrier inherent to parallel systems increases almost with each new architecture delivery. The development of libraries, languages, and tools that allow an easy and efficient use in this new scenario is mandatory. Among the proposals found to broach this problem, skeletal programming appeared as a natural alternative to easy the programmability of parallel systems in general, but also the GPU programming in particular. In this paper, we develop a programming skeleton for Dynamic Programming on MultiGPU systems. The skeleton, implemented in CUDA, allows the user to execute parallel codes for MultiGPU just by providing sequential C++ specifications of her problems. The performance and easy of use of this skeleton has been tested on several optimization problems. The experimental results obtained over a cluster of Nvidia Fermi prove the advantages of the approach.  相似文献   

19.
Deadlocks arising from insufficiently marked siphons in flexible manufacturing systems can be controlled by adding monitors to each siphon – too many for large systems. Li and Zhou add monitors to elementary siphons only while controlling the rest of (called dependent) siphons by adjusting control depth variables of elementary siphons. Only a linear number of monitors are required. The control of weakly dependent siphons (WDSs) is rather conservative since only positive terms were considered. The structure for strongly dependent siphons (SDSs) has been studied earlier. Based on this structure, the optimal sequence of adding monitors has been discovered earlier. Better controllability has been discovered to achieve faster and more permissive control. The results have been extended earlier to S3PGR2 (systems of simple sequential processes with general resource requirements). This paper explores the structures for WDSs, which, as found in this paper, involve elementary resource circuits interconnecting at more than (for SDSs, exactly) one resource place. This saves the time to compute compound siphons, their complementary sets and T-characteristic vectors. Also it allows us (1) to improve the controllability of WDSs and control siphons and (2) to avoid the time to find independent vectors for elementary siphons. We propose a sufficient and necessary test for adjusting control depth variables in S3PR (systems of simple sequential processes with resources) to avoid the sufficient-only time-consuming linear integer programming test (LIP) (Nondeterministic Polynomial (NP) time complete problem) required previously for some cases.  相似文献   

20.
The last level cache (LLC) in private configurations offer lower latency and isolation but extinguishes the possibility of sharing underutilized cache resources. Cooperative Caching (CC) provides capacity sharing by spilling a line evicted from one cache to another. However, CC proposals did not pay enough attention to the natural problem of private LLC, replication. The static policies either indulging the replicated blocks (replicas) in or excluding them out of LLC invariably are deficient for the complex cache capacity situations in manycore environment. In this paper, we present replication-aware cache management (RACMan) to optimize replication for private configurations. RACMan relies on a novel coarse-grained low-overheard mechanism PBFP that monitors and predicts the replica reusability to dynamically adjust LLC insertion policies giving replicas different positions of LRU chain and chances of survival in LLC according to the prediction. Experiment results show our proposal is competent to optimize replication by performing better than two baseline systems in the respects of L2 Hit Rate, Network Traffics, IPC, and Dynamic Energy. RACMan fulfils the requirements of manycore CMPs with private LLC for increasing system performance, area efficiency, and scalability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号