期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

孟慧玲王耀彬李凌杨洋王欣夷刘志勤《计算机应用》2021,41(9):2652-2657

线程级推测（TLS）技术可挖掘程序并行执行潜能,提高多核资源利用率,但目前TACLeBench的内核基准仍未在TLS并行化中得到有效分析。针对该问题设计了循环级推测执行的剖析方案和剖析工具。选取7个代表性的TACLeBench内核基准程序,首先对程序进行初始化分析,选取程序热点片段插入循环标识;其次对这些片段进行交叉编译,记录程序推测线程与内存地址相关数据,剖析其循环级最大潜在并行性;最后综合探讨程序运行时的特征（线程粒度、可并行化覆盖率、依赖特征）以及源码对加速比的影响。实验结果表明：1）该类程序适合采用TLS加速,与串行执行结果相比,循环结构的推测执行下的大部分程序的加速比在2以上,其中最高加速比达到20.79;2）利用TLS加速TACLeBench内核程序时,多数应用可有效利用4核到16核的计算资源。相似文献

2.

用线程级推测技术在多核体系结构上并行化科学计算应用

王耀彬安虹郭锐闫洁路璐《小型微型计算机系统》2010,31(2)

线程级推测技术使在多核上加速传统上难以手工或自动并行化的串行程序成为可能,它不仅需要合理地选择线程的划分策略,而且需要合理地选择适合推测执行的应用.已有的大量研究主要集中在如SPEC CPU这样的桌面应用领域,为了全面地认识TLS技术的应用适用性,本文探讨TLS技术对科学计算应用的性能提升潜力,提出一套TLS适用性的基本判定准则,实验结果表明采用该技术加速SPLASH2中的多数应用可以有效利用16核及以上的计算资源. 相似文献

3.

针对子程序结构的线程级推测并行性分析 总被引：3，自引：0，他引：3

梁博安虹王莉王耀彬《小型微型计算机系统》2009,30(2)

线程级推测技术为开发更多的线程级并行性,充分利用多核加速传统上难以手工或自动并行化的串行程序提供可行的技术途径.然而,这种技术的性能严重地依赖于线程划分方案.有研究表明,仅推测执行循环所产生的并行性是不够的.但推测执行子程序结构比循环结构要难.本文提出寻找适于推测并行执行的子程序结构的基本判定依据;通过运行由Simplescalar工具集改造得到的动态剖析工具ProRV、ProFun和SPEC CPU2000基准测试程序,我们对子程序结构线程化推测执行的适合性进行详细分析,给出具有指导意义的实验分析方法和实验数据.我们发现:①无返回值的子程序结构占据程序整体执行时间的大约40%;返回稀疏整型的子程序结构占据了程序整体执行时间的大约10%,对其返回值的预测成功率在70%左右.对于其他返回值类型的子程序结构,由于对其返回值的预测成功率过低,我们认为不适合作为线程划分的对象.②简单的last-value的值预测方案对于返回值的预测是简单而且足够有效的.③访存数据依赖普遍存在于子程序与其后继代码之间,显式同步机制对于针对子程序结构的线程级推测是必要的. 相似文献

4.

众核结构上线程级推测执行能力评估器设计

任永青安虹孙涛《小型微型计算机系统》2011,32(5)

由成百上千处理器核构成的众核处理器在提供大量计算能力的同时,也对如何高效利用资源提出挑战;具有不同并行度的应用对处理器核资源有不同的需求,不合理的分配会造成资源浪费(分配过多)或者限制并行性开发(分配过少).针对众核结构上串行程序线程级推测执行面临的处理器核资源分配问题,提出一种基于硬件的推测执行能力监测和评估机制,设计三种线程级推测执行能力评估器;该评估器能够根据串行程序推测执行能力的动态变化,对应用分配的处理器核资源数量进行实时调整.实验结果表明,利用一个硬件开销极小的评估器对众核平台上串行程序的线程级推测执行进行资源分配指导,即可使性能和资源利用率达到有效的平衡. 相似文献

5.

模糊C均值聚类算法的并行化研究

张建强郑晓薇吴华平《微型机与应用》2010,29(23)

使用Intel Parallel Amplifier高性能工具,针对模糊C均值聚类算法在多核平台的性能问题,找出串行程序的热点和并发性,提出并行化设计方案.基于Intel并行库TBB(线程构建模块)和OpenMP运行时库函数,对多核平台下的串行程序进行循环并行化和任务分配的并行化设计. 相似文献

6.

基于TBB的Hough变换并行检测直线

刘向娇赵学武贾超超《数字社区&智能家居》2014,(5):3162-3164

Hough变换存在着运算时间长的缺点,用了并行处理这种解决海量数据计算的有效方法来减少其运行时间。该文主要研究了：利用TBB（Threading Building Blocks）这种线程构建模块在多核机上对Hough变换中可并行的部分进行并行化;实验表明这种方法对Hough变换的并行化都有很好的加速效果。相似文献

7.

多核平台并行单源最短路径算法 总被引：1，自引：0，他引：1

下载免费PDF全文

黄跃峰钟耳顺《计算机工程》2012,38(3):1-3

提出一种多核平台并行单源最短路径算法。采用与Δ-Stepping算法相似的并行策略,通过多个子线程对同一个桶中的弧段进行并行松弛,利用主线程控制串行搜索中桶的序列。实验结果表明,该算法求解全美单源最短路径的时间约为4 s,与使用相同代码实现的串行算法相比,加速比更高。相似文献

8.

选择性循环的并行方法

下载免费PDF全文

吴悦雷超付杨洪斌《计算机工程》2010,36(9):35-37,4

针对含有大量循环的串行程序存在的问题,提出一种基于线程级前瞻技术的循环选择方案。该方案对循环进行最优选择后建立一个可并行运行的循环集。对于该集合中的循环,选择并行效率高的代码段作并行处理,以加快串行程序运行速度。实验表明,相对于一般的简单内部循环或外部循环并行方法,该方案使9种基准代码的加速比平均上升23.8%,从而提高串行程序并行运行的效率。相似文献

9.

选择性循环的并行方法

下载免费PDF全文

吴悦雷超付杨洪斌《计算机工程》2010,36(9):35-37,40

针对含有大量循环的串行程序存在的问题,提出一种基于线程级前瞻技术的循环选择方案。该方案对循环进行最优选择后建立一个可并行运行的循环集。对于该集合中的循环,选择并行效率高的代码段作并行处理,以加快串行程序运行速度。实验表明,相对于一般的简单内部循环或外部循环并行方法,该方案使9种基准代码的加速比平均上升23.8%,从而提高串行程序并行运行的效率。相似文献

10.

光线跟踪程序PBRT的并行化及性能优化

付雄 ;王汝传《微机发展》2008,(10):5-8

随着多核处理器的出现和迅速发展,将以前经典的串行程序并行化,更好地利用多核体系结构提高其性能,成为了当前多核处理器应用研究值得关注的一个问题。以并行化光线跟踪程序PBRT为例,深入研究了串行程序并行化中的并行模型的设计与实现、正确性验证,以及并行化后的性能优化等问题。优化后的并行PBRT取得了4个线程时近3．5倍的加速比,证明了所给出的并行化及性能优化有良好的效果。相似文献

11.

一种挖掘多核处理器存储级并行的算法

彭林张小强刘德峰谢伦国田祖伟《计算机研究与发展》2009,46(Z2)

多核处理器中,各个处理器核之间可以并发地进行外部存储访问,提供不同于单处理器的存储级并行(memory level parallelism)能力.不规则应用中的循环,传统的并行方法难以识别其并行性,不能充分利用多核处理器存储级并行能力和并行计算能力.对基于软件开发多核处理器存储级并行进行了讨论,提出一种前瞻并行多线程算法LLSM(loop level speculative mssultithreading).LLSM对不规则应用中的循环进行并行化,在多核处理器上的测试数据表明:该算法能够有效地挖掘多核处理器的存储级并行能力和计算能力,同时指出多核环境下存储级并行计算公式需要考虑线程同步开销. 相似文献

12.

Shared write buffer to boost applications on SpMT architecture

Ming Chen John Ye Tianzhou Chen Hongjun Dai 《The Journal of supercomputing》2017,73(8):3508-3525

With the trend of growing number of integrated processing cores on Chip Multiprocessors, researchers are working hard to increase the available parallelism of software programs so as to efficiently harness the growing computing power. One noticeable direction among these efforts is speculative multi-threading (SpMT), a.k.a thread level speculation, which aims to extract thread level parallelism by splitting a sequential execution thread into several finer ones and execute them on parallel. A SpMT thread is in speculative status before it “knows” all its input data are correct. A speculative thread needs to write to the L1 cache, but its output might be discarded if the speculation eventually fails. However, another speculative thread may have already read in such speculative output. Therefore, some mechanism is needed to support speculative read and write. And because the SpMT threads are extracted from a single thread, they usually share lots of data, thus there might be intense data coherence among the L1 caches. It would be very complicated to support data coherence and speculation together. This Paper proposes a shared write buffer among the SpMT cores. We are able to confine the speculative read and write in the SWB, thus the speculation will not interference with coherence, and the L1 cache design could be drastically simplified. Experiments show that the SWB can capture a big portion of inter-core data sharing, reduce cache coherence, and drastically improve data access performance of SpMT threads. 相似文献

13.

CMP Support for Large and Dependent Speculative Threads 总被引：1，自引：0，他引：1

Colohan C.B. Ailamaki A.C. Steffan J.G. Mowry T.C. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1041-1054

Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and which have minimal dependences between them. However, recent work has shown that TLS can offer compelling performance improvements when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread with many frequent data dependences between them. To support such large and dependent speculative threads, the hardware must be able to buffer the additional speculative state and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper, we present a chip-multiprocessor (CMP) support for large speculative threads that integrates several previous proposals for the TLS hardware. We also present a support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. Through an evaluation that exploits the proposed hardware support in the database domain, we find that the transaction response time for three of the five transactions from TPC-C (on a simulated four-processor chip-multiprocessor) speed up by a factor of 1.9 to 2.9. 相似文献

14.

面向SCMP的多线程前瞻控制分析与设计 总被引：1，自引：0，他引：1

下载免费PDF全文

鲁建壮王志英张春元《计算机工程与科学》2006,28(10):128-130

单芯片多处理器一直是处理器微体系结构发展的一个热点。对于通用串行应用程序,高效的线程控制方法是实现线程级前瞻、挖据线程级并行性的一个重要组成部分。本文结合一个具体的SCMP模型即Griffon,提出并实现了一种简单、高效的分布式线程控制方法。该方法易于实现,可扩展性强。实验结果表明,线程的控制可以在数个周期内实现
,能够满足片内并行处理的要求相似文献

15.

A high performance framework for modeling and simulation of large-scale complex systems

《Future Generation Computer Systems》2015

Due to the quick advances in the scale of problem domain of complex systems under investigation, the complexity of multi-input component models used to construct logical processes (LP) has significantly increased. High-performance computing technologies have therefore been extensively used to enable parallel simulation execution. However, the traditional multi-process parallel method (MPM) executes LPs in parallel on multi-core platforms, which ignores the intrinsic parallel capabilities of multi-input component models. In this study, a vectorized component model (VCM) framework has been proposed. The design aims to better utilize the parallelism of multi-input component models. A two-level composite parallel method (CPM) has then been constructed within the framework, which can sustain complex system simulation applications consisting of multi-input component models. CPM first employs MPM to dispatch LPs onto a multi-core computing platform. It then maps VCMs to the multiple-core platform for parallel execution. Experimental results indicate that (1) the proposed VCM framework can better utilize the parallelism of multi-input component models, and (2) CPM can significantly improve the performance comparing to the traditional MPM. The results also show that CPM can effectively cope with the size and complexity of complex simulation applications with multi-input component models. 相似文献

16.

Architectural support for thread communications in multi-core processors

Sevin Varoglu Stephen Jenks 《Parallel Computing》2011,37(1):26-41

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献

17.

基于多核并行的海量数据序列模式挖掘*

俞东进郑苏杭李万清《计算机应用研究》2012,29(2):478-481

为了在多核处理器上充分利用多核资源以提升挖掘性能,提出了一种动态与静态任务分配机制相结合的基于多核的并行序列模式挖掘算法。该算法采用数据并行与任务并行相结合的策略,在各处理器核生成局部序列模式后,再与其他处理器核协同,以最终获得所有的全局序列模式。算法通过并行局部归约技术消除了局部序列的重复生成与计算,并可结合静态与动态任务分配机制解决处理器的负载不均衡问题。理论分析和实验都证实了该算法可有效利用多核计算平台及多核体系结构优势,具有较高的运行效率和加速比。相似文献

18.

大整数Comba和Karatsuba乘法的多核并行化研究

蒋丽娟刘芳芳赵玉文杨超蔡颖《计算机系统应用》2016,25(11):232-236

大整数运算广泛地应用于公钥加密算法、大规模科学计算中高精度浮点数运算类以及构建大特征值等领域,然而其大部分算法空间和时间开销都很大,尤其对于核心运算之一的大整数乘法,当数据达到一定规模时,超长的串行计算时间已成为制约算法应用的巨大瓶颈.近几年来,伴随着多核、众核芯片的迅猛发展,通过充分挖掘算法本身的并行度以利用并行处理器的强大计算能力,进而高效地提升算法性能,成为一种研究趋势.本文基于通用多核并行计算平台,研究了大整数乘法Comba及Karatsuba快速算法的并行化,提出了高效的多核并行算法.在算法实现及性能优化上,采用了OpenMP+SIMD的多级并行技术,使性能获得巨大提升.在性能测试上,我们使用优化的并行算法与原始串行算法进行对比试验,结果显示,8线程并行Comba算法和Karatsuba算法相比串行对应算法分别实现了5.85倍以及6.14倍的性能加速比提升. 相似文献