期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

邓晴莺张民选蒋江《计算机工程》2008,34(12):13-15

同时多线程能在同一时钟周期执行不同线程的指令,并且指令级并行和线程级并行。显式并行指令计算关注于编译器和硬件的相互协作。寄存器文件的设计在高性能处理器设计中十分重要,寄存器栈和寄存器栈引擎是提高其性能的重要手段。该文设计和实现一套并行环境,其中包括并行编译器OpenUH和基于IA-64的同时多线程体系结构EDSMT,实验表明,该并行架构适用于大多数并行应用,针对NAS的并行测试程序,该架构相对于SMTSIM平均有12.48%的性能提升。相似文献

2.

基于EPIC的同时多线程处理器取指策略

下载免费PDF全文

贾小敏孙彩霞张民选《计算机工程》2007,33(4):256-258

EPIC硬件简单，同时多线程易于开发线程级并行，在EPIC上实现同时多线程可以结合二者的优点。取指策略对同时多线程处理器的性能有重要影响。该文介绍了几种有代表性的超标量同时多线程处理器取指策略，分析了这些策略在EPIC同时多线程处理器上的适用性，提出了一种新的适用于EPIC的取指策略SICOUNT。分析表明SICOUNT策略可以充分利用EPIC软硬件协同的优势，在选择取指线程时使用编译器所提供的停顿信息，能更精确地估计各个线程的流动速度，使取出指令的质量更高。相似文献

3.

多核同时多线程处理器的线程调度器设计

《电子技术应用》2016,(1):19-21

多核同时多线程处理器(SMT_PAAG)是用于图形、图像及数字信号处理的一种多核处理器。基于这种处理器提出了一种硬件线程调度器,该调度器采用同时多线程技术,最多可同时执行四个线程,支持八个线程阻塞模式下的快速上下文切换。这样避免了因阻塞带来的等待问题,能够有效提高处理器的工作效率和资源利用率。通过在处理器上运行图形处理算法进行性能评测。结果表明,SMT-PAAG处理器通过挖掘指令级并行和线程级并行,将处理器的性能提高了69.25%。相似文献

4.

基于事务性执行的投机并行多线程软件模拟

姚震郑启龙陈国良杨晓奇《小型微型计算机系统》2008,29(3):437-443

基于事务性执行的投机并行多线程是一种适合未来多核微处理器架构的新型并行程序设计和编译技术.但在此基础上的并行程序执行过程更为复杂,程序执行过程的模拟成为关键问题之一.本文提出利用二进制代码级动态插桩技术对投机并行多线程程序进行功能性模拟,设计并实现了完整的软件平台,可精确地模拟和监控并行程序的线程级投机执行过程,检测访存冲突,从而实现投机并行多线程的语义.该软件平台同时可以作为进一步研究投机多线程并行程序真实执行过程的基础,并有效支持投机并行多线程编译器的设计和分析. 相似文献

5.

萤火虫2：一种多态并行机的硬件体系结构

李涛杨婷易学渊蒲林钱博文黄光新黄虎才韩俊刚《计算机工程与科学》2014,36(2):191-200

提出了一种新型的多态高效并行阵列机结构--萤火虫2号阵列机。该结构的处理单元可以在SIMD和MIMD两种模式下运行,兼有异步执行机制,还可以实现分布式指令级并行处理。采用了硬件的多线程管理器和高效通信机制,这些机制使得此种阵列机能够实现效率很高的线程级并行运算、数据级并行运算和分布式指令级并行运算。尤其值得指出的是,此种阵列机的流处理性能堪与专用集成电路匹敌。该结构还能有效实现静态与动态数据流计算,可以高效实现图形、图像和数字信号处理任务。相似文献

6.

超标量处理器中引入SMT技术的性能分析研究

下载免费PDF全文

史莉雯樊晓桠黄小平《计算机工程与应用》2009,45(5):13-15

同时多线程（SMT）是一种允许多个独立的线程每周期发射多条指令的技术,这种技术充分利用了可能存在的指令级并行和线程级并行,提高了有限资源的利用率。文章以西北工业大学航空微电子中心自主研发的32位超标量处理器“龙腾R2”为基础,引入SMT技术,在基本不改变内部结构大小、不增加执行功能部件、仅做一些必要修改的前提条件下进行研究。通过仿真不同的线程数和各种线程组合,进行性能分析。尽管存在制约性能提升的一些因素,引入SMT技术后依然获得了最高约50%的性能增加。相似文献

7.

面向DSWP并行的OpenMP任务调度机制的扩展与实现

刘晓娴赵荣彩丁锐《计算机科学》2013,40(9):38-43

多核处理器能够提升多线程程序的性能,但早已存在的诸多单线程程序无法从中获益,程序员也习惯于编写单线程程序.自动并行化技术是将单线程程序移植到多核上的重要手段,但是当循环中存在无法确定的数据依赖或复杂的控制流时,传统的自动并行化技术无法取得良好效果.Ottoni等人针对传统自动并行失败的循环提出了Decoupled Software Pipelining(DSWP)算法用以实现指令级的细粒度并行,但其需要对处理器体系结构的深入了解以及对核间通信队列和专用指令的硬件支持,并行性能和应用广泛性受到限制.基于OpenMP应用编程接口实现的DSWP并行不依赖于硬件上对核间通信队列和专用指令的支持,且不受平台的限制,但现有的OpenMP任务调度机制无法满足DSWP并行中对任务调度的需求.对现有的OpenMP任务调度机制进行扩展,增加了任务与线程绑定的属性,保证了基于OpenMP的DSWP并行程序的正确执行.在GCC的OpenMP运行库libgomp中扩展了任务绑定属性子句的功能,扩展后的GCC作为OpenMP DSWP程序的基础编译器,为自动并行提供支持.通过对基准测试集NPB3.3.1的测试表明,传统自动并行失败的循环,经OpenMP DSWP自动并行后在双核处理器上平均加速比达到1.23以上;使用添加了OpenMP DSWP算法的Open64编译器生成的并行程序,与仅使用传统自动并行方法的Intel 编译器和Open64编译器所得程序相比,平均加速比分别高出22％和26％. 相似文献

8.

Win32平台基于通信的多核并行编程方法

李青徐璐娜《计算机应用与软件》2014,(4):11-14

随着计算机硬件的发展,多核并行计算在计算机软件及应用领域的出现率也越来越频繁。目前的多核编程模型采用线程级并行模型,现有的多线程并行编程模型主要有线程库、指令模型和任务式模型三种。提出一种与MPI并行编程模型相似的基于通信的方法在Win32平台上来实现并行编程,在此基础上实现MTI并行编程模型。通过若干典型的测试给出使用MTI进行并行编程的执行结果,结果表明MTI是有效、易用的。相似文献

9.

同时多线程操作：下一代处理器平台

夏春《电子计算机》1998,(2):2-10

同时多线程操作通过在桢的时钟周期内从不同的线程中发送指令的方法而利用了指令级并行性和线程级并行性相似文献

10.

基于可重构处理器的并行优化算法

下载免费PDF全文

刘石柱尹首一殷崇勇刘雷波魏少军《计算机工程》2012,38(21):286-289

为挖掘可重构处理器的内在并行性,需要编译器通过分析程序的并行性来决定可重构处理器硬件最好的执行模式。为此,提出一种基于可重构处理器的并行优化算法。将有向无环图的并行计算部分映射到可重构处理器上,对任务实现3个不同层次的并行性(指令级并行、循环级并行、线程级并行)。测试结果表明,该算法使得可重构处理器在处理任务时比未用并行优化算法的性能提升1.2倍左右。相似文献

11.

Adaptive execution techniques of parallel programs for multiprocessors

Jaejin Lee Jung-Ho Park Honggyu Kim Changhee Jung Daeseob Lim SangYong Han 《Journal of Parallel and Distributed Computing》2010

In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems. 相似文献

12.

Tuning Compiler Optimizations for Simultaneous Multithreading

Jack L. Lo Susan J. Eggers Henry M. Levy Sujay S. Parekh Dean M. Tullsen 《International journal of parallel programming》1999,27(6):477-503

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines. 相似文献

13.

低功耗SMT体系结构研究 总被引：3，自引：3，他引：3

赵荣彩唐志敏《计算机工程与设计》2002,23(8):7-12,17

由于应用程序中ILP和TLP的不足或不均衡性，使得超标量和多处理的性能和资源用率受到了挑战；而同时多线程（SMT）处理器则是一种能够充分利用资源，动态进行TLP到ILP转换的能量有效结构。文章围绕高性能、低功耗这两个目标讨论和探究了WMT体系结构的基本思想、设计技术、低功耗考虑了以及编译器和操作系统设计应注意和对待的新问题。相似文献

14.

面向移动终端的统计机器翻译解码定点化方法

李响徐金安姜文斌吕雅娟刘群《中文信息学报》2011,25(2):61-66

面向移动终端的统计机器翻译需求越来越多,但无浮点运算单元的处理器限制了翻译速度。该文提出了一种对统计机器翻译解码运算的定点化运算方法,缓解了无浮点运算单元的处理器对翻译速度的影响。基于PC和移动终端的实验表明,在保证翻译质量的情况下,利用定点处理浮点运算的解码器的运算速度较编译器模拟的浮点运算速度提高135.6%。因此,该方法可以有效地提高浮点运算能力薄弱的移动终端统计机器翻译速度。相似文献

15.

MOSI:一种基于超长指令字处理器的同时多线程微体系结构

万江华陈书明《计算机学报》2006,29(3):378-383

描述了一种基于超长指令字处理器的同时多线程微体系结构——MOSI（MultiOp Splitting Issue,多操作①分离发射）．MOSI动态地发射同一多操作内的指令．并通过写回缓冲保证计算结果的写回顺序与编译器的视图一致,从而以较小的代价解决了SMT技术中的关键问题．文中详细描述了写回缓冲的结构及算法,给出了多个线程的硬件模型,最后对硬件支持线程的个数及Cache的组织结构进行了讨论．实验结果表明,基于MOSI结构的双线程处理器能够将吞吐率提高40％．相似文献

16.

多核微机基于OpenMP的并行计算

蔡佳佳李名世郑锋《微机发展》2007,17(10):87-91

随着四核微机走向市场和八十核处理器在实验室研制成功,多核正引领软件研发发生基础性变化。开发人员需要在代码中添加线程来利用系统所提供的多个内核,从而提升PC应用软件的功能和性能。文中探讨在多核微机上进行并行计算的实现技术。介绍了共享存储系统并行编程接口OpenMP的模型、指令和库函数,以及Intel C 编译器9.1和Microsoft Visual Studio 2005等对OpenMP的支持;着重探讨了二维离散快速傅里叶变换并行算法的设计、实现与优化技术;展望了高性能并行计算软构件库的开发前景。相似文献

17.

Software-directed register deallocation for simultaneousmultithreaded processors

Lo J.L. Parekh S.S. Eggers S.J. Levy H.M. Tullsen D.M. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(9):922-933

This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better interthread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deal location. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to: 1) free registers immediately upon their last use, and 2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs 相似文献

18.

Machine independent AND and OR parallel execution of logicprograms. II. Compiled execution

Ramkumar B. Kale L.V. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(2):181-192

For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler 相似文献