期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《微型机与应用》2017,(9):25-27

文章研究了一种多核架构下基于OpenMP的Dijkstra并行算法,以Dijkstra算法为基础设计并行程序。对传统Dijkstra算法进行分析,明确优化方向,再利用OpenMP开发工具对并行程序进行优化调试。结果表明,文中算法易于操作,并充分利用了多核处理器并行计算的优势,提高了算法的运行效率,验证了算法的优越性。相似文献

2.

共享存储环境下非平衡动力学方程组并行计算

迟利华刘杰《计算机应用》2010,30(Z1)

OpenMP是现代多核机群系统采用的主要并行编程模型之一,在单CPU多核上可以获得良好的加速性能,但在整个机群系统上使用时,需要解决可扩展性差的问题.首先设计了求解非平衡动力学方程的并行算法.基于分布共享的多核机群系统,采用显式数据分布OpenMP并行计算方法,将数据进行分布式划分,分配到每个OpenMP线程,通过数据共享实现数据交换.计算结果表明显式OpenMP并行程序在保持可读性的同时,具有良好的可扩展性,在4核Xeon处理器构成的分布共享机群系统上,非平衡动力学方程组的数值并行计算可以扩展到1024个CPU核,具有明显的并行加速计算效果. 相似文献

3.

基于异构多核的CCA并行构件模型

彭云峰张炜《计算机应用研究》2014,31(12)

并行构件技术的出现提高了并行软件的开发效率,但现有的并行构件技术缺乏对异构多核平台的支持.为了提高并行构件程序在异构平台上的执行性能,扩展CCA(通用构件体系结构)并行构件模型支持CCA异构并行构件,提出了一种异构的CCA并行构件模型.使用管理者—工人模式调度CCA异构并行构件内的计算任务到异构多核平台上加速执行.在CCA构件工具包的基础上实现了支持扩展CCA并行构件模型的编译系统和运行时框架.在CELL BE和GPU两种异构多核处理器上进行的实验证明了提出的方法比原始的CCA构件程序具有较优的性能.提出的并行构件模型应用在并行程序开发中可以提高并行程序的性能. 相似文献

4.

面向神威高性能多核处理器的并行编译优化方法

周雍浩徐金龙李斌钱宏聂凯《计算机工程》2022,48(9):130-138

在神威高性能多核服务器上,自动并行化编译系统为识别和申明程序中的并行性,产生的OpenMP程序没有经过充分的优化,其采用简单的fork-join模型,存在大量的并行循环嵌套,导致运行效率低。为提升自动并行化编译系统产生的OpenMP程序的运行效率,提出一种并行域重构优化技术。并行域重构技术通过合并程序中的并行域和扩展嵌套循环中的并行域范围,减少OpenMP程序的并行域数目,降低线程组频繁创建和合并等控制开销,将简单fork-join模型的OpenMP程序转换为性能更为高效的单程序多数据模型的OpenMP程序。实验结果表明,在新一代神威高性能多核服务器SW1621平台上,并行域重构技术在NPB3.3-OMP测试集和SPEC OMP2012测试集上的运行效率分别提高了10.77%和7.94%的,可有效提升自动并行化编译系统OpenMP程序的执行效率。相似文献

5.

面向DSWP并行的OpenMP任务调度机制的扩展与实现

刘晓娴赵荣彩丁锐《计算机科学》2013,40(9):38-43

多核处理器能够提升多线程程序的性能,但早已存在的诸多单线程程序无法从中获益,程序员也习惯于编写单线程程序.自动并行化技术是将单线程程序移植到多核上的重要手段,但是当循环中存在无法确定的数据依赖或复杂的控制流时,传统的自动并行化技术无法取得良好效果.Ottoni等人针对传统自动并行失败的循环提出了Decoupled Software Pipelining(DSWP)算法用以实现指令级的细粒度并行,但其需要对处理器体系结构的深入了解以及对核间通信队列和专用指令的硬件支持,并行性能和应用广泛性受到限制.基于OpenMP应用编程接口实现的DSWP并行不依赖于硬件上对核间通信队列和专用指令的支持,且不受平台的限制,但现有的OpenMP任务调度机制无法满足DSWP并行中对任务调度的需求.对现有的OpenMP任务调度机制进行扩展,增加了任务与线程绑定的属性,保证了基于OpenMP的DSWP并行程序的正确执行.在GCC的OpenMP运行库libgomp中扩展了任务绑定属性子句的功能,扩展后的GCC作为OpenMP DSWP程序的基础编译器,为自动并行提供支持.通过对基准测试集NPB3.3.1的测试表明,传统自动并行失败的循环,经OpenMP DSWP自动并行后在双核处理器上平均加速比达到1.23以上;使用添加了OpenMP DSWP算法的Open64编译器生成的并行程序,与仅使用传统自动并行方法的Intel 编译器和Open64编译器所得程序相比,平均加速比分别高出22％和26％. 相似文献

6.

基于数据对齐属性指导的GCC自动向量化优化

李春江黄娟娟徐颖董钰山《计算机工程与科学》2014,36(6):1011-1017

主流通用处理器都已经实现了多核并行以及处理器核内的SIMD并行。虽然GCC编译器实现了面向SIMD并行的自动向量化,但是编译器针对OpenMP并行程序的自动向量化效果仍很不理想。针对多线程并行的OpenMP程序,基于GCC的OpenMP编译实现,扩展了数据对齐属性指导语句,使编译器在自动向量化时能够进行更准确的数据对齐与否的判断,优化了GCC编译器的自动向量化。相似文献

7.

嵌入式零树小波压缩和解压缩的并行化算法

韩丽洁李文田晏嘉《计算机应用》2009,29(Z1)

嵌入式零树小波压缩算法是图像压缩技术中有效的压缩算法,但其压缩时间较长.对该算法进行了研究,并在多核机群系统下实现了该算法的并行算法,提高了算法的性能.实现了MPI和MPI+OpenMP两种并行算法,并将串行算法、MPI并行算法与MPI+OpenMP并行算法进行比较.结果显示,随着数据量的增多,MPI并行算法和MPI+OpenMP并行算法相对于串行算法的运行效率都有明显提高,其中MPI+OpenMP并行算法的效率更好. 相似文献

8.

多核并行技术在三维场景加载中的应用 总被引：1，自引：0，他引：1

下载免费PDF全文

李喆郑晓薇《计算机工程》2011,37(6):245-246

研究一种用于三维漫游场景的多核并行加载系统。在多核计算机上采用OGRE进行场景加载,利用OpenMP实现多线程创建与同步,动态设置并行程序的线程数量。通过对一个三维山地漫游场景加载不同数量植物的实例,测试出不同线程下并行加载Mesh的时间,获得较好的加速比。实验结果表明,采用OpenMP并行技术可有效改进三维漫游场景的加载速率,加快地形场景的显示,提高绘制效率。相似文献

9.

多核计算机上的并行计算

李斌李海玉孙延君《电脑编程技巧与维护》2011,(18):34-35

研究了多核计算机上0penMP＋Vc＋＋编程模式的并行程序,并在双核和四核计算机上分别使用传统算法和并行算法计算数列求和、矩阵乘积及矩阵Cholesky分解。试验表明,传统串行程序只能利用多核计算机的一个核资源,而采用OpenMP程序的并行效率很高。相似文献

10.

一种利用并行复算实现的OpenMP 容错机制 总被引：1，自引：0，他引：1

富弘毅丁滟宋伟杨学军《软件学报》2012,23(2):411-427

基于并行复算的故障恢复技术,将故障恢复的计算任务分配至未发生故障的结点上并行执行,从而显著缩短复算时间,有效降低故障恢复开销,提高并行程序容错性能.基于该故障恢复技术,提出了一种针对OpenMP并行程序的容错机制PR-OMP,有效解决了分段复算、复算负载重分布等问题;此外,还扩展了传统编译数据流分析技术,提出了针对OpenMP并行程序的数据流分析技术,并基于该技术计算状态保存开销进行优化.设计实现了用于支持PR-OMP的编译工具GiFT-OMP,并通过实验证明了PR-OMP机制及其支持工具的有效性,评估并分析了其性能和可扩展性. 相似文献

11.

Implementation and tuning of a parallel symmetric Toeplitz eigensolver

Pedro AlonsoAuthor Vitae Antonio M. Vidal^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(3):485-494

In a previous paper (Vidal et al., 2008, [21]), we presented a parallel solver for the symmetric Toeplitz eigenvalue problem, which is based on a modified version of the Lanczos iteration. However, its efficient implementation on modern parallel architectures is not trivial.In this paper, we present an efficient implementation on multicore processors which takes advantage of the features of this architecture. Several optimization techniques have been incorporated to the algorithm: improvement of Discrete Sine Transform routines, utilization of the Gohberg-Semencul formulas to solve the Toeplitz linear systems, optimization of the workload distribution among processors, and others. Although the algorithm follows a distributed memory parallel programming paradigm that is led by the nature of the mathematical derivation, special attention has been paid to obtaining the best performance in multicore environments. Hybrid techniques, which merge OpenMP and MPI, have been used to increase the performance in these environments. Experimental results show that our implementation takes advantage of multicore architectures and clearly outperforms the results obtained with LAPACK or ScaLAPACK. 相似文献

12.

Supporting OpenMP on a multi-cluster embedded MPSoC

Andrea Marongiu Paolo Burgio Luca BeniniAuthor vitae 《Microprocessors and Microsystems》2011,35(8):668-682

The ever-increasing complexity of MPSoCs is putting the production of software on the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim to facilitate application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware, and that custom features are provided to the programmer to control it. In this paper we consider a representative template of a modern multi-cluster embedded MPSoC and present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementation variants that are aware of the memory hierarchy and of the heterogeneous interconnection. 相似文献

13.

Parallelization of stochastic bounds for Markov chains on multicore and manycore platforms

Jarosław Bylina 《The Journal of supercomputing》2018,74(4):1497-1509

The author demonstrates the methodology for parallelizing of finding stochastic bounds for Markov chains on multicore and manycore platforms. The stochastic bounds algorithm for Markov chains with the sparse matrices is investigated, thus needing a lot of irregular memory access. Its parallel implementations should scale across multiple threads and characterize with a high performance and performance portability between multicore and manycore platforms. The presented methods are built on the usage of two parallelization extensions of the C++ language: OpenMP and Cilk Plus. For this two extensions, we use two programming models, namely loop parallelism and task-based parallelism. The numerical experiments show the execution time of the implementations and the scalability on multicore and manycore platforms. This work provides the parallel implementations and at the same time presents an educational example of how computer science problems with irregular memory access can be implemented for high performance using OpenMP and Cilk Plus. 相似文献

14.

基于OpenMP/MPI并行编程模型的N体问题的优化实现

祝永志续士强禹继国《计算机工程与应用》2016,52(5):16-21

多核集群的层次化并行编程模型一直是高性能计算的研究热点。以SMP集群为例,从硬件上可分为节点间和节点内的两层架构。阐述了层次化并行编程的实现技术,针对N体问题算法进行了基于Hybrid并行编程模型的并行化研究。提出了一种块同步MPI/OpenMP细粒度N体问题的优化算法。基于曙光TC5000A集群,将该算法与传统的N体并行算法进行了执行时间与加速比的比较,得出了几句总结性具体论述。相似文献

15.

Parallel Smith-Waterman Comparison on Multicore and Manycore Computing Platforms with BSP++

Khaled Hamidouche Fernando Machado Mendonca Joel Falcou Alba Cristina Magalhaes Alves de Melo Daniel Etiemble 《International journal of parallel programming》2013,41(1):111-136

Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm that compares two sequences in quadratic time and space. Due to high computing power and memory requirements, SW is usually executed on High Performance Computing (HPC) platforms such as multicore clusters and CellBEs. Since HPC architectures exhibit very different hardware characteristics, porting an application to them is an error-prone time-consuming task. BSP++ is an implementation of BSP that aims to facilitate parallel programming, reducing the effort to port code. In this paper, we propose and evaluate a parallel BSP++ strategy to execute SW on multiple multicore and manycore platforms. Given the same base code, we generated MPI, OpenMP, MPI/OpenMP, CellBE and MPI/CellBE versions, which were executed on heterogeneous platforms with up to 6,144 cores. The results obtained with real DNA sequences show that the performance of our versions is comparable to the hand-tuned strategies in the literature, evidencing the appropriateness and flexibility of our approach. 相似文献

16.

面向规则DOACROSS循环的流水并行代码自动生成

刘晓娴赵荣彩赵捷徐金龙《软件学报》2014,25(6):1154-1168

发掘DOACROSS 循环中蕴含的并行性,选择合适的策略将其并行执行,对提升程序的并行性能非常重要.流水并行方式是规则DOACROSS 循环并行的重要方式.自动生成性能良好的流水并行代码是一项困难的工作,并行编译器对程序自动并行时常常对DOACROSS 循环作保守处理,损失了DOACROSS 循环包含的并行性,限制了程序的并行性能.针对上述问题,设计了一种选择计算划分循环层和循环分块层的启发式算法,给出了一个基于流水并行代价模型的循环分块大小计算公式,并使用计数信号量进行并行线程之间的同步,实现了基于OpenMP 的规则DOACROSS 循环流水并行代码的自动生成.通过对有限差分松弛法（finite difference relaxation,简称FDR）的波前（wavefront）循环和时域有限差分法（finite difference time domain,简称FDTD）中典型循环以及程序Poisson,LU 和Jacobi 的测试,算法自动生成的流水并行代码能够在多核处理器上获得明显的性能提升,使用的流水分块大小计算公式能够较为精确地计算出循环流水并行时的最佳分块大小.自动生成的流水并行代码与基于手工选择的最优分块大小的流水并行代码相比,加速比达到手工选择加速比的89%. 相似文献

17.

机群OpenMP系统的设计与实现 总被引：5，自引：0，他引：5

吴少刚章隆兵蔡飞顾丽红唐志敏《计算机学报》2004,27(7):904-912

OpenMP以其易用性和支持增量并行的特点成为共享存储体系结构的编程标准．目前机群系统已成为高性能计算的主流平台,研究机群OpenMP系统对推进并行应用的开发和普及非常有意义．该文作者以软件DSM系统JIAJIA作为OpenMP的运行时系统,结合一个前端编译器OMP2JIA,在机群系统上实现了OpenMP／JIAJIA计算环境,同时在提高性能方面根据机群系统特点扩展了OpenMP制导,优化了后端运行时库。通过11个OpenMP应用,作者比较了该计算环境和一个支持OpenMP的硬件cc-NUMA系统(SGI 2100)的性能．结果表明,作者的机群OpenMP系统的7机平均加速比为4．62;SGI 2100系统为4．55,二者性能相当．相似文献

18.

Performance‐based parallel loop self‐scheduling using hybrid OpenMP and MPI programming on multicore SMP clusters

Chao‐Tung Yang Chao‐Chin Wu Jen‐Hsiang Chang 《Concurrency and Computation》2011,23(8):721-744

Parallel loop self‐scheduling on parallel and distributed systems has been a critical problem and it is becoming more difficult to deal with in the emerging heterogeneous cluster computing environments. In the past, some self‐scheduling schemes have been proposed as applicable to heterogeneous cluster computing environments. In recent years, multicore computers have been widely included in cluster systems. However, previous researches into parallel loop self‐scheduling did not consider certain aspects of multicore computers; for example, it is more appropriate for shared‐memory multiprocessors to adopt Open Multi‐Processing (OpenMP) for parallel programming. In this paper, we propose a performance‐based approach using hybrid OpenMP and MPI parallel programming, which partition loop iterations according to the performance weighting of multicore nodes in a cluster. Because iterations assigned to one MPI process are processed in parallel by OpenMP threads run by the processor cores in the same computational node, the number of loop iterations allocated to one computational node at each scheduling step depends on the number of processor cores in that node. Experimental results show that the proposed approach performs better than previous schemes. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

19.

Assessment of offload-based programming environments for hybrid CPU–MIC platforms in numerical modeling of solidification

《Simulation Modelling Practice and Theory》2018

Heterogeneous (or hybrid) computing platforms with Intel Xeon Phi accelerators offer potential advantages of energy efficient, massively parallel computing, while supporting parallel programming models familiar for users of multicore CPUs. However, realizing this potential for real-world applications still remains a challenging issue. The main goal of this paper is the suitability assessment of offload-based programming environments for porting a real-life scientific application to hybrid platforms with Intel KNC and KNL accelerators, assuming no significant modifications of the application code. The main criterion of this assessment is the application performance. The evaluated environments include: 1) Intel Offload coupled with OpenMP, 2) OpenMP 4.0 and 3) OpenMP 4.5 Accelerator Models, and 4) hStreams Library with OpenMP. A real-life application dedicated to the numerical modeling of alloy solidification is used as a testbed in the assessment. An experimental evaluation of the four versions of the application code for a platform with KNC coprocessors shows that excluding OpenMP 4.0, the rest of them are able to adapt to expansion of available resources, however, with different efficiency. While the shortest execution time is achieved for Intel Offload, the high-level abstractions of hStreams contribute considerably to making porting and tuning the application easier, with low performance overheads in comparison to the low-level Intel Offload extension. Benchmarking the application performance and scalability on a platform with multiple KNL processors, using the Offload over Fabric technology with Intel Offload and OpenMP 4.5, concludes the assessment. 相似文献

20.

异构平台下格子Boltzmann方法实现及性能分析

张丹丹徐莹徐磊《计算机科学》2012,39(4):296-298,303

对CPU+GPU异构平台下的多种并行编程模式进行了研究,并针对格子Boltzmann方法实现了CUDA,MPI+CUDA,MPI+OpenMP+CUDA多级并行算法。结果表明,算法具有较好的加速性能;提出的根据计算量比例参数调节CPU和GPU之间负载均衡的方法,对于在异构平台上实现多级并行处理及资源的有效利用具有一定的参考和应用价值。相似文献