期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

一种支持多种访存技术的CBEA片上多核MPI并行编程模型 总被引：1，自引：0，他引：1

冯国富董小社胡冰王旭昊王恩东《计算机学报》2008,31(11)

现有的CBEA(Cell Broadband Engine Architecture)编程模型多侧重于支持类似于流处理的"批量访存"(Bulk Data Transfer)应用,传统非规则访存应用性能较低.文中基于Cell架构提出了一种同时支持"批量访存"与非规则访存应用的MPI并行编程模型,将通信分解在PPE(PowerPC Processing Element)上,拓宽模型的适用范围;在统一访存接口下,通过运行时访存剖分信息指导选择和优化访存以提高计算效率.实验结果表明,文中提出的编程模型支持多种访存模式并具有很好的并行加速比,可获得较同类相关技术30%~50%左右的性能提升. 相似文献

2.

PC机群上JIAJIA与MPI的比较 总被引：3，自引：2，他引：3

下载免费PDF全文

胡明昌史岗胡伟武唐志敏张福新《软件学报》2003,14(7):1187-1194

对JIAJIA和MPI (message passing interface)是进行了比较.JIAJIA和MPI分别代表共享存储和消息传递的编程模式.MPI显式进行数据传输,编程复杂;JIAJIA由底层维护数据一致性,并附加提供简单的消息传递函数,编程容易、灵活.JIAJIA分配共享内存时开销较大,初始化时间比MPI长.提出了一个关于并行加速比与进程数目之间关系的近似经验公式,推出JIAJIA和MPI性能差距随着进程数目的增多而增大的结论.测试结果表明,大部分应用程序的JIAJIA和MPI版本的并行性能差距不超过10%.对于通信量很小的应用程序,其JIAJIA和MPI的性能差距较小,而通信量本身较大的应用程序,其JIAJIA和MPI的性能差距主要取决于运行时产生的实际通信量. 相似文献

3.

异构多核上多级并行模型支持及性能优化

李士刚胡长军王珏李建江《软件学报》2013,24(12):2782-2796

低功耗及廉价性使得异构多核在超级计算机计算资源中占有重要比例.然而,异构多核具有高带宽及松耦合一致性等特点,获得理想的存储及计算性能需要更多地考虑底层硬件细节.实现了一种针对典型的异构多核Cell BE 处理器的多级并行模型CellMLP,通过C 语言扩展编译指导语句,实现了对数据并行、任务并行以及流水并行编程模型的支持,提高了并行程序生产率.运行支持优化方面,数据并行采用SPE 并行数据传输、双缓冲等优化手段来提高数据传输带宽;任务并行使用一种新式混合任务队列以支持异步任务窃取,降低SPE 线程间竞争,提高了任务并行的可扩展性;流水并行首次使用阻塞信号传输机制实现SPE 线程间的低开销同步操作.实验对Stream,NASBenchmark 及BOTS 等应用进行了测试,结果表明,CellMLP 可对多种典型并行应用进行高效支持.与目前同类编程模型SARC 及CellSs 进行性能对比,其结果表明,CellMLP 实际数据传输带宽以及非规则应用的支持方面具有明显优势. 相似文献

4.

面向层次化NoC的混合并行编程模型 总被引：1，自引：0，他引：1

下载免费PDF全文

曹祥易伟潘红兵高明伦李丽《计算机工程》2010,36(13):278-280

为更好发挥多核处理器的硬件性能,针对层次化的片上网络架构,提出MPI/OpenMP混合并行编程模型。运用基于MPI的任务级并行模型实现片内簇间的高效通信,采用OpenMP模型实现簇内四核的通信、同步和数据交换。实验结果表明,与单一并行编程模型相比,混合并行编程模型加速比提高了20%~50%。相似文献

5.

MIOS:面向大规模CCNUMA系统的多实例操作系统

卢凯迟万庆高颖慧冯华《计算机研究与发展》2011,48(9)

MIOS是一个面向大规模CCNUMA系统设计的新型高可扩展操作系统.MIOS创新地采用了多实例内核结构,每个内核实例执行相同代码,分别独立运行和管理一个处理器,多核间通过分布存储管理构成高可扩展的一致性系统映像空间,支持弱共享进程、线程并行模型.MIOS针对大规模CCNUMA系统特点和高性能并行科学计算应用的需求,采用了显式共享数据分布、层次式任务调度、自适应任务间通信以及寄存器锁等优化.在大规模CCNUMA体系结构的银河深度并行计算机上的测试表明,MIOS对MPI应用具有同传统操作系统类似的性能,并可以有效支持2048处理器规模的OMP应用高效运行,具有良好的系统可扩展性. 相似文献

6.

基于Docker的MPI和OpenMP混合编程

赵博颖肖鹏张力《计算机与现代化》2018,(5):60

针对当前搭建集群并行系统复杂且耗时等问题,提出基于Docker搭建并行系统。介绍轻量级虚拟化技术Docker的核心概念和基本架构,并基于Docker技术在Linux平台上搭建集群并行开发环境。简要阐述并行计算的思想,叙述MPI和OpenMP并行计算的基本概念和特点,针对矩阵并行乘法的算法建立MPI和OpenMP的混合编程模型,并给出混合编程模型与MPI并行编程模型以及OpenMP并行编程模型的性能对比,分析出现差异的原因。基于该混合编程模型比较Docker与传统物理机两者搭建的并行系统的并行效率。相似文献

7.

基于SMP集群的三维网格多粒度混合并行编程模型 总被引：2，自引：0，他引：2

于方郑晓薇孙晓鹏《计算机应用与软件》2009,26(3)

为提高大规模三维网格并行算法的执行效率,针对SMP集群分布/共享两级内存层次结构的特点,介绍适用于SMP集群混合编程的不同实现方法.对三维网格模型最短路径问题的并行求解提出了多粒度混合并行编程模型,给出了实现该问题的MPI+OpenMP混合并行算法,并在SMP集群上同粗粒度MPI(Message Passing Interface)并行算法做了性能比较.结果表明,采用该多粒度混合并行编程模型具有更好的加速比和运行效率. 相似文献

8.

众核体系结构对Cilk语言的硬件支持及评测研究 总被引：4，自引：0，他引：4

龙国平张军超范东睿《计算机学报》2008,31(11)

如何编程众核体系结构是当前一个亟待解决的问题.研究可扩展的硬件机制支持Cilk编程模型的目的是在良好的编程性和可扩展硬件实现之间达到平衡.Cilk语言是C的精简扩展,程序员编写Cilk程序时和串行编程近似,且不需关心调度、负载均衡和局部性等系统底层相关的问题.文中以域一致性存储模型为基础,主要工作包括两方面:首先针对域一致性模型编程性不好的缺点提出一种以数据为中心维护高速缓存一致性的方法;其次提出实现DAG Consistency的缓存一致性协议,并在此基础上支持Cilk编程模型.实验结果表明,当处理器核数目较少(<16)时所有测试程序都能获得比较好的性能加速,并且指出了众核情况下(>16)难以获得理想加速效果的两个根本原因:静态路由导致片上网络带宽利用不均衡以及有限的访存带宽. 相似文献

9.

基于共享存储的高可伸缩嵌入式集群模型

尹文轩高翔朱晓静郭德源《计算机研究与发展》2012,(Z1):245-251

利用对称多处理机(SMP)作结点可为嵌入式集群带来更高的计算性价比,但多个并行和存储层次也会带来存储一致性、可伸缩性、性能差异等问题.提出一种基于共享存储的嵌入式集群模型LESC.该模型通过高度综合实现"计算单元-互连一致性模块-系统"三级高可伸缩结构,获得功耗成本有效性.LESC完成分布式共享存储的基本功能,其目录缓存一致性和扩展的共享存储机制改善了传统存储层次,并利用"共享存储虚拟网络"提供模块级的高效通信,避免了网络硬件开销,同时支持MPI编程.经该模型的真实系统平台测试,模块内MPI通信性能是传统嵌入式集群的3倍以上,单元间通信性能可达单元内性能的86%以上,Linpack测试其扩展性能在最差情况下接近理想值的70%. 相似文献

10.

一类Stencil应用在众核NUMA架构的性能研究

高凌云勾文进刘夏真袁武张鉴陆忠华《数据与计算发展前沿》2023,(6):58-66

【应用背景】模板计算是CFD（计算流体动力学,Computational Fluid Dynamics）等科学计算的典型算法,其访存性能受到关注。NUMA架构因扩展性好,在以鲲鹏920处理器为代表的ARM架构上普遍被应用。【方法】使用性能分析工具和benchmark程序,对鲲鹏平台的访存和通信子系统进行性能测试。针对典型stencil应用软件CCFD V3.0开展热点分析和性能测试,并建立Roofline模型。【结果】鲲鹏920处理器依托其众核NUMA架构,单节点浮点性能、内存带宽峰值,以及通信时延均优于Intel Xeon E5-2680v2与一款国产处理器。单节点时,CCFD V3.0在鲲鹏平台的运行速度约是Intel平台的2～3倍,是国产处理器的1.5～2倍。【结论】基于ARM架构的鲲鹏平台应用移植简单,其NUMA架构对模板计算一类访存密集性应用具有优势。相似文献

11.

访存密集型应用在SMP机群系统中的性能分析

顾丽红吴少刚《小型微型计算机系统》2006,27(7):1258-1261

SMP机群系统因其良好的性价比、卓越的可扩展性与可用性，逐渐成为当前高性能计算机领域的主流结构．这种结点内共享存储、结点间消息传递的两级混合结构是目前并行计算研究的热点,在单个SMP结点中，总线和内存带宽是否满足CPU和I／O的需求对于访存密集型应用的性能影响很大。本文针对访存密集型应用的特点测试分析了在SMP机群中访存冲突对系统性能的影响，结果表明我们的SMP结点存在性能瓶颈，这种量化分析对于设计大规模的基于SMP的机群系统有很好的指导意义．相似文献

12.

Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

Nikolopoulos Dimitrios S. Ayguadé Eduard Polychronopoulos Constantine D. 《International journal of parallel programming》2002,30(4):225-255

This paper compares data distribution methodologies for scaling the performance of OpenMP on NUMA architectures. We investigate the performance of automatic page placement algorithms implemented in the operating system, runtime algorithms based on dynamic page migration, runtime algorithms based on loop scheduling transformations and manual data distribution. These techniques present the programmer with trade-offs between performance and programming effort. Automatic page placement algorithms are transparent to the programmer, but may compromise memory access locality. Dynamic page migration algorithms are also transparent, but require careful engineering and tuned implementations to be effective. Manual data distribution requires substantial programming effort and architecture-specific extensions to the API, but may localize memory accesses in a nearly optimal manner. Loop scheduling transformations may or may not require intervention from the programmer, but conform better to an architecture-agnostic programming paradigm like OpenMP. We identify the conditions under which runtime data distribution algorithms can optimize memory access locality in OpenMP. We also present two novel runtime data distribution techniques, one based on memory access traces and another based on affinity scheduling of parallel loops. These techniques can be used to effectively replace manual data distribution in regular applications. The results provide a proof of concept that it is possible to scale a portable shared-memory programming model up to more than 100 processors, without modifying the API and without exposing architectural details to the programmer. 相似文献

13.

Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Seonggun Kim Hwansoo Han Kwang-Moo Choe 《The Journal of supercomputing》2011,56(1):25-55

Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs. 相似文献

14.

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

《Journal of Parallel and Distributed Computing》2014,74(12):3202-3216

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries. 相似文献

15.

Data race: tame the beast

K. Leung Z. Huang Q. Huang P. Werstein 《The Journal of supercomputing》2010,51(3):258-278

Data races hamper parallel programming and threaten the reliability of future software. This paper proposes the data race prevention scheme View-Oriented Data race Prevention (VODAP), which can prevent data races in the View-Oriented Parallel Programming (VOPP) model. VOPP is a novel shared-memory data-centric parallel programming model, which uses views to bundle mutual exclusion with data access. We have implemented the data race prevention scheme with a memory protection mechanism. Experimental results show that the extra overhead of memory protection is trivial in our applications. The performance is evaluated and compared with modern programming models such as OpenMP and Cilk. 相似文献

16.

Scalable shared-memory multiprocessor architectures

Thakkar S. Dubois M. Laundrie A.T. Sohi G.S. 《Computer》1990,23(6):71-74

Directory-based and bus-based cache coherence schemes are defined and described. Directory-based schemes can be classified as centralized or distributed. Both categories support local caches to improve processor performance and reduce traffic in the interconnection. Schemes using presence flags, B pointers, and linked lists are discussed. Bus-based systems provide uniform memory access to all processors. This memory organization allows a simpler programming model, making it easier to develop new parallel applications or to move existing applications from a uniprocessor to a parallel system. Two architectural variations of bus-based systems are described: multiple-bus and hierarchical architectures 相似文献

17.

Zippy: A Framework for Computation and Visualization on a GPU Cluster 总被引：1，自引：0，他引：1

Zhe Fan Feng Qiu Arie E. Kaufman 《Computer Graphics Forum》2008,27(2):341-350

Due to its high performance/cost ratio, a GPU cluster is an attractive platform for large scale general‐purpose computation and visualization applications. However, the programming model for high performance general‐purpose computation on GPU clusters remains a complex problem. In this paper, we introduce the Zippy frame‐work, a general and scalable solution to this problem. It abstracts the GPU cluster programming with a two‐level parallelism hierarchy and a non‐uniform memory access (NUMA) model. Zippy preserves the advantages of both message passing and shared‐memory models. It employs global arrays (GA) to simplify the communication, synchronization, and collaboration among multiple GPUs. Moreover, it exposes data locality to the programmer for optimal performance and scalability. We present three example applications developed with Zippy: sort‐last volume rendering, Marching Cubes isosurface extraction and rendering, and lattice Boltzmann flow simulation with online visualization. They demonstrate that Zippy can ease the development and integration of parallel visualization, graphics, and computation modules on a GPU cluster. 相似文献

18.

CUDA-Zero:a framework for porting shared memory GPU applications to multi-GPUs

CHEN DeHao CHEN WenGuang & ZHENG WeiMin 《中国科学:信息科学(英文版)》2012,(3):663-676

As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it’s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models (e.g. CUDA and message passing). Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach: static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment. 相似文献

19.

Adaptive thread mapping strategies for transactional memory applications

Márcio Castro Luís Fabrício W. Góes Jean-François Méhaut 《Journal of Parallel and Distributed Computing》2014

Transactional Memory (TM) is a programmer friendly alternative to traditional lock-based concurrency. Although it intends to simplify concurrent programming, the performance of the applications still relies on how frequent they synchronize and the way they access shared data. These aspects must be taken into consideration if one intends to exploit the full potential of modern multicore platforms. Since these platforms feature complex memory hierarchies composed of different levels of cache, applications may suffer from memory latencies and bandwidth problems if threads are not properly placed on cores. An interesting approach to efficiently exploit the memory hierarchy is called thread mapping. However, a single fixed thread mapping cannot deliver the best performance when dealing with a large range of transactional workloads, TM systems and platforms. In this article, we propose and implement in a TM system a set of adaptive thread mapping strategies for TM applications to tackle this problem. They range from simple strategies that do not require any prior knowledge to strategies based on Machine Learning techniques. Taking the Linux default strategy as baseline, we achieved performance improvements of up to 64.4% on a set of synthetic applications and an overall performance improvement of up to 16.5% on the standard STAMP benchmark suite. 相似文献