共查询到20条相似文献,搜索用时 15 毫秒
1.
Prasenjit Sengupta Jimmy Nguyen Jason Kwan Padmanabhan K. Menon Eric M. Heien John B. Rundle 《Concurrency and Computation》2015,27(17):5460-5471
Parallelization strategies are presented for Virtual Quake, a numerical simulation code for earthquakes based on topologically realistic systems of interacting earthquake faults. One of the demands placed upon the simulation is the accurate reproduction of the observed earthquake statistics over three to four decades. This requires the use of a high‐resolution fault model in computations, which demands computational power that is well beyond the scope of off‐the‐shelf multi‐core CPU computers. However, the recent advances in general‐purpose graphic processing units have the potential to address this problem at moderate cost increments. A functional decomposition of Virtual Quake is performed, and opportunities for parallelization are discussed in this work. Computationally intensive modules are identified, and these are implemented on graphics processing units, significantly speeding up earthquake simulations. In the current best case scenario, a computer with six graphics processing units can simulate 500 years of fault activity in California at 1.5 km × 1.5 km element resolution in less than 1 hour, whereas a single CPU requires more than 2 days to perform the same simulation. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
2.
并行计算已成为主流趋势.在并行计算系统中,同步是关键设计之一,对硬件性能的充分利用至关重要.近年来,GPU(graphic processing unit,图形处理器)作为应用最为广加速器得到了快速发展,众多应用也对GPU线程同步提出更高要求.然而,现有GPU系统却难以高效地支持真实应用中复杂的线程同步.研究者虽然提出了很多支持GPU线程同步的方法并取得了较大进展,但GPU独特的体系结构及并行模式导致GPU线程同步的研究仍然面临很多挑战.根据不同的线程同步目的和粒度对GPU并行编程中的线程同步进行分类.在此基础上,围绕GPU线程同步的表达和执行,首先分析总结GPU线程同步存在的难以高效表达、错误频发、执行效率低的关键问题及挑战;而后依据不同的GPU线程同步粒度,从线程同步表达方法和性能优化方法两个方面入手,介绍近年来学术界和产业界对GPU线程竞争同步及合作同步的研究,对现有研究方法进行分析与总结.最后,指出GPU线程同步未来的研究趋势和发展前景,并给出可能的研究思路,从而为该领域的研究人员提供参考. 相似文献
3.
Mahesh Rajan Courtenay T. Vaughan Doug W. Doerfler Richard F. Barrett Paul T. Lin Kevin T. Pedretti K. Scott Hemmert 《Concurrency and Computation》2012,24(18):2404-2420
Multicore processors form the basis of most traditional high performance parallel processing architectures. Early experiences with these computers showed significant performance problems, both with regard to computation and inter‐process communication. The transition from Purple, an IBM POWER5‐based machine, to Cielo, a Cray XE6, as the main capability computing platform for the United States Department of Energy's Advanced Simulation and Computing campaign provides an opportunity to reexamine these issues after experiences with a few generations of multicore‐based machines. Experiences with Purple identified some important characteristics that led to strong performance of complex scientific application programs at very large scales. Herein, we compare the performance of some Advanced Simulation and Computing mission critical applications at capability scale across this transition to multicore processors. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献
4.
为研究基于GPU的高性能并行计算技术,利用集成448个处理核心的NVIDIA GPU GTX470实现了脉冲压缩雷达的基本数据处理算法,包括脉冲压缩算法与相参积累算法;同时根据GPU的并行处理架构,将脉冲压缩、相参积累算法完成了并行优化设计,有效地将算法并行映射到GPU GTX470的448个处理核心中,完成了脉冲压缩雷达基本处理算法的GPU并行处理实现;最后验证了并行计算的结果,并针对处理结果效果与实时性进行了评估。 相似文献
5.
WAPM:适合广域分布式计算的并行编程模型 总被引:1,自引:0,他引:1
早期的MPI与OpenMP等编程模型由于扩展性限制或并行粒度的差异而不适合于大规模的广域动态Internet环境.提出了一个用于广域网络范围内的并行编程模型(WAPM),为应用的分布式计算的编程提供了一个新的可行解决方案.WAPM由通信库、通信协议和应用编程接口组成,并且具有通用编程、自适应并行、容错性等特点,通过选择合适的编程语言,就可形成一个广域范围内的并行程序设计环境.以分布式计算平台P2HP为工作平台,描述了WAPM分布式计算的实施过程.实验结果表明,WAPM是一个通用的、可行的、性能较好的编程模型. 相似文献
6.
《Concurrency and Computation》2017,29(4)
This paper proposes a high‐performance graphics processing unit (GPU)‐based parity computing scheduler, which we call GPU‐redundant array of inexpensive disks (RAID), to reduce the encoding and decoding time for storage applications. The proposed GPU‐RAID differs from existing RAID in that it performs additional pairwise‐parallel XOR operations between data code words in each data stripe by applying divide‐and‐conquer approach using extra reserved space and it also increases parallelism by processing multiple strips in parallel using multiple GPU threads. And so the proposed GPU‐RAID pipelines data blocks into solid‐state disks and parity blocks into hard disk drives at the target server. The proposed algorithm decreases the span complexity of the parity computation schedule to O(l o g 2n w ) where n is the number of disks and w is the number of code words in a block, and it can be applied to various types of erasure codes. Experimental results show that the proposed storage application (SA1) improves average encoding performance by 63%, and 41%, and average decoding performance by 58%, and 38%, compared with traditional storage applications GPUStore (SA3) and Gibraltar RAID(SA2), respectively. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献
7.
Miguel Lastra José M. Mantas Carlos Ureña Manuel J. Castro José A. García-Rodríguez 《Mathematics and computers in simulation》2009
This paper addresses the speedup of the numerical solution of shallow-water systems in 2D domains by using modern graphics processing units (GPUs). A first order well-balanced finite volume numerical scheme for 2D shallow-water systems is considered. The potential data parallelism of this method is identified and the scheme is efficiently implemented on GPUs for one-layer shallow-water systems. Numerical experiments performed on several GPUs show the high efficiency of the GPU solver in comparison with a highly optimized implementation of a CPU solver. 相似文献
8.
高性能计算集群作为保证国家科研开展的“基础设施”已上升为国家战略。高性能计算应用广泛,特别在材料科学研究方面,是必不可少的工具。当下为科学计算用户提供高质量的远程化、可视化、图形化的高性能计算平台成为当前高性能服务研究的突破口。本文提出基于材料科学研究的新型高性能计算平台系统,该系统基于Java语言开发,采用B/S架构为用户提供服务,实现主流材料科学研究的计算软件与平台的整合,设计友好的用户操作界面,提供方便的接入方式,并结合OpenPBS优化作业调度方法为平台用户提供优先级更高的计算需求,保证材料科研应用中更高效的计算资源。 相似文献
9.
基于协同服务器组的志愿者计算环境的构造 总被引:4,自引:0,他引:4
构造了一个基于协同服务器组的志愿者计算环境P2HP.P2HP把平台中的所有节点按照角色划分为监控服务器节点、调度服务器节点、计算节点和数据服务器,进而形成一个可扩展的层次网络拓扑架构.P2HP具有开放性、容易使用、容错能力好、可扩展、跨平台等特点,并提供一套简单方便的API(application programming interface)函数调用来支持并行应用程序开发.测试结果表明,P2HP是处理高性能并行应用的一个可行的方法. 相似文献
10.
The El'brus-3 and MARS-M represent two recent efforts to address the Soviet Union's high-performance computing needs through original, indigenous development. The El'brus-3 extends very long instruction word (VLIW) concepts to a multiprocessor environment and offers features that increase performance and efficiency and decrease code size for both scientific and general-purpose applications. It incorporates procedure static and globally dynamic instruction scheduling, multiple, simultaneous branch path execution, and iteration frames for executing loops with recurrences and conditional branches. The MARS-M integrates VLIW, data flow, decoupled heterogeneous processors, and hierarchical systems into a unified framework. It also offers a combination of static and dynamic VLIW scheduling. While the viability of these machines has been demonstrated, significant barriers to their production and use remain.This paper was written nearly entirely by means of e-mail between Tucson and Novosibirsk. It is one of the first examples of this type of collaboration between Russian and American colleagues. 相似文献
11.
Zhisong Fu Sergey Yakovlev Robert M. Kirby Ross T. Whitaker 《Concurrency and Computation》2015,27(7):1639-1657
12.
徐胜超 《小型微型计算机系统》2011,32(8)
志愿者计算模型由于可以高效的聚集和利用在Internet上闲散的大规模计算资源,使得人们对高性能计算的研究与实现比集群系统更加廉价和容易,近年来已在工程和科学计算中显示其越来越重要的作用.在构造志愿者计算环境的过程中,志愿者计算网络的拓扑架构、任务的调度模型、应用的编程模型、数据的传输协议、应用的扩充研究等都是研究的关键技术点.本文分析志愿者计算的基本概况,综述了目前的志愿者计算项目在关键技术点上的研究进展,并对其今后的若干研究方向进行了展望. 相似文献
13.
Lanjun Wan Xueyan Cui Yuanyuan Li Weihua Zheng Xinpan Yuan 《Concurrency and Computation》2024,36(11):e8014
Heterogeneous platforms composed of multiple different types of computing devices (such as CPUs, GPUs, and Intel MICs) have been widely used recently. However, most of parallel applications developed in such a heterogeneous platform usually only utilize a certain kind of computing device due to the lack of easy-to-use heterogeneous cooperative parallel programming models. To reduce the difficulty of heterogeneous cooperative parallel programming, a directive-based heterogeneous cooperative parallel programming framework called HeteroPP is proposed. HeteroPP provides an easier way for programmers to fully exploit multiple different types of computing devices to concurrently and cooperatively perform data-parallel applications on heterogeneous platforms. An extension to OpenMP directives and clauses is proposed to make it possible for programmers to easily offload a data-parallel compute kernel to multiple different types of computing devices. A source-to-source compiler is designed to help programmers to automatically generate multiple device-specific compute kernels that can be concurrently and cooperatively performed on heterogeneous platforms. Many experiments are conducted with 12 typical data-parallel applications implemented with HeteroPP on a heterogeneous CPU-GPU-MIC platform. The results show that HeteroPP not only greatly simplifies the heterogeneous cooperative parallel programming, but also can fully utilize the CPUs, GPU, and MIC to efficiently perform these applications. 相似文献
14.
袁静珍 《计算机工程与设计》2010,31(9)
构建了一个面向互联网计算资源共享的并行程序设计环境IPPE(internet-based parallel programming environment).该环境使用Java语言开发,通过利用Java运行系统与Java并行通信类库,在IPPE环境下可以书写具有并行处理能力的Java应用程序.IPPE具有平台独立性、容易使用、负载均衡性、容错性等特点.IPPE环境的平台独立性与易用性得益于其基于Java的字节码技术与对象序列化技术,IPPE的负载均衡性得益于针对任务的自适应并行调度算法及子任务级的容错策略.通过运行两个典型的BanchMark并行程序,表明了IPPE环境的高效性与稳定性. 相似文献
15.
Rafal Krawczyk Tomasz Czarski Pawel Linczuk Andrzej Wojenski Maryna Chernyshova Krzysztof Pozniak Didier Mazon Piotr Kolasinski Grzegorz Kasprowicz Wojciech Zabolotny Michal Gaska Ewa Kowalsaka-Strzeciwilk Karol Malinowski Axel Jardin Philippe Malard 《Concurrency and Computation》2020,32(10)
This paper presents feasibility studies in utilizing graphics processing units (GPUs) as high‐performance computing hardware with front‐end electronics in high‐scale magnetic confinement thermal fusion experiments. The objective of the research is to provide scalable, high‐throughput, and low‐latency measurements for the runtime tokamak metallic impurities X‐ray diagnostic for the Tungsten Environment in Steady‐State Tokamak (WEST) reactor. The heterogeneous system of front‐end with field‐programmable gate arrays and the back‐end server was introduced to decompose workloads efficiently. It allows the comprehensive evaluation of CPUs and accelerators. In particular, a novel implementation of the back‐end algorithm for GPU with the performance analysis are presented. 相似文献
16.
提出了因特网上基于节点角色的计算资源共享平台——RB-CRSP。设计时充分考虑节点的角色性和功能性,把因特网上的网络资源按照角色划分为服务器端节点、协调节点、工作机节点与客户机节点四类实体,通过配合RB-CRSP的应用编程模式,完成并行分布式计算。分析了RB-CRSP中的自适应资源调度策略,该策略考虑了节点的硬件信息与可信誉机制,实现了平台的负载均衡性;在动态的因特网环境下,利用面向工作机的容错方式保证了平台的可靠性。案例程序选择了典型的并行BenchMark程序:N皇后问题,测试结果表明,RB-CRSP可以方便聚集异构环境下的空闲计算资源,平台的性能与机器硬件条件和可靠性密切相关。 相似文献
17.
单颗粒冷冻电镜是结构生物学研究的重要手段之一,基于贝叶斯理论的冷冻电镜3维图像数据处理软件RELION(regularized likelihood optimization)具有很好的性能和易用性,受到广泛关注.然而其计算需求极大,限制了RELION的应用.针对RELION算法的特点,研究了基于GPU 的并行优化问题.首先全面分析了RELION的原理、RELION程序的算法结构及性能瓶颈;在此基础上,针对GPU细粒度体系结构对程序进行优化设计,提出了基于GPU的多级并型模型.为了获得良好的性能,对RELION的数据结构进行重组.为了避免GPU存储空间不足的问题,设计了自适应并行框架.实验结果表明:基于GPU的RELION实现可以获得良好的性能,相比于单CPU,整个应用的加速比超过36倍,计算密集型算法的加速比达到75倍以上.在多GPU上的测试结果表明基于GPU的RELION具有很好的可扩展性. 相似文献
18.
《数据与计算发展前沿》2025,7(2)
【目的】随着信息技术的快速发展和全球数据量的激增;超级计算机(超算)已经成为科学研究和创新发展的重要驱动力。本文旨在探讨超算在多个领域中的应用现状与发展趋势。【方法】通过广泛调研全球范围内的超算和领域应用情况;系统性地对相关高性能计算应用进行分类和总结;重点关注化学与材料、物理学等多个领域;探讨相关计算需求与超算的适配和部署情况。此外;本文还积极讨论了网格计算与超算互联。【结果】超算在多个领域应用已经展现出了显著的效果。随着应用领域的需要和高性能计算技术的不断发展;对超级计算机的软硬件发展也提出更高要求。【局限】虽然超算正处在蓬勃发展的阶段;可应用范围广泛;但本文仅选取了代表性应用领域进行分析总结。【结论】超算在加速科学发现和技术创新方面的效率显著提升;为未来的研究和应用提供了强有力的支持。同时;提升超算的性能和适应性将是未来科研进展的重要保障。 相似文献
19.
Baker James M. Gold Brian Bucciero Mark Bennett Sidney Mahajan Rajneesh Ramachandran Priyadarshini Shah Jignesh 《The Journal of supercomputing》2004,30(2):133-149
As technology improves and transistor feature sizes continue to shrink, the effects of on-chip interconnect wire latencies on processor clock speeds will become more important. In addition, as we reach the limits of instruction-level parallelism that can be extracted from application programs, there will be an increased emphasis on thread-level parallelism. To continue to improve performance, computer architects will need to focus on architectures that can efficiently support thread-level parallelism while minimizing the length of on-chip interconnect wires. The SCMP (Single-Chip Message-Passing) parallel computer system is one such architecture. The SCMP system includes up to 64 processors on a single chip, connected in a 2-D mesh with nearest neighbor connections. Memory is included on-chip with the processors and the architecture includes hardware support for communication and the execution of parallel threads. Since there are no global signals or shared resources between the processors, the length of the interconnect wires will be determined by the size of the individual processors, not the size of the entire chip. Avoiding long interconnect wires will allow the use of very high clock frequencies, which, when coupled with the use of multiple processors, will offer tremendous computational power. 相似文献
20.
Mohr Bernd Malony Allen D. Shende Sameer Wolf Felix 《The Journal of supercomputing》2001,23(1):105-128
This paper proposes a performance tools interface for OpenMP, similar in spirit to the MPI profiling interface in its intent to define a clear and portable API that makes OpenMP execution events visible to runtime performance tools. We present our design using a source-level instrumentation approach based on OpenMP directive rewriting. Rules to instrument each directive and their combination are applied to generate calls to the interface consistent with directive semantics and to pass context information (e.g., source code locations) in a portable and efficient way. Our proposed OpenMP performance API further allows user functions and arbitrary code regions to be marked and performance measurement to be controlled using new OpenMP directives. To prototype the proposed OpenMP performance interface, we have developed compatible performance libraries for the Expert automatic event trace analyzer [17, 18] and the TAU performance analysis framework [13]. The directive instrumentation transformations we define are implemented in a source-to-source translation tool called OPARI. Application examples are presented for both Expert and TAU to show the OpenMP performance interface and OPARI instrumentation tool in operation. When used together with the MPI profiling interface (as the examples also demonstrate), our proposed approach provides a portable and robust solution to performance analysis of OpenMP and mixed-mode (OpenMP+MPI) applications. 相似文献