首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The lattice Boltzmann method (LBM) is a widely used computational fluid dynamics method for flow problems with complex geometries and various boundary conditions. Large‐scale LBM simulations with increasing resolution and extending temporal range require massive high‐performance computing (HPC) resources, thus motivating us to port it onto modern many‐core heterogeneous supercomputers like Tianhe‐2. Although many‐core accelerators such as graphics processing unit and Intel MIC have a dramatic advantage of floating‐point performance and power efficiency over CPUs, they also pose a tough challenge to parallelize and optimize computational fluid dynamics codes on large‐scale heterogeneous system. In this paper, we parallelize and optimize the open source 3D multi‐phase LBM code openlbmflow on the Intel Xeon Phi (MIC) accelerated Tianhe‐2 supercomputer using a hybrid and heterogeneous MPI+OpenMP+Offload+single instruction, mulitple data (SIMD) programming model. With cache blocking and SIMD‐friendly data structure transformation, we dramatically improve the SIMD and cache efficiency for the single‐thread performance on both CPU and Phi, achieving a speedup of 7.9X and 8.8X, respectively, compared with the baseline code. To collaborate CPUs and Phi processors efficiently, we propose a load‐balance scheme to distribute workloads among intra‐node two CPUs and three Phi processors and use an asynchronous model to overlap the collaborative computation and communication as far as possible. The collaborative approach with two CPUs and three Phi processors improves the performance by around 3.2X compared with the CPU‐only approach. Scalability tests show that openlbmflow can achieve a parallel efficiency of about 60% on 2048 nodes, with about 400K cores in total. To the best of our knowledge, this is the largest scale CPU‐MIC collaborative LBM simulation for 3D multi‐phase flow problems. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

2.
We describe an efficient and scalable symmetric iterative eigensolver developed for distributed memory multi‐core platforms. We achieve over 80% parallel efficiency by major reductions in communication overheads for the sparse matrix‐vector multiplication and basis orthogonalization tasks. We show that the scalability of the solver is significantly improved compared to an earlier version, after we carefully reorganize the computational tasks and map them to processing units in a way that exploits the network topology. We discuss the advantage of using a hybrid OpenMP/MPI programming model to implement such a solver. We also present strategies for hiding communication on a multi‐core platform. We demonstrate the effectiveness of these techniques by reporting the performance improvements achieved when we apply our solver to large‐scale eigenvalue problems arising in nuclear structure calculations. Because sparse matrix‐vector multiplication and inner product computation constitute the main kernels in most iterative methods, our ideas are applicable in general to the solution of problems involving large‐scale symmetric sparse matrices with irregular sparsity patterns. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

3.
Uintah is a software framework that provides an environment for solving fluid–structure interaction problems on structured adaptive grids for large‐scale science and engineering problems involving the solution of partial differential equations. Uintah uses a combination of fluid flow solvers and particle‐based methods for solids, together with adaptive meshing and a novel asynchronous task‐based approach with fully automated load balancing. When applying Uintah to fluid–structure interaction problems, the combination of adaptive meshing and the movement of structures through space present a formidable challenge in terms of achieving scalability on large‐scale parallel computers. The Uintah approach to the growth of the number of core counts per socket together with the prospect of less memory per core is to adopt a model that uses MPI to communicate between nodes and a shared memory model on‐node so as to achieve scalability on large‐scale systems. For this approach to be successful, it is necessary to design data structures that large numbers of cores can simultaneously access without contention. This scalability challenge is addressed here for Uintah, by the development of new hybrid runtime and scheduling algorithms combined with novel lock‐free data structures, making it possible for Uintah to achieve excellent scalability for a challenging fluid–structure problem with mesh refinement on as many as 260K cores. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

4.
Cluster computing is an attractive approach to provide high‐performance computing for solving large‐scale applications. Owing to the advances in processor and networking technology, expanding clusters have resulted in the system heterogeneity; thus, it is crucial to dispatch jobs to heterogeneous computing resources for better resource utilization. In this paper, we propose a new job allocation system for heterogeneous multi‐cluster environments named the Adaptive Job Allocation Strategy (AJAS), in which a self‐scheduling scheme is applied in the scheduler to dispatch jobs to the most appropriate computing resources. Our strategy focuses on increasing resource utility by dispatching jobs to computing nodes with similar performance capacities. By doing so, execution times among all nodes can be equalized. The experimental results show that AJAS can improve the system performance. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

5.
本文主要介绍了大规模油藏数值模拟并行计算技术在国内的研究进展,提供了精细油藏模拟在国产Beowulf系统上的计算实例和应用效果,给出了百万网格点规模的油藏应用算例在不同处理器规模下的数值模拟计算结果与性能分析,并实现了一个针对海量数据可视化的三维图、二维图、表格显示的后处理显示系统.  相似文献   

6.
高性能计算集群用于高效并行计算,具有很高的性价比和良好的可扩展性,如何测试和评价集群系统性能成为一个关键问题。本文基于6个节点的集群进行Linpack测试,测试不同问题规模、计算节点数、求解矩阵数据分块NB、处理器网格拓扑P×Q、网络通信等重要因素,将单机与集群的计算性能进行对比,测试集群性能,结果表明:该集群的并行计算性能良好,可扩展性强,但硬件通讯能力需进一步改善。应用该集群到实际的地震大数据计算中,该集群的并行计算能力得到了很大的提升。  相似文献   

7.
Sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi‐core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We study important features of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the ‘CPU core hours’ metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology‐aware mapping heuristic using simplified network load model. We have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the ‘CPU core hours’ metric and significantly reduces data movement overheads. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

8.
We propose a new framework design for exploiting multi‐core architectures in the context of visualization dataflow systems. Recent hardware advancements have greatly increased the levels of parallelism available with all indications showing this trend will continue in the future. Existing visualization dataflow systems have attempted to take advantage of these new resources, though they still have a number of limitations when deployed on shared memory multi‐core architectures. Ideally, visualization systems should be built on top of a parallel dataflow scheme that can optimally utilize CPUs and assign resources adaptively to pipeline elements. We propose the design of a flexible dataflow architecture aimed at addressing many of the shortcomings of existing systems including a unified execution model for both demand‐driven and event‐driven models; a resource scheduler that can automatically make decisions on how to allocate computing resources; and support for more general streaming data structures which include unstructured elements. We have implemented our system on top of VTK with backward compatibility. In this paper, we provide evidence of performance improvements on a number of applications.  相似文献   

9.
流线是流场可视化的主要方法之一,而针对大规模流场的流线生成由于计算量大往往需要采用高性能计算机这样的并行计算环境结合并行化算法以实现计算加速.在当前异构计算系统越来越普遍的情况下,为了充分利用并行异构计算环境的计算能力,实现更高效的并行流线生成,本文采用了基于数据并行原语结合分布式消息通讯的技术架构,设计了一套适用于异构集群的混合并行流线生成系统,并在此基础上针对数据分块、数据冗余化及进程通讯策略等方面进行设计,提出并实现了一套并行粒子追踪算法.该系统被部署于国产超算平台上,并针对大规模CFD流场模拟结果数据可视化应用开展了实验.本文给出了相关实验结果,分析了核心并行算法的速度性能、可扩展性以及负载均衡等方面情况,说明了系统及算法的有效性和可扩展性.  相似文献   

10.
The field of computational biology encloses a wide range of optimization problems that show non‐deterministic polynomial‐time hard complexities. Nowadays, phylogeneticians are dealing with a growing amount of biological data that must be analyzed to explain the origins of modern species. Evolutionary relationships among organisms are often described by means of tree‐shaped structures known as phylogenetic trees. When inferring phylogenies, two main challenges must be addressed. First, the inference of reliable evolutionary trees on data sets where different optimality principles support conflicting evolutionary hypotheses. Second, the processing of enormous tree searches spaces where traditional sequential strategies cannot be applied. In this sense, phylogenetic inference can benefit from the combination of high performance computing and evolutionary computation to carry out the reconstruction of complex evolutionary histories in reduced execution times. In this paper, we introduce multiobjective phylogenetics, a hybrid OpenMP/MPI approach to parallelize a well‐known multiobjective metaheuristic, the fast non‐dominated sorting genetic algorithm (NSGA‐II). This algorithm has been designed to conduct phylogenetic analyses on multi‐core clusters in accordance with two principles: maximum parsimony and maximum likelihood. The main goal is to combine the benefits of shared‐memory and distributed‐memory programming paradigms to efficiently infer a set of high‐quality Pareto solutions. Experiments on six real nucleotide data sets and comparisons with other hybrid parallel approaches show that multiobjective phylogenetics is able to achieve significant performance in terms of parallel, multiobjective, and biological results. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

11.
Providing high‐performance inter‐node communication is a key capability for running high performance computing applications efficiently on parallel architectures. In fact, current systems deployments are aggregating a significant number of cores interconnected via advanced networking hardware with Remote Direct Memory Access (RDMA) mechanisms, that enable zero‐copy and kernel‐bypass features. The use of Java for parallel programming is becoming more promising thanks to some useful characteristics of this language, particularly its built‐in multithreading support, portability, easy‐to‐learn properties, and high productivity, along with the continuous increase in the performance of the Java virtual machine. However, current parallel Java applications generally suffer from inefficient communication middleware, mainly based on protocols with high communication overhead that do not take full advantage of RDMA‐enabled networks. This paper presents efficient low‐level Java communication devices that overcome these constraints by fully exploiting the underlying RDMA hardware, providing low‐latency and high‐bandwidth communications for parallel Java applications. The performance evaluation conducted on representative RDMA networks and parallel systems has shown significant point‐to‐point performance increases compared with previous Java communication middleware, allowing to obtain up to 40% improvement in application‐level performance on 4096 cores of a Cray XE6 supercomputer. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

12.
Multi‐agent systems have been proven very effective for the modelling and simulation (M&S) of complex systems like those related to biology, engineering, social sciences and so forth. The intrinsic spatial character of many such systems leads to the definition of a situated agent. A situated agent owns spatial coordinates and acts and interacts with its peers in a hosting territory. In the context of parallel/distributed simulation of situated agent models, the territory represents a huge shared variable that requires careful handling. Frequent access by agents to territory information easily becomes a bottleneck degrading system performance and scalability. This paper proposes an original approach to modelling and distributed simulation of large‐scale situated multi‐agent systems. Time management is exploited for resolving conflicts and achieving data consistency while accessing the environment. The approach allows a simplification of the M&S tasks by making the modeller unaware of distribution concerns while ensuring the achievement of good scalability and performance during the distributed simulation. Practical aspects of the approach are demonstrated through some modelling examples based on Tileworld. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

13.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

14.
The power consumption of modern high‐performance computing (HPC) systems that are built using power hungry commodity servers is one of the major hurdles for achieving Exascale computation. Several efforts have been made by the HPC community to encourage the use of low‐powered system‐on‐chip (SoC) embedded processors in large‐scale HPC systems. These initiatives have successfully demonstrated the use of ARM SoCs in HPC systems, but there is still a need to analyze the viability of these systems for HPC platforms before a case can be made for Exascale computation. The major shortcomings of current ARM‐HPC evaluations include a lack of detailed insights about performance levels on distributed multicore systems and performance levels for benchmarking in large‐scale applications running on HPC. In this paper, we present a comprehensive evaluation of results that covers major aspects of server and HPC benchmarking for ARM‐based SoCs. For the experiments, we built an unconventional cluster of ARM Cortex‐A9s that is referred to as Weiser and ran single‐node benchmarks (STREAM, Sysbench, and PARSEC) and multi‐node scientific benchmarks (High‐performance Linpack (HPL), NASA Advanced Supercomputing (NAS) Parallel Benchmark, and Gadget‐2) in order to provide a baseline for performance limitations of the system. Based on the experimental results, we claim that the performance of ARM SoCs depends heavily on the memory bandwidth, network latency, application class, workload type, and support for compiler optimizations. During server‐based benchmarking, we observed that when performing memory intensive benchmarks for database transactions, x86 performed 12% better for multithreaded query processing. However, ARM performed four times better for performance to power ratios for a single core and 2.6 times better on four cores. We noticed that emulated double precision floating point in Java resulted in three to four times slower performance as compared with the performance in C for CPU‐bound benchmarks. Even though Intel x86 performed slightly better in computation‐oriented applications, ARM showed better scalability in I/O bound applications for shared memory benchmarks. We incorporated the support for ARM in the MPJ‐Express runtime and performed comparative analysis of two widely used message passing libraries. We obtained similar results for network bandwidth, large‐scale application scaling, floating‐point performance, and energy‐efficiency for clusters in message passing evaluations (NBP and Gadget 2 with MPJ‐Express and MPICH). Our findings can be used to evaluate the energy efficiency of ARM‐based clusters for server workloads and scientific workloads and to provide a guideline for building energy‐efficient HPC clusters. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

15.
Yongpeng Liu  Hong Zhu 《Software》2010,40(11):943-964
This paper surveys the research on power management techniques for high‐performance systems. These include both commercial high‐performance clusters and scientific high‐performance computing (HPC) systems. Power consumption has rapidly risen to an intolerable scale. This results in both high operating costs and high failure rates so it is now a major cause for concern. It has imposed new challenges to the development of high‐performance systems. In this paper, we first review the basic mechanisms that underlie power management techniques. Then we survey two fundamental techniques for power management: metrics and profiling. After that, we review the research for the two major types of high‐performance systems: commercial clusters and supercomputers. Based on this, we discuss the new opportunities and problems presented by the recent adoption of virtualization techniques, and again we present the most recent research on this. Finally, we summarize and discuss the future research directions. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

16.
蒋筱斌  熊轶翔  张珩  武延军  赵琛 《软件学报》2023,34(4):1977-1996
现阶段,随着数据规模扩大化和结构多样化的趋势日益凸现,如何利用现代链路内链的异构多协处理器为大规模数据处理提供实时、可靠的并行运行时环境,已经成为高性能以及数据库领域的研究热点.利用多协处理器(GPU)设备的现代服务器(multi-GPU server)硬件架构环境,已经成为分析大规模、非规则性图数据的首选高性能平台.现有研究工作基于Multi-GPU服务器架构设计的图计算系统和算法(如广度优先遍历和最短路径算法),整体性能已显著优于多核CPU计算环境.然而,这类图计算系统中,多GPU协处理器间的图分块数据传输性能受限于PCI-E总线带宽和局部延迟,导致通过增加GPU设备数量无法达到整体系统性能的类线性增长趋势,甚至会出现严重的时延抖动,进而已无法满足大规模图并行计算系统的高可扩展性要求.经过一系列基准实验验证发现,现有系统存在如下两类缺陷:(1)现代GPU设备间数据通路的硬件架构发展日益更新(如NVLink-V1,NVLink-V2),其链路带宽和延迟得到大幅改进,然而现有系统受限于PCI-E总线进行数据分块通信,无法充分利用现代GPU链路资源(包括链路拓扑、连通性和路由);(2)在...  相似文献   

17.
为具体了解CFD软件NUMECA FINE/Turbo的并行计算性能,良好把握后续的科研工作进度,分别研究在激活超线程情况下单节点计算与多节点并行计算以及CPU在激活超线程前、后计算速度的差异.结果表明:在多节点并行计算时,计算速度与实际参加并行计算的CPU物理核心数量成正比;在激活超线程的情况下,并行计算节点数在超过实际物理核心数后明显降低计算速度的提升.  相似文献   

18.
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

19.
为提高集群系统的可靠性和计算性能并降低成本,提出将单一系统映像的集群系统(Single System Image)和分布式复制块设备技术(DRBD)结合起来构建一种高可用集群(SSI-DRBD集群).这种利用单一系统映像和DRBD技术所构建的集群具有高性能、高可靠、实时性强、易管理和低成本等特点,可作为周期性、高强度和多元信息处理的平台.  相似文献   

20.
大规模集群上的并行计算软件需要具备处理部分节点、网络等失效的容错能力,也需要具有易于管理、维护、移植和可扩展的服务能力。针对星形计算模型,研究和开发了一套并行计算框架。利用调度节点内部的可变粒度分解器、相关队列等方法,实现了全系统容错,且具有较好的易用性、可移植性和可扩展性。系统目前可以实现300TFlops计算能力下连续运行超过150h,而且还具有进一步的可扩展能力。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号