首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
网格重排序是提升流体力学CPU和GPU并行计算效率的重要手段之一。对于非结构网格,由于其数据存储无规律,数据的间接访问会导致访存延迟,尤其是在GPU并行计算时,数据的间接访问将引起内存的非对齐访问,放大了访存延迟的影响。对此,采用Reverse Cuthill-Mckee网格重排序方法优化了非结构网格的数据局部性,并设计了一种面向编号重排序方法。算例测试表明,网格重排序不影响最终计算结果。对比分析了网格重排序对非结构求解器在CPU和GPU上的性能影响:对CPU计算,可以使部分热点函数运行时间降低约20%,整体运行时间降低15%~20%;对GPU计算,大部分热点函数运行时间可降低35%~60%,程序整体运行时间降低约40%。  相似文献   

2.
非结构网格的并行多重网格解算器   总被引:2,自引:0,他引:2  
李宗哲  王正华  姚路  曹维 《软件学报》2013,24(2):391-404
多重网格方法作为非结构网格的高效解算器,其串行与并行实现在时空上都具有优良特性.以控制方程离散过程为切入点,说明非结构网格在并行数值模拟的流程,指出多重网格方法主要用于求解时间推进格式产生的大规模代数系统方程,简述了算法实现的基本结构,分析了其高效性原理;其次,综述性地概括了几何多重网格与代数多种网格研究动态,并对其并行化的热点问题进行重点论述.同时,针对非结构网格的实际应用,总结了多重网格解算器采用的光滑算子;随后列举了非结构网格应用的部分开源项目软件,并简要说明了其应用功能;最后,指出并行多重网格解算器在非结构网格应用中的若干关键问题和未来的研究方向.  相似文献   

3.
激光推进数值模拟预处理程序为数值模拟主程序提供网格信息和设置边界条件,实现并行计算可以缩短大规模网格生成时间,从而提升高分辨率数值模拟的效率。分析并利用公用数据的特点改进原有串行算法,进而实现并行计算。算法测试结果表明,该并行算法有效地缩短了网格生成时间。  相似文献   

4.
5.
机器学习领域内的多数模型均需要通过迭代计算以求解其最优参数,而MapReduce模型在迭代计算中的缺陷不足导致其在迭代计算中无法得到广泛应用。为解决上述矛盾,基于MapReduce模型提出并实现了一种可用于模型参数求解的并行迭代模型MRI。MRI模型在保持Map以及Reduce阶段的基础上,新增了Iterate阶段以及相关通信协议,实现了迭代过程中模型参数的更新、分发与迭代控制;通过对MapReduce状态机进行增强,实现了节点任务的重用,避免了迭代过程中节点任务重复创建、初始化以及回收带来的性能开销;在任务节点实现了数据缓存,保障了数据的本地性,并在Map节点增加了基于内存的块缓存机制,进一步提高训练集加载效率,以提高整体迭代效率。基于梯度下降算法的实验结果表明:MRI模型在并行迭代计算方面性能优于MapReduce模型。  相似文献   

6.
为进一步提高大规模平台上可扩展矩阵乘法的并行计算效率,提出一种并行分层可扩展矩阵乘法的递阶优化方法。首先,在可扩展矩阵乘法算法(SMM)算法枢轴行和枢轴列通信研究基础上,利用分层方式在更高等级上对网格进行矩形群划分,实现矩阵乘法的二维计算向三维计算转变,并设计对应的集群内通信和集群间通信过程,实现SMM乘法的递阶并行优化(HSMM);其次,对所提HSMM算法进行理论分析,分情况对其通信成本进行分析和预测,推导出最佳计算成本的集群数选取方式;最后,通过在Grid5000和BlueGene/P测试平台实验,验证了所提算法有效性和理论分析的正确性。  相似文献   

7.
GATE/Geant4 Monte Carlo simulations are computationally demanding applications, requiring thousands of processor hours to produce realistic results. The classical strategy of distributing the simulation of individual events does not apply efficiently for Positron Emission Tomography (PET) experiments, because it requires a centralized coincidence processing and large communication overheads. We propose a parallel computational model for GATE that handles event generation and coincidence processing in a simple and efficient way by decentralizing event generation and processing but maintaining a centralized event and time coordinator. The model is implemented with the inclusion of a new set of factory classes that can run the same executable in sequential or parallel mode. A Mann–Whitney test shows that the output produced by this parallel model in terms of number of tallies is equivalent (but not equal) to its sequential counterpart. Computational performance evaluation shows that the software is scalable and well balanced.  相似文献   

8.
基于多区结构网格的计算流体力学方法,在并行处理的难点是多个网格数据块在计算资源上的高效合理分配,以实现大规模并行环境下的负载平衡。本文围绕负载平衡问题,介绍了 CCFD 软件开展的一些工作,包括:1. 面向结构网格的双层图剖分策略,通过细层图剖分环节考虑计算量和通信量的负载平衡;2. 建立可细分的重叠网格体系,并基于该体系建立了重叠网格系统的双级负载平衡模型。算例验证表明,所采用的负载平衡策略在大规模并行环境下能获得较高并行效率。  相似文献   

9.
根据基于PIM(Processor-In-Memory)技术的数据并行计算机体系结构的特点和面向多媒体计算的应用需求,提出了面向嵌入式SIMD(Single Instruction Multiple Data)计算的数据并行语言PIMC。简单讨论了PIMC语言的形式化定义,并以数据并行图像处理的均值滤波算法为例对语言的使用作了说明。结合其他大量的数据并行编程实例,说明了该语言能够在基于PIM技术的SIMD并行计算机上正确描述基本多媒体处理算法的数据并行实现。  相似文献   

10.
A novel multi-block compact-TVD finite difference method for the simulation of compressible flows is presented. The method combines distributed and shared-memory paradigms to take advantage of the configuration of modern supercomputers that host many cores per shared-memory node. In our approach a domain decomposition technique is applied to a compact scheme using explicit flux formulas at block interfaces. This method offers great improvement in performance over earlier parallel compact methods that rely on the parallel solution of a linear system. A test case is presented to assess the accuracy and parallel performance of the new method.  相似文献   

11.
李睿阳  毛国勇  张武 《计算机工程与设计》2007,28(19):4655-4657,4673
以ANSYS和FLUENT为例,分析了商业软件在工作站机群并行运行的优势.将并行运行与网格计算相结合,提出了两者结合的软件结构和硬件结构,实现了并行计算资源的Web发布,从而提高商业软件和高性能计算资源的利用率,为大规模科学工程计算提供了良好的运算平台.同时平台实现用户认证,过载保护和实时监控功能.  相似文献   

12.
We present a scalable framework for parallelizing greedy graph coloring algorithms on distributed-memory computers. The framework unifies several existing algorithms and blends a variety of techniques for creating or facilitating concurrency. The latter techniques include exploiting features of the initial data distribution, the use of speculative coloring and randomization, and a BSP-style organization of computation and communication. We experimentally study the performance of several specialized algorithms designed using the framework and implemented using MPI. The experiments are conducted on two different platforms and the test cases include large-size synthetic graphs as well as real graphs drawn from various application areas. Computational results show that implementations that yield good speedup while at the same time using about the same number of colors as a sequential greedy algorithm can be achieved by setting parameters of the framework in accordance with the size and structure of the graph being colored. Our implementation is freely available as part of the Zoltan parallel data management and load-balancing library.  相似文献   

13.
The problem of maximal clique enumeration (MCE) is to enumerate all of the maximal cliques in a graph. Once enumerated, maximal cliques are widely used to solve problems in areas such as 3-D protein structure alignment, genome mapping, gene expression analysis, and detection of social hierarchies. Even the most efficient serial MCE algorithms require large amounts of time to enumerate the maximal cliques in networks arising from these problems that contain hundreds, thousands, or larger numbers of vertices. The previous attempts to provide practical solutions to the MCE problem through parallel implementation have had limited success, largely due to a number of challenges inherent to the nature of the MCE combinatorial search space. On the one hand, MCE algorithms often create a backtracking search tree that has a highly irregular and hard-or-impossible to predict structure; therefore, almost any static decomposition of the search tree by parallel processors results in highly unbalanced processor execution times. On the other hand, the data-intensive nature of the MCE problem often makes naive dynamic load distribution strategies that require extensive data movement prohibitively expensive. As a result, good scaling of the overall execution time of parallel MCE algorithms has been reported for only up to a couple hundred processors. In this paper, we propose a parallel, scalable, and memory-efficient MCE algorithm for distributed and/or shared memory high performance computing architectures, whose runtime scales linearly for thousands of processors on real-world application graphs with hundreds and thousands of nodes. Its scalability and efficiency are attributed to the proposed: (a) representation of the search tree decomposition to enable parallelization; (b) parallel depth-first backtracking search to both constrain the search space and minimize memory requirement; (c) least stringent synchronization to minimize data movement; and (d) on-demand work stealing intelligently coupled with work stack splitting to minimize computing elements’ idle time. To the best of our knowledge, the proposed parallel MCE algorithm is the first to achieve a linear scaling runtime using up to 2048 processors on Cray XT machines for a number of real-world biological networks.  相似文献   

14.
杨丽鹏  车永刚 《计算机应用》2013,33(9):2423-2427
大规模计算流体动力学(CFD)计算对数据I/O能力提出了很高需求。层次式文件格式(HDF5)可有效管理大规模科学数据,并对并行I/O具有良好的支持。针对结构网格CFD并行程序,设计了其数据文件的HDF5存储模式,并基于HDF5并行I/O编程接口实现了其数据文件的并行I/O,在并行计算机系统上进行了性能测试与分析。结果表明,在使用4~32个进程时,基于HDF5并行I/O方式的写文件性能比每进程独立写普通文件的方式高6.9~16.1倍;基于HDF5并行I/O方式的读文件性能不及后者,为后者的20%~70%,但是读文件的时间开销远小于写文件的时间开销,因此对总体性能的影响较小。  相似文献   

15.
A scalable parallel algorithm has been designed to perform multimillion-atom molecular dynamics (MD) simulations, in which first principles-based reactive force fields (ReaxFF) describe chemical reactions. Environment-dependent bond orders associated with atomic pairs and their derivatives are reused extensively with the aid of linked-list cells to minimize the computation associated with atomic n-tuple interactions (n?4 explicitly and ?6 due to chain-rule differentiation). These n-tuple computations are made modular, so that they can be reconfigured effectively with a multiple time-step integrator to further reduce the computation time. Atomic charges are updated dynamically with an electronegativity equalization method, by iteratively minimizing the electrostatic energy with the charge-neutrality constraint. The ReaxFF-MD simulation algorithm has been implemented on parallel computers based on a spatial decomposition scheme combined with distributed n-tuple data structures. The measured parallel efficiency of the parallel ReaxFF-MD algorithm is 0.998 on 131,072 IBM BlueGene/L processors for a 1.01 billion-atom RDX system.  相似文献   

16.
基于二维/轴对称高精度可压缩多相流计算流体力学方法 MuSiC-CCASSIM的结构化网格部分,设计了区域并行分解方法;针对各处理器边界数据的通信,设计了阻塞式通信与非阻塞式通信并行算法;为了减少通信开销,设计了MPI/OpenMP混合并行优化算法。在天河二号超级计算机上进行了测试,每个核固定网格规模为625*250,最多调用8 192核。测试数据表明,采用MPI/OpenMP混合并行算法、纯MPI非阻塞式通信并行算法和纯MPI阻塞式通信并行算法的程序的平均并行效率分别达到86%、83%和77%,三种算法都具有良好的可扩展性。  相似文献   

17.
Finding similar items in a large and unstructured dataset is a challenging task in many applications of data science, such as searching, indexing, and retrieval. With the increasing data volume and demand for real time responses, similarity search has gained much consideration. In this paper, a parallel computational approach for similarity search using Bloom filters (PCASSB) has been proposed, which uses Bloom filter for the representation of features of document and comparison with user's query. Query features are stored in integer query array (IQA), an array of integer. The PCASSB, an approximate similarity search technique, has been implemented on graphics processing unit with compute unified device architecture as the programming platform. To compute the similarity score between query and reference dataset, Dice coefficient has been used as a baseline method. The accuracy of the results generated by PCASSB is compared with the baseline method and other state‐of‐the‐art methods. The experimental results show that the proposed technique is quite effective in processing large number of text documents as it takes less computational time.  相似文献   

18.
在多区结构网格计算流体动力学(computational fluid dynamics,CFD)并行模拟中,为了与并行计算资源相适应,经常需要对原始流场网格进行二次剖分与区块分组.在对区块分组和网格二次剖分进行了总结综述的基础上,重点提出针对多区结构网格二次剖分的两种策略:几何剖分和嵌套二分.基于这两种策略完成了剖分软件工具TH-MeshSplit,可实现初级方式、专业方式和专家方式3种运行方式,为用户在自动化与灵活性方面提供了多样化选择.数值实验结果表明,两种剖分策略及其实现软件可在较短时间内完成复杂的剖分,剖分后的网格在负载平衡性、计算通信比等方面具有更优的性能,从而为后续CFD流场的高效并行加速求解奠定了基础.  相似文献   

19.
We discuss the effective implementation of parallel processing for linear prediction-based uniform state sampling (LPUSS). In previous work, we proposed LPUSS as an optimization algorithm for mechanical motions that assures high optimality of the solutions and computational efficiency. In parallel computation, LPUSS requires balanced memory allocation and managed processing timing. In this paper, we propose an effective parallel computing method that assures high optimality and calculation efficiency in parallel processing using GPU processor. We conducted two experiments to validate the proposed method. In the first experiment, we compared single-thread processing for LPUSS and the proposed parallel processing. As a result of this experiment, calculation speed of LPUSS was about 4–20 times faster than that with single-thread CPU. In the second experiment, we applied the proposed method to the optimization of sixtuple inverted pendulum. As a result, the proposed method optimized the motion within 40 minutes. According to our survey, there is no other optimization method that is applicable to higher than quadruple inverted pendulum models with standard constraints.  相似文献   

20.
石油作为一种主要能源,在交通、工业生产以及日常生活中发挥着重要作用。大量的新技术已经被开发并用于最大化石油生产,比如聚合物驱等技术在我国得到了广泛的应用。天然裂缝油藏聚合物驱模拟对油田的可持续生产和延长油田寿命至关重要。开发了一种可扩展的并行油藏模拟器,用于使用聚合物驱技术模拟天然裂缝性油藏的石油开采,使油藏工程人员能够利用强大的并行计算机研究生产技术,优化采油过程。通过与现有的商用软件对比,数值结果验证了该模拟器的正确性和有效性。此外,数值实验也证明了该模拟器拥有良好的扩展性,它可以使用成千上万的CPU核来计算具有数亿个网格块的大规模油藏模型。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号