首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
接触问题的MPI+OpenMP混合并行计算   总被引:1,自引:0,他引:1  
针对接触计算问题中需要大量全局通信的特点,结合当前流行的多处理器集群系统,采用了MPI+OpenMP的混合并行模式,实现了接触问题的并行计算。以双重区域剖分并行算法为基础,内力计算部分在采用MPI并行基础上,使用基于分块结构的OpenMP并行编程,使得接触力并行计算中涉及的全局通信时间无需增加,从而进一步提高了并行效率。数值模拟实验表明,这种并行方式能在上百处理器上实现千万自由度接触问题的并行计算。  相似文献   

2.
A parallel adaptive dynamic relaxation (ADR) algorithm has been developed for non-linear structural analysis. This algorithm has minimal memory requirements, is easily parallelizable and scalable to many processors, and is generally very reliable and efficient for highly non-linear problems. Performance evaluations on single-processor computers have shown that the ADR algorithm is reliable and highly vectorizable, and that it is competitive with direct solution methods for the highly non-linear problems considered. The present algorithm is implemented on the 512-processor Intel Touchstone DELTA system at Caltech, and it is designed to minimize the extent and frequency of interprocessor communication. The algorithm has been used to solve for the non-linear static response of two- and three-dimensional hyperelastic systems involving contact. Impressive relative speed-ups have been achieved and demonstrate the high scalability of the ADR algorithm. For the class of problems addressed, the ADR algorithm represents a very promising approach for parallel-vector processing.  相似文献   

3.
A parallel quaternion-based dissipative particle dynamics (QDPD) program has been developed in Fortran to study the flow properties of complex fluids subject to shear. The parallelization allows for simulations of greater size and complexity and is accomplished with a parallel link-cell spatial (domain) decomposition using MPI. The technique has novel features arising from the DPD formalism, the use of rigid body inclusions spread across processors, and a sheared boundary condition. A detailed discussion of our implementation is presented, along with results on two distributed memory architectures. A parallel speedup of 24.19 was obtained for a benchmark calculation on 27 processors of a distributed memory cluster.  相似文献   

4.
Abstract

To efficiently execute a finite element program on a hypercube, we need to map nodes of the corresponding finite element graph to processors of a hypercube such that each processor has approximately the same amount of computational load and the communication among processors is minimized. If the number of nodes of a finite element graph will not be increased during the execution of a program, the mapping only needs to be performed once. However, if a finite element graph is solution‐adaptive, that is, the number of nodes will be increased discretely due to the refinement of some finite elements during the execution of a program, a run‐time load balancing algorithm has to be performed many times in order to balance the computational load of processors while keeping the communication cost as low as possible. In this paper, we propose a parallel iterative load balancing algorithm (ILB) to deal with the load imbalancing problem of a solution‐adaptive finite element program. The proposed algorithm has three properties. First, the algorithm is simple and easy to be implemented. Second, the execution of the algorithm is fast. Third, it guarantees that the computational load will be balanced after the execution of the algorithm. We have implemented the proposed algorithm along with two parallel mapping algorithms, parallel orthogonal recursive bisection (ORB) [19] and parallel recursive mincut bipartitioning (MC) [8], on a 16‐node NCUBE‐2. Three criteria, the execution time of load balancing algorithms, the computation time of an application program under different load balancing algorithms, and the total execution time of an application program (under several refinement phases) are used for performance evaluation. Experimental results show that (1) the execution time of ILB is very short compared to those of MC and ORB; (2) the mappings produced by ILB are better than those of ORB and MC; and (3) the speedups produced by ILB are better than those of ORB and MC.  相似文献   

5.
The parallel implementation of the element free Galerkin (EFG) method for heat transfer and fluid flow problems on MIMD type parallel computer is treated. A new parallel algorithm has been proposed in which parallelization is performed by row-wise data distribution among the processors. The codes have been developed in FORTRAN language using MPI message passing library. Two model (one each in heat transfer and fluid flow) problems have been solved to validate the proposed algorithm. The total time, communication time, user time, speedup and efficiency have been estimated for heat transfer and fluid flow problems. For eight processors, the speedup and efficiency are obtained to be 7.11 and 88.87% respectively in heat transfer problems for a data size of N=2,116 whereas 7.20 and 90.04% respectively in fluid flow problems for a data size of N=2,378.  相似文献   

6.
Parallelized FVM algorithm for three-dimensional viscoelastic flows   总被引:1,自引:0,他引:1  
 A parallel implementation for the finite volume method (FVM) for three-dimensional (3D) viscoelastic flows is developed on a distributed computing environment through Parallel Virtual Machine (PVM). The numerical procedure is based on the SIMPLEST algorithm using a staggered FVM discretization in Cartesian coordinates. The final discretized algebraic equations are solved with the TDMA method. The parallelisation of the program is implemented by a domain decomposition strategy, with a master/slave style programming paradigm, and a message passing through PVM. A load balancing strategy is proposed to reduce the communications between processors. The three-dimensional viscoelastic flow in a rectangular duct is computed with this program. The modified Phan-Thien–Tanner (MPTT) constitutive model is employed for the equation system closure. Computing results are validated on the secondary flow problem due to non-zero second normal stress difference N 2. Three sets of meshes are used, and the effect of domain decomposition strategies on the performance is discussed. It is found that parallel efficiency is strongly dependent on the grid size and the number of processors for a given block number. The convergence rate as well as the total efficiency of domain decomposition depends upon the flow problem and the boundary conditions. The parallel efficiency increases with increasing problem size for given block number. Comparing to two-dimensional flow problems, 3D parallelized algorithm has a lower efficiency owing to largely overlapped block interfaces, but the parallel algorithm is indeed a powerful means for large scale flow simulations. Received: 2 July 2002 / Accepted: 15 November 2002 This research is supported by an ASTAR Grant EMT/00/011.  相似文献   

7.
On the basis of the Monte Carlo method a theoretical model of charge transfer processes in the multilayer nc-Si/CaF2 structure has been developed. The constructed self-consistent model has made it possible to investigate the influence of the injection rate of charge carriers and the potential barrier height of a dielectric on the volt-ampere characteristic of the structure. The dependence of the rate of injection from a contact on the applied external voltage has been calculated. The main problem (large counting time) of the theoretical model has been solved by organizing parallel calculations in the developed code SIMPS. The realization of the SIMPS code written in the programming language Fortran-95 on a computer cluster for parallel calculations with distributed memory is presented. The results of the calculations demonstrate an increase in the calculation rate with increasing number of processors.  相似文献   

8.
An efficient, scalable, parallel algorithm for treating material surface contacts in solid mechanics finite element programs has been implemented in a modular way for multiple-instruction, multiple-data (MIMD) parallel computers. The serial contact detection algorithm that was developed previously for the transient dynamics finite element code PRONTO3D has been extended for use in parallel computation by utilizing a dynamic (adaptive) load balancing algorithm. This approach is scalable to thousands of computational nodes1  相似文献   

9.
A contact algorithm has been developed and implemented in a non-linear dynamic explicit finite element program to analyse the response of three-dimensional shell structures. The contact search algorithm accounts for initial contact, sliding, and release through the use of a parametric representation of the motion of points located on the surface of the structure combined with a contact surface representation which approximates the actual surface by means of triangular search planes. The mechanics of contact is handled by taking advantage of the fact that an explicit time integration scheme results in very small displacements during a time step. The amount of overlap of the discrete representation of the surfaces which occurs at contact is taken as a measure of the approach of the surfaces. Hence, experimental data which relates approach to normal contact pressure can be used to determine the contact pressure applied to the finite element model of the surface as contact evolves. The friction model also incorporates experimental data on the dependence of the coefficient of friction on both the relative sliding velocity and on the relative tangential displacement between surfaces in contact observed in friction tests. The parallel implementation of this contact algorithm and its performance on a 128-processor distributed-memory multiprocessor computer is discussed in Part II of this paper.  相似文献   

10.
The existing global–local multiscale computational methods, using finite element discretization at both the macro‐scale and micro‐scale, are intensive both in terms of computational time and memory requirements and their parallelization using domain decomposition methods incur substantial communication overhead, limiting their application. We are interested in a class of explicit global–local multiscale methods whose architecture significantly reduces this communication overhead on massively parallel machines. However, a naïve task decomposition based on distributing individual macro‐scale integration points to a single group of processors is not optimal and leads to communication overheads and idling of processors. To overcome this problem, we have developed a novel coarse‐grained parallel algorithm in which groups of macro‐scale integration points are distributed to a layer of processors. Each processor in this layer communicates locally with a group of processors that are responsible for the micro‐scale computations. The overlapping groups of processors are shown to achieve optimal concurrency at significantly reduced communication overhead. Several example problems are presented to demonstrate the efficiency of the proposed algorithm. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

11.
冲击动力问题的混合积分并行算法及应用   总被引:1,自引:0,他引:1  
为了提高冲击动力问题的计算效率和速度,在分布式MIMD并行环境下,构造了冲击动力问题的混合时间步长显式积分并行算法。基于区域分裂法,该算法按照单元时间积分步长的大小来划分各个子区域,再把具有不同时间步长的子区域分配到网络机群中的各结点机上,并采用子循环的方法使各子区域的计算达到同步,然后通过消息传递软件―PVM来传递各子区域间的信息。最后通过工程算例可以看出:带有子循环的混合积分并行算法能够显著的提高运算效率和并行加速比,缩短计算时间。  相似文献   

12.
杆系DEM(离散元,discrete element method)是求解结构强非线性问题的有效方法,但随着结构数值计算规模的扩大,杆系DEM所需要的计算时间也随之急剧膨胀.为了提高杆系DEM的计算效率,该研究提出单元级并行、节点级并行的计算方法,基于CPU-GPU异构平台,建构了杆系DEM并行计算框架,编制了相应的几...  相似文献   

13.
Anjan Bose 《Sadhana》1993,18(5):815-841
The dynamic behaviour of a large interconnected electric power system is characterized by a simultaneous set of nonlinear algebraic and ordinary differential equations. The solution is obtained by numerical methods and the simulation of the transient behaviour for a few seconds after a fault is the standard analytical procedure used in planning and operational studies of the system. The need for on-line simulation in near real time for more efficient operation has encouraged the search for faster solution methods and the use of parallel computers for this purpose has attracted the attention of many researchers. The success of parallelization depends on three factors: the problem structure, the computer architecture, and the algorithm that takes maximum advantage of both. In this problem, the generator equations are only coupled through the electrical network providing some parallelization in (variable) space, and a solution is needed at each time step leading to some parallelization in time (waveform relaxation). However, since the problem formulation is not completely decoupled, parallel algorithms can only be developed by trading off any relaxation with a degradation in convergence. The fastest sequential algorithm used today is the combination of implicit trapezoidal integration with a dishonest Newton solution. The Newton algorithm is not parallel at all but has the fastest convergence while a Gauss-Jacobi algorithm is completely parallel but converges very slowly. A relaxation of the Newton algorithm appears to be a good compromise. As for the parallel hardware, the coupling seems to require significant communication between processors thus favouring a data-sharing architecture over a message-passing hypercube. Special architectures to match the problem structure have also been an area of investigation. This paper elaborates on the above issues and assesses the present state-of-the-art.  相似文献   

14.
New parallel software for the analysis of coupled heat and moisture transfer in unsaturated soil is developed. The model, written in a two-dimensional polar co-ordinate formulation, is based on a finite difference self-implicit method. The code is programmed in FORTRAN with message passing libraries PARMACS and executed on a ‘Paramid’ parallel supercomputer. The validity of the parallel code by comparison of simulation results with experimentally measured values obtained from a laboratory heating experiment is examined. An assessment of the algorithm's performance on a large network of processors is also explored. It was found that the simulation results compared very well with the experimental measurements. The efficiency of the parallel code was also revealed leading to the conclusion that the algorithm was highly efficient in nature. The new parallel code was also found to be more efficient when dealing with larger problems requiring more finite difference nodal points, on a larger network of processors.  相似文献   

15.
Reducing work-in-process (WIP) inventory is continuing to be an important business need because of several factors including the need to reduce working capital. Numerous techniques have been suggested for WIP reduction, and CONWIP is a competitive algorithm for WIP reduction. Prior CONWIP algorithms have been primarily sequential algorithms and can be potentially incur significant computing time, especially when dealing with inventories for multiple products. The paper proposes a card-setting algorithm for multiple product types subject to routing and throughput requirements. The proposed algorithm searches the WIP space iteratively and the step-size is adaptively selected based on the known properties of multi-chain, multi-class, closed queuing networks. Furthermore, parallelization of this search algorithm across multiple processors is proposed where each processor searches a different segment of the WIP space while adaptively adjusting its step size for all product types to ensure fast convergence. The proposed parallel algorithm can take advantage of distributed computing architectures to speed-up the overall computation. An experimental implementation of the parallel algorithm using Message Passing Interface (MPI) over a high-speed network is described. Computational results demonstrate that the proposed parallel algorithm can be parallelized over eight to ten processors to obtain a speed-up of three to five.  相似文献   

16.
Abstract

Traditionally, to program a distributed memory multiprocessor, a programmer is responsible for partitioning an application program into modules or tasks, scheduling tasks on processors, inserting communication primitives, and generating parallel codes for each processor manually. As both the number of processors and the complexity of problems to be solved increases, programming distributed memory multiprocessors becomes difficult and error‐prone. In a distributed memory multiprocessor, the program partitioning and scheduling play an important role in the performance of a parallel program. However, how to find the best program partitioning and scheduling so that the best performance of a parallel program on a distributed memory multiprocessor can be achieved, is not an easy task. In this paper, we present a parallel programming tool, PPT, to aid programmers to find the best program partitioning and scheduling and automatically generate the parallel code for the single program multiple data (SPMD) model on a distributed memory multiprocessor. An example of designing a parallel FFT program by using PPT on an NCUBE‐2 is also presented.  相似文献   

17.
A parallel implementation of a finite volume method for the solution of the Navier-Stokes equations on a distributed computing environment through Parallel Virtual Machine (PVM) is reported. The numerical method is implicit and is based on the SIMPLE algorithm in which the system of equations is discretised using a hybrid scheme. An Alternative Direction Implicit (ADI) scheme, and the Thomas tri-diagonal solver are used to solve the algebraic equations. The parallelization of the program is implemented by a domain decomposition strategy on MIMD parallel architectures using PVM platform. The program was tested for laminar flow in a cavity. The parallelisation strategy and performance are discussed. It is concluded that the efficiency is strongly dependent on the grid size, block numbers and the number of processors. Different strategies to improve the computational efficiency are proposed.  相似文献   

18.
This study explains a newly developed parallel algorithm for phylogenetic analysis of DNA sequences. The newly designed D‐Phylo is a more advanced algorithm for phylogenetic analysis using maximum likelihood approach. The D‐Phylo while misusing the seeking capacity of k ‐means keeps away from its real constraint of getting stuck at privately conserved motifs. The authors have tested the behaviour of D‐Phylo on Amazon Linux Amazon Machine Image(Hardware Virtual Machine)i2.4xlarge, six central processing unit, 122 GiB memory, 8 ×  800 Solid‐state drive Elastic Block Store volume, high network performance up to 15 processors for several real‐life datasets. Distributing the clusters evenly on all the processors provides us the capacity to accomplish a near direct speed if there should arise an occurrence of huge number of processors.Inspec keywords: parallel algorithms, Linux, pattern clustering, DNA, molecular biophysics, genetics, biology computingOther keywords: D‐Phylo algorithm parallel implementation, maximum likelihood clusters, DNA sequence phylogenetic analysis, Amazon Linux AMI, HVM, central processing unit, SSD, real‐life datasets, processors, high‐network performance  相似文献   

19.
Frequent pattern mining is the most important phase of association rule mining process because of its time and space complexity. Several methods have attempted to improve the performance of association rule mining by enhancing frequent pattern mining efficiency. Due to the large size of the data-sets and huge amounts of data which should be mined, many parallel and distributed mining approaches have been introduced to divide data-sets or to distribute mining processes between multiple processors or computers and thus, improve the efficiency of the mining process. In this paper, we propose a hadoop-based parallel implementation of PrePost+ algorithm for frequent itemset mining. In our parallel approach, the process of constructing N-Lists of itemsets has been distributed between the mappers and the operation of the final pruning process and extracting frequent itemsets has been carried out by reducers in a map-reduce parallel programming model. The experimental results show that our hadoop-based PrePost+(HBPrePost+) algorithm outperforms one of the best existing parallel methods of frequent itemset mining (PARMA) in terms of execution time.  相似文献   

20.
This paper presents a nonlinear implicit transient formulation for parallel contact analysis. A new lumping algorithm based on the penalty method is developed to enforce contact constraints. This algorithm has the effect of incorporating contact generated connectivity and eliminating ill conditioning in the system stiffness matrix caused by the use of a large penalty parameter. Communication schemes are also developed to facilitate contact searching and enforcement in a parallel environment. Numerical analyses were conducted to verify the accuracy of the proposed algorithm. In addition, several contact simulations are also performed to study the performance of this new algorithm under different circumstances. Efficiency plots are also presented for one of the simulations to evaluate the performance of this parallel implicit contact formulation.This paper is based upon work supported by, or in part by, the U.S. Army Research Office under grant number DAAD19-99-1-0235. The authors also acknowledge support under a DoD HPC Challenge Project.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号