首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Computers & Structures》1987,26(4):551-559
The development of general-purpose finite element computer software systems has provided the capability to analyze a wide range of linear and non-linear structural problems. However, these software systems are severely limited for non-linear response calculations because of the available speed on current sequential computers. Recent and projected advances in parallel multiple instruction multiple data (MIMD) computers provide an opportunity for significant gains in computing speed and for broadening the range of structural problems which may be solved. The key to these gains is the effective selection and implementation of algorithms which exploit parallel computing. This paper documents experiences solving transient response calculations on an experimental MIMD computer, termed the Finite Element Machine. The paper describes the algorithm used, its implementation for parallel computations, and results for representative one- and two-dimensional dynamic response test problems. The results show computation speedups of up to 7.83 for eight processors, and indicate that significant speedups of solution time are possible for non-linear dynamic response calculations through the use of many processors and appropriate parallel integration algorithms. The results are extremely encouraging and suggest that significant speedups in structural computations can be achieved through advances in parallel computers.  相似文献   

2.
Sparse QR factorization on a massively parallel computer   总被引:1,自引:0,他引:1  
This paper shows that QR factorization of large, sparse matrices can be performed efficiently on massively parallel SIMD (single instruction stream/multiple data stream) computers such as the Connection Machine CM-2. The problem is cast as a dataflow graph, whose nodes are mapped to a virtual dataflow machine in such a way that only nearest-neighbor communication is required. This virtual machine is implemented by programming the CM-2 processors to support a restricted dataflow protocol. Execution results for several test matrices show that good performance can be obtained without relying on nested dissection techniques.  相似文献   

3.
Implementation of GAMMA on a Massively Parallel Computer   总被引:1,自引:0,他引:1       下载免费PDF全文
The GAMMA paradigm is recently proposed by Banatre and Metayer to describe the systematic construction of parallel programs without introducing artificial sequentiality.This paper presents two synchronous execution models for GAMMA and discusses how to implement them on MasPar MP-1,a massively data parallel computer.The results show that GAMMA paradign can be implemented very naturally on data parallel machines,and very high level language,such as GAMMA in which parallelism is left implicit,is suitable for specifying massively parallel applications.  相似文献   

4.
5.
《Parallel Computing》1997,23(9):1365-1377
A finite element fluid analysis code, which is based on the matrix-storage free formulation and the element-by-element computation strategy, is developed. The code has reduced memory requirements due to the matrix-storage free formulation. Simulations involving one million elements can be carried out with less than 208 Mbytes of memory. The code is implemented on the massively parallel computers, KSR1 and CRAY T3D. In the case of KSR1, high parallel efficiency is achieved, i.e. 95.9% with 16 CPUs. In the case of T3D, excellent scalability is achieved. Each time step of a 3D cavity flow problem with one million elements required 36.3, 18.7 and 9.8 s of CPU time by using 32, 64 and 128 processors, respectively.  相似文献   

6.
This paper describes a parallel solver of tridiagonal systems appropriate for distributed memory computers and implemented on an array of chain-connected T800 Transputers. Each processor in the chain uses the same program to solve its own subset of equations. This implementation is suited, for instance, for solving the heat conduction equation in one-dimensional hydrodynamic codes. The procedure performs a parallel cyclic reduction, a recursive Gaussian elimination on a reduced number of equations and a parallel backward unfolding scheme, with a direct substitution in the reduced equations. The code has been written in Occam2 language. A one-way communication of values between adjacent processors is required at each cycle of both the reduction and the unfolding steps. Due to the number of floating point operations and the amount of communications the implementation described here works efficiently on arrays with more than 4 processors and for more than 50 equations per processor.  相似文献   

7.
The performance evaluation process for a massively parallel distributed-memory SIMD computer is described generally. The performance in basic computation, grid communication, and computation with grid communication is analysed. A practical performance evaluation and analysis study is done for the Connection Machine 2, and conclusions about its performance are drawn.  相似文献   

8.
Efficient algorithms for estimating the coefficient parameters of the ordinary linear model on a massively parallel SIMD computer are presented. The numerical stability of the algorithms is ensured by using orthogonal transformations in the form of Householder reflections and Givens plane rotations to compute the complete orthogonal decomposition of the coefficient matrix. Algorithms for reconstructing the orthogonal matrices involved in the decompositions are also designed, implemented and analyzed. The implementation of all algorithms on the targeted SIMD array processor is considered in detail. Timing models for predicting the execution time of the implementations are derived using regression modelling methods. The timing models also provide an insight into how the algorithms interact with the parallel computer. The predetermined factors used in the regression fits are derived from the number of memory layers involved in the factorization process of the matrices. Experimental results show the high accuracy and predictive power of the timing models. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

9.
Visibility analysis algorithms use digital elevation models (DEMs), which represent terrain topography, to determine visibility at each point on the terrain from a given location in space. This analysis can be computationally very demanding, particularly when manipulating high resolution DEMs accurately at interactive response rates. Massively data-parallel computers offer high computing capabilities and are very well-suited to handling and processing large regular spatial data structures. In the paper, the authors present a new scanline-based data-parallel algorithm for visibility analysis. Results from an implementation onto a MasPar massively data-parallel SIMD computer are also presented.  相似文献   

10.
This paper presents parallel computational strategies to implement explicit nonlinear finite element analysis code onto distributed memory parallel computers for solving large-scale problems in structural dynamics. Implementation details on both homogeneous and heterogeneous parallel processing environments are considered in detail in this paper. Implementation of an explicit nonlinear finite element dynamic analysis code on homogeneous systems is discussed first and this is later moved onto heterogeneous systems. Domain decomposition with explicit message passing is preferred for parallel implementation. The message passing implementation in the parallel algorithm is based on MPI (Message Passing Interface) libraries. Implementation aspects of overlapped, non-overlapped domain decomposition techniques, Dynamic Task Allocation (DTA) and clustering techniques for DTA and their relative merits are presented. The interprocessor communications are optimised by overlapping with computations to improve the performance of the domain decomposition based explicit dynamic analysis finite element code.The issues related to implementation of finite element code for nonlinear dynamic analysis on heterogeneous parallel computing environment are later presented. A new dynamic load-balancing algorithm is developed for this purpose and it is integrated with the domain decomposition based parallel explicit finite element code to test our algorithms on a coarse grain heterogeneous cluster of workstations. Numerical experiments have been carried out on PARAM-10000, an Indian parallel computer and also on cluster of Unix workstations.  相似文献   

11.
A new mapping algorithm is presented for domain decomposition for the purpose of allowing researchers to conduct finite element analysis on massively parallel computers. Over the last few years, massively parallel MIMD machines such as the Intel Touchstone Delta and recently the Intel Touchstone Paragon have become increasingly popular for speeding up finite element computations. Most of these applications use domain decomposition as a first step towards conquering the problem. Many different algorithms have been developed by researchers to achieve an effective domain decomposition. Some of these methods use connectivity information only, some use coordinate information only, while others use both of them together. Some algorithms are based on assigning weights to nodes using a particular strategy while others are recursive in nature. As will be discussed in this paper, the logic employed in various algorithms works perfectly well for certain meshes to be decomposed, in certain numbers of subdomains; while it gives far from perfect results for other meshes or for same meshes to be decomposed in a different number of subdomains. The logic used in the proposed algorithm has been developed in a creative way such that it is closer to a human's natural thinking when making decisions. Fairly large meshes can be decomposed in a matter of seconds on a Sun Sparc station by the proposed algorithm. Its execution time remains almost the same for any number of subdomains.  相似文献   

12.
A study has been made of how cost-effectiveness due to the improvement of VLSI technology can apply to a scientific computer system without performance loss. The result is a parallel computer, ADENA (Alternating Direction Edition Nexus Array), with a core consisting of four kinds of VLSI chips, two for processor elements (PES) and two for the interprocessor network (plus some memory chips). An overview of ADENA and an analysis of its performance are given. The design considerations for the PEs incorporated in ADENA are discussed. The factors that limit performance in a parallel processing environment are analyzed, and the measures employed to improve these factors at the LSI design level are described. The 42.6 sq cm CMOS PEs reach a peak performance of 20 MFLOPS and a 256-PE ADENA 1.5 GFLOPS has been achieved and 300 to 400 MFLOPS for PDE applications  相似文献   

13.
This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201.  相似文献   

14.
Dynamic programming techniques are well-established and employed by various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. These algorithms usually operate in an iteration-based manner where new values are computed from values of the previous iteration. The data dependencies enforce synchronization which limits possibilities for internal parallel processing. In this paper, we investigate parallel approaches to processing matrix-based dynamic programming algorithms on modern multicore CPUs, Intel Xeon Phi accelerators, and general purpose GPUs. We address both the problem of computing a single distance on large inputs and the problem of computing a number of distances of smaller inputs simultaneously (e.g., when a similarity query is being resolved). Our proposed solutions yielded significant improvements in performance and achieved speedup of two orders of magnitude when compared to the serial baseline.  相似文献   

15.
We present a high-order method employing Jacobi polynomial-based shape functions, as an alternative to the typical Legendre polynomial-based shape functions in solid mechanics, for solving dynamic three-dimensional geometrically nonlinear elasticity problems. We demonstrate that the method has an exponential convergence rate spatially and a second-order accuracy temporally for the four classes of problems of linear/geometrically nonlinear elastostatics/elastodynamics. The method is parallelized through domain decomposition and message passing interface (MPI), and is scaled to over 2000 processors with high parallel performance.  相似文献   

16.
The paper analyzes and selects an appropriate interconnection network for a compliant multiprocessor. The multiprocessor is compliant to the tasks assigned to it in the sense that it can be reconfigured to provide a more efficient fit to the tasks to be executed. A number of possible candidate networks for the multiprocessor is considered: Omega, ADM, Hypercube and Torus. The potential applicability of these networks to the multiprocessor is analyzed from the points of view of partitionability, inter-PE delay, fault impact, and cost. After the individual analysts of the above points of consideration is completed, a weighted network factor is formed, and the optimal type of network is selected, under different performance criteria. The overall results point to the selection of the Torus or Hypercube network for most cases under consideration.  相似文献   

17.
Presents new principles for online monitoring in the context of multiprocessors (especially massively parallel processors) and then focuses on the effect of the aliasing probability on the error detection process. In the proposed test architecture, concurrent testing (or online monitoring) at the system level is accomplished by enforcing the run-time testing of the data and control dependences of the algorithm currently being executed on the parallel computer. In order to help in this process, each message contains both source and destination addresses. At each message source, the sequence of destination addresses of the outgoing messages is compressed on a block basis. At the same time, at each destination, the sequence of source addresses of all incoming messages is compressed, also on a block basis. Concurrent compression of the instructions executed by the PEs is also possible. As a result of this procedure, an image of the data dependences and of the control flow of the currently running algorithm is created. This image is compared, at the end of each computational block, with a reference image created at compilation time. The main results of this work are in proposing new principles for the online system-level testing of multiprocessor systems, based on signaturing and monitoring the data dependences together with the control dependences, and in providing an analytical model and analysis for the address compression process used for monitoring the data routing process  相似文献   

18.
This paper presents a parallel mixed time integration algorithm formulated by synthesising the implicit and explicit time integration techniques. The proposed algorithm is an extension of the mixed time integration algorithms [Comput. Meth. Appl. Mech. Engng 17/18 (1979) 259; Int. J. Numer. Meth. Engng 12 (1978) 1575] being successfully employed for solving media-structure interaction problems. The parallel algorithm for nonlinear dynamic response of structures employing mixed time integration technique has been devised within the broad framework of domain decomposition. Concurrency is introduced into this algorithm, by integrating interface nodes with explicit time integration technique and later solving the local submeshes with implicit algorithm. A flexible parallel data structure has been devised to implement the parallel mixed time integration algorithm. Parallel finite element code has been developed using portable Message Passing Interface software development environment. Numerical studies have been conducted on PARAM-10000 (Indian parallel supercomputer) to test the accuracy and also the performance of the proposed algorithm. Numerical studies indicate that the proposed algorithm is highly adaptive for parallel processing.  相似文献   

19.
目的几何校正(又称地理编码)是合成孔径雷达(SAR)影像处理流程中重要的一个步骤,具有一定的计算复杂度,需要用到几何定位模型。本文针对星载SAR影像,采用有理多项式系数(RPC)定位模型,提出了图形处理器(GPU)支持的几何校正大规模并行处理方法。方法该方法充分利用GPU计算资源强大及几何校正过程中每个像素处理步骤一致的特点,每次导入大量像素至GPU,为每个像素分配一个线程,每个线程执行有理函数计算、投影变换、插值采样等计算复杂度高的步骤,通过优化配置dim Grid和dim Block参数,提升GPU的并行性能。该方法通过分块处理实现SAR影像大幅面处理,且可适用于多个不同分块大小。结果实验结果显示其计算加速比为38 44,为全面客观地分析GPU并行处理的特点,还计算了整体加速比,通过多个实验分析影响整体加速性能的因素,提出大块读写提高I/O性能的优化方法。结论该方法形式简洁,通用性好,可适用于几乎所有的星载SAR影像、不同的影像幅面;且加速性能明显。  相似文献   

20.
A parallel finite element analysis based on a domain decomposition technique (DDT) is considered. In the present DDT, an analysis domain is divided into a number of smaller subdomains without overlap. Finite element analyses of the subdomains are performed under the constraint of both displacement continuity and force equivalence among them. The constraint is satisfied through iterative calculations based on either the Uzawa algorithm or the Conjugate Gradient (CG) method. Owing to the iterative algorithm, a large scale finite element analysis can be divided into a number of smaller ones which can be carried out in parallel.

The DDT is implemented on a parallel computer network composed of a number of 32-bit microprocessors, transputers. The developed parallel calculation system named the ‘FEM server type system’ involves peculiar features such as network independence and dynamic workload balance.

The characteristics of the domain decomposition method such as computational speed and memory requirement are first examined in detail through the finite element calculations of homogeneous or inhomogeneous cracked plate subjected to a tensile load on a single CPU computer.

The ‘speedup’ and ‘performance’ features of the FEM server type system are discussed on a parallel computer system composed of up to 16 transputers, with changing network types and domain decompositions. It is clearly demonstrated that the present parallel computing system requires a much smaller amount of computational memory than the conventional finite element method and also that, due to the feature of dynamic workload balancing, high performance (over 90%) is achieved even in a large scale finite element calculation with irregular domain decomposition.  相似文献   


设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号