共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper shows that QR factorization of large, sparse matrices can be performed efficiently on massively parallel SIMD (single instruction stream/multiple data stream) computers such as the Connection Machine CM-2. The problem is cast as a dataflow graph, whose nodes are mapped to a virtual dataflow machine in such a way that only nearest-neighbor communication is required. This virtual machine is implemented by programming the CM-2 processors to support a restricted dataflow protocol. Execution results for several test matrices show that good performance can be obtained without relying on nested dissection techniques. 相似文献
2.
This paper describes a parallel solver of tridiagonal systems appropriate for distributed memory computers and implemented on an array of chain-connected T800 Transputers. Each processor in the chain uses the same program to solve its own subset of equations. This implementation is suited, for instance, for solving the heat conduction equation in one-dimensional hydrodynamic codes. The procedure performs a parallel cyclic reduction, a recursive Gaussian elimination on a reduced number of equations and a parallel backward unfolding scheme, with a direct substitution in the reduced equations. The code has been written in Occam2 language. A one-way communication of values between adjacent processors is required at each cycle of both the reduction and the unfolding steps. Due to the number of floating point operations and the amount of communications the implementation described here works efficiently on arrays with more than 4 processors and for more than 50 equations per processor. 相似文献
3.
The performance evaluation process for a massively parallel distributed-memory SIMD computer is described generally. The performance in basic computation, grid communication, and computation with grid communication is analysed. A practical performance evaluation and analysis study is done for the Connection Machine 2, and conclusions about its performance are drawn. 相似文献
4.
Visibility analysis algorithms use digital elevation models (DEMs), which represent terrain topography, to determine visibility at each point on the terrain from a given location in space. This analysis can be computationally very demanding, particularly when manipulating high resolution DEMs accurately at interactive response rates. Massively data-parallel computers offer high computing capabilities and are very well-suited to handling and processing large regular spatial data structures. In the paper, the authors present a new scanline-based data-parallel algorithm for visibility analysis. Results from an implementation onto a MasPar massively data-parallel SIMD computer are also presented. 相似文献
5.
A study has been made of how cost-effectiveness due to the improvement of VLSI technology can apply to a scientific computer system without performance loss. The result is a parallel computer, ADENA (Alternating Direction Edition Nexus Array), with a core consisting of four kinds of VLSI chips, two for processor elements (PES) and two for the interprocessor network (plus some memory chips). An overview of ADENA and an analysis of its performance are given. The design considerations for the PEs incorporated in ADENA are discussed. The factors that limit performance in a parallel processing environment are analyzed, and the measures employed to improve these factors at the LSI design level are described. The 42.6 sq cm CMOS PEs reach a peak performance of 20 MFLOPS and a 256-PE ADENA 1.5 GFLOPS has been achieved and 300 to 400 MFLOPS for PDE applications 相似文献
6.
This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201. 相似文献
7.
Dynamic programming techniques are well-established and employed by various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. These algorithms usually operate in an iteration-based manner where new values are computed from values of the previous iteration. The data dependencies enforce synchronization which limits possibilities for internal parallel processing. In this paper, we investigate parallel approaches to processing matrix-based dynamic programming algorithms on modern multicore CPUs, Intel Xeon Phi accelerators, and general purpose GPUs. We address both the problem of computing a single distance on large inputs and the problem of computing a number of distances of smaller inputs simultaneously (e.g., when a similarity query is being resolved). Our proposed solutions yielded significant improvements in performance and achieved speedup of two orders of magnitude when compared to the serial baseline. 相似文献
8.
We present a high-order method employing Jacobi polynomial-based shape functions, as an alternative to the typical Legendre polynomial-based shape functions in solid mechanics, for solving dynamic three-dimensional geometrically nonlinear elasticity problems. We demonstrate that the method has an exponential convergence rate spatially and a second-order accuracy temporally for the four classes of problems of linear/geometrically nonlinear elastostatics/elastodynamics. The method is parallelized through domain decomposition and message passing interface (MPI), and is scaled to over 2000 processors with high parallel performance. 相似文献
9.
The paper analyzes and selects an appropriate interconnection network for a compliant multiprocessor. The multiprocessor is compliant to the tasks assigned to it in the sense that it can be reconfigured to provide a more efficient fit to the tasks to be executed. A number of possible candidate networks for the multiprocessor is considered: Omega, ADM, Hypercube and Torus. The potential applicability of these networks to the multiprocessor is analyzed from the points of view of partitionability, inter-PE delay, fault impact, and cost. After the individual analysts of the above points of consideration is completed, a weighted network factor is formed, and the optimal type of network is selected, under different performance criteria. The overall results point to the selection of the Torus or Hypercube network for most cases under consideration. 相似文献
10.
Presents new principles for online monitoring in the context of multiprocessors (especially massively parallel processors) and then focuses on the effect of the aliasing probability on the error detection process. In the proposed test architecture, concurrent testing (or online monitoring) at the system level is accomplished by enforcing the run-time testing of the data and control dependences of the algorithm currently being executed on the parallel computer. In order to help in this process, each message contains both source and destination addresses. At each message source, the sequence of destination addresses of the outgoing messages is compressed on a block basis. At the same time, at each destination, the sequence of source addresses of all incoming messages is compressed, also on a block basis. Concurrent compression of the instructions executed by the PEs is also possible. As a result of this procedure, an image of the data dependences and of the control flow of the currently running algorithm is created. This image is compared, at the end of each computational block, with a reference image created at compilation time. The main results of this work are in proposing new principles for the online system-level testing of multiprocessor systems, based on signaturing and monitoring the data dependences together with the control dependences, and in providing an analytical model and analysis for the address compression process used for monitoring the data routing process 相似文献
11.
A parallel finite element analysis based on a domain decomposition technique (DDT) is considered. In the present DDT, an analysis domain is divided into a number of smaller subdomains without overlap. Finite element analyses of the subdomains are performed under the constraint of both displacement continuity and force equivalence among them. The constraint is satisfied through iterative calculations based on either the Uzawa algorithm or the Conjugate Gradient (CG) method. Owing to the iterative algorithm, a large scale finite element analysis can be divided into a number of smaller ones which can be carried out in parallel. The DDT is implemented on a parallel computer network composed of a number of 32-bit microprocessors, transputers. The developed parallel calculation system named the ‘FEM server type system’ involves peculiar features such as network independence and dynamic workload balance. The characteristics of the domain decomposition method such as computational speed and memory requirement are first examined in detail through the finite element calculations of homogeneous or inhomogeneous cracked plate subjected to a tensile load on a single CPU computer. The ‘speedup’ and ‘performance’ features of the FEM server type system are discussed on a parallel computer system composed of up to 16 transputers, with changing network types and domain decompositions. It is clearly demonstrated that the present parallel computing system requires a much smaller amount of computational memory than the conventional finite element method and also that, due to the feature of dynamic workload balancing, high performance (over 90%) is achieved even in a large scale finite element calculation with irregular domain decomposition. 相似文献
12.
Large scale high speed design and analysis of next generation aerospace systems is achieved by focusing on the computational integration and synchronization of probabilistic mathematics, structural/material mechanics, and parallel computing. Design costs have been driven upward by mathematical models that require multiple levels of interactive analysis and utilize time consuming convergence criteria that further drive computing and design costs upward. To reduce CPU time and memory limitations, an effective real time parallelization (RTP) of the solution is introduced. Recursive Internal Partitioning (RIP) is used to partition the entire domain into subdomains with one or more subdomains being assigned to each independent processor. The Alpha STAR multifrontal algorithm (AMF) is implemented to assemble, condense, and solve for the unknowns at all finite element nodes. A multi level optimization technique is utilized to speed up the simulation processing time of the diversified field of specialized analysis techniques and mathematical models. These models require hierarchical multiple levels of interactive analysis utilizing time consuming convergence. The generic high speed civil transport (hsct) model is used to demonstrate the large scale computing capability. Results of multicriteria optimization indicate an order of magnitude reduction in computing time. Numerical solutions as well as physical phenomena are discussed and recommendations are provided for future solutions. 相似文献
13.
This paper discusses a parallel Lisp system developed for a distributed memory, parallel processor, the Mayfly. The language has been adapted to the problems of distributed data by providing a tight coupling of control and data, including mechanisms for mutual exclusion and data sharing. The language was primarily designed to execute on the Mayfly, but also runs on networked workstations. Initially, we show the relevant parts of the language as seen by the user. Then we concentrate on the system Lisp level implementation of these constructs with particular attention to agents, a mechanism for limiting the cost of remote operations. Briefly mentioned are the low-level kernel hardware and software support of the system Lisp primitives.Work Supported in part by the Hewlett-Packard Corporation. 相似文献
14.
Algorithms for solving multiple criteria nonlinear programming problems are frequently based on the use of the generalized reduced gradient (GRG) metod. Since the GRG method gives complex and large size processing for computation, it takes much time to solve large-scale multiple criteria nonlinear programming problems. Therefore, parallel processing dealing with the GRG method is required to solve the problems. We propose a parallel processing algorithm for the GRG method under multiple processors systems. 相似文献
15.
A new computer program architecture for the solution of finite element systems using concurrent processing is presented. The basic approach involves the automatic creation of substructures. A host provides control over a set of processors, each of which is assigned initially to one substructure, then dynamically reassigned to the common interface for the solution of the complete system of substructures. Algorithm details are presented fo each phase of the analysis.Results of analysis of large plate bending problems on a hypercube multicomputer are reported. For a system with 2,000 equations, an efficiency of 80 percent of the maximum theoretical value was obtained using 16 processors. 相似文献
16.
To support parallel processing of data-intensive applications, the interconnection network of a parallel/distributed machine must provide high end-to-end communication bandwidth and handle the bursty and concentrated communication patterns generated by dynamic load balancing and data collection operations. A large-scale interconnection network architecture called a virtual bus is proposed. The virtual bus can scale to terabits-per-second end-to-end communication bandwidth with low queuing delay for nonuniform traffic. A terabit virtual bus architecture can be efficiently implemented for less than 5% of the total cost of an eight-thousand-node system. In addition, the virtual bus has an open system parallel interface that is flexible enough to support up to gigabytes per second data transfer rates, different grades of services, and broadcast operation. Such flexibility makes the virtual bus a plausible open system communication backbone for a broad range of applications 相似文献
17.
Methods were developed for parallel processing of finite element solutions of large truss structures. The parallel processing techniques were implemented in two stages, i.e., the repeated forming of the nonlinear global stiffness matrix and the solving of the global system of equations. The Sequent Balance 21000 parallel computer was employed to demonstrate the procedures and the speed-up. 相似文献
18.
Finite difference schemes have proved to be very flexible numerical methods for the pricing of contingent claims with one and two underlying state variables. This flexibility and the steady stream of new complex financial instruments imply that finite difference schemes for the valuation of contingent claims with three underlying state variables can supposedly be very useful. In this paper, two such schemes are developed and tested. For practical purposes, numerical valuation of contingent claims with three underlying state variables by means of finite difference methods is probably too laborious computationally to be performed on a single processor computer. Many calculations can, however, be performed in parallel. Therefore, the methods are well suited to be executed on a massively parallel computer, like the Connection Machine CM-200, which is used in this paper. The accuracy of the schemes proposed in this paper suggests that valuation of multivariate contingent claims with the help of finite difference methods on a massively parallel computer can be a useful approach for academics as well as practitioners. 相似文献
19.
This paper describes a study carried out on the development and implementation of two parallel equation solvers for static finite element analysis. The two direct solvers, one for banded storage and the other for skyline profile storage, are implemented on a tightly coupled shared memory system. Certain key issues like (algorithmic) portability across different parallel architectures, matrix sparsity and vectorisation have been kept in mind while designing the algorithms. Performance studies have been conducted by varying the number of processors and the size of the problem. The results indicate that higher efficiencies can be obtained with both the algorithms described in this paper. However, one has to obtained with both the algorithms described in this paper. However, one has to choose the appropriate solver based on the concurrent approach chosen for paralleling the finite element code. The pseudo-codes, the concurrent implementation of the two solvers, both for shared and message passing systems are presented. 相似文献
20.
采用Intel Xeon LV处理器,利用先进的EDA工具和仿真软件进行高速串行总线的合理布局布线,实现了一种支持多主并行处理的加固计算机。根据应用,构建了基于高速互连网络的计算机硬件系统,结合成熟的商用并行软件,对计算机系统并行能力进行了测试;针对抗恶劣环境应用,特别是热设计,通过热分析、建模仿真(Icepak)等手段,实现计算机系统的环境设计。 相似文献
|