首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 364 毫秒
1.
In this paper we discuss a recursive divide and conquer algorithm to compute the inverse of an unreduced tridiagonal matrix. It is based on the recursive application of the Sherman–Morrison formula to a diagonally dominant tridiagonal matrix to avoid numerical stability problems. A theoretical study of the computational cost of this method is developed, comparing it with the experimental times measured on an IBM SP2 using switch and Ethernet hardware for communications between processors. Experimental results are presented for two and four processors. Finally, the method is compared with a divide and conquer method for solving tridiagonal linear systems.  相似文献   

2.
块带状线性方程组的分布式并行算法   总被引:3,自引:0,他引:3       下载免费PDF全文
本文首先根据分而治之的思想提出一种新的求解块三地角线性方程组的分布式并行算法,然后将该算法推广到块五对角线性方程组和块七地角线方程组的并行求解,并对算法进行了性能分析。 SGI工作站机群和586微机群上试算表明,加速比呈线性增加。  相似文献   

3.
三对角线性方程组的一种有效分布式并行算法   总被引:8,自引:0,他引:8  
提出了分布式存储环境下求解三对角线性方程的一种并行算法,该算法基于“分而治之”的策略,高效地形成并求解其缩减方程组,避免不必要的冗余计算,通过对计算量的仔细估计,较好地平衡了各处理机的负载;同时,充分利用了计算与通信重叠技术,减少处理机空闲时间,分析了自救的复杂性,给 分布存储多计算机系统上的数值试验结果,数值结果表明,算法的效率较迟利华和李晓梅的DPP算法有较大的提高。  相似文献   

4.
The problem of solving tridiagonal linear systems on parallel distributed-memory environments is considered in this paper. In particular, two common direct methods for solving such systems are considered: odd-even cyclic reduction and prefix summing. For each method, a variety of lower bounds on execution time for solving tridiagonal linear systems are presented. Specifically, lower bounds are presented that (a) hold when the number of data items per processor is bounded, (b) are general lower bounds, and (c) for specific data layouts commonly used in designing parallel algorithms to solve tridiagonal linear systems. Furthermore, algorithms are presented that have running times within a constant factor of the lower bounds provided. Lastly, a comparison of bounds for odd-even cyclic reduction and prefix summing is given.  相似文献   

5.
This paper presents a recursive direct differentiation method for sensitivity analysis of flexible multibody systems. Large rotations and translations in the system are modeled as rigid body degrees of freedom while the deformation field within each body is approximated by superposition of modal shape functions. The equations of motion for the flexible members are differentiated at body level and the sensitivity information is generated via a recursive divide and conquer scheme. The number of differentiations required in this method is minimal. The method works concurrently with the forward dynamics simulation of the system and requires minimum data storage. The use of divide and conquer framework makes the method linear and logarithmic in complexity for serial and parallel implementation, respectively, and ideally suited for general topologies. The method is applied to a flexible two arm robotic manipulator to calculate sensitivity information and the results are compared with the finite difference approach.  相似文献   

6.
This paper describes an efficient algorithm for the parallel solution of systems of linear equations with a block tridiagonal coefficient matrix. The algorithm comprises a multilevel LU-factorization based on block cyclic reduction and a corresponding solution algorithm.

The paper includes a general presentation of the parallel multilevel LU-factorization and solution algorithms, but the main emphasis is on implementation principles for a message passing computer with hypercube topology. Problem partitioning, processor allocation and communication requirement are discussed for the general block tridiagonal algorithm.

Band matrices can be cast into block tridiagonal form, and this special but important problem is dealt with in detail. It is demonstrated how the efficiency of the general block tridiagonal multilevel algorithm can be improved by introducing the equivalent of two-way Gaussian elimination for the first and the last partitioning and by carefully balancing the load of the processors. The presentation of the multilevel band solver is accompanied by detailed complexity analyses.

The properties of the parallel band solver were evaluated by implementing the algorithm on an Intel iPSC hypercube parallel computer and solving a larger number of banded linear equations using 2 to 32 processors. The results of the evaluation include speed-up over a sequential processor, and the measure values are in good agreement with the theoretical values resulting from complexity analysis. It is found that the maximum asymptotic speed-up of the multilevel LU-factorization using p processors and load balancing is approximated well by the expression (p +6)/4.

Finally, the multilevel parallel solver is compared with solvers based on row and column interleaved organization.  相似文献   


7.
考虑工作站网络(NOWs)中三对角线性方程组的并行求解,基于最小秩解耦算法与分布治之并行计算模式,并行最小秩解耦算法(PMRD)。它在计算过程中保持原矩阵的结构特征,数值稳定性高,本文给出算法的数值特征分析以及计算与通讯复杂性分析并与Mehrmann分治算比较,所有算法由PVM软件系统实现并在工作站网络中测试。  相似文献   

8.
SMP集群系统上矩阵特征问题并行求解器的有效算法   总被引:2,自引:0,他引:2  
对称矩阵三对角化和三对角对称矩阵的特征值求解是稠密对称矩阵特征问题并行求解器的关键步.针对SMP集群系统的多级体系结构,基于Householder变换的矩阵三对角化和三对角矩阵特征值问题的分而治之算法,给出了它们的MPI+OpenMP混合并行算法.算法研究集中在SMP集群系统环境下的负载平衡、通信开销和性能评价.混合并行算法的设计结合了粗粒度线程并行模式和任务共享的动态调用方法,改善了MPI算法中的负载平衡问题、降低了通信开销.在深腾6800上的实验表明,基于混合并行算法的求解器比纯MPI版本的求解器具有更好的性能和可扩展性.  相似文献   

9.
基于对称三对角矩阵特征求解的分而治之方法,提出了一种改进的使用MPI/Cilk模型求解的混合并行实现,结合节点间数据并行和节点内多任务并行,实现了对分治算法中分治阶段和合并阶段的多任务划分和动态调度.节点内利用Cilk任务并行模型解决了线程级并行的数据依赖和饥饿等待等问题,提高了并行性;节点间通过改进合并过程中的通信流程,使组内进程间只进行互补的数据交换,降低了通信开销.数值实验体现了该混合并行算法在计算效率和扩展性方面的优势.  相似文献   

10.
Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

11.
Solving the Kohn–Sham equation, which arises in density functional theory, is a standard procedure to determine the electronic structure of atoms, molecules, and condensed matter systems. The solution of this nonlinear eigenproblem is used to predict the spatial and energetic distribution of electronic states. However, obtaining a solution for large systems is computationally intensive because the problem scales super-linearly with the number of atoms. Here we demonstrate a divide and conquer method that partitions the necessary eigenvalue spectrum into slices and computes each partial spectrum on an independent group of processors in parallel. We focus on the elements of the spectrum slicing method that are essential to its correctness and robustness such as the choice of filter polynomial, the stopping criterion for a vector iteration, and the detection of duplicate eigenpairs computed in adjacent spectral slices. Some of the more prominent aspects of developing an optimized implementation are discussed.  相似文献   

12.
The parallel stratagem in this paper uses scattered square decomposition, introduced by G. Fox, for its data assignment and then exploits parallelism in the solution steps of the sequential Householder tridiagonalization algorithm. One may condense a real symmetric full matrix A of order n into a tridiagonal form by the stratagem in concurrent machines where N(= D2) processors are used. Expressions for efficiency and speedup are given for the evaluation of the stratagem. An alternative stratagem which requires less data transmission but more computations is also discussed. The results shown that the Householder Method of tridiagonalization may be implemented on a concurrent machine efficiently by scattered square decomposition provided that the number of matrix elements contained in each processor is much larger than the number of processors of the concurrent machine, and the ratio of the time to transmit one data item from one processor to any other processor to the time to perform a floating-point arithmetic operation is small enough.  相似文献   

13.
This paper describes a parallel solver of tridiagonal systems appropriate for distributed memory computers and implemented on an array of chain-connected T800 Transputers. Each processor in the chain uses the same program to solve its own subset of equations. This implementation is suited, for instance, for solving the heat conduction equation in one-dimensional hydrodynamic codes. The procedure performs a parallel cyclic reduction, a recursive Gaussian elimination on a reduced number of equations and a parallel backward unfolding scheme, with a direct substitution in the reduced equations. The code has been written in Occam2 language. A one-way communication of values between adjacent processors is required at each cycle of both the reduction and the unfolding steps. Due to the number of floating point operations and the amount of communications the implementation described here works efficiently on arrays with more than 4 processors and for more than 50 equations per processor.  相似文献   

14.
In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone.  相似文献   

15.
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013)  [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.  相似文献   

16.
In this work, we developed a parallel algorithm to speed up the resolution of differential matrix Riccati equations using a backward differentiation formula algorithm based on a fixed‐point method. The role and use of differential matrix Riccati equations is especially important in several applications such as optimal control, filtering, and estimation. In some cases, the problem could be large, and it is interesting to speed it up as much as possible. Recently, modern graphic processing units (GPUs) have been used as a way to improve performance. In this paper, we used an approach based on general‐purpose computing on graphics processing units. We used NVIDIA © GPUs with unified architecture. To do this, a special version of basic linear algebra subprograms for GPUs, called CUBLAS, and a package (three different packages were studied) to solve linear systems using GPUs have been used. Moreover, we developed a MATLAB © toolkit to use our implementation from MATLAB in such a way that if the user has a graphic card, the performance of the implementation is improved. If the user does not have such a card, the algorithm can also be run using the machine CPU. Experimental results on a NVIDIA Quadro FX 5800 are shown. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

17.
三对角线性方程组的分布式并行算法   总被引:4,自引:1,他引:4  
文中回顾了Michielse&Vorst算法,分析了影响M&V.算法并行效率的主要因素,根据分布治之思想提出了一种求解三对角方程组的并行算法,新算法通信建立次数的M&V算法的50%,数据传输量为其33%,最后的工作站网络环境下实现了新算法,就并行效率与M&V算法进行了比较,结果表明在由6台工作站的组成的网络中新算法必能提高可达到40%。  相似文献   

18.
《Parallel Computing》1986,3(1):25-34
Three parallel algorithms for computing the QR-factorization of a matrix are presented. The discussion is primarily concerned with implementation of these algorithms on a computer that supports tightly coupled parallel processes sharing a large common memory. The three algorithms are a Householder method based upon high-level modules, a Windowed Householder method that avoids fork-join synchronization, and a Pipelined Givens method that is a variant of the data-flow type algorithms offering large enough granularity to mask synchronization costs. Numerical experiments were conducted on the Denelcor HEP computer. The computational results indicate that the Pipelined Givens method is preferred and that this is primarily due to the number of array references required by the various algorithms.  相似文献   

19.
A Parallel Solver for Circulant Toeplitz Tridiagonal Systems on Hypercubes   总被引:1,自引:0,他引:1  
Solving circulant Toeplitz tridiagonal systems arises in many engineering applications. This paper presents a fast parallel algorithm for solving this type of systems. The number of floating-point operations required in our algorithm is less than the previous parallel algorithm [cf. Kim and Lee (1990)] for solving the similar system. Specifically, an overlapping technique is proposed to reduce the communication steps required. In addition, an error analysis is given. The implementation of our algorithm on the nCUBE2/E with 16 processors has been carried out. The experimental results show that the speedup is almost linearly proportional to the number of processors.  相似文献   

20.
Abstract

In this paper, we consider the application of the conjugate gradient method specifically to solve non symmetric systems which are large, tridiagonal and Toeplitz. Under the condition that the system is diagonally dominant, one can pre-multiply the system by the transpose of the coefficient matrix and take advantage of the structure of the new coefficient matrix to perturb and factor it. This allows us to divide the task of solution containing pairs of tridiagonal, symmetric and Toeplitz systems and to solve the pairs of systems using a parallel implementaton of congujate gradient. Final corrections, to account for the perturbations, provide a numerical approximation to the solution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号