期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast parallel Preconditioned Conjugate Gradient algorithms for robot manipulator dynamics simulation

Amir Fijany Robert E. Scheid 《Journal of Intelligent and Robotic Systems》1994,9(1-2):73-99

In this paper fast parallel Preconditioned Conjugate Gradient (PCG) algorithms for robot manipulator forward dynamics, or dynamic simulation, problem are presented. By exploiting the inherent structure of the forward dynamics problem, suitable preconditioners are devised to accelerate the iterations. Also, based on the choice of preconditioners, a modified dynamic formulation is used to speedup both serial and parallel computation of each iteration. The implementation of the parallel algorithms on two interconnected processor arrays is discussed and their computation and communication complexities are analyzed. The simulation results for a Puma Arm are presented to illustrate the effectiveness of the proposed preconditioners. With a faster convergence due to preconditioning and a faster computation of iterations due to parallelization, the developed parallel PCG algorithms represent the fastest alternative for parallel computation of the problem withO(n) processors. 相似文献

2.

一种求解H(curl)型椭圆问题的高效并行预条件子及并行实现

冯春生王俊仙舒适《数值计算与计算机应用》2012,33(1):48-58

本文为一类H(curl)型椭圆问题的线性棱有限元方程,构造了一种基于节点辅助空间预条件子(HX预条件子)和基于简单粗空间的非重叠区域分解相结合的预条件子,并为该预条件子设计了并行算法,编制了基于MPI+OpenMP二级并行架构的并行程序.数值实验结果表明基于该预条件子的并行PCG法具有良好的算法可扩展能力和并行可扩展能力. 相似文献

3.

Conjugate residual squared method and its improvement for non-symmetric linear systems

《国际计算机数学杂志》2012,89(7):1578-1590

In this paper, conjugate residual squared (CRS) method for solving linear systems with non-symmetric coefficient matrices is proposed. Moreover, based on the ideas by Gu et al. [An improved bi-conjugate residual algorithm suitable for distributed parallel computing, Appl. Math. Comput. 186 (2007), pp. 1243–1253], we present an improved conjugate residual squared (ICRS) method, which is designed for distributed parallel environments. The improved method reduces two global synchronization points to one by changing the computation sequence in the CRS method and all inner products per iteration are independent, and communication time required for inner product can be overlapped with useful computation. Theoretical analysis shows that the ICRS method has better parallelism and scalability than the CRS method. Finally, some numerical experiments clearly show that the ICRS method can achieve better parallel performance with a higher scalability than the CRS method, and also the improvement percentage of communication is up to 47.33%, which meets our theoretical analysis. 相似文献

4.

一种改进的适合并行计算的共轭剩余算法

刘杰刘兴平迟利华胡庆丰《计算机学报》2006,29(3):495-499

通过改变CR算法的计算次序。提出了一种改进的共轭剩余（ICR）算法．对比CR算法。ICR算法的数值稳定性和CR算法相同,几乎没有增加计算量。但考虑了在MIMD并行机上实现时并行算法的性能,其同步开销减少为CR算法的一半,并且所有内积计算以及矩阵向量乘是独立的,没有数据相关性。可以进行计算与通信的重叠．从理论和实验两个角度来讨论ICR算法的性能,当处理机台数较多时ICR算法的计算速度快于CR算法．在64台处理机机群上进行的数值实验表明,并行ICR算法的计算速度大约比CR算法快30％．相似文献

5.

一种改进的适合并行计算的TFQMR算法

刘杰迟利华胡庆丰李晓梅《计算机研究与发展》2005,42(7):1235-1240

TFQMR算法是一种Krylov子空间算法,常用来求解大型稀疏线性方程组．通过改变TFQMR算法的计算次序,提出了一种改进的TFQMR(ITFQMR)算法．对比TFQMR算法,ITFQMR算法的数值稳定性和TFQMR算法相同,几乎没有增加计算量,但考虑了在MIMD并行机上实现时并行算法的性能,其同步开销减少为TFQMR算法的一半,并且所有内积计算以及矩阵向量乘是独立的,没有数据相关性,可以进行计算与通信的重叠．从理论和实验两个角度来讨论ITFQMR算法的性能,当处理机台数较多时,ITFQMR算法的计算速度快于TFQMR算法．实验说明了在有64台处理机机群上进行,最快的并行ITFQMR算法的计算速度大约比TFQMR算法快20％．相似文献

6.

Massive Parallelization of NMC′s Spectral Model

《Journal of Parallel and Distributed Computing》1994,21(1):140-149

This paper briefly describes the formulation of NMC′s Spectral Model and the numerical methods used in its implementation. The various segments of the numerical algorithm are analyzed for suitability to a massively parallel architecture and a computer algorithm is designed for execution on a CM-200 computer. Special emphasis is placed on the production of a fast Legendre Transform and the minimization of interprocessor communication. Performance and scalability issues are presented. 相似文献

7.

Improving the scalability of a symmetric iterative eigensolver for multi‐core platforms

Hasan Metin Aktulga Chao Yang Esmond G. Ng Pieter Maris James P. Vary 《Concurrency and Computation》2014,26(16):2631-2651

We describe an efficient and scalable symmetric iterative eigensolver developed for distributed memory multi‐core platforms. We achieve over 80% parallel efficiency by major reductions in communication overheads for the sparse matrix‐vector multiplication and basis orthogonalization tasks. We show that the scalability of the solver is significantly improved compared to an earlier version, after we carefully reorganize the computational tasks and map them to processing units in a way that exploits the network topology. We discuss the advantage of using a hybrid OpenMP/MPI programming model to implement such a solver. We also present strategies for hiding communication on a multi‐core platform. We demonstrate the effectiveness of these techniques by reporting the performance improvements achieved when we apply our solver to large‐scale eigenvalue problems arising in nuclear structure calculations. Because sparse matrix‐vector multiplication and inner product computation constitute the main kernels in most iterative methods, our ideas are applicable in general to the solution of problems involving large‐scale symmetric sparse matrices with irregular sparsity patterns. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

8.

SOLVING SPARSE LEAST SQUARES PROBLEMS WITH PRECONDITIONED CGLS METHOD ON PARALLEL DISTRIBUTED MEMORY COMPUTERS

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(4):289-305

Abstract

In this paper we study the parallel aspects of PCGLS, a basic iterative method based on the conjugate gradient method with preconditioner applied to normal equations and Incomplete Modified Gram-Schmidt (IMGS) preconditioner, for solving sparse least squares problems on massively parallel distributed memory computers. The performance of these methods on this kind of architecture is usually limited because of the global communication required for the inner products. We will describe the parallelization of PCGLS and IMGS preconditioner by two ways of improvement. One is to accumulate the results of a number of inner products collectively and the other is to create situations where communication can be overlapped with computation. A theoretical model of computation and communication phases is presented which allows us to determine the optimal number of processors that minimizes the runtime. Several numerical experiments on the Parsytec GC/PowerPlus are presented. 相似文献

9.

Transputer arrays for the on-line computation of robot jacobians

A. Y. H. Zomaya A. S. Morris 《Concurrency and Computation》1992,4(5):399-412

On-line computation of forward and inverse Jacobian matrices is essential in robot manipulator controllers, where high-speed robot motion is required. The complexity of Jacobian calculation is such that the computational burden is large, and parallel processing is necessary if on-line computation is to be achieved. Various algorithms and parallel-processing networks suitable for this are considered. All algorithms have been implemented on transputer networks and computation times measured. The paper emphasises the importance of including communication overheads in comparisons of the computational efficiency of alternative algorithms and processor networks. Theoretical processing times based on computer cycle times and arithmetic operation counts are shown to be a false basis for comparison. Whilst considering the specific case of computation of Jacobian matrices for a robot manipulator, the paper provides a useful example of the considerations and constraints involved in distributing any algorithm across a multi-processor network. 相似文献

10.

Benchmarking the computation and communication performance of the CM-5

Kivanc Dincer Zeki Bozkus Sanjay Ranka Geoffrey Fox 《Concurrency and Computation》1996,8(1):47-69

Thinking Machines' CM-5 machine is a distributed-memory, message-passing computer. In the paper we devise a performance benchmark for the base and vector units and the data communication networks of the CM-5 machine. We model the communication characteristics such as communication latency and bandwidths of point-to-point and global communication primitives. We show, on a simple Gaussian elimination code, that an accurate static performance estimation of parallel algorithms is possible by using those basic machine properties connected with computation, vectorization, communication and synchronization. Futhermore, we describe the embedding of meshes or hypercubes on the CM-5 fat-tree topology and illustrate the performance results of their basic communication primitives. 相似文献

11.

A comparison of projective and direct solvers for finite elements in elastostatics

Carlo Janna Andrea Comerlati Giuseppe Gambolati 《Advances in Engineering Software》2009,40(8):675-685

The Finite Element Method (FEM) is widely used in civil and mechanical engineering to simulate the behavior of complex structures and, more specifically, to predict stress and deformation fields of structural parts or mechanical bodies. In the former case, the coupling between different types of elements, such as beams, trusses, and shells, is often required, while in the latter fully 3D discretizations are typically used. For both, FEM leads to symmetric positive definite (SPD) matrices that, depending on the type of discretization and especially on the topology of the nodal connections, may be efficiently solved by either the Preconditioned Conjugate Gradient (PCG) or a direct solver such as the routine MA57 of the Harwell Software Library. Numerical experiments are shown and discussed where the effect of spatial discretization, different solution techniques, and a possible nodal reordering, is explored. The PCG preconditioner used is a variant of the incomplete Cholesky factorization with variable fill-in. It is shown that for structures with 1D or 2D connections, such as for example a bridge, MA57 performs usually better than PCG. In this case it is noted that some reorderings specifically designed and implemented for direct elimination methods can be very helpful for PCG as well as they yield a cheaper preconditioner and lead to a much faster PCG convergence. The main disadvantage is the need for an appropriate degree of fill-in for the preconditioner which turns out to be problem dependent and must be found empirically. However, in fully 3D problems, arising for example from the FE discretization of structural components or geomechanical structures, PCG outperforms MA57 while also requiring much less memory, and thus allowing for the use of much refined grids, if needed. With the aid of a large geomechanical problem it is shown that direct solvers may not be (even) used on serial computers due to their prohibitive computational cost with PCG the only viable alternative solver. 相似文献

12.

求解Maxwell线性棱元鞍点系统的并行Uzawa算法

王俊仙舒适冯春生《计算机工程与科学》2009,31(11)

本文针对一类Maxwell方程组鞍点问题的第一类N啨d啨lec线性棱元离散系统,设计了一种基于节点辅助空间预条件子的并行Uzawa算法(HX-Uzawa-p)。数值实验结果表明,不论是对光滑系数还是对有无浮动子区域及有无内交叉点的跳系数情形,我们所设计的并行算法HX-Uzawa-p的迭代次数都基本不依赖于网格规模及系数跳幅,且具有很好的并行可扩展性。相似文献

13.

Exploiting thread-level parallelism in the iterative solution of sparse linear systems

José I. Aliaga Matthias Bollhöfer Alberto F. Mart?´n Enrique S. Quintana-Ort?´ 《Parallel Computing》2011,37(3):183-202

We investigate the efficient iterative solution of large-scale sparse linear systems on shared-memory multiprocessors. Our parallel approach is based on a multilevel ILU preconditioner which preserves the mathematical semantics of the sequential method in ILUPACK. We exploit the parallelism exposed by the task tree corresponding to the nested dissection hierarchy (task parallelism), employ dynamic scheduling of tasks to processors to improve load balance, and formulate all stages of the parallel PCG method conformal with the computation of the preconditioner to increase data reuse. Results on a CC-NUMA platform with 16 processors reveal the parallel efficiency of this solution. 相似文献

14.

Low-latency communication over ATM networks using active messages 总被引：1，自引：0，他引：1

von Eicken T. Basu A. Buch V. 《Micro, IEEE》1995,15(1):46-53

Today's communication architectures for parallel machines reduce communication overheads and latencies by over an order of magnitude. However, carrying over these techniques to workstation clusters connected by an ATM network presents major design challenges. We discuss the differences in communication characteristics between workstation clusters built from standard hardware and software components and state-of-the-art multiprocessors, and then evaluate a prototype implementation of an active message communication layer. Application round-trip latencies of about 50 microseconds for small messages roughly compare to a similar implementation on the Thinking Machines CM-5 multiprocessor 相似文献

15.

Parallel computation of manipulator inverse dynamics

Amir Fijany Antal K. Bejczy 《野外机器人技术杂志》1991,8(5):599-635

In this article, parallel computation of manipulator inverse dynamics is investigated. A hierarchical graph-based mapping approach is devised to analyze the inherent parallelism in the Newton-Euler formulation at several computational levels, and to derive the features of an abstract architecture for exploitation of parallelism. At each level, a parallel algorithm represents the application of a parallel model of computation that transforms the computation into a graph whose structure defines the features of an abstract architecture, i.e., number of processors, communication structure, etc. Data flow analysis is employed to derive the time lower bound in the computation as well as the sequencing of the abstract architecture. The features of the target architecture are defined by optimization of the abstract architecture to exploit maximum parallelism while minimizing various overheads and architectural complexity. An algorithmically specialized, highly parallel, MIMD-SIMD architecture is designed and implemented that is capable of efficient exploitation of parallelism at several computational levels. The computation time of the Newton-Euler formulation for a 6-degree-of-freedom (dof) general manipulator is measured as 187 μs. The increase in computation time for each additional dof is 23 μs, which leads to a computation time of less than 500 μs, even for a 12-dof redundant arm. 相似文献

16.

Parallel scheduling of the PCG method for banded matrices rising from FDM/FEM

E. M. Ortigosa L. F. Romero J. I. Ramos 《Journal of Parallel and Distributed Computing》2003,63(12):1243-1256

An implicit time-linearized finite difference discretization of partial differential equations on regular/structured meshes results in an n-diagonal block system of algebraic equations, which is usually solved by means of the Preconditioned Conjugate Gradient (PCG) method. In this paper, an analysis of the parallel implementation of this method on several computer architectures and for several programming paradigms is presented. For three-dimensional regular/structured meshes, a new implementation of the PCG method with Jacobi preconditioner is proposed. For the computer architectures and number of processors employed in this study, it has been found that this implementation is more efficient than the standard one, and can be applied to narrow-band matrices and other preconditioners, such as, for example, polynomial ones. 相似文献

17.

Parallel methods for optimality criteria-based topology optimization

《Computer Methods in Applied Mechanics and Engineering》2005,194(34-35):3637-3667

Topology optimization problems require the repeated solution of finite element problems that are often extremely ill-conditioned due to highly heterogeneous material distributions. This makes the use of iterative linear solvers inefficient unless appropriate preconditioning is used. Even then, the solution time for topology optimization problems is typically very high. These problems are addressed by considering the use of non-overlapping domain decomposition-based parallel methods for the solution of topology optimization problems. The parallel algorithms presented here are based on the solid isotropic material with penalization (SIMP) formulation of the topology optimization problem and use the optimality criteria method for iterative optimization. We consider three parallel linear solvers to solve the equilibrium problem at each step of the iterative optimization procedure. These include two preconditioned conjugate gradient (PCG) methods: one using a diagonal preconditioner and one using an incomplete LU factorization preconditioner with a drop tolerance. A third substructuring solver that employs a hybrid of direct and iterative (PCG) techniques is also studied. This solver is found to be the most effective of the three solvers studied, both in terms of parallel efficiency and in terms of its ability to mitigate the effects of ill-conditioning. In addition to examining parallel linear solvers, we consider the parallelization of the iterative optimality criteria method. To tackle checkerboarding and mesh dependence, we propose a multi-pass filtering technique that limits the number of “ghost” elements that need to be exchanged across interprocessor boundaries. 相似文献

18.

Reliable scalable symbolic computation: The design of SymGridPar2

《Computer Languages, Systems and Structures》2014,40(1):19-35

Symbolic computation is an important area of both Mathematics and Computer Science, with many large computations that would benefit from parallel execution. Symbolic computations are, however, challenging to parallelise as they have complex data and control structures, and both dynamic and highly irregular parallelism. The SymGridPar framework (SGP) has been developed to address these challenges on small-scale parallel architectures. However the multicore revolution means that the number of cores and the number of failures are growing exponentially, and that the communication topology is becoming increasingly complex. Hence an improved parallel symbolic computation framework is required.This paper presents the design and initial evaluation of SymGridPar2 (SGP2), a successor to SymGridPar that is designed to provide scalability onto 10⁵ cores, and hence also provide fault tolerance. We present the SGP2 design goals, principles and architecture. We describe how scalability is achieved using layering and by allowing the programmer to control task placement. We outline how fault tolerance is provided by supervising remote computations, and outline higher-level fault tolerance abstractions.We describe the SGP2 implementation status and development plans. We report the scalability and efficiency, including weak scaling to about 32,000 cores, and investigate the overheads of tolerating faults for simple symbolic computations. 相似文献

19.

Application of the incomplete Cholesky factorization preconditioned Krylov subspace method to the vector finite element method for 3-D electromagnetic scattering problems

Liang Li Yan-Fei Jing 《Computer Physics Communications》2010,181(2):271-276

The incomplete Cholesky (IC) factorization preconditioning technique is applied to the Krylov subspace methods for solving large systems of linear equations resulted from the use of edge-based finite element method (FEM). The construction of the preconditioner is based on the fact that the coefficient matrix is represented in an upper triangular compressed sparse row (CSR) form. An efficient implementation of the IC factorization is described in detail for complex symmetric matrices. With some ordering schemes our IC algorithm can greatly reduce the memory requirement as well as the iteration numbers. Numerical tests on harmonic analysis for plane wave scattering from a metallic plate and a metallic sphere coated by a lossy dielectric layer show the efficiency of this method. 相似文献

20.

Compiler techniques for the distribution of data and computation

Navarro A. Zapata E. Padua D. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):545-562

This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines. 相似文献