首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Parallel Computing》1997,23(6):637-647
This paper considers the parallel solution of finite element equations using the preconditioned conjugate gradient method on shared memory multiprocessors. The preconditioner used is an incomplete LU factorization of the stiffness matrix. We have designed a new method for implementing parallel forward and backward substitutions which requires the L and U factors to have an independent column structure, which we refer to as I2LU. The algorithm has been implemented on an Encore Multimax and parallel efficiencies up to 80% of the maximum possible theoretical value have been obtained using around 6–8 processors. These figures are shown to be comparable with the efficiencies of an implementation of a similar row-oriented approach, based on level scheduling, tested with the same problems.  相似文献   

2.
This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor,as a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation  相似文献   

3.
Soft errors are becoming a prominent problem for massive parallel scientific applications. Dual-modular redundancy (DMR) can provide approximately 100% error coverage, but it has the problem of overhead excessive. Stencil kernel is one of the most important routines applied in the context of structured grids. In this paper, we propose Grid Sampling DMR (GS-DMR), a low-overhead soft error detection scheme for stencil-based computation. Instead of comparing the whole set of the results in the traditional DMR, GS-DMR just compares a subset of the results according to sampling on the grid data, which is based on the error propagation pattern on the grid. We also design a fault tolerant (FT) framework combining GS-DMR with checkpoint technology, and provide theoretical analysis and an algorithm for the optimal FT parameters. Experimental results on Tianhe-2 supercomputer demonstrate that GS-DMR can achieve a good FT effect for stencil-based computation, and the effect is greatly improved for massively parallel applications, reducing the total FT overhead up to 51%.  相似文献   

4.
The main idea of this paper is in determination of the pattern of nonzero elements of the LU factors of a given matrix A. The idea is based on taking the powers of the Boolean matrix derived from A. This powers of a Boolean matrix strategy (PBS) is an efficient, effective, and inexpensive approach. Construction of an ILU preconditioner using PBS is described and used in solving large nonsymmetric sparse linear systems. Effectiveness of the proposed ILU preconditioner in solving large nonsymmetric sparse linear systems by the GMRES method is also shown. Numerical experiments are performed which show that it is possible to considerably reduce the number of GMRES iterations when the ILU preconditioner constructed here is used. In numerical examples, the influence of k, the dimension of the Krylov subspace, on the performance of the GMRES method using an ILU preconditioner is tested. For all the tests carried out, the best value for k is found to be 10.  相似文献   

5.
In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead.  相似文献   

6.
In this paper a preconditioned iterative method suitable for the solution of the generalized eigenvalue problem Ax = λBx is presented. The proposed method is suitable for the determination of extreme eigenvalues and their corresponding eigenvectors of large sparse eigenproblems derived from the finite element discretization method. The solution is obtained through the minimization of the Rayleigh quotient by a preconditioned conjugate gradient (CG) method. The proposed triangular splitting preconditioner combines Evans' SSOR preconditioner with a drop-off tolerance criterion and is called partial preconditioner (PPR). The PPR is attractive in a large FE framework because it is simple and can be implemented at the element level, as opposed to incomplete Choleski preconditioners, which require a sparse assembly. Because of the renewed interest in CG techniques for FE work on microprocessors and parallel computers it is believed that this improved approach to the generalized eigenvalue problem, through the minimization of the Rayleigh quotient, is likely to be very promising.  相似文献   

7.
Atomistic simulations of thin film deposition, based on the lattice Monte Carlo method, provide insights into the microstructure evolution at the atomic level. However, large-scale atomistic simulation is limited on a single computer—due to memory and speed constraints. Parallel computation, although promising in memory and speed, has not been widely applied in these simulations because of the intimidating overhead. The key issue in achieving optimal performance is, therefore, to reduce communication overhead among processors. In this paper, we propose a new parallel algorithm for the simulation of large-scale thin film deposition incorporating two optimization strategies: (1) domain decomposition with sub-domain overlapping and (2) asynchronous communication. This algorithm was implemented both on message-passing-processor systems (MPP) and on cluster computers. We found that both architectures are suitable for parallel Monte Carlo simulation of thin film deposition in either a distributed memory mode or a shared memory mode with message-passing libraries.  相似文献   

8.
A parallel multilevel preconditioner based on domain decomposition and fictitious domain methods has been presented for the solution of the Poisson equation in complicated geometries. Rectangular blocks with matching grids on interfaces on a structured rectangular mesh have been used for the decomposition of the problem domain. Sloping sides or curved boundary surfaces are approximated using stepwise surfaces formed by the grid cells. A seven-point stencil based on the central difference scheme has been used for the discretization of the Laplacian for both interior and boundary grid points, and this results in a symmetric linear algebraic system for any type of boundary condition. The preconditioned conjugate gradient method has been used for the solution of this symmetric system. The multilevel preconditioner for the CG is based on a V-cycle multigrid applied to the Poisson equation on a fictitious domain formed by the union of the rectangular blocks used for the domain decomposition. Numerical results are presented for two typical Poisson problems in complicated geometries—one related to heat conduction, and the other one arising from the LES/DNS of incompressible turbulent flow over a packed array of spheres. These results clearly show the efficiency and robustness of the proposed approach.  相似文献   

9.
The most efficient way to parallelize computation is to build and evaluate the task graph constrained only by the data dependencies between the tasks. Both Intel's C++ Concurrent Collections (CnC) and Threading Building Blocks (TBB) libraries allow such task‐based parallel programming. CnC also adapts the macro data flow model by providing only single‐assignment data objects in its global data space. Although CnC makes parallel programming easier, by specifying data flow dependencies only through single‐assignment data objects, its macro data flow model incurs overhead. Intel's C++ CnC library is implemented on top of its C++ TBB library. We can measure the overhead of CnC by comparing its performance with that of TBB. In this paper, we analyze all three types of data dependencies in the tiled in‐place Gauss–Jordan elimination algorithm for the first time. We implement the task‐based parallel tiled Gauss–Jordan algorithm in TBB using the data dependencies analyzed and compare its performance with that of the CnC implementation. We find that the overhead of CnC over TBB is only 12%– 15% of the TBB time, and CnC can deliver as much as 87%– 89% of the TBB performance for Gauss–Jordan elimination, using the optimal tile size. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

10.
We have developed a parallel algorithm for radial basis function (rbf) interpolation that exhibits O(N) complexity, requires O(N) storage, and scales excellently up to a thousand processes. The algorithm uses a gmres iterative solver with a restricted additive Schwarz method (rasm) as a preconditioner and a fast matrix-vector algorithm. Previous fast rbf methods — achieving at most O(NlogN) complexity — were developed using multiquadric and polyharmonic basis functions. In contrast, the present method uses Gaussians with a small variance with respect to the domain, but with sufficient overlap. This is a common choice in particle methods for fluid simulation, our main target application. The fast decay of the Gaussian basis function allows rapid convergence of the iterative solver even when the subdomains in the rasm are very small. At the same time we show that the accuracy of the interpolation can achieve machine precision. The present method was implemented in parallel using the petsc library (developer version). Numerical experiments demonstrate its capability in problems of rbf interpolation with more than 50 million data points, timing at 106 s (19 iterations for an error tolerance of 10? 15) on 1024 processors of a Blue Gene/L (700 MHz PowerPC processors). The parallel code is freely available in the open-source model.  相似文献   

11.
DESH: overhead reduction algorithms for deferrable scheduling   总被引:1,自引:0,他引:1  
Although the deferrable scheduling algorithm for fixed priority transactions (DS-FP) has been shown to be a very effective approach for minimizing real-time update transaction workload, it suffers from its on-line scheduling overhead. In this work, we propose two extensions of DS-FP to minimize the on-line scheduling overhead. The proposed algorithms produce a hyperperiod from DS-FP so that the schedule generated by repeating the hyperperiod infinitely satisfies the temporal validity constraint of the real-time data. The first algorithm, named DEferrable Scheduling with Hyperperiod by Schedule Construction (DESH-SC), searches the DS-FP schedule for a hyperperiod. The second algorithm, named DEferrable Scheduling with Hyperperiod by Schedule Adjustment (DESH-SA), adjusts the DS-FP schedule in an interval to form a hyperperiod. Our experimental results demonstrate that while both DESH-SC and DESH-SA can reduce the scheduling overhead of DS-FP, DESH-SA outperforms DESH-SC by accommodating significantly more update transactions in the system. Moreover, DESH-SA can also achieve near-optimal update workload.  相似文献   

12.
We consider numeration systems where digits are integers and the base is an algebraic number β such that |β|>1 and β satisfies a polynomial where one coefficient is dominant in a certain sense. For this class of bases β, we can find an alphabet of signed-digits on which addition is realizable by a parallel algorithm in constant time. This algorithm is a kind of generalization of the one of Avizienis. We also discuss the question of cardinality of the used alphabet, and we are able to modify our algorithm in order to work with a smaller alphabet. We then prove that β satisfies this dominance condition if and only if it has no conjugate of modulus 1. When the base β is the Golden Mean, we further refine the construction to obtain a parallel algorithm on the alphabet {−1,0,1}. This alphabet cannot be reduced any more.  相似文献   

13.
We present a variant of the HMC algorithm with mass preconditioning (Hasenbusch acceleration) and multiple time scale integration. We have tested this variant for standard Wilson fermions at β=5.6 and at pion masses ranging from 380 to 680 MeV. We show that in this situation its performance is comparable to the recently proposed HMC variant with domain decomposition as preconditioner. We give an update of the “Berlin Wall” figure, comparing the performance of our variant of the HMC algorithm to other published performance data. Advantages of the HMC algorithm with mass preconditioning and multiple time scale integration are that it is straightforward to implement and can be used in combination with a wide variety of lattice Dirac operators.  相似文献   

14.
We describe a dynamic load-balancing algorithm for ray-tracing by progressive refinement on a distributed-memory parallel computer. Parallelization of progressive ray-tracing for single images is difficult because of the inherent sequential nature of the sample location generation process, which is optimized (and different) for any given image. Parallelization of progressive ray-tracing when generating image sequences at a fixed interactive rate is even more difficult, because of the time and synchronization constraints imposed on the system. The fixed frame rate requirement complicates matters and even renders meaningless traditional measures of parallel system performance (e.g., speedup). We show how to overcome these problems, which, to the best of our knowledge, have not been treated before. Exploiting the temporal coherence between frames enables us to both accelerate rendering and improve the load-balance throughout the sequence. Our dynamic load-balance algorithm combines local and global methods to account not only for rendering performance, but also for communication overhead and synchronization issues. The algorithm is shown to be robust to the harsh environment imposed by a time-critical application, such as the one we consider.  相似文献   

15.
In this paper we use the Lanczos process for preconditioning discrete ill-posed problems. We show that by few steps of this process one can obtain a well qualified and efficient preconditioner. This is a general method in the sense that it is not limited only to special structured matrices and the matrix–vector multiplications can be carried out in O(n) operations. Also even in problems with structured matrices this preconditioner performs more efficiently than the circulant and Kronecker product approximate preconditioners.  相似文献   

16.
The dynamic lot-sizing model (DLS) is one of the most frequently used models in production and inventory system because lot decisions can greatly affect the performance of the system. The practicality of DLS algorithms is hindered by the huge amount of computer resources required for solving these models, even for a modest problem. This study developed a parallel algorithm to solve the lot-sizing problem efficiently. Given that n is the size of the problem, the complexity of the proposed parallel algorithm is O(n2p) with p processors. Numerical experiments are provided to verify the complexity of the proposed algorithm. The empirical results demonstrate that the speedup of this parallel algorithm approaches linearity, which means that the proposed algorithm can take full advantage of the distributed computing power as the size of the problem increases.  相似文献   

17.
We consider the following problem. For a binary tree T = (V, E) where V = {1, 2, ..., n}, given its inorder traversal and either its preorder or its postorder traversal, reconstruct the binary tree. We present a new parallel algorithm for this problem. Our algorithm requires O(n) space. The main idea of our algorithm is to reduce the reconstruction process to merging two sorted sequences. With the best parallel merging algorithms, our algorithm can be implemented in O(log log n) time using O(n/log log n) processors on the CREW PRAM (or in O(log n) time using O(n/log n) processors on the EREW PRAM). Our result provides one more example of a fundamental problem which can be solved by optimal parallel algorithms in O(log log n)time on the CREW PRAM.  相似文献   

18.
The successive projection algorithm (SPA) can quickly solve a nonnegative matrix factorization problem under a separability assumption. Even if noise is added to the problem, SPA is robust as long as the perturbations caused by the noise are small. In particular, robustness against noise should be high when handling the problems arising from real applications. The preconditioner proposed by Gillis and Vavasis (SIAM J Optim 25(1):677–698, 2015) makes it possible to enhance the noise robustness of SPA. Meanwhile, an additional computational cost is required. The construction of the preconditioner contains a step to compute the top-k truncated singular value decomposition of an input matrix. It is known that the decomposition provides the best rank-k approximation to the input matrix; in other words, a matrix with the smallest approximation error among all matrices of rank less than k. This step is an obstacle to an efficient implementation of the preconditioned SPA. To address the cost issue, we propose a modification of the algorithm for constructing the preconditioner. Although the original algorithm uses the best rank-k approximation, instead of it, our modification uses an alternative. Ideally, this alternative should have high approximation accuracy and low computational cost. To ensure this, our modification employs a rank-k approximation produced by an SPA based algorithm. We analyze the accuracy of the approximation and evaluate the computational cost of the algorithm. We then present an empirical study revealing the actual performance of the SPA based rank-k approximation algorithm and the modified preconditioned SPA.  相似文献   

19.
New circulant block-factorization preconditioners are introduced and studied. The general approach is first formulated for the case of block tridiagonal sparse matrices. Then estimates of the relative condition number for a model Dirichlet boundary value problem are derived. In the case ofy-periodic problems the circulant block-factorization preconditioner is shown to give an optimal convergence rate. Finally, using a proper imbedding of the original Dirichlet boundary value problem to ay-periodic one a preconditioner of optimal convergence rate for the general case is obtained. The total computational cost of the preconditioner isO (N logN) (based on FFT), whereN is the number of unknowns. That is, the algorithm is nearly optimal. Various numerical tests that demonstrate the features of the circulant block-factorization preconditioners are presented.  相似文献   

20.
本文为一类H(curl)型椭圆问题的线性棱有限元方程,构造了一种基于节点辅助空间预条件子(HX预条件子)和基于简单粗空间的非重叠区域分解相结合的预条件子,并为该预条件子设计了并行算法,编制了基于MPI+OpenMP二级并行架构的并行程序.数值实验结果表明基于该预条件子的并行PCG法具有良好的算法可扩展能力和并行可扩展能力.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号