首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.  相似文献   

2.
GPU-accelerated preconditioned iterative linear solvers   总被引:1,自引:1,他引:0  
This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved.  相似文献   

3.
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

4.
LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Most of these factorization routines are implemented with block‐partitioned algorithms in order to perform matrix–matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have different performance ratios of computation and communication, the optimal computational block sizes are different from one another in order to generate the maximum performance of an algorithm. Therefore, the ata matrix should be distributed with the machine specific optimal block size before the computation. Too small or large a block size makes achieving good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an ‘algorithmic blocking’ on two‐dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

5.
In this paper, we address the problem of preconditioning sequences of large sparse indefinite systems of linear equations and present two new strategies to construct approximate updates of factorized preconditioners. Both updates are based on the availability of an incomplete factorization for one matrix of the sequence and differ in the approximation of the so-called ideal update. For a general treatment, an incomplete LU (ILU) factorization is considered, but the proposed approaches apply to incomplete factorizations of symmetric matrices as well. The first strategy is an approximate diagonal update of the ILU factorization; the second strategy relies on banded approximations of the factors in the ideal update. The efficiency and reliability of the proposed preconditioners are shown in the solution of nonlinear systems of equations by preconditioned Newton–Krylov methods. Nearly matrix-free implementations of the updating strategy are provided, and numerical experiments are carried out on application problems.  相似文献   

6.
We investigate the introduction of look-ahead in two-stage algorithms for the singular value decomposition (SVD). Our approach relies on a specialized reduction for the first stage that produces a band matrix with the same upper and lower bandwidth instead of the conventional upper triangular-band matrix. In the case of a CPU-GPU server, this alternative form accommodates a static look-aheadinto the algorithm in order to overlap the reduction of the “next” panel on the CPU and the “current” trailing update on the GPU. For multicore processors, we leverage the same compact form to formulate a version of the algorithm that advances the reduction of “future” panels, yielding a dynamic look-ahead that overcomes the performance bottleneck that the sequential panel factorization represents.  相似文献   

7.
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.  相似文献   

8.
We develop a parallel algorithm for partitioning the vertices of a graph intop2 sets in such a way that few edges connect vertices in different sets. The algorithm is intended for a message-passing multiprocessor system, such as the hypercube, and is based on the Kernighan-Lin algorithm for finding small edge separators on a single processor.(1) We use this parallel partitioning algorithm to find orderings for factoring large sparse symmetric positive definite matrices. These orderings not only reduce fill, but also result in good processor utilization and low communication overhead during the factorization. We provide a complexity analysis of the algorithm, as well as some numerical results from an Intel hypercube and a hypercube simulator.Publication of this report was partially supported by the National Science Foundation under Grant DCR-8451385 and by AT&T Bell Laboratories through their Ph.D scholarship program.  相似文献   

9.
This paper describes the design and implementation of three core factorization routines—LU, QR, and Cholesky—included in the out‐of‐core extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. The full matrix is stored on disk and the factorization routines transfer sub‐matrice panels into memory. The ‘left‐looking’ column‐oriented variant of the factorization algorithm is implemented to reduce the disk I/O traffic. The routines are implemented using a portable I/O interface and utilize high‐performance ScaLAPACK factorization routines as in‐core computational kernels. We present the details of the implementation for the out‐of‐core ScaLAPACK factorization routines, as well as performance and scalability results on a Beowulf Linux cluster. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

10.
This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge–Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using the CSR storage scheme and working on Intel Xeon Phi, were investigated. The implementation based on the Intel MKL library uses the high-performance BLAS and Sparse BLAS routines. In our application we focus on OpenMP style programming. We implement SpMV operation and vector addition using the basic optimizing techniques and the vectorization. We evaluate our approach in native and offload modes for various number of cores and thread allocation affinities. Both implementations (based on Intel MKL and made by the authors) were compared in respect of the time, the speedup and the performance. The numerical experiments on Intel Xeon Phi show that the performance of authors’ implementation is very promising and gives a gain of up to two times compared to the multithreaded implementation (based on Intel MKL) running on CPU (Intel Xeon processor) and even three times in comparison with the application which uses Intel MKL on Intel Xeon Phi.  相似文献   

11.
A multilevel algorithm is presented for direct, parallel factorization of the large sparse matrices that arise from finite element and spectral element discretization of elliptic partial differential equations. Incomplete nested dissection and domain decomposition are used to distribute the domain among the processors and to organize the matrix into sections in which pivoting is applied to stabilize the factorization of indefinite equation sets. The algorithm is highly parallel and memory efficient; the efficient use of sparsity in the matrix allows the solution of larger problems as the number of processors is increased, and minimizes computations as well as the number and volume of communications among the processors. The number of messages and the total volume of messages passed during factorization, which are used as measures of algorithm efficiency, are reduced significantly compared to other algorithms. Factorization times are low and speedups high for implementation on an Intel iPSC/860 hypercube computer. Furthermore, the timings for forward and back substitutions are more than an order-of-magnitude smaller than the matrix decomposition times.  相似文献   

12.
§1.引言 以LU分解, Cholesky分解等为代表的线性代数问题的数值计算在现代科学研究和工程技术中得到广泛应用.随着计算机技术的发展,实现这些线性代数数值计算的计算机算法和软件也在不断发展.目前,具有多级存储结构的高性能RISC计算机已占据了数值计算领域的主导地位. RSIC处理器的运算速度非常快,它们与存储器之间的速度差距很大.计算机的性能能不能充分发挥,多级存储结构与高缓能否得到有效利用成为关键.为此,现行的  相似文献   

13.
利用近似三对角Toeplitz矩阵的特殊结构,提出了一种新的求解近似三对角Toeplitz方程组的快速算法.在三对角Toeplitz矩阵的近似LU分解的基础上,利用“分而治之”的思想,并结合秦九韶技术和特殊的数学技巧减少大量的冗余计算,提出了求解近似Toeplitz三对角方程组的快速分布式并行算法,并在理论上证明了算法具有近似于线性的加速比.最后通过数值实验证明,新的并行算法具有较高的并行效率,并且当矩阵阶数n足够大时,算法的加速比趋近于线性加速比.  相似文献   

14.
An important factorization algorithm for polynomials over finite fields was developed by Niederreiter. The factorization problem is reduced to solving a linear system over the finite field in question, and the solutions are used to produce the complete factorization of the polynomial into irreducibles. One charactersistic feature of the linear system arising in the Niederreiter algorithm is the fact that, if the polynomial to be factorized is sparse, then so is the Niederreiter matrix associated with it. In this paper, we investigate the special case of factoring trinomials over the binary field. We develop a new algorithm for solving the linear system using sparse Gaussian elmination with the Markowitz ordering strategy. Implementing the new algorithm to solve the Niederreiter linear system for trinomials over F2 suggests that, the system is not only initially sparse, but also preserves its sparsity throughout the Gaussian elimination phase. When used with other methods for extracting the irreducible factors using a basis for the solution set, the resulting algorithm provides a more memory efficient and sometimes faster sequential alternative for achieving high degree trinomial factorizations over F2.  相似文献   

15.
In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached.  相似文献   

16.
In this paper, we use the inherited LU factorization for solving the fuzzy linear system of equations. Inherited LU factorization is a type of LU factorization which is very faster and simpler than the traditional LU factorization. In this case, we prove some theorems to introduce the conditions that the inherited LU factorization exists for the coefficient matrix of fuzzy linear system. The examples illustrate that the proposed method can be used in order to find the solution of a fuzzy linear system simply.  相似文献   

17.
In this article we present an adaptation of the QIF (Quadrant Interlocking Factorization) algorithm, which solves systems of linear equations, for implementation in SIMD hypercube computers with distributed memory. This method is based in the WZ decomposition of the system's matrix. The parallel algorithm developed is general in the sense that there is no restriction imposed on the size of the problem and that it is independent of the dimension of the hypercube. The comparison of this algorithm with the parallel algorithms based on the LU factorization show that the execution time is divided by a factor of two, approximately.  相似文献   

18.
The numerical solution of a large-scale variational inequality problem can be obtained using the generalization of an inexact Newton method applied to a semismooth nonlinear system. This approach requires a sparse and large linear system to be solved at each step. In this work we obtain an approximate solution of this system by the LSQR algorithm of Paige and Saunders combined with a convenient preconditioner that is a variant of the incomplete LU–factorization. Since the computation of the factorization of the preconditioning matrix can be very expensive and memory consuming, we propose a preconditioner that admits block-factorization. Thus the direct factorization is only applied to submatrices of smaller sizes. Numerical experiments on a set of test-problems arising from the literature show the effectiveness of this approach.  相似文献   

19.
动态数据包分类是目前新兴网络服务的基础,但现有包分类算法的更新性能不能令人满意。基于递归空间分解和解释器方法,设计和实现了一个支持快速增量更新的两阶段多维包分类算法TICS,利用局部数据结构重建替换方法允许规则集增量更新,并通过适当的内存管理允许查找和更新的并行同步进行。实验表明,算法的更新速度比目前更新最快的BRPS算法至少提升了一个数量级,且内存消耗少,具有良好的并行扩放性。  相似文献   

20.
Visual and interactive data exploration requires fast and reliable tools for embedding of an original data space in 3(2)‐dimensional Euclidean space. Multidimensional scaling (MDS) is a good candidate. However, owing to at least O(M2) memory and time complexity, MDS is computationally demanding for interactive visualization of data sets consisting of order of 104 objects on computer systems, ranging from PC with multicore CPU processor, graphics processing unit (GPU) board to midrange MPI clusters. To explore interactively data sets of that size, we have developed novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics. We demonstrate that the performance of our MDS algorithms implemented in compute unified device architecture environment on a PC equipped with a modern GPU board (Tesla M2090, GeForce GTX 480) is considerably faster than its MPI/OpenMP parallel implementation on the modern midrange professional cluster (10 nodes, each equipped with 2x Intel Xeon X5670 CPUs). We also show that the hybridized two‐level MPI/CUDA implementation, run on a cluster of GPU nodes, can additionally provide a linear speedup. Copyright 2013 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号