期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

《Parallel Computing》2017

Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation. 相似文献

2.

Scheduling dense linear algebra operations on multicore processors

Jakub Kurzak Hatem Ltaief Jack Dongarra Rosa M. Badia 《Concurrency and Computation》2010,22(1):15-44

State‐of‐the‐art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability to fully exploit thread‐level parallelism. At the same time, the coarse–grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workloads, the Cholesky factorization, the QR factorization and the LU factorization, using dynamic data‐driven execution. Two emerging approaches to implementing coarse–grain dataflow are examined, the model of nested parallelism, represented by the Cilk framework, and the model of parallelism expressed through an arbitrary Direct Acyclic Graph, represented by the SMP Superscalar framework. Performance and coding effort are analyzed and compared against code manually parallelized at the thread level. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

3.

Accelerating engineering software on modern multi-core processors

《Advances in Engineering Software》2015

Recent multi-core designs migrated from Symmetric Multi Processing to cache coherent Non Uniform Memory Access architectures. In this paper we discuss performance issues that arise when designing parallel Finite Element programs for a 64-core ccNUMA computer and explore solutions for these issues. We first present the overview of the computer architecture and show that highly parallel code that does not take into account the aspects of the system memory organization scales poorly, achieving only 2.8× speedup when running with 64 threads. Then, we discuss how we identified the sources of overhead and evaluate three possible solutions for the problem. We show that the first solution does not require the application’s code to be modified, however, the speedup achieved is only 10.6×. The second solution enables the performance to scale up to 30.9×, however, it requires the programmer to manually schedule threads and allocate related data on local CPUs and memory banks and rely on ccNUMA aware libraries that are not portable across operating systems. Also, we propose and evaluate “copy-on-thread”, an alternative solution that enables the performance to scale up to 25.5× without relying on specialized libraries nor requiring specific data allocation and thread scheduling. Finally, we argue that the issues reported only happen for large data sets and conclude the paper with recommendations to help programmers to design algorithms and programs that perform well on such kind of machine. 相似文献

4.

Supernodal sparse Cholesky factorization on graphics processing units

Dan Zou Yong Dou Song Guo Rongchun Li Lin Deng 《Concurrency and Computation》2014,26(16):2713-2726

Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

5.

面向申威异构架构的并行代码自动生成

陶小涵朱雨庞建民赵捷徐金龙《软件学报》2023,34(4):1570-1593

异构架构逐渐成为高性能计算领域的主流架构,但相较于同构多核架构,其硬件结构及存储层次更为复杂,程序编写更为困难.先进的优化编译器可以协助程序开发人员实现更为高效的代码,降低程序开发复杂度.多面体编译模型通过抽象分析将程序抽象成空间多面体表示形式,能够将多种循环变换与硬件映射相结合,并面向特定体系结构生成相应的代码.设计实现了一个面向国产申威异构架构的并行代码自动生成系统,采用“源-源”编译模式,基于多面体编译模型实现.系统针对申威异构架构特点将程序计算过程进行硬件部署,同时实现数据传输与内存空间的自动管理.实验基于Polybench测试集中线性代数相关用例进行测试.结果表明,利用代码自动生成系统生成的异构并行代码能够在申威异构平台上正确运行,并能够有效发挥申威异构平台的性能,基于申威异构平台利用64线程加速计算的平均加速比达到了539.16倍. 相似文献

6.

Highly scalable parallel algorithms for sparse matrix factorization

Gupta A. Karypis G. Kumar V. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(5):502-520

In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer 相似文献

7.

A distributed packed storage for large dense parallel in‐core calculations

Marc Baboulin Luc Giraud Serge Gratton Julien Langou 《Concurrency and Computation》2007,19(4):483-502

In this paper we propose a distributed packed storage format that exploits the symmetry or the triangular structure of a dense matrix. This format stores only half of the matrix while maintaining most of the efficiency compared with a full storage for a wide range of operations. This work has been motivated by the fact that, in contrast to sequential linear algebra libraries (e.g. LAPACK), there is no routine or format that handles packed matrices in the currently available parallel distributed libraries. The proposed algorithms exclusively use the existing ScaLAPACK computational kernels, which proves the generality of the approach, provides easy portability of the code and provides efficient re‐use of existing software. The performance results obtained for the Cholesky factorization show that our packed format performs as good as or better than the ScaLAPACK full storage algorithm for a small number of processors. For a larger number of processors, the ScaLAPACK full storage routine performs slightly better until each processor runs out of memory. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

8.

Fast Matlab compatible sparse assembly on multicore computers

《Parallel Computing》2016

We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix from raw index data. Since it is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation. We show how to do this, and moreover, we show how our implementation can be parallelized to utilize the power of modern multicore computers. Our freely available code, fully Matlab compatible, achieves about a factor of 5 × in speedup on a typical 6-core machine and 10 × on a dual-socket 16-core machine compared to the built-in serial implementation. 相似文献

9.

Parallel Implementation of Linear Algebra Problems on Dawning-1000

下载免费PDF全文

Chi Xuebin 《计算机科学技术学报》1998,13(2):141-146

In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached. 相似文献

10.

Exploiting hardware capabilities in interior point methods

Csaba Mészáros 《Optimization methods & software》2016,31(2):435-443

The increase of computer performance continues to support the practice of large-scale optimization. Computers with multiple computing cores and vector processing capabilities are now widely available. We investigate how the recently introduced Advanced Vector Instruction (AVX) set on Intel-compatible architectures can be exploited in interior point methods for linear and nonlinear optimization. We focus on data structures and implementation techniques that utilize the new vector instructions. Our numerical experiments demonstrate that the AVX instruction set provides a significant performance boost in our implementation on large-scale problem that have significant fill-in in the sparse Cholesky factorization, achieving up to 100 gigaflops performance on a standard desktop computer on linear optimization problems for which the required Cholesky factorization is relatively dense. 相似文献

11.

Accelerating sparse Cholesky factorization on GPUs

《Parallel Computing》2016

Sparse factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. For factorizations which involve sufficient dense math, the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost. However, for many other cases, the prevalence of small/irregular dense math and the relatively slow communication between the host and device over the PCIe bus, make it challenging to significantly accelerate sparse factorization using the GPU.In this paper we describe a left-looking supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream subtrees of the elimination tree through the GPU and perform the factorization of each subtree entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these subtrees, many independent, small, dense operations are batched to minimize kernel launch overhead and many of these batched kernels are executed concurrently to maximize device utilization.Performance results for commonly studied matrices are presented along with suggested actions for further optimization. 相似文献

12.

A quasi-Newton method with Cholesky factorization

Dr. J. Bräuninger 《Computing》1980,25(2):155-162

A quasi-Newton method for unconstrained minimization is presented, which uses a Cholesky factorization of an approximation to the Hessian matrix. In each step a new row and column of this approximation matrix is determined and its Cholesky factorization is updated. This reduces storage requirements and simplifies the calculation of the search direction. Precautions are taken to hold the approximation matrix positive definite. It is shown that under usual conditions the method converges superlinearly or evenn-step quadratic. 相似文献

13.

Fast learning of scale‐free networks based on Cholesky factorization

下载免费PDF全文

Vladisav Jelisavcic Ivan Stojkovic Veljko Milutinovic Zoran Obradovic 《国际智能系统杂志》2018,33(6):1322-1339

Recovering network connectivity structure from high‐dimensional observations is of increasing importance in statistical learning applications. A prominent approach is to learn a Sparse Gaussian Markov Random Field by optimizing regularized maximum likelihood, where the sparsity is induced by imposing L₁ norm on the entries of a precision matrix. In this article, we shed light on an alternative objective, where instead of precision, its Cholesky factor is penalized by the L₁ norm. We show that such an objective criterion possesses attractive properties that allowed us to develop a very fast Scale‐Free Networks Estimation Through Cholesky factorization (SNETCH) optimization algorithm based on coordinate descent, which is highly parallelizable and can exploit an active set approach. The approach is particularly suited for problems with structures that allow sparse Cholesky factor, an important example being scale‐free networks. Evaluation on synthetically generated examples and high‐impact applications from a biomedical domain of up to more than 900,000 variables provides evidence that for such tasks the SNETCH algorithm can learn the underlying structure more accurately, and an order of magnitude faster than state‐of‐the‐art approaches based on the L₁ penalized precision matrix. 相似文献

14.

Experimental evaluation and improvements to linear scan register allocation

Konstantinos Sagonas Erik Stenman 《Software》2003,33(11):1003-1034

We report our experience from implementing and experimentally evaluating the performance of various register allocation schemes, focusing on the recently proposed linear scan register allocator. In particular, we describe in detail our implementation of linear scan and report on its behavior both on register‐rich and on register‐poor computer architectures. We also extensively investigate how different options to the basic algorithm and to the compilation process as a whole affect compilation times and quality of the produced code. In a nutshell, our experience is that a well‐tuned linear scan register allocator is a good choice on register‐rich architectures. It performs competitively with graph coloring based allocation schemes and results in significantly lower compilation times. When compilation time is a concern, such as in just‐in‐time compilers, it can also be a viable option on register‐poor architectures. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

15.

A method for updating Cholesky factorization of a band matrix

Wei H. Yang 《Computer Methods in Applied Mechanics and Engineering》1977,12(3):281-288

A method is presented for updating the Cholesky factorization of a band symmetric matrix modified by a rank-one matrix which has the same band width. Problems which could involve applications of such a method arise frequently in plasticity and structural optimization where repeated solutions of a band algebraic system with a changing matrix are needed. The Cholesky factorization of a stiffness matrix can be updated after modifying a local stiffness matrix which can be written as a sum of a few rank-one matrices. The number of operations required for the updating is of the order mn or less, where n is the dimension of the global matrix and m is its half band width (including the diagonal). 相似文献

16.

Hardware–software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Xiaofang Wang 《Journal of Real-Time Image Processing》2014,9(1):187-204

State-of-the-art field-programmable gate array (FPGA) technologies have provided exciting opportunities to develop more flexible, less expensive, and better performance floating-point computing platforms for embedded systems. To better harness the full power of FPGAs and to bring FPGAs to more system designers, we investigate unique advantages and optimization opportunities in both software and hardware offered by multi-core processors on a programmable chip (MPoPCs). In this paper, we present our hardware customization and software dynamic scheduling solutions for LU factorization of large sparse matrices on in-house developed MPoPCs. Theoretical analysis is provided to guide the design. Implementation results on an Altera Stratix III FPGA for five benchmark matrices of size up to 7,917 × 7,917 are presented. Our hardware customization alone can reduce the execution time by up to 17.22 %. The integrated hardware–software optimization improves the speedup by an average of 60.30 %. 相似文献

17.

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

《Parallel Computing》2019

Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. 相似文献

18.

AutoConfig: 面向深度学习编译优化的自动配置机制

张洪滨周旭林邢明杰武延军赵琛《软件学报》2024,35(6)

随着深度学习模型和硬件架构的快速发展,深度学习编译器已经被广泛应用.目前,深度学习模型的编译优化和调优的方法主要依赖基于高性能算子库的手动调优和基于搜索的自动调优策略.然而,面对多变的目标算子和多种硬件平台的适配需求,高性能算子库往往需要为各种架构进行多次重复实现.此外,现有的自动调优方案也面临着搜索开销大和缺乏可解释性的挑战.为了解决上述问题,本文提出了AutoConfig,一种面向深度学习编译优化的自动配置机制.针对不同的深度学习计算负载和特定的硬件平台,AutoConfig可以构建具备可解释性的优化算法分析模型,采用静态信息提取和动态开销测量的方法进行综合分析,并基于分析结果利用可配置的代码生成技术自动完成算法选择和调优.本文创新性地将优化分析模型与可配置的代码生成策略相结合,不仅保证了性能加速效果,还减少了重复开发的开销,同时简化了调优过程.在此基础上,本文进一步将AutoConfig集成到深度学习编译器Buddy Compiler中,对矩阵乘法和卷积的多种优化算法建立分析模型,并将自动配置的代码生成策略应用在多种SIMD硬件平台上进行评估.实验结果验证了AutoConfig在代码生成策略中有效地完成了参数配置和算法选择.与经过手动或自动优化的代码相比,由AutoConfig生成的代码可达到相似的执行性能,并且无需承担手动调优的重复实现开销和自动调优的搜索开销. 相似文献

19.

Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems

《Parallel Computing》2007,33(3):159-173

We discuss the performance of direct summation codes used in the simulation of astrophysical stellar systems on highly distributed architectures. These codes compute the gravitational interaction among stars in an exact way and have an O(N²) scaling with the number of particles. They can be applied to a variety of astrophysical problems, like the evolution of star clusters, the dynamics of black holes, the formation of planetary systems, and cosmological simulations. The simulation of realistic star clusters with sufficiently high accuracy cannot be performed on a single workstation but may be possible on parallel computers or grids. We have implemented two parallel schemes for a direct N-body code and we study their performance on general purpose parallel computers and large computational grids. We present the results of timing analyzes conducted on the different architectures and compare them with the predictions from theoretical models. We conclude that the simulation of star clusters with up to a million particles will be possible on large distributed computers in the next decade. Simulating entire galaxies however will in addition require new hybrid methods to speedup the calculation. 相似文献

20.

Practical Considerations of Parallel Simulations and Architecture Independent Parallel Algorithm Design

《Journal of Parallel and Distributed Computing》1998,53(1):1-25

We examine combinatorial properties of a class of hash functions and its application to the simulations of classical models of parallel computation on other models, such as theBSPand theS*PRAM, optimally in communication to within additive lower order terms. The BSP model can serve as a programming paradigm as well; we also examine the implications of architecture independent parallel algorithm design in the context of the BSP model and show how it can lead to portable and scalable implementations of algorithms that can work on a multiplicity of hardware platforms with only recompilation of the source program code. Toward this end, dense Cholesky factorization algorithms are presented and their performance on three parallel hardware platforms, an SGI Power Challenge, IBM SP2, and Cray T3D, is examined and analyzed. 相似文献