期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Scalability Issues Affecting the Design of a Dense Linear Algebra Library

Dongarra J. J. Vandegeijn R. A. Walker D. W. 《Journal of Parallel and Distributed Computing》1994,22(3)

This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain. 相似文献

2.

The design and implementation of the parallel out‐of‐core ScaLAPACK LU,QR, and Cholesky factorization routines

Eduardo D'Azevedo Jack Dongarra 《Concurrency and Computation》2000,12(15):1481-1493

This paper describes the design and implementation of three core factorization routines—LU, QR, and Cholesky—included in the out‐of‐core extension of ScaLAPACK. These routines allow the factorization and solution of a dense system that is too large to fit entirely in physical memory. The full matrix is stored on disk and the factorization routines transfer sub‐matrice panels into memory. The ‘left‐looking’ column‐oriented variant of the factorization algorithm is implemented to reduce the disk I/O traffic. The routines are implemented using a portable I/O interface and utilize high‐performance ScaLAPACK factorization routines as in‐core computational kernels. We present the details of the implementation for the out‐of‐core ScaLAPACK factorization routines, as well as performance and scalability results on a Beowulf Linux cluster. Copyright © 2000 John Wiley & Sons, Ltd. 相似文献

3.

Adaptive block size for dense QR factorization in hybrid CPU–GPU systems via statistical modeling

《Parallel Computing》2014,40(5-6):70-85

QR factorization is a computational kernel of scientific computing. How can the latest computer be used to accelerate this task? We investigate this topic by proposing a dense QR factorization algorithm with adaptive block sizes on a hybrid system that contains a central processing unit (CPU) and a graphic processing unit (GPU). To maximize the use of CPU and GPU, we develop an adaptive scheme that chooses block size at each iteration. The decision is based on statistical surrogate models of performance and an online monitor, which avoids unexpected occasional performance drops. We modify the highly optimized CPU–GPU based QR factorization in MAGMA to implement the proposed schemes. Numerical results suggest that our approaches are efficient and can lead to near-optimal block sizes. The proposed algorithm can be extended to other one-sided factorizations, such as LU and Cholesky factorizations. 相似文献

4.

Parallel Implementation of Linear Algebra Problems on Dawning-1000

下载免费PDF全文

Chi Xuebin 《计算机科学技术学报》1998,13(2):141-146

In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached. 相似文献

5.

Transforming Complex Loop Nests for Locality

Qing Yi Ken Kennedy Vikram Adve 《The Journal of supercomputing》2004,27(3):219-264

Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. To optimize complex loop structures both effectively and inexpensively, we present a novel loop transformation, dependence hoisting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new technique to aggressively optimize benchmarks for better locality. Our technique is as inexpensive as the traditional unimodular loop transformation techniques and thus can be incorporated into commercial compilers. In addition, it is highly effective and is able to block several linear algebra kernels containing highly challenging loop structures, in particular, Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The automatic blocking of QR and pivoting LU is a notable achievement—to our knowledge, few previous compiler techniques, including theoretically more general loop transformation frameworks [1, 21, 23, 27, 31], were able to completely automate the blocking of these kernels, and none has produced the same blocking as produced by our technique. These results indicate that with low compilation cost, our technique can in practice match the effectiveness of much more expensive frameworks that are theoretically more powerful. 相似文献

6.

Highly scalable parallel algorithms for sparse matrix factorization

Gupta A. Karypis G. Kumar V. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(5):502-520

In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer 相似文献

7.

Performance modeling of dense Cholesky factorization on the MasPar MP-2

V. Garg D. E. Schimmel 《Concurrency and Computation》1997,9(7):697-719

In this paper we study various implementations of Cholesky factorization on SIMD architectures. A submatrix algorithm is implemented on the MasPar MP-2 using both block and torus-wrap data mappings. Both LL^T and LDL^T (square root free) implementations of the algorithm are investigated. The execution times and performance results for the MP-2 are presented. The performance of these algorithms is characterized in terms of the problem size, machine size and other machine dependent communication and computation parameters. Analysis for the communication and computation complexities for these algorithms is also presented, and models to predict the performance are derived. The torus-wrap mapped implementations outperformed the block approach for all problem sizes. The LDL^T implementation outperformed LL^T for small to medium problem sizes. © 1997 John Wiley & Sons, Ltd. 相似文献

8.

Parallel Performance of Block ILU Preconditioners for a Block-tridiagonal Matrix

Heon Yun Jae 《The Journal of supercomputing》2003,24(1):69-89

The parallelizable block ILU (incomplete LU) factorization preconditioners for a block-tridiagonal matrix have been recently proposed by the author. In this paper, we describe a parallelization of Krylov subspace methods with the block ILU factorization preconditioners on distributed-memory computers such as the Cray T3E, and then parallel performance results of a preconditioned Krylov subspace method are provided to evaluate the effectiveness and efficiency of the block ILU preconditioners on the Cray T3E. 相似文献

9.

Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

Jesús Cámara Javier Cuenca Luis-Pedro García Domingo Giménez 《Parallel Computing》2014

The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix–matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BLAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP. 相似文献

10.

Tiling, block data layout, and memory hierarchy performance

Park N. Hong B. Prasanna V.K. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(7):640-654

Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout. 相似文献

11.

Fraction-free matrix factors: new forms for LU and QR factors

Wenqin ZHOU Davidj.JEFFREY 《Frontiers of Computer Science in China》2008,2(1):67-80

Gaussian elimination and LU factoring have been greatly studied from the algorithmic point of view, but much less from the point view of the best output format. In this paper, we give new output formats for fraction free LU factoring and for QR factoring. The formats and the algorithms used to obtain them are valid for any matrix system in which the entries are taken from an integral domain, not just for integer matrix systems. After discussing the new output format of LU factoring, the complexity analysis for the fraction free algorithm and fraction free output is given. Our new output format contains smaller entries than previously suggested forms, and it avoids the gcd computations required by some other partially fraction free computations. As applications of our fraction free algorithm and format, we demonstrate how to construct a fraction free QR factorization and how to solve linear systems within a given domain. 相似文献

12.

Fraction-free matrix factors: new forms for LU and QR factors

ZHOU Wenqin J. JEFFREY David 《Frontiers of Computer Science》2008,2(1):67

Gaussian elimination and LU factoring have been greatly studied from the algorithmic point of view, but much less from the point view of the best output format. In this paper, we give new output formats for fraction free LU factoring and for QR factoring. The formats and the algorithms used to obtain them are valid for any matrix system in which the entries are taken from an integral domain, not just for integer matrix systems. After discussing the new output format of LU factoring, the complexity analysis for the fraction free algorithm and fraction free output is given. Our new output format contains smaller entries than previously suggested forms, and it avoids the gcd computations required by some other partially fraction free computations. As applications of our fraction free algorithm and format, we demonstrate how to construct a fraction free QR factorization and how to solve linear systems within a given domain. 相似文献

13.

Cholesky分解递归算法与改进 总被引：10，自引：0，他引：10

陈建平《计算机研究与发展》2001,38(8):923-926

递归算法是计算稠密线性代数的一种新的有效方法。递归产生自动、变化的矩阵分块,能充分发挥当今分级存储高性能计算机的效率。对Cholesky分解递归算法进行了研究,给出了算法的详细推导过程,用具有递归功能的Fortran90实现了算法,并通过矩阵元素顺序重排的方法,进一步提高了递归算法的运算速度。研究产生的算法比目前常用的分块算法快15％－25％。相似文献

14.

Architectural investigation of matrix data layout on multicore processors

《Future Generation Computer Systems》2014

Many practical applications include matrix operations as essential procedures. In addition, recent studies of matrix operations rely on parallel processing to reduce any calculation delays. Because these operations are highly data intensive, many studies have investigated work distribution techniques and data access latency to accelerate algorithms. However, previous studies have not considered hardware architectural features adequately, although they greatly affect the performance of matrix operations. Thus, the present study considers the architectural characteristics that affect the performance of matrix operations on real multicore processors. We use matrix multiplication, LU decomposition, and Cholesky factorization as the test applications, which are well-known data-intensive mathematical algorithms in various fields. We argue that applications only access matrices in a particular direction, and we propose that the canonical data layout is the optimal matrix data layout compared with the block data layout. In addition, the tiling algorithm is utilized to increase the temporal data locality in multilevel caches and to balance the workload as evenly as possible in multicore environments. Our experimental results show that applications using the canonical data layout with tiling have an 8.23% faster execution time and 3.91% of last level cache miss rate compared with applications executed with the block data layout. 相似文献

15.

LU分解递归算法的实现

陈建平《数值计算与计算机应用》2004,25(4):315-320

§1.引言以LU分解, Cholesky分解等为代表的线性代数问题的数值计算在现代科学研究和工程技术中得到广泛应用.随着计算机技术的发展,实现这些线性代数数值计算的计算机算法和软件也在不断发展.目前,具有多级存储结构的高性能RISC计算机已占据了数值计算领域的主导地位. RSIC处理器的运算速度非常快,它们与存储器之间的速度差距很大.计算机的性能能不能充分发挥,多级存储结构与高缓能否得到有效利用成为关键.为此,现行的相似文献

16.

LU分解递归算法的研究 总被引：1，自引：0，他引：1

陈建平《计算机科学》2004,31(6):141-142

将递归方法引入稠密线性代数的计算，能产生自动的矩阵分块，使算法适合于当今分级存储高性能计算机的结构，提高运算速度。文中对解线性代数方程组的LU分解递归算法进行了研究，给出了算法的详细推导过程。相似文献

17.

The ijk forms of factorization methods II. Parallel systems

《Parallel Computing》1988,7(2):149-162

We consider the “ijk forms” of LU and Cholesky factorization on certain parallel computers. This extends the analysis for vector computers of Part I of this work. We restrict attention to local memory systems with processors that may or may not have vector capability. Special attention is given to bus architectures but qualitative analyses are given for other interconnection systems. 相似文献

18.

Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers

《Journal of Parallel and Distributed Computing》2001,61(4):520-535

This paper presents and analyzes two different strategies of heterogeneous distribution of computations solving dense linear algebra problems on heterogeneous networks of computers. The first strategy is based on heterogeneous distribution of processes over processors and homogeneous block cyclic distribution of data over the processes. The second is based on homogeneous distribution of processes over processors and heterogeneous block cyclic distribution of data over the processes. Both strategies were implemented in the mpC language—a dedicated parallel extension of ANSI C for efficient and portable programming of heterogeneous networks of computers. The first strategy was implemented using calls to ScaLAPACK; the second strategy was implemented with calls to LAPACK and BLAS. Cholesky factorization on a heterogeneous network of workstations is used to demonstrate that the heterogeneous distributions have an advantage over the traditional homogeneous distribution. 相似文献

19.

乔莱斯基分解递归算法的研究

陈建平 JerzyWasniewski 《计算机科学》2001,28(7):127-129

1 引言以乔莱斯基(Cholesky)分解、LU分解等为代表的线性代数问题的数值计算在现代科学研究和工程技术中得到广泛应用。随着计算机结构和技术的发展,实现这些线性代数数值计算的计算机算法和软件也在不断发展。通用的基本线性代数子程序库BLAS(Basic Lin-ear Algebra Subprograms)从70年代的Level-1 BLAS(执行向量一向量运算),发展到80年代的Level-2 相似文献

20.

Adaptive blocking in the QR factorization

Christian H. Bischof 《The Journal of supercomputing》1989,3(3):193-208

On most high-performance architectures, data movement is slow compared to floating point (in particular, vector) performance. On these architectures block algorithms have been successful for matrix computations. By considering a matrix as a collection of submatrices (the so-called blocks), one naturally arrives at algorithms that require little data movement. The optimal blocking strategy, however, depends on the computing environment and on the problem parameters. On parallel machines, tradeoffs between individual floating point performance and overall system performance also come into play. Current approaches use fixed-width blocking strategies which are not optimal. This paper presents an adaptive blocking methodology for determining a good blocking strategy systematically. We demonstrate this technique on a block QR factorization routine on a distributed-memory machine. Using timing models for the high-level kernels of the algorithm, we can formulate in a recurrence relation a blocking strategy that avoids adding extra delays along the critical path of the algorithm. This recurrence relation predicts performance well since we base our timing models on observed data, not other simplistic measures. Experiments on the Intel iPSC/1 hypercube show that, in fact, the resulting blocking strategy is as good as any fixed-width blocking strategy, independent of problem size and the number of processors employed. So while we do not know the optimum fixed-width blocking strategy unless we rerun the same problem several times, adaptive blocking provides close to optimum performance in the first run. We also mention how adaptive blocking can result in performance portable code by automating the generation of the timing models.This work was partially performed when the author was a graduate student at Cornell University and was supported by the Office of Naval Research under contract N00014-83-K-0640 and by NSF contract DCR 86-02310. Support at Argonne was provided by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under Contract W-31-109-Eng-38. Computations were performed in part at the facilities of the Cornell Computational Optimization Project which is supported by the National Science Foundation under contract DMS 87-06133, at the Cornell National Supercomputer Facility, and at the Advanced Research Computing Facility at Argonne National Laboratory. 相似文献