首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A new SIMD-model of parallel computations is proposed, in which the processors are interconnected by a controlled tree-like communication circuit. For the problems considered, an architecture is proposed that is more effective than that for well-known models. Translated from Kibernetika i Sistemnyi Analiz, No. 5, pp. 175–180, September–October, 1999.  相似文献   

2.
The promise of future many‐core processors, with hundreds of threads running concurrently, has led the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries such as (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMP‐like pragmas and a run‐time system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLAS‐level operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that column‐major order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at run‐time. The parallelization of FLAME routines using SMPSs is simpler as this library includes blocked algorithms (or algorithms‐by‐blocks in the FLAME argot) for most operations and storage‐by‐blocks (or block data layout) is already in place. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

3.
This paper is concerned with principal considerations for developing a linear algebra package for the SUPRENUM computer. The design goals, as well as the mapping strategy of the parallelization methodology, are described briefly. Finally, a basic factorization scheme is introduced which can be readily tailored to the LU, Cholesky and QR factorization provided that the corresponding matrices are distributed according to the column-oriented wrap mapping.  相似文献   

4.
Parallel accelerators are playing an increasingly important role in scientific computing. However, it is perceived that their weakness nowadays is their reduced “programmability” in comparison with traditional general-purpose CPUs. For the domain of dense linear algebra, we demonstrate that this is not necessarily the case. We show how the libflame library carefully layers routines and abstracts details related to storage and computation, so that extending it to take advantage of multiple accelerators is achievable without introducing platform specific complexity into the library code base. We focus on the experience of the library developer as he develops a library routine for a new operation, reduction of a generalized Hermitian positive definite eigenvalue problem to a standard Hermitian form, and configures the library to target a multi-GPU platform. It becomes obvious that the library developer does not need to know about the parallelization or the details of the multi-accelerator platform. Excellent performance on a system with four NVIDIA Tesla C2050 GPUs is reported. This makes libflame the first library to be released that incorporates multi-GPU functionality for dense matrix computations, setting a new standard for performance.  相似文献   

5.
Numerical linear algebra libraries provide many kernels that can be composed to perform complex computations. For a given computation, there is typically a large number of functionally equivalent kernel compositions. Some of these compositions achieve better response times than others for particular data and when executed on a particular computer architecture. Previous research provides methods to enumerate (a subset of) these kernel compositions. In this work, we study the problem of determining the composition that yields the lowest response time. Our approach is based on a response time prediction for each candidate combination. While this prediction could in principle be obtained using analytical and/or empirical performance models, developing accurate such models is known to be challenging. Instead, we define a feature space that captures salient properties of kernel combinations and predict response time using supervised machine learning. We experiment with a standard set of machine learning algorithms and identify an effective algorithm for our kernel composition selection problem. Using this algorithm, our approach widely outperforms the strategy that would consist in always using the simplest kernel composition and is often close to the fastest kernel compositions among those evaluated. We quantify the potential benefit of our approach if it were to be implemented as part of an interactive computational tool. We find that although the potential benefit is substantial, a limiting factor is the kernel composition enumeration overhead. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

6.
In the paper three distinct approaches to the implementation of an application in linear algebra on the AMT DAP 510 using the Fortran-Plus Enhanced language and compiler are investigated. In the first the implementation is tailored to fit a virtual array processor whose edge size corresponds to that of the maximum dimension of matrix used in the application. The second approach is a special case of the first, and corresponds to a situation in which the virtual array processor is constrained to have the same dimensions as the AMT DAP 510. The third approach may be considered to be a hybrid approach. In this case an implementation of the application is constructed which incorporates the most efficient features of the first two. Finally, comparisons of the performances of the three implementations, all of which were compiled using the Fortran-Plus Enhanced compiler, are presented.  相似文献   

7.
《国际计算机数学杂志》2012,89(3-4):311-325
A comparison is presented in regular algebra of the Gaussian and Gauss-Jordon elimination techniques for solving sparse systems of simultaneous equations. Specifically, the elimination form and product form of the star A* of a matrix A are defined and it is then shown that the product form is never more sparse than the elimination form. This result generalises an earlier one due to Brayton, Gustavson and Willoughby in which it is shown that the product form of the inverse A-1 of a matrix A is never more sparse than the elimination form of the inverse. Our result applies both in linear algebra and, more generally, to path-finding problems.  相似文献   

8.
In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached.  相似文献   

9.
The implementation and evaluation of the performances on the ICL DAP of two algorithms for the parallel computation of eigenvalues and eigenvectors of moderately large real symmetric matrices of order N, where 64 < N 256, is reported. The first of the algorithms is a modified form of a Parallel Orthogonal Transformation algorithm proposed by Clint et al., which has already been implemented on the DAP for matrices of order N, where N < 65. The second, which has also been implemented on the DAP for matrices of order N, where N < 65, is Jacobi's algorithm, in the modified form proposed by Modi and Pryce. A comparison of the efficiency of the two algorithms for the solution of a variety of large matrices is given.  相似文献   

10.
An implementation of digitized picture rotation and magnification based on Weiman's algorithm is presented. In a programmable array machine routines to perform small transformations code efficiently. The method illustrates the interpolative nature of the algorithm.  相似文献   

11.
We analyse an alternative decomposition for a tridiagonal matrix which has the property that the decomposition as well as the subsequent solution process can be done in two parallel parts. This decomposition is equivalent to the two-sided Gaussian elimination algorithm that has been discussed by Babuska. In the context of parallel computing a similar approach has been suggested by Joubert and Cloete. The computational complexity of this alternative decomposition is the same as for the standard decomposition and a remarkable aspect is that it often leads to slightly more accurate solutions than the standard process does. The algorithm can be combined with recursive doubling or cyclic reduction in order to increase the degree of parallelism and vectorizability.  相似文献   

12.
This paper describes the implementation and performance results for a few standard linear algebra routines on the Denelcor HEP computer. The algorithms used here are based on high-level modules that facilitate portability and perform efficiently in a wide range of environments. The modules are chosen to be of a large enough computational granularity so that reasonably optimum performance may be insured. The design of algorithms with such fundamental modules in mind will also facilitate their replacement by others more suited to gain the desired performance on a particular computer architecture.  相似文献   

13.
The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared‐memory, multicore architectures. Current numerical libraries (e.g., linear algebra package) show clear limitations on such emerging systems mainly because of their coarse granularity tasks. Thus, many numerical algorithms need to be redesigned to better fit the architectural design of the multicore platform. The parallel linear algebra for scalable multicore architectures library developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity. These tile algorithms can then be represented by directed acyclic graphs, where nodes are the tasks and edges are the dependencies between the tasks. The paramount key to achieve high performance is to implement a runtime environment to efficiently schedule the execution of the directed acyclic graph across the multicore platform. This paper studies the impact on the overall performance of some parameters, both at the level of the scheduler (e.g., window size and locality) and the algorithms (e.g., left‐looking and right‐looking variants). We conclude that some commonly accepted rules for dense linear algebra algorithms may need to be revisited. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

14.
Extreme scale supercomputers available before the end of this decade are expected to have 100 million to 1 billion computing cores. The power and energy efficiency issue has become one of the primary concerns of extreme scale high performance scientific computing. This paper surveys the research on saving power and energy for numerical linear algebra algorithms in high performance scientific computing on supercomputers around the world. We first stress the significance of numerical linear algebra algorithms in high performance scientific computing nowadays, followed by a background introduction on widely used numerical linear algebra algorithms and software libraries and benchmarks. We summarize commonly deployed power management techniques for reducing power and energy consumption in high performance computing systems by presenting power and energy models and two fundamental types of power management techniques: static and dynamic. Further, we review the research on saving power and energy for high performance numerical linear algebra algorithms from four aspects: profiling, trading off performance, static saving, and dynamic saving, and summarize state-of-the-art techniques for achieving power and energy efficiency in each category individually. Finally, we discuss potential directions of future work and summarize the paper.  相似文献   

15.
面向对象数值软件Trilinos及其线性代数包剖析   总被引:1,自引:0,他引:1  
张妮  曹建文 《计算机工程与设计》2007,28(5):993-998,1011
分析研究了美国Sandia国家实验室Trilinos项目的设计思想、组织结构,该项目受到ASC计划等的资助.Trilinos框架中定义了一个线性代数对象模型,作为各种软件包的构建基础和沟通载体,此模型当前较成熟的实现为Epetra.详细地阐述了Epetra的组织设计和主要类的层次结构,通过两个数值实验考察该软件当前的性能.可以看到,以Trilinos为代表的众多数值计算项目,为各种数值软件的有效协作、集成和扩展所进行的努力和一些有价值的经验.  相似文献   

16.
We compare five implementations of the Jacobi method for diagonalizing a symmetric matrix. Two of these, the classical Jacobi and sequential sweep Jacobi, have been used on sequential processors. The third method, the parallel sweep Jacobi, has been proposed as the method of choice for parallel processors. The fourth and fifth methods are believed to be new. They are similar to the parallel sweep method but use different schemes for selecting the rotations.

The classical Jacobi method is known to take O(n4) time to diagonalize a matrix of order n. We find that the parallel sweep Jacobi run on one processor is about as fast as the sequential sweep Jacobi. Both of these methods take O(n3 log2n) time. One of our new methods also takes O(n3 log2n) time, but the other one takes only O(n3) time. The choice among the methods for parallel processors depends on the degree of parallelism possible in the hardware. The time required to diagonalize a matrix on a variety of architectures is modeled.

Unfortunately for proponents of the Jacobi method, we find that the sequential QR method is always faster than the Jacobi method. The QR method is faster even for matrices that are nearly diagonal. If we perform the reduction to tridiagonal form in parallel, the QR method will be faster even on highly parallel systems.  相似文献   


17.
In this paper we discuss design principles to implement a set of linear algebra subroutines in a portable libraty for parallel computers. Our design supports reuse of code and easy adaption to new parallel programming paradigms or network configurations. The routines are supposed to be used in self-verifying algorithms. They therefore have to deliver a validated result of high accuracy.  相似文献   

18.
Gaussian elimination over GF(2) is used in a number of applications including the factorisation of large integers. The boolean nature of arithmetic in GF(2) makes the task well suited to highly parallel bit-organised computers. A program to work with up to 4096 × 4096 matrices has been developed for the ICL-DAP. A method has been developed that needs no extra storage to store the history of the elimination. The algorithm is presented and its correctness proved.  相似文献   

19.
In this paper we show that finding solutions of a system of multivariate polynomial equalities and inequalities in the max algebra is equivalent to solving an Extended Linear Complementarity Problem. This allows us to find all solutions of such a system of multivariate polynomial equalities and inequalities and provides a geometrical insight in the structure of the solution set. We also demonstrate that this enables us to solve many important problems in the max algebra and the max-min-plus algebra such as matrix decompositions, construction of matrices with a given characteristic polynomial, state space transformations and the (minimal) state space realization problem.Research assistant with the N.F.W.O. (Belgian National Fund for Scientific Research).Senior research associate with the N.F.W.O.  相似文献   

20.
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号