期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

External matrix multiplication and all-pairs shortest path

Jop F. Sibeyn 《Information Processing Letters》2004,91(2):99-106

Algorithms are presented for external matrix multiplication and for all-pairs shortest path computation. In comparison with earlier algorithms, the amount of I/O is reduced by a constant factor. The all-pairs shortest path algorithm even performs fewer internal operations, making the algorithm practically interesting. 相似文献

2.

Communication efficient matrix multiplication on hypercubes 总被引：1，自引：0，他引：1

Jarle Berntsen 《Parallel Computing》1989,12(3):335-342

In a recent paper Fox, Otto and Hey consider matrix algorithms for hypercubes. For hypercubes allowing pipelined broadcast of messages they present a communication efficient algorithm. We present in this paper a similar algorithm that uses only nearest neighbour communication. This algorithm will therefore by very communication efficient also on hypercubes not allowing pipelined broadcast. We introduce a new algorithm that reduces the asymptotic communication cost from . This is achieved by regarding the hypercube as a set of subcubes and by using the cascade sum algorithm. 相似文献

3.

An $\tilde{O}(m^{2}n)$ Algorithm for Minimum Cycle Basis of Graphs

Telikepalli Kavitha Kurt Mehlhorn Dimitrios Michail Katarzyna E. Paluch 《Algorithmica》2008,52(3):333-349

We consider the problem of computing a minimum cycle basis of an undirected non-negative edge-weighted graph G with m edges and n vertices. In this problem, a {0,1} incidence vector is associated with each cycle and the vector space over generated by these vectors is the cycle space of G. A set of cycles is called a cycle basis of G if it forms a basis for its cycle space. A cycle basis where the sum of the weights of the cycles is minimum is called a minimum cycle basis of G. Minimum cycle basis are useful in a number of contexts, e.g. the analysis of electrical networks and structural engineering. The previous best algorithm for computing a minimum cycle basis has running time O(m ^ω n), where ω is the best exponent of matrix multiplication. It is presently known that ω<2.376. We exhibit an O(m ² n+mn ²log n) algorithm. When the edge weights are integers, we have an O(m ² n) algorithm. For unweighted graphs which are reasonably dense, our algorithm runs in O(m ^ω) time. For any ε>0, we also design an 1+ε approximation algorithm. The running time of this algorithm is O((m ^ω/ε)log (W/ε)) for reasonably dense graphs, where W is the largest edge weight. A preliminary version of this paper appeared in Kavitha et al. (31st International Colloquium on Automata, Languages and Programming (ICALP), pp. 846–857, 2004). T. Kavitha and K.E. Paluch were in Max-Planck-Institut für Informatik, Saarbrücken, Germany, while this work was done. 相似文献

4.

Fast multiplication of matrices over a finitely generated semiring

Daniel Andrén 《Information Processing Letters》2008,107(6):230-234

In this paper we show that n×n matrices with entries from a semiring R which is generated additively by q generators can be multiplied in time O(q²nω), where nω is the complexity for matrix multiplication over a ring (Strassen: ω<2.807, Coppersmith and Winograd: ω<2.376).We first present a combinatorial matrix multiplication algorithm for the case of semirings with q elements, with complexity , matching the best known methods in this class.Next we show how the ideas used can be combined with those of the fastest known boolean matrix multiplication algorithms to give an O(q²nω) algorithm for matrices of, not necessarily finite, semirings with q additive generators.For finite semirings our combinatorial algorithm is simple enough to be a practical algorithm and is expected to be faster than the O(q²nω) algorithm for matrices of practically relevant sizes. 相似文献

5.

Gonzalo Berger Manuel Freire Renzo Marini Ernesto Dufrechou Pablo Ezzatti 《Concurrency and Computation》2023,35(20):e7271

The sparse matrix multiplication (SpGeMM ) increased its importance in the last years due to its data science and machine learning applications. Consequently, considerable research has focused on accelerating this kernel in GPUs. Designing massively-parallel algorithms for the SpGeMM is a challenging task since the computation pattern is highly irregular, and the required memory and operations depend on the interaction between the nonzero layout of the inputs. One strategy to attack this kernel consists of proposing new sparse matrix storage formats that contribute to mitigating this irregularity. In previous work, we commenced a study of the recently proposed bmSparse matrix format, suggesting several modifications to the SpGeMM algorithm. This work integrates the previous extensions and proposes new improvements to unleash bmSparse's full potential before comparing it to more consolidated options. In particular, we enhance one of the most computationally demanding stages with an adaptive technique, apply optimizations to achieve more efficient data accesses, and analyze the effect of using Tensor Cores to accelerate the multiplication stage of the algorithm. The experimental results on a set of real-world sparse matrices show that the optimized implementation largely outperforms vendor implementations such as NVIDIA cuSparse Intel MKL-CSR variant, while being competitive with MKL's-BSR. 相似文献

6.

Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions

Qingda Lu Xiaoyang Gao Sriram Krishnamoorthy Gerald Baumgartner J. Ramanujam P. Sadayappan 《Journal of Parallel and Distributed Computing》2012

Empirical optimizers like ATLAS have been very effective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined in ATLAS by executing different versions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its effectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to select library calls and choose data layout transformations in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The effectiveness of the approach is demonstrated through experimental measurements on representative computations from quantum chemistry. 相似文献

7.

基于TVM 的变维批处理小矩阵乘法的加速及应用

戴翰文;陈长波《计算机科学》2025,52(5):25-40

很多实际应用中需要高效计算大量不同维度的小矩阵乘积,如基于图神经网络的图分类需要将多个邻接矩阵与节点特征矩阵相乘。针对现有方法无法跨不同硬件平台高效计算此类维度各异(简称变维)批处理小矩阵乘法的问题,基于深度学习编译器TVM,提出了一种可以跨平台的高效算法BVSM,通过为小矩阵特制优化模板、运用张量化批处理和分组填充等技术使得TVM可以高效进行变维批处理小矩阵乘法。在真实图分类任务数据集上的实验表明,在CPU 端,BVSM相较于自动调度和调优的TVM(AnsorTVM)平均获得两倍以上加速,平均性能达到Intel MKL变维批处理矩阵乘法的95%,最高为其1.27倍;在 GPU 端,BVSM相较于AnsorTVM 平均获得62.05倍的加速,相较于cuBLAS平均获得28.82倍的加速,相较于MAGMA 的变维批处理矩阵乘法平均获得6.59倍的加速。相似文献

8.

Optimization techniques for small matrix multiplication

Charles-Éric Drevet 《Theoretical computer science》2011,412(22):2219-2236

The complexity of matrix multiplication has attracted a lot of attention in the last forty years. In this paper, instead of considering asymptotic aspects of this problem, we are interested in reducing the cost of multiplication for matrices of small size, say up to 30. Following the previous work of Probert & Fischer, Smith, and Mezzarobba, in a similar vein, we base our approach on the previous algorithms for small matrices, due to Strassen, Winograd, Pan, Laderman, and others and show how to exploit these standard algorithms in an improved way. We illustrate the use of our results by generating multiplication codes over various rings, such as integers, polynomials, differential operators and linear recurrence operators. 相似文献

9.

基于对角划分的矩阵乘并行算法 总被引：5，自引：0，他引：5

张学波李晓梅《计算机工程》2004,30(6):42-43

提出了一种新的基于对角划分的矩阵乘并行算法,它在以往行列划分策略的基础上,采用基于对角划分的策略。数值试验表明该算法具有较高的加速比和并行效率。相似文献

10.

Derandomization,witnesses for Boolean matrix multiplication and construction of perfect hash functions

N. Alon M. Naor 《Algorithmica》1996,16(4-5):434-449

Small sample spaces with almost independent random variables are applied to design efficient sequential deterministic algorithms for two problems. The first algorithm, motivated by the attempt to design efficient algorithms for the All Pairs Shortest Path problem using fast matrix multiplication, solves the problem of computingwitnesses for the Boolean product of two matrices. That is, ifA andB are twon byn matrices, andC=AB is their Boolean product, the algorithm finds for every entryC _ij=1 a witness: an indexk so thatA _ik=B _kj=1. Its running time exceeds that of computing the product of twon byn matrices with small integer entries by a polylogarithmic factor. The second algorithm is a nearly linear time deterministic procedure for constructing a perfect hash function for a givenn-subset of {1,...,m}.Part of this paper was presented at the IEEE 33rd Symposium on Foundations of Computer Science.Research supported in part by a USA-Israeli BSF grant and by the Fund for Basic Research administered by the Israel Academy of Sciences.Supported by an Alon Fellowship and by a grant from the Israel Science Foundation administered by the Israeli Academy of Sciences. Some of this work was done while the author was with the IBM Almaden Research Center. 相似文献

11.

Implementing Gaussian elimination on a matrix-matrix multiplication systolic array

Tanguy Risset 《Parallel Computing》1990,16(2-3):351-359

We show that any systolic array dedicated to matrix-matrix multiplication can also execute Gaussian elimination. 相似文献

12.

一种新的大整数乘法算法

张力张引兵刘海《计算机安全》2011,(1):11-13

为解决超出计算机系统基本整数类型表达能力的整数(大整数)算术运算问题,以基础算法--大整数乘法为研究对象,根据大整数的表示形式与多项式表示形式上的相似性,结合大整数乘法进位与取模的特点,给出了一种关于大整数乘法的多项式算法.其方法与别的方法最大的不同是,虽然是求两个大整数乘法,但整个算法没有使用乘法,只是用加法运算而已... 相似文献

13.

RAPWBN的矩阵乘法并行算法

陈宏建陈崚李开荣陈莉莉《计算机工程》2004,30(23):31-33,110

在介绍带有宽总线网络的可重构计算阵列(RAPWBN)的基本结构及其二进制值的前缀和操作的基础上,提出了 RAPWBN 阵列上的整数求和算法,并由此得到了 RAPWBN 阵列上的两种快速高效的矩阵乘法运算并行算法。在具有 N3个处理器和 N2条行总线的 RAPWBN 阵列上,若总线带宽ω>logN 字节,矩阵乘法可以在 O(1)时间完成;在具有 N2个处理器和 N 条行总线的 RAPWBN 阵列上,矩阵乘法可以在 O(N)时间完成。它们的效率都为 O(N3),达到了最优。相似文献

14.

基于MPI的Strassen矩阵乘法算法的并行计算研究与实现

石莹莹李涛《数字社区&智能家居》2007,(22)

本文在Windows系统并行计算平台下,利用MPICH环境并结合Visual C 6.0编程语言,实现Strassen矩阵乘法算法的并行程序,实验表明该算法能有效地提高矩阵乘法的运行效率. 相似文献

15.

关于矩阵张量积计算的研究 总被引：1，自引：1，他引：1

谭国律《计算机工程与应用》2004,40(8):77-78,167

利用矩阵张量积有关理论,讨论了矩阵张量积的计算问题,分析了算法的复杂性,并研究了并行算法及计算复杂性问题。相似文献

16.

Exploiting processor features to implement error detection in reduced precision matrix multiplications

Pedro Reviriego Serdar Zafer Can Çağrı Eryılmaz Juan Antonio Maestro Oğuz Ergin 《Microprocessors and Microsystems》2014

Modern processors incorporate complex arithmetic units that can work with large word-lengths. Those units are useful for applications that require high precision. There are however, many applications for which the use of reduced precision is sufficient. In those cases, one possibility is to use the large word-length arithmetic units to implement reduced precision operations with additional error detection. In this paper, this idea is explored for the case of matrix multiplications. A technique is presented and evaluated. The results show that it can detect most errors and that for large matrixes the overhead in terms of execution time is small. 相似文献

17.

Polynomial multiplication over finite fields: from quadratic to straight-line complexity

Nader H. Bshouty Michael Kaminski 《Computational Complexity》2006,15(3):252-262

相似文献

18.

Chunky and equal-spaced polynomial multiplication

Daniel S. Roche 《Journal of Symbolic Computation》2011,46(7):791-806

Finding the product of two polynomials is an essential and basic problem in computer algebra. While most previous results have focused on the worst-case complexity, we instead employ the technique of adaptive analysis to give an improvement in many “easy” cases. We present two adaptive measures and methods for polynomial multiplication, and also show how to effectively combine them to gain both advantages. One useful feature of these algorithms is that they essentially provide a gradient between existing “sparse” and “dense” methods. We prove that these approaches provide significant improvements in many cases but in the worst case are still comparable to the fastest existing algorithms. 相似文献

19.

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-2):53-70

We present in this paper the parallelization of fast matrix multiplication algorithms of Strassen and Wino-grad on MIMD distributed architectures whose interconnection networks are ring and torus. Complexity and efficiency are analyzed and good asymptotic behaviour is proved. These new parallel algorithms are compared with standard algorithms on a 128-processor parallel computer; experiments confirm the theoretical results. 相似文献

20.

A note on a fast algorithm for sparse matrix multiplication

Jacques Cohen 《Information Processing Letters》1983,16(5):247-248

相似文献