首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In a hypercube multiprocessor with distributed memory, messages have a street address and an apartment number, i.e., a hypercube node address and a local memory address. Here we describe an optimal algorithm for performing the communication described by exchanging the bits of the node address with that of the local address. These exchanges occur typically in both matrix transposition and bit reversal for the fast Fourier transform.  相似文献   

2.
We present several variants of the sunflower conjecture of Erd?s & Rado (J Lond Math Soc 35:85–90, 1960) and discuss the relations among them. We then show that two of these conjectures (if true) imply negative answers to the questions of Coppersmith & Winograd (J Symb Comput 9:251–280, 1990) and Cohn et al. (2005) regarding possible approaches for obtaining fast matrix-multiplication algorithms. Specifically, we show that the Erd?s–Rado sunflower conjecture (if true) implies a negative answer to the “no three disjoint equivoluminous subsets” question of Coppersmith & Winograd (J Symb Comput 9:251–280, 1990); we also formulate a “multicolored” sunflower conjecture in ${\mathbb{Z}_3^n}$ and show that (if true) it implies a negative answer to the “strong USP” conjecture of Cohn et al. (2005) (although it does not seem to impact a second conjecture in Cohn et al. (2005) or the viability of the general group-theoretic approach). A surprising consequence of our results is that the Coppersmith–Winograd conjecture actually implies the Cohn et al. conjecture. The multicolored sunflower conjecture in ${\mathbb{Z}_3^n}$ is a strengthening of the well-known (ordinary) sunflower conjecture in ${\mathbb{Z}_3^n}$ , and we show via our connection that a construction from Cohn et al. (2005) yields a lower bound of (2.51 . . .) n on the size of the largest multicolored 3-sunflower-free set, which beats the current best-known lower bound of (2.21 . . . ) n Edel (2004) on the size of the largest 3-sunflower-free set in ${\mathbb{Z}_3^n}$ .  相似文献   

3.
New hybrid algorithms for matrix multiplication are proposed that have the lowest computational complexity in comparison with well-known matrix multiplication algorithms. Based on the proposed algorithms, efficient algorithms are developed for the basic operation \( D = C + \sum\limits_{l =1}^{\xi} A_{l} B_{l}\) of cellular methods of linear algebra, where A, B, and D are square matrices of cell size. The computational complexity of the proposed algorithms is estimated.  相似文献   

4.
This paper presents an efficient parallel implementation of matrix multiplication on three parallel architectures, namely a linear array, a binary tree, and a mesh-of-trees.  相似文献   

5.
Recursive array layouts and fast matrix multiplication   总被引:1,自引:0,他引:1  
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen (1969) and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.  相似文献   

6.
Algorithms are presented for external matrix multiplication and for all-pairs shortest path computation. In comparison with earlier algorithms, the amount of I/O is reduced by a constant factor. The all-pairs shortest path algorithm even performs fewer internal operations, making the algorithm practically interesting.  相似文献   

7.
《国际计算机数学杂志》2012,89(6):1264-1276
This paper investigates different ways of systolic matrix multiplication. We prove that in total there are 43 arrays for multiplication of rectangular matrices. We also prove that, depending on the mutual relation between the dimensions of rectangular matrices, there is either 1 or 21 arrays with minimal number of processing elements. Explicit mathematical formulae for systolic array synthesis are derived. The methodology applied to obtain 43 systolic designs is based on the modification of the synthesis procedure based on dependency vectors and space-time mapping of the dependency graph.  相似文献   

8.
New hybrid algorithms are proposed for multiplying (n × n) matrices. They are based on Laderman’s algorithm for multiplying (3 × 3)-matrices. As compared with well-known hybrid matrix multiplication algorithms, the new algorithms are characterized by the minimum computational complexity. The multiplicative, additive, and overall complexities of the algorithms are estimated.  相似文献   

9.
《国际计算机数学杂志》2012,89(3-4):231-248
The systolic concept in the parallel architecture design proposed by the H. T. Kung [1,2] obtains high throughput and speedups. The linear array for the matrix vector multiplication executes the algorithm in 2n ? 1 time steps using 2n ? 1 processors. Although the speedup obtained is very high, the efficiency is very poor (typical values of 25% efficiency for problem size greater than 10). H. T. Kung proposed an idea for a linear systolic array using two data streams flowing in opposite directions. However, the processors in the array perform operations in every second time moment.

Attempts to improve this design have been made by many researchers. Nonlinear and folding transformations techniques [3,4,5] only decrease the number of processors used to half the size, but do not affect the time.

We propose the use of a fast linear systolic computation procedure to obtain a solution that uses 3n/2 processors and executes the algorithm in 3n/2 time steps for the same cells, the same communication and the same regular data flow as the H. T. Kung linear array. Only the algorithm is restructured and more efficiently organized. Now the processors are utilized in every time step and no idle steps are required.  相似文献   

10.
11.
We examine several VLSI architectures and compare these for their suitability for various forms of the band matrix multiplication problem. The following architectures are considered: chain, broadcast chain, mesh, broadcast mesh and hexagonally connected. The forms of the matrix multiplication problem that are considered are: band matrix × vector and band matrix × band matrix. Metrics to measure the utilization of resources (bandwidth and processors) are also proposed. An important feature of this paper is the inclusion of correctness proofs. These proofs are provided for selected designs and illustrate how VLSI designs may be proved correct using traditional mathematical tools.  相似文献   

12.
F. Romani 《Calcolo》1980,17(1):77-86
A new class of algorithms for the computation of bilinear forms has been recently introduced [1, 3]. These algorithms approximate the result with an arbitrarily small error. Such approximate algorithms may have a multiplicative complexity smaller than exact ones. On the other hand any comparison between approximate and exact algorithms has to take into account the complexity-stability relations. In this paper some complexity measures for matrix multiplication algorithms are discussed and applied to the evaluation of exact and approximate algorithms. Multiplicative complexity is shown to remain a valid comparison test and the cost of approximation appears to be only a logarithmic factor.  相似文献   

13.
This paper proposes two new cellular methods of matrix multiplication that allow one to obtain cellular analogs of well-known matrix multiplication algorithms with reduced computational complexities as compared with analogs derived on the basis of well-known cellular methods of matrix multiplication. The new fast cellular method reduces the multiplicative, additive, and overall complexities of the mentioned algorithms by 15%. The new mixed cellular method combines the Laderman method with the proposed fast cellular method. The interaction of these methods reduces the multiplicative, additive, and overall complexities of the matrix multiplication algorithms by 28%. Computational complexities of these methods are estimated using a model of obtaining cellular analogs of the traditional matrix multiplication algorithm.  相似文献   

14.
This paper develops optimal algorithms to multiply an n × n symmetric tridiagonal matrix by: (i) an arbitrary n × m matrix using 2nmm multiplications; (ii) a symmetric tridiagonal matrix using 6n − 7 multiplications; and (iii) a tridiagonal matrix using 7n −8 multiplications. Efficient algorithms are also developed to multiply a tridiagonal matrix by an arbitrary matrix, and to multiply two tridiagonal matrices.  相似文献   

15.
The complexity of matrix multiplication has attracted a lot of attention in the last forty years. In this paper, instead of considering asymptotic aspects of this problem, we are interested in reducing the cost of multiplication for matrices of small size, say up to 30. Following the previous work of Probert & Fischer, Smith, and Mezzarobba, in a similar vein, we base our approach on the previous algorithms for small matrices, due to Strassen, Winograd, Pan, Laderman, and others and show how to exploit these standard algorithms in an improved way. We illustrate the use of our results by generating multiplication codes over various rings, such as integers, polynomials, differential operators and linear recurrence operators.  相似文献   

16.
17.
Communication efficient matrix multiplication on hypercubes   总被引:1,自引:0,他引:1  
In a recent paper Fox, Otto and Hey consider matrix algorithms for hypercubes. For hypercubes allowing pipelined broadcast of messages they present a communication efficient algorithm. We present in this paper a similar algorithm that uses only nearest neighbour communication. This algorithm will therefore by very communication efficient also on hypercubes not allowing pipelined broadcast. We introduce a new algorithm that reduces the asymptotic communication cost from . This is achieved by regarding the hypercube as a set of subcubes and by using the cascade sum algorithm.  相似文献   

18.
19.
The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

20.
A unified cellular method for matrix multiplication is proposed. The method is a hybrid of three methods, namely, Strassen’s and Laderman’s recursive methods and a fast cellular method for matrix multiplication. The interaction of these three methods provides the highest (in comparison with well-known methods) percentage (equal to 37%) of minimization of the multiplicative, additive, and overall complexities of cellular analogues of well-known matrix multiplication algorithms. The estimation of the computational complexity of the unified method is illustrated by an example of obtaining a cellular analogue of the traditional matrix multiplication algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号