共查询到20条相似文献,搜索用时 15 毫秒
1.
《国际计算机数学杂志》2012,89(1-2):43-54
In this article, we show how the multiplication of polynomials can be performed in a pipelined fashion on a systolic array in linear time steps. The computational model consists of two linear systolic arrays with 2(n+1) processing elements used and (m+2n+2) running time steps needed, where m, n are the degrees of the two given polynomials, respectively. Since the same types of processing elements execute the same program, it is suitable for VLSI implementation. This algorithm is also proved to be correct. 相似文献
2.
Björn Lisper 《Distributed Computing》1991,5(3):133-144
Summary Forming the transitive closure of a binary relation (or directed graph) is an important part of many algorithms. When the relation is represented by a bit matrix, the transitive closure can be efficiently computed in parallel in a systolic array.Here we propose two novel ways of computing the transitive closure of an arbitrarily big graph on a systolic array of fixed size. The first method is a simple partitioning of a well-known systolic algorithm for computing the transitive closure. The second is a block-structured algorithm. This algorithm is suitable for execution on a systolic array that can multiply fixed size bit matrices and compute transitive closure of graphs with a fixed number of nodes. The algorithm is, however, not limited to systolic array implementations; it works onany parallel architecture that can perform these bit matrix operatons efficiently.The shortest path problem, for directed graphs with weighted edges, can also be solved in the same manner, devised above, as the transitive closure is computed.
Björn Lisper was born in 1956 in Solna, Sweden. He received the M. Eng. Physics degree in 1980 and the Ph.D. degree in Computer Science in 1987, both from the Royal Institute of Technology in Stockholm. Currently he shares his time between the Royal Institute of Technology and the Swedish Institute of Computer Science. His research interests are mainly in the area of formal methods for deriving efficient parallel implementations of algorithms, including synthesis of fixed hardware structures for specific algorithms and compilation techniques for tightly coupled parallel systems. Dr. Lisper is a member of the European Association for Theoretical Computer Science. 相似文献
3.
《国际计算机数学杂志》2012,89(1-2):103-116
We consider systolic arrays for matrix computations involving complex elements, and show that in certain circumstances the complex calculations can be decomposed into a number of real matrix sub-problems which can be used to good effect in reducing computation times and increasing array cell efficiency. Computations involving matrix vector, matrix product and matrix factorisation are examined. It was found that matrix vector and product calculation produce arrays which have e= 1 cell efficiency and the same computation time as their real counterparts, with only an increase in hardware related to the bandwidth of the systems. Matrix factorisation is achieved by using 2?2 block factorisation requiring four times the hardware and is only twice as slow as the real version of the factorisation. 相似文献
4.
Christian Lengauer 《Software》1990,20(3):261-282
An experiment of a mechanical code generation for a programmable systolic computer is reported. Two-dimensional systolic arrays are automatically reduced to one dimension, and code is generated for the one-dimensional processor array Warp. The technique is demonstrated with two examples: matrix multiplication and LU decomposition. 相似文献
5.
《国际计算机数学杂志》2012,89(4):403-429
The fast systolic computation and double pipelines were designed to achieve implementations that use less processors to execute the algorithm in less time then the conventional systolic algorithms. H. T. Kung and C. S. Leiserson in [1-3] proposed systolic algorithms realized on a bidirectional linear array where two data streams flow in opposite directions. The data flow introduced for this solution requires data elements to appear in the data stream at each second time step, which is the only way to meet all the elements from the other data stream. In [4, 5] the authors proposed a linear array where one data stream is double mapped while the elements from the other data stream flow in consecutive time moments. The procedure to obtain such a solution is called a fast systolic design. It was shown in [5] that double pipeline solutions are obtained by separating and grouping techniques in addition to this design. Several more efficient systolic designs have been proposed for the matrix vector multiplication algorithm in [4, 5]. Here we implement these techniques on other linear array algorithms such as triangular linear system solver, string comparison, convolution, correlation, MA and AR filter. 相似文献
6.
《国际计算机数学杂志》2012,89(3-4):113-121
Given n items, a parallel algorithm for generating all the n! permutations is presented. The computational model used is a linear array which consists of n identical processing elements with a simple structure. One permutation is produced at each other time step. The elapsed time to produce a permutation is independent of the integer n. The basic idea used is the iterative method and the exchange of two consecutive components in an existing permutation. The design procedures of this algorithm are considered in detail. The ranking and unranking functions of the required permutations are also discussed. 相似文献
7.
In this work we propose two techniques for improving VLSI implementations for artificial neural networks (ANNs). By making use of two kinds of processing elements (PEs), one dedicated to the basic operations (addition and multiplication) and another to evaluate the activation function, the total time and cost for the VLSI array implementation of ANNs can be decreased by a factor of two compared with previous work. Taking the advantage of residue number system, the efficiency of each PE can be further increased. Two RNS- based array processor designs are proposed. The first is built by look-up tables, and the second is constructed by binary adders accomplished by the mixed- radix conversion (MRC), such that the hardwares are simple and high speed operations are obtained. The proposed techniques are general enough to be extended to cover other forms of loading and learning algorithms. 相似文献
8.
The length of the longest common subsequence (LCS) between two strings of M and N characters can be computed by an O(M × N) dynamic programming algorithm, that can be executed in O(M + N) steps by a linear systolic array. It has been recently shown that the LCS between run-length-encoded (RLE) strings of m and n runs can be computed by an O(nM + Nm − nm) algorithm that could be executed in O(m + n) steps by a parallel hardware. However, the algorithm cannot be directly mapped on a linear systolic array because of its irregular structure.In this paper, we propose a modified algorithm that exhibits a more regular structure at the cost of a marginal reduction of the efficiency of RLE. We outline the algorithm and we discuss its mapping on a linear systolic array. 相似文献
9.
In this paper we present algorithms for solving some combinatorial problems on one-dimensional processor arrays in which data flows in only one direction through the array. The problems we consider are: ranking the elements in a chain of sizen, rooting a spanning tree withn vertices, and computing biconnected components of a connected graph withn vertices. We show that each of these problems can be solved using arrays of sizen in which the data enters at the first cell and flows through the array in only one direction until it leaves the last cell as output. We also show how the biconnectivity algorithm for the array yields a new sequential algorithm for computing biconnected components which uses onlyO(n) locations of random access memory.These results were originally reported in extended abstract form in theProceedings of the 24th Annual Allerton Conference on Communication, Control, and Computing, October, 1986, pp. 797–806. This research was supported in part by the Center for Communications and Signal Processing, North Carolina State University. 相似文献
10.
Parallel algorithms for solving the satisfaction problem of non-trivial functional and multivalued data dependencies (FDs and MVDs) in a relation of N tuples by M processors are developed in this paper. Algorithms performing, in a parallel manner, batch or interactive checking of these data dependencies are also discussed. The M processors are organized as a linear systolic array. The time complexities of the first two algorithms for solving the FD satisfaction problem under M N are both O(N), and that of Algorithm (3) or (4) for solving the FD or MVD satisfaction problem under N M is O(N2/M). The latter complexity reduced to O(N) if N = M and is at least not worse than O(N log N) if N = M (N/log N). 相似文献
11.
12.
微机环境下基于PVM的网络并行程序开发方法 总被引:1,自引:0,他引:1
并行虚拟机PVM是一种通用的网络并行程序开发环境,它可以把连网的巨型机,大规模并行机,工作站以及微机作为一大型并行机使用,供人们开发并行算法或运行并行系统。此文对PVM的基本情况和最新进展进行介绍,讨论了基于PVM的网络并行程序开发方法,最后给出了具体的实例。 相似文献
13.
巨量并行处理(MPP)强调并行系统结构和并行算法的可扩放性。在一个可扩放的并行系统结构上,可扩放的并行算法应该能够有效地利用不断增加的处理机,算法的有效性通常以算法运行时的处理机效率来衡量。一个被普遍忽视的因素是通讯效率,这是一个具有一般性的问题。本文给出了通讯效率的定义,研究了它与处理机效率的关系,并通过对一个典型算法的运行情况分析,研究了几个常见的并行系统结构的通讯效率。本文的结果表明:处理机效率和通讯效率的综合才能全面地评价算法的可扩放性并指导并行系统结构的设计。 相似文献
14.
在介绍带有宽总线网络的可重构计算阵列(RAPWBN)的基本结构及其二进制值的前缀和操作的基础上,提出了 RAPWBN 阵列上的整数求和算法,并由此得到了 RAPWBN 阵列上的两种快速高效的矩阵乘法运算并行算法。在具有 N3个处理器和 N2条行总线的 RAPWBN 阵列上,若总线带宽ω>logN 字节,矩阵乘法可以在 O(1)时间完成;在具有 N2个处理器和 N 条行总线的 RAPWBN 阵列上,矩阵乘法可以在 O(N)时间完成。它们的效率都为 O(N3),达到了最优。 相似文献
15.
《国际计算机数学杂志》2012,89(3-4):231-248
The systolic concept in the parallel architecture design proposed by the H. T. Kung [1,2] obtains high throughput and speedups. The linear array for the matrix vector multiplication executes the algorithm in 2n ? 1 time steps using 2n ? 1 processors. Although the speedup obtained is very high, the efficiency is very poor (typical values of 25% efficiency for problem size greater than 10). H. T. Kung proposed an idea for a linear systolic array using two data streams flowing in opposite directions. However, the processors in the array perform operations in every second time moment. Attempts to improve this design have been made by many researchers. Nonlinear and folding transformations techniques [3,4,5] only decrease the number of processors used to half the size, but do not affect the time. We propose the use of a fast linear systolic computation procedure to obtain a solution that uses 3n/2 processors and executes the algorithm in 3n/2 time steps for the same cells, the same communication and the same regular data flow as the H. T. Kung linear array. Only the algorithm is restructured and more efficiently organized. Now the processors are utilized in every time step and no idle steps are required. 相似文献
16.
Tsorng-Lin Chia Kuang-Bor WangZen Chen Der-Chyuan Lou 《Information Processing Letters》2002,82(2):73-81
Distance transformation (DT) has been widely used for image matching and shape analysis. In this paper, a parallel algorithm for computing distance transformation is presented. First, it is shown that the algorithm has an execution time of 6N−4 cycles, for an N×N image using a parallel architecture that requires ⌈N/2⌉ parallel processors. By doing so, the real time requirement is fulfilled and its execution time is independent of the image contents. In addition, a partition method is developed to process an image when the parallel architecture has a fixed number of processing elements (PEs); say two or more. The total execution time for an N×N image by employing a fixed number of PEs is 2[N2/M+2(M−1)], when M is the fixed number of PEs. 相似文献
17.
A scheme is presented which transforms systolic programs with a two-dimensional structure to one dimension. The elementary steps of the transformation are justified by theorems in the theory of communicating sequential processes and the scheme is demonstrated with an example in occam: matrix composition/decomposition.A previous version appeared inProc. Int. Conf. on Mathematics of Program Construction, J. L. A. van de Snepscheut (ed.), Lecture Notes in Computer Science 375, pp. 307–324, Springer-Verlag, 1989.On leave from the Department of Computer Sciences, The University of Texas at Austin, Taylor Hall 2.124, Austin, Texas 78712-1188, USA. 相似文献
18.
Parosh Abdulla 《Formal Aspects of Computing》1992,4(2):149-194
Systolic circuits have attracted considerable attention as a means of implementing parallel algorithms in areas such as linear algebra, signal processing, pattern matching, etc. A systolic circuit is composed of a number of computation cells which are connected in a regular pattern. Each cell can perform computations, store data and communicate with other cells in the circuit. We define a language to describe implementations and specifications of a class of systolic circuits whose cells compute over a commutative ring, and present a decision method to check whether or not a circuit implementation fulfils a specification. The main advantage of our approach, compared with earlier work in the field, is that the verification is performed automatically, without user interaction. We give an example of how the method may be applied to verify a convolution circuit. 相似文献
19.
A VLSI systolic architecture for solving DBT-transformed fuzzy clustering problems of arbitrary size
This article presents a VLSI systolic architecture for solving fuzzy C-means clustering problems of arbitrary size with an array of fixed size. The array design assumes that the data fed in have been subjected to the DBT transformation (dense-to-band matrix transformation by triangular block partitioning). The architectural configuration of the various subarrays involved are presented together with the internal structure of their processing elements, and the performance is analysed. The modularity and regularity of the design make it highly suitable for VLSI implementation. 相似文献
20.
Tanguy Risset 《Parallel Computing》1990,16(2-3):351-359
We show that any systolic array dedicated to matrix-matrix multiplication can also execute Gaussian elimination. 相似文献