期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Double pipelines and fast systolic designs on linear arrays

《国际计算机数学杂志》2012,89(4):403-429

The fast systolic computation and double pipelines were designed to achieve implementations that use less processors to execute the algorithm in less time then the conventional systolic algorithms. H. T. Kung and C. S. Leiserson in [1-3] proposed systolic algorithms realized on a bidirectional linear array where two data streams flow in opposite directions. The data flow introduced for this solution requires data elements to appear in the data stream at each second time step, which is the only way to meet all the elements from the other data stream.

In [4, 5] the authors proposed a linear array where one data stream is double mapped while the elements from the other data stream flow in consecutive time moments. The procedure to obtain such a solution is called a fast systolic design. It was shown in [5] that double pipeline solutions are obtained by separating and grouping techniques in addition to this design.

Several more efficient systolic designs have been proposed for the matrix vector multiplication algorithm in [4, 5]. Here we implement these techniques on other linear array algorithms such as triangular linear system solver, string comparison, convolution, correlation, MA and AR filter. 相似文献

2.

Polynomial multiplication on systolic arrays

《国际计算机数学杂志》2012,89(1-2):43-54

In this article, we show how the multiplication of polynomials can be performed in a pipelined fashion on a systolic array in linear time steps. The computational model consists of two linear systolic arrays with 2(n+1) processing elements used and (m+2n+2) running time steps needed, where m, n are the degrees of the two given polynomials, respectively. Since the same types of processing elements execute the same program, it is suitable for VLSI implementation. This algorithm is also proved to be correct. 相似文献

3.

An orthogonal systolic array for the algebraic path problem

Y. Robert D. Trystram 《Computing》1987,39(3):187-199

This paper is devoted to the design of an orthogonal systolic array ofn(n+1) elementary processors which can solve any instance of the Algebraic Path Problem within only 5n?2 time steps, and is compared with the 7n?2 time steps of the hexagonal systolic array of Rote [8]. 相似文献

4.

Systolic algorithm for the solution of dense linear equations

《国际计算机数学杂志》2012,89(1-4):159-167

For arbitrary n×n matrix A and n×m matrix B, a systolic algorithm to solve the linear systems AX=B is presented. The cmputational model used consists of n(n+1) PEs. The number of PEs used is independent of m. This algorithm requires 4n+m–2 time steps to solve the linear systems. Since the structure of PE is simple and the PE with same type executes the identical program, it is very suitable for VLSI implementation. Moreover, if m is a singular matrix, it can be detected during the execution of the systolic algorithm. 相似文献

5.

Computing the Euclidean Distance Transform on a Linear Array of Processors

Gavrilova Marina L. Alsuwaiyel Muhammad H. 《The Journal of supercomputing》2003,25(2):177-185

Given an n×n binary image of white and black pixels, we present an optimal parallel algorithm for computing the distance transform and the nearest feature transform using the Euclidean metric. The algorithm employs the systolic computation to achieve O(n) running time on a linear array of n processors. 相似文献

6.

An efficient all-parses systolic algorithm for general context-free parsing

Oscar H. Ibarra Michael A. Palis 《International journal of parallel programming》1990,19(4):295-331

The problem of outputting all parse trees of a string accepted by a context-free grammar is considered. A systolic algorithms is presented that operates inO(m·n) time, wherem is the number of distinct parse trees andn is the length of the input. The systolic array usesn ² processors, each of which requires at mostO(logn) bits of storage. This is much more space-efficient that a previously reported systolic algorithm for the same problem, which requiredO(n logn) space per processor. The algorithm also extends previous algorithms that only output a single parse tree of the input.Research squpported in part by NSF Grant DCR-8420935 and DCR-8604603. 相似文献

7.

PRAM和LARPBS模型上有向序列翻转距离并行算法

下载免费PDF全文

沈一飞陈国良张强锋《软件学报》2007,18(11):2683-2690

分别在两种重要并行计算模型中给出计算有向基因组排列的反转距离新的并行算法.基于Hannenhalli和Pevzner理论,分3个主要部分设计并行算法:构建断点图、计算断点图中圈数、计算断点图中障碍的数目.在CREW-PRAM模型上,算法使用O(n²)处理器,时间复杂度为O(log²n);在基于流水光总线的可重构线性阵列系统(linear array with a reconfigurable pipelined bus system, LARPBS)模型上,算法使用O(n³)处理器,计算时间复杂度为O(logn). 相似文献

8.

Controlling the data space of tree structured computations

I. Gottlieb B. Obreni 《Information and Computation》2003,187(2):246-276

We study the problem of scheduling a parallel computation so as to minimize the maximum number of data items extant at any point in the execution. Computations are expressed as directed graphs, where nodes represent primitive operations and arcs represent data dependences. The result of an operation is extant after the operation executes and until all immediate successors have begun execution. Our goal is to schedule computations so as to minimize both the maximum space required for extant data and the overall completion time.The classical problem of multiprocessor scheduling with precedence constraints is a special case of our problem, obtained by disregarding the data-space constraint. This special case is NP-complete for general graphs; a time-optimal multiprocessor scheduling algorithm is known only for the class of arbitrary trees. For this same class of arbitrary trees we present a multiprocesssor scheduling algorithm where the completion time is optimal within a constant factor, while the data-space size exceeds the optimal by a factor not greater than the number of processors.For an arbitrary n-node precedence tree T of in-degree Δ, we present:

(1)an algorithm for evaluating the lower bound on the size of data space required for executing T, regardless of the completion time or number of processors;
(2)a proof that the lower bound of Part 1 may be as large as (Δ−1)lgn but not larger;
(3)a single-processor schedule that executes T in time that equals the optimal, while creating the data space of size equal to the lower bound of Part 1;
(4)an ω-processor schedule that executes T in time not exceeding three times the optimal, while creating the data space of size that exceeds the lower bound of Part 1 by a factor not greater than ω.
(5)a proof that for every number of processors ω and for every 0<ε1, there exist infinitely many trees such that every ω-processor schedule that executes any of these trees in time not exceeding (2−ε) times the optimal requires a token space as large as that created by the schedule of Part 4, while the schedule of Part 4 executes every such tree in optimal time.

The family of complete binary trees provides an example where our schedule achieves an exponential improvement in the size of the data space, compared to that of the classical time-optimal schedule. 相似文献

9.

Parallel generation of permutations on systolic arrays

Chau-Jy Lin 《Parallel Computing》1990,15(1-3):267-276

We present a systolic algorithm to generate all the n! permutations of n given items. The computational model used is a linear systolic array consisting of n identical PEs. This algorithm requires n! time steps to solve this problem. Since any PE is identical and executes the same program, it is suitable for VLSI implementation. The correctness of the algorithm is proved. We also consider the ranking and unranking functions of permutations in this parallel algorithm 相似文献

10.

Top-down designs of instruction systolic arrays for polynomial interpolation and evaluation

H. Schroder 《Journal of Parallel and Distributed Computing》1989,6(3)

This paper describes the application of a new parallel architecture—Instruction systolic array (ISA)-for the interpolation and evaluation of polynomials using a linear array of processors. It also demonstrates a systematic top-down design of instruction systolic arrays. The periods of the resulting algorithms are O(n) for interpolation and O(1) for evaluation, where n is the degree of the polynomial. 相似文献

11.

A Fast Efficient Parallel Hough Transform Algorithm on LARPBS 总被引：2，自引：0，他引：2

Chen Ling Chen Hongjian Pan Yi Chen Yixin 《The Journal of supercomputing》2004,29(2):185-195

A parallel algorithm for Hough transform on a linear array with reconfigurable pipeline bus system (LARPBS) is presented. Suppose the number of -values to be considered is m, for an image with n × n pixels, the algorithm can complete Hough transform in O(1) time using mn ² processors and achieve optimal speed and efficiency. We also illustrate how to partition data and perform the algorithm on a LARPBS with fewer than mn ² processors, and hence show that the algorithm is highly scalable. 相似文献

12.

An incremental primal sieve

S. A. Bengelloun 《Acta Informatica》1986,23(2):119-125

Summary A new algorithm is presented for finding all primes between 2 and an incrementally increasing value n. The algorithm executes in linear arithmetic time and space. An outline is given to show how previously developed techniques can be applied to improve the efficiency of the algorithm to O(n/loglogn) time and space. 相似文献

13.

Efficient parallel and sequential algorithms for 4-coloring perfect planar graphs

He Xin 《Algorithmica》1990,5(1-4):545-559

We present an efficient algorithm for 4-coloring perfect planar graphs. The best previously known algorithm for this problem takesO(n ^3/2) sequential time, orO(log⁴ n) parallel time withO(n³) processors. The sequential implementation of our algorithm takesO(n logn) time. The parallel implementation of our algorithm takesO(log³ n) time withO(n) processors on a PRAM.

相似文献

14.

An Efficient Systolic Algorithm for the Longest Common Subsequence Problem

Lin Yen-Chun Chen Jyh-Chian 《The Journal of supercomputing》1998,12(4):373-385

A longest common subsequence (LCS) of two strings is a common subsequence of the two strings of maximal length. The LCS problem is to find an LCS of two given strings and the length of the LCS (LLCS). In this paper, a fast linear systolic algorithm that improves on previous systolic algorithms for solving the LCS problem is presented. For two given strings of length m and n, where m n, the LLCS and an LCS can be found in m + 2n – 1 time steps. This algorithm achieves the tight lower bound of the time complexity under the situation where symbols are input sequentially to a linear array of n processors. The systolic algorithm can be modified to take only m + n steps on multicomputers by using the scatter operation. 相似文献

15.

A systolic algorithm for solving dense linear systems

Chau-Jy Lin 《Computers & Mathematics with Applications》1996,32(12):77-91

For an arbitrary n × n matrix A and an n × 1 column vector b, we present a systolic algorithm to solve the dense linear equations Ax = b. An important consideration is that the pivot row can be changed during the execution of our systolic algorithm. The computational model consists of n linear systolic arrays. For 1 ≤ i ≤ n, the i^th linear array is responsible to eliminate the i^th unknown variable x_i of x. This algorithm requires 4n time steps to solve the linear system. The elapsed time unit within a time step is independent of the problem size n. Since the structure of a PE is simple and the same type PE executes the identical instructions, it is very suitable for VLSI implementation. The design process and correctness proof are considered in detail. Moreover, this algorithm can detect whether A is singular or not. 相似文献

16.

An Efficient Parallel Algorithm for the Layered Planar Monotone Circuit Value Problem

Vijaya Ramachandran Honghua Yang 《Algorithmica》1997,18(3):384-404

A planar monotone circuit (PMC) is a Boolean circuit that can be embedded in the plane and that contains only AND and OR gates. A layered PMC is a PMC in which all input nodes are in the external face, and the gates can be assigned to layers in such a way that every wire goes between gates in successive layers. Goldschlager, Cook and Dymond, and others have developed NC ² algorithms to evaluate a layered PMC when the output node is in the same face as the input nodes. These algorithms require a large number of processors (Ω(n ⁶ ), where n is the size of the input circuit). In this paper we give an efficient parallel algorithm that evaluates a layered PMC of size n in time using only a linear number of processors on an EREW PRAM. Our parallel algorithm is the best possible to within a polylog factor, and is a substantial improvement over the earlier algorithms for the problem. Received April 18, 1994; revised April 7, 1995. 相似文献

17.

A Parallel Algorithm for Computing the Generalized Singular Value Decomposition

Bai Z. J. 《Journal of Parallel and Distributed Computing》1994,20(3)

A parallel algorithm for computing the generalized singular value decomposition of two matrices A and B having the same number of columns is described in this paper. The algorithm is designed for efficient implementation on distributed-memory parallel computer architectures. The time cost is O(n²) units for parallel preprocessing, and O(n²/p) units for the GSVD of two upper trapezoidal matrices, where p is the dimension of the triangular array of processors. 相似文献

18.

Linear rotation based algorithm and systolic architecture for solving linear system equations

I. -Chang Jou 《Parallel Computing》1989,11(3):367-379

A linear rotation based algorithm is proposed for solving linear system equations, Ax = b. This algorithm modified the conventional Gaussian elimination method and can avoid the problems of numerical singularity and ill condition. In this study, the implementation of a trapezoidal systolic array of n²/2 + n −2 processors as well as a linear array of n processors are accomplished for this algorithm. The trapezoidal systolic array performs the triangularization of a matrix A by using the modified linear rotation algorithm; while the linear array performs the backward substitution for evaluating the solution of x. The computing time for solving a linear equation system will be O(5n) time units. Also an implicit representation of the elimination factor by means of the sign parameter sequence instead of an numerical value is introduced for simplifying the hardware complexity. It is clear that this systolic architecture is simple, uniform, and regular, and therefore well suitable for the implementation of a VLSI chip. 相似文献

19.

A systolic algorithm for riccati and lyapunov equations

J. -P. Charlier P. Van Dooren 《Mathematics of Control, Signals, and Systems (MCSS)》1989,2(2):109-136

Riccati and Lyapunov equations can be solved using the recursive matrix sign method applied to symmetric matrices constructed from the corresponding Hamiltonian matrices. In this paper we derive an efficient systolic implementation of that algorithm where theLDL ^T andUDU ^T decompositions of those symmetric matrices are propagated. As a result the solution of a class of Riccati and Lyapunov equations can be obtained inO(n) time steps on a bidimensional (triangular) grid ofO(n ²) processors, leading to an optimal speedup. 相似文献

20.

Parallel comparison of run-length-encoded strings on a linear systolic array

Alessandro Bogliolo Valerio Freschi 《Information Sciences》2007,177(1):231-238

The length of the longest common subsequence (LCS) between two strings of M and N characters can be computed by an O(M × N) dynamic programming algorithm, that can be executed in O(M + N) steps by a linear systolic array. It has been recently shown that the LCS between run-length-encoded (RLE) strings of m and n runs can be computed by an O(nM + Nm − nm) algorithm that could be executed in O(m + n) steps by a parallel hardware. However, the algorithm cannot be directly mapped on a linear systolic array because of its irregular structure.In this paper, we propose a modified algorithm that exhibits a more regular structure at the cost of a marginal reduction of the efficiency of RLE. We outline the algorithm and we discuss its mapping on a linear systolic array. 相似文献