首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We present a systolic algorithm to generate all the n! permutations of n given items. The computational model used is a linear systolic array consisting of n identical PEs. This algorithm requires n! time steps to solve this problem. Since any PE is identical and executes the same program, it is suitable for VLSI implementation. The correctness of the algorithm is proved. We also consider the ranking and unranking functions of permutations in this parallel algorithm  相似文献   

2.
For an arbitrary n × n matrix A and an n × 1 column vector b, we present a systolic algorithm to solve the dense linear equations Ax = b. An important consideration is that the pivot row can be changed during the execution of our systolic algorithm. The computational model consists of n linear systolic arrays. For 1 ≤ in, the ith linear array is responsible to eliminate the ith unknown variable xi of x. This algorithm requires 4n time steps to solve the linear system. The elapsed time unit within a time step is independent of the problem size n. Since the structure of a PE is simple and the same type PE executes the identical instructions, it is very suitable for VLSI implementation. The design process and correctness proof are considered in detail. Moreover, this algorithm can detect whether A is singular or not.  相似文献   

3.
A linear rotation based algorithm is proposed for solving linear system equations, Ax = b. This algorithm modified the conventional Gaussian elimination method and can avoid the problems of numerical singularity and ill condition. In this study, the implementation of a trapezoidal systolic array of n2/2 + n −2 processors as well as a linear array of n processors are accomplished for this algorithm. The trapezoidal systolic array performs the triangularization of a matrix A by using the modified linear rotation algorithm; while the linear array performs the backward substitution for evaluating the solution of x. The computing time for solving a linear equation system will be O(5n) time units. Also an implicit representation of the elimination factor by means of the sign parameter sequence instead of an numerical value is introduced for simplifying the hardware complexity. It is clear that this systolic architecture is simple, uniform, and regular, and therefore well suitable for the implementation of a VLSI chip.  相似文献   

4.
A systolic array for the solution of the assignment problem is presented. The algorithm requires O(n2) time on an orthogonally connected array of (n + 2) * (n + 2) cells consisting of simple adders and control logic. The design is area efficient and incorporates the new concept of a Systolic Control Ring (SCR) to generate the necessary systolic wavefronts in any orientation within the design, while special cells are positioned only on the periphery of the design. The design was simulated and tested by an OCCAM program.  相似文献   

5.
A systolic array is presented to improve numerical approximations to integrals using Richardson's extrapolation procedure in the form of Romberg integration. Two designs are presented, the first is an intuitive linear systolic array, the second, a systolic ring using approximately 1/3 of the cells of the first array. Both systolic arrays have a computation time of 3n cycles, which is a significant improvement on the O(n2) steps required to construct the extrapolation table sequentially.  相似文献   

6.
This paper presents a new systolic algorithm for thecompletesolution of a system ofNlinear equations in (N2/2 +O(N)) time steps using 2Nprocessing elements (PEs). It is based on a variant of the Gaussian elimination (GE) algorithm called the successive GE and is faster than any existing GE based algorithm usingO(N) PEs. We also suggest two fault tolerant schemes that tolerate up toNPE failures. The first scheme is a time redundancy based approach with no hardware overhead and 100% time overhead. This scheme can tolerate up toNPE failures. The second scheme is based on algorithm based fault tolerance (ABFT) and usesNextra PEs to tolerate up toN− 1 PE failures with very little time overhead. The number of errors that can be detected/corrected in both schemes is more than that in any existing fault tolerant systolic array.  相似文献   

7.
本文在深入分析K-means算法计算特征的基础上,基于FPGA平台提出并实现了一种细粒度的并行浮点K-means算法。设计采用了阵列多PE并行处理的任务划分策略,实现了处理单元间的负载平衡,采用数据驱动的流水线隐藏片外存储访问,设计了一种基于脉动阵列结构的主从多PE并行计算阵列,并在单片FPGA(XC5VLX330)上成功集成了4个PE。实验结果表明,我们提出的K-means算法加速器结构具备良好的可扩展性。通过实验测试,我们的实现方案相对于Pentium 4 2.66 GHz单处理器程序达到了15倍的加速比。  相似文献   

8.
We show that an n-node complete binary tree can be embedded into an n-node linear array such that the maximum edge length is n/log n, and the distribution of the tree nodes is regular in the sense that if each node in the tree array is a finite-state machine and each edge is a communication channel, then the message exchanges (i.e., communication) among the tree nodes can be easily simulated by the linear array which has also finite-state machines for its nodes. The embedding is optimal with respect to the maximum edge length. The motivation for this is the proof of the following result: every cellular (or (log n)-depth iterative) tree array running in T(n) time can be simulated by a linear cellular (or iterative) array in nT(n)/log n time.  相似文献   

9.
We develop a simple mapping technique to design linear systolic arrays. The basic idea of our technique is to map the computations of a certain class of two-dimensional systolic arrays onto one-dimensional arrays. Using this technique, systolic algorithms are derived for problems such as matrix multiplication and transitive closure on linearly connected arrays of PEs with constant I/O bandwidth. Compared to known designs in the literature, our technique leads to modular systolic arrays with constant hardware in each PE, few control lines, lexicographic data input/output, and improved delay time. The unidirectional flow of control and data in our design assures implementation of the linear array in the known fault models of Wafer Scale Integration.  相似文献   

10.
Consdier I(z) = ∫ba w(t)f(t, z) dt, f(t, z) = (1 + t/z)−1. It is known that generalized Gaussian quadrature of I(z) leads to approximations which occupy the (n, n + r − 1) positions of the Padé matrix table for I(z). Here r is a positive integer or zero. In a previous paper the author developed a series representation for the error in Gaussian quadrature. This approach is now used to study the error in the Padé approximations noted. Three important examples are treated. Two of the examples are generalized to the case where f(t, z) = (1 + t/z)v.  相似文献   

11.
In this paper a regular bidirectional linear systolic array (RBLSA) for computing all-pairs shortest paths of a given directed graph is designed. The obtained array is optimal with respect to a number of processing elements (PE) for a given problem size. The execution time of the array has been minimized. To obtain RBLSA with optimal number of PEs, the accommodation of the inner computation space of the systolic algorithm to the projection direction vector is performed. Finally, FPGA-based reprogrammable systems are revolutionizing certain types of computation and digital logic, since as logic emulation systems they offer some orders of magnitude speedup over software simulation; herein, a FPGA realization of the RBLSA is investigated and the performance evaluation results are discussed.  相似文献   

12.
This paper describes several parallel algorithms for image edge relaxation on array processors with different numbers of processing elements (PEs) connected by a mesh or hypercube network. The time complexity of Prager's original edge relaxation scheme is O(N2) per iteration using floating-point operations on a sequential machine, where N2 is the number of pixels in the image. Modifications to the scheme are made so that no multiplications are employed and only integer operations are required. Moreover, with parallel processing, the time complexity per iteration is reduced to some constant value. A time complexity analysis on two parallel algorithms is performed. Although the algorithm on an array processor with 4N2 PEs achieved higher degree of parallelism, the algorithm with N2 PEs is preferred. Further modifications on the latter algorithm are made to accommodate to fewer PEs.  相似文献   

13.
We present a parallel solution to the unbounded knapsack problem on a linear systolic array. It achieves optimal speedup for this well-known, NP-hard problem on a model of computation that is weaker than the PRAM. Our array is correct by construction, as it is formally derived by transforming a recurrence equation specifying the algorithm. This recurrence has dynamic dependencies, a property that puts it beyond the scope of previous methods for automatic systolic synthesis. Our derivation thus serves as a case study. We generalize the technique and propose a systematic method for deriving systolic arrays by nonlinear transformations of recurrences. We give sufficient conditions that the transformations must satisfy, thus extending systolic synthesis methods. We address a number of pragmatic considerations: implementing the array on only a fixed number of PEs, simplifying the control to just two counters and a few latches, and loading the coefficients so that successive problems can be pipelined without any loss of throughput. Using a register level model of VLSI, we formulate a nonlinear optimization problem to minimize the expected running time of the array. The analytical solution of this problem allows us to choose the memory size of each PE in an optimal manner  相似文献   

14.
We present an optimum bit-parallel/word-sequential systolic convolver. Our design is the best one among the previous many convolvers in the sense that its optimality in time and space performances is simultaneously attained without augmenting any global control, broadcasting, initial-data-preloading, and/or multi-sequential or parallel I/O ports which were allowed in most of the previous designs. As an application of our convolver we give a systolic polynomial divider which can compute the polynomial division in exactly n + O(1) optimum steps on [min(nm, m)/2]+O(1) systolic cells for the division of any degree n polynomial by any degree m polynomial (n m).  相似文献   

15.
With the advent of VLSI it has become possible to map parallel algorithms for compute-bound problems directly on silicon. Systolic architecture is very good candidate for VLSI implementation because of its regular and simple design, and regular communication pattern. In this paper, a systolic algorithm and corresponding systolic architecture, a linear systolic array, for the scanline-based hidden surface removal problem in three-dimensional computer graphics have been proposed. The algorithm is based on the concept of sample spans or intervals. The worst case time taken by the algorithm is O(n), n being the number of segments in a scanline. The time taken by the algorithm for a given scene depends on the scene itself, and on an average considerable improvement over the worst case behaviour is expected. A pipeline scheme for handling the I/O process has also been proposed which is suitable for VLSI implementation of the algorithm.  相似文献   

16.
In this paper, we derive time-minimal systolic arrays for Gaussian elimination and the Algebraic Path Problem (APP) that use a minimal number of processors. For a problem of size n, we obtain an execution time T(n) = 3n −1 using A(n) = n2/4+O(n) processors for Gaussian elimination, and T(n) = 5n −2 and A(n) = n3/+O(n) for the APP.  相似文献   

17.
We present particle simulations of natural convection of a symmetrical, nonlinear, three-dimensional cavity flow problem. Qualitative studies are made in an enclosure with localized heating. The assumption is that particles interact locally by means of a compensating Lennard-Jones type force F, whose magnitude is given by −G/rp + H/rq.

In this formula, the parameters G, H, p, q depend upon the nature of the interacting particles and r is the distance between two particles. We also consider the system to be under the influence of gravity. Assuming that there are n particles, the equations relating position, velocity and acceleration at time tk = kΔt, K = 0, 1, 2, …, are solved simultaneously using the “leap-frog” formulas. The basic formulas relating force and acceleration are Newton's dynamical equations Fi,k = miai,k, I = 1, 2, 3, …, n, where mi is the mass of the ith particle.

Extensive and varied computations on a CRAY X - MP/24 are described and discussed, and comparisons are made with the results of others.  相似文献   


18.
In this paper a recursive method is developed to obtain the steady state probability distribution of the number in system at arbitrary and departure time epochs of a single server state-dependent arrival rate queue λ(n)/G/1/K in which the arrival process is Markovian with arrival rates λ(n) which depend on the number of customers n in the system and general service time distribution. It is assumed that there exists an integer K such that λ(n) > 0 for all 0 n < K and λ(n) = 0 for all n K. Numerical results have been presented for many queueing models by suitably defining the function λ(n). These include machine interference model, queues with balking, queues with finite waiting space and machine interference model with finite waiting space. These models have wide application in computer/communication networks.  相似文献   

19.
We investigate time-constructible functions in one-dimensional cellular automata (CA). It is shown that (i) if a function t(n) is computable by an O(t(n)−n)-time Turing machine, then t(n) is time constructible by CA and (ii) if two functions are time constructible by CA, then the sum, product, and exponential functions of them are time constructible by CA. As an application, it is shown that if t1(n) and t2(n) are time constructible functions such that limn→∞ t1(n)/t2(n) = 0 and t1(n)n, then there is a language which can be recognized by a CA in t2(n) time but not by any CA in t1(n) time.  相似文献   

20.
In this paper we present an O(log n) time parallel algorithm for arithmetic expression evaluation, on an n × n processor array with reconfigurable bus system, where n is the sum of the number of operators and constants in the expression. The basic technique involved here is leaves-cutting (rake operation), as in the case of PRAM model algorithms available in the literature for this problem. The input to our algorithm is assumed to be the binary tree associated with a given expression (also known as expression tree with n number of nodes). Our algorithm is faster compared to the previous best time for expression evaluation on mesh connected computers which is O(√n).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号