首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
AI应用对硬件算力的需求逐年增加,驱使着AI加速器不断向更高的性能演化。研究表明,AI应用的主要运算形式可以转化为矩阵乘运算,脉动阵列因为在矩阵乘运算上的独特优势,使其成为了主流矩阵乘加速技术之一。然而,矩阵在注入和流出脉动阵列时存在一定的流水线启动和排空开销,特别是支持训练的浮点脉动阵列,其MAC延时往往大于1,矩阵块间切换不及时会导致PE利用率急剧下降。针对上述问题,基于典型应用场景进行理论分析,提出了一种矩阵块间提前切换策略,能够精确计算出各种情况下的矩阵块间最优切换时刻。同时,还实现了RTL设计。经过实验对比可知,优化后的脉动阵列增加的硬件开销微乎其微,但在所有场景中均能得到性能提升。  相似文献   

2.
Computation of the permanent of a general matrix is a hard problem. However, for a sparse matrix its permanent can be obtained. In this note an efficient method is described for computing the permanent of (0,l)-matrix. The method is based on the numerical structure of a matrix defined in Cushkov [4]. We outline a method with reduced computation efforts and show that, by application on five examples, the method significantly increases the computation speed.  相似文献   

3.
基于增广矩阵束方法的平面天线阵列综合   总被引:1,自引:1,他引:0  
针对平面阵列的稀布优化问题,提出了一种基于增广矩阵束方法的减少阵元数目、求解阵元位置和设计幅度激励的优化方法。首先对期望平面阵的方向图进行采样并由采样点数据构造增广矩阵,对此矩阵进行奇异值(SVD)分解,确定在误差允许范围内所需的最小阵元数目;然后基于广义特征值分解分别计算两组特征值,并根据类ESPRIT算法对特征值进行配对;最后在最小二乘准则条件下根据正确的特征值对求解平面阵列的阵元位置和激励。仿真结果表明该算法具有较高的计算效率和数值精度。  相似文献   

4.
5.
We present a new algorithm and systolic array for adaptive beamforming. Our approach improves on McWhirter's pioneering work in two respects. First, our algorithm uses only orthogonal transformations and thus should have better numerical properties. Second, the algorithm can be implemented on one single p × p triangular array of programmable processors that offers a throughput of one residual element per cycle.  相似文献   

6.
The major concern of this paper is the systolic realization of Jacobi algorithms with a cyclic-by-rows iteration scheme. This aim can be achieved by construction of an algorithm which is well suited for parallel processing and is essentially equivalent to the above-mentioned type of algorithm. Moreover, a large variety of applications for Jacobi algorithms is presented as well as a comparison to other parallel schemes for the same problem. Finally, a systolic array is derived which requires (n + 1)2/4 processing cells and has a time complexity of O(n) for each sweep.  相似文献   

7.
Using a directed acyclic graph (DAG) model of algorithms, the authors focus on processor-time-minimal multiprocessor schedules: time-minimal multiprocessor schedules that use as few processors as possible. The Kung, Lo, and Lewis (KLL) algorithm for computing the transitive closure of a relation over a set of n elements requires at least 5n-4 parallel steps. As originally reported. their systolic array comprises n2 processing elements. It is shown that any time-minimal multiprocessor schedule of the KLL algorithm's dag needs at least n2/3 processing elements. Then a processor-time-minimal systolic array realizing the KLL dag is constructed. Its processing elements are organized as a cylindrically connected 2-D mesh, when n=0 mod 3. When n≠0 mod 3, the 2-D mesh is connected as a torus  相似文献   

8.
We present a new probabilistic algorithm to compute the Smith normal form of a sparse integer matrix . The algorithm treats A as a “black box”—A is only used to compute matrix-vector products and we do not access individual entries in A directly. The algorithm requires about black box evaluations for word-sized primes p and , plus additional bit operations. For sparse matrices this represents a substantial improvement over previously known algorithms. The new algorithm suffers from no “fill-in” or intermediate value explosion, and uses very little additional space. We also present an asymptotically fast algorithm for dense matrices which requires about bit operations, where O(MM(m)) operations are sufficient to multiply two matrices over a field. Both algorithms are probabilistic of the Monte Carlo type — on any input they return the correct answer with a controllable, exponentially small probability of error. Received: March 9, 2000.  相似文献   

9.
A simple and efficient algorithm for the bandwidth reduction of sparse symmetric matrices is proposed. It involves column-row permutations and is well-suited to map onto the linear array topology of the SIMD architectures. The efficiency of the algorithm is compared with the other existing algorithms. The interconnectivity and the memory requirement of the linear array are discussed and the complexity of its layout area is derived. The parallel version of the algorithm mapped onto the linear array is then introduced and is explained with the help of an example. The optimality of the parallel algorithm is proved by deriving the time complexities of the algorithm on a single processor and the linear array.  相似文献   

10.
Linear operators for digital contour smoothing are described. These operators are defined by circulant Toeplitz matrices and allow to smooth digital contours in the least-squares sense. They minimize the undersampling, digitizing and quantizing error and allow to calculate invariants, such as curvature, which are not possible to calculate without smoothing. A bit-level systolic array which is capable of realizing the proposed operator is described. This array is easy to implement in VLSI, because the array cells involved are very simple. Furthermore, the array is completely pipelined on the bit-level, so that it operates with a high clock frequency achieving very high throughputs.  相似文献   

11.
Using a directed acyclic graph (DAG) model of algorithms, the paper focuses on time-minimal multiprocessor schedules that use as few processors as possible. Such a processor-time-minimal scheduling of an algorithm's DAG first is illustrated using a triangular shaped 2-D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n×n×n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh required 3n-2 steps. It is shown that the number of processing elements needed to achieve this time bound is at least [3n2/4]. A systolic array for the cubical directed mesh is then presented. It completes the mesh using the minimum number of steps and exactly [3n 2/4] processing elements it is processor-time-minimal. The systolic array's topology is that of a hexagonally shaped, cylindrically connected, 2-D directed mesh  相似文献   

12.
Multivariate time series (MTS) data are widely available in different fields including medicine, finance, bioinformatics, science and engineering. Modelling MTS data accurately is important for many decision making activities. One area that has been largely overlooked so far is the particular type of time series where the data set consists of a large number of variables but with a small number of observations. In this paper we describe the development of a novel computational method based on Natural Computation and sparse matrices that bypasses the size restrictions of traditional statistical MTS methods, makes no distribution assumptions, and also locates the associated parameters. Extensive results are presented, where the proposed method is compared with both traditional statistical and heuristic search techniques and evaluated on a number of criteria. The results have implications for a wide range of applications involving the learning of short MTS models.  相似文献   

13.
This paper describes a computational technique that was used for determining whether a certain matrix equation, Mx = b, had an exact solution. Here, MT was a sparse 96,208 × 40,040 integer matrix. The approach is an appropriate alternative for certain matrices when memory constraints preclude using conventional techniques. The tools used are simple and their data structures well-suited for virtual memory computers. An important aspect of the method is to adopt different data structures to represent the matrix as the matrix changes over the course of the problem.  相似文献   

14.
We introduce a linear systolic array for the Longest Common Subsequence (LCS, for short) problem. We first present an array of m identical cells which computes the length of an LCS of two strings of length m and n, respectively, in linear time (i.e., in time proportional to m + n). Then we show that, by extending any cell with the systolic stack introduced by Guibas and Liang (1982), a new array can be designed to recover an LCS in linear time.  相似文献   

15.
A two-layered mesh array for matrix multiplication is presented. It computers the matrix product faster than the standard array.  相似文献   

16.
In this paper, we present new techniques for designing systolic arrays and asynchronous arrays for recursive algorithms. More specifically, we propose a systolic array with simple local interconnections for matrix multiplication which achieves optimal performance without having undesirable features such as preloading input data or global broadcasting. An asynchronous array for matrix multiplication which can speed up the total computation time significantly is also presented. The key component of the asynchronous array is a communication protocol which controls input data flow properly and efficiently. Finally, performance of the arrays is analyzed and a simulation using Occam programmed in a Transputer network is reported.  相似文献   

17.
In this paper a systolic array is presented for the solution of linear equations occurring in scientific and engineering calculations. The new systolic array is based on the parallel Quadrant Interlocking Elimination method of Evans and Hadjidimos [1].  相似文献   

18.
首先从信号与信息处理的角度阐述了波束形成所要解决的技术问题和波束形成的理论优势和方法局限,并对传统的基于空域波形采样的波束形成技术进行了再思考。其次,分析了稀疏阵列的布阵特点对波束形成技术的挑战,给出了空域、时域、频域分布式相参信号处理等关键技术及其理论性能的分析与比较结果,并利用实例分析了空、时、频协同采样克服低维欠采样模糊的可行性。最后分析了在实际应用中遇到的一些非理想因素对稀疏阵列波束形成与控制的影响及其解决方法。  相似文献   

19.
This paper describes a systolic array for the computation of n-dimensional (nD) convolutions for any positive integer n. Systolic systems usually achieve high performance by allowing computations to be pipelined over a large array of processing elements. To achieve even higher performance, the systolic array described in this paper uses a second level of pipelining by allowing the processing elements themselves to be pipelined to an arbitrary degree.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号