首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this article, we show how the multiplication of polynomials can be performed in a pipelined fashion on a systolic array in linear time steps. The computational model consists of two linear systolic arrays with 2(n+1) processing elements used and (m+2n+2) running time steps needed, where m, n are the degrees of the two given polynomials, respectively. Since the same types of processing elements execute the same program, it is suitable for VLSI implementation. This algorithm is also proved to be correct.  相似文献   

2.
Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU–GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.  相似文献   

3.
A study of implementation schemes for vectorized sparse element-by-element (EBE) matrix-vector multiplication used in the iterative equation solving of finite element problems is presented. The Hughes, Ferencz and Hallquist scheme and a variation are compared with Torigaki's approach. Computer timing and required memory for each scheme on the IBM 3090/VF vector computer regarding the analysis of large scale three-dimensional solid mechanics problems are presented and results compared.  相似文献   

4.
In this paper, we revisit the performance issues of the widely used sparse matrix-vector multiplication (SpMxV) kernel on modern microarchitectures. Previous scientific work reports a number of different factors that may significantly reduce performance. However, the interaction of these factors with the underlying architectural characteristics is not clearly understood, a fact that may lead to misguided, and thus unsuccessful attempts for optimization. In order to gain an insight into the details of SpMxV performance, we conduct a suite of experiments on a rich set of matrices for three different commodity hardware platforms. In addition, we investigate the parallel version of the kernel and report on the corresponding performance results and their relation to each architecture’s specific multithreaded configuration. Based on our experiments, we extract useful conclusions that can serve as guidelines for the optimization process of both single and multithreaded versions of the kernel.  相似文献   

5.
The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Programmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory.  相似文献   

6.
We examine several VLSI architectures and compare these for their suitability for various forms of the band matrix multiplication problem. The following architectures are considered: chain, broadcast chain, mesh, broadcast mesh and hexagonally connected. The forms of the matrix multiplication problem that are considered are: band matrix × vector and band matrix × band matrix. Metrics to measure the utilization of resources (bandwidth and processors) are also proposed. An important feature of this paper is the inclusion of correctness proofs. These proofs are provided for selected designs and illustrate how VLSI designs may be proved correct using traditional mathematical tools.  相似文献   

7.
In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multiplication ( ) on NVIDIA GPUs using CUDA. has a very low computation-data ratio and its performance is mainly bound by the memory bandwidth. We propose optimization of based on ELLPACK from two aspects: (1) enhanced performance for the dense vector by reducing cache misses, and (2) reduce accessed matrix data by index reduction. With matrix bandwidth reduction techniques, both cache usage enhancement and index compression can be enabled. For GPU with better cache support, we propose differentiated memory access scheme to avoid contamination of caches by matrix data. Performance evaluation shows that the combined speedups of proposed optimizations for GT-200 are 16% (single-precision) and 12.6% (double-precision) for GT-200 GPU, and 19% (single-precision) and 15% (double-precision) for GF-100 GPU.  相似文献   

8.
《国际计算机数学杂志》2012,89(6):1264-1276
This paper investigates different ways of systolic matrix multiplication. We prove that in total there are 43 arrays for multiplication of rectangular matrices. We also prove that, depending on the mutual relation between the dimensions of rectangular matrices, there is either 1 or 21 arrays with minimal number of processing elements. Explicit mathematical formulae for systolic array synthesis are derived. The methodology applied to obtain 43 systolic designs is based on the modification of the synthesis procedure based on dependency vectors and space-time mapping of the dependency graph.  相似文献   

9.
A modified systolic array architecture for performing matrix vector operation is presented. The vector to be transformed is represented at bit level as a matrix having only binary elements which are multiplied with a weight vector whose ith element is 2i−1. This modification permits pipelining of verious carry-save adder stages of the multiplier and thereby permitting increased data throughput. It is shown that the throughput rate of the proposed design is independent of the vector size and the word length in contrast to the previously reported architecture by Kung and Liserson (1978). Detailed area time complexity analysis of the proposed design has also been carried out.  相似文献   

10.
The complexity of adding twon-bit numbers on a two-dimensional systolic array is investigated. We consider different constraints on the systolic array, including:
  • whether or not the input and output ports lie on the periphery of the array,
  • constraints placed on the arrival and departure times of inputs and outputs
  • . For all combinations of the above constraints, we obtain optimal tradeoffs among the resources of area, pipeline delay, and worst-case time. It turns out that there is a subtle interplay among the constraints and some of our results seem counterintuitive. For instance, we show that allowing more-significant bits to arrive earlier than less-significant bits can speed up addition by a factor of logn. We also show that multiplexing can often result in a smaller array. On the other hand, we show that some known results, such as Chazelle and Monier's bounds for arrays that have input/output ports on the perimeter, also hold in less constrained models.  相似文献   

    11.
    《国际计算机数学杂志》2012,89(1-2):103-116
    We consider systolic arrays for matrix computations involving complex elements, and show that in certain circumstances the complex calculations can be decomposed into a number of real matrix sub-problems which can be used to good effect in reducing computation times and increasing array cell efficiency. Computations involving matrix vector, matrix product and matrix factorisation are examined. It was found that matrix vector and product calculation produce arrays which have e= 1 cell efficiency and the same computation time as their real counterparts, with only an increase in hardware related to the bandwidth of the systems.

    Matrix factorisation is achieved by using 2?2 block factorisation requiring four times the hardware and is only twice as slow as the real version of the factorisation.  相似文献   

    12.
    13.
    《国际计算机数学杂志》2012,89(3-4):235-286
    The preconditioning strategy of the incomplete factorisation methods for the numerical solution of sparse linear systems is developed for VLSI architectures to yield compact systolic factorisation arrays.  相似文献   

    14.
    A two-step regularization method in which first permutation sequences and then broadcast planes are selected is proposed to design various regular iterative algorithms for matrix multiplication. The regular iterative algorithms are then spacetime mapped to regular arrays, such as mesh, cylindrical, two-layered mesh, and orbital arrays. The proposed method can be used to design regular arrays with execution time of less than N (problem size)  相似文献   

    15.
    基于高基阵列乘法器的高速模乘单元设计与实现   总被引:1,自引:0,他引:1  
    蒙哥马利模乘算法是最适合硬件实现的模乘算法,被应用在RSA密码和ECC密码的协处理器设计中.目前性能最高的是高基蒙哥马利模乘算法,分析了高基蒙哥马利算法的实现,提出了一种新的基于高基阵列乘法器的Montgomery模乘高速硬件实现结构,基于这种结构位长为n的比特模乘仅需要约n/w+6个时钟周期,该结构设计的电路只与最小单元有关,在硬件实现时可以大大提高频率,并提高设计的性能,可以设计高速的RSA和椭圆曲线密码大规模集成电路.  相似文献   

    16.
    We present a systolic algorithm to generate all the n! permutations of n given items. The computational model used is a linear systolic array consisting of n identical PEs. This algorithm requires n! time steps to solve this problem. Since any PE is identical and executes the same program, it is suitable for VLSI implementation. The correctness of the algorithm is proved. We also consider the ranking and unranking functions of permutations in this parallel algorithm  相似文献   

    17.
    Summary Using modular arithmetic we obtain the following improved bounds on the time and space complexities for n × n Boolean matrix multiplication: O(n log 2 7 lognlogloglognloglogloglogn) bit operations and O(n 2loglog n) bits of storage on a logarithmic cost RAM having no multiply or divide instruction; O(n log 2 7(logn)2–1/2log 2 7(loglog n)1/2log 2 7–1) bit operations and O(n 2log n) bits of storage on a RAM which can use indirect addressing for table lookups. The first algorithm can be realized as a Boolean circuit with O(n log 2 7lognlogloglognloglogloglogn) gates. Whenever n×n arithmetic matrix multiplication can be performed in less than O(n log 2 7) arithmetic operations, our results have corresponding improvements.This work was supported in part by the Office of Naval Research under contract N00014-67-0204-0063, by the National Research Council of Canada under grant A4307, and by the National Science Foundation under grants MCS76-17321 and GJ-43332  相似文献   

    18.
    This paper describes the application of a new parallel architecture—Instruction systolic array (ISA)-for the interpolation and evaluation of polynomials using a linear array of processors. It also demonstrates a systematic top-down design of instruction systolic arrays. The periods of the resulting algorithms are O(n) for interpolation and O(1) for evaluation, where n is the degree of the polynomial.  相似文献   

    19.
    Conclusion The proposed method for solving Boolean differential equations relies on matrix interpretation of the representation and solution of Boolean differential equations. The algorithm is accordingly interpreted in the language of parallel computations, producing an efficient solution of the problem on parallel structures. This approach is consistent with the requirements and development trends in VLSI technology. Translated from Kibernetika i Sistemnyi Analiz, No. 1, pp. 36–53, January–February, 1996.  相似文献   

    20.
    椭圆柱体二维液态声子晶体声波禁带的研究   总被引:1,自引:0,他引:1  
    将椭圆柱体引入2维声子晶体中,采用平面波展开法计算了该系统的声波禁带结构.对于不同的椭圆柱体截面形状以及旋转角度,该体系都发现了完全禁带,但其禁带的位置与大小有很大不同.当晶格常数a1=4cm,a2=3.2cm,填充率F=0.35时,椭圆柱体截面不旋转的体系只产生一个禁带,其宽度为0.453,而截面旋转π/4的体系产生3个声波禁带,其宽度分别为0.458,0.023和0.062.研究结果表明:在这种2维非均匀液态体系中,声波禁带结构受到填充率,椭圆柱体截面形状以及旋转角度的影响.  相似文献   

    设为首页 | 免责声明 | 关于勤云 | 加入收藏

    Copyright©北京勤云科技发展有限公司  京ICP备09084417号