期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

FPGA architecture and implementation of sparse matrix-vector multiplication for the finite element method

Yousef Elkurdi Evgueni Souleimanov Warren J. Gross 《Computer Physics Communications》2008,178(8):558-570

The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Programmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory. 相似文献

2.

一种支持优化分块策略的矩阵乘加速器设计

沈俊忠肖涛乔寓然杨乾明文梅《计算机工程与科学》2016,38(9):1748-1754

在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。相似文献

3.

阵列处理器分布式Cache的局部优先访问结构设计

刘有耀张园山蕊《计算机工程与科学》2020,42(4):580-587

针对可重构阵列处理器访存数据量大、数据并行性要求高且数据全局重用少、局部性明显的特点,提出了一种分布式Cache结构的簇内局部优先高效互连访问结构,该结构实现了簇内4×4个PE对4×4个Cache的并行访问,选用Xilinx公司的ZYNQ系列芯片XC7Z045 FFG900-2进行FPGA综合。在无冲突情况下,该互连结构支持簇内16个PE的同时读/写访问,最高频率可达221 MHz,访存峰值带宽为7.6 GB/s。在此结构上实现了灰度共生矩阵提取纹理图像特征算法,数据访存带宽达到478.125 MB/s,运行时间为0.24 ms。相似文献

4.

FPGA-based architecture for real time segmentation and denoising of HD video

M. Genovese E. Napoli 《Journal of Real-Time Image Processing》2013,8(4):389-401

The identification of moving objects is a basic step in computer vision. The identification begins with the segmentation and is followed by a denoising phase. This paper proposes the FPGA hardware implementation of segmentation and denoising unit. The segmentation is conducted using the Gaussian mixture model (GMM), a probabilistic method for the segmentation of the background. The denoising is conducted implementing the morphological operators of erosion, dilation, opening and closing. The proposed circuit is optimized to perform real time processing of HD video sequences (1,920 × 1,080 @ 20 fps) when implemented on FPGA devices. The circuit uses an optimized fixed width representation of the data and implements high performance arithmetic circuits. The circuit is implemented on Xilinx and Altera FPGA. Implemented on xc5vlx50 Virtex5 FPGA, it can process 24 fps of an HD video using 1,179 Slice LUTs and 291 Slice Registers; the dynamic power dissipation is 0.46 mW/MHz. Implemented on EP2S15F484C3 StratixII, it provides a maximum working frequency of 44.03 MHz employing 5038 Logic Elements and 7,957 flip flop with a dynamic power dissipation of 4.03 mW/MHz. 相似文献

5.

A Resource-Efficient Communication Architecture for Chip Multiprocessors on FPGAs

下载免费PDF全文

Maggie Swetha Thota 《计算机科学技术学报》2011,26(3):434-447

Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip.For multiprocessors,an efficient communication network that matches the needs of the target application is always critical to the overall performance.Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs).However,the quest for high performance networks has led to very complex and resource-expensive NoC designs,leaving little room for the real computing force,i.e.,PEs.Moreover,many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers.We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs.This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs.We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology,but also the resource requirement of each router.Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network.The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications. 相似文献

6.

Cholesky分解细粒度并行算法 总被引：1，自引：0，他引：1

邬贵明窦勇王淼《计算机工程与科学》2010,32(9):102-106

本文提出了一种Cholesky分解细粒度流水线并行算法,该算法可以处理任意规模的数据,可以充分开发FP-GA加速器提供的细粒度并行。实验表明,该算法具有很好的可扩展性,在Xilinx XC5 VLX330 FPGA上能够集成36个处理单元(PE),当矩阵的阶为16384、运行频率为200MHz时性能达到14.3GFLOPS。相似文献

7.

Performance modeling of pipelined linear algebra architectures on FPGAs

Sam Skalicky Sonia Lopez Marcin Lukowiak 《Computers & Electrical Engineering》2014

The potential design space of FPGA accelerators is very large. The factors that define performance of a particular implementation include the architecture design, number of pipelines, and memory bandwidth. In this paper we present a mathematical model that, based on these factors, calculates the computation time of pipelined FPGA accelerators and allows for quick exploration of the design space without any implementation or simulation. We evaluate the model and its ability to identify design bottlenecks and improve performance. Being the core of many compute-intensive applications, linear algebra computations are the main contributors to their total execution time. Hence, five relevant linear algebra computations are selected, analyzed, and the accuracy of the model is validated against implemented designs. 相似文献

8.

High performance implementation of texture features extraction algorithms using FPGA architecture

Ali Reza Akoushideh Asadollah Shahbahrami Babak Mazloom-Nezhad Maybodi 《Journal of Real-Time Image Processing》2014,9(1):141-157

The most popular second-order statistical texture features are derived from the co-occurrence matrix, which has been proposed by Haralick. However, the computation of both matrix and extracting texture features are very time consuming. In order to improve the performance of co-occurrence matrices and texture feature extraction algorithms, we propose an architecture on FPGA platform. In the proposed architecture, first, the co-occurrence matrix is computed then all thirteen texture features are calculated in parallel using computed co-occurrence matrix. We have implemented the proposed architecture on Virtex 5 fx130T-3 FPGA device. Our experimental results show that a speedup of 421[× yields over a software implementation on Intel Core i7 2.0 GHz processor. In order to improve much more performance on textures, we have reduced the computation of 13 texture features to 3 texture features using ranking of Haralick’s features. The performance improvement is 484×. 相似文献

9.

HW/SW co-design for public-key cryptosystems on the 8051 micro-controller

K. L. B. I. 《Computers & Electrical Engineering》2007,33(5-6):324-332

It is a challenge to implement large word length public-key algorithms on embedded systems. Examples are smartcards, RF-ID tags and mobile terminals. This paper presents a HW/SW co-design solution for RSA and Elliptic Curve Cryptography (ECC) over GF(p) on a 12 MHz 8-bit 8051 micro-controller. The hardware coprocessor has a Modular Arithmetic Logic Unit (MALU) of which the digit size (d) is variable. It can be adapted to the speed and bandwidth of the micro-controller to which it is connected. The HW/SW co-design space exploration is based on the GEZEL system-level design environment. It allows the designer to find the best performance-area combination for the digit size. As a case study of an FPGA prototyping, 160-bit ECC over GF(p) (ECC-160p) was implemented on Xilinx Virtex-II PRO (XC2VP30). The results show that one point multiplication takes only 130 ms including all communications between the 8051 and the coprocessor. The performance is 40 times faster than the most optimized SW implementation on a small CPU in literature. This is achieved by the HW/SW co-design exploration in order to find the optimized digit size of the MALU. On the other hand, the design of ECC-160p maintains a high level of flexibility by using coprocessor instructions. Our proposed architecture proves that HW/SW co-design provides a high performance close to ASIC solutions with a flexible feature of SW even on a small CPU. 相似文献

10.

Designing linear systolic arrays

V. K. Prasanna Kumar Yu-Chen Tsai 《Journal of Parallel and Distributed Computing》1989,7(3)

We develop a simple mapping technique to design linear systolic arrays. The basic idea of our technique is to map the computations of a certain class of two-dimensional systolic arrays onto one-dimensional arrays. Using this technique, systolic algorithms are derived for problems such as matrix multiplication and transitive closure on linearly connected arrays of PEs with constant I/O bandwidth. Compared to known designs in the literature, our technique leads to modular systolic arrays with constant hardware in each PE, few control lines, lexicographic data input/output, and improved delay time. The unidirectional flow of control and data in our design assures implementation of the linear array in the known fault models of Wafer Scale Integration. 相似文献

11.

多处理器芯片组中交叉开关的设计与性能优化

下载免费PDF全文

方志斌安学军胡鹏《计算机工程》2008,34(5):36-38

交叉开关是交换芯片和芯片组的核心逻辑。该文设计并实现了多处理器芯片组中的交叉开关,其工作频率在FPGA布局布线后可以达到100 MHz。通过实践采样,对延迟和带宽进行测试,提出性能优化的策略,目前该交叉开关已稳定运行于龙芯2E多处理器系统中。相似文献

12.

高性能浮点除法和开方的设计与实现

洪琪赵志伟何敏《计算机工程》2013,(12):264-268

在基于现场可编程门阵列（FPGA）的设计中,低延时、高吞吐量、小面积是3个主要考虑因素。针对以上因素,提出不同基数SRT浮点除法和开方算法,设计基于Virtex—IIproFPGA的可变位宽浮点除法和开方的3种实现方案,包括小面积的迭代实现、低延时的阵列实现和高吞吐量的流水实现。实验结果表明,对于浮点除法和开方算法的流水实现,在综合面积符合要求的基础上,实现频率最高分别可达到180MHz和200MHz以上,证明了该实现方案的有效I陛。相似文献

13.

A reduced memory bandwidth and high throughput HDTV motion compensation decoder for H.264/AVC High 4:2:2 profile

Bruno Zatt Leandro M. de L. Silva Arnaldo Azevedo Luciano Agostini Altamiro Susin Sergio Bampi 《Journal of Real-Time Image Processing》2013,8(1):127-140

This article presents the HP422-MoCHA: optimized Motion Compensation hardware architecture for the High 4:2:2 profile of H.264/AVC video coding standard. The proposed design focuses on real-time decoding for HDTV 1080p (1,920 × 1,080 pixels) at 30 fps. It supports multiple sample bit-width (8, 9, or 10 bits) and multiple chroma sub-sampling formats (4:0:0, 4:2:0, and 4:2:2) to provide enhanced video quality experience. The architecture includes an optimized sample interpolator that processes luma and chroma samples in two parallel datapaths and features quarter sample accuracy, bi-prediction and weighted prediction. HP422-MoCHA also includes a hardwired Motion Vector Predictor, supporting temporal and spatial direct predictions. A novel memory hierarchy implemented as a 3-D Cache reduces the frame memory access, providing, on average, 62% of bandwidth and 80% of clock cycles reduction. The design was implemented in a Xilinx Virtex-II PRO FPGA, and also in an ASIC with a TSMC 0.18 μm standard cells technology. The ASIC implementation occupies 102 K equivalent gates and 56.5 KB of on-chip SRAM in a 3.8 × 3.4 mm² area. It presents a power consumption of 130 mW. Both implementations reach a maximum operation frequency of ~100 MHz, being able to motion compensate 37 bi-predictive frames or 69 predictive fps. The minimum required frequency to ensure the real-time decoding for HD1080p at 30 fps is 82 MHz. Since HP422-MoCHA is the first Motion Compensation architecture for the High 4:2:2 profile found in the literature, a Main profile MoCHA was used for comparison purposes, showing the highest throughput among all presented works. However, the HP422-MoCHA architecture also reaches the highest throughput when compared with the other published Main profile MC solutions, even considering the significantly higher complexity of the High 4:2:2 profile. 相似文献

14.

An FPGA implementation for neural networks with the FDFM processor core approach

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(4):308-320

This paper presents a field programmable gate array (FPGA) implementation of a three-layer perceptron using the few DSP blocks and few block RAMs (FDFM) approach implemented in the Xilinx Virtex-6 family FPGA. In the FDFM approach, multiple processor cores with few DSP slices and few block RAMs are used. We have implemented 150 processor cores for perceptrons in a Xilinx Virtex-6 family FPGA XC6VLX240T-FF1156. The implementation results show that the 150 processor cores for 32-32-32 input–hidden–output layer perceptrons can be implemented in the FPGA using 150 DSP48 slices, 185 block RAMs and 9676 slices. It runs in 242.89 MHz clock frequency, and a single evaluation of 150 nodes perceptron can be performed 1.65 × 10⁷ times per second. 相似文献

15.

FPGA-based implementation of a chirp signal generator using an OpenCL design

《Microprocessors and Microsystems》2020

A novel approach to developing an FPGA-based chirp signal generator using high-level synthesis implementation is proposed. OpenCL, which is a framework used for high-level synthesis (HLS) methodologies, is employed instead of the Verilog/VHDL language to program FPGA. OpenCL has been used for FPGA programming, particularly in high-performance computing applications. Utilizing OpenCL for FPGA development reduces development time because of the high-level abstraction of the code. However, compared to Verilog/VHDL, standard OpenCL does not enable direct access to the FPGA's I/O. In this study, the FPGA needs to access the I/O pins to communicate with the DAC and generate the chirp signal. Thus, direct access to the FPGA I/O pin from the OpenCL environment is required. Therefore, a new OpenCL component is developed to enable the FPGA to communicate with the DAC, thus allowing data streaming to generate the chirp signal. This OpenCL component enables us to stream the data from the FPGA to generate the chirp signal. Here, we demonstrate that by using OpenCL implementation, the FPGA can generate an I/Q chirp signal efficiently. Moreover, the same OpenCL kernel can be employed to generate different bandwidths of the chirp signal without having to reprogram the FPGA. To demonstrate the capability of the system, we generated the I/Q chirp signal from 1 MHz to 5 MHz, 5 MHz to 10 MHz, 10 MHz to 15 MHz and 15 MHz to 20 MHz for a period of 10 µs. 相似文献

16.

A real-time H.264/AVC VLSI encoder architecture

K. Babionitakis G. Doumenis G. Georgakarakos G. Lentaris K. Nakos D. Reisis I. Sifnaios N. Vlassopoulos 《Journal of Real-Time Image Processing》2008,3(1-2):43-59

Evolving applications related to video technologies require video encoder and decoder implemented with low cost and achieving real-time performance. In order to meet this demand and targeting especially the applications imposing low VLSI area requirements, the present paper describes a VLSI H.264/AVC encoder architecture performing at real-time. The encoder uses a pipeline architecture and all the modules have been optimized with respect to the VLSI cost. The encoder design complies with the reference software encoder of the standard, follows the baseline profile level 3.0 and it constitutes an IP-core and/or an efficient stand-alone solution. The architecture operates at a maximum frequency of 100 MHz and achieves maximum throughput of 30 frames/s with frame size 1,024 × 768. Results and performance measurements of the entire encoder have been validated on FPGA and VLSI 0.18 μm occupying a total area of 3.9 mm². 相似文献

17.

大矩阵QR分解的FPGA设计与实现

下载免费PDF全文

周杰陈啸洋赵建勋窦勇《计算机工程与科学》2010,32(10):34-37

大规模QR分解在信号处理、图像处理、计算结构力学等领域有着广泛的应用。大规模矩阵QR分解主要在高性能并行机上进行运算,目前还没有基于FPGA平台的加速实现。本文在分析快速Givens Rotation QR分解算法特征的基础上,提出并实现了一种细粒度并行QR分解算法,并在Altera StratixⅡ FPGA平台上实现可扩展QR分解线性阵列处理器。相对于单处理单元,该阵列处理器可取得近似线性加速比,显示了良好的可扩展性。在100 MHz频率下的性能测试结果表明,相对于2.0GHz的Pentium双核通用微处理器,该阵列处理器可取得19倍的加速比。相似文献

18.

一种用于高速通信的虚拟DDR存储器设计及其FPGA实现

贺彦军李占才王沁《计算机工程与应用》2005,41(13):113-116

机群系统中,互连网络性能对整个机群系统的性能有着至关重要的影响,传统互联网络适配器基本上基于PCI接口,节点出口带宽理论上限132MB/s犤1犦。论文提出虚拟DDR(DoubleDataRateSDRAM)存储器这一概念,定义了虚拟DDR存储器的行为,并将其用于基于DDR内存接口的互联网络适配器中,该互联网络适配器在主板时钟频率100MHz时,节点带宽上限达到1600MB/s,带宽比基于PCI接口提高了12倍。基于FPGA的实现验证了虚拟DDR存储器及建立其上的网络适配器的可行性和正确性。相似文献

19.

High Speed Cycle-Approximate Simulation of Embedded Cache-Incoherent and Coherent Chip-Multiprocessors

Christopher Thompson Miles Gould Nigel Topham 《International journal of parallel programming》2018,46(6):1247-1282

The increasing density of silicon processes, coupled with the development of ever more energy and space efficient embedded core designs, has led to multi-processor system-on-chip (MPSoC) designs becoming increasingly attractive for use in embedded systems. Unfortunately this increase in core count gives rise to an explosion in design space possibilities, especially when heterogeneous designs are considered. To address this problem, new techniques in simulation are required to increase the simulation performance of these systems, while maintaining the accuracy needed to make good design decisions, and to verify the performance characteristics for real-time systems. We present a new high-speed, near cycle-accurate simulator, addressing an important but neglected category of multicore systems: deeply-embedded cache-incoherent MPSoCs. We take advantage of the unique properties of these systems to relax synchronisation constraints and increase the parallelism of the simulation. In doing so we achieve performance not possible using previous simulation techniques, without compromising the accuracy of the results. Quantitative performance results are presented across a large range of simulated MPSoC designs, comprising 1–64 cores, on average we simulate at 5.7 MIPS, with simulation speeds reaching 377 MIPS in the best case. Comparing against FPGA implementations we demonstrate that the simulator manages this with an average timing error of only 2.1%. Applying some of these techniques to coherent simulation enables even coherent 64-core designs to be simulated accurately at up to 2.2 MIPS. 相似文献

20.

A hardware centric algorithm for the best matching unit searching stage of the SOM-based quantizer and its FPGA implementation

W. Kurdthongmee 《Journal of Real-Time Image Processing》2016,12(1):71-80

Parts of a self-organizing map (SOM)-based quantizer can be performed in parallel, i.e., distance calculation between an input pixel and a group of codewords or processing elements (PEs), and updating codewords. To search for the best matching unit (BMU) whose distance is the minimum, all distances are inevitably required to compare with each other. It is true that a group of comparators and registers can be instantiated with equal size to the distances (which is equivalent to the number of PEs) and performed in a multistage manner to come up with the minimum distance and its index. In this way, the algorithm requires n = log₂ C clock cycles, where C is the number of PEs and \(\sum\nolimits_{k=0}^{n-1}{2^k}\) are the number of comparators and registers. In this paper, we propose a novel hardware centric algorithm with the objective to accelerate the BMU searching stage of the SOM-based quantizer. In a simple form, the algorithm relies on using a PE’s distance as an address of a memory to store its index. Simultaneously with storing indices of all PEs, the states of all ‘non-empty’ addresses within the memory are prepared. In this way, it can be stated that the position of the first non-empty state corresponds to the memory address whose content is the BMU index. The approach to find the first position of the non-empty state within a single clock cycle is also detailed. The algorithm is also adapted to make it more feasible to realize on an FPGA platform. The synthesis results compared with the conventional BMU searching indicate that the FPGA resource requirements of the algorithm are 1.8 and 1.57 times in terms of slices and LUT usages, respectively. In terms of acceleration, the algorithm outperforms the conventional ones by a factor of 1.8 for a test image of size 512 × 512 pixels. 相似文献