期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A high level FPGA-based abstract machine for image processing

《Journal of Systems Architecture》1999,45(10):809-824

Image processing requires high computational power, plus the ability to experiment with algorithms. Recently, reconfigurable hardware devices in the form of field programmable gate arrays (FPGAs) have been proposed as a way of obtaining high performance at an economical price. At present, however, users must program FPGAs at a very low level and have a detailed knowledge of the architecture of the device being used. They do not therefore facilitate easy development of, or experimentation with, image processing algorithms. To try to reconcile the dual requirements of high performance and ease of development, this paper reports on the design and realisation of an FPGA based image processing machine and its associated high level programming model. This abstract programming model allows an application developer to concentrate on the image processing algorithm in hand rather than on its hardware implementation. The abstract machine is based on a PC host system with a PCI-bus add-on card containing Xilinx XC6200 series FPGA(s). The machine's high level instruction set is based on the operators of image algebra. XC6200 series FPGA configurations have been developed to implement each high level instruction. 相似文献

2.

面向国产CPU的可重构计算系统设计及性能探究

下载免费PDF全文

彭福来于治楼陈乃阔耿士华李凯一《计算机工程与应用》2018,54(23):36-41

为了提升国产平台的计算性能,采用国产CPU+FPGA的异构架构,设计了基于国产CPU的可重构计算系统。该系统包括基于国产CPU的主机单元和FPGA可重构加速单元,主机单元负责逻辑判断与管理调度等任务,FPGA负责对计算密集型任务进行加速,并采用OpenCL框架模型进行编程,以缩短FPGA的开发周期。为了验证该系统的性能,采用AES加密算法来测试该系统的计算性能,通过对不同长度的明文进行AES加密测试,并与CPU串行处理结果进行对比,得出：相比于单核FT-1500A CPU串行加密方式,采用可重构计算系统并行加密能够获得120多倍的加速比,且此加速比会随着明文长度的增加而成非线性增大。实验结果表明：基于国产CPU的可重构计算系统能够大幅提升国产平台的计算性能。相似文献

3.

Accelerating floating-point fitness functions in evolutionary algorithms: a FPGA-CPU-GPU performance comparison

Juan A. Gomez-Pulido Miguel A. Vega-Rodriguez Juan M. Sanchez-Perez Silvio Priem-Mendes Vitor Carreira 《Genetic Programming and Evolvable Machines》2011,12(4):403-427

Many large combinatorial optimization problems tackled with evolutionary algorithms often require very high computational times, usually due to the fitness evaluation. This fact forces programmers to use clusters of computers, a computational solution very useful for running applications of intensive calculus but having a high acquisition price and operation cost, mainly due to the Central Processing Unit (CPU) power consumption and refrigeration devices. A low-cost and high-performance alternative comes from reconfigurable computing, a hardware technology based on Field Programmable Gate Array devices (FPGAs). The main objective of the work presented in this paper is to compare implementations on FPGAs and CPUs of different fitness functions in evolutionary algorithms in order to study the performance of the floating-point arithmetic in FPGAs and CPUs that is often present in the optimization problems tackled by these algorithms. We have taken advantage of the parallelism at chip-level of FPGAs pursuing the acceleration of the fitness functions (and consequently, of the evolutionary algorithms) and showing the parallel scalability to reach low cost, low power and high performance computational solutions based on FPGA. Finally, the recent popularity of GPUs as computational units has moved us to introduce these devices in our performance comparisons. We analyze performance in terms of computation times and economic cost. 相似文献

4.

孪生网络跟踪算法并行计算结构研究

卢金仪唐维伟徐文辉颜露新钟胜邹旭《测控技术》2021,40(3):39-45

基于嵌入式平台的复杂背景目标跟踪技术在智能视频监控设备、无人机跟踪等领域有重要作用.卷积神经网络在跟踪问题上有准确率高、鲁棒性强的优点,但基于卷积特征的算法计算复杂度高,受嵌入式平台面积和功耗的限制,实时性难以满足嵌入式平台应用场景的需求.针对基于卷积特征的跟踪算法计算复杂度高、存储参数量大的难题,率先提出一种利用FPGA实现基于卷积神经网络的复杂背景目标跟踪硬件加速架构.该方法通过利用KL相对熵对目标跟踪算法Siamese-FC进行定点量化,设计了基于通道并行的卷积层加速架构.实验结果表明,定点量化后跟踪算法相比于原算法的平均精度损失不超过4.57％,FPGA部署后前向推理耗时仅为CPU的16.15％,功耗仅为CPU的13.7％. 相似文献

5.

An improved method for the removal of ring artifacts in synchrotron radiation images by using GPGPU computing with compute unified device architecture

Leqing Zhu Dadong Wang Huiyan Wang 《Concurrency and Computation》2014,26(18):2880-2892

Ring artifacts are a common problem in computed tomography, positron emission tomography, magnetic resonance imaging, and synchrotron radiation images. Before further processing the images such as segmentation and quantification, these artifacts have to be removed or suppressed. Otherwise, they may introduce additional errors for the segmentation and subsequent analysis. This paper proposes an improved ring artifact removal method based on biorthogonal wavelet transform, one‐dimensional fast Fourier transform, and Gaussian damping, which is implemented on general‐purpose computing on graphics processing unit with compute unified device architecture. The experimental results show that the proposed algorithms can be speed up several hundred times compared with the previous algorithms on CPU. The significant performance improvement makes the algorithms much more practical in processing large volume of images in real time. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

6.

Multiplierless and fully pipelined JPEG compression soft IP targeting FPGAs

Luciano Volcan Ivan Saraiva Sergio 《Microprocessors and Microsystems》2007,31(8):487-497

This paper presents the design of a soft IP for JPEG compression targeted for high performance in a FPGA device. The JPEG compressor architecture achieves high throughput with a deep and optimized pipeline and with a multiplierless datapath architecture. The JPEG compressor architecture was designed in a hierarchical and modular fashion and the details of the global architecture and of its modules are presented in this paper. A modular and strictly structural VHDL design is followed to develop the JPEG compressor soft IP. The VHDL codes were synthesized to Altera and Xilinx FPGAs. Synthesis results and relevant performance comparisons with related works are presented. Our high throughput compressor is able to compress 39.8 millions of pixels per second when mapped onto an Altera FLEX 10KE FPGA. Our JPEG soft IP mapped to FLEX 10KE low cost FPGA is able to compress 115 images per second in SDTV resolution (720 × 480 pixels). Considering this SDTV resolution our design is worthy as a core of an M-JPEG video compressor, reaching a real time processing rate of 30 fps, once mapped to the FLEX 10KE FPGA device. 相似文献

7.

Sorting networks on FPGAs

Rene Mueller Jens Teubner Gustavo Alonso 《The VLDB Journal The International Journal on Very Large Data Bases》2012,21(1):1-23

Computer architectures are quickly changing toward heterogeneous many-core systems. Such a trend opens up interesting opportunities but also raises immense challenges since the efficient use of heterogeneous many-core systems is not a trivial problem. Software-configurable microprocessors and FPGAs add further diversity but also increase complexity. In this paper, we explore the use of sorting networks on field-programmable gate arrays (FPGAs). FPGAs are very versatile in terms of how they can be used and can also be added as additional processing units in standard CPU sockets. Our results indicate that efficient usage of FPGAs involves non-trivial aspects such as having the right computation model (a sorting network in this case); a careful implementation that balances all the design constraints in an FPGA; and the proper integration strategy to link the FPGA to the rest of the system. Once these issues are properly addressed, our experiments show that FPGAs exhibit performance figures competitive with those of modern general-purpose CPUs while offering significant advantages in terms of power consumption and parallel stream evaluation. 相似文献

8.

基于FPGA的八位RISC CPU的设计

张杰《微计算机信息》2006,22(35):155-157

从CPU的总体结构到局部功能的实现采用了自顶向下的设计方法和模块化的设计思想,利用Xilinx公司的SpartanII系列FPGA,设计实现了八位CPU软核。在FPGA内部不仅实现了CPU必需的算术逻辑器、寄存器堆、指令缓冲、跳转计数、指令集,而且针对FPGA内部的结构特点对设计进行了地址和数据的优化。相似文献

9.

基于FPGA的拟态服务器设计

崔冰萌倪明凌幸华《计算机系统应用》2018,27(4):219-225

随着计算机技术的不断发展,传统架构下的CPU处理能力已无法应对日益多样化的计算处理任务,新型异构计算体系也存在可提升的空间.分析了以“应用决定结构,结构决定效能”为理念,基于多维重构函数化结构与动态多变体运行机制的拟态计算（Mimicry Computing,MC）体系架构,利用FPGA硬件可编程、动态可重构和功耗低的特性,设计了一种基于FPGA的拟态计算服务器,并阐明了该服务器的核心电路设计与关键技术实现. 相似文献

10.

Server Selection,Configuration and Reconfiguration Technology for IaaS Cloud with Multiple Server Types

Yoji Yamato 《Journal of Network and Systems Management》2018,26(2):339-360

We propose a server selection, configuration, reconfiguration and automatic performance verification technology to meet user functional and performance requirements on various types of cloud compute servers. Various servers mean there are not only virtual machines on normal CPU servers but also container or baremetal servers on strong graphic processing unit (GPU) servers or field programmable gate arrays (FPGAs) with a configuration that accelerates specified computation. Early cloud systems are composed of many PC-like servers, and virtual machines on these severs use distributed processing technology to achieve high computational performance. However, recent cloud systems change to make the best use of advances in hardware power. It is well known that baremetal and container performances are better than virtual machines performances. And dedicated processing servers, such as strong GPU servers for graphics processing, and FPGA servers for specified computation, have increased. Our objective for this study was to enable cloud providers to provision compute resources on appropriate hardware based on user requirements, so that users can benefit from high performance of their applications easily. Our proposed technology select appropriate servers for user compute resources from various types of hardware, such as GPUs and FPGAs, or set appropriate configurations or reconfigurations of FPGAs to use hardware power. Furthermore, our technology automatically verifies the performances of provisioned systems. We measured provisioning and automatic performance verification times to show the effectiveness of our technology. 相似文献

11.

基于异构计算平台的规则处理器的设计与实现

陈孟东郭东升谢向辉吴东《计算机科学》2020,47(4):312-317

对于身份认证机制中的安全字符串恢复,字典结合变换规则是一种常用的方法。通过变换规则的处理,可以快速生成大量具有针对性的新字符串供验证使用。但是,规则的处理过程复杂,对处理性能、系统功耗等有很高的要求,现有的工具和研究都是基于软件方式进行处理,难以满足实际恢复系统的需求。为此,文中提出了基于异构计算平台的规则处理器技术,首次使用可重构FPGA硬件加速规则的处理过程,同时使用ARM通用计算核心进行规则处理过程的配置、管理、监控等工作,并在Xilinx Zynq XC7Z030芯片上进行了具体实现。实验结果表明,在典型情况下,该混合架构的规则处理器相比于单纯使用ARM通用计算核心,性能提升了214倍,规则处理器的运行性能优于Intel i7-6700 CPU,性能功耗比相比NVIDIA GeForce GTX 1080 Ti GPU有1.4~2.1倍的提升,相比CPU有70倍的提升,有效提升了规则处理的速率和能效。实验数据充分说明,基于异构计算平台,采用硬件加速的规则处理器有效解决了规则处理中的速率和能效问题,可以满足实际工程需求,为整个安全字符串恢复系统的设计奠定了基础。相似文献

12.

FPGA architecture and implementation of sparse matrix-vector multiplication for the finite element method

Yousef Elkurdi Evgueni Souleimanov Warren J. Gross 《Computer Physics Communications》2008,178(8):558-570

The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. The trends in floating-point performance are moving in favor of Field-Programmable Gate Arrays (FPGAs), hence increasing interest has grown in the scientific community to exploit this technology. We present an architecture and implementation of an FPGA-based sparse matrix-vector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. FEM matrices display specific sparsity patterns that can be exploited to improve the efficiency of hardware designs. Our architecture exploits FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements by relying on external SDRAM for data storage while utilizing the FPGAs computational resources in a stream-through systolic approach. The architecture is based on a pipelined linear array of processing elements (PEs) coupled with a hardware-oriented matrix striping algorithm and a partitioning scheme which enables it to process arbitrarily big matrices without changing the number of PEs in the architecture. Therefore, this architecture is only limited by the amount of external RAM available to the FPGA. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA systems, this architecture can achieve 1.5 GFLOPS sustained performance. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solution techniques such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory. 相似文献

13.

Low power data processing system with self-reconfigurable architecture

《Journal of Systems Architecture》2007,53(9):568-576

In this paper, a low power data processing system with a self-reconfigurable architecture and USB interface is presented. A single FPGA performs all processing and controls the multiple configurations without any additional elements, such as microprocessor, host computer or additional FPGAs. This architecture allows high performance with very low power consumption, a comprehensive alternative to microprocessor or DSP systems. In addition, a hierarchical reconfiguration system is used to support a large number of different processing tasks without the power consumption penalty of a big local configuration memory. Due to its simplicity and low power, this data processing system is especially suitable for portable applications, reducing the disadvantage of FPGAs against ASICS in low power consumption applications [A. Amara, F. Amiel, T. Ea, FPGA vs. ASIC for low power applications, Microelectronics Journal 37 (8) (2006) 669–677]. 相似文献

14.

Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration

Qianming Yang Mei Wen Nan Wu Chunyuan Zhang 《The Journal of supercomputing》2013,63(2):508-537

Recent research has shown that field programmable gate arrays (FPGAs) have a large potential for accelerating demanding applications, such as high performance digital signal process applications with low-volume market. The loss of generality in the architecture is one disadvantage of using FPGAs, however, the reconfigurability of FPGAs allow reprogramming for other applications. Therefore, a uniform FPGA-based architecture, an efficient programming model, and a simple mapping method are paramount for the wide acceptance of FPGA technology. This paper presents MASALA, a dynamically reconfigurable FPGA-based accelerator for parallel programs written in thread-intensive and explicit memory management (TEMM) programming models. Our system uses a TEMM programming model to parallelize demanding applications, including application decomposition into separate thread blocks and compute and data load/store decoupling. Hardware engines are included into MASALA using partial dynamic reconfiguration modules, each of which encapsulates a thread process engine that implements the hardware’s thread functionality. A data dispatching scheme is also included in MASALA to enable the explicit communication of multiple memory hierarchies such as interhardware engines, host processors, and hardware engines. Finally, this paper illustrates a multi-FPGA prototype system of the presented architecture: MASALA-SX. A large synthetic aperture radar image formatting experiment shows that MASALA’s architecture facilitates the construction of a TEMM program accelerator by providing greater performance and less power consumption than current CPU platforms, without sacrificing programmability, flexibility, and scalability. 相似文献

15.

三维扫描仪中人头轮廓线提取方法及实现

雷海军李德华钱铮铁雷丰中《计算机工程与应用》2003,39(19):37-39

为解决三维扫描仪的实时性,文章提出了以FPGA处理器与PC主机交互式共同完成提取轮廓线的快速算法。该算法由两个阶段组成:第一阶段由主机计算背景与目标的分割阈值。第二阶段由FPGA处理器实时检测轮廓线位置信息。该快速算法具有计算简单、实现速度快等优点,并且减少了传输与存储的数据量,减轻了后面主机计算工作量。同时,省掉了昂贵的图像采集压缩卡与高速硬盘,降低了成本。可重构FPGA处理器设计成流水线结构,对每个像素的平均处理时间控制在70ns以内。仿真与综合结果表明:从一帧720576标准PAL制视频图像中提取轮廓线信息可在40ms内实时完成。相似文献

16.

CPU/FPGA混合架构上的硬件线程加速方法

陈天洲严力科胡威马吉军《软件学报》2009,20(Z1):15-22

CPU/FPGA混合架构是可重构计算的普遍结构,为了简化混合架构上FPGA的使用,提出了一种硬件线程方法,并设计了硬件线程的执行机制,以硬件线程的方式使用可重构资源.同时,软硬件线程可以通过共享数据存储方式进行多线程并行执行,将程序中计算密集部分以FPGA上的硬件线程方式执行,而控制密集部分则以CPU上的软件线程方式执行.在Simics仿真软件模拟的混合架构平台上,对DES,MD5SUM和归并排序算法进行软硬件多线程改造后的实验结果表明,平均执行加速比达到了2.30,有效地发挥了CPU/FPGA混合架构的计算性能. 相似文献

17.

动态可重配置的星上嵌入式实时计算系统

刘勇李华旺尹增山杨根庆《计算机应用研究》2006,23(1):204-205

卫星上由于特殊条件的限制,计算机处理速度满足不了对信号处理的需要,而且不能在有限的硬件规模和功耗的情况下灵活地实现各种计算处理功能。提出了一种基于嵌入式微处理器配合大规模现场可编程门阵列（FPGA）的动态可重配置结构的星上实时计算系统的体系结构设计,可在一块FPGA资源上通过动态重配置实现不同的信号处理功能。实际应用证明,处理速度和性能得到了大幅度提高。相似文献

18.

一种基于分布式平台的规则处理架构

陈孟东原昊谢向辉吴东《计算机工程与科学》2020,42(1):18-24

采用字符串变换规则对字典进行变形变换是安全字符串恢复中的一种有效方法,然而,规则的处理过程复杂,现有的方式都是基于软件实现,针对处理性能、功耗等方面的现实需求,提出了一种基于分布式平台的规则处理架构,首次使用FPGA硬件来加速规则的处理过程,并通过将复杂的规则组合进行拆分,分布到并行结点上进一步加速规则的处理过程。在蚁群系统上的实验结果表明,采用该种架构的规则处理系统满足实际需求,性能和能效相比CPU和GPU都有显著提高,表明了该分布式规则处理架构的有效性。相似文献

19.

Flexible VLIW processor based on FPGA for efficient embedded real-time image processing

Vincent Brost Fan Yang Charles Meunier 《Journal of Real-Time Image Processing》2014,9(1):47-59

Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability. 相似文献

20.

PMSS: A programmable memory system and scheduler for complex memory patterns

Tassadaq Hussain Amna Haider Eduard Ayguadé 《Journal of Parallel and Distributed Computing》2014

HPC industry demands more computing units on FPGAs, to enhance the performance by using task/data parallelism. FPGAs can provide its ultimate performance on certain kernels by customizing the hardware for the applications. However, applications are getting more complex, with multiple kernels and complex data arrangements, generating overhead while scheduling/managing system resources. Due to this reason all classes of multi threaded machines–minicomputer to supercomputer–require to have efficient hardware scheduler and memory manager that improves the effective bandwidth and latency of the DRAM main memory. This architecture could be a very competitive choice for supercomputing systems that meets the demand of parallelism for HPC benchmarks. In this article, we proposed a Programmable Memory System and Scheduler (PMSS), which provides high speed complex data access pattern to the multi threaded architecture. This proposed PMSS system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PMSS based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 19x speedup compared to the MicroBlaze based system. 相似文献