期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

FPGA–DSP co-processing for feature tracking in smart video sensors

Matteo Tomasi Shrinivas Pundlik Gang Luo 《Journal of Real-Time Image Processing》2016,11(4):751-767

Motion estimation in videos is a computationally intensive process. A popular strategy for dealing with such a high processing load is to accelerate algorithms with dedicated hardware such as graphic processor units (GPU), field programmable gate arrays (FPGA), and digital signal processors (DSP). Previous approaches addressed the problem using accelerators together with a general purpose processor, such as acorn RISC machines (ARM). In this work, we present a co-processing architecture using FPGA and DSP. A portable platform for motion estimation based on sparse feature point detection and tracking is developed for real-time embedded systems and smart video sensors applications. A Harris corner detection IP core is designed with a customized fine grain pipeline on a Virtex-4 FPGA. The detected feature points are then tracked using the Lucas–Kanade algorithm in a DSP that acts as a co-processor for the FPGA. The hybrid system offers a throughput of 160 frames per second (fps) for VGA image resolution. We have also tested the benefits of our proposed solution (FPGA + DSP) in comparison with two other traditional architectures and co-processing strategies: hybrid ARM + DSP and DSP only. The proposed FPGA + DSP system offers a speedup of about 20 times and 3 times over ARM + DSP and DSP only configurations, respectively. A comparison of the Harris feature detection algorithm performance between different embedded processors (DSP, ARM, and FPGA) reveals that the DSP offers the best performance when scaling up from QVGA to VGA resolutions. 相似文献

2.

可重构的串行高级加密标准加解密电路设计

谢惠敏郭东辉《计算机应用》2013,33(2):450-459

为了进一步提高高级加密标准(AES)算法在现场可编程门阵列(FPGA)上的硬件资源使用效率,提出一种可支持密钥长度128/192/256位串行AES加解密电路的实现方案。该设计采用复合域变换实现字节乘法求逆,同时实现列混合与逆列混合的资源共享以及三种AES算法密钥扩展共享。该电路在Xilinx Virtex-Ⅴ系列的FPGA上实现,硬件资源消耗为1871slice、4RAM。结果表明,在最高工作频率173.904MHz时,密钥长度128/192/256位AES加解密吞吐率分别可达2119/1780/1534Mb·s^(-1)。该设计吞吐率/硬件资源比值较高,且适用支持千兆以太网。相似文献

3.

Efficient parallel architecture for multi-level forward discrete wavelet transform processors

Syed Mahfuzul Aziz^{Author Vitae} Duc Minh Pham Author Vitae 《Computers & Electrical Engineering》2012,38(5):1325-1335

A resource efficient and high-performance architecture for a two-dimensional multi-level discrete wavelet transform processor is presented in this paper. The JPEG2000 standard integer lossless 5-3 filter has been implemented. It achieves optimal hardware utilisation with minimal combinational logic block slices and high frequency of operation. To reduce the hardware complexity and to achieve high performance the proposed architecture implements lifting scheme with a single multiplier-free processing element to perform both predict and update operations. Symmetric extension is used at image boundaries without requiring any extra clock cycle. The generic architecture is very flexible and can perform up to five levels of forward transform on any arbitrary image size. Synthesis of the 5-level architecture on Xilinx Virtex 5 FPGA shows that the processor can achieve a maximum frequency of operation of 221.44 MHz. The reduced hardware complexity and high frequency of operation render the design suitable for incorporation in image processing applications requiring fast operations. The 5-level design has been successfully implemented on a Xilinx Spartan 3E FPGA, utilising only 1104 slices for a 512-by-512 pixel test image, the lowest hardware requirements for a 5-level discrete wavelet transform processor reported to date. 相似文献

4.

A low energy adaptive motion estimation hardware for H.264 multiview video coding

Yusuf Aksehir Kamil Erdayandi Tevfik Zafer Ozcan Ilker Hamzaoglu 《Journal of Real-Time Image Processing》2018,15(1):3-12

Multiview video coding (MVC) is the process of efficiently compressing stereo (two views) or multiview video signals. The improved compression efficiency achieved by H.264 MVC comes with a significant increase in computational complexity. Temporal prediction and inter-view prediction are the most computationally intensive parts of H.264 MVC. Therefore, in this paper, we propose novel techniques for reducing the amount of computations performed by temporal and inter-view predictions in H.264 MVC. The proposed techniques reduce the amount of computations performed by temporal and inter-view predictions significantly with very small PSNR loss and bit rate increase. We also propose a low energy adaptive H.264 MVC motion estimation hardware for implementing the temporal and inter-view predictions including the proposed computation reduction techniques. The proposed hardware is implemented in Verilog HDL and mapped to a Xilinx Virtex-6 FPGA. The FPGA implementation is capable of processing 30 × 8 = 240 frames per second (fps) of CIF (352 × 288) size eight view video sequence or 30 × 2 = 60 fps of VGA (640 × 480) size stereo (two views) video sequence. The proposed techniques reduce the energy consumption of this hardware significantly. 相似文献

5.

Real-time embedded systems powered by FPGA dynamic partial self-reconfiguration: a case study oriented to biometric recognition applications

Francisco Fons Mariano Fons Enrique Cantó Mariano López 《Journal of Real-Time Image Processing》2013,8(3):229-251

This work aims to pave the way for an efficient open system architecture applied to embedded electronic applications to manage the processing of computationally complex algorithms at real-time and low-cost. The target is to define a standard architecture able to enhance the performance-cost trade-off delivered by other alternatives nowadays in the market like general-purpose multi-core processors. Our approach, sustained by hardware/software (HW/SW) co-design and run-time reconfigurable computing, is synthesizable in SRAM-based programmable logic. As proof-of-concept, a run-time partially reconfigurable field-programmable gate array (FPGA) is addressed to carry out a specific application of high-demanding computational power such as an automatic fingerprint authentication system (AFAS). Biometric personal recognition is a good example of compute-intensive algorithm composed of a series of image processing tasks executed in a sequential order. In our pioneer conception, these tasks are partitioned and synthesized first in a series of coprocessors that are then instantiated and executed multiplexed in time on a partially reconfigurable region of the FPGA. The implementation benchmark of the AFAS either as a pure software approach on a PC platform under a dual-core processor (Intel Core 2 Duo T5600 at 1.83 GHz) or as a reconfigurable FPGA co-design (identical algorithm partitioned in HW/SW tasks operating at 50 or 100 MHz on the second smallest device of the Xilinx Virtex-4 LX family) highlights a speed-up of one order of magnitude in favor of the FPGA alternative. These results let point out biometric recognition as a sensible killer application for run-time reconfigurable computing, mainly in terms of efficiently balancing computational power, functional flexibility and cost. Such features, reached through partial reconfiguration, are easily portable today to a broad range of embedded applications with identical system architecture. 相似文献

6.

Xilinx PCI-Express核总线接口设计与实现 总被引：1，自引：0，他引：1

董永吉陈庶樵李玉峰李印海《电子技术应用》2011,37(8)

介绍了基于PCI-Express总线的DMA硬件设计,并运用硬件描述语言Verilog HDL对其实现。使用Xilinx公司Virtex-5系列FPGA对该设计进行了验证。结果表明,该设计能满足实际应用对可读性、可靠性及高效性的要求。相似文献

7.

基于FPGA的FFT处理器的设计与验证

王正勤朱向冰《数字社区&智能家居》2007,2(6):1379-1380,1382

介绍一种基于FPGA，选择FFT的基一2DIT处理算法，在ISE6．2I开发平台上完成32位浮点运算的FFT信号处理器设计：利用Modelsim工具软件对系统的逻辑综合和时序进行仿真，并将系统的结果与Matlab计算结果相比较，验证了设计结果的精确性；实验表明利用FPGA实现FFT，运算速度快，可以满足高速信号处理的应用场合。相似文献

8.

Design and implementation of a realtime co-processor for denoising Fiber Optic Gyroscope signal

Rangababu Peesapati Samrat L. Sabat Kiran Kumar Anumandla Palani Karthik Kandyala Jagannath Nayak 《Digital Signal Processing》2013,23(5):1813-1825

The amount of noise present in the Fiber Optic Gyroscope (FOG) signal limits its applications and has a negative impact on navigation system. Existing algorithms such as Discrete Wavelet Transform (DWT), Kalman Filter (KF) denoise the FOG signal under static environment, however denoising fails in dynamic environment. Therefore in this paper an Adaptive Moving Average Dual Mode Kalman Filter (AMADMKF) is developed for denoising the FOG signal under both the static and dynamic environments. Performance of the proposed algorithm is compared with DWT and KF techniques. Further, a hardware Intellectual Property (IP) of the algorithm is developed for System on Chip (SoC) implementation using Xilinx Virtex-5 Field Programmable Gate Array (Virtex-5FX70T-1136). The developed IP is interfaced as a Co-processor/ Auxiliary Processing Unit (APU) with the PowerPC (PPC440) embedded processor of the FPGA. It is proved that the proposed system is an efficient solution for denoising the FOG signal in real-time environment. Hardware acceleration of developed Co-processor is 65× with respect to its equivalent software implementation of AMADMKF algorithm in the PPC440 embedded processor. 相似文献

9.

快速IPv6分段查找及其硬件实现

下载免费PDF全文

李慧杰杜慧敏王亚刚《计算机工程与应用》2013,49(3):96-100

提出一种可硬件实现的快速IPv6查找算法,采用基于内容可寻址存储器CAM的分段查找机制,用流水线实现,每个周期可输出一次查找结果,所需存储开销较小。在Xilinx Virtex-6 FPGA开发板用150×1 024项IPv6前缀测试表明,查找速度可达597 Mp/s（Million packet/s）,最坏需要2次存储器访问,更新最坏需要50 μs,仅需20.07 MB的RAM和258 KB的CAM存储开销。相似文献

10.

Keccak算法|海绵结构|哈希算法|可重构|现场可编程门阵列

吴武飞王奕李仁发《计算机应用》2012,32(3):864-866

在分析研究Keccak算法的基础上,针对现有Keccak算法的硬件实现方案版本单一,应用不灵活的问题,设计了一种高性能可重构的Keccak算法硬件实现方案。实验结果表明:该方案在Xilinx 公司的现场可编程门阵列(FPGA)Virtex-5平台上的时钟频率可达214MHz,占用1607slices;该方案具有吞吐量高(9131Mbps),应用灵活性好,可支持4种不同参数版本的优点。相似文献

11.

分块自适应量化算法的FPGA实现 总被引：2，自引：1，他引：2

高俊峰杨汝良马小兵《数据采集与处理》2006,21(1):103-107

详细介绍了采用FPGA实现分块自适应量化（BAQ）算法的设计方法。该设计选用Xilinx公司100万门FPGA，采用自顶向下的方法，实现了3位长BAQ压缩算法。设计中通过资源共享来降低资源消耗，通过并行和流水来提高处理速度，满足了星载系统小型化、低功耗和高可靠性的要求。与专用DSP方案相比，采用FPGA的实现方案极大地简化了电路设计的复杂性和布线的难度。相似文献

12.

超宽带系统中维特比译码器的设计与实现

下载免费PDF全文

欧阳淦刘亮叶凡任俊彦《计算机工程》2010,36(17):260-263

提出一种超宽带系统中的维特比译码器,对混合幸存路径管理单元进行改进,使其最高工作频率提升25%,译码延时减少40个时钟周期。在Xilinx Virtex-5 XC5VLX330 FPGA上的实现结果表明,该维特比译码器能在240 MHz的时钟频率下正确工作。并行使用 2个该译码器,可对系统中所有8种速率的数据译码。相似文献

13.

基于RISC-V的SM4算法扩展指令的设计与实现

李晨琪袁国材樊荣《计算机与数字工程》2022,50(2):256-260

为更好地在资源有限终端实现SM4密码算法,论文基于开源RISC-V指令集及VexRiscv处理器,设计实现SM4算法扩展指令集,包括两条SM4算法扩展指令分别对应SM4算法密钥扩展部分及密码算法部分,以低硬件资源开销换取基于软件实现SM4密码算法时更高的吞吐量.论文设计实现的SM4密码算法扩展指令,通过使用Xilinx... 相似文献

14.

可重构Grostl设计研究及其FPGA实现

李志灿王奕李仁发《计算机工程与应用》2012,(6):49-52

Grostl是继承MD迭代结构和沿用AES压缩函数的SHA．3候选算法。目前的研究只针对Grostl算法的一种或两种参数版本进行实现,并没有针对Grcstl四种参数版本的设计,缺少灵活性。在分析Gr#stl算法的基础上,采用可重构的设计思想,在FPGA上实现了Grcstl四种参数版本。实验结果表明,在XilinxVirtex一5FPGA平台上,四参数可重构方案的面积为4279slices,时钟频率为223．32MHz,与已有的实现方法相比,具有面积小、时钟频率高及灵活性等优点。相似文献

15.

基于FPGA的Java处理器设计

下载免费PDF全文

南兆阔须文波柴志雷《计算机工程》2008,34(1):253-255

针对Java技术在嵌入式领域的广泛应用,设计了一个适用于低端嵌入式设备的32位环境的Java处理器JPOR。该处理器由FPGA芯片实现,采用一种新的Java栈结构,指令系统简洁,可以直接执行Java字节码,能够对实时Java规范(RTSJ)提供有效支持。在Xilinx SPARTAN-3平台上通过了功能仿真,表明该Java处理器能够在低成本的FPGA芯片中实现。相似文献

16.

Low power and high-speed FPGA implementation for 4D memristor chaotic system for image encryption

Hagras Esam A. A. Saber Mohamed 《Multimedia Tools and Applications》2020,79(31-32):23203-23222

In this paper, we proposed a novel low power and high-speed FPGA implementation of the 4D memristor chaotic system with cubic nonlinearity based on Xilinx System Generator (XSG) model. Firstly, a pseudo-random number generator based on the proposed XSG FPGA implementation of the proposed 4D memristor chaotic system which implemented into Xilinx Spartan-6 X6SLX45 board with 32 fixed-point format. The aim of the FPGA implementation is increasing the frequency of the memristor chaotic random number generators. The FPGA implementation of the memristor chaotic system results show that the new design approach achieves a maximum frequency of 393 MHz and dissipates 117 m watt. The standard fifteen randomization tests are used to measure the quality of the proposed pseudo-random number generator based on the 4D memristor chaotic system and it gives an excellent randomization analysis. Also, the gray image encryption scheme based on the 4D memristor chaotic system has been introduced. The proposed cryptosystem has a large keyspace, very low correlation values, high entropy which is much closer to the ideal entropy value, a high number of pixels change rate and high unified average changing intensity values. The results and security analysis of the proposed encryption scheme demonstrate that the investigated encryption approach can protect high speed and high security against various attack.

相似文献

17.

High performance hardware support for elliptic curve cryptography over general prime field

《Microprocessors and Microsystems》2017

Secure information exchange in resource constrained devices can be accomplished efficiently through elliptic curve cryptography (ECC). Due to the high computational complexity of ECC arithmetic, a high performance dedicated hardware architecture is essential to provide sufficient performance in a computation of elliptic curve scalar multiplication. This paper presents a high performance hardware support for elliptic curve cryptography over a prime field GF(p). It exploited a best available possible parallelism of elliptic curve points in projective representation. The proposed hardware for ECC is implemented on Xilinx Virtex-4, Virtex-5 and Virtex-6 FPGAs. A 256-bit scalar multiplication is completed in 2.01 ms, 2.62 ms and 3.91 ms on Virtex-6, Virtex-5 and Virtex-4 FPGA platforms, respectively. The results show that the proposed design is 1.96 times faster with insignificant increase in area consumption as compared to the other reported designs. Therefore, it is a good choice to be used in many ECC based schemes. 相似文献

18.

Design,implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions

Mostafa I. Soliman 《Journal of Parallel and Distributed Computing》2013

This paper proposes a low-complexity vector-core called LcVc for executing both scalar and vector instructions on the same execution datapath. A unified register file in the decode stage is used for storing both scalar operands and vector elements. The execution stage accepts a new set of operands each cycle and produces a new result. Rather than issuing a vector instruction (1-D operations) as a whole, each vector operation is issued sequentially with the existing scalar issue hardware. In the first implementation of LcVc, all loads and stores of registers take place from the data cache in the memory access stage in a rate of one element per clock cycle. The complete design of our proposed LcVc processor is implemented using VHDL targeting the Xilinx FPGA Spartan 3E, xc3s1600e-4-fg320 device. The total number of slices required for implementing LcVc is 1778, where the number of slice flip-flops is 538 and the number of 4-input LUTs is 3706: 1914 for logic and 1792 for RAMs. Moreover, our performance evaluation results show that the speedup of executing vector addition, vector scaling, SAXPY, and matrix–matrix multiplication on LcVc over the scalar execution are 2.3, 2.5, 1.9, and 3, respectively. The hardware required to support the enhanced vector capability is insignificant (5%), which results in reducing the area per core and increasing the number of cores available in a given chip area. 相似文献

19.

MB-OFDM UWB通信系统维特比解码器的实现

下载免费PDF全文

徐卓王雪静叶凡任俊彦《计算机工程》2008,34(18):117-119

提出一种应用于多波段正交频分复用(MB-OFDM)超宽带通信系统的维特比解码器的设计方案,分析MB-OFDM所采用的卷积/凿孔码及相应的维特比解码算法的性能。为了达到系统要求的最高数据传输率、保持硬件开销的经济性,结合滑动窗口和折叠2种方法设计解码器的硬件结构。在低速工作模式下,部分处理单元被禁用,以节省功耗。该设计经Xilinx Virtex-4 FPGA验证,最高译码速率可达432 Mb/s。相似文献

20.

Scalable hardware architecture for disparity map computation and object location in real-time

Pedro Miguel Santos João Canas Ferreira José Silva Matos 《Journal of Real-Time Image Processing》2016,11(3):473-485

We present the disparity map computation core of a hardware system for isolating foreground objects in stereoscopic video streams. The operation is based on the computation of dense disparity maps using block-matching algorithms and two well-known metrics: sum of absolute differences and Census transform. Two sets of disparity maps are computed by taking each of the images as reference so that a consistency check can be performed to identify occluded pixels and eliminate spurious foreground pixels. Taking advantage of parallelism, the proposed architecture is highly scalable and provides numerous degrees of adjustment to different application needs, performance levels and resource usage. A version of the system for 640 × 480 images and a maximum disparity of 135 pixels was implemented in a system based on a Xilinx Virtex II-Pro FPGA and two cameras with a frame rate of 25 fps (less than the maximum supported frame rate of 40 fps on this platform). Implementation of the same system on a Virtex-5 FPGA is estimated to achieve 80 fps, while a version with increased parallelism is estimated to run at 140 fps (which corresponds to the calculation of more than 5.9 × 10⁹ disparity-pixels per second). 相似文献