首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The method of moments is one of the most powerful techniques for image analysis. However, real-time applications of this method have been prohibited due to the computational intensity in calculating the moments. This paper presents a novel configurable hardware accelerator for expediting the moment computation. The fundamental building block of the proposed accelerator is a custom-designed floating-point moment processing element (MPE). Running at 75 MHz, the MPE can provide a 12X speedup over a 166 MHz TMS320C6701 digital signal processor. On top of this, a linear performance boost can be obtained by connecting up to eight MPEs into a one-dimensional (1-D) array  相似文献   

2.
基于NEDA算法的二维DCT硬件加速器的设计实现   总被引:1,自引:1,他引:0  
应用二维DCT的图像压缩系统,DCT的运算量较大,为了突破该瓶颈,设计了基于NEDA算法的DCT硬件加速器,该设计方案采用移位相加代替乘法运算,并用RAM代替ROM,有效地节省了硬件资源.给出了Verilog仿真结果,结果表明该加速器可以在使用资源非常少的情况下,正确地实现二维DCT运算,适合于各种视频图像压缩方面的应用.  相似文献   

3.
AdaBoost算法的人脸检测系统的SoC软硬件设计   总被引:1,自引:0,他引:1  
AdaBoost人脸检测算法计算量大,难以在嵌入式平台上用纯软件实时实现.文中对AdaBoost检测算法进行了性能分析,设计了合适的软硬件划分方案.算法的大部分计算都转移到硬件加速器中,大大提高了检测的速度.文中描述了整个系统的周期精确模型.仿真显示,SoC方案的速度是纯软件的11倍,在200MHz的主频下可以以28帧/秒的速度检测384*288的图像.  相似文献   

4.
在由通用RISC处理器核和附加定点硬件加速器构成的定点SoC(System-on-Chip)芯片体系架构基础上,提出了一种新颖的基于统计分析的定点硬件加速器字长设计方法。该方法利用统计参数在数学层面上求解计算出满足不同信噪比要求下的最小字长,能有效地降低芯片面积、功耗和制作成本,从而在没有DSP协处理器的低成本RISC处理器核SoC芯片上运行高计算复杂度应用。  相似文献   

5.
In this paper, we propose two new VLSI architectures for computing the N-point discrete Fourier transform (DFT) and its inverse (IDFT) based on a radix-2 fast algorithm, where N is a power of two. The first part of this work presents a linear systolic array that requires log2 N complex multipliers and is able to provide a throughput of one transform sample per clock cycle. Compared with other related systolic designs based on direct computation or a radix-2 fast algorithm, the proposed one has the same throughput performance but involves less hardware complexity. This design is suitable for high-speed real-time applications, but it would not be easily realized in a single chip when N gets large. To balance the chip area and the processing speed, we further present a new reduced-complexity design for the DFT/IDFT computation. The alternative design is a memory-based architecture that consists of one complex multiplier, two complex adders, and some special memory units. The new design has the capability of computing one transform sample every log2 N+1 clock cycles on average. In comparison with the first design, the second design reaches a lower throughput with less hardware complexity. As N=512, the chip area required for the memory-based design is about 5742×5222 μm2, and the corresponding throughput can attain a rate as high as 4M transform samples per second under 0.6 μm CMOS technology. Such area-time performance makes this design very competitive for use in long-length DFT applications, such as asymmetric digital subscriber lines (ADSL) and orthogonal frequency-division multiplexing (OFDM) systems  相似文献   

6.
This paper presents an integrated systolic array design for implementing full-search block matching, 2-D discrete wavelet transform, and full-search vector quantization on the same VLSI architecture. These functions are the prime components in video compression and take a great amount of computation. To meet the real-time application requirements, many systolic array architectures are proposed for individually performing one of those functions. However, these functions contain similar computational procedure. The matrix-vector product forms of the three functions are quite analogous. After extracting the common computation component, we design an integrated one-dimensional systolic array that can perform aforementioned three functions. The proposed architecture can efficiently perform three typical functions: (1) the full-search block matching with block of size 16 × 16 and the search are from –8 to 7; (2) the 2-D 2 level Harr transform with block of size 8 × 8; and (3) the full-search vector quantization with input vector of size 2 × 2. A utilization rate of 100% to 97% is achieved in the course of executing full-search block matching and full-search vector quantization. When it comes to perform 2-D discrete wavelet transform, the utilization rate is about 32%. The proposed integrated architecture has lowered hardware cost and reduced hardware structure. It befits the VLSI implementation for video/image compression applications.  相似文献   

7.
Application-specific processors offer an attractive option in the design of embedded systems by providing high performance for a specific application domain. In this work, we describe the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design. By using a common base instruction set, development cost can be reduced and design space exploration is focused on the application-specific aspects of performance. An important aspect of deploying any new architecture is verification which usually requires lengthy software simulation of a design model. We show how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow. While previously hardware emulation required massive investment in design effort and special purpose emulators, an emulation approach based on high-density field-programmable gate array (FPGA) devices now makes hardware emulation practical and cost effective for embedded processor designs. To reduce development cost and avoid duplication of design effort, FPGA prototypes and ASIC implementations are derived from a common source: We show how to perform targeted optimizations to fully exploit the capabilities of the target technology while maintaining a common source base  相似文献   

8.
基于群体信任的 WSN 异常数据过滤方法   总被引:1,自引:0,他引:1  
以节点数据的时空相关性为理论依据,通过定量数据与定性知识之间的不确定性转换,在知识层面上比较节点间数据的相似程度,实现对单节点数据的群体信任评估,进而设计了一种实时的WSN异常数据过滤方法,在节点数据采集过程中实时发现可疑数据。仿真实验验证了此方法不但能够实时过滤异常数据,提升WSN的入侵容忍能力,还有较低的通信及计算开销。  相似文献   

9.
作为计算量最多的模块之一,运动补偿占用了解码器与片外数据存储器之间约70%的带宽,是实现超高清视频解码的瓶颈。通过所设计的基于Cache的HEVC运动补偿模块,在保证实时解码数据吞吐量的同时,有效减少了80%的带宽消耗。首先,利用由可复用滤波器构成的插值计算模块和2D Cache设计了可并行化流水线数据处理的运动补偿模块,满足计算过程中高数据吞吐量需求。其次,设计高效内部存储器RAM结构,并提出片内Cache功耗降低的有效解决方案。最后,利用了参考帧数据相关性,设计插值顺序重排,将Cache的硬件开销减少了87.5%。基于HM9.0的HEVC标准测试视频序列实验结构表明,该设计显著地减少了带宽消耗和硬件开销。  相似文献   

10.
This work presents a new approach and an algorithm for binary image representation, which is applied for the fast and efficient computation of moments on binary images. This binary image representation scheme is called image block representation, since it represents the image as a set of nonoverlapping rectangular areas. The main purpose of the image block representation process is to provide an efficient binary image representation rather than the compression of the image. The block represented binary image is well suited for fast implementation of various processing and analysis algorithms in a digital computing machine. The two-dimensional (2-D) statistical moments of the image may be used for image processing and analysis applications. A number of powerful shape analysis methods based on statistical moments have been presented, but they suffer from the drawback of high computational cost. The real-time computation of moments in block represented images is achieved by exploiting the rectangular structure of the blocks.  相似文献   

11.
针对卷积神经网络(CNN)在嵌入式端的应用受实时性限制的问题,以及CNN卷积计算中存在较大程度的稀疏性的特性,该文提出一种基于FPGA的CNN加速器实现方法来提高计算速度。首先,挖掘出CNN卷积计算的稀疏性特点;其次,为了用好参数稀疏性,把CNN卷积计算转换为矩阵相乘;最后,提出基于FPGA的并行矩阵乘法器的实现方案。在Virtex-7 VC707 FPGA上的仿真结果表明,相比于传统的CNN加速器,该设计缩短了19%的计算时间。通过稀疏性来简化CNN计算过程的方式,不仅能在FPGA实现,也能迁移到其他嵌入式端。  相似文献   

12.
13.
Low-Cost Fast VLSI Algorithm for Discrete Fourier Transform   总被引:1,自引:0,他引:1  
A primeN-length discrete Fourier transform (DFT) can be reformulated into a (N-1)-length complex cyclic convolution and then implemented by systolic array or distributed arithmetic. In this paper, a recently proposed hardware efficient fast cyclic convolution algorithm is combined with the symmetry properties of DFT to get a new hardware efficient fast algorithm for small-length DFT, and then WFTA is used to control the increase of the hardware cost when the transform length Nis large. Compared with previously proposed low-cost DFT and FFT algorithms with computation complexity of O(logN), the new algorithm can save 30% to 50% multipliers on average and improve the average processing speed by a factor of 2, when DFT length Nvaries from 20 to 2040. Compared with previous prime-length DFT design, the proposed design can save large amount of hardware cost with the same processing speed when the transform length is long. Furthermore, the proposed design has much more choices for different applicable DFT transform lengths and the processing speed can be flexible and balanced with the hardware cost  相似文献   

14.
为了适应阵列信号处理数据量大、实时性高的特点,文中结合项目需求设计了一种基于FPGA的多功能阵列信号处理系统。通过采用先进的大规模高性能FPGA和多路高精度ADC芯片,可完成对40路中频信号的同步采集和数字下变频处理,并由数字波束合成运算得到36组波束数据。通过设置多种类型的对外接口,可实现与多个外联设备的网络数据交互、串口控制、波束控制及MGT高速数据传输。文中给出了系统的硬件和软件总体架构设计,并详细介绍了芯片选型、外设接口及各软件功能模块的具体实现方法。测试结果表明,本系统满足设计需求,具有较强的阵列信号处理能力以及良好的通用性和可扩展性。  相似文献   

15.
This paper presents practical design of a smart antenna system based on direction‐of‐arrival estimation and Dolph–Chebyshev beam forming. Direction‐of‐arrival estimation is based on the multiple signal classification algorithm for identifying the directions of the source signals incident on the sensor array comprising the smart antenna system. The smart antenna system design involves a hardware part, which provides real data measurements of the incident signals received by the sensor array. This paper presents the Dolph–Chebyshev method for the synthesis and design of antenna arrays with periodic element spacing. A Field‐Programmable Gate Array implementation is presented for an antenna array application employing digital beamforming. The array comprises 10 elements, equal in number receiving radio frequency and intermediate frequency components, as well as a Spartan‐3E Field‐Programmable Gate Array‐based unit, which is responsible for the control of the array. Low‐cost and switched‐beam, and fully adaptive antenna array suitable for 2‐GHz applications are proposed in this paper. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

16.
An efficient approach to design very large scale integration (VLSI) architectures and a scheme for the implementation of the discrete sine transform (DST), based on an appropriate decomposition method that uses circular correlations, is presented. The proposed design uses an efficient restructuring of the computation of the DST into two circular correlations, having similar structures and only one half of the length of the original transform; these can be concurrently computed and mapped onto the same systolic array. Significant improvement in the computational speed can be obtained at a reduced input-output (I/O) cost and low hardware complexity, retaining all the other benefits of the VLSI implementations of the discrete transforms, which use circular correlation or cyclic convolution structures. These features are demonstrated by comparing the proposed design with some of the previously reported schemes.  相似文献   

17.
基于提升格式的离散小波变换比传统的基于卷积的运算量少,易于VLSI实现。本文提出了一种基于提升格式,高效实时实现JPEG2000中9/7双正交离散小波变换滤波器的VLSI结构设计方法。该方法所设计的结构,在保证同样的精度下,减少了运算量,整体运算速度高,硬件花费少,存储需求低,硬件利用率达到100%。本文用Verilog HDL对系统进行硬件描述,并选用Xilinx公司的XCV50e-cs144-8器件在ISE4.1环境下实现了综合。  相似文献   

18.
Spline curve rendering is an essential operation in modern 2-D graphic applications. Different from the software acceleration approach by graphic processor units, this brief presents a very large scale integration hardware accelerator architecture for supporting fast curve rendering. Many existing accelerators employ a sequential forward-difference algorithm, where a step size is used in calculating the next sample on the curve. The problem of hardware-based curve rendering is that feedback loops are required to accumulate the difference, and these loops inhibit many traditional performance-enhancement strategies such as unfolding and pipelining. This brief proposes a different parallel design approach by transforming the difference equation set into parallel ones. Each equation has to be equipped with the same increased step size but accumulated starting from different initial values. Although more initial values must be precomputed, this computation can itself be sped up by using the accelerator. The proposed design can be applied not only to cubic spline curves but also to any curves defined by parameterized polynomial functions.   相似文献   

19.
Multimedia applications such as video and image processing are often characterized as computation intensive applications. For these applications the word-length of data and instructions is different throughout the application. Generating hardware architectures is not a straightforward task since it requires a deep word-length analysis in order to properly determine what hardware resources are needed. In this paper we suggest an automated design methodology based on high-level synthesis which takes care of data word-length and interconnection resource cost in order to generate area and power efficient fixed-point architectures for DSP applications. Both ASIC and FPGA technologies are targeted. Experimental results show that our proposed approach reduces area by 6% to 42% on FPGA technology and by 9% to 48 % on ASIC compared to previous approaches. Power saving can reach up to 44% on FPGA technology and 36% on ASIC.  相似文献   

20.
Video object segmentation is an important pre-processing task for many video analysis systems. To achieve the requirement of real-time video analysis, hardware acceleration is required. In this paper, after analyzing existing video object segmentation algorithms, it is found that most of the core operations can be implemented with simple morphology operations. Therefore, with the concepts of morphological image processing element array and stream processing, a reconfigurable morphological image processing accelerator is proposed, where by the proposed instruction set, the operation of each processing element can be controlled, and the interconnection between processing elements can also be reconfigured. Simulation results show that most of the core operations of video object segmentation can be supported by the accelerator by only changing the instructions. A prototype chip is designed to support real-time change-detection-and-background-registration based video object segmentation algorithm. This chip incorporates eight macro processing elements and can support a processing capacity of 6,200 9-bit morphological operations per second on a SIF image. Furthermore, with the proposed tiling and pipelined-parallel techniques, a real-time watershed transform can be achieved using 32 macro processing elements.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号