首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
A class of Finite Impulse Response (FIR) filtering algorithms based either on short Fast Fourier Transforms (FFT) or on short length FIR filtering algorithms was recently proposed. Besides the significant reduction of the arithmetic complexity, these algorithms present some characteristics which make them useful in many applications, namely a small delay processing (independent on the FIR filter length) as well as a multiply-add based computational structure. These algorithms are presented in a unified framework, thus allowing an easy combination of any of them. However, a remaining difficulty concerns the implementation of the fast algorithms on Digital Signal Processors (DSP), given the DSP finite resources (number of pointers, registers and memory), while keeping as much as possible the improvement brought by the reduction of the arithmetic complexity. This paper provides an efficient implementation methodology, by organizing the algorithm in such a way that the memory data access is optimized on a DSP. As a result, our implementation requires a constant number of pointers whatever the algorithm combination. This knowledge is used in a DSP code generator which is able to select the appropriate algorithm meeting the application constraints, as well as to generate automatically an optimized assembly code, using macro-instructions available in a DSP-dependent library. An improvement of more than 50% in terms of throughput (number of machine cycles per point) compared to the implementation of the direct convolution is generally achieved.  相似文献   

2.
Distributed arithmetic techniques are the key to efficient implementation of DSP algorithms in FPGAs. The distributed arithmetic process is briefly described. A representative DSP design application in the form of an 8 tap FIR filter is offered for the Xilinx XC3042 field programmable logic array (FPGA). The design is presented in sufficient detail—from filter specifications via filter design software through detailed logic of salient data and control functions to obtain a realistic placing and routing of configurable logic block (CLBs) and in/out block (IOBs) components for simulation verification and performance evaluation vis-a-vis commercially available dedicated 8 tap FIR filter chips.  相似文献   

3.
胡凤国 《现代电子技术》2010,33(20):127-132
通过对现有超长整数四则运算的算法的综合比较,提出一套相对完善的算法的设计目标。基于字符数组提出一套无符号超长整数四则运算算法,给出了加减乘除4个算法的逻辑思路和程序代码,精确实现了超长整数的四则运算,并将算法应用于科学研究的实践当中。  相似文献   

4.
Reconfigurable hardware has become a well-accepted option for implementing digital signal processing (DSP). Traditional devices such as field-programmable gate arrays offer good fine-grain flexibility. More recent coarse-grain reconfigurable architectures are optimized for word-length computations. We have developed a medium-grain reconfigurable architecture that combines the advantages of both approaches. Modules such as multipliers and adders are mapped onto blocks of 4-bit cells. Each cell contains a matrix of lookup tables that either implement mathematics functions or a random-access memory. A hierarchical interconnection network supports data transfer within and between modules. We have created software tools that allow users to map algorithms onto the reconfigurable platform. This paper analyzes the implementation of several common benchmarks, ranging from floating-point arithmetic to a radix-4 fast Fourier transform. The results are compared to contemporary DSP hardware.  相似文献   

5.
Modular arithmetic is a building block for a variety of applications potentially supported on embedded systems. An approach to turn modular arithmetic more efficient is to identify algorithmic modifications that would enhance the parallelization of the target arithmetic in order to exploit the properties of parallel devices and platforms. The Residue Number System (RNS) introduces data-level parallelism, enabling the parallelization even for algorithms based on modular arithmetic with several data dependencies. However, the mapping of generic algorithms to full RNS-based implementations can be complex and the utilization of suitable hardware architectures that are scalable and adaptable to different demands is required. This paper proposes and discusses an architecture with scalability features for the parallel implementation of algorithms relying on modular arithmetic fully supported by the Residue Number System (RNS). The systematic mapping of a generic modular arithmetic algorithm to the architecture is presented. It can be applied as a high level synthesis step for an Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA) design flow targeting modular arithmetic algorithms. An implementation with the Xilinx Virtex 4 and Altera Stratix II Field Programmable Gate Array (FPGA) technologies of the modular exponentiation and Elliptic Curve (EC) point multiplication, used in the Rivest-Shamir-Adleman (RSA) and (EC) cryptographic algorithms, suggests latency results in the same order of magnitude of the fastest hardware implementations of these operations known to date.  相似文献   

6.
The acoustic echo cancellation with large adaptive filters is a computationally intensive problem and needs real time cost effective solution. To deal with these challenges, designers have increasingly turned to mixed Hardware/Software (HW/SW) implementation of echo canceller algorithms. This paper presents a co-design methodology and environment for both hardware and software modules. We describe how High Level Synthesis (HLS) tools like GAUT and SYNDEX can be efficiently used for rapid prototyping of heterogeneous architecture based on DSP TMS320C40 and ASIC. The HW/SW interface synthesis task is especially discussed since it constitutes a key issue of the whole design. As an illustration, we present a mixed implementation of the GMDF alpha algorithm, an adaptive filter well suited to acoustic echo cancellation, on both ASIC and TMS320C40 DSP.  相似文献   

7.
A design technique based on a combination of Common Sub-Expression Elimination and Bit-Slice (CSE-BitSlice) arithmetic for hardware and performance optimization of multiplier designs with variable operands is presented in this paper. The CSE-BitSlice technique can be extended to hardware optimization of multiplier circuits operating on vectors or matrices of variables. The CSE-BitSlice technique has been applied to the design and implementation of 12 × 12 and 42 × 42 bit real multipliers, a complex multiplier, a 6-tap FIR filter, and a 5-point DFT circuit. For comparison purposes, circuit implementations of the same arithmetic and DSP functions have been carried out using Radix-4 Booth and CSA algorithms. Simulation results based on implementations using the Xilinx FPGA 5VLX330FF1760-2 device shows that the circuits based on the CSE-BitSlice techniques require fewer logic resources and yield higher throughput as compared to the CSA and Radix-4 Booth based circuits.  相似文献   

8.
A 32-b RISC/DSP microprocessor with reduced complexity   总被引:2,自引:0,他引:2  
This paper presents a new 32-b reduced instruction set computer/digital signal processor (RISC/DSP) architecture which can be used as a general purpose microprocessor and in parallel as a 16-/32-b fixed-point DSP. This has been achieved by using RISC design principles for the implementation of DSP functionality. A DSP unit operates in parallel to an arithmetic logic unit (ALU)/barrelshifter on the same register set. This architecture provides the fast loop processing, high data throughput, and deterministic program flow absolutely necessary in DSP applications. Besides offering a basis for general purpose and DSP processing, the RISC philosophy offers a higher degree of flexibility for the implementation of DSP algorithms and achieves higher clock frequencies compared to conventional DSP architectures. The integrated DSP unit provides instruction set support for highly specialized DSP algorithms. Subword processing optimized for DSP algorithms has been implemented to provide maximum performance for 16-b data types. While creating a unified base for both application areas, we also minimized transistor count and we reduced complexity by using a short instruction pipeline. A parallelism concept based on a varying number of instruction latency cycles made superscalar instruction execution superfluous  相似文献   

9.
Many useful DSP algorithms have high dimensions and complex logic. Consequently, an efficient implementation of these algorithms on parallel processor arrays must involve a structured design methodology. Full-search block-matching motion estimation is one of those algorithms that can be developed using parallel processor arrays. In this paper, we present a hierarchical design methodology for the full-search block matching motion estimation. Our proposed methodology reduces the complexity of the algorithm into simpler steps and then explores the different possible design options at each step. Input data timing restrictions are taken into consideration as well as buffering requirements. A designer is able to modify system performance by selecting some of the algorithm variables for pipelining or broadcasting. Our proposed design strategy also allows the designer to study time and hardware complexities of computations at each level of the hierarchy. The resultant architecture allows easy modifications to the organization of data buffers and processing elements-their number, datapath pipelining, and complexity-to produce a system whose performance matches the video data sample rate requirements.  相似文献   

10.
Karnofsky  K. 《Spectrum, IEEE》1996,33(7):79-82
The boom in the use of systems based on digital signal processing (DSP) is rivalled only by the swiftness with which their technology changes. No sooner do engineers master the latest development than a still newer one emerges. It's the same with design tools; it never seems quite possible for the users to catch up. Designing DSP based systems, therefore, remains a challenging and multidisciplinary task. More often than not, unfortunately, there is a gap between the algorithm development and implementation phases of a DSP design project. Therefore, DSP engineers are turning in droves toward a methodology that integrates the design of DSP algorithms with the later stages of development and implementation. Called “accelerated DSP design” (ADD), the methodology makes use of high level algorithm simulation and rapid prototyping (on off the shelf DSP boards)-both offline and in real time environments-to achieve its goals. The tools it uses allow early validation of algorithms and evaluation of tradeoffs, increasing the designer's confidence that a particular design will meet its requirements  相似文献   

11.
A general-purpose programmable digital signal processor (DSP) has been implemented in 1.5-/spl mu/m (L/SUB eff/) NMOS technology using full-custom circuit design for high performance. The DSP has a 32-bit instruction set, 32-bit data path, and full-hardware 32-bit floating-point arithmetic. The architecture is described section by section, and an overview of the instruction set is presented. The extensive design verification process applied to the DSP is also described.  相似文献   

12.
The authors believe that special-purpose architectures for digital signal processing (DSP) real-time applications will use closely coupled processing elements as array processor modules to implement the various portions of the new algorithms, and several such modules will cooperate in a pipelined manner to implement complete algorithms. Such an architecture, based upon systolic modules, for the MUSIC algorithm is presented. The architecture is suitable for VLSI implementation. The throughput of the pipelined approach is O(N), whereas the sequential approach is O(N3)  相似文献   

13.
文章通过对32位定点DSP的体系结构及其设计方法的研究,重点阐述了32位定点DSP中CPU包括ALU、MPY、ARAU、流水线、指令系统和总线接口等关键逻辑部件工作原理,对各个逻辑部件的设计思路和实现方法进行了分析描述。采用基于标准单元正向设计方法,设计了一款32位指令集的定点DSP电路,该电路采用哈佛总线结构,可以在单周期内实现16×16位有符号整数乘法、32位累加和32位数据的算术逻辑运算,处理精度高。该电路采用0.5μm 1P3M CMOS工艺流片,集成度7万门,工作频率可达36 MHz,动态功耗594 mW。  相似文献   

14.
The evolution of CORDIC, an iterative arithmetic computing algorithm capable of evaluating various elementary functions using a unified shift-and-add approach, and of CORDIC processors is reviewed. A method to utilize a CORDIC processor array to implement digital signal processing algorithms is presented. The approach is to reformulate existing DSP algorithms so that they are suitable for implementation with an array performing circular or hyperbolic rotation operations. Three categories of algorithm are surveyed: linear transformations, digital filters, and matrix-based DSP algorithms  相似文献   

15.
实时目标跟踪系统处理平台设计及快速算法研究   总被引:2,自引:1,他引:1       下载免费PDF全文
宋华军  朱明  胡硕 《电子器件》2004,27(3):474-477
为解决电视捕获跟踪瞄准系统中系统的实时性与算法复杂性之间的矛盾,设计了以高性能的DSP芯片TMS320C6416为核心处理器,结合大规模可编程逻辑器件CPLD进行逻辑控制以及现场可编程门阵列FPGA对采集的视频数字图像做预处理的实时目标识别跟踪处理平台。同时改进了目标识别算法,提出一种基于遗传算法的快速图像相关匹配算法。重点介绍了该实时数字图像处理系统的硬件组成、工作原理和新的图像相关匹配算法。结果表明系统具有较高的实时性和稳定性。  相似文献   

16.
17.
The feasibility and performance of implementing kinematics and inverse dynamics algorithms on a DSP chip for real-time robot arm control is investigated. The algorithms include the following modules: forward and inverse kinematics; Jacobian, inverse Jacobian, and Jacobian derivative term; and Newton-Euler inverse dynamics. These modules are unified under a common coordinate system, and then computationally optimized by eliminating the redundancies among the modules. Further optimization is indicated for the PUMA-like arms. The algorithms are implemented on a TI TMS320C30 DSP chip. It is found that the execution time for the entire set of algorithms is about 0.78 ms for a six-degree-of-freedom robot with a spherical wrist, and is about 0.63 ms for a PUMA-specific arm. The communication time between the host PC and the DSP chip is about 0.376 ms. Thus, it is possible to implement a complete Cartesian controller at a 1000 Hz sampling rate. The algorithms have been successfully tested on a PUMA arm with a PC-based advanced controller  相似文献   

18.
New energy concepts such as distributed power generation systems (DPGSs) are changing the face of electric distribution and transmission. Power electronics researchers try to apply new electronic controller solutions with the capacity of implementing new and more complex control algorithms combined with internal high-speed communication interfaces. Thus, it is possible to monitor, store, and transfer a large number of internal variables that can be sent online to local or remote hosts in order to take new set points of different generation units. With this objective, this paper presents the design, implementation, and test of an industrial multiprocessor controller based on a floating-point digital signal processor (DSP) and a field-programmable gate array, which operate cooperatively. The communication architecture, which has been added to the proposed electronic solution, consists of a universal serial bus (USB), implemented with a minimum use of the DSP core, and a controller area network (CAN) bus that permits distributed control. Although the proposed system can be readily applied to any DPGS, in this paper, it is focused on a 150-kVA back-to-back three-level neutral-point-clamped voltage source converter for wind turbine applications.   相似文献   

19.
Overlay程序设计技术的基本原理是利用模块化设计思想,将任务划分成多个功能模块,在内存中只加载当前需要执行的模块,不加载其他暂不执行的模块。但当其他模块需要执行时,首先将内存中的模块卸载,然后将需要执行的模块再加载到内存。就内存的使用情况而言,Overlay技术与动态链接库是非常类似的。本文以TM320C6000系列DSP为目标平台,通过一个具体实例给出开发DSP覆盖(Overlay)程序设计技术的方法和实现步骤,详细地讨论了Overlay程序设计技术各个环节。  相似文献   

20.
The initial part of this paper reviews the early challenges (c 1980) in achieving real-time silicon implementations of DSP computations. In particular, it discusses research on application specific architectures, including bit level systolic circuits that led to important advances in achieving the DSP performance levels then required. These were many orders of magnitude greater than those achievable using programmable (including early DSP) processors, and were demonstrated through the design of commercial digital correlator and digital filter chips. As is discussed, an important challenge was the application of these concepts to recursive computations as occur, for example, in Infinite Impulse Response (IIR) filters. An important breakthrough was to show how fine grained pipelining can be used if arithmetic is performed most significant bit (msb) first. This can be achieved using redundant number systems, including carry-save arithmetic. This research and its practical benefits were again demonstrated through a number of novel IIR filter chip designs which at the time, exhibited performance much greater than previous solutions. The architectural insights gained coupled with the regular nature of many DSP and video processing computations also provided the foundation for new methods for the rapid design and synthesis of complex DSP System-on-Chip (SoC), Intellectual Property (IP) cores. This included the creation of a wide portfolio of commercial SoC video compression cores (MPEG2, MPEG4, H.264) for very high performance applications ranging from cell phones to High Definition TV (HDTV). The work provided the foundation for systematic methodologies, tools and design flows including high-level design optimizations based on ”algorithmic engineering” and also led to the creation of the Abhainn tool environment for the design of complex heterogeneous DSP platforms comprising processors and multiple FPGAs. The paper concludes with a discussion of the problems faced by designers in developing complex DSP systems using current SoC technology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号