期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

Ardavan Pedram John D. McCalpin Andreas Gerstlauer 《Journal of Signal Processing Systems》2014,77(1-2):169-190

FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor. We show that it is possible to to achieve excellent parallel scaling while maintaining power and area efficiency comparable to that of the single-core solution. The result is an architecture that can effectively use up to 16 hybrid cores for transform sizes that can be contained in on-chip SRAM. When configured with 12MiB of on-chip SRAM, our technology evaluation shows that the proposed 16-core FFT accelerator should sustain 388 GFLOPS of nominal double-precision performance, with power and area efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm², respectively. 相似文献

2.

FPGA implementations of fast Fourier transforms for real-time signal and image processing 总被引：5，自引：0，他引：5

Uzun I.S. Amira A. Bouridane A. 《Vision, Image and Signal Processing, IEE Proceedings -》2005,152(3):283-296

Applications based on the fast Fourier transform (FFT), such as signal and image processing, require high computational power, plus the ability to experiment with algorithms. Reconfigurable hardware devices in the form of field programmable gate arrays (FPGAs) have been proposed as a way of obtaining high performance at an economical price. However, users must program FPGAs at a very low level and have a detailed knowledge of the architecture of the device being used. They do not therefore facilitate easy development of, or experimentation with, signal/image processing algorithms. To try to reconcile the dual requirements of high performance and ease of development, the paper reports on the design and realisation of a high level framework for the implementation of 1D and 2D FFTs for real-time applications. A wide range of FFT algorithms, including radix-2, radix-4, split-radix and fast Hartley transform (FHT) have been implemented under a common framework in order to enable system designers to meet different system requirements. Results show that the parallel implementation of 2D FFT achieves linear speed-up and real-time performance for large matrix sizes. Finally, an FPGA-based parametrisable environment based on 2D FFT is presented as a solution for frequency-domain image filtering application. 相似文献

3.

Memory Controller for Vector Processor

Tassadaq Hussain Osman S. Ünsal Adrian Cristal Eduard Ayguadé 《Journal of Signal Processing Systems》2018,90(11):1533-1549

相似文献

4.

Fault-tolerant designs for 256 Mb DRAM

Kirihata T. Watanabe Y. Hing Wong DeBrosse J.K. Yoshida M. Kato D. Fujii S. Wordeman M.R. Poechmueller P. Parke S.A. Asao Y. 《Solid-State Circuits, IEEE Journal of》1996,31(4):558-566

This paper describes fault-tolerant designs, which have been used to boost the yield of a 286 mm² 256 Mb DRAM with x32 both-ends DQ. The 256 Mb DRAM consists of sixteen 16 Mb units, each containing one 128 Kb row redundancy block. This row redundancy block architecture allows flexible row redundancy replacement, where random faults, clustered faults, and grouped faults can be efficiently repaired. Flexible column redundancy replacement with interchangeable master DQ's (MDQ) is used to allow a 256 b data compression without causing a data conflict, while improving the column access speed by 2 ns. A depletion NMOS bitline-precharge-current-limiter suppresses the current flow which occurs as a result of a wordline-bitline short-circuit to only 15 μA per cross fail, avoiding a standby current fail. Consequently, the hardware results show a significant yield enhancement of 16 times relative to the intra-block/segment replacement. Detailed simulation results show that this 256 Mb DRAM allows 275 random faults to be repaired with 5.5% silicon area overhead for 80% chip yield 相似文献

5.

Design Space Exploration of 1-D FFT Processor

Shaohan Liu Dake Liu 《Journal of Signal Processing Systems》2018,90(11):1609-1621

A design space exploration methodology of 1-D FFT processor is proposed to find the best hardware architecture in a quantitative way during early design. The methodology includes architecture candidate collection, coarse-grained architecture selection, and circuit level design optimizations. We show how to select a better architecture from candidates including different architectures (SDF, SDC, MDF, MDC and memory-based) with different degree of parallelism at different radices. The sub-level designs, including designs of rotator and data scaling module, are introduced for further optimizations. As a proof of concept, an FFT processor for 4G, WLAN and future 5G is designed supporting 16-4096 and 12-2400 point FFTs. Memory-based architecture with 16-datapath mixed-radix butterfly unit is selected to satisfy the demands for 1GS/s (4096) throughput. The synthesis result based on 65nm technology shows that the silicon cost and power consumption are 1.46mm2 and 68.64mW respectively. The proposed processor has better normalized throughput per area unit and normalized FFTs per energy unit than the state of the art available designs. 相似文献

6.

基于FPGA的高精度相位差测量算法实现

郎杰邹建彬张尔扬《现代电子技术》2011,(21):28-30,33

首先介绍了两种高精度相位差测量算法,一种是基于直接数字频率合成（DDS）的相关测量法,另一种是基于快速傅里叶变换（FFT）的FFT测量法。其次,通过理论仿真分析两种算法在不同信噪比和数据长度下的性能,并在此基础上给出了硬件平台的设计方案。最后,基于高性能的FPGA芯片XC5SX95T,搭建了硬件实验平台,完成了两种相位差测量算法的硬件实现。经过实测,该硬件平台能够达到良好的相位差测量精度。相似文献

7.

Hidden double data transfer scheme for MDL design [merged DRAMlogic]

Se-Jeong Park Hoi-Jun Yoo 《Electronics letters》2001,37(11):676-677

A high-speed DRAM data transfer scheme between DRAM and logic parts in merged DRAM logic (MDL) designs is proposed with logically divided DRAM row address mapping. The proposed scheme results in a 20% faster write access and 40% faster read access. It can be used as a general design framework to maximise DRAM access speed in various MDL designs. A test chip has been fabricated by 0.16 μm DRAM technology, and the scheme has been verified in the design of a DRAM L2 cache memory 相似文献

8.

Architectures for Dynamic Data Scaling in 2/4/8K Pipeline FFT Cores

Lenart T. Owall V. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(11):1286-1290

This paper presents architectures for supporting dynamic data scaling in pipeline fast Fourier transforms (FFTs), suitable when implementing large size FFTs in applications such as digital video broadcasting and digital holographic imaging. In a pipeline FFT, data is continuously streaming and must, hence, be scaled without stalling the dataflow. We propose a hybrid floating-point scheme with tailored exponent datapath, and a co-optimized architecture between hybrid floating point and block floating point (BFP) to reduce memory requirements for 2-D signal processing. The presented co-optimization generates a higher signal-to-quantization-noise ratio and requires less memory than for instance convergent BFP. A 2048-point pipeline FFT has been fabricated in a standard-CMOS process from AMI Semiconductor (Lenart and Owall, 2003), and a field-programmable gate array prototype integrating a 2-D FFT core in a larger design shows that the architecture is suitable for image reconstruction in digital holographic imaging 相似文献

9.

基于超大点数FFT优化算法的研究与实现

高立宁马潇刘腾飞吴金《电子与信息学报》2014,36(4):998-1002

针对应用系统对超大点数快速傅里叶变换(FFT)的性能需求不断提升,以及现有处理平台的资源对实现超大点数FFT的制约问题,该文提出一种超大点数FFT的实现方法。该方法通过优化铰链因子存储,采用行列号方式访问2维矩阵避免了3次显性转置,从而节省了内存资源;同时,通过分析处理器的分级存储结构特点,优化了矩阵行列划分规则,进而提高了行列访问效率。实验结果表明,该方法节约了近一半的内存资源,且有效提高了超大点数FFT的执行速度。相似文献

10.

Address generators for linear systolic array

M.K. Stoj?ev I.?. Milovanovi? E.I. Milovanovi? T.R. Nikoli? 《Microelectronics Reliability》2010,50(2):292-303

Systolic arrays (SAs) are very efficient architectures for multimedia processing, database management, and scientific computing applications that are characterized by a high number of data access. However, in these data transfer and storage intensive applications, memory access is often the limiting factor to the computation speed. Since the memory subsystem dominates the cost (area), performance and power consumption of the SA, we have to pay a special attention to how memory subsystem can benefit from customization. In this paper we consider memory organization of linear systolic array with bi-directional links (called BLSA) suitable for implementation of broad class of algorithms. We assume that memory is organized into distributed smaller physical memory modules. In order to provide high bandwidth in data access we have designed special hardware, called address generator unit (AGU). The function of AGU is threefold. First, during the initialization, it transforms host address space into BLSA address space. Second, provides efficient memory data access during BLSA operation. Third, performs fast data transfer between BLSA and host at the end of the computation. In this article, we examine the impact on area and performance of memory access related circuity in eliminating computational intensive offset address calculations performed in software by implementing the needed address transformations with the AGUs. By involving hardware AGUs we achieved a speedup of approximately two, compared to the software implementation of address calculation, with a hardware overhead of only 7.6% in the worst case. 相似文献

11.

Memory Access Optimized VLSI for 5000-Word Continuous Speech Recognition

Kisun You Young-kyu Choi Jungwook Choi Wonyong Sung 《Journal of Signal Processing Systems》2011,63(1):95-105

We have developed a memory access reduced VLSI chip for 5,000 word speaker-independent continuous speech recognition. This chip employs a context-dependent HMM (hidden Markov model) based speech recognition algorithm, and contains parallel and pipelined hardware units for emission probability computation and Viterbi beam search. To maximize the performance, we adopted several memory access reduction techniques such as sub-vector clustering and multi-block processing for the emission probability computation. We also employed a custom DRAM controller for efficient access of consecutive data. Moreover, we analyzed the access pattern of data to minimize the internal SRAM size while maintaining high performance. The experimental results show that the implemented system performs speech recognition 2.4 and 1.8 times faster than real-time utilizing 32-bit DDR SDRAM and SDR SDRAM, respectively. 相似文献

12.

Pseudo-Parallel Datapath Structure for Power Optimal Implementation of 128-pt FFT/IFFT for WPAN

J. Mathew K. Maharatna B. R. Jose H. Rahaman D. K. Pradhan 《Circuits, Systems, and Signal Processing》2011,30(4):871-882

An optimal implementation of 128-Pt FFT/IFFT for low power IEEE 802.15.3a WPAN using pseudo-parallel datapath structure is presented, where the 128-Pt FFT is devolved into 8-Pt and 16-Pt FFTs and then once again by devolving the 16-Pt FFT into 4×4 and 2×8. We analyze 128-Pt FFT/IFFT architecture for various pseudo-parallel 8-Pt and 16-Pt FFTs and an optimum datapath architecture is explored. It is suggested that there exists an optimum degree of parallelism for the given algorithm. The analysis demonstrated that with a modest increase in area one can achieve significant reduction in power. The proposed architectures complete one parallel-to-parallel (i.e., when all input data are available in parallel and all output data are generated in parallel) 128-point FFT computation in less than 312.5 ns and thereby meet the standard specification. The relative merits and demerits of these architectures have been analyzed from the algorithm as well as implementation point of view. Detailed power analysis of each of the architectures with a different number of data paths at block level is described. We found that from power perspective the architecture with eight datapaths is optimum. The core power consumption with optimum case is 60.6 MW which is only less than half of the latest reported 128-point FFT design in 0.18u technology. Furthermore, a Single Event Upset (SEU) tolerant scheme for registers is also explored. The SEU tolerant scheme will not affect the performance, however, there is an increase power consumption of about 42 percent. Apart from the low power consumption, the advantages of the proposed architectures include reduced hardware complexity, regular data flow and simple counter based control. 相似文献

13.

Low‐Power‐Adaptive MC‐CDMA Receiver Architecture

Mohd. Hasan Tughrul Arslan John S. Thompson 《ETRI Journal》2007,29(1):79-88

This paper proposes a novel concept of adjusting the hardware size in a multi‐carrier code division multiple access (MC‐CDMA) receiver in real time as per the channel parameters such as delay spread, signal‐to‐noise ratio, transmission rate, and Doppler frequency. The fast Fourier transform (FFT) or inverse FFT (IFFT) size in orthogonal frequency division multiplexing (OFDM)/MC‐CDMA transceivers varies from 1024 points to 16 points. Two low‐power reconfigurable radix‐4 256‐point FFT processor architectures are proposed that can also be dynamically configured as 64‐point and 16‐point as per the channel parameters to prove the concept. By tailoring the clock of the higher FFT stages for longer FFTs and switching to shorter FFTs from longer FFTs, significant power saving is achieved. In addition, two 256 sub‐carrier MC‐CDMA receiver architectures are proposed which can also be configured for 64 sub‐carriers in real time to prove the feasibility of the concept over the whole receiver. 相似文献

14.

基于新型非易失存储器的混合内存架构的内存管理机制

下载免费PDF全文

李琪钟将李雪李青《电子学报》2019,47(3):664-670

随着互联网和云计算技术的迅猛发展,现有动态随机存储器（Dynamic Random Access Memory,DRAM）已无法满足一些实时系统对性能、能耗的需求.新型非易失存储器（Non-Volatile Memory,NVM）的出现为计算机存储体系的发展带来了新的契机.本文针对NVM和DRAM混合内存系统架构,提出一种高效的混合内存页面管理机制.该机制针对内存介质写特性的不同,将具有不同访问特征的数据页保存在合适的内存空间中,以减少系统的迁移操作次数,从而提升系统性能.同时该机制使用一种两路链表使得NVM介质的写操作分布更加均匀,以提升使用寿命.最后,本文在Linux内核中对所提机制进行仿真实验.并与现有内存管理机制进行对比,实验结果证明了所提方法的有效性. 相似文献

15.

Generation of Heterogeneous Distributed Architectures for Memory-Intensive Applications Through High-Level Synthesis

Chao Huang Ravi S. Raghunathan A. Jha N.K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(11):1191-1204

Memory-intensive applications present unique challenges to an application-specific integrated circuit (ASIC) designer in terms of the choice of memory organization, memory size requirements, bandwidth and access latencies, etc. The high potential of single-chip distributed logic-memory architectures in addressing many of these issues has been recognized in general-purpose computing, and more recently, in ASIC design. The high-level synthesis (HLS) techniques presented in this paper are motivated by the fact that many memory-intensive applications exhibit irregular array data access patterns. Synthesis should, therefore, be capable of determining a partitioned architecture, wherein array data and computations may have to be heterogeneously distributed for achieving the best performance speed-up. We use a combination of clustering and min-cut style partitioning techniques to yield distributed architectures, based on simulation profiling while considering various factors including data access locality, balanced workloads, inter-partition communication, etc. Our experiments with several benchmark applications show that the proposed techniques yielded two-way partitioned architectures that can achieve upto 2.1 x (average of 1.9 x) performance speed-up over conventional HLS solutions, while achieving upto 1.5 x (average of 1.4 x) performance speed-up over the best homogeneous partitioning solution feasible. At the same time, the reduction in the energy-delay product over conventional single-memory designs is upto 2.7 x (average of 2.0 x). A larger amount of partitioning makes further system performance improvement achievable at the cost of chip area. 相似文献

16.

Architecture design,performance analysis and VLSI implementation of a reconfigurable shared buffer for high‐speed switch/router

Ling Wu Cheng Li 《International Journal of Communication Systems》2009,22(2):159-186

Modern switches and routers require massive storage space to buffer packets. This becomes more significant as link speed increases and switch size grows. From the memory technology perspective, while DRAM is a good choice to meet capacity requirement, the access time causes problems for high‐speed applications. On the other hand, though SRAM is faster, it is more costly and does not have high storage density. The SRAM/DRAM hybrid architecture provides a good solution to meet both capacity and speed requirements. From the switch design and network traffic perspective, to minimize packet loss, the buffering space allocated for each switch port is normally based on the worst‐case scenario, which is usually huge. However, under normal traffic load conditions, the buffer utilization for such configuration is very low. Therefore, we propose a reconfigurable buffer‐sharing scheme that can dynamically adjust the buffering space for each port according to the traffic patterns and buffer saturation status. The target is to achieve high performance and improve buffer utilization, while not posing much constraint on the buffer speed. In this paper, we study the performance of the proposed buffer‐sharing scheme by both a numerical model and extensive simulations under uniform and non‐uniform traffic conditions. We also present the architecture design and VLSI implementation of the proposed reconfigurable shared buffer using the 0.18 µm CMOS technology. Our results manifest that the proposed architecture can always achieve high performance and provide much flexibility for the high‐speed packet switches to adapt to various traffic patterns. Furthermore, it can be easily integrated into the functionality of port controllers of modern switches and routers. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献

17.

Single Chip Dual–Issue RISC Processor for Real–Time MPEG–2 Software Decoding

Edgar Holmann Toyohiko Yoshida Akira Yamada Shin&#x;ichi Uramoto 《The Journal of VLSI Signal Processing》1998,18(2):155-165

A single chip system for real–time MPEG–2 decoding can be created by integrating a general purpose dual–issue RISC processor, with a small dedicated hardware for the variable length decoding (VLD) and block loading processes; a 32KB instruction RAM; and a 32KB data RAM. The VLD hardware performs Huffman decoding on the input data. The block loader performs the half–sample prediction for motion compensation and acts as a direct memory access (DMA) controller for the RISC processor by transferring data between an external 2MB DRAM and the internall 32 KB data RAM. The dual-issue RISC processor, running at 250MHz, is enhanced with a set of key sub-word and multimedia instructions for a sustained peak performance of 1000 MOPS. With this setup for MPEG-2 decoding applications, bi-directionally predicted non-intra video blocks are decoded in less than 800 cycles, leading to a single-chip, real-time MPEG-2 decoding system. 相似文献

18.

一种高效的面向基2 FFT算法的SIMD并行存储结构

下载免费PDF全文

陈海燕杨超刘胜刘仲《电子学报》2016,44(2):241-246

随着SIMD(Single Instruction Multiple Data stream)结构DSP(Digital Signal Processor)片上集成了越来越多的处理单元,并行访存的灵活性及带宽效率对实际运算性能的影响越来越大.本文详细分析了一般SIMD结构DSP中基2 FFT(Fast Fourier Transform)并行算法面临的访存问题,采用简单的部分地址异或逻辑完成SIMD并行访存地址转换,实现了FFT运算的无冲突SIMD并行访存;提出了几种带特殊混洗模式的向量访存指令,可完全消除SIMD结构下基2 FFT运算时需要的额外混洗指令操作.最后将其应用于某16路SIMD数字信号处理器YHFT-Matrix2中向量存储器VM的优化设计.测试结果表明,采用该SIMD并行存储结构优化的VM以增加18%的硬件开销实现了FFT运算全流水无冲突并行访存和100%并行访存带宽利用率;相比优化前的设计,不同点数FFT运算可获得1.32~2.66的加速比. 相似文献

19.

基于无冲突地址生成的高性能FFT处理器设计

王江黑勇郑晓燕仇玉林《微电子学与计算机》2007,24(3):15-19

提出一种基于存储器交织架构的FFT处理器设计方法,并且针对基-8FFT提出一种无冲突地址生成算法,数据按帧进行操作。每个存储器均划分为8个独立的存储体,通过对循环移位寄存器译码,蝶式运算单元并行无冲突读写操作数,8通道输入数据进行并行的复数乘法运算。每级运算引入完全流水,减少了运算的时钟周期开销,同时推导出局部流水线设计必须满足的不等式条件。输入、输出存储器采用乒乓操作,按帧轮换,FFT运算连续输入、输出,采样频率与系统工作频率一致,具有很好的实时性,运算精度通过块浮点得到保证。该设计方法可以扩展至基-16FFT处理器设计。相似文献

20.

Exploiting three-dimensional (3D) memory stacking to improve image data access efficiency for motion estimation accelerators

Yiran Li Yang Liu Tong Zhang 《Signal Processing: Image Communication》2010,25(5):335-344

Enabled by the emerging three-dimensional (3D) integration technologies, 3D integrated computing platforms that stack high-density DRAM die(s) with a logic circuit die appear to be attractive for memory-hungry applications such as multimedia signal processing. This paper considers the design of motion estimation accelerator under a 3D logic-DRAM integrated heterogeneous multi-core system framework. In this work, we develop one specific DRAM organization and image frame storage strategy geared to motion estimation. This design strategy can seamlessly support various motion estimation algorithms and variable block size with high energy efficiency. With a DRAM performance modeling/estimation tool and ASIC design at 65 nm, we demonstrate the energy efficiency of such 3D integrated motion estimation accelerators with a case study on HDTV multi-frame motion estimation. 相似文献