首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
An integrated memory array processor (IMAP) ULSI with 64 processing elements and a 2-Mb SRAM has been developed for image processing. The chip attains a 3.84 GIPS peak performance through the use of SIMD parallel processing and a 1.28 GByte/s on-chip processor-memory bandwidth. The IMAP is capable of parallel indirect addressing, which increases applications for parallel algorithms. Large power consumption with the wide memory bandwidth is avoided by reducing the number of active sense amplifiers and adopting dynamic power control. Fabricated with a 0.55-μm BiCMOS double layer metal process technology, the IMAP contains 11 million transistors in a 15.1×15.6 mm2 die area  相似文献   

2.
In this brief, a high-throughput and low-complexity fast Fourier transform (FFT) processor for wideband orthogonal frequency division multiplexing communication systems is presented. A new indexed-scaling method is proposed to reduce both the critical-path delay and hardware cost by employing shorter wordlength. Together with the mixed-radix multipath delay feedback structure, the proposed FFT processor can achieve very high throughput with low hardware cost. From analysis, it is shown that the proposed indexed-scaling method can save at least 11% memory utilizations compared to other state-of-the-art scaling algorithms. Also, a test chip of a 1.2 Gsample/s 2048-point FFT processor has been designed using UMC 90-nm 1P9M process with a core area of 0.97 mm2. The signal-to-quantization-noise ratio (SQNR) performance of this test chip is over 32.7 dB to support 16-QAM modulation and the power consumption is about 117 mW at 300 MHz. Compared to the fixed-point FFT processors, about 26% area and 28% power can be saved under the same throughput and SQNR specifications.  相似文献   

3.
In a typical embedded CPU, large on-chip storage is critical to meet high performance requirements. However, the fast increasing size of the on-chip storage based on traditional SRAM cells makes the area cost and energy consumption unsustainable for future embedded applications. Replacing SRAM with DRAM on the CPU’s chip is generally considered not worthwhile because DRAM is not compatible with the common CMOS logic and requires additional processing steps beyond what is required for CMOS. However a special DRAM technology, Gain-Cell embedded-DRAM (GC-eDRAM)  [1], [2], [3] is logic compatible and retains some of the good properties of DRAM (small and low power). In this paper we evaluate the performance of a novel hybrid cache memory where the data array, generally populated with SRAM cells, is replaced with GC-eDRAM cells while the tag array continues to use SRAM cells. Our evaluation of this cache demonstrates that, compared to the conventional SRAM-based designs, our novel architecture exhibits comparable performance with less energy consumption and smaller silicon area, enabling the sustainable on-chip storage scaling for future embedded CPUs.  相似文献   

4.
The performance of the processor core depends on the configuration parameters and utilization of on-chip memory in multimedia applications such as image, video and audio processing. The design of the on-chip memory architecture is critical for power and area efficient design without compromising quality in data-intensive computing applications. This paper proposes a design of high speed, area, and energy efficient Static Segment On-Chip (SSOC) memory for error-tolerant applications. In this static segment method, n-bit data array is reduced by m-bit data array for significant value of input data to achieve balanced design metrics at the cost of accuracy. The proposed m-bit static segmentation algorithm is implemented and verified in Single Port Static Random Access Memory (SP SRAM) architecture for the approximate computing applications. From the overall simulation results, the proposed 4-bit SSOC SP SRAM design provides 49.02% area savings, 50.62% power reduction and 16.92% speed improvement at the cost of 0.64% Peak Signal to Noise Ratio (PSNR) and exhibits same visual quality in comparison with the existing 8-bit conventional on-chip SP SRAM design in the image processing applications.  相似文献   

5.
An optimal implementation of 128-Pt FFT/IFFT for low power IEEE 802.15.3a WPAN using pseudo-parallel datapath structure is presented, where the 128-Pt FFT is devolved into 8-Pt and 16-Pt FFTs and then once again by devolving the 16-Pt FFT into 4×4 and 2×8. We analyze 128-Pt FFT/IFFT architecture for various pseudo-parallel 8-Pt and 16-Pt FFTs and an optimum datapath architecture is explored. It is suggested that there exists an optimum degree of parallelism for the given algorithm. The analysis demonstrated that with a modest increase in area one can achieve significant reduction in power. The proposed architectures complete one parallel-to-parallel (i.e., when all input data are available in parallel and all output data are generated in parallel) 128-point FFT computation in less than 312.5 ns and thereby meet the standard specification. The relative merits and demerits of these architectures have been analyzed from the algorithm as well as implementation point of view. Detailed power analysis of each of the architectures with a different number of data paths at block level is described. We found that from power perspective the architecture with eight datapaths is optimum. The core power consumption with optimum case is 60.6 MW which is only less than half of the latest reported 128-point FFT design in 0.18u technology. Furthermore, a Single Event Upset (SEU) tolerant scheme for registers is also explored. The SEU tolerant scheme will not affect the performance, however, there is an increase power consumption of about 42 percent. Apart from the low power consumption, the advantages of the proposed architectures include reduced hardware complexity, regular data flow and simple counter based control.  相似文献   

6.
Recently, the power consumption of integrated circuits has been attracting increasing attention. Many techniques have been studied to improve the power efficiency of digital signal processing units such as fast Fourier transform (FFT) processors, which are popularly employed in both traditional research fields, such as satellite communications, and thriving consumer electronics, such as wireless communications. This paper presents solutions based on parallel architectures for high throughput and power efficient FFT cores. Different combinations of hybrid low‐power techniques are exploited to reduce power consumption, such as multiplierless units which replace the complex multipliers in FFTs, low‐power commutators based on an advanced interconnection, and parallel‐pipelined architectures. A number of FFT cores are implemented and evaluated for their power/area performance. The results show that up to 38% and 55% power savings can be achieved by the proposed pipelined FFTs and parallel‐pipelined FFTs respectively, compared to the conventional pipelined FFT processor architectures.  相似文献   

7.
陈海燕  杨超  刘胜  刘仲 《电子学报》2016,44(2):241-246
随着SIMD(Single Instruction Multiple Data stream)结构DSP(Digital Signal Processor)片上集成了越来越多的处理单元,并行访存的灵活性及带宽效率对实际运算性能的影响越来越大.本文详细分析了一般SIMD结构DSP中基2 FFT(Fast Fourier Transform)并行算法面临的访存问题,采用简单的部分地址异或逻辑完成SIMD并行访存地址转换,实现了FFT运算的无冲突SIMD并行访存;提出了几种带特殊混洗模式的向量访存指令,可完全消除SIMD结构下基2 FFT运算时需要的额外混洗指令操作.最后将其应用于某16路SIMD数字信号处理器YHFT-Matrix2中向量存储器VM的优化设计.测试结果表明,采用该SIMD并行存储结构优化的VM以增加18%的硬件开销实现了FFT运算全流水无冲突并行访存和100%并行访存带宽利用率;相比优化前的设计,不同点数FFT运算可获得1.32~2.66的加速比.  相似文献   

8.
高吞吐浮点可灵活重构的快速傅里叶变换(FFT)处理器可满足尖端雷达实时成像和高精度科学计算等多种应用需求。与定点FFT相比,浮点运算复杂度更高,使得浮点型FFT的运算吞吐率与其实现面积、功耗之间的矛盾问题尤为突出。鉴于此,为降低运算复杂度,首先将大点数FFT分解成若干个小点数基2k 级联子级实现,提出分别针对128/256/512/1024/2048点FFT的优化混合基算法。同时,结合所提出同时支持单通道单精度和双通道半精度两种浮点模式的新型融合加减与点乘运算单元,首次提出一款高吞吐率双模浮点可变点FFT处理器结构,并在28 nm标准CMOS工艺下进行设计并实现。实验结果表明,单通道单精度和双通道半精度浮点两种模式下的运算吞吐率和输出平均信号量化噪声比分别为3.478 GSample/s, 135 dB和6.957 GSample/s, 60 dB。归一化吞吐率面积比相比于现有其他浮点FFT实现可提高约12倍。  相似文献   

9.
在多媒体系统的系统集成芯片(SoC)中,从系统集成芯片工作实时性要求,应用程序和数据尽可能存放在片上存储或Cache,执行方便,处理速度快,就要使用大量的存储部件,使得存储部件的面积和功耗占到整个芯片的很大部分.为了减少片上存储部件,则部分程序和数据移到片外存储,在执行时轮流调进到芯片内,势必增加I/O的开销.因此如何使设计优化是软硬件协同设计中的一个问题.本文以MPEG2集成解码芯片中音频存储优化为例给出了系统集成芯片存储优化的一些方法.包括通过LGDFG(Large Grain Data Flow Graph)模型分析改变程序结构,共享数据空间,改变数据类型以及添加片上SRAM并减少片上Cache容量从而减少系统存储消耗等.这些方法显著地减少系统的存储消耗,降低系统芯片的面积和功耗.  相似文献   

10.
文章提出了一种以基-22/23为基础的流水线结构,用以实现低成本、超大规模集成电路(VLSI)的快速傅里叶变换(FFT)处理器设计。该处理器在减少普通复数乘法器级数的同时,通过单路延时反馈(SDF)存取方式,以最少的存储字来获得FFT结果。对于数据通路,我们采用了混合浮点的数据缩放方式,在保证信噪比的同时,降低了数据长...  相似文献   

11.
While an ECL-CMOS SRAM can achieve both ultra high speed and high density, it consumes a lot of power and cannot be applied to low power supply voltage applications. This paper describes an NTL (Non Threshold Logic)-CMOS SRAM macro that consists of a PMOS access transistor CMOS memory cell, an NTL decoder with an on-chip voltage generator, and an automatic bit line signal voltage swing controller. A 32 Kb SRAM macro, which achieves a 1 ns access time at 2.5 V power supply and consumes a mere 1 W, has been developed on a 0.4 μm BiCMOS technology  相似文献   

12.
由FFT芯片构成的并行FFT结构   总被引:1,自引:0,他引:1  
快速傅立叶变换(FFT)在计算机层析影象技术,语间识别,图像处理等域得了广泛的应用。随着计算机应用的发展,越来越需要对大规模的数据进行变换。并行FFT是完成快速数据变换的一种方法。本文提出一咱由小规模FFT芯片构成并行FFT的方法,楞用于大规模数据的变换,并对其并行结构的面积和执行时间进行了探讨,还提出了具有容错功能的并行FFT网络。  相似文献   

13.
In this paper, a VLSI architecture based on radix-2/sup 2/ integer fast Fourier transform (IntFFT) is proposed to demonstrate its efficiency. The IntFFT algorithm guarantees the perfect reconstruction property of transformed samples. For a 64-points radix-2/sup 2/ FFT architecture, the proposed architecture uses 2 sets of complex multipliers (six real multipliers) and has 6 pipeline stages. By exploiting the symmetric property of lossless transform, the memory usage is reduced by 27.4%. The whole design is synthesized and simulated with a 0.18-/spl mu/m TSMC 1P6M standard cell library and its reported equivalent gate count usage is 17,963 gates. The whole chip size is 975 /spl mu/m/spl times/977 /spl mu/m with a core size of 500 /spl mu/m/spl times/500 /spl mu/m. The core power consumption is 83.56 mW. A Simulink-based orthogonal frequency demodulation multiplexing platform is utilized to compare the conventional fixed-point FFT and proposed IntFFT from the viewpoint of system-level behavior in items of signal-to-quantization-noise ratio (SQNR) and bit error rate (BER). The quantization loss analysis of these two types of FFT is also derived and compared. Based on the simulation results, the proposed lossless IntFFT architecture can achieve comparative SQNR and BER performance with reduced memory usage.  相似文献   

14.
Powering billions of devices is one of the most challenging barrier in achieving the future vision of IoT. Most of the sensor nodes for IoT based systems depend on battery as their power source and therefore fail to meet the design goals of lifetime power supply, cost, reliable sensing and transmission. Energy harvesting has the potential to supplant batteries and thus prevents frequent battery replacement. However, energy autonomous systems suffer from sudden power variations due to change in external natural sources and results in loss of data. The memory system is a main component which can improve or decrease performance dramatically. The latest versions of many computing system use chip multiprocessor (CMP) with on-chip cache memory organized as array of SRAM cell. In this paper, we outline the challenges involved with the efficient power supply causing power outage in energy autonomous/self-powered systems. Also, various techniques both at circuit level and system level are discussed which ensures reliable operation of IoT device during power failure. We review the emerging non-volatile memories and explore the possibility of integrating STT-MTJ as prospective candidate for low power solution to energy harvesting based IoT applications. An ultra-low power hybrid NV-SRAM cell is designed by integrating MTJ in the conventional 6T SRAM cell. The proposed LP8T2MTJ NV-SRAM cell is then analyzed using multiple key performance parameters including read/write energies, backup/restore energies, access times and noise margins. The proposed LP8T2MTJ cell is compared to conventional 6T SRAM counterpart indicating similar read and write performance. Also, comparison with the existing MTJ based NV-SRAM cells show 51–78% reduction in backup energy and 42–70% reduction in restore energy.  相似文献   

15.
Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.  相似文献   

16.
A CMOS static RAM (SRAM) circuit capable of detecting and storing optically transmitted data is described. Bits of data are transferred to the memory circuit via an array of parallel light beams. A 16-b optoelectronic SRAM was fabricated in a standard bulk CMOS process and tested using argon and helium-neon lasers. Data contained in an array of 16 light beams with an average power of 3.35 μW/pixel were successfully transferred to the SRAM in parallel fashion. The storage of the optical information was verified by electronically addressing each cell. The optical data transfer technology is extended to other systems in which high speed and parallelism are essential  相似文献   

17.
Lowering the supply voltage is an effective way to significantly reduce the power consumption of a static random access memory (SRAM). However, the minimum supply voltage (Vminf) required to support a given operating frequency in an SRAM macro is often elusive from one chip to another due to process variations. Moreover, temperature could vary when an SRAM macro is in operation, and thus exacerbating the problem since temperature variation could affect the Vminf . In this paper, we propose an on-chip self-VDD-tuning scheme that automatically adjusts each manufactured SRAM macro to a minimal voltage near its Vminf. Our scheme can provide a user-specified speed margin (e.g., 10% of the target frequency), and thereby creating a guard band for assuring robust operations over a wide range of temperatures. Simulation results show that, with the proposed speed margining technique, a 64 Kb SRAM macro can tolerate temperature up to 125degC. Measurement results from a test chip in a 0.18-mum CMOS process also demonstrate that we can achieve 40% power savings for an 8 Kb SRAM macro operating at 150 MHz by means of this resilient self-VDD-tuning.  相似文献   

18.
A VLSI array processor for 16-point FFT   总被引:1,自引:0,他引:1  
An implementation of a two-dimensional array processor for fast Fourier transform (FFT) using a 2-μm CMOS technology is presented. The array processor, which is dedicated to 16-point FFT, implements a 4×4 mesh array of 16 processing elements (PEs) working in parallel. Design considerations in both the chip level and the PE level are examined. A layout design methodology based on bit-slice units (BSUs) results in a very simple design, easy debugging, and a regular interconnection scheme through abutment. It contains about 48,000 transistors on an area of 53.52 mm2, excluding the 83-pad area, and operation is on a 15-MHz clock. The array processor performs 24.6 million complex multiplications per second, and computes a 16-point FFT in 3 μs  相似文献   

19.
Memory-processor integration offers new opportunities for reducing, the energy of a system. In the case of embedded systems, where memory access patterns can typically be profiled at design time, one solution consists of mapping the most frequently accessed addresses onto the on-chip SRAM to guarantee power and performance efficiency. In this work, we propose an algorithm for the automatic partitioning of on-chip SRAMs into multiple banks. Starting from the dynamic execution profile of an embedded application running on a given processor core, we synthesize a multi-banked SRAM architecture optimally fitted to the execution profile. The algorithm computes an optimal solution to the problem under realistic assumptions on the power cost metrics, and with constraints on the number of memory banks. The partitioning algorithm is integrated with the physical design phase into a complete flow that allows the back annotation of layout information to drive the partitioning process. Results, collected on a set of embedded applications for the ARM processor, have shown average energy savings around 34%  相似文献   

20.
A chip set for pipelined and parallel pipelined FFT applications is presented. The set consists of two cascadeable chips with built-in self-test and a chip-interconnectivity test feature. The two ASICs are a 15k gate Complex-Butterfly and a 9k gate FFT Switch. The Complex-Butterfly uses redundant binary arithmetic (RBA), a modified Booth algorithm and a Wallace tree architecture to achieve a throughput of better than 25 Msamples/sec. The cascadeable FFT Switch is designed to support the implementation of radix-2, 2 N point, pipeline FFTs. Both devices have been fabricated in 1.5m CMOS gate array technology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号