首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We have fabricated a high yield integrated memory array processor (IMAP) LSI, which features a high memory bandwidth (1.28-GB/s) and low power consumption (4-W max.) and which contains a 2-Mb SRAM with 1.28-I/O's and 64 processor elements (PE's) in one chip. A high-bandwidth and low-power memory circuit design is the key technology to realize the IMAP-LSI. We adopted following new designs for memory circuit. (1) Memory access time is designed to be twice as fast as PE execution time (2) Employment of dynamic power control mode, which reduces the memory power consumption down to 30% of maximum power without a loss in access-speed (3) Simplified synchronization with PE's (4) 4-way block redundancy. These design techniques are suitable for future system integrated ULSI's  相似文献   

2.
5G 宽带功放数字预失真器(DPD)的FPGA 实现过程中,常遇到数字处理带宽不够和资源有限问题,对 此,文中提出一种基于双路并行数据流的数字预失真带宽扩展方法和基于Zynq Ultrascale+ MPSoC 的自动化模型优化 验证方法,可快速实现对5G 宽带功放线性化方案的优化。使用该并行处理结构的数字预失真器,克服了数字电路最 大时钟频率造成的对FPGA 线性化带宽的限制,使得数字预失真电路在每个时钟周期内可以处理更多的数据,不仅有 效地增加了数字处理带宽,而且降低了DPD 的功耗。然而,这种带宽增加以消耗更多硬件资源为代价,对此,文中同时 提出了对预失真非线性模型的在线自动优化方法,以简化非线性模型、降低DPD 的硬件资源开销。最后,在Zynq Ultrascale+ FPGA 实验平台上实现了具有两路并行数据处理的I-MSA 自优化数字预失真电路,采用100 MHz 的5G 新无 线电(NR)信号在2. 6 GHz 功率放大器上进行线性化实验验证,获得了满意的预失真性能,验证了所提方法的有效性。  相似文献   

3.
宽带DRFM雷达干扰机信号处理模块设计   总被引:2,自引:0,他引:2  
杨春 《电讯技术》2012,52(6):918-921
给出了宽带数字射频存储器(DRFM)雷达干扰机信号处理模块组成框图以及信号处理流程,描述了模块实现的关键技术,特别是在FPGA中实现高速信号并行处理的方法.该信号处理模块可以提供1 GHz瞬时处理带宽,存储深度达到2 048 μs,可实现对新体制宽带雷达有效干扰,具有广阔的应用前景.  相似文献   

4.
This paper describes a low-power, single-chip video encoder intended for battery-operated portable applications. Design goals are minimizing system power as well as utilized bandwidth, and maximizing system integration. The encoder achieves competitive compression, with convenient bit rate scalability, using a peak power dissipation of several hundred μW on a video stream of 8-bit gray scale, 30 frame/s, and 128×128 demonstration resolution. Compression is performed using wavelet filtering, zero-trees, and arithmetic coding, all integrated on a single chip (3 million transistors, 1 cm2, in 0.6 μm CMOS, operating at 500 kHz), with no external memory or control. Results do not include use of motion compensation, however, hooks are included at algorithmic and architectural levels to add motion compensation at the cost of power dissipation a few times higher, and more internal memory. In the absence of motion compensation, temporal correlation is still utilized through the use of simple frame differencing. The architectural centerpiece is a massively parallel, fine granularity SIMD array of processing elements (PEs). A mapping is made between small image blocks (4×4 pixels on the test chip) and PEs, with each PE containing both memory and logic required for its block. These results are obtained by careful coordination of design in a deep vertical manner, ranging from system, algorithmic, architectural, circuit, and layout, and designing simultaneously for all required algorithmic subcomponents  相似文献   

5.
Two-dimensional (2-D) convolution is widely used in image and video processing. Although the operation is simple, 2-D convolution is however both computationally expensive and memory-intensive. Field-programmable-gate-array (FPGA)-based parallel processing architectures were proposed to accelerate calculations for 2-D convolution. And data buffers implemented with FPGA on-chip resources were used to avoid direct access to external memories. Full buffering and partial buffering (PB) schemes were adopted in previous works. The former would consume a large amount of FPGA resources, while the latter would cause a sharp increase in external memory bus bandwidth. In this brief, we present a multiwindow PB scheme for FPGA-based 2-D convolvers. Compared with the aforementioned methods, the new buffering strategy exhibits a good balance between on-chip resource utilization and external memory bus bandwidth, and therefore is suitable for low-cost FPGA implementation  相似文献   

6.
Recently, the level of realism in PC graphics applications has been approaching that of high-end graphics workstations, necessitating a more sophisticated texture data cache memory to overcome the finite bandwidth of the AGP or PCI bus. This paper proposes a multilevel parallel texture cache memory to reduce the required data bandwidth on the AGP or PCI bus and to accelerate the operations of parallel graphics pipelines in PC graphics cards. The proposed cache memory is fabricated by 0.16-μm DRAM-based SOC technology. It is composed of four components: an 8-MB DRAM L2 cache, 8-way parallel SRAM L1 caches, pipelined texture data filters, and a serial-to-parallel loader. For high-speed parallel L1 cache data replacement, the internal bus bandwidth has been maximized up to 75 GB/s with a newly proposed hidden double data transfer scheme. In addition, the cache memory has a reconfigurable architecture in its line size for optimal caching performance in various graphics applications from three-dimensional (3-D) games to high-quality 3-D movies  相似文献   

7.
8.
Many of the current applications used in battery powered devices are from digital signal processing, telecommunication, and multimedia domains. These applications typically set high requirements for computational performance and often parallelism is the key solution to meet the performance requirements. In order to exploit the parallel processing units, memory should be able to feed the data path with data. This calls for a memory organization supporting parallel memory accesses. In this paper, a conflict resolving parallel data memory system for application-specific instruction-set processors is described. The memory structure is generic and reusable to support various application-specific designs. The proposed memory system does not employ any predefined access format signals for memory addressing. The proposed parallel memory system is attached to an application-specific instruction-set processor core and comparison on area, power, and critical path are shown. The experiments show that significant power savings can be obtained by exploiting the parallel memory system instead of multi-port memory.
Jarmo TakalaEmail:
  相似文献   

9.
This superscalar microprocessor is the first implementation of a 32-bit RISC architecture specification incorporating a single-instruction, multiple-data vector processing engine. Two instructions per cycle plus a branch can be dispatched to two of seven execution units in this microarchitecture designed for high execution performance, high memory bandwidth, and low power for desktop, embedded, and multiprocessing systems. The processor features an enhanced memory subsystem, 128-bit internal data buses for improved bandwidth, and 32-KB eight-way instruction/data caches. The integrated L2 tag and cache controller with a dedicated L2 bus interface supports L2 cache sizes of 512 KB, 1 MB, or 2 MB with two-way set associativity. At 450 MHz, and with a 2-MB L2 cache, this processor is estimated to have a floating-point and integer performance metric of 20 while dissipating only 7 W at 1.8 V. The 10.5 million transistor, 83-mm2 die is fabricated in a 1.8-V, 0.20-μm CMOS process with six layers of copper interconnect  相似文献   

10.
The implementation of the memory for storing image and transform coefficients in 2-D DWT processing systems using the more cost-effective external memory module such as DDR DRAM is shown to suffer from effective memory bandwidth which is significantly lower than the memory system peak bandwidth if the conventional direct logical-to-physical memory address mapping is adopted. The low effective memory bandwidth is caused by the high level of memory overhead cycle occurrence which is in turn is closely related to the logical memory access patterns of 2-D DWT processes. The problem becomes even more severe for the 2-D DWT processing of video. An analysis on the logical memory access patterns of multi-level 2-D DWT is carried out and an enhanced logical-to-physical memory mapping scheme which minimizes the occurrence of memory overhead cycles is proposed. The proposed scheme is simulated and its performance in terms of effective memory access bandwidth is evaluated and compared with the conventional direct mapping scheme.
Soon-Chieh LimEmail:
  相似文献   

11.
Xetal-II is a single-instruction multiple-data (SIMD) processor with 320 processing elements. It delivers a peak performance of 107 GOPS on 16-bit data while dissipating 600 mW. A 10 Mbit on-chip memory is provided which can store up to four VGA frames, allowing efficient implementation of frame-iterative algorithms. A massively parallel interconnect provides an internal bandwidth of more than 1.3 Tbit/s to sustain the peak performance. The IC is realized in 90 nm CMOS and takes up 74 mm2.  相似文献   

12.
Two-dimensional (2-D) filters for video signal processing typically operate at high uniform sampling rates and require very large delay-line (DL) memory blocks. By employing 2-D multirate signal processing techniques to reduce the sampling rate, not only the DL memory blocks can be downsized to save silicon area, but also the memory access time can be increased to save power as well. This is demonstrated in this paper considering a 2-D switched-capacitor multirate image processor that realizes (2×2)nd-order recursive low-pass and high-pass filtering functions employing half of the storage cells that would be needed in a nonmultirate system. Only one type of operational transconductance amplifier with 1-mS transconductance and 120-MHz unity gain bandwidth is needed for both the vertical filter and associated DL memory blocks and the horizontal decimating filter. Fully differential circuit techniques are employed to increase immunity to charge feedthrough injection in the analog storage cells. The complete system has been implemented in a CMOS 1.0-μm double-poly technology. The core active area is only 2.5×3.0 mm2, and at 5-V supply and 18-MHz sampling it dissipates 85 mW  相似文献   

13.
FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor. We show that it is possible to to achieve excellent parallel scaling while maintaining power and area efficiency comparable to that of the single-core solution. The result is an architecture that can effectively use up to 16 hybrid cores for transform sizes that can be contained in on-chip SRAM. When configured with 12MiB of on-chip SRAM, our technology evaluation shows that the proposed 16-core FFT accelerator should sustain 388 GFLOPS of nominal double-precision performance, with power and area efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm2, respectively.  相似文献   

14.
An Algorithm-Hardware-System Approach to VLIW Multimedia Processors   总被引:2,自引:0,他引:2  
Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads.  相似文献   

15.
Multiview video coding (MVC) plays an important role in a 3-D video system. In addition, the resolution of HDTV is increasing to present more vivid perception for users. To realize real-time processing of dozens of TOPS, VLSI solution is necessary. However, ultra high computational complexity, a large amount of external memory bandwidth and on-chip SRAM size, and complex MVC prediction structures are three main design challenges of implementation of MVC hardware architecture. In this paper, an MVC single-chip encoder is proposed for H.264/AVC Multiview High Profile and High Profile for 3-D and quad full high definition (QFHD) TV applications, respectively. The 4096 × 2160 p multiview video encoder chip is implemented on a 11.46 mm2 die with 90 nm CMOS technology. An eight-stage macroblock pipelined architecture with proposed system scheduling and cache-based prediction core supports real-time processing from one-view 4096 × 2160 p to seven-view 720 p videos. The 212 Mpixels/s throughput is 3.4 to 7.7 times higher than previous work. The 407 Mpixels/W power efficiency is achieved, and 94% on-chip SRAM size and 79% external memory bandwidth are saved by the proposed techniques.  相似文献   

16.
This paper presents a memory efficient partially parallel decoder architecture suited for high rate quasi-cyclic low-density parity-check (QC-LDPC) codes using (modified) min-sum algorithm for decoding. In general, over 30% of memory can be saved over conventional partially parallel decoder architectures. Efficient techniques have been developed to reduce the computation delay of the node processing units and to minimize hardware overhead for parallel processing. The proposed decoder architecture can linearly increase the decoding throughput with a small percentage of extra hardware. Consequently, it facilitates the applications of LDPC codes in area/power sensitive high-speed communication systems  相似文献   

17.
针对H.265整数运动估计算法参考块更新模块中数据传输量大、运行速度慢等问题,在分析参考块间数据相关性的基础上,提出了能够减少硬件使用资源,提高运行效率的并行方案.该方案采用18×17个处理元阵列,通过相邻参考块之间3个方向数据重合的关系设计了3个缓存区,更新时根据参考块之间的关系定位缓存区,然后从外存加载相应的参考块数据.该方案中的资源占用量相比传统设计降至1/16.实验结果表明,该方案可以将数据复用率提高到98.4%,有效降低了整数运动估计算法的带宽需求.  相似文献   

18.
19.
Next-generation mobile devices will continue to demand high processing power for imaging applications. The expected performance is in the class of supercomputers, but delivered with limited energy and memory bandwidth for embedded systems. This article advocates a streaming computation model that leverages the deterministic access patterns in imaging applications to deliver the necessary processing throughput. A reconfigurable datapath connects a set of functional units, forming a computation pipeline to offer energy efficiency. The architecture and implementation of a stream processor are presented along with the memory subsystem to support stream data transfers. The results show speedup ranging from a factor of 2 to 28 for imaging applications, offering favorable comparison against scalar processors.  相似文献   

20.
A network-on-chip (NoC) based parallel processor is presented for bio-inspired real-time object recognition with visual attention algorithm. It contains an ARM10-compatible 32-bit main processor, 8 single-instruction multiple-data (SIMD) clusters with 8 processing elements in each cluster, a cellular neural network based visual attention engine (VAE), a matching accelerator, and a DMA-like external interface. The VAE with 2-D shift register array finds salient objects on the entire image rapidly. Then, the parallel processor performs further detailed image processing within only the pre-selected attention regions. The low-latency NoC employs dual channel, adaptive switching and packet-based power management, providing 76.8 GB/s aggregated bandwidth. The 36 mm2 chip contains 1.9 M gates and 226 kB SRAM in a 0.13 mum 8-metal CMOS technology. The fabricated chip achieves a peak performance of 125 GOPS and 22 frames/sec object recognition while dissipating 583 mW at 1.2 V.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号