首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 70 毫秒
1.
The 3-D Computer     
The 3-D Computer [1]–[4] is a unique implementation of a cellular array processor. We have developed two radically new technologies which enable massive numbers of communication channels both between silicon wafers and through them. A parallel processor (single instruction-multiple data stream cellular array processor) has been designed and built to demonstrate the potential of this technological approach. While the 3-D Computer which has been built and operated in a small scale implementation relative to the long-term aims of this technology, it is nevertheless an extremely powerful computer. The current feasibility demonstration 3-D Computer is a 32×32 array of processors partitioned over five wafers stacked one on top of another. The throughput of this current machine is >600 million operations per second (MOPS) with a 10 MHz clock, while the projected throughput of a full scale machine is >100 billion operations per second (BOPS), again with a 10 MHz clock. The extension of the level of circuit integration beyond that of VLSI and WSI, which is made possible by the 3-D technologies of wafer feedthroughs and microbridges, enable us to achieve these enormous throughputs in a very compact form and at very low power. The small size and low power attributes of the 3-D Computer result from the elimination of the chip level and board level packaging and the intraboard wiring required by conventional levels of circuit integration.  相似文献   

2.
A massively parallel systolic-array architecture is proposed for the implementation of real-time VLSI spatio-temporal 3-D IIR frequency-planar filters at a throughput of one-frame-per-clock-cycle (OFPCC). The architecture is based on a differential-form transfer function and is of low circuit complexity compared with the direct-form architecture. A 3-D look-ahead (LA) form of the transfer function is proposed for maximizing the speed of the implementation, which has a nonseparable 3-D transfer function. The systolic array enables real-time implementation of 3-D IIR frequency-planar filters at radio-frequency (RF) frame-rates and is therefore a suitable building block for 3-D IIR digital filters having beam- and cone-shaped passbands as required for smart-antenna-array beam-forming applications involving the broadband spatio-temporal filtering of plane-waves. The fixed-point systolic-array implementation have a throughput of OFPCC and the tested real-time prototype achieves frame (clock) sample frequencies of up to 90 MHz using one Xilinx Virtex-4 sx35-10ff668 FPGA device.   相似文献   

3.
A circuit measuring the phase of incoming asynchronous signals relative to the system clock in digital signal processing is described. The system clock can be in the range from 10 to 20 MHz, as is typical for video signal processing applications. As a reference in the asynchronous signal the positive or negative slope is taken. Its phase is measured with a resolution of 1/32 of a system clock cycle (approximately 1.5 to 3 ns). Pure digital CMOS technology without precision components is used, to enable combined integration on processor chips. Timing precision (jitter) is better than 200 ps without any adjustments. One external capacitor is needed  相似文献   

4.
A 36 mm/sup 2/ graphics processor with fixed-point programmable vertex shader is designed and implemented for portable two-dimensional (2-D) and three-dimensional (3-D) graphics applications. The graphics processor contains an ARM-10 compatible 32-bit RISC processor,a 128-bit programmable fixed-point single-instruction-multiple-data (SIMD)vertex shader, a low-power rendering engine, and a programmable frequency synthesizer (PFS). Different from conventional graphics hardware, the proposed graphics processor implements ARM-10 co-processor architecture with dual operations so that user-programmable vertex shading is possible for advanced graphics algorithms and various streaming multimedia processing in mobile applications. The circuits and architecture of the graphics processor are optimized for fixed-point operations and achieve the low power consumption with help of instruction-level power management of the vertex shader and pixel-level clock gating of the rendering engine. The PFS with a fully balanced voltage-controlled oscillator (VCO) controls the clock frequency from 8 MHz to 271 MHz continuously and adaptively for low-power modes by software. The chip shows 50 Mvertices/s and 200 Mtexels/s peak graphics performance, dissipating 155 mW in 0.18-/spl mu/m 6-metal standard CMOS logic process.  相似文献   

5.
Reed-Solomon codes are powerful error-correcting codes that can be found in many digital communications standards. Recently, there has been an interest in soft-decision decoding of Reed-Solomon codes, incorporating reliability information from the channel into the decoding process. The Koetter-Vardy algorithm is a soft-decision decoding algorithm for Reed-Solomon codes which can provide several dB of gain over traditional hard-decision decoders. The algorithm consists of a soft-decision front end to the interpolation-based Guruswami-Sudan list decoder. The main computational task in the algorithm is a weighted interpolation of a bivariate polynomial. We propose a parallel architecture for the hardware implementation of bivariate interpolation for soft-decision decoding. The key feature is the embedding of both a binary tree and a linear array into a 2-D array processor, enabling fast polynomial evaluation operations. An field-programmable gate array interpolation processor was implemented and demonstrated at a clock frequency of 23 MHz, corresponding to decoding rates of 10-15 Mb/s  相似文献   

6.
本文提出了一种二维OCT快速算法的FPGA实现结构,采用行列快速算法将二维DCT分解成两个一维DCT实现,其中一维DCT借鉴Loeffler DCT算法,采用并行的流水线结构,提高电路的数据吞吐率和运算速度,通过系数矩阵的简化和蝶形运算结构的等价减少乘法器的消耗,一维DCT核消耗16个乘法器.转置RAM采用8片双口RAM,一个时钟可以完成 8个数据读写.实验结果验证了二维DCT核设计的正确性,该电路结构消耗资源少,布线简单,功耗小,适合图像的实时处理.  相似文献   

7.
杨焱  侯朝焕 《电子学报》2003,31(11):1667-1670
本文提出一种新的基于VLIW处理器的层次化数据通道的VLSI结构,通过独特的微码结构,十分方便地得到了具有可配置特征的高速数据通道的控制模型,模型能有效地改善系统扩展所需要的灵活性,适合构建高性能的媒体处理器阵列.运用VHDL语言实现的硬件设计通过了系统仿真.100MHz时钟频率下的最大数据吞吐率可达1.28Gbit/s.  相似文献   

8.
This paper describes a new IA-32 architecture microprocessor that implements 70 additional instructions to further accelerate the performance of data-streaming applications such as three-dimensional graphics and video encode/decode. This processor is an enhancement over the previous implementation of this family through the addition of these new instructions along with circuit improvements in several key areas for higher clock frequency. The 10.17×12.10 mm2 die contains 9.5 million transistors and is fabricated in a CMOS five-layer-metal 0.25-μm process with a six-layer organic land grid array package using C4 interconnect technology. It has an operating range of 1.4-2.2 V and is currently running up to 650 MHz  相似文献   

9.
10.
A novel three-dimensional (3-D) masterslice monolithic microwave integrated circuit (MMIC) is presented that significantly reduces turnaround time and cost for multifunction MMIC production. This MMIC incorporates an artificial ground metal for effective selection of master array elements on the wafer surface, resulting in various MMIC implementations on a master-arrayed footprint in association with thin polyimide and metal layers over it. Additionally, the 3-D miniature circuit components of less than 0.4 mm2 in size provide a very high integration level. To clearly show the advantages, a 20-GHz-band receiver MMIC was implemented on a master array with 6×3 array units including a total of 36 MESFETs in a 1.78×1.78 mm area. Details of the miniature circuit components and the design, closely related to the fabrication process, are also presented. The receiver MMIC exhibited a 19-dB conversion gain with an associated 6.5-dB noise figure from 17 to 24 GHz and an integration level four times higher than conventional planar MMICs. This technology promises about a 90% cost reduction for MMIC because it can be similarly applied to large-scale Si wafers with the aid of an artificial ground  相似文献   

11.
Experimental results of an integrated circuit implementing a simplicial cellular nonlinear network digital pixel processor are presented. The prototype has a 7 times 6 cells array, and works as expected at a testing clock speed of 10 MHz.  相似文献   

12.
Maximizing the Functional Yield of Wafer-to-Wafer 3-D Integration   总被引:1,自引:0,他引:1  
Three-dimensional integrated circuit technology with through-silicon vias offers many advantages, including improved form factor, increased circuit performance, robust heterogenous integration, and reduced costs. Wafer-to-wafer integration supports the highest possible density of through-silicon vias and highest throughput; however, in contrast to die-to-wafer integration, it does not benefit from the ability to bond only tested and diced good die. In wafer-to-wafer integration, wafers are entirely bonded together, which can unintentionally integrate a bad die from one wafer to a good die from another wafer reducing the yield. In this paper, we propose solutions that maximize the yield of wafer-to-wafer 3-D integration, assuming that the individual die can be tested on the wafers before bonding. We exploit some of the available flexibility in the integration process, and propose wafer assignment algorithms that maximize the number of good 3-D ICs. Our algorithms range from scalable, fast heuristics to optimal methods that exactly maximize the yield of wafer-to-wafer 3-D integration. Using realistic defect models and yield simulations, we demonstrate the effectiveness of our methods up to large numbers of wafer stacks. Our results demonstrate that it is possible to significantly improve the yield in comparison to yield-oblivious wafer assignment methods.   相似文献   

13.
A 54-MHz CMOS video processor with a systolic architecture suited for two-dimensional symmetric FIR (finite impulse response) filtering is reported. The circuit is a one-dimensional digital filter comprising a control part and an array of eight multiplication-accumulation cells. This processor is capable of handling 32 equivalent multiply-add operations in a sampling period as short as 18 ns. Devices can be cascaded to increase the order of the filter in both dimensions, up to 1024 stages with no truncation errors. It has been developed in a 1.2-μm CMOS technology, and it dissipates less than 500 mW at a 54-MHz clock frequency  相似文献   

14.
An artificial retina is a device that intimately associates an imager with processing facilities on a monolithic circuit. Yet, except for simple environments and applications, analog hardware will not suffice to process and compact the raw image flow from the photosensitive array. To solve this output problem, an on-chip array of bare Boolean processors with halftoning facilities is proposed, with versatility provided by programmability. For a pixel memory size of 3 b, the authors demonstrate both the technological practicality and the computational efficiency of this programmable Boolean retina concept. Using semistatic shifting structures together with some interaction circuitry, a minimal retina Boolean processor can be built with less than 30 transistors and controlled by as few as six global clock signals. The successful design, integration, and test of a 65×76 Boolean retina on a 50-mm2 CMOS 2-μm circuit are described  相似文献   

15.
基于FPGA的空间太阳望远镜图像相关算法实现   总被引:1,自引:0,他引:1       下载免费PDF全文
两维图像相关跟踪是空间太阳望远镜1m光学系统达到0.1″分辨率关键之一.介绍了基于FPGA实现SST相关算法的方法,如2×2矢量基蝶形FFT、模块化结构、两级状态机、动态块浮点、并行流水时序等.20MHz下32×32图像相关算法在XCV800芯片上实现仅713 微秒,像元拟合精度优于1/50.  相似文献   

16.
针对串行加解扰电路存在功耗大、数据处理速度慢、串行扰码需要较高时钟频率等问题,提出了一种基于JESD204B协议的新型并行加解扰电路,通过由矩阵推导出的算法实现32位数据并行加扰/解扰。使用Verilog HDL对电路进行RTL级设计,并通过Cadence公司的NCVerilog软件进行验证。结果表明,该电路能够正确实现加解扰功能,并且可以使用312.5 MHz的时钟处理10 Gb/s的数据。采用65 nm CMOS工艺制作样片,测试结果表明,该电路符合设计要求。该加解扰电路对于高速数据通信芯片的自主可控设计与实现具有重要的参考价值。  相似文献   

17.
A new VLSI processor (DIP chip) for image compression is presented which combines principles of multipipeline and array processing. The device is not specific to any one image compression algorithm and can be regarded as a general purpose processor. The chip has been implemented using a CMOS 1.0-μm process on a 14.4×13.5-mm2 die. An internal clock frequency of 40 MHz results in 1.2×109 operations/s on 8-bit data. Solutions to problems associated with the large bandwidth required, for both image data and instruction streams, is the main aim of the paper. The necessary problem of increasing the array clock frequency relative to the input/output clock frequency without the need for a large on-chip instruction cache or fast external clock speeds is also addressed  相似文献   

18.
This paper presents a novel variable-latency multiplier architecture, suitable for implementation as a self-timed multiplier core or as a fully synchronous multicycle multiplier core. The architecture combines a second-order Booth algorithm with a split carry save array pipelined organization, incorporating multiple row skipping and completion-predicting carry-select dual adder. The paper reports the architecture and logic design, CMOS circuit design and performance evaluation. In 0.35 μm CMOS, the expected sustainable cycle time for a 32-bit synchronous implementation is 2.25 ns. Instruction level simulations estimate 54% single-cycle and 46% two-cycle operations in SPEC95 execution. Using the same CMOS process, the 32-bit asynchronous implementation is expected to reach an average 1.76 ns throughput and 3.48 ns latency in SPEC95 execution  相似文献   

19.
Modern reduced-instruction-set computer chips have features that lay the groundwork for great performance. They boast circuits that can work at clock frequencies ranging to 300 MHz, pipelines built to continually execute independent operations, and the ability to execute multiple instructions in a single clock cycle. The potential of this hardware can only be realized by sophisticated compiler technology. The best approach to optimal computing performance is to design a processor architecture and a compiler concurrently  相似文献   

20.
A new class of fully parameterizable multiple array architectures for motion estimation in video sequences based on the Full-Search Block-Matching algorithm is proposed in this paper. This class is based on a new and efficient AB2 single array architecture with minimum latency, maximum throughput and full utilization of the hardware resources. It provides the ability to configure the target processor within the boundary values imposed for the configuration parameters concerning the algorithm setup, the processing time and the circuit area. With this purpose, a software configuration tool has been implemented to determine the set of possible configurations which fulfill the requisites of a given video coder. Experimental results using both FPGA and ASIC technologies are presented. In particular, the implementation of a single array processor configuration on a single-chip is illustrated, evidencing the ability to estimate motion vectors in real-time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号