首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
The 3-D Computer     
The 3-D Computer [1]–[4] is a unique implementation of a cellular array processor. We have developed two radically new technologies which enable massive numbers of communication channels both between silicon wafers and through them. A parallel processor (single instruction-multiple data stream cellular array processor) has been designed and built to demonstrate the potential of this technological approach. While the 3-D Computer which has been built and operated in a small scale implementation relative to the long-term aims of this technology, it is nevertheless an extremely powerful computer. The current feasibility demonstration 3-D Computer is a 32×32 array of processors partitioned over five wafers stacked one on top of another. The throughput of this current machine is >600 million operations per second (MOPS) with a 10 MHz clock, while the projected throughput of a full scale machine is >100 billion operations per second (BOPS), again with a 10 MHz clock. The extension of the level of circuit integration beyond that of VLSI and WSI, which is made possible by the 3-D technologies of wafer feedthroughs and microbridges, enable us to achieve these enormous throughputs in a very compact form and at very low power. The small size and low power attributes of the 3-D Computer result from the elimination of the chip level and board level packaging and the intraboard wiring required by conventional levels of circuit integration.  相似文献   

2.
A massively parallel systolic-array architecture is proposed for the implementation of real-time VLSI spatio-temporal 3-D IIR frequency-planar filters at a throughput of one-frame-per-clock-cycle (OFPCC). The architecture is based on a differential-form transfer function and is of low circuit complexity compared with the direct-form architecture. A 3-D look-ahead (LA) form of the transfer function is proposed for maximizing the speed of the implementation, which has a nonseparable 3-D transfer function. The systolic array enables real-time implementation of 3-D IIR frequency-planar filters at radio-frequency (RF) frame-rates and is therefore a suitable building block for 3-D IIR digital filters having beam- and cone-shaped passbands as required for smart-antenna-array beam-forming applications involving the broadband spatio-temporal filtering of plane-waves. The fixed-point systolic-array implementation have a throughput of OFPCC and the tested real-time prototype achieves frame (clock) sample frequencies of up to 90 MHz using one Xilinx Virtex-4 sx35-10ff668 FPGA device.   相似文献   

3.
Reed-Solomon codes are powerful error-correcting codes that can be found in many digital communications standards. Recently, there has been an interest in soft-decision decoding of Reed-Solomon codes, incorporating reliability information from the channel into the decoding process. The Koetter-Vardy algorithm is a soft-decision decoding algorithm for Reed-Solomon codes which can provide several dB of gain over traditional hard-decision decoders. The algorithm consists of a soft-decision front end to the interpolation-based Guruswami-Sudan list decoder. The main computational task in the algorithm is a weighted interpolation of a bivariate polynomial. We propose a parallel architecture for the hardware implementation of bivariate interpolation for soft-decision decoding. The key feature is the embedding of both a binary tree and a linear array into a 2-D array processor, enabling fast polynomial evaluation operations. An field-programmable gate array interpolation processor was implemented and demonstrated at a clock frequency of 23 MHz, corresponding to decoding rates of 10-15 Mb/s  相似文献   

4.
本文提出了一种二维OCT快速算法的FPGA实现结构,采用行列快速算法将二维DCT分解成两个一维DCT实现,其中一维DCT借鉴Loeffler DCT算法,采用并行的流水线结构,提高电路的数据吞吐率和运算速度,通过系数矩阵的简化和蝶形运算结构的等价减少乘法器的消耗,一维DCT核消耗16个乘法器.转置RAM采用8片双口RAM,一个时钟可以完成 8个数据读写.实验结果验证了二维DCT核设计的正确性,该电路结构消耗资源少,布线简单,功耗小,适合图像的实时处理.  相似文献   

5.
A circuit measuring the phase of incoming asynchronous signals relative to the system clock in digital signal processing is described. The system clock can be in the range from 10 to 20 MHz, as is typical for video signal processing applications. As a reference in the asynchronous signal the positive or negative slope is taken. Its phase is measured with a resolution of 1/32 of a system clock cycle (approximately 1.5 to 3 ns). Pure digital CMOS technology without precision components is used, to enable combined integration on processor chips. Timing precision (jitter) is better than 200 ps without any adjustments. One external capacitor is needed  相似文献   

6.
7.
杨焱  侯朝焕 《电子学报》2003,31(11):1667-1670
本文提出一种新的基于VLIW处理器的层次化数据通道的VLSI结构,通过独特的微码结构,十分方便地得到了具有可配置特征的高速数据通道的控制模型,模型能有效地改善系统扩展所需要的灵活性,适合构建高性能的媒体处理器阵列.运用VHDL语言实现的硬件设计通过了系统仿真.100MHz时钟频率下的最大数据吞吐率可达1.28Gbit/s.  相似文献   

8.
A 36 mm/sup 2/ graphics processor with fixed-point programmable vertex shader is designed and implemented for portable two-dimensional (2-D) and three-dimensional (3-D) graphics applications. The graphics processor contains an ARM-10 compatible 32-bit RISC processor,a 128-bit programmable fixed-point single-instruction-multiple-data (SIMD)vertex shader, a low-power rendering engine, and a programmable frequency synthesizer (PFS). Different from conventional graphics hardware, the proposed graphics processor implements ARM-10 co-processor architecture with dual operations so that user-programmable vertex shading is possible for advanced graphics algorithms and various streaming multimedia processing in mobile applications. The circuits and architecture of the graphics processor are optimized for fixed-point operations and achieve the low power consumption with help of instruction-level power management of the vertex shader and pixel-level clock gating of the rendering engine. The PFS with a fully balanced voltage-controlled oscillator (VCO) controls the clock frequency from 8 MHz to 271 MHz continuously and adaptively for low-power modes by software. The chip shows 50 Mvertices/s and 200 Mtexels/s peak graphics performance, dissipating 155 mW in 0.18-/spl mu/m 6-metal standard CMOS logic process.  相似文献   

9.
A novel three-dimensional (3-D) masterslice monolithic microwave integrated circuit (MMIC) is presented that significantly reduces turnaround time and cost for multifunction MMIC production. This MMIC incorporates an artificial ground metal for effective selection of master array elements on the wafer surface, resulting in various MMIC implementations on a master-arrayed footprint in association with thin polyimide and metal layers over it. Additionally, the 3-D miniature circuit components of less than 0.4 mm2 in size provide a very high integration level. To clearly show the advantages, a 20-GHz-band receiver MMIC was implemented on a master array with 6×3 array units including a total of 36 MESFETs in a 1.78×1.78 mm area. Details of the miniature circuit components and the design, closely related to the fabrication process, are also presented. The receiver MMIC exhibited a 19-dB conversion gain with an associated 6.5-dB noise figure from 17 to 24 GHz and an integration level four times higher than conventional planar MMICs. This technology promises about a 90% cost reduction for MMIC because it can be similarly applied to large-scale Si wafers with the aid of an artificial ground  相似文献   

10.
Maximizing the Functional Yield of Wafer-to-Wafer 3-D Integration   总被引:1,自引:0,他引:1  
Three-dimensional integrated circuit technology with through-silicon vias offers many advantages, including improved form factor, increased circuit performance, robust heterogenous integration, and reduced costs. Wafer-to-wafer integration supports the highest possible density of through-silicon vias and highest throughput; however, in contrast to die-to-wafer integration, it does not benefit from the ability to bond only tested and diced good die. In wafer-to-wafer integration, wafers are entirely bonded together, which can unintentionally integrate a bad die from one wafer to a good die from another wafer reducing the yield. In this paper, we propose solutions that maximize the yield of wafer-to-wafer 3-D integration, assuming that the individual die can be tested on the wafers before bonding. We exploit some of the available flexibility in the integration process, and propose wafer assignment algorithms that maximize the number of good 3-D ICs. Our algorithms range from scalable, fast heuristics to optimal methods that exactly maximize the yield of wafer-to-wafer 3-D integration. Using realistic defect models and yield simulations, we demonstrate the effectiveness of our methods up to large numbers of wafer stacks. Our results demonstrate that it is possible to significantly improve the yield in comparison to yield-oblivious wafer assignment methods.   相似文献   

11.
Experimental results of an integrated circuit implementing a simplicial cellular nonlinear network digital pixel processor are presented. The prototype has a 7 times 6 cells array, and works as expected at a testing clock speed of 10 MHz.  相似文献   

12.
An artificial retina is a device that intimately associates an imager with processing facilities on a monolithic circuit. Yet, except for simple environments and applications, analog hardware will not suffice to process and compact the raw image flow from the photosensitive array. To solve this output problem, an on-chip array of bare Boolean processors with halftoning facilities is proposed, with versatility provided by programmability. For a pixel memory size of 3 b, the authors demonstrate both the technological practicality and the computational efficiency of this programmable Boolean retina concept. Using semistatic shifting structures together with some interaction circuitry, a minimal retina Boolean processor can be built with less than 30 transistors and controlled by as few as six global clock signals. The successful design, integration, and test of a 65×76 Boolean retina on a 50-mm2 CMOS 2-μm circuit are described  相似文献   

13.
A new VLSI processor (DIP chip) for image compression is presented which combines principles of multipipeline and array processing. The device is not specific to any one image compression algorithm and can be regarded as a general purpose processor. The chip has been implemented using a CMOS 1.0-μm process on a 14.4×13.5-mm2 die. An internal clock frequency of 40 MHz results in 1.2×109 operations/s on 8-bit data. Solutions to problems associated with the large bandwidth required, for both image data and instruction streams, is the main aim of the paper. The necessary problem of increasing the array clock frequency relative to the input/output clock frequency without the need for a large on-chip instruction cache or fast external clock speeds is also addressed  相似文献   

14.
This paper demonstrates an optimal time, fully systolic algorithm for edge detection on a mesh connected processor array. It uses only inexpensive addition and comparison operations which makes it ideal for fine grained parallelism in VLSI. Given anN xN image in the form of a two-dimensional array of pixels, our algorithm computes the Sobel and Laplacian operators for skimming lines in the image and then generates the Hough array using thresholding. The Hough transforms forM different angles of projection are obtained in a fully systolic manner inO(M+N) time, which is asymptotically optimal. In comparison, a previously published multiplication free algorithm has a time complexity ofO(NM). An implementation of our algorithm on a mesh connected finegrained processor array is discussed, which computes at the rate of approximately 170,000 Hough transforms per second using a 50 MHz clock.This research was partially supported by National Science Foundation under Grant No. MIP 8902636  相似文献   

15.
This paper describes a novel reconfigurable architecture for digital signal processing (DSP). This architecture consists of a two-level array of cells and interconnections. On the upper level, fundamental DSP operations such as multiplication and addition are mapped onto blocks of 4-bit cells. On the lower level, each cell uses a 4 × 4 matrix of smaller “elements” to perform the necessary computations. Cells also contain pipeline latches for increased throughput. The architecture features a simple VLSI implementation that combines the flexibility of memory elements with the speed of DOMINO logic. Initial prototypes have been fabricated using a modest 0.5-μm CMOS technology. Circuit simulations of the cell in 0.25-μm technology indicate that the design achieves a clock frequency of 200 MHz.  相似文献   

16.
This paper describes a new IA-32 architecture microprocessor that implements 70 additional instructions to further accelerate the performance of data-streaming applications such as three-dimensional graphics and video encode/decode. This processor is an enhancement over the previous implementation of this family through the addition of these new instructions along with circuit improvements in several key areas for higher clock frequency. The 10.17×12.10 mm2 die contains 9.5 million transistors and is fabricated in a CMOS five-layer-metal 0.25-μm process with a six-layer organic land grid array package using C4 interconnect technology. It has an operating range of 1.4-2.2 V and is currently running up to 650 MHz  相似文献   

17.
Modern reduced-instruction-set computer chips have features that lay the groundwork for great performance. They boast circuits that can work at clock frequencies ranging to 300 MHz, pipelines built to continually execute independent operations, and the ability to execute multiple instructions in a single clock cycle. The potential of this hardware can only be realized by sophisticated compiler technology. The best approach to optimal computing performance is to design a processor architecture and a compiler concurrently  相似文献   

18.
A new class of fully parameterizable multiple array architectures for motion estimation in video sequences based on the Full-Search Block-Matching algorithm is proposed in this paper. This class is based on a new and efficient AB2 single array architecture with minimum latency, maximum throughput and full utilization of the hardware resources. It provides the ability to configure the target processor within the boundary values imposed for the configuration parameters concerning the algorithm setup, the processing time and the circuit area. With this purpose, a software configuration tool has been implemented to determine the set of possible configurations which fulfill the requisites of a given video coder. Experimental results using both FPGA and ASIC technologies are presented. In particular, the implementation of a single array processor configuration on a single-chip is illustrated, evidencing the ability to estimate motion vectors in real-time.  相似文献   

19.
This 512 Kw×8 b×3 way synchronous BiCMOS SRAM uses a 2-stage wave-pipeline scheme, a PLL self-timing generator and a 0.4-μm BiCMOS process to achieve 220 MHz fully-random read/write operations with a GTL I/O interface. Newly developed circuit technologies include: 1) a zig-zag double word-line scheme, 2) a centered bit-line load layout scheme, and 3) a phase-locked-loop (PLL) with a multistage-tapped ring oscillator which generates a clock cycle proportional pulse (CCPP) and a clock edge lookahead pulse (CELP)  相似文献   

20.
This paper describes the design and implementation of a dedicated data encryption standard (DES) processor. The processor consists of three 0.6 /spl mu/m complementary metal oxide semiconductor (CMOS) integrated circuits (ICs) mounted on a single MCM-D thin-film substrate. Each chip can operate on an individual data stream, or the three can be cascaded to implement the so-called "triple-DES" (3DES) function for increased security. Measurements show 3DES operation at 110 MHz, which translates to a throughput of over 7 Gb/s, the highest reported 3DES throughput to date. System features which contribute to this throughput are the use of area-array (flip-chip) input/output (I/O) and global IC power/ground/clock distribution in the MCM package. In this case, package-level distribution reduced clock skew by 150 ps, and reduced the chip area required for power distribution by 20%. This paper also includes measurements of switching noise of the MCM's V/sub dd/ plane and how it correlates with a simple model of the system power distribution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号