首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Two new 3-D chip stacking technologies, wire-on-bump (WOB) and bump-on-flex (BOF), are proposed and demonstrated with their prototypes. The WOB and BOF technologies are for low cost 3-D stacking of memory chips by vertical side interconnections with metal wires and flex-circuits, respectively. These new 3-D chip stacking technologies have benefits such as a shorter signal path and 3-D stackability of an unlimited number of chips compared to wire-bonded chip stacking. In the case of the BOF technology, additional active and passive components can be either surface-mounted onto or embedded into the flex-circuit, which is an added value that other chip stacking technologies have not demonstrated so far. More importantly, the WOB and BOF technologies enable lower cost processes than Si through-via technology, which is thus more suitable for memory chip stacking. This paper describes the detailed processes for our unique chip stacking structures with vertical interconnection methods of the WOB and BOF. Finite-element modeling and thermal cycle (TC) tests are also performed to address their thermo-mechanical reliability.  相似文献   

2.
One major issue in designing image processors is to design a memory system that supports parallel access with a simple interconnection network. This paper presents an efficient memory allocation to minimize the number of memory modules and processing elements with a parallel access capability when multiple windows with arbitrary shapes are specified. This paper also presents an efficient search method based on regularity of window-type image processing. We give some practical examples including a stereo-matching processor for acquiring 3-D information, and an optical-flow processor for motion estimation. These examples show that the numbers of memory modules are reduced to 2.7% and 10%, respectively, in comparison with a basic approach. It is also shown that the search time is less than 1 ms for practical image sizes and window sizes.   相似文献   

3.
本文针对矢量基二维DCT修剪提出内存存取减少方法.该方法旨在减少计算中因权重因子和信号输入而导致的内存存取.它首先利用权重因子的属性将计算流程图内每相邻两阶段内的蝴蝶运算单元进行融合,然后再以较少的权重因子来计算.本文采用通用DSP处理器来验证该方法对矢量基二维DCT修剪算法的有效性.并且实验结果显示该方法相比于常规方法可以大幅度减少运算所需的时钟周期数、降低对运算中对内存的存取量、以及占用更少的内存.  相似文献   

4.
We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system’s performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space—just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor’s register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product costs.  相似文献   

5.
朱玉飞  戴紫彬  徐进辉  李功丽 《电子学报》2017,45(12):2957-2964
以信息安全设备的密码应用需求为基础,融合流体系结构处理器基本架构,设计出流体系结构密码处理器.文章主要研究和设计影响该处理器性能的瓶颈--流存储系统.此系统针对专用密码处理器的存储特点,并采用可配置化设计,满足密码应用对处理器存储系统灵活高效的要求.同时,该设计将层次化-分布-分体式存储、多数据通道流水并行化访存、流访存调度策略相结合,优化存储系统的访存效率,以提高该处理器的整体性能.研究结果表明,相比于典型密码处理器的存储设计,该设计的访存效率最高可提升约6倍.  相似文献   

6.
A single-chip rendering engine that consists of a DRAM frame buffer, a SRAM serial access memory, pixel/edge processor array and 32-b RISC core is proposed for low-power three-dimensional (3-D) graphics in portable systems. The main features are two-dimensional (2-D) hierarchical octet tree (HOT) array structure with bandwidth amplification, three dedicated network schemes, virtual page mapping, memory-coupled logic pipeline, low-power operation, 7.1-GB/s memory bandwidth, and 11.1-Mpolygon/s drawing speed. The 56-mm2 prototype die integrating one edge processor, eight pixel processors, eight frame buffers, and a RISC core are fabricated using 0.35-μm CMOS embedded memory logic (EML) technology with four poly layers and three metal layers. The fabricated test chip, 590 mW at 100 MHz 3.3 V operation, is demonstrated with a host PC through a PCI bridge  相似文献   

7.
The 3-D Computer     
The 3-D Computer [1]–[4] is a unique implementation of a cellular array processor. We have developed two radically new technologies which enable massive numbers of communication channels both between silicon wafers and through them. A parallel processor (single instruction-multiple data stream cellular array processor) has been designed and built to demonstrate the potential of this technological approach. While the 3-D Computer which has been built and operated in a small scale implementation relative to the long-term aims of this technology, it is nevertheless an extremely powerful computer. The current feasibility demonstration 3-D Computer is a 32×32 array of processors partitioned over five wafers stacked one on top of another. The throughput of this current machine is >600 million operations per second (MOPS) with a 10 MHz clock, while the projected throughput of a full scale machine is >100 billion operations per second (BOPS), again with a 10 MHz clock. The extension of the level of circuit integration beyond that of VLSI and WSI, which is made possible by the 3-D technologies of wafer feedthroughs and microbridges, enable us to achieve these enormous throughputs in a very compact form and at very low power. The small size and low power attributes of the 3-D Computer result from the elimination of the chip level and board level packaging and the intraboard wiring required by conventional levels of circuit integration.  相似文献   

8.
In this paper, we present a novel memory access reduction scheme (MARS) for two-dimension fast cosine transform (2-D FCT). It targets programmable DSPs with high memory-access latency. It reduces the number of memory accesses by: 1) reducing the number of weighting factors and 2) combining butterflies in vector-radix 2-D FCT pruning diagram from two stages to one stage with an efficient structure. Hardware platform based on general purpose processor is used to verify the effectiveness of the proposed method for vector-radix 2-D FCT pruning implementation. Experimental results validate the benefits of the proposed method with reduced memory access, less clock cycle and fewer memory space compared with the conventional implementation.  相似文献   

9.
The 3-D Computer     
The 3-D Computer [1]–[4] is a unique implementation of a cellular array processor. We have developed two radically new technologies which enable massive numbers of communication channels both between silicon wafers and through them. A parallel processor (single instruction-multiple data stream cellular array processor) has been designed and built to demonstrate the potential of this technological approach. While the 3-D Computer which has been built and operated in a small scale implementation relative to the long-term aims of this technology, it is nevertheless an extremely powerful computer. The current feasibility demonstration 3-D Computer is a 32×32 array of processors partitioned over five wafers stacked one on top of another. The throughput of this current machine is >600 million operations per second (MOPS) with a 10 MHz clock, while the projected throughput of a full scale machine is >100 billion operations per second (BOPS), again with a 10 MHz clock. The extension of the level of circuit integration beyond that of VLSI and WSI, which is made possible by the 3-D technologies of wafer feedthroughs and microbridges, enable us to achieve these enormous throughputs in a very compact form and at very low power. The small size and low power attributes of the 3-D Computer result from the elimination of the chip level and board level packaging and the intraboard wiring required by conventional levels of circuit integration.  相似文献   

10.
A 121-mm/sup 2/ graphics LSI is designed and implemented for portable two-dimensional (2-D) and three-dimensional (3-D) graphics and MPEG-4 applications. The LSI contains a RISC processor with a multiply-accumulate unit (MAC), a 3-D rendering engine, a programmable power optimizer, and 29-Mb embedded DRAM. The chip is built in a 0.16-/spl mu/m pure DRAM technology to reduce the fabrication cost. Texture-mapped 3-D graphics with perspective-correct address calculation and bilinear MIPMAP filtering can be realized while consuming the low power with the help of depth-first clock gating, address alignment logic, and embedded DRAM. Programmable clocking allows the LSI to operate in lower power modes for various applications. The chip consumes less than 210 mW, delivering 66 Mpixels/s and 264 Mtexel/s texture-mapped pixels with real-time special effects such as full-scene antialiasing and motion blur.  相似文献   

11.
The implementation of the memory for storing image and transform coefficients in 2-D DWT processing systems using the more cost-effective external memory module such as DDR DRAM is shown to suffer from effective memory bandwidth which is significantly lower than the memory system peak bandwidth if the conventional direct logical-to-physical memory address mapping is adopted. The low effective memory bandwidth is caused by the high level of memory overhead cycle occurrence which is in turn is closely related to the logical memory access patterns of 2-D DWT processes. The problem becomes even more severe for the 2-D DWT processing of video. An analysis on the logical memory access patterns of multi-level 2-D DWT is carried out and an enhanced logical-to-physical memory mapping scheme which minimizes the occurrence of memory overhead cycles is proposed. The proposed scheme is simulated and its performance in terms of effective memory access bandwidth is evaluated and compared with the conventional direct mapping scheme.
Soon-Chieh LimEmail:
  相似文献   

12.
Including multiple cores on a single chip has become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with hundreds of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors. In addition, memory access will continue to be an important limiting factor on achieving performance, and heterogeneous systems may make use of cores with varying capabilities and performance characteristics. An appropriate programming model can address scalability and can expose data locality while making it possible to migrate application code between processors with different parallel architectures and variable numbers and kinds of cores. We survey and evaluate a range of multicore processor architectures and programming models with a focus on GPUs and the Cell BE processor. These processors have a large number of cores and are available to consumers today, but the scalable programming models developed for them are also applicable to current and future multicore CPUs.  相似文献   

13.
14.
We propose a technique for reducing the energy spent in the memory-processor interface of an embedded system during the execution of firmware code. The method is based on the idea of compressing the most commonly executed instructions so as to reduce the energy dissipated during memory access. Instruction decompression is performed on-the-fly by a hardware block located between processor and memory: No changes to the processor architecture are required. Hence, our technique is well suited for systems employing IP cores whose internal architecture cannot be modified. We describe a number of decompression schemes and architectures that effectively trade off hardware complexity and static code size increase for memory energy and bandwidth reduction, as proved by the experimental data we have collected by executing several test programs on different design templates.  相似文献   

15.
A direct method for the computation of 2-D DCT/IDCT on a linear-array architecture is presented. The 2-D DCT/IDCT is first converted into its corresponding I-D DCT/IDCT problem through proper input/output index reordering. Then, a new coefficient matrix factorisation is derived, leading to a cascade of several basic computation blocks. Unlike other previously proposed high-speed 2-D N /spl times/ N DCT/IDCT processors that usually require intermediate transpose memory and have computation complexity O(N/sup 3/), the proposed hardware-efficient architecture with distributed memory structure has computation complexity O(N/sup 2/ log/sub 2/ N) and requires only log/sub 2/ N multipliers. The new pipelinable and scalable 2-D DCT/IDCT processor uses storage elements local to the processing elements and thus does not require any address generation hardware or global memory-to-array routing.  相似文献   

16.
In this paper, a three-dimensional (3-D) memory array architecture is proposed. This new architecture is realized by stacking several cells in series vertically on each cell located in a two-dimensional array matrix. Therefore, this memory array architecture has a conventional horizontal row and column address and new vertical row address. The total bit-line capacitance of this proposed architecture's DRAM is suppressed to 37% of normal DRAM when one bit-line has 1-Kbit cells and the same design rules are used. Moreover, an array area of 1-Mbit DRAM using the proposed architecture is reduced to 11.5% of normal DRAM using the same design rules. This proposed architecture's DRAM can realize small bit-line capacitance and small array area simultaneously. Therefore, this proposed 3-D memory array architecture is suitable for future ultrahigh-density DRAM  相似文献   

17.
The first half of this paper presents the design rationale for CNAPS, a specialized one-dimensional (1-D) processor array developed by Adaptive Solutions Inc. In this context, we discuss the problem of Amdahl's law which severely constrains special-purpose architectures. We also discuss specific architectural decisions such as the kind of parallelism, the computational precision of the processors, on-chip versus off-chip processor memory, and-most importantly-the interprocessor communication architecture. We argue that, for our particular set of applications, a 1-D architecture gives the best “bang for the buck”, even when compared to the more traditional two-dimensional (2-D) architecture. The second half of this paper describes how several simple algorithms map to the CNAPS array. Our results show that the CNAPS 1-D array offers excellent performance over a range of IP algorithms. We also briefly look at the performance of CNAPS as a pattern recognition engine because many image processing and pattern recognition problems are intimately related  相似文献   

18.
A four-processor chip, for use in processor arrays for image computations, is described. The large degree of data parallelism available in image computations allows dense array implementations where all processors operate under the control of a single instruction stream. An instruction decoder shared by the four processors on the chip minimizes the pin count allocated for global control of the processors. The chip incorporates an interface to an external SRAM (static RAM) for memory expansion without glue chips. The full-custom 2-μm CMOS chip contains 56669 transistors and runs instructions at 10 MHz. Five hundred and twelve 16-b processors and 4 Mbyte of distributed external memory fit on two industry standard cards to yield 5-billion instructions per second peak throughout. As image I/O can overlap perfectly with pixel computation, an array containing 128 of these chips can provide more than 600 16-b operations per pixel on 512×512 images at 30 Hz  相似文献   

19.
In this paper, we propose an efficient pipeline architecture for the DWT 9/7 filter defined in JPEG 2000. The proposed architecture is composed of column and row processors to perform the separable 2-D DWT. Based on the rescheduling DWT algorithm, we derive a new data flow graph to shorten the critical path. The proposed 1-D column processor requires less pipeline registers to achieve about the same critical path compared with other lifting-based architectures. For the row processor, the data dependency of each lifting step is reduced to only two computation nodes and therefore more pipeline registers can be applied to achieve higher processing speed without increasing the internal memory size in the 2-D case. That is, for an N × N image, it only requires 4N internal memory to perform the row-wise transform. For the memory bit-width analysis, we use software simulation to reduce the memory bit-width for various compression ratios. Since a portion of information from least significant bits of DWT coefficients would be discarded after EBCOT-tier2 processing, one can decrease the data width of internal memory to perform various compression ratios of JPEG 2000 coding, especially at the low-bit rates. Our simulation results suggest that it is practically possible to design the energy-aware memory architecture to further reduce the power consumption in the future work.  相似文献   

20.
In modern multimedia applications, memory bottleneck can be alleviated with special stride data accesses. Data elements in stride access can be retrieved in parallel with parallel memories, in which the idea is to increase memory bandwidth with several memory modules working in parallel and feed the processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the accessed data element count equals to the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively. The complexity of the proposed system is shown to be a trade-off between application specific and highly configurable parallel memory system.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号