首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The architecture and implementation of a programmable video signal processor dedicated as building block of a multiple instruction multiple data (MIMD)-based bus-connected multiprocessor system is presented. This system can either be constructed from several single processor chips, or it can be integrated on a large area integrated circuit containing several processors. The processor allows an efficient implementation of different video coding standards like H.261, H.263, MPEG-1 and MPEG-2. It consists of a RISC processor supplemented by a coprocessor for computation intensive convolution-like tasks, which provides a peak performance of more than 1 giga-arithmetic operations per second (GOPS). A large area integrated circuit integrating 9 processor elements (PE's) on an area of 16.6 cm2 has been designed. Due to yield considerations redundancy concepts have been implemented, that-even in the presence of production defects-result in working chips utilizing a lower number of PE's. Each PE has built-in self-test (BIST) capabilities, which allow for an independent test of itself under the control of its integrated fault-tolerant BIST controller. Defective PE's are switched off. Only the PE's passing the BIST are used for video processing tasks. Prototypes have been fabricated in a 0.8 μm complementary metal-oxide-semiconductor (CMOS) process structured by masks using wafer stepping with overlapping exposures. Employing redundancy, up to 6 PE's per chip were functional at 66 MHz, thus providing a peak arithmetic performance of up to 6 GOPS  相似文献   

2.
An integrated memory array processor (IMAP) ULSI with 64 processing elements and a 2-Mb SRAM has been developed for image processing. The chip attains a 3.84 GIPS peak performance through the use of SIMD parallel processing and a 1.28 GByte/s on-chip processor-memory bandwidth. The IMAP is capable of parallel indirect addressing, which increases applications for parallel algorithms. Large power consumption with the wide memory bandwidth is avoided by reducing the number of active sense amplifiers and adopting dynamic power control. Fabricated with a 0.55-μm BiCMOS double layer metal process technology, the IMAP contains 11 million transistors in a 15.1×15.6 mm2 die area  相似文献   

3.
A low-power multimedia processor for mobile applications is presented. An 80-MHz 32-b RISC with enhanced multiplier, two 20-MHz hardware accelerators with 7.125-Mb embedded DRAM for MPEG-4 visual SP@L1 decoding and 3-D graphics processing, 2-kB dual-port SRAM, and peripheral blocks are integrated together on a single chip, MPEG-4 SP@L1 video decoding and 3-D graphics rendering with a 16-b depth-buffer alpha-blending double-buffering and gouraud-shading features at 2, 2-Mpolygons/s speed are realized with the help of the dedicated hardware accelerators/ The architecture of the processor is optimized in terms of power consumption and performance, and various low-power circuit techniques are adopted in each hardware block. The chip is implemented using 0.18-μm embedded memory logic (EML) technology. Its area is 84 mm2, and power consumption is 160 mW when all of the functions are activated  相似文献   

4.
Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on the Booth's algorithm enhances the performance of a 16-bit fixed-point signed MAC, which operates up to 30.0GOPS/W. The dedicated I/O interface circuits are designed for converting the direction of data access and supporting the interleaved memory architecture, and they are implemented for maximizing the processor core efficiency. Power management techniques for suppressing current peaks and reducing average power consumption are proposed to enhance the robustness of the macro. The circuits and the design methodology proposal in this paper are attractive for achieving a high performance and robust massively parallel SIMD processor core employed in multimedia SoCs  相似文献   

5.
Design of a 3-D fully depleted SOI computational RAM   总被引:1,自引:0,他引:1  
We introduce a three-dimensional (3-D) processor-in-memory integrated circuit design that provides progressively increasing processing power as the number of stacked dies increases, while incurring no extra design effort or mask sets. Innovative techniques for processor/memory redundancy and fast global bus evaluation are described. The architecture can be augmented with a nearest-neighbor physical 3-D communications network that can substantially reduce interconnect lengths and relieve routing congestion. The test chip, with 128 Kb of memory and 512 processing elements (PEs) on two fully depleted silicon-on-insulator (SOI) dies, can achieve a peak of 170 billion-bit-operations per second at 400 MHz.  相似文献   

6.
乔文  冯全源 《微电子学》2012,42(2):164-167,172
提出了一款基于EPC Class1 Generation2协议的UHF RFID标签基带处理器。考虑到工作距离是无源标签的一个重要指标,要提高工作距离,就要降低标签功耗,采取了一系列低功耗措施,如2.56MHz和1.28MHz的双时钟策略、增加单元开关功能以及使用异步计数器等。设计采用TSMC 0.18μm工艺,工作电压为1.8V,功耗为6.4μW,版图尺寸为415μm×398μm。采用Xilinx的FPGA开发平台进行验证,测试结果满足C1G2协议要求。  相似文献   

7.
The programmable video signal processor (VSP) is an important category of processors for multimedia systems. Programmable video processors combine the flexibility of programmability with special architectural features that improve performance on video processing applications. VSPs are typically multiple processors with several processing elements (PEs) and a parallel memory system. This paper focuses on the architectural design of the PE's in a video processor and shows how technology and circuit parameters influence the structure of the datapath and, hence, the overall architecture of a programmable VSP. We emphasize the need to consider technological and circuit-level issues during the design of a system architecture and present a method whereby the conceptual organization of the PEs-the number of PEs, pipelining of the datapath, size of the register file, and number of register ports-can be evaluated in terms of a target set of applications before a detailed design is undertaken. We use motion-estimation and discrete cosine transform as example applications to illustrate how various technology parameters affect the architectural design choices. We show that the design of the register file and the datapath-pipeline depth can drastically affect PE utilization and, therefore, the number of PEs required for different applications. Our results demonstrate that pursuing the fastest cycle time can greatly increase the silicon area which must be devoted to PEs, due to both increased pipeline latency and reduced register file bandwidth  相似文献   

8.
Embedded and portable systems running multimedia applications create a new challenge for hardware architects. A microprocessor for such applications needs to be easy to program like a general-purpose processor and have the performance and power efficiency of a digital signal processor. This paper presents the codevelopment of the instruction set, the hardware, and the compiler for the Vector IRAM media processor. A vector architecture is used to exploit the data parallelism of multimedia programs, which allows the use of highly modular hardware and enables implementations that combine high performance, low power consumption, and reduced design complexity. It also leads to a compiler model that is efficient both in terms of performance and executable code size. The memory system for the vector processor is implemented using embedded DRAM technology, which provides high bandwidth in an integrated, cost-effective manner. The hardware and the compiler for this architecture make complementary contributions to the efficiency of the overall system. This paper explores the interactions and tradeoffs between them, as well as the enhancements to a vector architecture necessary for multimedia processing. We also describe how the architecture, design, and compiler features come together in a prototype system-on-a-chip, able to execute 3.2 billion operations per second per watt  相似文献   

9.
SDRAM功耗模型及指令FIFO优化   总被引:2,自引:0,他引:2       下载免费PDF全文
凌明  杨军  张永新   《电子器件》2005,28(4):834-838
在嵌入式系统中,存储子系统通常是系统性能的关键,同时由于其访问频繁功耗也通常占据了功耗的大部分。本文从一个嵌入式RISC处理器的指令FIFO设计出发,提出了SDRAM的功耗模型,基于该功耗模型,提出了最优化的指令FI-FO设计。实验结果表明FIFO深度为4或者5时性能最高且消耗能量最少。  相似文献   

10.
The constant-ratio-coupled multi-grain digital synchronizer (CRC-MGsynchronizer) is proposed as a means for making high-speed connections with very low power consumption, both among multiple chips such as processors, controllers, and storage devices, and among on-chip modules. The synchronizer not only provides a wide range of operating frequencies, but is fast locking and only occupies a small area on chip. Therefore, it contributes to large reductions in power consumption and costs. It is suitable for use in various low-power systems (e.g., battery-hungry mobile appliances and low-cost consumer electronic products). Three major techniques were applied to the design: 1)a multi-grain structure for the delay elements, which greatly reduces the number of gates while facilitating locking in a very small number of clock cycles;2) constant-ratio-coupled (CRC) delay lines (measurement versus generation)for flexible selection of the input-output delay; and 3) a new lock stage decision circuit (LSDC) scheme, conferring excellent testability. Moreover,the architecture is all-digital, and thus it has high process portability. By applying these techniques to a DDR memory interface circuit for a mobile application processor fabricated in 130-nm technology, we were able to reduce power consumption by 42% and chip area by 65% compared with a conventional implementation. Furthermore, the novel design spans a frequency range covering 12 times the minimum frequency.  相似文献   

11.
In this paper, we consider programmable tightly-coupled processor arrays consisting of interconnected small light-weight VLIW cores, which can exploit both loop-level parallelism and instruction-level parallelism. These arrays are well suited for compute-intensive nested loop applications often providing a higher power and area efficiency compared with commercial off-the-shelf processors. They are ideal candidates for accelerating the computation of nested loop programs in future heterogeneous systems, where energy efficiency is one of the most important design goals for overall system-on-chip design. In this context, we present a novel design methodology for the mapping of nested loop programs onto such processor arrays. Key features of our approach are: (1) Design entry in form of a functional programming language and loop parallelization in the polyhedron model, (2) support of zero-overhead looping not only for innermost loops but also for arbitrarily nested loops. Processors of such arrays are often limited in instruction memory size to reduce the area and power consumption. Hence, (3) we present methods for code compaction and code generation, and integrated these methods into a design tool. Finally, (4) we evaluated selected benchmarks by comparing our code generator with the Trimaran and VEX compiler frameworks. As the results show, our approach can reduce the size of the generated processor codes up to 64 % (Trimaran) and 55 % (VEX) while at the same time achieving a significant higher throughput.  相似文献   

12.
The implementation of a 2-core, multi-threaded itanium family processor   总被引:1,自引:0,他引:1  
The design of the high end server processor code named Montecito incorporated several ambitious goals requiring innovation. The most obvious being the incorporation of two legacy cores on-die and at the same time reducing power by 23%. This is an effective 325% increase in MIPS per watt which necessitated a holistic focus on power reduction and management. The next challenge in the implementation was to ensure robust and high frequency circuit operation in the 90-nm process generation which brings with it higher leakage and greater variability. Achieving this goal required new methodologies for design, a greatly improved and tunable clock system and a better understanding of our power grid behavior all of which required new circuits and capabilities. The final aspect of circuit design improvement involved the I/O design for our legacy multi-drop system bus. To properly feed the two high frequency cores with memory bandwidth we needed to ensure frequency headroom in the operation of the bus. This was achieved through several innovations in controllability and tuning of the I/O buffers which are discussed as well.  相似文献   

13.
高速缓冲存储器的设计与实现   总被引:3,自引:0,他引:3  
随着芯片集成度的提高,在高速CPU与低速内存之间插入有缓冲作用的速度较快、容量较小的高速缓冲存储器,解决了两者速度的平衡和匹配问题,对微处理器整体性能有很大提高。本文从高速缓存的结构和基本理论出发,理论结合实际,介绍了32位高性能、低功耗嵌入式微处理器中高速缓存的实现方法,从RTL设计到版图设计的各个部分进行了论述,并介绍了该模块全定制部分电路和版图的实现。  相似文献   

14.
Memory circuit architecture (decoder, cell, cell array, and sense circuit) is surveyed, with emphasis on implementing a memory with fast access and low power consumption. Recent progress in fabrication and circuit technology has improved memory performance. An AC powering scheme, instead of the earlier DC system, has been developed. The AC powering scheme eliminates complicated timing control, which restricts shortening access time, but introduces large power consumption and in-phase powering problems. A parallel decoding scheme that decreases the number of decoding stages is presented. It will decrease the decoding time and AND scheme decoder. An attractive OR-inverter scheme has been proposed for a decoder suitable for a memory with a large capacity. The chip performance strongly depends not only on whether the read mode is destructive or nondestructive but also on the cell connection method, which determines the line inductance. Because the cell input line inductance depends on layered construction of the lines, a planarizing technology for an Nb Josephson integrated circuit has been developed to reduce line inductance by thinning the insulators. Access time of less than 0.5 ns has been confirmed in 1-kb and 4-kb memories using the proposed memory architecture  相似文献   

15.
As technology evolves into the deep submicron level, synchronous circuit designs based on a single global clock have incurred problems in such areas as timing closure and power consumption. An asynchronous circuit design methodology is one of the strong candidates to solve such problems. To verify the feasibility and efficiency of a large‐scale asynchronous circuit, we design a fully clockless 32‐bit processor. We model the processor using an asynchronous HDL and synthesize it using a tool specialized for asynchronous circuits with a top‐down design approach. In this paper, two microarchitectures, basic and enhanced, are explored. The results from a pre‐layout simulation utilizing 0.13‐μm CMOS technology show that the performance and power consumption of the enhanced microarchitecture are respectively improved by 109% and 30% with respect to the basic architecture. Furthermore, the measured power efficiency is about 238 μW/MHz and is comparable to that of a synchronous counterpart.  相似文献   

16.
A monolithic integrated high-gain limiting amplifier for future optical-fiber receivers is described. It is characterized by the following features: high insertion-voltage gain (maximum 54 dB); high input dynamic range (about 52 dB) at constant output-voltage swing (400 mV/SUB p-p/); high operating speed (up to at least 4 Gb/s); low power dissipation (350 mW at 50-/spl Omega/ load); standard supply voltage (5 V); 50-/spl Omega/ output buffer; one-chip solution; and small fabrication costs by use of a 2-/spl mu/m standard bipolar technology without needing polysilicon self-aligning processes. The good values of operating speed and power consumption, which the authors believe has until now not nearly been achieved by other comparable bipolar amplifier ICs, are a result of careful circuit design and optimization. The amplifier was extended to a high-sensitivity (amplitude and time) decision circuit operating at up to 4.0 Gb/s by adding a high-speed master-slave D-flip-flop IC fabricated with the same technology.  相似文献   

17.
设计了一种集成双半桥和四功率开关的驱动芯片。采用双路对称设计,每一路可单独控制使能、自举和驱动。芯片内部采用高精度的基准源以及LDO电路,同时具有欠压死锁、过压保护和过温保护功能。死区控制可避免上下功率管直通大电流,自举设计可使上功率管的开启电压达到5 V,降低了功率管自身的损耗,使功率管输出达到11.90 V。采用TSMC 0.18μm BCD工艺进行流片。测试结果表明,输出的方波信号幅度为11.96 V/11.95 V,死区时间为60 ns/80 ns,静态功耗低至478μA。  相似文献   

18.
Analog circuit techniques can be beneficially applied to reduce the circuit complexity and power consumption of motion estimation processors for digital video encoding. However, analog circuits are sensitive to mismatch which affects motion estimation. This paper presents the design of an analog motion estimation processor which overcomes these limitations. A novel architecture is described featuring pixel reuse and input offset error cancellation. The proof-of-concept realization was fabricated in 0.8-/spl mu/m CMOS, and operates on 4/spl times/4 pixel blocks and a search area of 8/spl times/8 pixels. However, the architecture is scalable to larger block sizes and more advanced technologies. Measured results for various QCIF video sequences at 15-f/s showed excellent PSNR performance. The prototype dissipates 0.9 mW of power from a single 3-V power supply and occupies an area of 0.95 mm/sup 2/. Energy consumption is 1.51 nJ per motion vector.  相似文献   

19.
This work presents the design and implementation of a 2.4 GHz low power fast-settling frequency-presetting PLL frequency synthesizer in the 0.18μm CMOS process.A low power mixed-signal LC VCO,a low power dual mode prescaler and a digital processor with non-volatile memory are developed to greatly reduce the power consumption and the setting time.The digital processor can automatically calibrate the presetting frequency and accurately preset the frequency of the VCO under process variations.The experiment...  相似文献   

20.
A system design for performing low-level image processing tasks in real time is presented. The design is based on large processor-per-pixel arrays implemented using integrated circuit technology. Two integrated circuit architectures are summarized: an associative parallel processor and a parallel processor employing DRAM cells. In both architectures, the layout pitch of one-bit-wide logic is matched to the pitch of memory cells to form high-density processing element arrays. The system design features an efficient control path implementation, providing high processing element array utilization without demanding complex controller hardware. Sequences of array instructions are generated by a host computer before processing begins, then stored in a simple controller. Once processing begins, the host computer initiates stored sequences to perform pixel-parallel operations. A programming framework implemented using the C++ programming language supports application development. A prototype system employs associative parallel processor devices, a controller, and the programming framework. Three sample applications, smoothing and segmentation, median filtering, and optical flow, establish the suitability of the system for real-time image processing  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号