首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper describes the design and implementation of the massively parallel processor based on the matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves the high performance of 40 GOPS in the case of consecutive fixed-point 16-bit additions at 200MHz clock frequency and the small power dissipation of 250mW. In addition, 1Mbit SRAM for data registers and 2048 2-bit-grained processing elements connected by a flexible switching network are integrated in the small area of 3.1 mm 2 in 90nm CMOS low standby technology. These design techniques and architectures described in this paper are attractive for realizing area-efficient, energy-efficient, and high-performance multimedia processors  相似文献   

2.
We present a systematic methodology to support the design tradeoffs of array processors in several emerging issues, such as (1) high performance and high flexibility, (2) low cost, low power, (3) efficient memory usage, and (4) system-on-a-chip or the ease of system integration. This methodology is algebraic based, so it can cope with high-dimensional data dependence. The methodology consists of some transformation rules of data dependency graphs for facilitating flexible array designs. For example, two common partitioning approaches, LPGS and LSGP, could be unified under the methodology. It supports the design of high-speed and massively parallel processor arrays with efficient memory usage. More specifically, it leads to a novel systolic cache architecture comprising of shift registers only (cache without tags). To demonstrate how the methodology works, we have presented several systolic design examples based on the block-matching motion estimation algorithm (BMA). By multiprojecting a 4D DG of the BMA to 2D mesh, we can reconstruct several existing array processors. By multiprojecting a 6D DG of the BMA, a novel 2D systolic array can be derived that features significantly improved rates in data reusability (96%) and processor utilization (99%).  相似文献   

3.
4.
In this paper PAPRICA, a massively parallel coprocessor devoted to the analysis of bitmapped images is presented considering first the computational model, then the architecture and its implementation, and finally the performance analysis. The main goal of the project was to develop a subsystem to be attached to a standard workstation and to operate as a specialized processing module in dedicated systems. The computational model is strongly related to the concepts of mathematical morphology, and therefore the instruction set of the processing units implements basic morphological transformations. Moreover, the specific processor virtualization mechanism allows to handle and process multiresolution data sets. The actual implementation consists of a mesh of 256 single bit processing units operating in a SIMD style and is based on a set of custom VLSI circuits. The architecture comprises specific hardware extensions that significantly improved performances in real-time applications.  相似文献   

5.
嵌入式Flash CISC/DSP微处理器的研究与实现   总被引:1,自引:0,他引:1       下载免费PDF全文
卢结成  丁丁  丁晓兵  朱少华 《电子学报》2003,31(8):1252-1254
本文研究一种新的既具有微控制器功能,又有增强DSP功能的高性能微处理器的实现架构.在统一的增强CISC指令集下,我们将基于哈佛和寄存器-寄存器结构的微处理器模块和单周期乘法/累加器、桶形移位寄存器、无开销循环及跳转硬件支持模块、硬件地址产生器等DSP功能模块以及嵌入式Flash Memory和指令队列缓冲器有机的集成起来,在统一架构下通过单核实现CISC/DSP微处理器,有效地提高了处理器的性能.该微处理器采用0.35μm CMOS工艺实现,芯片面积为25mm2.在80M工作频率下,动态功耗为425mW,峰值数据处理能力可达80MIPS.该处理器核可满足片上系统(SOC)对高性能处理器的需求.  相似文献   

6.
IEEE 802.11 wireless LANs have a tremendous market potential, as they support high data rates, work in the world-wide licence free 2.4 GHz ISM band and have high performance in terms of range and power consumption. Hence, the development of a communication processor that supports the IEEE 802.11 MAC functions has a significant potential, making very important the concept of having an ASIC ‘right from the first time’, in order to minimize development cost and to meet short time-to-market requirements. This paper presents the methodology of developing such a component with emphasis on the rapid prototyping approach. The chip architecture, which is based on an ARM processor core, is described in detail, focusing on the implementation of the protocol functions using custom hardware modules. Finally, the paper presents experimental results on the ASIC implementation.  相似文献   

7.
HMAC􀀁MD5 算􀀂法的硬件实现   总被引:1,自引:0,他引:1       下载免费PDF全文
信息安全体系中的消息验证是一个非常重要的方面。采用以散列函数为基础的消息验证编码是其中的一种重要方法。现提出了硬件实现一种以MD5算法为基础的消息验证编码(HMAC-MD5)的电路结构。该电路结构通过对MD5核心运算模块的复用,缩小了电路规模,达到了较高的处理速度。用Verilog HDL描述电路结构,并且在FPGA上验证了该结构的正确性。  相似文献   

8.
The implementation method of the IEEE 802.11 Medium Access Control (MAC) protocol is mainly based on DSP (Digital Signal Processor)/ARM (Advanced Reduced instruction set computer Machine) processor or DSP/ARM IP (Intellectual Property) core. This paper presents a method based on Nios II soft-core processor embedded in Altera's Cyclone FPGA (Field Programmable Gate Array) and MicroC/OS-II RTOS (Real-Time Operation System). The benefits and drawbacks of above methods are compared, and then the method presented in this paper is described. The hardware and software partitioning are discussed; the hardware architecture is also illustrated and the MAC software programming is described in detail. The presented method has some advantages, such as low cost, easy-implementation and very suitable for the implementation of IEEE 802.11 MAC in research stage.  相似文献   

9.
We present a design framework that consists of a high-throughput, parallel, and scalable elliptic curve cryptographic (ECC) processor, and its cost-effectiveness methodology for the design exploration. A two-phase scheduling methodology is proposed to optimize the ECC arithmetic over both ${rm GF}(p)$ and ${rm GF}(2^m)$. Based on the methodology, a parallel and scalable ECC architecture is also proposed. Our dual-field ECC architecture supports arbitrary elliptic curves and arbitrary finite fields with different field sizes. The optimization to a variety of applications with different area/throughput requirements can be achieved rapidly and efficiently. Using 0.13-$mu$m CMOS technology, a 160-bit ECC processor core is implemented, which can perform elliptic-curve scalar multiplication in 340 $mu$s over ${rm GF}(p)$ and 155 $mu$s over ${rm GF}(2^m)$, respectively. The comparison of speed and area overhead among different ECC designs justifies the cost-effectiveness of the proposed ECC architecture with its design methodology.   相似文献   

10.
《IEE Review》2001,47(3):38-40
The author reports on the theory and practice behind an innovative 32 bit RISC processor core, whose architecture can be customised to provide the optimum design solution for processor-based application-specific integrated circuits. In its basic configuration, the ARC processor, the Tangent-A4, is a four-stage pipeline device, with instructions, data and address formats all 32 bit. It also boasts separate instruction and data buses (Harvard architecture), data and instruction caches, and a unique host interface (parallel, JTAG or user defined), giving external devices access to the internal registers and memory  相似文献   

11.
We describe the design and implementation of an asynchronous discrete cosine transform/inverse discrete cosine transform (DCT/IDCT) processor core compliant with the CCITT recommendation H.261. First, a micropipelined implementation with level-sensitive latches is shown. This is improved by replacing the level-sensitive latches with dual-edge triggered flip-flops to save power and using completion-detection adders in the critical stage of the pipeline to exploit the data-dependent processing delay. Gate-level simulation of extracted layouts indicates that the performance of asynchronous implementations is comparable with that of a synchronous implementation based on an identical architecture. This is because part of the penalty introduced by handshaking circuitry in an asynchronous pipeline can be recovered by exploiting data-dependent processing delays with completion-detection circuitry. In pipelines with significant arithmetic processing such as the DCT/IDCT processor, this is easily accomplished. Our results are encouraging because asynchronous designs do not employ global clocking. In the near future when clock generation, clock distribution, and the power consumed in the clock circuitry become limiting factors in the design of large synchronous application specific integrated circuits (ASICs), asynchronous implementation methodology could be pursued as a real alternative  相似文献   

12.
As technology evolves into the deep submicron level, synchronous circuit designs based on a single global clock have incurred problems in such areas as timing closure and power consumption. An asynchronous circuit design methodology is one of the strong candidates to solve such problems. To verify the feasibility and efficiency of a large‐scale asynchronous circuit, we design a fully clockless 32‐bit processor. We model the processor using an asynchronous HDL and synthesize it using a tool specialized for asynchronous circuits with a top‐down design approach. In this paper, two microarchitectures, basic and enhanced, are explored. The results from a pre‐layout simulation utilizing 0.13‐μm CMOS technology show that the performance and power consumption of the enhanced microarchitecture are respectively improved by 109% and 30% with respect to the basic architecture. Furthermore, the measured power efficiency is about 238 μW/MHz and is comparable to that of a synchronous counterpart.  相似文献   

13.
This 64-b microprocessor is the second-generation design of the new Itanium architecture, termed explicitly parallel instruction computing (EPIC). The design seeks to extract maximum performance from EPIC by optimizing the memory system and execution resources for a combination of high bandwidth and low latency. This is achieved by tightly coupling microarchitecture choices to innovative circuit designs and the capabilities of the transistors and wires in the 0.18-/spl mu/m bulk Al metal process. The key features of this design are: a short eight-stage pipeline, 11 sustainable issue ports (six integer, four floating point, half-cycle access level-1 caches, 64-GB/s level-2 cache and 3-MB level-3 cache), all integrated on a 421 mm/sup 2/ die. The chip operates at over 1 GHz and is built on significant advances in CMOS circuits and methodologies. After providing an overview of the processor microarchitecture and design, this paper describes a few of these key enabling circuits and design techniques.  相似文献   

14.
This paper presents a 2-D DCT/IDCT processor chip for high data rate image processing and video coding. It uses a fully pipelined row–column decomposition method based on two 1-D DCT processors and a transpose buffer based on D-type flip-flops with a double serial input/output data-flow. The proposed architecture allows the main processing elements and arithmetic units to operate in parallel at half the frequency of the data input rate. The main characteristics are: high throughput, parallel processing, reduced internal storage, and maximum efficiency in computational elements. The processor has been implemented using standard cell design methodology in 0.35 μm CMOS technology. It measures 6.25 mm2 (the core is 3 mm2) and contains a total of 11.7 k gates. The maximum frequency is 300 MHz with a latency of 172 cycles for 2-D DCT and 178 cycles for 2-D IDCT. The computing time of a block is close to 580 ns. It has been designed to meets the demands of IEEE Std. 1,180–1,990 used in different video codecs. The good performance in the computing speed and hardware cost indicate that this processor is suitable for HDTV applications. This work was supported by the Spanish Ministry of Science and Technology (TIC2000-1289).
  相似文献   

15.
Many useful DSP algorithms have high dimensions and complex logic. Consequently, an efficient implementation of these algorithms on parallel processor arrays must involve a structured design methodology. Full-search block-matching motion estimation is one of those algorithms that can be developed using parallel processor arrays. In this paper, we present a hierarchical design methodology for the full-search block matching motion estimation. Our proposed methodology reduces the complexity of the algorithm into simpler steps and then explores the different possible design options at each step. Input data timing restrictions are taken into consideration as well as buffering requirements. A designer is able to modify system performance by selecting some of the algorithm variables for pipelining or broadcasting. Our proposed design strategy also allows the designer to study time and hardware complexities of computations at each level of the hierarchy. The resultant architecture allows easy modifications to the organization of data buffers and processing elements-their number, datapath pipelining, and complexity-to produce a system whose performance matches the video data sample rate requirements.  相似文献   

16.
Distributed integrated circuits are presented as a methodology to design high-frequency communication building blocks. Distributed circuits operate based on multiple parallel signal paths working in synchronization that can be used to enhance the frequency of operation, combine power, and enhance the robustness of the design. These multiple signal paths usually result in strong couplings inside the circuit that necessitate a treatment spanning architecture, circuits, devices, and electromagnetic levels of abstraction  相似文献   

17.
A computational circuit is custom-designed hardware which promises to offer maximum speedup of computationally intensive software algorithms. However, the practical needs to manage development cost and many low-level physical design details erodes much of the potential speedup by distracting attention away from high-level architectural design. Instead, designers need an inexpensive, processor-like platform where computational circuits can be rapidly synthesized and simulated. This enables rapid architectural evolution and mitigates the risk of producing custom hardware. In this paper we present a tool flow (RVETool) for compiling computational circuits into a massively parallel processor array (MPPA). We demonstrate the CAD runtime is on average 70× faster than FPGA tools, with a circuit speed 5.8× slower than FPGA devices. Unlike the fixed logic capacity of FPGAs, RVETool can trade area for simulation performance by targeting a wide range in the number of processor cores. We also demonstrate tool scalability to very large circuits, synthesizing, placing, and routing a ≈1.6 million gate random circuit in 54 min.  相似文献   

18.
马旭  陈杰   《电子器件》2007,30(2):415-418
提出了一种面向视频处理应用的二维8X8IDCT(反离散余弦变换)处理器结构.该处理器设计利用了IDCT算法中的对称性,采用基于并行的乘累加器的结构加快处理速度.设计过程中对于有限位宽对运算结果误差及精度的影响进行了仿真与分析,并根据要求确定了运算位宽的优化值,在满足精度的条件下使芯片的面积开销最小.该处理器核的面积为48K逻辑门,能够在0.64μs内完成对一个数据块的运算,可以满足对高清晰度视频实时解码的性能需求.  相似文献   

19.
The first implementation of the IA-64 architecture achieves high performance by using a highly parallel execution core, while maintaining binary compatibility with the IA-32 instruction set. Explicitly parallel instruction computing (EPIC) design maximizes performance through hardware and software synergy. The processor contains 25.4 million transistors and operates at 800 MHz. The chip is fabricated in a 0.18-μm CMOS process with six metal layers and packaged in a 1012-pad organic land grid array using C4 (flip chip) assembly technology. A core speed back-side bus connects the processor to a 4-MB L3 cache  相似文献   

20.
蔡宇  张浩  罗飞  贺光辉  周祖成 《半导体技术》2005,30(3):50-53,61
研究了无线局域网网卡的实现框架.整个网卡是一个嵌入式实时系统,主要涉及MAC控制器等.它建立于SOC的基础之上,在Xilinx的EDK工具下进行软硬件的协同设计.其硬件实现包括32位的MicroBlaze软核,通用IP和一些用户定制的IP,以及它们总线连接实现.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号