首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper describes debug facilities in the Philips TriMedia CPU64, which is an embedded processor core for multimedia applications. Its architecture provides a VLIW pipeline, support for 64-bit vector data, and virtual memory management. The debug hardware in the TriMedia CPU64 supports two complementary debug strategies. One strategy provides a snapshot of the processor state at certain moments in time, which is achieved by single-step execution and various breakpoint types. The other debug strategy provides continuous monitoring of the processor state by using a PC trace buffer. Precise exceptions are used to provide accurate context switching from application software to debugger software.  相似文献   

2.
Mpact media processors enable powerful, flexible and cost-effective multimedia in a PC. A single chip replaces today's multiboard, multichip solutions for graphics, video, audio, and communications. The architecture combines a high-bandwidth RAMBUS memory, VLIW/SIMD (single instruction, multiple data) processing, standard buses, and software programmability for the cost of a modern graphics chip. Mpact architecture uses a modified VLIW style with two RISC-like instructions per VLIW. The instructions are either executed sequentially or concurrently based on a tag in the VLIW. Classical VLIW suffers from low code density due to unused instruction fields, but the Mpact modified VLIW has the same code density as RISC instructions. Additionally, the SIMD instructions improve code density by increasing the work done by each instruction. An 8 byte word size was chosen to balance vector and scalar performance and also to balance data and instruction bandwidth. A 9 bit byte was chosen to represent color-component differences in one byte and to represent 18 bit color or 18 bit audio samples in two bytes. Hardware-dithered rounding of quantization noise allows most audio to be processed in two byte precision. The maximal multiplier precision of 24×24 was chosen for audio requirements. The article reviews the first-generation Mpact media processor and then describes the multimedia performance goals and architecture of Chromatic's second-generation media processor architecture. It then presents newer modules of the architecture in more detail  相似文献   

3.
A 2-/spl mu/m CMOS VLSI digital signal processor (DSP) family, the SP50, is described that is capable of eight million instructions per second and up to six concurrent operations in each instruction. Two DSPs, the PCB5010 and PCB5011, have been developed. Both are based on a common architecture which contains two 16-bit data buses, and a 16/spl times/16/spl rarr/40-bit multiplier accumulator and 16-bit ALU, both with multiprecision support in hardware. Also implemented are two static data RAMs (128/spl times/16 or 256/spl times/16), a data ROM (51/spl times/16), a 15-word three-port register file, three address computation units, and five serial and parallel I/O interfaces. The data path is controlled by an orthogonal instruction set, using 40-bit microcode words. The controller contains a five-level stack and an instruction repeat register, and can have either on-chip program memory (RAM: 32/spl times/40; ROM: 987/spl times/40) or off-chip program memory (up to 64K/spl times/40). Benchmarks show a two to sixfold improvement in overall performance over its predecessors.  相似文献   

4.
This superscalar microprocessor is the first implementation of a 32-bit RISC architecture specification incorporating a single-instruction, multiple-data vector processing engine. Two instructions per cycle plus a branch can be dispatched to two of seven execution units in this microarchitecture designed for high execution performance, high memory bandwidth, and low power for desktop, embedded, and multiprocessing systems. The processor features an enhanced memory subsystem, 128-bit internal data buses for improved bandwidth, and 32-KB eight-way instruction/data caches. The integrated L2 tag and cache controller with a dedicated L2 bus interface supports L2 cache sizes of 512 KB, 1 MB, or 2 MB with two-way set associativity. At 450 MHz, and with a 2-MB L2 cache, this processor is estimated to have a floating-point and integer performance metric of 20 while dissipating only 7 W at 1.8 V. The 10.5 million transistor, 83-mm2 die is fabricated in a 1.8-V, 0.20-μm CMOS process with six layers of copper interconnect  相似文献   

5.
面向移动终端处理器的低功耗、低成本、高效率、灵活升级的需求,本文在对LTE-A基带算法并行性分析的基础上,提出了一种基于超长指令字(Very Long Instruction Word,VLIW)和单指令多数据(Single Instruction Multiple Data,SIMD)混合结构的矢量处理器作为终端软基带解决方案.该矢量处理器采用变长的VLIW指令字,共有7条矢量数据通路,每条通路可执行16个16bit的定点运算;采用分组的系数存储器提高灵活性,受限访问的寄存器组降低电路面积;同时设计了SHUF和ISHUF指令专门用于快速傅里叶变换(FFT)和雏特比(vIT-ERBI)译码算法的矢量化实现.最后本文实现和分析了FFT和VITERBI译码算法.  相似文献   

6.
We propose two new instructions, swperm and sieve, that can be used to efficiently complete an arbitrary bit-level permutation of an n-bit word with or without repetitions. Permutations with repetitions are rearrangements of an ordered set in which elements may replace other elements in the set; such permutations are useful in cryptographic algorithms. On a four-way superscalar processor, we can complete an arbitrary 64-bit permutation with repetitions of 1-bit subwords in 11 instructions and only four cycles using the two proposed instructions. For subwords of size 4 bits or greater, we can perform an arbitrary permutation with repetitions of a 64-bit register in a single cycle using a single swperm instruction. This improves upon previous results by requiring fewer instructions to permute 4-bit or larger subwords packed in a 64-bit register and fewer execution cycles for 1-bit subwords on wide superscalar processors. We also demonstrate that we can accelerate the performance of the popular DES block cipher using the proposed instructions. We obtain a DES performance improvement of at least 55% in constrained embedded environments and an improvement of 71% on a four-way superscalar processor when applying DES as a cryptographic hash function.  相似文献   

7.
A novel processor with micro-pipelined architecture is proposed for latch-type Josephson logic devices. The processor is segmented into several operating stages activated by a multi-phase power system. Independent register groups are allocated to each stage in order to support pipeline processing of several instruction streams. This architecture allows building of a fine pipeline pitch processor which is capable of MIMD processing. A 12-bit micro-pipelined Josephson processor, containing an ALU, a multiplier and 16 registers, is described. Driven by a 3-phase AC power system, it is able to process 4 instruction streams simultaneously. A pipeline pitch of 3.3 GHz is expected using conventional Josephson device technology. A 4-bit processor design for 12-bit data length is also discussed  相似文献   

8.
This work presents a detailed case study in customizing a configurable, extensible, 32-bit RISC processor with vector/SIMD instruction extensions for the efficient execution of block-based video-coding algorithms utilizing a proprietary co-design environment. In addition to the default Full-Search motion estimation of the MPEG-2 Test Model 5, fourteen fast ME algorithms were implemented in both scalar and vector form. Results demonstrate a reduction of up to 68% in the dynamic instruction count of the full search-based encoder whereas the fast motion estimation algorithms achieved a reduction in instruction count of nearly 90%, both accelerated via three 128-bit vector/SIMD instructions when compared to the scalar, reference implementation of the standard. We address in detail the profiling, vectorization and the development of these vector instruction set extensions, discuss in depth the implementation of a parametric vector accelerator that implements these instructions and show the introduction of that accelerator into a 32-bit RISC processor pipeline, in a closely-coupled configuration.  相似文献   

9.
邓晴莺  张民选 《电子学报》2008,36(2):392-396
寄存器文件的设计在高性能处理器设计中十分重要,寄存器栈和寄存器栈引擎是提高其性能的重要手段.编译优化常常基于特定的体系机构以及目标机器.本文针对EDSMT微体系结构(基于IA-64的同时多线程体系结构)提出了一种新颖的基于映射表的寄存器机制——MTRM(Mapping Table-based Register Management),它通过映射表将连续的虚拟寄存器物理号映射到不连续的实际物理寄存器,并研究了编译器支持下的及时去配,实验结果表明该方案能有效提高性能.  相似文献   

10.
Proposes a low-power approach to the design of embedded very long instruction word (VLIW) processor architectures based on the forwarding (or bypassing) hardware, which provides operands from interstage pipeline registers directly to the inputs of the function units. The power optimization technique exploits the forwarding paths to avoid the power cost of writing/reading short-lived variables to/from the register file (RF). Such optimization is justified by the fact that, in application-specific embedded systems, a significant number of variables are short-lived, that is, their liveness (from first definition to last use) spans only few instructions. Values of short-lived variables can thus be accessed directly through the forwarding registers, avoiding writeback to the RF by the producer instruction and successive read from the RF by the consumer instruction. The decision concerning the enabling of the RF writeback phase is taken at compile time by the compiler static scheduling algorithm. This approach implies a minimal overhead on the complexity of the processor control logic and, thus, no critical path increase. The application of the proposed solution to a VLIW embedded core has shown an average RF power saving of 7.8% with respect to the unoptimized approach on the given set of target benchmarks.  相似文献   

11.
The available instruction level parallelism allowed by current register file organizations is not always fully exploited by media processors when running a multimedia application. This paper introduces a novel register file organization, called multi-shared register file, that eliminates this superfluous instruction scheduling flexibility by reducing the number of read and write ports and partitioning the register file in a special ring structure. A parameterized generic VLIW architecture is used to explore different configurations of our proposed register file structure in terms of estimated silicon area, minimum clock period, estimated power consumption, and multimedia task processing performance. Moreover, a metric highly related to multimedia applications is introduced to study trade-offs between hardware cost and performance. The results show that by substituting a monolithic register file with an equivalent multi-shared register file, the estimated area and the power consumption are considerably reduced at the cost of a negligible performance degradation.  相似文献   

12.
In the past years, many works have demonstrated the applicability of Coarse-Grained Reconfigurable Array (CGRA) accelerators to optimize loops by using software pipelining approaches. They are proven to be effective in reducing the total execution time of multimedia and signal processing applications. However, the run-time reconfigurability of CGRAs is hampered overheads introduced by the needed translation and mapping steps. In this work, we present a novel run-time translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. We propose a greedy approach, since the modulo scheduling for CGRA is an NP-complete problem. In addition to read-after-write dependencies, the dynamic modulo scheduling faces new challenges, such as register insertion to solve recurrence dependences and to balance the pipelining paths. Our results demonstrate that the greedy run-time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for a 16-issue VLIW processor. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated. In this analysis it is demonstrated that when changing from one memory access per cycle to two memory accesses per cycle, the modulo scheduling algorithm is able to exploit this increase in memory bandwidth and enhance performance accordingly. Additionally, to measure area and performance, the proposed CGRA was prototyped on an FPGA. The area comparisons show that a crossbar CGRA (with 16 processing elements and including an 4-issue VLIW host processor) is only 1.11 × bigger than a standalone 8-issue VLIW softcore processor.  相似文献   

13.
A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.
Tim NiggemeierEmail:
  相似文献   

14.
The convergence of voice and video in next-generation wireless applications requires a processor that can efficiently implement a broad range of advanced third generation (3G) wireless algorithms. The micro signal architecture (MSA) core is a dual-MAC modified Harvard architecture that has been designed to have good performance on both voice and video algorithms. In addition, some of the best features and simplicity of microcontrollers has been incorporated into the MSA core. This article presents an overview of the MSA architecture, key engineering issues and their solutions, and details associated with the first implementation of the core. The utility of the MSA architecture for practical 3G wireless applications is illustrated with several application examples and performance benchmarks for typical DSP and image/video kernels. The DSP features of the MSA core include: two 16-bit single-cycle throughput multipliers, two 40-bit split data ALUs, and hardware support for on-the-fly saturation and clipping; two 32-bit pointer ALUs with support for circular and bit-reversed addressing; two separate data ports to a unified 4 GB memory space, a parallel port for instructions, and two loop counters that allow nested zero overhead looping  相似文献   

15.
A 1-million transistor 64-b microprocessor has been fabricated using 0.8-μm double-metal CMOS technology. A 40-MIPS (million instructions per second) and 20-MFLOPS (million floating-point operations per second) peak performance at 40 MHz is realized by a self-clocked register file and two translation lookaside buffers (TLBs) with word-line transition detection circuits. The processor contains an integer unit based on the SPARC (scalable processor architecture) RISC (reduced instruction set computer) architecture, a floating-point unit (FPU) which executes IEEE-754 single- and double-precision floating-point operations a 6-KB three-way set-associative physical instruction cache, a 2-KB two-way set-associative physical data cache, a memory management unit that has two TLBs, and a bus control unit with an ECC (error-correcting code) circuit  相似文献   

16.
This article proposes design and architecture of a dynamically scalable dual-core pipelined processor. Methodology of the design is the core fusion of two processors where two independent cores can dynamically morph into a larger processing unit, or they can be used as distinct processing elements to achieve high sequential performance and high parallel performance. Processor provides two execution modes. Mode1 is multiprogramming mode for execution of streams of instruction of lower data width, i.e., each core can perform 16-bit operations individually. Performance is improved in this mode due to the parallel execution of instructions in both the cores at the cost of area. In mode2, both the processing cores are coupled and behave like single, high data width processing unit, i.e., can perform 32-bit operation. Additional core-to-core communication is needed to realise this mode. The mode can switch dynamically; therefore, this processor can provide multifunction with single design. Design and verification of processor has been done successfully using Verilog on Xilinx 14.1 platform. The processor is verified in both simulation and synthesis with the help of test programs. This design aimed to be implemented on Xilinx Spartan 3E XC3S500E FPGA.  相似文献   

17.
We report the first implementation of the new SPARC V9 64-b instruction set architecture. The HaL processor called SPARC64 is a ceramic Multi-Chip Module (MCM) that contains one CPU chip, one Memory Management Unit (MMU) chip, and four 64 KB Cache chips. Together, they implement a unique three-level address translation scheme that efficiently supports using virtual addresses spread anywhere in the full 64-b address range. The processor assigns a serial number to each issued instruction to track up to 64 in-progress instructions and can speculatively issue through up to 16 branches. It issues up to 4 instructions per cycle and utilizes superscalar instruction issue, register renaming, and dataflow (potentially out-of-order) execution to fully exploit instruction-level parallelism. The processor maintains a precise-state execution model, and commits in-order, up to 9 instructions in a cycle. In a HaL R1 system, a production SPARC64 running at 143 MHz has a performance of 230 SPECint92 and 300 SPECfp92 and dissipates 50 W from a 3.3 V supply  相似文献   

18.
A software methodology for detecting hardware faults in VLIW data paths   总被引:1,自引:0,他引:1  
The proposed methodology aims to achieve processor data paths for VLIW architectures able to autonomously detect transient and permanent hardware faults while executing their applications. The approach, carried out on the compiled application software, provides the introduction of additional instructions for controlling the correctness of the computation with respect to failures in one of the data path functional units. The advantage of a software approach to hardware fault detection is interesting because it allows one to apply it only to the critical applications executed on the VLIW architecture, thus not causing a delay in the execution of noncritical tasks. Furthermore, by exploiting the intrinsic redundancy of this class of architectures no hardware modification is required on the data path so that no processor customization is necessary.  相似文献   

19.
This paper presents SWIFT, a computationally intensive DSP Architecture for communication applications. We introduce the VLIW feature and SIMD capability of SWIFT, and the vector register file, store buffer, multi-banked memories as well. Then, the structure of nine stages pipeline with powerful bypass logics is disclosed. Finally, the hardware implementation of SWIFT is shown. In SWIFT, computation, data access and control operation can be handled orthogonally. With the efficient SIMD feature, multi-accessible local data memories, and the fine-tuned VLIW instructions, it is possible to achieve high utilization of the DSP datapath in SWIFT.  相似文献   

20.
A 16-kbit nonvolatile charge addressed memory (NOVCAM) is described. A unique cell design allows a high-density memory array layout without reduced line widths or spacings. A cell size of 0.5 square mils is produced by a seven mask process with 6-/spl mu/m polysilicon gates, 10-/spl mu/m aluminum gates, and 10-/spl mu/m minimum spacing on all mask levels. Charge addressed write and read operations are implemented with a very simple interface between the memory array and a two-phase dynamic shift register. The memory is organized as 256 columns by 64 rows. Two 64-bit shift registers provide data access to the memory array via a 2:1 column decoder. With single polysilicon processing the memory array is 50/spl times/161 mils; the 16-kbit chip is 131/spl times/200 mils.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号