首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
针对嵌入式应用中三维图形渲染的要求,设计了一款可编程的多线程顶点处理器.该顶点处理器采用单指令多数据结构,一条指令能够同时处理4个单精度浮点数,并采用多线程技术,支持4个线程并发执行,能够有效地减少发生数据写读冲突时的停顿周期数,提高了处理效率.相对于单线程结构,4线程顶点处理器在较小的硬件开销下,可以实现2.1~2.8倍的性能提升.该顶点处理器支持OpenGL ES 1.1和Vertex Shader Model 1.1,在90nm CMOS工艺库下可实现频率为200MHz,性能为50Mvertices/s.  相似文献   

2.
面向VLIW结构的高性能代码生成技术   总被引:1,自引:1,他引:0  
DSP处理器通过采用VLIW结构获得了高性能,同时也增加了编译器为其生成汇编代码的难度.代码生成器作为编译器的代码生成部件,是VLIW结构能够发挥性能的关键.由此提出并实现了一种基于可重定向编译框架的代码生成器.该代码生成器充分利用VLIW的体系结构特点,支持SIMD指令,支持谓词执行,能够生成高度指令级并行的汇编代码,显著提高应用程序的执行性能.  相似文献   

3.
The single instruction multiple data (SIMD) architecture is very efficient for executing arithmetic intensive programs, but frequently suffers from data-alignment problems. The data-alignment problem not only induces extra time overhead but also hinders automatic vectorization of the SIMD compiler. In this paper, we compare three on-chip memory systems, which are single-bank, multi-bank, and multi-port, for the SIMD architecture to resolve the data-alignment problems. The single-bank memory is the simplest, but supports only the aligned accesses. The multi-bank memory requires a little higher complexity, but enables the unaligned accesses and the stride accesses with a bank-conflict limitation. The multi-port memory is capable of both the unaligned and stride accesses without any restriction, but needs quite much expensive hardware. We also developed a vectorizing compiler that can conduct dynamic memory allocation and SIMD code generation. The performances of the three memory systems with our SIMD compiler are evaluated using several digital signal processing kernels and the MPEG2 encoder. The experimental results show that the multi-bank memory can carry out MPEG2 encoding 5.8 times faster, whereas the single-bank memory only achieves 2.9 times speed-up when employed in a multimedia system with a 2-issue host processor and an 8-way SIMD coprocessor. The multi-port memory obviously shows the best performance, which is however an impractical improvement over the multi-bank memory when the hardware cost is considered.  相似文献   

4.
王向前  洪一  王昊  郑启龙 《电子学报》2015,43(8):1656-1661
魂芯DSP是一款字寻址的、分簇结构的、支持SIMD的VLIW处理器.介绍了基于开源编译器基础设施open64开发魂芯编译器的关键技术,包括地址寄存器的优化处理、综合多种启发因子的指令分簇、分簇架构下的寄存器分配和指令调度.介绍了魂芯DSP编译器的体系结构优化关键技术,包括基于依赖分析的向量化、高效指令的使用和零开销循环的识别.并总结开发经验,给出了基于开源编译基础设施开发编译器的若干注意点.  相似文献   

5.
A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.
Tim NiggemeierEmail:
  相似文献   

6.
This paper describes the implementation of a digital audio effect system‐on‐a‐chip (SoC), which integrates an embedded digital signal processor (DSP) core, audio codec intellectual property, a number of peripheral blocks, and various audio effect algorithms. The audio effect SoC is developed using a software and hardware co‐design method. In the design of the SoC, the embedded DSP and some dedicated hardware blocks are developed as a hardware design, while the audio effect algorithms are realized using a software centric method. Most of the audio effect algorithms are implemented using a C code with primitive functions that run on the embedded DSP, while the equalization effect, which requires a large amount of computation, is implemented using a dedicated hardware block with high flexibility. For the optimized implementation of audio effects, we exploit the primitive functions of the embedded DSP compiler, which is a very efficient way to reduce the code size and computation. The audio effect SoC was fabricated using a 0.18 μm CMOS process and evaluated successfully on a real‐time test board.  相似文献   

7.
This paper describes a new architecture for JAVA-based, interactive multimedia applications. A hardware implementation of a Java Virtual Machine (JVM) is proposed, which allows the direct execution of Java bytecode. In a single clock cycle, up to 3 bytecode instructions can be decoded and executed in parallel using a RISC pipeline. A splitable 64-bit ALU implementation addresses demanding processing requirements of typical multimedia signal processing schemes. The on-chip caches are adapted to the specific data structures of the JVM. The proposed architecture supports execution of multiple Java threads in parallel. An implementation of basic building blocks of the processor with a standard-cell library provides an estimate of 150 MHz clock-speed for a 0.35 m 3 metal layer CMOS process. With a size of less than 10 mm2 needed for the core logic, it is possible to integrate multiple JVMs together with larger cache memories on a single chip. Based on these results, we discuss various performance aspects of JAVA for use in future multimedia terminals.  相似文献   

8.
Many-core processors are good candidates for speeding up video coding because the parallelism of these applications can be exploited more efficiently by the many-core architecture. Lock methods are important for many-core architecture to ensure correct execution of the program and communication between threads on chip. The efficiency of lock method is critical to overall performance of chipped many-core processor. In this paper, we propose two types of hardware locks for on-chip many-core architecture, a centralized lock and a distributed lock. First, we design the architectures of centralized lock and distributed lock to implement the two hardware lock methods. Then, we evaluate the performance of the two hardware locks and a software lock by quantitative evaluation micro-benchmarks on a many-core processor simulator Godson-T. The experimental results show that the locks with dedicated hardware support have higher performance than the software lock, and the distributed hardware lock is more scalable than the centralized hardware lock.  相似文献   

9.
李萱  郭炜 《信息技术》2007,31(5):51-53,57
提出了一种适用于JPEG2000标准中并行通道编码的Embedded Block Coding with Optimized Truncation (EBCOT)高速MQ编码器的硬件架构。首先对JPEG2000标准流程的标码流程选择和字节输出等流程进行改进,使之更适应于硬件实现,并提出一种区间重整时对前导零位数的更简洁的判断方法和电路实现,充分利用硬件并行性,提高了编码速度。进而提出了四级流水的MQ编码器硬件架构,有效提高了MQ编码速率,充分满足并行通道编码的要求。  相似文献   

10.
The hardware/software co-exploration is a critical phase for a broad range of embedded platforms based on the System-On-Chip approach. Traditionally, the compilation and the architectural design sub-spaces have been explored independently. Only recently, some approaches have analyzed the problem of the concurrent exploration of the compilation/architecture sub-spaces. This paper proposes a framework to support the co-exploration phase of the design space composed of architectural parameters and source program transformations. The objective space is multi-dimensional, including conflicting objectives such as energy and delay. In the proposed framework, heuristic co-exploration techniques based on Pareto Simulated Annealing (PSA) have been used to efficiently explore the architecture/compiler co-design space. A first result of this paper consists of showing how the architecture/compiler co-exploration can be more effective than a traditional two-phase exploration. Since the co-exploration space is quite large, to speed up the co-exploration phase by several orders of magnitude over simulation-based approaches, a methodology based on analytical models has been introduced in the co-exploration framework. The goal of analytical models is to quickly evaluate energy/delay metrics at the systems level, while maintaining accuracy with respect to simulation-based co-exploration. The proposed co-exploration framework has been applied to a parameterized SoC superscalar architecture during the execution of a selected set of multimedia applications.  相似文献   

11.
In this paper, a 3-D vertex processor with a floating-point four-threaded and four-issue expanded VLIW architecture and vertex caches for mobile multimedia applications is proposed. The multi-threaded datapath prevents data hazards, and the multi-issue expanded VLIW architecture enables the processor to have an opportunity to execute instructions in parallel and a well-balanced way. The efficient vertex caches are proposed and implemented for the embedded vertex processors to accelerate its geometry operations and to save bandwidth between hosts and vertex processors. The proposed architecture with the vertex caches reduces the average total energy dissipation of 44.7% compared to a conventional single-threaded SIMD architecture, and the proposed vertex processor achieves 120 M vertices/s of geometry performance which is 3.3 times faster than the previous result, and it supports OpenGL ES 2.0 and vertex shader model 3.0. The processor is implemented in a 0.18-mum 1P4M CMOS process, and the operating frequency is 100 MHz.  相似文献   

12.
An Algorithm-Hardware-System Approach to VLIW Multimedia Processors   总被引:2,自引:0,他引:2  
Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads.  相似文献   

13.
介绍了一种应用于ARM处理器的增强DSP功能乘加单元。为了减小乘加指令的周期数,采用了两个并行16×16位乘加单元构成的单指令多数据(SIMD)结构,可以通过适当的配置支持16到32位的各种乘加运算以及16位的复数乘法。理论分析表明,这种乘加单元与传统的单指令单数据(SISD)结构相比在周期数上有明显的减小。尤其对于16位乘加及16位复数乘法,其所需周期数分别只有ARM1022E的1/4和1/3。0.35mm的标准单元库实现表明该乘加单元可以工作在120MHz,使得其非常适合数字信号处理的应用。  相似文献   

14.
For mobile intelligent robot applications, an 81.6 GOPS object recognition processor is implemented. Based on an analysis of the target application, the chip architecture and hardware features are decided. The proposed processor aims to support both task-level and data-level parallelism. Ten processing elements are integrated for the task-level parallelism and single instruction multiple data (SIMD) instruction is added to exploit the data-level parallelism. The Memory-Centric network-on-chip7 (NoC) is proposed to support efficient pipelined task execution using the ten processing elements. It also provides coherence and consistency schemes tailored for 1-to-N and M-to-1 data transactions in a task-level pipeline. For further performance gain, the visual image processing memory is also implemented. The chip is fabricated in a 0.18- $mu$m CMOS technology and computes the key-point localization stage of the SIFT object recognition twice faster than the 2.3 GHz Core 2 Duo processor.   相似文献   

15.
The second in the Niagara series of processors (Niagara2) from Sun Microsystems is based on the power-efficient chip multi-threading (CMT) architecture optimized for Space, Watts (Power), and Performance (SWaP) [SWap Rating = Performance/(Space * Power) ]. It doubles the throughput performance and performance/watt, and provides >10times improvement in floating point throughput performance as compared to UltraSPARC T1 (Niagara1). There are two 10 Gb Ethernet ports on chip. Niagara2 has eight SPARC cores, each supporting concurrent execution of eight threads for 64 threads total. Each SPARC core has a floating point and graphics unit and an advanced cryptographic unit which provides high enough bandwidth to run the two 10 Gb Ethernet ports encrypted at wire speeds. There is a 4 MB Level2 cache on chip. Each of the four on-chip memory controllers controls two FBDIMM channels. Niagara2 has 503 million transistors on a 342 mm2 die packaged in a flip-chip glass ceramic package with 1831 pins. The chip is built in Texas Instruments' 65 nm 11LM triple-Vt CMOS process. It operates at 1.4 GHz at 1.1 V and consumes 84 W.  相似文献   

16.
A low-power I-cache architecture is proposed that is appropriate for embedded low-power processors. Unlike existing schemes, the proposed organisation places an extra small cache in parallel alongside the L1 cache. Since it allows simultaneous accesses to both caches, the proposed scheme introduces little performance degradation. Using simple hardware logic (for sequential accesses) and a compiler transformation (for loop accesses), most L1 cache requests are served by a small cache, so that the amount of energy consumed by the L1 cache is significantly reduced. Experimental results show that for the SPEC95 benchmarks, the proposed organisation reduces the energy-delay product on average by 67.2% over a conventional cache design and 16.8% over the filter cache design  相似文献   

17.
A high speed analog VLSI image acquisition and low-level image processing system is presented. The architecture of the chip is based on a dynamically reconfigurable SIMD processor array. The chip features a massively parallel architecture enabling the computation of programmable mask-based image processing in each pixel. Each pixel include a photodiode, an amplifier, two storage capacitors, and an analog arithmetic unit based on a four-quadrant multiplier architecture. A 64 × 64 pixel proof-of-concept chip was fabricated in a 0.35 μm standard CMOS process, with a pixel size of 35 μm × 35 μm. The chip can capture raw images up to 10,000 fps and runs low-level image processing at a framerate of 2,000–5,000 fps.  相似文献   

18.
This paper presents a methodology for hardware/software co-design with particular emphasis on the problems related to the concurrent simulation and synthesis of hardware and software parts of the overall system. The proposed approach aims at overcoming the problem of having two separate simulation environments by defining a VHDL-based modeling strategy for software execution, thus enabling the simulation of hardware and software modules within the same VHDL-based CAD framework. The proposed methodology is oriented towards the application field of control-dominated embedded systems implemented onto a single chip.  相似文献   

19.
Because of the intrinsic lack of internal‐system observability and controllability in highly integrated multicore processors, very restricted access is allowed for the debugging of erroneous chip behavior. Therefore, the building of an efficient debug function is an important consideration in the design of multicore processors. In this paper, we propose a flexible on‐chip debug architecture that embeds a special logic supporting the debug functionality in the multicore processor. It is designed to support run‐stop‐type debug functions that can halt and control the execution of the multicore processor at breakpoint events and inspect the possible causes of any errors. The debug architecture consists of the following three functional components: the core debug support block, the multicore debug support block, and the debug interface and control block. By embedding this debug infrastructure, the embedded processor cores within the multicore processor can be debugged simultaneously as well as independently. The debug control is performed by employing a JTAG‐based scanning operation. We apply this on‐chip debug architecture to build a debugger for a prototype multicore processor and demonstrate the validity and scalability of our approach.  相似文献   

20.

Achieving high performance in task-parallel runtime systems, especially with high degrees of parallelism and fine-grained tasks, requires tuning a large variety of behavioral parameters according to program characteristics. In the current state of the art, this tuning is generally performed in one of two ways: either by a group of experts who derive a single setup which achieves good – but not optimal – performance across a wide variety of use cases, or by monitoring a system’s behavior at runtime and responding to it. The former approach invariably fails to achieve optimal performance for programs with highly distinct execution patterns, while the latter induces overhead and cannot affect parameters which need to be set at compile time. In order to mitigate these drawbacks, we propose a set of novel static compiler analyses specifically designed to determine program features which affect the optimal settings for a task-parallel execution environment. These features include the parallel structure of task spawning, the granularity of individual tasks, the memory size of the closure required for task parameters, and an estimate of the stack size required per task. Based on the result of these analyses, various runtime system parameters are then tuned at compile time. We have implemented this approach in the Insieme compiler and runtime system, and evaluated its effectiveness on a set of 12 task parallel benchmarks running with 1 to 64 hardware threads. Across this entire space of use cases, our implementation achieves a geometric mean performance improvement of 39%. To illustrate the impact of our optimizations, we also provide a comparison to current state-of-the art task-parallel runtime systems, including OpenMP, Cilk, HPX, and Intel TBB.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号