共查询到20条相似文献,搜索用时 109 毫秒
1.
针对嵌入式应用中三维图形渲染的要求,设计了一款可编程的多线程顶点处理器.该顶点处理器采用单指令多数据结构,一条指令能够同时处理4个单精度浮点数,并采用多线程技术,支持4个线程并发执行,能够有效地减少发生数据写读冲突时的停顿周期数,提高了处理效率.相对于单线程结构,4线程顶点处理器在较小的硬件开销下,可以实现2.1~2.8倍的性能提升.该顶点处理器支持OpenGL ES 1.1和Vertex Shader Model 1.1,在90nm CMOS工艺库下可实现频率为200MHz,性能为50Mvertices/s. 相似文献
2.
为了降低超长指令字(VLIW)架构的平均跳转开销和平均访存时延,并减少VLIW程序的代码体积,提出了一种全新的将分支预测与值预测技术应用于VLIW架构的方法。首先分析现有超标量(Superscalar)架构中动态预测技术与V L IW架构中指令静态并行之间所存在的矛盾;通过拓展原有跳转指令和读内存指令,使之与不同的延时槽个数相对应,并根据不同的指令来阻塞流水线或延时写回寄存器,从而解决动态预测技术造成V L IW架构静态调度周期错乱的问题。基于Gem5仿真平台和清华大学Magnolia VLIW数字信号处理器(DSP)的基准测试程序实验表明,该分支预测与值预测技术能显著地提高VLIW架构的性能,缩小VLIW程序的代码体积。 相似文献
3.
提出了一种DSP和通用CPU一体化的处理器架构,并完成了一款基于该架构的同构4核处理器设计和流片验证.该处理器基于VLIW结构,支持自主定义的DSP指令系统,兼容现有通用的MIPS 4KC处理器指令集,支持最大8个指令通道的并行发射.处理器在不改变CPU的指令编码以及执行顺序的前提下,实现了芯片结构上的DSP和CPU执行处理的一体化,适合在统一的平台上同时完成宽带通信和多媒体的信号和协议处理的嵌入式应用开发.处理器内核通过自主定义的DSP指令字中前后并行标识位和一条专用的前导paralink指令实现了DSP与CPU指令的并行发射.在4核处理器的同构架构上,采用了全局读局部写的多核间片上数据存储策略,在控制硬件开销的基础上实现片上数据的共享.仿真和流片验证结果表明,所提出的DSP和CPU一体化处理器架构可行,在宽带通信和多媒体等嵌入式应用上具有优势. 相似文献
4.
面向VLIW结构的高性能代码生成技术 总被引:1,自引:1,他引:0
DSP处理器通过采用VLIW结构获得了高性能,同时也增加了编译器为其生成汇编代码的难度.代码生成器作为编译器的代码生成部件,是VLIW结构能够发挥性能的关键.由此提出并实现了一种基于可重定向编译框架的代码生成器.该代码生成器充分利用VLIW的体系结构特点,支持SIMD指令,支持谓词执行,能够生成高度指令级并行的汇编代码,显著提高应用程序的执行性能. 相似文献
5.
为了提高高速DSP或通用处理器的程序执行速度,描述了一种指令缓存单元的有效架构,特别是实现细节和性能分析.因所提出的指令缓存单元是为一种高性能VLIW结构的DSP核而设计,使用了并行的标签比较逻辑和寄存器堆的结构,芯片面积、关键路径延迟、功耗都大大减小.该指令缓存单元使用高层次的RTL(使用Verilog)编码,并由Synopsys的Design Compiler综合,使用不同的StarCoreTM基准程序测试比较,并进行性能分析.比较结果表明,所提出的结构是有效的,适合用于任何高速的处理器核. 相似文献
6.
本文提出了一种VLIW处理器的预取和针对循环指令的优化策略.文中重点介绍了预取普通指令和处理循环指令的方法,以及普通预取和循环预取这两种预取模式间的切换方式.基于该设计和优化方案,可以有效减小取指操作的功耗.实验证明,在针对不同的应用上,减少的功耗从40%到90%不等,优化了该VLIW多运算簇DSP处理器的性能. 相似文献
7.
线程级并行技术能有效的提高微处理器内核的资源利用率,是目前高性能微处理器研究的重点内容。文章分析了网络处理器的线程级并行技术中存在的几个关键问题,结合网络协议处理的特征提出了一种适合于网络协议处理的混合多线程结构。并将其成功应用于网络协议处理微引擎NRS05的设计中,最大程度的提高了网络处理器的分组吞吐率。 相似文献
8.
本文研究一种新的既具有微控制器功能,又有增强DSP功能的高性能微处理器的实现架构.在统一的增强CISC指令集下,我们将基于哈佛和寄存器-寄存器结构的微处理器模块和单周期乘法/累加器、桶形移位寄存器、无开销循环及跳转硬件支持模块、硬件地址产生器等DSP功能模块以及嵌入式Flash Memory和指令队列缓冲器有机的集成起来,在统一架构下通过单核实现CISC/DSP微处理器,有效地提高了处理器的性能.该微处理器采用0.35μm CMOS工艺实现,芯片面积为25mm2.在80M工作频率下,动态功耗为425mW,峰值数据处理能力可达80MIPS.该处理器核可满足片上系统(SOC)对高性能处理器的需求. 相似文献
9.
本文提出了一种基于硬件抽象机的动态翻译技术,它可用于实现Java处理器.该技术采用了硬件抽象机的"模糊执行"(HAM)方法,通过分析Java程序之间的相关性,动态地将Java字节码转换成基于标签的类RISC指令.然后,将堆栈折叠与动态翻译相结合进一步优化指令.应用该技术设计了一个Java指令级并行处理器,并且扩展它,支持Java多线程功能. 相似文献
10.
以图形处理、数字信号处理等为代表的流应用,对微处理器提出了高并行度、高性能和高带宽的要求。针对流应用加速的流处理器体系架构得到了广泛研究。流体系结构大多集成大量的功能单元、开发多层次并行和存储来加速流应用,但同时增加了系统功耗和芯片面积。分析和比较了近年来主流的流处理器架构,提出了一种用于流应用加速的可重构协处理器。该协处理器针对流应用特点,实现了数据级和指令级并行,并集成了多个可以动态配置的运算单元,可动态配置其运算类型和数据类型,提升系统灵活性,降低芯片面积。针对典型算法,该处理器实现了更高的加速比,综合后延时为9.74 ns,功耗为63.69 mW。 相似文献
11.
Dongming Peng Mi Lu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(1):106-125
Although the notion of the parallelism in multidimensional applications has existed for a long time, it is so far unknown what the bound (if any) of inter-iteration parallelism in multirate multidimensional digital signal processing (DSP) algorithms is, and whether the maximum inter-iteration parallelism can be achieved for arbitrary multirate data flow algorithms. This paper explores the bound of inter-iteration parallelism within rate-balanced multirate multidimensional DSP algorithms and proves that this parallelism can always be achieved in hardware system given the availability of a large number of processors and the interconnections between them. 相似文献
12.
13.
We deal with parallelism at the data level. We describe an implementation of the architectural technique called sub-word parallelism (SWP), which increases parallelism at the data-element-level by means of partitioning a processor's data path. The specific implementation we focus on is based on the TigerSHARC DSP architecture, developed at Analog Devices, Inc. As a result of SWP, the same data path and computation units perform more than one computation on an N-element composite word. This composite word consists of more than one adjacent sub-words. SWP is quite common and exists in production versions of most major general-purpose microprocessors. We also present an implementation of an FIR filter in the TigerSHARC using data-level SWP as an example 相似文献
14.
Mladen Berekovic Peter Pirsch Johannes Kneip 《The Journal of VLSI Signal Processing》1998,20(1-2):163-180
Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads. 相似文献
15.
16.
A flow graph scheduling algorithm that simultaneously considers pipelining, retiming, parallelism, and hierarchical node decomposition is presented. The ability to simultaneously consider the many types of concurrency allows the scheduler to find efficient multiprocessor solutions for a wide range of DSP applications. It has been implemented as part of a software environment for scheduling DSP programs onto fixed and configurable multiprocessor systems. The results on a set of benchmarks demonstrate that the algorithm achieves near ideal speedups even across programs with different types of concurrency 相似文献
17.
18.
从视频信号的特征出发,简要说明了实时视频压缩的常用算法及其国际标准。通过系统地分析了视频压缩算法中内在的模块特性和并行特性,结合数字信号处理领域中具有并行实现机制的典型硬件结构,得出了可用于实时视频压缩的两种单片硬件结构模型。 相似文献
19.
20.
Leupers R. Marwedel P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1997,5(1):112-122
This paper addresses instruction-level parallelism in code generation for digital signal processors (DSPs). In the presence of potential parallelism, the task of code generation includes code compaction, which parallelizes primitive processor operations under given dependency and resource constraints. Furthermore, DSP algorithms in most cases are required to guarantee real-time response. Since the exact execution speed of a DSP program is only known after compaction, real-time constraints should be taken into account during the compaction phase. While previous DSP code generators rely on rigid heuristics for compaction, we propose a novel approach to exact local code compaction based on an integer programming (IP) model, which handles time constraints. Due to a general problem formulation, the IP model also captures encoding restrictions and handles instructions having alternative encodings and side effects and therefore applies to a large class of instruction formats. Capabilities and limitations of our approach are discussed for different DSPs 相似文献