首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
分簇结构超长指令字DSP编译器的设计与实现   总被引:5,自引:0,他引:5  
超长指令字(VLIW)是高端DSP普遍采用的体系结构。VLIW DSP在硬件上没有调度和冲突判决的机制,其性能的发挥完全依靠编译嚣的优化效果.基于可重定向编译基础设施IMPACT,为分簇VLIW DSP YHFT—D4设计与实现了优化编译器.其中着重讨论了可重定向信息的定义、代码注释、SIMD指令的支持、分簇寄存器分配以度指令级并行开发和资源冲突解决等内容.实验结果表明该编译器可以达到较好的优化效果.  相似文献   

2.
曾斌  安虹  王莉 《计算机科学》2010,37(3):248-252
开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。  相似文献   

3.
BWDSP100是一款采用超长指令字(VLIW)和单指令多数据流(SIMD)架构的针对高性能计算领域而设计的32位静态标量数字信号处理器,其指令级并行(ILP)主要是通过其特殊的分簇体系结构和SIMD指令来实现,然而现有的编译框架无法对这些特殊的SIMD指令提供支持。由于BWDSP100拥有丰富的SIMD向量化资源,且其所运用的雷达数字信号处理领域对程序的性能要求极高,因此针对BWDSP100结构的特点,在传统Open64编译器中SIMD编译优化框架的基础上提出并实现了一种支持单双字模式选择的SIMD编译优化算法,通过该算法可以显著提高一些在DSP上有着广泛运用计算密集型程序的性能。实验结果表明,与优化前相比,该算法方案在BWDSP编译器上的实现能够平均取得5.66的加速比。  相似文献   

4.
Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead.  相似文献   

5.
MMX technology extension to the Intel architecture   总被引:2,自引:0,他引:2  
Peleg  A. Weiser  U. 《Micro, IEEE》1996,16(4):42-50
Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU  相似文献   

6.
一种SIMD优化中的向量寄存器部分重用方法   总被引:1,自引:0,他引:1       下载免费PDF全文
SIMD架构用于多媒体加速,已经广泛应用于现代通用处理器中.SIMD架构的数据并行性可大大提高处理器的运算能力,但由于存储系统的速度远远不能与其匹配,使得应用程序的性能很难获得进一步的提高.因此,本文基于SIMD架构的访存特性,提出了一种向量寄存器部分重用的方法,以提高访存效率;并给出了相应的程序转换算法,通过数据相关性的分
分析,在应用程序向量化时,生成采用向量寄存器部分重用的优化代码.实验结果说明,该算法对多媒体应用程序的性能有显著的提高.  相似文献   

7.
VLIW DSP通过软件流水获得时间并行性,通过指令分簇获得空间并行性.指令的分簇本质上是资源分配问题.传统的指令分簇假设一条指令分到某一簇执行,而某些体系结构提供SIMD指令,传统的分簇算法对这类体系结构并不完全适用.提出的基于评估模型的分簇算法能对SIMD指令和普通指令进行合理的分簇.分簇之后,通过调度簇间传输指令,合成适当的簇间双字传输指令.由于SIMD和簇间双字传输的引入,以及较好的分簇决策,程序整体的调度延迟变短.对许多数字信号处理程序相对于没分簇的情况下的性能有2~3倍的性能提升,相对寄存器压力分簇算法有约7~10%性能的提升.  相似文献   

8.
一种动态VLIW调度机制的研究和实现   总被引:2,自引:0,他引:2       下载免费PDF全文
VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。  相似文献   

9.
Extensive research has been done on extracting parallelism from single instruction stream processors. This paper presents our investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism.  相似文献   

10.
本文基于TTA结构提出了一种嵌入式协处理器体系结构,并完成了其VLSI设计与实现.该协处理器具有双Cluster的运算内核,能够高效地支持多媒体应用中的数据密集型计算.为了充分发挥协处理器工作效率,本文还设计了具有流缓冲代理特征的流存储子系统,通过实现数据流存储访问机制以及计算资源与片外存储之间的低耦合结构,提高访存带宽.最后,基于该嵌入式协处理器,本文在0.18μm CMOS工艺下实现了一款多核SoC芯片,其工作主频为300MHz,实测功耗为910mW.  相似文献   

11.
Wireless communication over LTE (long term evolution) brings several design challenges to industry and academia, due to its high throughput demand. Specially in the case of hand held mobile devices where the power budget is very limited and high throughput requires more computation power. On the other hand, the industry is struggling for flexible hardware solution, a Software Defined Radio (SDR), to amortize huge costs of hardware changes to suit the continued evolution in wireless standards. In this article, an MPSoC design has been presented for the baseband processing of a 20 MHz LTE system. Transport Triggered Architecture (TTA) has been preferred over conventional DSPs/VLIW architectures as processing element (PE) of MPSoC. Processing tasks are statically scheduled. Synchronization among the PEs is based on polling of a shared memory space. In addition an approach is presented to organize I/O buffer in such a way that the stalling probability of a PE should be reduced to exploit efficiently data and task level parallelism. The total power consumption by all the PEs synthesized on 130 nm technology at 200 MHz and 1.5 V is 105.04 mW. The total energy consumption to process one subframe including carrier recovery is 0.0767 mJ. Our study shows that TTA architecture brings several improvements in conventional SIMD/VLIW architectures. TTA as contrary to other run time designs has a guaranteed performance and lower energy consumption due to the fact that all the data dependency/independency issues are resolved at compile time. Further, it is also true due to the fact that TTA has a reduced register file (RF) traffic, number of RF ports and lower overall cycle count for a given task. To the best of author’s knowledge this article is among the first few published articles on LTE receiver implementation with published figures like time, frequency, power and perhaps the first article explaining further in detail about data access pattern to process an LTE subframe, memory organization, subsystem interconnection, and synchronization.  相似文献   

12.
The single‐instruction multiple‐data (SIMD) computing capability of modern processors is continually improved to deliver ever better performance and power efficiency. For example, Intel has increased SIMD register lengths from 128 bits in streaming SIMD extension to 512 bits in AVX‐512. The ARM scalable vector extension supports SIMD register length up to 2048 bits and includes predicated instructions. However, SIMD instruction translation in dynamic binary translation has not received similar attention. For example, the widely used QEMU emulates guest SIMD instructions with a sequence of scalar instructions, even when the host machines have relevant SIMD instructions. This leaves significant potential for performance enhancement. We propose a newly designed SIMD translation framework for dynamic binary translation, which takes advantage of the host's SIMD capabilities. The proposed framework has been built in HQEMU, an enhanced QEMU with a separate thread for applying LLVM optimizations. The current prototype supports ARMv7, ARMv8, and IA32 guests on the X86‐64 AVX‐2 host. Compared with the scalar‐translation version HQEMU, our framework runs up to 1.84 times faster on Standard Performance Evaluation Corporation 2006 CFP benchmarks and up to 6.81 times faster on selected real applications.  相似文献   

13.
It is an established trend that CPU development takes advantage of Moore's Law to improve in parallelism much more than in scalar execution speed. This results in higher hardware thread counts (MIMD) and improved vector units (SIMD), of which the MIMD developments have received the focus of library research and development in recent years. To make use of the latest hardware improvements, SIMD must receive a stronger focus of API research and development because the computational power can no longer be neglected and often auto‐vectorizing compilers cannot generate the necessary SIMD code, as will be shown in this paper. Nowadays, the SIMD capabilities are sufficiently significant to warrant vectorization of algorithms requiring more conditional execution than was originally expected for Streaming SIMD Extension to handle. The Vc library ( http://compeng.uni‐frankfurt.de/?vc ) was designed to support developers in the creation of portable vectorized code. Its capabilities and performance have been thoroughly tested. Vc provides portability of the source code, allowing full utilization of the hardware's SIMD capabilities, without introducing any overhead. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

14.
何义  何圣  彭向军  戴健  张春元 《软件》2011,32(9):45-48
随着媒体处理和科学计算等应用领域数据级并行性的需求不断增加,SIMD体系结构以其固有的易扩展数据并行处理结构被广泛采用且系统规模日益增大,这使得SIMD体系结构的仿真测试逐渐成为难题,仿真速度与成本的矛盾加剧。本文提出了一种适用于SIMD体系结构的多时钟耦合仿真技术,它采用多个不同频率的时钟分别控制仿真系统的不同功能模块,实现计算单元的分时复用。实验结果表明,多时钟耦合仿真技术能有效提高FPGA芯片的仿真能力,增强仿真系统的灵活可配置性,降低了硬件仿真的成本。  相似文献   

15.
Control architectures based on emotions are becoming promising solutions for the implementation of future robotic systems. The basic controllers of this architecture are the emotional processes that decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, limiting the processing capacity of a main processor to solve the complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel, hence enabling the computing power to tackle the complex dynamic problems. In this paper, Graphic Processing Unit (GPU), multicore processors and single instruction multiple data (SIMD) instructions are used to provide parallelism for the emotional processes. Different GPUs, multicore processors and SIMD instruction sets are evaluated and compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different environmental conditions, robot dynamics and emotional states. Experimental results show that, despite the fact that GPUs have a bottleneck in the data transmission between the host and the device, the evaluated GTX 670 GPU provides a performance of more than one order of magnitude greater than the initial implementation of the architecture on a single core. Thus, all complex proposed application problems can be solved using the GPU technology in contrast to the first prototype where only 55% of them could be solved. Using AVX SIMD instructions, the performance of the architecture is increased in 3.25 times in relation to the first implementation. Thus, from the 27 proposed applications about 88.8% are solved. In the case of the SSE SIMD instructions, the performance is almost doubled and the robot could solve about 74% of the proposed application problems. The use of AVX and SSE SIMD instructions provides almost the same performance as a quad- and a dual-core, respectively, with the advantage that they do not add any additional hardware cost.  相似文献   

16.
Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability.  相似文献   

17.
Demand for mobile video applications is growing today in wireless handheld platforms. Optimizing instruction set architectures and employing SIMD techniques is a logical approach towards attaining higher performance in mobile multimedia applications. Intel® Wireless MMX? technology has been designed to accelerate mobile multimedia and applications processing in a power efficient manner. This paper provides an overview of Intel® Wireless MMX? technology, a 64-bit Single Instruction Multiple Data (SIMD) coprocessor for the Intel® XScale® microarchitecture, and the key features of the architecture that specifically enhance the multi-media performance. Tools and techniques for optimization are also described.  相似文献   

18.
Speculative execution is the execution of instructions before it is known whether these instructions should be executed. In the speculative execution for instruction level parallelism (ILP) processors, the concept of shadow register provides a hardware solution to maintain semantics of a program from the pollution of boosted instructions that are incorrectly predicted. In a recent study, Chang and Lai proposed a special register file based on shadow register, named conjugate register file (CRF), to support multilevel boosting in speculative execution. They also proposed a scheduling heuristic named frequency-driven scheduling to incorporate with CRF for execution. However, the ability of boosting is still constrained since the concept of register pair will force the results produced speculatively be stored in dedicated locations. Moreover, when the parallelism potential increases to tens through the advancement of hardware techniques, the heavy demand on register usage and the complexity of register file may well become a serious bottleneck for the exploitation of ILP.In this paper, the algorithm of frequency-driven scheduling is modified by replacing the function of hardware CRF with the technique of variable renaming during compilation. The new scheduling technique, named LESS, can exploit the parallelism efficiently with limited number of registers. Moreover, since the technique can benefit ILP without any special hardware support, it can be incorporated with any other ILP architecture without changing its instruction set architecture (ISA).Simulation results show that the performance achievable by LESS is better than other existing methods. For example, under the ILP model with an issue rate of 8, the speculative execution can achieve an increase of 34% in parallelism, as compared to 18% in CRF scheme.  相似文献   

19.
模板匹配是进行滤波、边缘检测、目标识别和图像匹配的一种基本和有效的方法。但是模板匹配是一种密集型运算,在单处理机上实现耗时较多,但是如采用并行阵列计算机,硬软件成本也会相应提高。所幸Intel处理器提供了MMX/SSE/SSE2指令集,支持指令级SIMD操作。可将模板匹配主要运算部分进行SIMD并行化,在Linux平台下编程实现单处理机上的并行处理。测试结果表明:SIMD大大加快了模板匹配的速度。  相似文献   

20.
一种改进的嵌入式SIMD协处理器设计   总被引:1,自引:0,他引:1  
论文介绍的SIMD协处理器是用于低层图像理解的16位定点嵌入式阵列处理器。该协处理器采用load/store体系结构,并且除SIMD固有的数据并行性外,还具有三级流水和三组指令并发执行的并行性。三组指令并发执行使数据交换操作和其它类型操作并发执行,从而实现了数据交换操作的隐含执行,大大减少了通信和I/O操作的开销。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号