期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

分簇结构超长指令字DSP编译器的设计与实现 总被引：5，自引：0，他引：5

胡定磊陈书明刘春林《小型微型计算机系统》2006,27(2):348-353

超长指令字（VLIW）是高端DSP普遍采用的体系结构。VLIW DSP在硬件上没有调度和冲突判决的机制，其性能的发挥完全依靠编译嚣的优化效果．基于可重定向编译基础设施IMPACT，为分簇VLIW DSP YHFT—D4设计与实现了优化编译器．其中着重讨论了可重定向信息的定义、代码注释、SIMD指令的支持、分簇寄存器分配以度指令级并行开发和资源冲突解决等内容．实验结果表明该编译器可以达到较好的优化效果．相似文献

2.

基于剖析信息和关键路径长度的软件扇出树生成算法

曾斌安虹王莉《计算机科学》2010,37(3):248-252

开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。相似文献

3.

分簇VLIW DSP上支持单双字模式选择的SIMD编译优化

黄胜兵郑启龙郭连伟《计算机应用》2015,35(8):2371-2374

BWDSP100是一款采用超长指令字(VLIW)和单指令多数据流(SIMD)架构的针对高性能计算领域而设计的32位静态标量数字信号处理器,其指令级并行(ILP)主要是通过其特殊的分簇体系结构和SIMD指令来实现,然而现有的编译框架无法对这些特殊的SIMD指令提供支持。由于BWDSP100拥有丰富的SIMD向量化资源,且其所运用的雷达数字信号处理领域对程序的性能要求极高,因此针对BWDSP100结构的特点,在传统Open64编译器中SIMD编译优化框架的基础上提出并实现了一种支持单双字模式选择的SIMD编译优化算法,通过该算法可以显著提高一些在DSP上有着广泛运用计算密集型程序的性能。实验结果表明,与优化前相比,该算法方案在BWDSP编译器上的实现能够平均取得5.66的加速比。相似文献

4.

Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hong Jun Choi Dong Oh Son Jong Myon Kim Cheol Hong Kim 《The Journal of supercomputing》2014,69(1):330-356

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. Unfortunately, branch divergence due to branch instructions causes underutilization of computational resources, resulting in performance degradation of SIMD architecture. Graphics processing unit (GPU) is a representative parallel architecture based on SIMD architecture. In recent computing systems, GPUs can process general-purpose applications as well as graphics applications with the help of convenient APIs. However, contrary to graphics applications, general-purpose applications include many branch instructions, resulting in serious performance degradation of GPU due to branch divergence. In this paper, we propose concurrent warp execution (CWE) technique to reduce the performance degradation of GPU in executing general-purpose applications by increasing resource utilization. The proposed CWE enables selecting co-warps to activate more threads in the warp, leading to concurrent execution of combined warps. According to our simulation results, the proposed architecture provides a significant performance improvement (5.85 % over PDOM, 91 % over DWF) with little hardware overhead. 相似文献

5.

MMX technology extension to the Intel architecture 总被引：2，自引：0，他引：2

Peleg A. Weiser U. 《Micro, IEEE》1996,16(4):42-50

Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU 相似文献

6.

一种SIMD优化中的向量寄存器部分重用方法 总被引：1，自引：0，他引：1

下载免费PDF全文

钱兴隆臧斌宇朱传琪《计算机工程与科学》2007,29(5):141-146

SIMD架构用于多媒体加速,已经广泛应用于现代通用处理器中.SIMD架构的数据并行性可大大提高处理器的运算能力,但由于存储系统的速度远远不能与其匹配,使得应用程序的性能很难获得进一步的提高.因此,本文基于SIMD架构的访存特性,提出了一种向量寄存器部分重用的方法,以提高访存效率;并给出了相应的程序转换算法,通过数据相关性的分
分析,在应用程序向量化时,生成采用向量寄存器部分重用的优化代码.实验结果说明,该算法对多媒体应用程序的性能有显著的提高. 相似文献

7.

支持SIMD 与簇间双字传输体系下的VLIW DSP 分簇算法

陈思灵郑启龙冯玉谦付和萍《计算机系统应用》2012,21(10):100-104

VLIW DSP通过软件流水获得时间并行性,通过指令分簇获得空间并行性.指令的分簇本质上是资源分配问题.传统的指令分簇假设一条指令分到某一簇执行,而某些体系结构提供SIMD指令,传统的分簇算法对这类体系结构并不完全适用.提出的基于评估模型的分簇算法能对SIMD指令和普通指令进行合理的分簇.分簇之后,通过调度簇间传输指令,合成适当的簇间双字传输指令.由于SIMD和簇间双字传输的引入,以及较好的分簇决策,程序整体的调度延迟变短.对许多数字信号处理程序相对于没分簇的情况下的性能有2～3倍的性能提升,相对寄存器压力分簇算法有约7～10%性能的提升. 相似文献

8.

一种动态VLIW调度机制的研究和实现 总被引：2，自引：0，他引：2

下载免费PDF全文

李云照王志英沈立《计算机工程与科学》2008,30(7):90-93

VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。相似文献

9.

Code scheduling for multiple instruction stream architectures

Gary Tyson Matthew Farrens 《International journal of parallel programming》1994,22(3):243-272

Extensive research has been done on extracting parallelism from single instruction stream processors. This paper presents our investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism. 相似文献

10.

基于TTA体系结构的嵌入式协处理器的设计与实现

赖明澈戴葵陆洪毅岳虹王志英《计算机科学》2008,35(2):293-297

本文基于TTA结构提出了一种嵌入式协处理器体系结构,并完成了其VLSI设计与实现.该协处理器具有双Cluster的运算内核,能够高效地支持多媒体应用中的数据密集型计算.为了充分发挥协处理器工作效率,本文还设计了具有流缓冲代理特征的流存储子系统,通过实现数据流存储访问机制以及计算资源与片外存储之间的低耦合结构,提高访存带宽.最后,基于该嵌入式协处理器,本文在0.18μm CMOS工艺下实现了一款多核SoC芯片,其工作主频为300MHz,实测功耗为910mW. 相似文献

11.

MPSoC based on Transport Triggered Architecture for baseband processing of an LTE receiver

《Journal of Systems Architecture》2014,60(1):140-149

Wireless communication over LTE (long term evolution) brings several design challenges to industry and academia, due to its high throughput demand. Specially in the case of hand held mobile devices where the power budget is very limited and high throughput requires more computation power. On the other hand, the industry is struggling for flexible hardware solution, a Software Defined Radio (SDR), to amortize huge costs of hardware changes to suit the continued evolution in wireless standards. In this article, an MPSoC design has been presented for the baseband processing of a 20 MHz LTE system. Transport Triggered Architecture (TTA) has been preferred over conventional DSPs/VLIW architectures as processing element (PE) of MPSoC. Processing tasks are statically scheduled. Synchronization among the PEs is based on polling of a shared memory space. In addition an approach is presented to organize I/O buffer in such a way that the stalling probability of a PE should be reduced to exploit efficiently data and task level parallelism. The total power consumption by all the PEs synthesized on 130 nm technology at 200 MHz and 1.5 V is 105.04 mW. The total energy consumption to process one subframe including carrier recovery is 0.0767 mJ. Our study shows that TTA architecture brings several improvements in conventional SIMD/VLIW architectures. TTA as contrary to other run time designs has a guaranteed performance and lower energy consumption due to the fact that all the data dependency/independency issues are resolved at compile time. Further, it is also true due to the fact that TTA has a reduced register file (RF) traffic, number of RF ports and lower overall cycle count for a given task. To the best of author’s knowledge this article is among the first few published articles on LTE receiver implementation with published figures like time, frequency, power and perhaps the first article explaining further in detail about data access pattern to process an LTE subframe, memory organization, subsystem interconnection, and synchronization. 相似文献

12.

Efficient and retargetable SIMD translation in a dynamic binary translator

下载免费PDF全文

Sheng‐Yu Fu Ding‐Yong Hong Yu‐Ping Liu Jan‐Jan Wu Wei‐Chung Hsu 《Software》2018,48(6):1312-1330

The single‐instruction multiple‐data (SIMD) computing capability of modern processors is continually improved to deliver ever better performance and power efficiency. For example, Intel has increased SIMD register lengths from 128 bits in streaming SIMD extension to 512 bits in AVX‐512. The ARM scalable vector extension supports SIMD register length up to 2048 bits and includes predicated instructions. However, SIMD instruction translation in dynamic binary translation has not received similar attention. For example, the widely used QEMU emulates guest SIMD instructions with a sequence of scalar instructions, even when the host machines have relevant SIMD instructions. This leaves significant potential for performance enhancement. We propose a newly designed SIMD translation framework for dynamic binary translation, which takes advantage of the host's SIMD capabilities. The proposed framework has been built in HQEMU, an enhanced QEMU with a separate thread for applying LLVM optimizations. The current prototype supports ARMv7, ARMv8, and IA32 guests on the X86‐64 AVX‐2 host. Compared with the scalar‐translation version HQEMU, our framework runs up to 1.84 times faster on Standard Performance Evaluation Corporation 2006 CFP benchmarks and up to 6.81 times faster on selected real applications. 相似文献

13.

Vc: A C++ library for explicit vectorization

Matthias Kretz Volker Lindenstruth 《Software》2012,42(11):1409-1430

It is an established trend that CPU development takes advantage of Moore's Law to improve in parallelism much more than in scalar execution speed. This results in higher hardware thread counts (MIMD) and improved vector units (SIMD), of which the MIMD developments have received the focus of library research and development in recent years. To make use of the latest hardware improvements, SIMD must receive a stronger focus of API research and development because the computational power can no longer be neglected and often auto‐vectorizing compilers cannot generate the necessary SIMD code, as will be shown in this paper. Nowadays, the SIMD capabilities are sufficiently significant to warrant vectorization of algorithms requiring more conditional execution than was originally expected for Streaming SIMD Extension to handle. The Vc library ( http://compeng.uni‐frankfurt.de/?vc ) was designed to support developers in the creation of portable vectorized code. Its capabilities and performance have been thoroughly tested. Vc provides portability of the source code, allowing full utilization of the hardware's SIMD capabilities, without introducing any overhead. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

14.

适用于SIMD体系结构的多时钟耦合仿真技术

何义何圣彭向军戴健张春元《软件》2011,32(9):45-48

随着媒体处理和科学计算等应用领域数据级并行性的需求不断增加,SIMD体系结构以其固有的易扩展数据并行处理结构被广泛采用且系统规模日益增大,这使得SIMD体系结构的仿真测试逐渐成为难题,仿真速度与成本的矛盾加剧。本文提出了一种适用于SIMD体系结构的多时钟耦合仿真技术,它采用多个不同频率的时钟分别控制仿真系统的不同功能模块,实现计算单元的分时复用。实验结果表明,多时钟耦合仿真技术能有效提高FPGA芯片的仿真能力,增强仿真系统的灵活可配置性,降低了硬件仿真的成本。相似文献

15.

Embedded GPU and multicore processors for emotional-based mobile robotic agents

《Future Generation Computer Systems》2016

Control architectures based on emotions are becoming promising solutions for the implementation of future robotic systems. The basic controllers of this architecture are the emotional processes that decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, limiting the processing capacity of a main processor to solve the complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel, hence enabling the computing power to tackle the complex dynamic problems. In this paper, Graphic Processing Unit (GPU), multicore processors and single instruction multiple data (SIMD) instructions are used to provide parallelism for the emotional processes. Different GPUs, multicore processors and SIMD instruction sets are evaluated and compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different environmental conditions, robot dynamics and emotional states. Experimental results show that, despite the fact that GPUs have a bottleneck in the data transmission between the host and the device, the evaluated GTX 670 GPU provides a performance of more than one order of magnitude greater than the initial implementation of the architecture on a single core. Thus, all complex proposed application problems can be solved using the GPU technology in contrast to the first prototype where only 55% of them could be solved. Using AVX SIMD instructions, the performance of the architecture is increased in 3.25 times in relation to the first implementation. Thus, from the 27 proposed applications about 88.8% are solved. In the case of the SSE SIMD instructions, the performance is almost doubled and the robot could solve about 74% of the proposed application problems. The use of AVX and SSE SIMD instructions provides almost the same performance as a quad- and a dual-core, respectively, with the advantage that they do not add any additional hardware cost. 相似文献

16.

Flexible VLIW processor based on FPGA for efficient embedded real-time image processing

Vincent Brost Fan Yang Charles Meunier 《Journal of Real-Time Image Processing》2014,9(1):47-59

Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability. 相似文献

17.

Optimizing mobile multimedia using SIMD techniques

N. C. Paver M. H. Khan B. C. Aldrich 《Multimedia Tools and Applications》2006,28(2):221-238

Demand for mobile video applications is growing today in wireless handheld platforms. Optimizing instruction set architectures and employing SIMD techniques is a logical approach towards attaining higher performance in mobile multimedia applications. Intel® Wireless MMX? technology has been designed to accelerate mobile multimedia and applications processing in a power efficient manner. This paper provides an overview of Intel® Wireless MMX? technology, a 64-bit Single Instruction Multiple Data (SIMD) coprocessor for the Intel® XScale® microarchitecture, and the key features of the architecture that specifically enhance the multi-media performance. Tools and techniques for optimization are also described. 相似文献

18.

On the Boosting of Instruction Scheduling by Renaming

Wang L. Yang Ted C. 《The Journal of supercomputing》2001,19(2):173-197

Speculative execution is the execution of instructions before it is known whether these instructions should be executed. In the speculative execution for instruction level parallelism (ILP) processors, the concept of shadow register provides a hardware solution to maintain semantics of a program from the pollution of boosted instructions that are incorrectly predicted. In a recent study, Chang and Lai proposed a special register file based on shadow register, named conjugate register file (CRF), to support multilevel boosting in speculative execution. They also proposed a scheduling heuristic named frequency-driven scheduling to incorporate with CRF for execution. However, the ability of boosting is still constrained since the concept of register pair will force the results produced speculatively be stored in dedicated locations. Moreover, when the parallelism potential increases to tens through the advancement of hardware techniques, the heavy demand on register usage and the complexity of register file may well become a serious bottleneck for the exploitation of ILP.In this paper, the algorithm of frequency-driven scheduling is modified by replacing the function of hardware CRF with the technique of variable renaming during compilation. The new scheduling technique, named LESS, can exploit the parallelism efficiently with limited number of registers. Moreover, since the technique can benefit ILP without any special hardware support, it can be incorporated with any other ILP architecture without changing its instruction set architecture (ISA).Simulation results show that the performance achievable by LESS is better than other existing methods. For example, under the ILP model with an issue rate of 8, the speculative execution can achieve an increase of 34% in parallelism, as compared to 18% in CRF scheme. 相似文献

19.

Linux平台下基于SIMD编程的模板匹配优化

陈辉龚浩张燕忠《计算机测量与控制》2004,12(12):1222-1225

模板匹配是进行滤波、边缘检测、目标识别和图像匹配的一种基本和有效的方法。但是模板匹配是一种密集型运算，在单处理机上实现耗时较多，但是如采用并行阵列计算机，硬软件成本也会相应提高。所幸Intel处理器提供了MMX／SSE／SSE2指令集，支持指令级SIMD操作。可将模板匹配主要运算部分进行SIMD并行化，在Linux平台下编程实现单处理机上的并行处理。测试结果表明：SIMD大大加快了模板匹配的速度。相似文献

20.

一种改进的嵌入式SIMD协处理器设计 总被引：1，自引：0，他引：1

周国昌王忠车德亮冯国臣《计算机工程与应用》2004,40(31):13-16

论文介绍的SIMD协处理器是用于低层图像理解的16位定点嵌入式阵列处理器。该协处理器采用load/store体系结构,并且除SIMD固有的数据并行性外,还具有三级流水和三组指令并发执行的并行性。三组指令并发执行使数据交换操作和其它类型操作并发执行,从而实现了数据交换操作的隐含执行,大大减少了通信和I/O操作的开销。相似文献