共查询到20条相似文献,搜索用时 149 毫秒
1.
2.
3.
魂芯DSP是一款采用VLIW和SIMD架构的针对高性能计算领域而设计的32bit静态标量数字信号处理器.为了满足数字高性能计算的性能要求,魂芯DSP提供了丰富的复数指令,而编译器不能直接利用这些复数指令来提升编译性能.因此针对魂芯DSP芯片提供了大量的复数类操作指令的特点,在传统开源编译器Open64的编译框架基础上进行研究,实现了复数作为编译器基础类型和复数运算操作的支持.同时,通过识别特定的复数类操作的模式利用魂芯DSP上的复数类指令对程序编译优化.实验结果表明,该实现方案在魂芯DSP编译器上对复数程序优化后能够取得平均5.28的加速比. 相似文献
4.
5.
如今单指令多数据流(SIMD)技术在数字信号处理器(DSP)上得到了广泛的应用,现有的向量化编译器大多都实现了自动向量化的功能,但是编译器并不适合支持DSP为特征的SIMD自动向量化,主要由于DSP复杂的指令集、特有的寻址模型,以及依赖关系或者数据非对齐等原因而导致向量化效率不高。为了解决此问题,在基于Open64的超字并行(SLP)自动向量化编译系统后端,对SLP自动向量化中的指令分析和冗余优化算法进行了添加和改进,生成更加高效的向量化源程序。实验结果表明,该优化方法能有效提高DSP性能并降低功耗。 相似文献
6.
多媒体技术的迅速发展使得越来越多的处理器集成了SIMD扩展,当前的编译器大多数都已实现了自动向量化功能。为了发掘迭代内并行,一些编译器在自动向量化模块中引入了SLP向量化方法。多媒体数据的密集存储和规则运算使得在处理多媒体数据时需要进行频繁的数据类型转换,而目前的SLP向量化方法对数据类型转换的处理能力还不完善。为了在存在大量数据类型转换语句的程序中发掘更多的SLP向量化机会,提出了一种类型转换语句的SLP发掘方法,它能够在SLP向量化框架下利用数据重组实现具有相同向量化因子和不同向量化因子的数据类型之间的转换。实验结果表明,该方法能够有效地对类型转换语句进行SLP向量化发掘,提高了程序的向量化执行效率。 相似文献
7.
分簇结构超长指令字DSP编译器的设计与实现 总被引:5,自引:0,他引:5
超长指令字(VLIW)是高端DSP普遍采用的体系结构。VLIW DSP在硬件上没有调度和冲突判决的机制,其性能的发挥完全依靠编译嚣的优化效果.基于可重定向编译基础设施IMPACT,为分簇VLIW DSP YHFT—D4设计与实现了优化编译器.其中着重讨论了可重定向信息的定义、代码注释、SIMD指令的支持、分簇寄存器分配以度指令级并行开发和资源冲突解决等内容.实验结果表明该编译器可以达到较好的优化效果. 相似文献
8.
多媒体处理器的SIMD代码生成 总被引:1,自引:0,他引:1
通用处理器的SIMD(Single Instruction Multiple Data)多媒体扩展,为提高多媒体应用的性能提供了新的体系结构支持。但目前编译技术对这类指令不能提供很好的支持。本文提出了一个新的SIMD指令生成算法,基于把编译器前端的程序分析和编译器后端的机器信息相结合的思想,采用扩展的treeparsing技术,有效识别程序中的并行操作以生成SIMD指令。基于SUIF(Stanford University Intermediate Format)编译器框架的实验表明,针对一组多媒体kernel,本文提出的算法可平均减少其非SIMD代码47%的cycles。 相似文献
9.
在ARMv8 64位多核处理器上基于OpenBLAS实现了四精度三角矩阵求解(QTRSM)。基于两种数据格式分别实现了QTRSM,第一种实现利用GCC编译器对long double数据类型的支持来实现QTRSM,第二种实现采用double-double数据格式及其相应的四精度加减法、乘法和除法。以long double数据类型QTRSM为测试基准,就不同矩阵规模下测试结果精度和时间与double-double数据格式QTRSM进行比较。实验结果表明:两者得到近似相同精度的数值结果,但double-double数据格式QTRSM的性能是long double数据类型QTRSM的1.6倍。随着线程数的增加,两种QTRSM实现的加速比接近2.0,具有较好的可扩展性。 相似文献
10.
11.
12.
Data abstractions have been proposed as a means to enhance program modularity. The implementation of such new features to an existing language is typically handled by either rewriting large portions of an existing compiler or by using a preprocessor to translate the extensions into the standard language. The first technique is expensive to implement while the latter is usually slow and clumsy to use. In this paper a data abstraction addition to PL 1 is described and a hybrid implementation is given. A minimal set of primitive features are added to the compiler and the other extensions are added via an internal macro processor that expands the new syntax into the existing language. 相似文献
13.
显式并行资源计算结构及其编译优化 总被引:1,自引:0,他引:1
提出并分析了一种新的基于超长指令字(VLIW)思想的微处理器模型,该模型提供了体系结构可见的处理器内部结果寄存器和数据通路,允许优化编译器进行直接的控制和调度,并依赖编译器保证操作之间的依赖关系,以简化硬件设计并获得更高的时钟频率.基于该目标模型,构造了一个完整的优化编译和模拟环境,提出、分析并实现了相应的软件旁路优化以及集成式的资源分配与指令调度算法. 相似文献
14.
Kevin O’Brien Kathryn O’Brien Zehra Sura Tong Chen Tao Zhang 《International journal of parallel programming》2008,36(3):289-311
The Cell processor is a heterogeneous multi-core processor with one power processing engine (PPE) core and eight synergistic
processing engine (SPE) cores. There is a significant amount of ongoing research in programming models and tools that attempts
to make it easy to exploit the computation power of the Cell architecture. In our work, we explore supporting OpenMP on the
Cell processor. It is attractive to support OpenMP because programmers can continue using their familiar programming model,
and existing code can be re-used. We base our work on IBM’s XL compiler, and developed new components in the XL compiler and
a new runtime library. Three major issues are addressed: (1) synchronization support on heterogeneous cores; (2) code generation
targeting the different instruction sets; (3) data transfers and implement the OpenMP memory model. We present experimental
results for some SPEC OMP 2001 and NAS benchmarks to demonstrate the effectiveness of this approach. A visualization tool
based on Paraver is also used to provide some insights into actual thread and synchronization behaviors. 相似文献
15.
16.
Computation in the Context of Transport Triggered Architectures 总被引:1,自引:0,他引:1
Henk Corporaal Johan Janssen Marnix Arnold 《International journal of parallel programming》2000,28(4):401-427
Processors used in embedded systems have specific requirements which are not always met by off-the-shelf processors. A templated processor architecture, which can easily be tuned towards a certain application (domain) offers a solution. The transport triggered architecture (TTA) template presented in this paper has a number of properties that make it very suitable for embedded system design. Key to its success is to give the compiler more control; it has to schedule all data transports within the processor. This paper highlights two important TTA-related issues. First a new code generation method for TTAs is discussed; it integrates scheduling and register allocation, thereby avoiding the notorious phase ordering problem between these two steps. Secondly, we discuss how to tune the instruction repertoire for an embedded processor. A tool is described which automatically detects frequent patterns of operations. These patterns can then be implemented on special function units. 相似文献
17.
Summary General models of multiprocessor systems in which processors are functionally dedicated are described. In these models, processors are divided into different types. A task can be assigned only to a processor of certain types. Clearly, the model of multiprocessor systems with identical processors is a special case of our models. These models also include the job shop problem in which there is exactly one processor of each type. Worst case performance bounds of priority-driven schedules are obtained.This work was supported by the National Science Foundation under Grants NSFDCR 72-03740 and NSFMCS 73-03408 相似文献
18.
19.
随着RISC-V指令集的流行,出现了一批应用于IoT智能硬件、嵌入式系统、人工智能芯片、安全设备及高性能计算等不同领域的开源和商业IP软核。性能、功耗和面积三者之间的平衡需要指令集可裁剪、易扩展,以及软件开发环境的配套支持。为此,按照增加自定义指令、扩展ALU功能单元、连接控制信号和数据通路、FPGA原型验证、定制交叉编译环境和应用程序测试的流程,基于FPGA快速实现了定制化RISC-V处理器。以加速矩阵运算为例,基于FPGA在开源IP蜂鸟E203上设计了一条计算向量内积的自定义指令,并在FPGA上进行了原型验证。应用测试程序表明,定制化的RISC-V处理器的计算性能有显著提升,矩阵乘法运算的性能加速比达到了5.3~7.6。 相似文献