共查询到20条相似文献,搜索用时 187 毫秒
1.
2.
魂芯DSP是一款采用VLIW和SIMD架构的针对高性能计算领域而设计的32bit静态标量数字信号处理器.为了满足数字高性能计算的性能要求,魂芯DSP提供了丰富的复数指令,而编译器不能直接利用这些复数指令来提升编译性能.因此针对魂芯DSP芯片提供了大量的复数类操作指令的特点,在传统开源编译器Open64的编译框架基础上进行研究,实现了复数作为编译器基础类型和复数运算操作的支持.同时,通过识别特定的复数类操作的模式利用魂芯DSP上的复数类指令对程序编译优化.实验结果表明,该实现方案在魂芯DSP编译器上对复数程序优化后能够取得平均5.28的加速比. 相似文献
3.
4.
如今单指令多数据流(SIMD)技术在数字信号处理器(DSP)上得到了广泛的应用,现有的向量化编译器大多都实现了自动向量化的功能,但是编译器并不适合支持DSP为特征的SIMD自动向量化,主要由于DSP复杂的指令集、特有的寻址模型,以及依赖关系或者数据非对齐等原因而导致向量化效率不高。为了解决此问题,在基于Open64的超字并行(SLP)自动向量化编译系统后端,对SLP自动向量化中的指令分析和冗余优化算法进行了添加和改进,生成更加高效的向量化源程序。实验结果表明,该优化方法能有效提高DSP性能并降低功耗。 相似文献
5.
多媒体技术的迅速发展使得越来越多的处理器集成了SIMD扩展,当前的编译器大多数都已实现了自动向量化功能。为了发掘迭代内并行,一些编译器在自动向量化模块中引入了SLP向量化方法。多媒体数据的密集存储和规则运算使得在处理多媒体数据时需要进行频繁的数据类型转换,而目前的SLP向量化方法对数据类型转换的处理能力还不完善。为了在存在大量数据类型转换语句的程序中发掘更多的SLP向量化机会,提出了一种类型转换语句的SLP发掘方法,它能够在SLP向量化框架下利用数据重组实现具有相同向量化因子和不同向量化因子的数据类型之间的转换。实验结果表明,该方法能够有效地对类型转换语句进行SLP向量化发掘,提高了程序的向量化执行效率。 相似文献
6.
7.
在ARMv8 64位多核处理器上基于OpenBLAS实现了四精度三角矩阵求解(QTRSM)。基于两种数据格式分别实现了QTRSM,第一种实现利用GCC编译器对long double数据类型的支持来实现QTRSM,第二种实现采用double-double数据格式及其相应的四精度加减法、乘法和除法。以long double数据类型QTRSM为测试基准,就不同矩阵规模下测试结果精度和时间与double-double数据格式QTRSM进行比较。实验结果表明:两者得到近似相同精度的数值结果,但double-double数据格式QTRSM的性能是long double数据类型QTRSM的1.6倍。随着线程数的增加,两种QTRSM实现的加速比接近2.0,具有较好的可扩展性。 相似文献
8.
9.
10.
基于申威处理器,在底层虚拟机(Low Level Virtual Machine,LLVM)编译器后端对锁机制提供编译支持,保证多线程环境下,对共享内存操作的原子性.锁机制研究与实现主要包括实现原子指令语义映射策略保证原子操作的原子性并在锁机制算法中加入对8位和16位数据类型的数据处理,实现锁机制在申威处理器上对小粒度数据类型的支持.基于并行计算机基准测试集NPB进行测试,在多线程环境下所有程序皆自校验通过.在16个线程下,Fortran语言程序平均加速比为11.91,最大加速比为15.73,C语言程序平均加速比为8.08,最大加速比为13.32. 相似文献
11.
12.
Data abstractions have been proposed as a means to enhance program modularity. The implementation of such new features to an existing language is typically handled by either rewriting large portions of an existing compiler or by using a preprocessor to translate the extensions into the standard language. The first technique is expensive to implement while the latter is usually slow and clumsy to use. In this paper a data abstraction addition to PL 1 is described and a hybrid implementation is given. A minimal set of primitive features are added to the compiler and the other extensions are added via an internal macro processor that expands the new syntax into the existing language. 相似文献
13.
显式并行资源计算结构及其编译优化 总被引:1,自引:0,他引:1
提出并分析了一种新的基于超长指令字(VLIW)思想的微处理器模型,该模型提供了体系结构可见的处理器内部结果寄存器和数据通路,允许优化编译器进行直接的控制和调度,并依赖编译器保证操作之间的依赖关系,以简化硬件设计并获得更高的时钟频率.基于该目标模型,构造了一个完整的优化编译和模拟环境,提出、分析并实现了相应的软件旁路优化以及集成式的资源分配与指令调度算法. 相似文献
14.
Kevin O’Brien Kathryn O’Brien Zehra Sura Tong Chen Tao Zhang 《International journal of parallel programming》2008,36(3):289-311
The Cell processor is a heterogeneous multi-core processor with one power processing engine (PPE) core and eight synergistic
processing engine (SPE) cores. There is a significant amount of ongoing research in programming models and tools that attempts
to make it easy to exploit the computation power of the Cell architecture. In our work, we explore supporting OpenMP on the
Cell processor. It is attractive to support OpenMP because programmers can continue using their familiar programming model,
and existing code can be re-used. We base our work on IBM’s XL compiler, and developed new components in the XL compiler and
a new runtime library. Three major issues are addressed: (1) synchronization support on heterogeneous cores; (2) code generation
targeting the different instruction sets; (3) data transfers and implement the OpenMP memory model. We present experimental
results for some SPEC OMP 2001 and NAS benchmarks to demonstrate the effectiveness of this approach. A visualization tool
based on Paraver is also used to provide some insights into actual thread and synchronization behaviors. 相似文献
15.
16.
Summary General models of multiprocessor systems in which processors are functionally dedicated are described. In these models, processors are divided into different types. A task can be assigned only to a processor of certain types. Clearly, the model of multiprocessor systems with identical processors is a special case of our models. These models also include the job shop problem in which there is exactly one processor of each type. Worst case performance bounds of priority-driven schedules are obtained.This work was supported by the National Science Foundation under Grants NSFDCR 72-03740 and NSFMCS 73-03408 相似文献
17.
Computation in the Context of Transport Triggered Architectures 总被引:1,自引:0,他引:1
Henk Corporaal Johan Janssen Marnix Arnold 《International journal of parallel programming》2000,28(4):401-427
Processors used in embedded systems have specific requirements which are not always met by off-the-shelf processors. A templated processor architecture, which can easily be tuned towards a certain application (domain) offers a solution. The transport triggered architecture (TTA) template presented in this paper has a number of properties that make it very suitable for embedded system design. Key to its success is to give the compiler more control; it has to schedule all data transports within the processor. This paper highlights two important TTA-related issues. First a new code generation method for TTAs is discussed; it integrates scheduling and register allocation, thereby avoiding the notorious phase ordering problem between these two steps. Secondly, we discuss how to tune the instruction repertoire for an embedded processor. A tool is described which automatically detects frequent patterns of operations. These patterns can then be implemented on special function units. 相似文献
18.
《Journal of Parallel and Distributed Computing》1994,21(1):150-159
Array statements as included in Fortran 90 or High Performance Fortran (HPF) are a well-accepted way to specify data parallelism in programs. When generating code for such a data parallel program for a private memory parallel system, the compiler must determine when array elements must be moved from one processor to another. This paper describes a practical method to compute the set of array elements that are to be moved; it covers all the distributions that are included in HPF: block, cyclic, and block-cyclic. This method is the foundation for an efficient protocol for modern private memory parallel systems: for each block of data to be sent, the sender processor computes the local address in the receiver′s address space, and the address is then transmitted together with the data. This strategy increases the communication load but reduces the overhead on the receiving processor. We implemented this optimization in an experimental Fortran compiler, and this paper reports an empirical evaluation on a 64-node private memory iWarp system, using a number of different distributions. 相似文献
19.