首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 209 毫秒
1.
陈向  沈立  李家文 《计算机科学》2011,38(5):290-294
SIMID指令能够高效开发数据级并行,因此当前绝大多数通用微处理器都支持这种机制。但是应用程序和算法的一些固有特性,如访存地址不对齐、非连续存储访问以及控制流等,使得编译器或程序员必须借助置换指令重新组合向量的各个元素,才能得到符合SIMD指令要求的操作数。这些冗余的置换指令已成为当前挖掘数据级并行的主要性能瓶颈。提出一种自动的数据置换指令生成和优化算法,以有效地减少置换指令带来的性能损失。该算法基于提出的一种新中间表示形式,其中包含有足够的操作数地址信息,因此可以将置换指令的生成转换为数据流图中冲突边的识别问题,而将置换指令的优化转化为用最少的置换指令来删除所有冲突边的问题。面向一组典型多媒体程序进行测试的结果表明,提出的算法可平均获得7%的性能加速。  相似文献   

2.
基于位宽控制提高SIMD架构并行度的优化算法   总被引:1,自引:0,他引:1  
随着SIMD功能单元作为多媒体加速部件的广泛应用,如何有效利用这一构架优化应用程序成为编译优化研究的热点.目前典型的SIMD结构为同一操作对不同的数据化宽提供了不同的指令版本,随着操作数位宽的增加,对应的SIMD指令可同时完成的操作个数也随之降低.因此,如何有效识别操作数的有效位宽,对提高优化过程中SIMD指令内操作的并行度将产生至关重要的影响.文中针对SIMD优化面临的并行度问题,提出了一种优化算法,该算法在对操作数的有效位进行分析的基础上,进行溢出控制,从而减少操作数对宽位宽数据类型的依赖.实验数据表明,该算法可以有效提高多媒体程序优化的并行度,对多媒体程序获得较好的加速效果.  相似文献   

3.
代码选择在编译器的代码产生阶段是一个十分重要的任务,它的目标就是在与机器无关的中间表示代码和与处理器相关的机器指令之间寻找一种高效的映射方法。为了支持DSP处理器的SIMD指令,在传统的基于数据流树中间表示的代码选择算法的基础上,提出一种基于数据流图(DFG)的代码选择技术,它能在最大限度地挖掘和利用SIMD指令的基础上寻求对整个DFG的最优覆盖。  相似文献   

4.
针对目前二维SIMD结构编译技术研究的不足,结合二维SIMD结构中普遍采用的复用数据通路和寄存器少的限制和应用程序的特点,提出了一种解决数据向量复用的算法.该算法先使用数据向量的代表元计算各SIMD指令间数据向量的重用信息,再根据这些信息对SIMD指令进行调度.该算法可以有效缓解应用程序在二维SIMD结构执行时加载数据的压力,有效提高结构受限二维SIMD结构的并行性.实验数据显示,该算法对各种应用程序可获得平均2.97的加速比和平均3.86的SIMD指令级并行度.  相似文献   

5.
BWDSP是针对高性能计算设计的一款新型的处理器, 采用多簇超长指令字体系结构和SIMD架构, 有丰富的指令集. 为充分利用BWDSP提供的向量化资源, 迫切需要提出一种向量化算法. 本文在open64基础上研究并实现了面向多簇超长指令字(VLIW)DSP的SIMD编译优化算法. 算法基于OPEN64的中间语言WHIRL, 能够充分地利用BWDSP丰富的硬件资源和向量化指令. 最终实验结果表明, 对于能够合成双字和单字的循环程序, 该优化算法能够平均取得6倍和4倍的加速比.  相似文献   

6.
刘鹏  赵荣彩  赵博  高伟 《计算机科学》2014,41(9):28-31,44
随着多媒体应用的普及和高性能计算的需求,越来越多的处理器集成了SIMD扩展。为了针对不同SIMD扩展部件自动生成高效的向量化代码,设计了一套虚拟向量指令集,在此基础上构建了一种面向SIMD扩展部件的向量化统一架构。将输入程序通过向量识别等阶段转变为虚拟向量指令的中间表示,而后通过向量长度解虚拟化和指令集解虚拟化,将其转变为特定SIMD部件的向量指令集。在申威1600、DSP和Alpha上的实验结果表明:统一架构能够针对3种平台自动变换出高效的向量化代码,在DSP上的加速比要明显优于其它两种平台。  相似文献   

7.
为了提高多媒体数据的处理能力,高性能DSP普遍引入了SIMD技术。作为DSP重要组成部分的乘法器也必须具备这一功能。本文对SIMD乘法器的实现进行深入研究,提出了一种新的SIMD乘法器体系结构,采用两个16×8乘法器,通过对其操作数和结果进行符号扩展和拼接等处理,简单而高效地实现了16位FT-SIMD乘法器。同时,本体系结构可以扩展为32位和64位的SIMD乘法器。  相似文献   

8.
多媒体处理器的SIMD代码生成   总被引:1,自引:0,他引:1  
通用处理器的SIMD(Single Instruction Multiple Data)多媒体扩展,为提高多媒体应用的性能提供了新的体系结构支持。但目前编译技术对这类指令不能提供很好的支持。本文提出了一个新的SIMD指令生成算法,基于把编译器前端的程序分析和编译器后端的机器信息相结合的思想,采用扩展的treeparsing技术,有效识别程序中的并行操作以生成SIMD指令。基于SUIF(Stanford University Intermediate Format)编译器框架的实验表明,针对一组多媒体kernel,本文提出的算法可平均减少其非SIMD代码47%的cycles。  相似文献   

9.
本文在对新闻报道理论分析及实验验证的基础上,提出一种多向量表示模型,使其在尽量不丢失信息的情况下,对特征集合尽可能细地划分。基于该模型,本文设计了一种模糊匹配的方法用于计算命名实体子向量之间的关联度,它们和多个向量相似度一起用支持向量机进行整合,形成报道模型间的相似度。本文选用TDT4中文语料作为测试语料,将上述模型及模糊匹配技术用于话题关联识别。实验表明,多向量模型能够改进话题关联识别的性能,模糊匹配技术也在一定程度上弥补了精确匹配带来的性能损失。  相似文献   

10.
设计和实现一个新的产品化的编译器通常需要几年时间。基于已有的编译器进行修改和扩展,是研发面向新体系结构的编译器的主要途径。GNU编译器集合(GCC)支持多种高级语言和多种目标处理器平台、文档及源代码开放等。基于GCC的Sparc后端,实现了支持四路双精度SIMD指令的四路双精度短向量寄存器的描述。在此过程中,定义了新的目标机,扩充了一类向量模式,定义了一类新的寄存器约束,实现了四路双精度寄存器的描述,定义了四路双精度SIMD指令的机器描述。对于面向此类SIMD指令的内嵌函数,GCC编译器能够正确使用该类向量寄存器来生成对应的SIMD指令。  相似文献   

11.
《Parallel Computing》2013,39(10):586-602
Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively.  相似文献   

12.
在二进制翻译中引入TCG中间表示技术可以实现多目标平台之间的程序移植,同时可以更加方便地引入新型平台,解决新平台对主流平台的兼容性问题。然而由于原有的中间表示在翻译过程中影响了代码的关联度,生成的后端代码中存在较多冗余指令,影响翻译程序的执行效率。分析了指令优化可行性,针对条件跳转指令进行优化,通过指令预处理对中间表示进行改进,实现中间表示到后端代码生成由一对多翻译模式到多对多翻译模式的转变,采用指令归约技术,针对条件跳转指令的2种模式CMP-JX型与TEST-JX型,分别设计相应的优化翻译算法,并在开源二进制平台QEMU上实现。基于NPB-3.3和SPEC CPU 2006测试集进行了测试,与以前的翻译模式进行对比,优化后的代码膨胀率平均减少了14.62%,翻译程序运行速度提升了17.23%,验证了该优化方法的有效性。  相似文献   

13.
To achieve maximum efficiency, modern embedded processors for media applications exploit single instruction multiple data (SIMD) instructions. SIMD instructions provide a form of vectorization where a large machine word is viewed as a vector of subwords and the same operation is performed on all subwords in parallel. Systematic usage of SIMD instructions can significantly improve program performance. With C becoming the dominant language for programming embedded devices, there is a clear need for C compilers that use SIMD instructions whenever appropriate. However, SIMD instructions typically require each memory access to be aligned with the instruction's data access size. Therefore an important problem in designing the compiler is to determine whether a C pointer is aligned, i.e. whether it refers to the beginning of a machine word. In this paper, we describe our SIMD generation algorithm and present an analysis method which determines the alignment of pointers at compile time. The alignment information is used to reduce the number of dynamic alignment checks and the overhead incurred by them. Our method uses an interprocedural analysis which propagates pointer alignment information in function bodies and through function calls. The effectiveness of our method is supported by experimental results which show that in typical programs the alignments of about 50% of the pointers can be statically determined. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

14.
Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

15.
VLIW DSP通过软件流水获得时间并行性,通过指令分簇获得空间并行性.指令的分簇本质上是资源分配问题.传统的指令分簇假设一条指令分到某一簇执行,而某些体系结构提供SIMD指令,传统的分簇算法对这类体系结构并不完全适用.提出的基于评估模型的分簇算法能对SIMD指令和普通指令进行合理的分簇.分簇之后,通过调度簇间传输指令,合成适当的簇间双字传输指令.由于SIMD和簇间双字传输的引入,以及较好的分簇决策,程序整体的调度延迟变短.对许多数字信号处理程序相对于没分簇的情况下的性能有2~3倍的性能提升,相对寄存器压力分簇算法有约7~10%性能的提升.  相似文献   

16.
张倩 《计算机工程》2009,35(10):273-275
针对二维SIMD结构,提出一种可以动态关闭空转部件且结合编译器、指令集和体系结构支持的低功耗调度算法,其中包括编译器优化二维SIMD指令,功耗指令发出部件开关信号,系统接收信号并执行。采用对不同功能单元分别调度的方式和部件局部化的方法。在模拟器上的实验结果表明该方法可以节省整个系统约15%的能量消耗。  相似文献   

17.
18.
Functional parallelism can be supported on SIMD machines by interpretation. Under such a scheme, the programs and data of each task are loaded on the processing elements (PEs) and the Control Unit of the machine executes a central control algorithm that causes the concurrent interpretation of the tasks on the PEs. The central control algorithm is, in many respects, analogous to the control store program on microprogrammed machines. Accordingly, the organization of the control algorithm greatly influences the performance of the synthesized MIMD environment. Most central control algorithms are constructed to interpret the execution phase of all instructions during every cycle (iteration). However, it is possible to delay the interpretation of infrequent and costly instructions to improve the overall performance. Interpreters that attempt improved performance by delaying the issue of infrequent instructions are referred to as variable issue control algorithms. This paper examines the construction of optimized variable issue control algorithms. In particular, a mathematical model for the interpretation process is built and two objective functions (instruction throughput and PE utilization) are defined. The problem of deriving variable issue control algorithms for these objective functions has been shown elsewhere to be NP-complete. Therefore, this paper investigates three heuristic algorithms for constructing near optimal variable issue control algorithms. The performance of the algorithms is studied on four different instruction sets and the trends of the schedulers with respect to the instruction sets and the objective functions are analyzed  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号