首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 25 毫秒
1.
GCC后端中四路双精度短向量寄存器的实现   总被引:1,自引:1,他引:0  
设计和实现一个新的产品化的编译器通常需要几年时间。基于已有的编译器进行修改和扩展,是研发面向新体系结构的编译器的主要途径。GNU编译器集合(GCC)支持多种高级语言和多种目标处理器平台、文档及源代码开放等。基于GCC的Sparc后端,实现了支持四路双精度SIMD指令的四路双精度短向量寄存器的描述。在此过程中,定义了新的目标机,扩充了一类向量模式,定义了一类新的寄存器约束,实现了四路双精度寄存器的描述,定义了四路双精度SIMD指令的机器描述。对于面向此类SIMD指令的内嵌函数,GCC编译器能够正确使用该类向量寄存器来生成对应的SIMD指令。  相似文献   

2.
BWDSP100是一款采用超长指令字(VLIW)和单指令多数据流(SIMD)架构的针对高性能计算领域而设计的32位静态标量数字信号处理器,其指令级并行(ILP)主要是通过其特殊的分簇体系结构和SIMD指令来实现,然而现有的编译框架无法对这些特殊的SIMD指令提供支持。由于BWDSP100拥有丰富的SIMD向量化资源,且其所运用的雷达数字信号处理领域对程序的性能要求极高,因此针对BWDSP100结构的特点,在传统Open64编译器中SIMD编译优化框架的基础上提出并实现了一种支持单双字模式选择的SIMD编译优化算法,通过该算法可以显著提高一些在DSP上有着广泛运用计算密集型程序的性能。实验结果表明,与优化前相比,该算法方案在BWDSP编译器上的实现能够平均取得5.66的加速比。  相似文献   

3.
Queue processors are a viable alternative for high performance embedded computing and parallel processing. We present the design and implementation of a compiler for a queue-based processor. Instructions of a queue processor implicitly reference their operands making the programs free of false dependencies. Compiling for a queue machine differs from traditional compilation methods for register machines. The queue compiler is responsible for scheduling the program in level-order manner to expose natural parallelism and calculating instructions relative offset values to access their operands. This paper describes the phases and data structures used in the queue compiler to compile C programs into assembly code for the QueueCore, an embedded queue processor. Experimental results demonstrate that our compiler produces good code in terms of parallelism and code size when compared to code produced by a traditional compiler for a RISC processor.  相似文献   

4.
张倩 《计算机工程》2009,35(10):273-275
针对二维SIMD结构,提出一种可以动态关闭空转部件且结合编译器、指令集和体系结构支持的低功耗调度算法,其中包括编译器优化二维SIMD指令,功耗指令发出部件开关信号,系统接收信号并执行。采用对不同功能单元分别调度的方式和部件局部化的方法。在模拟器上的实验结果表明该方法可以节省整个系统约15%的能量消耗。  相似文献   

5.
Single-Instruction Multiple-Data (SIMD) instructions provide an inexpensive way to exploit the Data-Level Parallelism in multimedia applications. However, the performance improvement obtained by employing SIMD instructions is often limited because frequently many overhead instructions are required to bring data in a form amenable to SIMD processing. In this paper, we employ two techniques to overcome this limitation. The first technique, extended subwords, uses four extra bits for every byte in a media register. This allows many SIMD operations to be performed without overflow and avoids packing/unpacking conversion overhead. The second technique, Matrix Register File (MRF), allows flexible row-wise as well as column-wise access to the register file. It is useful for many two-dimensional multimedia algorithms such as the (I) Discrete Cosine Transform, 2 × 2 Haar Transform, and pixel padding. In addition, we propose a few new media instructions. Experimental results obtained by extending the SimpleScalar toolset show that these techniques improve performance by up to a factor of 4.5 compared to a conventional SIMD instruction set extension.  相似文献   

6.
基于SSE指令的大内存快速拷贝   总被引:1,自引:0,他引:1  
在深入研究单指令多数据流扩展指令集(Streaming SIMD Extensions,SSE)数据传输指令操作特点的基础上,充分考虑了数据预取、数据对齐、CPU缓存和新的128位寄存器等因素,在Visual C++平台上用嵌入汇编开发了内存拷贝函数。通过实验分析了各内存拷贝函数拷贝速度与拷贝内存量之间的对应关系。  相似文献   

7.
Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

8.
This paper addresses how to automatically generate code for multimedia extension architectures in the presence of conditionals. We evaluate the costs and benefits of exploiting branches on the aggregate condition codes associated with the fields of a superword (an aggregate object larger than a machine word) such as the branch-on-any instruction of the AltiVec. Branch-on-superword-condition-codes (BOSCC) instructions allow fast detection of aggregate conditions, an optimization opportunity often found in multimedia applications. This paper presents compiler analyses and techniques for generating efficient parallel code using BOSCC instructions. We evaluate our approach, which has been implemented in the SUIF compiler, through a set of experiments with multimedia benchmarks, and compare it with the default approach previously implemented in our compiler. Our experimental results show that using BOSCC instructions can result in better performance for applications where the aggregate condition codes of a superword often evaluate to the same value.  相似文献   

9.
分簇结构超长指令字DSP编译器的设计与实现   总被引:5,自引:0,他引:5  
超长指令字(VLIW)是高端DSP普遍采用的体系结构。VLIW DSP在硬件上没有调度和冲突判决的机制,其性能的发挥完全依靠编译嚣的优化效果.基于可重定向编译基础设施IMPACT,为分簇VLIW DSP YHFT—D4设计与实现了优化编译器.其中着重讨论了可重定向信息的定义、代码注释、SIMD指令的支持、分簇寄存器分配以度指令级并行开发和资源冲突解决等内容.实验结果表明该编译器可以达到较好的优化效果.  相似文献   

10.
多媒体处理器的SIMD代码生成   总被引:1,自引:0,他引:1  
通用处理器的SIMD(Single Instruction Multiple Data)多媒体扩展,为提高多媒体应用的性能提供了新的体系结构支持。但目前编译技术对这类指令不能提供很好的支持。本文提出了一个新的SIMD指令生成算法,基于把编译器前端的程序分析和编译器后端的机器信息相结合的思想,采用扩展的treeparsing技术,有效识别程序中的并行操作以生成SIMD指令。基于SUIF(Stanford University Intermediate Format)编译器框架的实验表明,针对一组多媒体kernel,本文提出的算法可平均减少其非SIMD代码47%的cycles。  相似文献   

11.
This work presents a static method implemented in a compiler for extracting high instruction level parallelism for the 32-bit QueueCore, a queue computation-based processor. The instructions of a queue processor implicitly read and write their operands, making instructions short and the programs free of false dependencies. This characteristic allows the exploitation of maximum parallelism and improves code density. Compiling for the QueueCore requires a new approach since the concept of registers disappears. We propose a new efficient code generation algorithm for the QueueCore. For a set of numerical benchmark programs, our compiler extracts more parallelism than the optimizing compiler for an RISC machine by a factor of 1.38. Through the use of QueueCore’s reduced instruction set, we are able to generate 20% and 26% denser code than two embedded RISC processors.  相似文献   

12.
随着处理器和存储器速度差距的不断拉大,访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数,一般假定这些指令的延迟为cache命中或者cache miss的延迟,所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息,利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度,编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生,容易引起reorder buffer满,导致流水线阻塞。调度容易cache miss的指令。使其并行执行,从而隐藏cache miss的长延迟,就可以提高程序性能。因此,我们针对load指令,一方面修改频繁miss的指令的延迟,一方面修改调度策略,提高存储级并行度。实验证明,我们的调度对于bzip2有高达4.8%的提升,art有4%的提升,整体平均提高1.5%。  相似文献   

13.
龚慧敏  段湘煜  张民 《计算机科学》2017,44(12):216-220, 238
词对齐是统计机器翻译系统的重要一环,但词对齐的获得往往基于序列模型的计算,而没有考虑语言的结构化信息及语言特征,从而造成词对齐中出现一些不符合语言特征的结果。文中提出一种词对齐的自纠正机制,以纠正词对齐中的错误部分。该机制使用一些语言学上的先验知识,对词对齐结果进行由粗颗粒度到细颗粒度的纠正。首先采用基于标点的方法对句对进行粗粒度化纠正,然后采用基于统计特征的方法对子句对进行细粒度化纠正。该自纠正过程不需要借助任何其他词对齐工具和新语料。实验结果显示,自纠正词对齐显著提高了词对齐的准确率,并提高了机器翻译的质量,其中粗粒度的纠正方法对翻译质量的提高最为显著,细粒度的纠正方法也提升了翻译质量,最终通过结合粗颗粒度和细颗粒度的纠正方法,使翻译结果相对基准系统取得了显著的提高。  相似文献   

14.
《Parallel Computing》2013,39(10):586-602
Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively.  相似文献   

15.
通过分簇结构实现向量化执行是一种高效而灵活的体系结构选择.在编译中间表示里,向量指令与标量指令交叠出现.分簇结构向量化实现的特殊方式给传统的寄存器分配框架带来了挑战.针对该问题,本文从向量指令的表示形式、Callee/Caller寄存器划分、向量寄存器分配等进行研究,并给出全局与局部向量寄存器的分配方法.  相似文献   

16.
句子对齐能够为跨语言的自然语言处理任务提供高质量的对齐句子对。受对齐句子对通常包含大量对齐的单词对这种直觉的启发,该文通过探索神经网络框架下词对间的语义相互作用来解决句子对齐问题。特别地,该文提出的词对关联网络通过融合三种相似性度量方法从不同角度来捕获词对之间的语义关系,并进一步融合它们之间的语义关系来确定两个句子是否对齐。在单调和非单调文本上的实验结果表明,该文提出的方法显著提高了句子对齐的性能。  相似文献   

17.
A. Biliris  S. Dar  N. H. Gehani 《Software》1993,23(12):1285-1303
C++ objects of types that have virtual functions or virtual base classes contain volatile (‘memory’) pointers. We call such pointers ‘hidden pointers’ because they were not specified by the user. If such C++ objects are made persistent, then these pointers become invalid across program invocations. We encountered this problem in our implementation of O++, which is a database language based on C++. O++ extends C++ with the ability to create and access persistent objects. In this paper, we describe the hidden pointers problem in detail and present several solutions to it. Our solutions are elegant in that they do not require modifying the C++ compiler or the semantics of C++. We also discuss another problem that arises because C++ allows base class pointers to point to derived class objects. C++ has emerged as the de facto standard language for software development, and database systems based on C++ have attracted much attention. We hope that the details and techniques presented will be useful to database researchers and to implementors of object-oriented database systems based on C++.  相似文献   

18.
In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented.  相似文献   

19.
编译器是嵌入式系统软件中的重要组成部分,它对嵌入式系统的软件开发有重要影响。本文在将体系结构描述语言(ADL)与传统可移植编译器相结合,自动生成嵌入式系统编译器的思想基础上,对自动生成工具genmd的结构进行了分析。重点对其指令识别和机器描述生成部分进行了抽象和建模。同时,针对genmd不支持分支跳转类指令的问题提出了改进方案。  相似文献   

20.
在国产申威高性能多核服务器系统中,基础编译系统对应用程序中访存操作进行代码生成时,没有考虑国产处理器指令特征,导致编译器生成的访存地址计算代码效率较低,影响国产高性能处理器的性能。为充分发挥国产处理器高性能计算能力,提出一种加速访存地址计算的编译优化方法。加速访存地址计算编译优化基于处理器支持带扩展因子的运算指令,在编译器后端内存地址表达式合法性检查中,添加针对乘加模式的地址计算表达式合法性检查算法,自动识别地址表达式中存在的乘加运算并进行合法性检验,对符合条件的地址表达式在代码生成阶段匹配生成带扩展因子的运算指令来快速计算访存地址,从而加快访存指令的发射与执行以及应用程序中的访存地址生成,提升访存效率。使用行业标准性能测试集SPEC CPU2006对优化效果进行评测,结果表明,相比优化前SPECspeed Integer与SPECspeed Float Point两个子集,该优化方法平均性能分别提高了2.53%与1.50%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号