首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
IA-64体系结构使用64位指令集,该指令集应用显式并行指令计算(EPIC)技术,可提供更高的指令级并行性(ILP),但同时也给IA-64二进制代码流的分析和变换带来了困难.介绍了一个IA-64解码器自动生成器的结构与实现,该生成器的输入为IA-64指令集的SLED描述,自动生成用于IA-64指令解码器的C代码.通过该生成器可有效减少解码器的开发时间,确保解码器的正确性,提高解码器的执行效率.实现的自动生成器可应用于IA-64二进制翻译及逆向工程中.  相似文献   

2.
面向BWDSP的体系结构分析了字符串与内存处理函数汇编优化方法,基于向量化与软件流水的优化技术,通过利用高效访存指令、能够提升循环执行效率的零开销循环机制、指令重排技术,结合具体功能函数的循环特性,展开针对字符串与内存处理函数的指令级并行性挖掘.实验结果表明,这些库函数的优化效率能够达到硬件平台提供函数性能理论运行时间的1.5倍以下,对BWDSP平台整体性能提升具有重要意义.  相似文献   

3.
杨旭  何虎  孙义和 《计算机学报》2011,34(1):182-192
应用的需求促使如今的处理器必须尽可能高地利用程序中所存在的指令级并行度,然而,高指令级并行的硬件和指令调度技术会给寄存器资源带来极大的压力.要在单一寄存器堆的情况下,既维持高的指令级并行度,又保持高的运行时钟频率是一件非常困难的事情,这是因为,当指令级并行度足够高时,在单一寄存器堆情况下,寄存器堆访问端口数目的限制会使...  相似文献   

4.
Speculative execution is the execution of instructions before it is known whether these instructions should be executed. In the speculative execution for instruction level parallelism (ILP) processors, the concept of shadow register provides a hardware solution to maintain semantics of a program from the pollution of boosted instructions that are incorrectly predicted. In a recent study, Chang and Lai proposed a special register file based on shadow register, named conjugate register file (CRF), to support multilevel boosting in speculative execution. They also proposed a scheduling heuristic named frequency-driven scheduling to incorporate with CRF for execution. However, the ability of boosting is still constrained since the concept of register pair will force the results produced speculatively be stored in dedicated locations. Moreover, when the parallelism potential increases to tens through the advancement of hardware techniques, the heavy demand on register usage and the complexity of register file may well become a serious bottleneck for the exploitation of ILP.In this paper, the algorithm of frequency-driven scheduling is modified by replacing the function of hardware CRF with the technique of variable renaming during compilation. The new scheduling technique, named LESS, can exploit the parallelism efficiently with limited number of registers. Moreover, since the technique can benefit ILP without any special hardware support, it can be incorporated with any other ILP architecture without changing its instruction set architecture (ISA).Simulation results show that the performance achievable by LESS is better than other existing methods. For example, under the ILP model with an issue rate of 8, the speculative execution can achieve an increase of 34% in parallelism, as compared to 18% in CRF scheme.  相似文献   

5.
指令描述的自动检测技术   总被引:2,自引:0,他引:2  
杨欣  赵荣彩  李崇 《计算机工程与设计》2006,27(18):3344-3348,3352
通过使用高级说明语言描述指令集,自动生成指令编码和解码程序,使单调乏味而且非常容易出错的机器代码重定向工作自动化,并且通过反汇编测试平台对这项描述的正确性实现自动检测.这对于64位、具有更高的指令级并行性(ILP)的IA-64,在二进制指令代码流的自动分析和变换,基于机器和操作系统的描述来实现IA-64二进制自动翻译和逆向工程中有重要的意义.概述了对IA-64指令的SLED描述,详细阐述了利用NJMCT自动生成反向工具的设计与实现技术.  相似文献   

6.
High-performance microprocessors are currently designed with the purpose of exploiting instruction level parallelism (ILP). The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. This paper reviews hardware and software techniques that alleviate the high register demands of aggressive scheduling heuristics on VLIW cores. From the software point of view, instruction scheduling can stretch lifetimes and reduce the register pressure. If more registers than those available in the architecture are required, some actions (such as the injection of spill code) have to be applied to reduce this pressure, at the expense of some performance degradation. From the hardware point of view, this degradation could be reduced if a high-capacity register file were included without causing a negative impact on the design of the processor (cycle time, area and power dissipation). Novel organizations for the register file based on clustering and hierarchical organization are necessary to meet the technology constraints. This paper proposes the used of a clustered organization and proposes an aggressive instruction scheduling technique that minimizes the negative effect of the limitations imposed by the register file organization.  相似文献   

7.
条件跳转指令是VLIW DSP中频繁使用的一种指令,循环是条件跳转指令应用的主要领域之一。条件跳转指令高效的设计是VLIW DSP高效运行的关键。针对这类指令实现的复杂性,讨论了一种新的结构Hyperblock,并用这种结构设计实现了BWDSP100处理器中的条件跳转指令,实验证明该方法对于DSP核心算法程序以及实际应用程序都可以获得较好的优化效果,提高了指令并行性。  相似文献   

8.
作为64住处理器架构的IA-64提供了更高的指令级并行性(ILP),并代表了一种新型微处理器的发展方向,对IA-64二进制指令代码流的自动分析和变换.在基于机器和操作系统的的描述来实现IA-64二进制自动翻译和逆向工程中有重要的意义。本文概述了SLED与IA-64的指令特点,详细介绍了基于SLED对IA-64指令的描述和利用MLTK自动生成反向工具的设计与实现技术.并给出了自动生成反汇编的测试结果。  相似文献   

9.
一种动态VLIW调度机制的研究和实现   总被引:2,自引:0,他引:2       下载免费PDF全文
VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。  相似文献   

10.
SMA:前瞻性多线程体系结构   总被引:4,自引:1,他引:3  
肖刚  周兴铭  徐明  邓鹍 《计算机学报》1999,22(6):582-590
提出了一种新的ILP处理器体系结构-前瞻性多线程体系的结构,简称SMA.它结合了前瞻性执行机制和多线程执行机制,以整个线程为长步进行前瞻性执行,多个线程并行执行并且共享处理器硬件资源,这样,处理器既通过组合每个线程的指令窗口形成一个大的动态指令窗口,开发出程序中更大的ILP,又利用多线程执行机制屏蔽各种长延迟操作,达到较高的资源利用率;介绍了SMA执行模型,并讨论了SMA处理器的实现和其中的关键技  相似文献   

11.
12.
For implementing a dynamic binary translation system, traditional software-based solutions suffer from significant runtime overhead and are not suitable for extra complex optimization. This paper proposes using hardware–software collaboration techniques to create an high efficient dynamic binary translation system, CoDBT, which emulates several heterogeneous ISAs (Instruction Set Architectures) on a host processor without changing to the existing processor. We analyze the major performance bottlenecks via evaluating overhead of a pure software-solution DBT. Guidelines are provided for applying a suitable hardware–software partition process to CoDBT, as are algorithms for designing hardware-based binary translator and code cache management. An intermediate instruction set is introduced to make multi-source translation more practicable and scalable. Meantime, a novel runtime profiling strategy is integrated into the infrastructure to collect program hot spots information to supporting potential future optimizations. The advantages of using co-design as an implementation approach for DBT system are assessed by several SPEC benchmarks. Our results demonstrate that significant performance improvements can be achieved with appropriate hardware support choices. CoDBT could be an efficient and cost-effective solution for situations where the usual methods of performance acceleration for dynamic binary translation are inappropriate.  相似文献   

13.
Discovering and exploiting instruction level parallelism in code will be key to future increases in microprocessor performance. What technical challenges must compiler writers meet to better use ILP? Instruction level parallelism allows a sequence of instructions derived from a sequential program to be parallelized for execution on multiple pipelined functional units. If industry acceptance is a measure of importance, ILP has blossomed. It now profoundly influences the design of almost all leading edge microprocessors and their compilers. Yet the development of ILP is far from complete, as research continues to find better ways to use more hardware parallelism over a broader class of applications  相似文献   

14.
分簇VLIW DSP在减少硬件设计复杂性的同时,显著地增加了编译器进行指令调度的难度。提出通过在调度中首先进行指令簇划分然后再簇内调度,这样在增加很少几条拷贝指令的条件下充分利用分簇的特性提高指令的并行度,减小调度时间。  相似文献   

15.
《Parallel Computing》2013,39(10):586-602
Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively.  相似文献   

16.
The superblock: An effective technique for VLIW and superscalar compilation   总被引:8,自引:1,他引:7  
A compiler for VLIW and superscalar processors must expose sufficient instruction-level parallelism (ILP) to effectively utilize the parallel hardware. However, ILP within basic blocks is extremely limited for control-intensive programs. We have developed a set of techniques for exploiting ILP across basic block boundaries. These techniques are based on a novel structure called thesuperblock. The superblock enables the optimizer and scheduler to extract more ILP along the important execution paths by systematically removing constraints due to the unimportant paths. Superblock optimization and scheduling have been implemented in the IMPACT-I compiler. This implementation gives us a unique opportunity to fully understand the issues involved in incorporating these techniques into a real compiler. Superblock optimizations and scheduling are shown to be useful while taking into account a variety of architectural features.  相似文献   

17.
曾斌  安虹  王莉 《计算机科学》2010,37(3):248-252
开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。  相似文献   

18.
Nearly two decades of research in the area of Inductive Logic Programming (ILP) have seen steady progress in clarifying its theoretical foundations and regular demonstrations of its applicability to complex problems in very diverse domains. These results are necessary, but not sufficient, for ILP to be adopted as a tool for data analysis in an era of very large machine-generated scientific and industrial datasets, accompanied by programs that provide ready access to complex relational information in machine-readable forms (ontologies, parsers, and so on). Besides the usual issues about the ease of use, ILP is now confronted with questions of implementation. We are concerned here with two of these, namely: can an ILP system construct models efficiently when (a) Dataset sizes are too large to fit in the memory of a single machine; and (b) Search space sizes becomes prohibitively large to explore using a single machine. In this paper, we examine the applicability to ILP of a popular distributed computing approach that provides a uniform way for performing data and task parallel computations in ILP. The MapReduce programming model allows, in principle, very large numbers of processors to be used without any special understanding of the underlying hardware or software involved. Specifically, we show how the MapReduce approach can be used to perform the coverage-test that is at the heart of many ILP systems, and to perform multiple searches required by a greedy set-covering algorithm used by some popular ILP systems. Our principal findings with synthetic and real-world datasets for both data and task parallelism are these: (a) Ignoring overheads, the time to perform the computations concurrently increases with the size of the dataset for data parallelism and with the size of the search space for task parallelism. For data parallelism this increase is roughly in proportion to increases in dataset size; (b) If a MapReduce implementation is used as part of an ILP system, then benefits for data parallelism can only be expected above some minimal dataset size, and for task parallelism can only be expected above some minimal search-space size; and (c) The MapReduce approach appears better suited to exploit data-parallelism in ILP.  相似文献   

19.
While high-performance architectures have included some Instruction-Level Parallelism (ILP) for at least 25 years, recent computer designs have exploited ILP to a significant degree. Although a local scheduler is not sufficient for generation of excellent ILP code, it is necessary as many global scheduling and software pipelining techniques rely on a local scheduler. Global scheduling techniques are well-documented, yet practical discussions of local schedulers are notable in their absence. This paper strives to remedy that disparity by describing a list scheduling framework and several important practical details that, taken together, allow implementation of an efficient local instruction scheduler that is easily retargetable for ILP architectures. The foundation of our machine-independent instruction scheduler is a timing model that allows easy retargetability to a wide range of architectures. In addition to describing how a general list-scheduler can be implemented within the framework of our timing model, experimental results indicate that lookahead scheduling can profoundly improve a scheduler's ability to produce a legal schedule. Further experimental data shows that deciding to schedule a data dependence DAG (DDD) in forward or reverse order depends significantly upon that target architecture, suggesting the possibility of scheduling in each direction and using the best of the two schedules. In contrast, experiments demonstrate little difference in code quality for schedules generated by either instruction-driven or operation-driven schedulers. Thus, the inherent flexibility of operation-driven methods suggests including that approach in a retargetable instruction scheduler. List scheduling is, of course, a heuristic scheduling method. A variety of scheduling heuristics are presented. In addition, the paper describes a method, using a genetic algorithm search, to ‘fine-tune’ the weights of twenty-four individual heuristics to form a DDD-node heuristic tuned to a specific architecture. © 1998 John Wiley & Sons, Ltd.  相似文献   

20.
协作式全局指令调度与寄存器分配   总被引:1,自引:1,他引:0  
指令级并行是现代高性能代理器的重要特征,对于发挥这类处理器所具有的并行处理能力来说,编译器有至关重要的影响。文中讨论指令级并行编译中的核心问题-全局指令调度与 器分配,并以作者为一种新型的显式并行体系结构微处理器的编译系统为背景,介绍了此类编译器后端设计中面临的指令调度与寄存器分配的时序问题,以及为解决这一问题而提出了的一种协作式全局指令调度与寄存器分配方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号