期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

苏铭宋宗宇赵荣彩钟声《计算机工程》2007,33(6):86-88,9

IA-64体系结构支持判断执行，提高指令级并行性，但是编译器为了充分利用该特性而做的优化将程序代码进行深度重构，对逆向工程来说很难从优化后的可执行代码中恢复原程序逻辑。该文提出了消除谓词的反优化技术，提高了可执行代码逆向工程的质量。相似文献

2.

Construction of speculative optimization algorithms

A. A. Belevantsev S. S. Gaisaryan V. P. Ivannikov 《Programming and Computer Software》2008,34(3):138-153

In modern processors, instructions to perform operations are often produced before it becomes known that this is required. Such an expedient, which is called speculative execution, helps to reveal parallelism at the instruction level. In the EPIC architectures, the speculative execution is completely controlled by the compiler, which makes it possible to avoid using complex hardware mechanisms for supporting speculative instruction production. Moreover, the idea of the speculative execution can be used by the compiler in machine-independent optimizations. The paper describes a scheme of construction of the speculative optimization that is based on the selection of properties of the control flow and data flow that are important from the optimization standpoint and on the estimation of the probabilities of their fulfillment. The probabilities found are used for searching and constructing advantageous speculative and bookkeeping transformations. For optimizations that include only speculative movements of instructions upwards along the control flow graph, on the basis of the suggested scheme, a method has been developed that includes algorithms for finding probabilities of data and control dependences, for estimating benefit of speculative movements, and for constructing a recovery code. On the basis of this method, an algorithm for the speculative scheduling of instructions for the Intel Itanium architecture has been developed and implemented. Specific features of its implementation and experimental results are described. 相似文献

3.

Itanium 2 processor 6M: higher frequency and larger L3 cache

Rusu S. Muljono H. Cherkauer B. 《Micro, IEEE》2004,24(2):10-18

The third-generation Itanium processor targets the high-performance server and workstation market. To do so, the design team sought to provide higher performance through increased frequency and a larger L3 cache. At the same time, we had to limit the power dissipation to fit into the existing platform envelope. These considerations led to what we now call the Itanium 2 processor 6M: the latest generation of Itanium 2, which features a 6-Mbyte, 24-way set-associative on-die L3 cache. The design implements a 2-bundle 64-bit explicitly parallel instruction computing (EPIC) architecture and is fully compatible with previous implementations. Although this processors frequency is 50 percent higher than that of the previous generation, the maximum power dissipation holds flat at 130 W to ensure the platform's backward compatibility. 相似文献

4.

Montecito: a dual-core, dual-thread Itanium processor 总被引：2，自引：0，他引：2

McNairy C. Bhatia R. 《Micro, IEEE》2005,25(2):10-20

Intel's Montecito is the first Itanium processor to feature duplicate, dual-thread cores and cache hierarchies on a single die. It features a landmark 1.72 billion transistors and server-focused technologies, and it requires only 100 watts of power. Intel's Itanium 2 processor series has regularly delivered additional performance through the increased frequency and cache as evidenced by the 6-Mbyte and 9-Mbyte versions. 相似文献

5.

Hardware-Software Collaborative Techniques for Runtime Profiling and Phase Transition Detection

下载免费PDF全文

Youfeng Wu Yong-Fong Lee 《计算机科学技术学报》2005,20(5):665-675

Dynamic optimization relies on runtime profile information to improve the performance of program execution. Traditional profiling techniques incur significant overhead and are not suitable for dynamic optimization. In this paper, a new profiling technique is proposed, that incorporates the strength of both software and hardware to achieve near-zero overhead profiling. The compiler passes profiling requests as a few bits of information in branch instructions to the hardware, and the processor executes profiling operations asynchronously in available free slots or on dedicated hardware. The compiler instrumentation of this technique is implemented using an Itanium research compiler. The result shows that the accurate block profiling incurs very little overhead to the user program in terms of the program scheduling cycles. For example, the average overhead is 0.6% for the SPECint95 benchmarks. The hardware support required for the new profiling is practical. The technique is extended to collect edge profiles for continuous phase transition detection. It is believed that the hardware-software collaborative scheme will enable many profile-driven dynamic optimizations for EPIC processors such as the Itanium processors. 相似文献

6.

3种提高软件流水有效性的算法:比较和结合 总被引：1，自引：0，他引：1

李文龙陈彧林海波汤志忠《软件学报》2005,16(10):1822-1832

软件流水是开发循环程序指令级并行性的技术,它通过并行执行连续的多个循环体来加快循环的执行速度.在软件流水中,循环体的重叠增加了寄存器需求,导致寄存器压力增大,当目标处理机所提供的寄存器不足时,软件流水可能失败.在Itanium处理机上评估了NAS和SPEC2000基准程序中的软件流水循环的寄存器需求,发现静态寄存器不足是造成软件流水失败的主要原因,提出了3种增加软件流水个数、提高软件流水有效性的算法:限制循环展开因子的算法(register sensitive unrolling,简称RSU)、堆栈寄存器分配算法(stacked registerallocation,简称SRA)以及变量类型转换的算法(variabletype conversion,简称VTC).RSU根据静态寄存器需求确定一个合理的展开因子,增加了软件流水的成功率;SRA和VTC分别使用空闲的堆栈寄存器和旋转寄存器来充当静态寄存器,提高了寄存器的利用率.在面向Itanium处理器的开放源码编译器ORC(open research compiler)上实现了这3种算法,通过NAS程序的测试比较了这3种算法的有效性,同时对它们的结合应用进行了研究和实验. 相似文献

7.

基于Itanium处理器的密码算法实现

陈迅姜晶菲张民选《计算机工程与应用》2004,40(15):40-42,208

使用ItaniumCompiler7.0编译器对现有分组密码算法的C语言实现进行编译得到汇编代码,在对这些汇编代码进行分析时可以发现编译器并没有充分利用Itanium处理器提供的资源。针对这一问题,该文提出了在Itanium处理器上有效实现常用密码算法的方法,主要是利用Itanium处理器指令集中提供的SIMD指令提高处理的并行性,并探讨了Itanium处理器SIMD指令的使用方法。相似文献

8.

Itanium processor microarchitecture

Sharangpani H. Arora H. 《Micro, IEEE》2000,20(5):24-43

The Itanium processor is the first implementation of the IA-64 instruction set architecture (ISA). The design team optimized the processor to meet a wide range of requirements: high performance on Internet servers and workstations, support for 64-bit addressing, reliability for mission-critical applications, full IA-32 instruction set compatibility in hardware, and scalability across a range of operating systems and platforms. The processor employs EPIC (explicitly parallel instruction computing) design concepts for a tighter coupling between hardware and software. In this design style the hardware-software interface lets the software exploit all available compilation time information and efficiently deliver this information to the hardware. It addresses several fundamental performance bottlenecks in modern computers, such as memory latency, memory address disambiguation, and control flow dependencies 相似文献

9.

The IA-64 Itanium processor cartridge

Samaras W.A. Cherukuri N. Venkataraman S. 《Micro, IEEE》2001,21(1):82-89

The Itanium processor cartridge is a packaging optimization for electrical and thermal performance in a server environment. The 3-in. x 5-in. cartridge contains the Itanium CPU, up to 4 megabytes of level-3 (L3) cache, an innovative power delivery scheme, and an integrated vapor chamber thermal spreading lid for removing power. Cartridges and a chip set can be ganged electrically by means of a glueless bidirectional, multidrop system bus. Power is delivered through a custom connection with separate voltages for the 0.18-micron CPU and 0.25-micron custom cache devices. An I²C serial connection provides access to system management features such as temperature monitoring and cartridge identification information 相似文献

10.

Simple code optimizations

David R. Hanson 《Software》1983,13(8):745-763

Program optimization has received a great deal of attention for many years, which has resulted in numerous advances in compiler technology. The effectiveness of various simple optimizations has received comparably little attention during the same time period. The simplicity of most programs suggests that straightforward optimizations pay the greatest dividends. This paper describes three such optimizations suitable for one-pass compilers. The optimizations involve expression rearrangement, instruction selection, and the use of a cache for the allocation of resources. The cost of these optimizations is low; none require major changes to the size or structure of the compiler or reduce compilation speed by more than 10%. The benefits are high; each optimization results in at least a 10% average reduction in object code size and a corresponding reduction in execution time. Examples and implementation details are also described. 相似文献

11.

基于安腾微处理器的程序性能优化与分析

迟利华刘杰《计算机工程与科学》2011,33(9):42

高性能计算越来越广泛地应用到科学和工程的各个领域,但实际应用程序获得的性能并未随着机器峰值性能的提高而同比例提高,应用程序只能发挥峰值性能的5%～10%左右,而且两者的差距在扩大,程序性能优化作为解决该问题的方法之一得到了学术界的广泛关注。本文基于安腾微处理器,总结了程序优化的通用方法,给出了程序优化与分析的一般步骤。根据优化与分析步骤,首先对四个程序进行了详细的性能分析,找到性能瓶颈和重点子程序;然后分别根据四个程序的特点,采用基于Cache和指令流水线的优化技术,对程序进行了性能优化;最后给出了性能优化测试结果,分别得到8%～33%的性能提高,取得了良好的优化效果。相似文献

12.

优化动态二进制翻译器DigitalBridge

白童心冯晓兵武成岗张兆庆《计算机工程》2005,31(10):103-105

讨论动态二进制翻译器DigitalBridge的动态优化设计与实现,给出了基于edge profile的热路径选择算法FHFS,在热路径上实施了基于模式匹配的指令组合优化翻译和标志位延迟计算的优化。实验结果表明,优化后动态翻译的性能平均提高40%。相似文献

13.

指令优化设计与指令缓存结构相互关系的研究

王宇英周兴社《小型微型计算机系统》2008,29(9)

传统的指令优化方法通常不考虑调整指令高速缓存的硬件体系结构,只能得到局部优化结果.本文以实验的方法研究了指令优化设计和指令缓存配置之间的关系,通过实现程序指令优化并在不同指令缓存配置的平台上运行优化前后的程序,对比缓存缺失率,为进一步提高指令缓存性能提供了重要参考.实验结果表明指令缓存配置对指令优化的性能有极大的影响,在系统设计阶段同时考虑指令优化和指令缓存结构将能大幅度地改进指令缓存的性能. 相似文献

14.

面向异构融合处理器的性能分析、优化及应用综述

张峰翟季冬陈政林甲灶杜小勇《软件学报》2020,31(8):2603-2624

随着异构计算技术的不断进步,CPU和GPU等设备相集成的异构融合处理器在近些年得到了充分的发展,并引起了学术界和工业界的关注.将多种设备相集成带来了许多好处,例如,多种设备可以访问同样的内存,可以进行细粒度的交互.然而,这也带来了系统编程和优化方面的巨大挑战.充分发挥异构融合处理器的性能,需要充分利用集成体系结构中共享内存等特性;同时,还需结合具体应用特征对异构融合处理器上的不同设备进行优化.本文首先对目前涉及异构融合处理器的研究工作进行了分析,之后介绍了异构融合处理器的性能分析工作,并进一步介绍了相关优化技术,随后对异构融合处理器的应用进行了总结.最后,对异构融合处理器未来的研究方向进行了展望,并进行了总结. 相似文献

15.

Register allocation with instruction scheduling for VLIW-architectures

D. S. Ivanov 《Programming and Computer Software》2010,36(6):363-367

Interaction between the phases of register allocation and instruction scheduling are often considered in publications devoted to optimizations for the final stage of compilation. Typically, it is proposed to adapt one of the phase for needs of another without their combination into a single unit. However, their integration can essentially reduce the time of operation and enhance the performance of the resulting code. This study describes an attempt to combine these phases as completely as possible with account for the features of static scheduling for VLIW-architectures. 相似文献

16.

Path Analysis and Renaming for Predicated Instruction Scheduling

Lori Carter Beth Simon Brad Calder Larry Carter Jeanne Ferrante 《International journal of parallel programming》2000,28(6):563-588

Increases in instruction level parallelism are needed to exploit the potential parallelism available in future wide issue architectures. Predicated execution is an architectural mechanism that increases instruction level parallelism by removing branches and allowing simultaneous execution of multiple paths of control, only committing instructions from the correct path. In order for the compiler to expose and use such parallelism, traditional compiler data-flow and path analysis needs to be extended to predicated code. In this paper, we motivate the need for renaming and for predicates that reflect path information. We present Predicated Static Single Assignment (PSSA) which uses renaming and introduces Full -Path Predicates to remove false dependences and enable aggressive predicated optimization and instruction scheduling. We demonstrate the usefulness of PSSA for Predicated Speculation and Control Height Reduction. These two predicated code optimizations used during instruction scheduling reduce the dependence length of the critical paths through a predicated region. Our results show that using PSSA to enable speculation and control height reduction reduces execution time from 12 to 68%. 相似文献

17.

函数式语言编译器中的G-Machine技术研究

CHEN Zi-xin 《数字社区&智能家居》2008,(20)

函数式程序设计语言具有程序简洁,易于进行推理和正确性证明等优点。抽象机技术完成函数式程序设计语言的规约计算到传统体系结构的状态转移计算之间的转换,是函数式语言编译技术的核心。本文基于SpinelessG-Machine抽象机的图规约机模型,并在其基础上进行了改进,通过增加闭包,构造全懒惰表达式等,得到了一个更容易理解和易于优化的抽象机模型。并且在此模型上使用了扩展MKAP指令和G-code窥孔优化等方法提高抽象机的效率。相似文献

18.

Compilation Techniques for High Level Parallel Code

Benedict R. Gaster Tim Bainbridge David Lacey David Gardner 《International journal of parallel programming》2010,38(1):4-18

This paper describes methods to adapt existing optimizing compilers for sequential languages to produce code for parallel processors. In particular it looks at targeting data-parallel processors using SIMD (single instruction multiple data) or vector processors where users need features similar to high-level control flow across the data-parallelism. The premise of the paper is that we do not want to write an optimizing compiler from scratch. Rather, a method is described that allows a developer to take an existing compiler for a sequential language and modify it to handle SIMD extensions. As well as modifying the front-end, the intermediate representation and the code generation to handle the parallelism, specific optimizations are described to target the architecture efficiently. 相似文献

19.

High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Arslan Munir Farinaz Koushanfar Ann Gordon-Ross Sanjay Ranka 《The Journal of supercomputing》2013,66(1):431-487

Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs. 相似文献

20.

What is Itanium Memory Consistency from the Programmer's Point of View?

Lisa Higham LillAnne Jackson Jalal Kawash 《Electronic Notes in Theoretical Computer Science》2007,174(9):63

相似文献