期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Background optimization in full system binary translation

R. A. Sokolov A. V. Ermolovich 《Programming and Computer Software》2012,38(3):119-126

Binary translation and dynamic optimization are widely used to provide compatibility between legacy and promising upcoming architectures on the level of executable binary codes. Dynamic optimization is one of the key contributors to dynamic binary translation system performance. At the same time it can be a major source of overhead, both in terms of CPU cycles and whole system latency, as long as optimization time is included in the execution time of the application under translation. One of the solutions that allow to eliminate dynamic optimization overhead is to perform optimization simultaneously with the execution, in a separate thread. In the paper we present implementation of this technique in full system dynamic binary translator. For this purpose, an infrastructure for multithreaded execution was implemented in binary translation system. This allowed running dynamic optimization in a separate thread independently of and concurrently with the main thread of execution of binary codes under translation. Depending on the computational resources available, this is achieved whether by interleaving the two threads on a single processor core or by moving optimization thread to an underutilized processor core. In the first case the latency introduced to the system by a computational intensive dynamic optimization is reduced. In the second case overlapping of execution and optimization threads also results in elimination of optimization time from the total execution time of original binary codes. 相似文献

2.

动态二进制翻译中基本块重叠冗余的优化

下载免费PDF全文

李骏管海兵李增祥梁阿磊《计算机工程》2007,33(22):60-62

动态二进制翻译技术通常采用基本块作为翻译和执行的基本单元,动态翻译中的基本块在划分过程中存在重叠冗余的情况,即当前翻译的基本块可能是一个已经过翻译的基本块子集,或者包含一个已翻译的基本块,这增加了翻译开销。该文从优化动态二进制翻译角度出发,检测、消除由基本块重叠冗余带来的开销。实验表明,在动态二进制翻译过程中存在5%左右的基本块重叠率,通过消除这些冗余可以将翻译和执行的性能提高1%~4%。相似文献

3.

利用动态二进制翻译加速应用程序行为特征分析

赵天磊唐遇星付桂涛贾小敏齐树波张民选《计算机研究与发展》2012,49(1):35-43

应用程序运行时典型行为特征分析的一种重要方法是SimPoint,但是为SimPoint生成基本块向量剖析（basic block vector profile,BBV profile）文件非常耗时.首先提出了一个利用动态二进制翻译技术生成BBVprofile的通用框架DBT-BBV,然后详细分析了几种降低开销的优化技术,最后基于DBT-BBV和提出的优化技术设计实现了一个高效的BBVProfile收集工具QPoint.利用SPEC2006测试程序集评估了所提出的优化技术和QPoint的性能和开销.与现有工具相比,QPoint有两个优势:①QPoint的性能高于现有工具,在普通PC机上最高速度为292MIPS,平均速度为109MIPS,BBV Profile收集的平均开销小于4%,在同类工具中最低;②QPoint支持众多体系结构平台,包括x86/x86₆4,ARM,POWER,SPARC,MIPS等,并且可跨指令集收集BBVProfile.结果显示,动态二进制翻译技术在应用程序行为特征分析加速方面具有非常好的效果. 相似文献

4.

动态二进制翻译中数据预取优化研究*

罗琼程吴强《计算机应用研究》2009,26(12):4572-4576

动态优化是动态二进制翻译研究中一个十分重要的课题,数据预取优化能提高现代处理器体系结构应用程序性能。基于超级块(Superblock)的动态数据预取优化采用软件插桩方式收集应用程序的load访存延迟信息并构造Superblock;然后根据延迟信息以及Superblock数据流分析得出的寄存器定值引用关系,对延迟load指令进行预取优化。通过在龙芯DigitalBridge动态二进制翻译系统上实验验证,数据预取优化可以提高翻译后SPEC2000浮点测试程序代码的平均性能3.3%,开销远小于0.5%。相似文献

5.

Binary compatibility for embedded systems using greedy subgraph mapping

CHEN XuHao SHEN Li WANG ZhiYing ZHENG Zhong CHEN Wei 《中国科学:信息科学(英文版)》2014,57(7):1-16

We propose a novel lightweight code generation algorithm GSM(Greedy Subgraph Mapping),which can generate compact code with low overhead using many-to-one mapping.GSM is implemented and evaluated in a dynamic binary translation prototype system called TransARM.Experimental results demonstrate that GSM generates higher quality target code than a conventional implementation,which brings an average code expansion rate close to 1.3 for the selected 11 benchmarks.Moreover,GSM causes slightly extra overhead and negligible slowdown of translation and enables 10%performance improvement for target code execution. 相似文献

6.

兼容ARM Thumb指令的多指令集处理器技术研究

白创陈益如童元满《计算机应用研究》2023,40(11)

随着处理器的快速发展,RISC-V的软件生态环境建设成为其在处理器市场中站稳脚跟的关键因素之一。二进制翻译是解决处理器二进制代码兼容性问题、为处理器生态环境建设获取时间成本的关键技术之一,但由于二进制翻译器难以以较低的功耗面积开销获得高效执行的二进制代码,使其无法广泛应用于嵌入式领域。针对二进制翻译器执行效率和功耗面积开销难以取得平衡的问题,采用硬件逻辑加速的方式处理ARMv7-M中条件执行指令、更新标志位指令以及桶形移位指令,并利用静态二进制翻译器对ARMv7-M程序进行IT Block分裂、地址重计算及指令映射后生成RISC-V二进制代码,以此支持ARMv7-M的各类指令。基于开源内核CV32E40P设计了一个支持ARMv7-M的处理器内核,结果表明,运行ARMv7-M程序的平均性能能够达到直接运行RISC-V程序性能的137%,与纯软件二进制翻译支持ARMv7-M相比,该处理器核运行ARMv7-M程序的性能提升了5.59倍。相似文献

7.

Composing high-performance schedulers: a case study from real-time simulation

Kaushik Ghosh Richard M. Fujimoto Karsten Schwan 《Concurrency and Computation》1999,11(5):221-245

Dynamic, high-performance or real-time applications require scheduling latencies and throughput not typically offered by current kernel or user-level threads schedulers. Moreover, it is widely accepted that it is important to be able to specialize scheduling policies for specific target applications and their execution environments. This paper presents one solution to the construction of such high-performance, application-specific thread schedulers. Specifically, scheduler implementations are composed from modular components, where individual scheduler modules may be specialized to underlying hardware characteristics or implement precisely the mechanisms and policies desired by application programs. The resulting user-level schedulers' implementations can provide resource guarantees by interaction with kernel-level facilities which provide means of resource reservation. This paper demonstrates the concept of composable schedulers by construction of several compositions for highly dynamic target applications, where low scheduling latencies are critical to application performance. Claims about the importance and effectiveness of scheduler composition are validated experimentally on a shared-memory multiprocessor. Scheduler compositions are optimized to take advantage of different low-level hardware attributes and of knowledge about application requirements specific to certain applications, including a Time Warp-based real-time discrete event simulator. Experimental evaluations are based on synthetic workloads, on a real-time simulation blending simulated with implemented control system components, and on a dynamic robot control program. Measurements indicate that schedulers can be composed and specialized to offer performance similar to that of dedicated scheduling co-processors. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

8.

CoDBT: A multi-source dynamic binary translator using hardware–software collaborative techniques

Haibing Guan Bo Liu Zhengwei Qi Yindong Yang Hongbo Yang Alei Liang 《Journal of Systems Architecture》2010,56(10):500-508

For implementing a dynamic binary translation system, traditional software-based solutions suffer from significant runtime overhead and are not suitable for extra complex optimization. This paper proposes using hardware–software collaboration techniques to create an high efficient dynamic binary translation system, CoDBT, which emulates several heterogeneous ISAs (Instruction Set Architectures) on a host processor without changing to the existing processor. We analyze the major performance bottlenecks via evaluating overhead of a pure software-solution DBT. Guidelines are provided for applying a suitable hardware–software partition process to CoDBT, as are algorithms for designing hardware-based binary translator and code cache management. An intermediate instruction set is introduced to make multi-source translation more practicable and scalable. Meantime, a novel runtime profiling strategy is integrated into the infrastructure to collect program hot spots information to supporting potential future optimizations. The advantages of using co-design as an implementation approach for DBT system are assessed by several SPEC benchmarks. Our results demonstrate that significant performance improvements can be achieved with appropriate hardware support choices. CoDBT could be an efficient and cost-effective solution for situations where the usual methods of performance acceleration for dynamic binary translation are inappropriate. 相似文献

9.

动态二进制翻译器CrossBit的性能分析与评估

下载免费PDF全文

官孝峰梁阿磊《计算机工程与应用》2008,44(27):91-94

动态二进制翻译是广泛应用于虚拟机系统的一种二进制代码的翻译技术。动态二进制翻译由于拥有代码缓存、本地执行、代码块链接、动态热路径生成等优化技术的支持,有着很高的性能。CrossBit是一个多元多目标的动态二进制翻译系统,通过对CrossBit二进制翻译器的性能进行的研究,分析动态二进制翻译器性能提升中所必须解决的若干问题,并通过定量的分析总结了一些二进制翻译系统的在不同的配置和负载下系统优化手段的执行时策略。相似文献

10.

基于IPT硬件的内核模块ROP透明保护机制

王心然刘宇涛陈海波《软件学报》2018,29(5):1333-1347

Return-Oriented Programming（ROP）是一种流行的利用缓冲区溢出漏洞进行软件攻击的方法,它通过覆写程序栈上的返回地址,使程序在之后执行返回指令时,跳转到攻击者指定位置的代码,因而违反了程序原本期望的控制流.控制流完整性（Control-flow Integrity,简称CFI）检查是目前最流行的ROP防御机制,它将每条控制流跳转指令的合法目标限制在一个合法目标地址集合内,从而阻止攻击者恶意改变程序的控制流.现有的CFI机制大多用于保护用户态程序,然而当前已经有诸多针对内核态的攻击被曝出,其中Return-oriented rootkits^[¹^] （ROR）就是在有漏洞的内核模块中进行ROP攻击,达到执行内核任意代码的目的.相较于传统的基于用户空间的ROP攻击,ROR攻击更加危险.根据Linux CVE的数据统计,在2014-2016年中,操作系统内核内部的漏洞有76%出现在内核模块中,其中基本上所有被公布出来的攻击都发生在内核模块.由此可见,内核模块作为针对内核攻击的高发区,非常危险.另一方面,当前鲜有针对操作系统内核的CFI保护方案,而已有的相关系统都依赖于对内核的重新编译,这在很大程度上影响了它们的应用场景.针对这些问题,本文首次提出利用Intel Processor Trace （IPT）硬件机制,并结合虚拟化技术,对内核模块进行透明且有效的保护,从而防御针对其的ROP攻击.实验表明该系统具有极强的保护精确性、兼容性和高效性. 相似文献

11.

注解信息制导的动态二进制翻译及优化

李剑慧王昀黄波乐永年刘江宁叶锦云《小型微型计算机系统》2007,28(3):558-565

动态二进制翻译器在运行时将源体系结构机器码翻译为目标体系结构机器码.这种即时编译技术使得源机器上的软件无需重编译就可以直接在目标机上较高效地运行.然而,利用动态二进制翻译器运行源软件的效率大大低于针对目标机器重新编译运行源软件的效率.本文在比较分析动态翻译生成的目的机器码的性能偏低的原因的基础上,提出了注解信息制导的动态二进制翻译及优化的方法.本文选取了三种注解信息,在英特尔的商用动态二进制翻译器"IA-32Execution Layer"和静态编译器"Intel(r)Compiler"上实现了注解信息制导的动态二进制编译及优化技术.实验结果表明该三种注解信息较大程度地提高了动态翻译码的执行效率. 相似文献

12.

基于热例程的动态二进制翻译优化

董卫宇刘金鑫戚旭衍何红旗蒋烈辉《计算机科学》2016,43(5):27-33, 41

依据对系统级程序行为特性的观察,提出了一种基于热例程的动态二进制翻译优化方法。该方法以频繁执行的例程作为优化单位,通过块内和块间优化算法消除动态二进制翻译引入的冗余。相比基于踪迹的优化方法,该方法具有优化单位发现开销更小、代码区域更大、无重复翻译等优点,更适用于系统虚拟机中操作系统代码的优化。在跨平台系统虚拟机监控器ARCH-BRIDGE上的测试表明,通过对内核代码实施该优化方法,SPEC CPUINT 2006程序的效率提升了3.5%~14.4%,相比基于踪迹的优化,性能最大提升了5.1%。相似文献

13.

Polymorphic bytecode instrumentation

下载免费PDF全文

Walter Binder Philippe Moret Éric Tanter Danilo Ansaloni 《Software》2016,46(10):1351-1380

Bytecode instrumentation is a widely used technique to implement aspect weaving and dynamic analyses in virtual machines such as the Java virtual machine. Aspect weavers and other instrumentations are usually developed independently and combining them often requires significant engineering effort, if at all possible. In this article, we present polymorphic bytecode instrumentation(PBI), a simple but effective technique that allows dynamic dispatch amongst several, possibly independent instrumentations. PBI enables complete bytecode coverage, that is, any method with a bytecode representation can be instrumented. We illustrate further benefits of PBI with three case studies. First, we describe how PBI can be used to implement a comprehensive profiler of inter‐procedural and intra‐procedural control flow. Second, we provide an implementation of execution levels for AspectJ, which avoids infinite regression and unwanted interference between aspects. Third, we present a framework for adaptive dynamic analysis, where the analysis to be performed can be changed at runtime by the user. We assess the overhead introduced by PBI and provide thorough performance evaluations of PBI in all three case studies. We show that pure Java profilers like JP2 can, thanks to PBI, produce accurate execution profiles by covering all code, including the core Java libraries. We then demonstrate that PBI‐based execution levels are much faster than control flow pointcuts to avoid interference between aspects and that their efficient integration in a practical aspect language is possible. Finally, we report that PBI enables adaptive dynamic analysis tools that are more reactive to user inputs than existing tools that rely on dynamic aspect‐oriented programming with runtime weaving. These experiments position PBI as a widely applicable and practical approach for combining bytecode instrumentations. © 2015 The Authors. Software: Practice and Experience Published by John Wiley & Sons Ltd. 相似文献

14.

面向瘦客户端的分布式动态二进制翻译系统

下载免费PDF全文

林凌管海兵梁阿磊《计算机工程》2009,35(22):272-274

传统的动态二进制翻译系统不适合直接用于瘦客户端,因为瘦客户端（如手机等）大多存在资源受限的问题,而动态二进制翻译过程会消耗较多的计算和内存资源。针对上述问题,提出一个适用于瘦客户端的分布式动态二进制翻译系统,用远程服务器完成二进制翻译,客户端只要执行翻译好后的代码即可。CPUSPEC2000的实验结果表明,在瘦客户端上使用该系统相对于使用传统的动态二进制翻译器可以带来更高的性能和更小的开销。相似文献

15.

一个用户级动态二进制翻译系统的设计与实现 总被引：1，自引：0，他引：1

曹宏嘉俞磊邓鹍周兴铭《计算机工程与科学》2004,26(8):79-82

本文介绍了一个x86 Linux系统下动态二进制翻译系统的设计与实现，该系统将IA-32用户级整数代码翻译到一个RISC指令集并由模拟器执行目标代码；详细描述了该系统的总体组成、目标结构模拟器、代码翻译过程以及翻译过的代码的执行。相似文献

16.

Achieving Efficiency and Accuracy in the ALPS Application-level Proportional-share Scheduler

T. Newhouse J. Pasquale 《Journal of Grid Computing》2007,5(2):251-270

We present the design and implementation of ALPS, a per-application user-level proportional-share scheduler. It provides an application with a way to control the relative allocation of CPU time amongst its individual processes. The ALPS scheduler runs as just another process (belonging to the application) at user level; thus, it does not require any special kernel support, nor does it require any special privileges, making it highly portable. To achieve efficiency, ALPS delegates fine-grained time-slicing responsibility to the underlying kernel scheduler, while itself making coarse-grained decisions to achieve proportional-share scheduling, all in a way that is transparent to the underlying kernel. Our results show that the ALPS approach is practical; we can achieve good accuracy (under 5% relative error) and low overhead (under 1% of CPU time), despite user-level operation. 相似文献

17.

LDMBL: An architecture for reducing code duplication in heavyweight binary instrumentations

下载免费PDF全文

Behnam Momeni Mehdi Kharrazi 《Software》2018,48(9):1642-1659

Emergence of instrumentation frameworks has vastly contributed to the software engineering practices. As the instrumentation use cases become more complex, complexity of instrumenting programs also increases, leading to a higher risk of software defects, increased development time, and decreased maintainability. In security applications such as symbolic execution and taint analysis, which need to instrument a large number of instruction types, this complexity is prominent. This paper presents an architecture based on the Pin binary instrumentation framework to abstract the low‐level OS and hardware‐dependent implementation details, facilitate code reuse in heavyweight instrumentation use cases, and improve instrumenting program development time. Instructions of x86 and x86‐64 hardware architectures are formally categorized using the Z language based on the Pin framework API. This categorization is used to automate the instrumentation phase on the basis of a configuration list. Furthermore, instrumentation context data such as register data are modeled in an object‐oriented scheme. This makes it possible to focus the instrumenting program development time on writing the essential analysis logics while access to low‐level OS and hardware dependencies are streamlined. The proposed architecture is evaluated by instrumenting 135 instruction types in a concrete symbolic execution engine, resulting in a reduction of the instrumenting program size by 59.7%. Furthermore, performance overhead measure against the SPEC CINT2006 programs is limited to 8.7%. 相似文献

18.

二进制流模式提取在CPU/GPU下的实现框架

章一超陈凯梁阿磊白英彩管海兵《计算机应用与软件》2012,(1):113-115,140

图形处理单元(GPU)作为一种流体系结构的处理器,现已被广泛地用于通用高性能计算,而不仅仅局限于图像处理领域了。NVIDIA的CUDA和AMD的Stream SDK都是现在比较流行的针对GPU通用计算(GPGPU)的流编程环境。然而,它们有自身的缺陷和限制,其中最主要的便是缺乏面对不同GPU的二进制兼容性问题和重写已有程序源代码代价大的问题。通过利用二进制分析和动态二进制翻译技术,实现一个自动化执行框架GxBit,它提供一种从x86二进制程序中提取流模式,并映射到NVIDIACUDA编程环境的方法。该框架经过CUDA SDK Sample和Parboil Benchmark Suite中若干程序的验证,平均取得10倍以上的性能提升。相似文献

19.

Parallel Heat Kernel Volume Based Local Binary Pattern on Multi-Orientation Planes for Face Representation

Wei Lu Xiaomin Yang Xu Gou Lihua Jian Wei Wu Gwanggil Jeon 《International journal of parallel programming》2018,46(5):943-962

相似文献

20.

用户级动态二进制翻译系统设计

吴浩管海兵梁阿磊《计算机应用与软件》2007,24(10):1-3

介绍了一个用户级动态二进制翻译系统的结构设计,该系统实现了arm到x86的用户级动态翻译.详细介绍该系统各部分的功能、设计难点和具体运行过程. 相似文献