期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Translating the ARM Neon and VFP instructions in a binary translator

Yu‐Chuan Guo Wuu Yang Jiunn‐Yeu Chen Jenq‐Kuen Lee 《Software》2016,46(12):1591-1615

Binary translation attempts to emulate one instruction set with another on the same or different platforms. The important technique is widely used in modern software. Vector and floating‐point instructions are widely used in many applications, including multimedia, graphics, and gaming. Although these instructions are usually simulated with software in a binary translator, it is important to support them such that the host single‐instruction, multiple‐data (SIMD) and floating‐point hardware are efficiently used during emulation. We report our design and implementation of the emulation of ARM Neon and vector floating point (VFP) instructions in the machine‐code‐to‐low‐level‐virtual‐machine (MC2LLVM) binary translator. The Neon and VFP instructions are first translated into carefully chosen sequences of LLVM intermediate representation (IR), and later, the IR sequences are optimized and translated into the host native binary by the existing LLVM backend. Because MC2LLVM makes use of the vector and floating‐point types in LLVM IR, the generated host native binary can take full advantage of the vector and floating‐point functional units, if present, of the host machine. To be fully compliant with Neon and VFP instruction sets, all the features are supported, including the flush‐to‐zero mode, default not a number mode, and floating‐point exceptions. The experimental results show that code generated by MC2LLVM with the Neon and VFP extensions achieves an average speedup of 1.174× in SPEC 2006 benchmark suites and exhibits a floating‐point throughput of 12.05× in LINPACK, compared with code generated by MC2LLVM without the Neon and VFP extensions. Furthermore, MC2LLVM is 3.36× faster than QEMU for processing Neon/VFP instructions. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献

2.

Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

Asadollah Shahbahrami Ben Juurlink Demid Borodin Stamatis Vassiliadis 《International journal of parallel programming》2006,34(3):237-260

Single-Instruction Multiple-Data (SIMD) instructions provide an inexpensive way to exploit the Data-Level Parallelism in multimedia applications. However, the performance improvement obtained by employing SIMD instructions is often limited because frequently many overhead instructions are required to bring data in a form amenable to SIMD processing. In this paper, we employ two techniques to overcome this limitation. The first technique, extended subwords, uses four extra bits for every byte in a media register. This allows many SIMD operations to be performed without overflow and avoids packing/unpacking conversion overhead. The second technique, Matrix Register File (MRF), allows flexible row-wise as well as column-wise access to the register file. It is useful for many two-dimensional multimedia algorithms such as the (I) Discrete Cosine Transform, 2 × 2 Haar Transform, and pixel padding. In addition, we propose a few new media instructions. Experimental results obtained by extending the SimpleScalar toolset show that these techniques improve performance by up to a factor of 4.5 compared to a conventional SIMD instruction set extension. 相似文献

3.

Embedded GPU and multicore processors for emotional-based mobile robotic agents

《Future Generation Computer Systems》2016

Control architectures based on emotions are becoming promising solutions for the implementation of future robotic systems. The basic controllers of this architecture are the emotional processes that decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, limiting the processing capacity of a main processor to solve the complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel, hence enabling the computing power to tackle the complex dynamic problems. In this paper, Graphic Processing Unit (GPU), multicore processors and single instruction multiple data (SIMD) instructions are used to provide parallelism for the emotional processes. Different GPUs, multicore processors and SIMD instruction sets are evaluated and compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different environmental conditions, robot dynamics and emotional states. Experimental results show that, despite the fact that GPUs have a bottleneck in the data transmission between the host and the device, the evaluated GTX 670 GPU provides a performance of more than one order of magnitude greater than the initial implementation of the architecture on a single core. Thus, all complex proposed application problems can be solved using the GPU technology in contrast to the first prototype where only 55% of them could be solved. Using AVX SIMD instructions, the performance of the architecture is increased in 3.25 times in relation to the first implementation. Thus, from the 27 proposed applications about 88.8% are solved. In the case of the SSE SIMD instructions, the performance is almost doubled and the robot could solve about 74% of the proposed application problems. The use of AVX and SSE SIMD instructions provides almost the same performance as a quad- and a dual-core, respectively, with the advantage that they do not add any additional hardware cost. 相似文献

4.

兼容ARM Thumb指令的多指令集处理器技术研究

白创陈益如童元满《计算机应用研究》2023,40(11)

随着处理器的快速发展,RISC-V的软件生态环境建设成为其在处理器市场中站稳脚跟的关键因素之一。二进制翻译是解决处理器二进制代码兼容性问题、为处理器生态环境建设获取时间成本的关键技术之一,但由于二进制翻译器难以以较低的功耗面积开销获得高效执行的二进制代码,使其无法广泛应用于嵌入式领域。针对二进制翻译器执行效率和功耗面积开销难以取得平衡的问题,采用硬件逻辑加速的方式处理ARMv7-M中条件执行指令、更新标志位指令以及桶形移位指令,并利用静态二进制翻译器对ARMv7-M程序进行IT Block分裂、地址重计算及指令映射后生成RISC-V二进制代码,以此支持ARMv7-M的各类指令。基于开源内核CV32E40P设计了一个支持ARMv7-M的处理器内核,结果表明,运行ARMv7-M程序的平均性能能够达到直接运行RISC-V程序性能的137%,与纯软件二进制翻译支持ARMv7-M相比,该处理器核运行ARMv7-M程序的性能提升了5.59倍。相似文献

5.

基于二进制插桩的ASIP处理器指令集混合仿真方法

邱吉高翔彭飞汪文祥蒋毅飞《计算机研究与发展》2012,(Z1):330-335

指令集仿真器在ASIP处理器硅前软件开发中发挥着重要的作用,但使用传统仿真方法的指令集仿真器仿真速度较慢.基于二进制插桩,提出了ASIP处理器指令集混合仿真方法,以混合仿真的方式,使基础指令直接运行在宿主机上,仅对扩展指令仿真,从而降低仿真开销,提升仿真速度.实验表明,采用此方法对主流高清音视频解码软件进行仿真的平均速度达到了1058.5MIPS,是采用当前先进的动态二进制翻译仿真方法仿真器速度的34.7倍. 相似文献

6.

基于K Framework的向量化机器学习指令语义形式化

黄厚华刘嘉祥施晓牧《软件学报》2023,34(8):3853-3869

ARM针对ARMv8.1-M微处理器架构推出基于M-Profile向量化扩展方案的技术,并命名为ARM Helium,声明能为ARM Cortex-M处理器提升达15倍的机器学习性能.随着物联网的高速发展,微处理器指令执行正确性尤为重要.指令集的官方手册作为芯片模拟程序,片上应用程序开发的依据,是程序正确性基本保障.主要介绍利用可执行语义框架K Framework对ARMv8.1-M官方参考手册中向量化机器学习指令的语义正确性研究.基于ARMv8.1-M的官方参考手册自动提取指令集中描述向量化机器学习指令执行过程的伪代码,并将其转换为形式化语义转换规则.通过K Framework提供的可执行框架利用测试用例,验证机器学习指令算数运算执行的正确性. 相似文献

7.

分簇结构向量寄存器分配策略研究

王向前王昊《单片机与嵌入式系统应用》2017,17(7)

通过分簇结构实现向量化执行是一种高效而灵活的体系结构选择.在编译中间表示里,向量指令与标量指令交叠出现.分簇结构向量化实现的特殊方式给传统的寄存器分配框架带来了挑战.针对该问题,本文从向量指令的表示形式、Callee/Caller寄存器划分、向量寄存器分配等进行研究,并给出全局与局部向量寄存器的分配方法. 相似文献

8.

译码制导的动态二进制翻译优化

董卫宇王瑞敏戚旭衍曾韵《计算机科学》2015,42(6):189-192, 203

提出了一种译码制导的轻量级动态二进制翻译优化技术,该技术在译码阶段提取源指令的高层语义信息,结合上下文对其进行标注,并在翻译阶段利用标注信息直接生成优化的目标指令.该技术可识别动态二进制翻译系统中主要的基本块级优化机会,去除load/store冗余、精确异常导致的冗余和标志位处理冗余.测试表明,相比QEMU,该优化技术的跨平台x86系统虚拟机ARCH-BRIDGE的翻译开销降低了53％,翻译块尺寸降低了78％,load和store操作数量分别了降低了50％和21％. 相似文献

9.

支持SIMD 与簇间双字传输体系下的VLIW DSP 分簇算法

陈思灵郑启龙冯玉谦付和萍《计算机系统应用》2012,21(10):100-104

VLIW DSP通过软件流水获得时间并行性,通过指令分簇获得空间并行性.指令的分簇本质上是资源分配问题.传统的指令分簇假设一条指令分到某一簇执行,而某些体系结构提供SIMD指令,传统的分簇算法对这类体系结构并不完全适用.提出的基于评估模型的分簇算法能对SIMD指令和普通指令进行合理的分簇.分簇之后,通过调度簇间传输指令,合成适当的簇间双字传输指令.由于SIMD和簇间双字传输的引入,以及较好的分簇决策,程序整体的调度延迟变短.对许多数字信号处理程序相对于没分簇的情况下的性能有2～3倍的性能提升,相对寄存器压力分簇算法有约7～10%性能的提升. 相似文献

10.

The use of vector instructions of a processor architecture for emulating the vector instructions of another processor architecture

K. A. Batuzov 《Programming and Computer Software》2017,43(6):366-372

The complexity of software is ever increasing, and it requires more and more computational resources for its execution. A way to satisfy these requirements is the use of vector instructions that can operate with fixed-length vectors of data of the same. A method for representing vector instructions of one processor architecture in terms of the vector instructions of another architecture during the dynamic binary translation is proposed. An implementation of this method that includes the translation of vector addition and memory access increased the performance of the QEMU emulator by a factor greater than three on an artificial example and 12% on a real-life application. 相似文献

11.

基于TCG技术的二进制翻译条件转移指令优化研究

张家豪单征岳峰傅立国王军李明亮《计算机工程与科学》2019,41(8):1343-1352

在二进制翻译中引入TCG中间表示技术可以实现多目标平台之间的程序移植,同时可以更加方便地引入新型平台,解决新平台对主流平台的兼容性问题。然而由于原有的中间表示在翻译过程中影响了代码的关联度,生成的后端代码中存在较多冗余指令,影响翻译程序的执行效率。分析了指令优化可行性,针对条件跳转指令进行优化,通过指令预处理对中间表示进行改进,实现中间表示到后端代码生成由一对多翻译模式到多对多翻译模式的转变,采用指令归约技术,针对条件跳转指令的2种模式CMP-JX型与TEST-JX型,分别设计相应的优化翻译算法,并在开源二进制平台QEMU上实现。基于NPB-3.3和SPEC CPU 2006测试集进行了测试,与以前的翻译模式进行对比,优化后的代码膨胀率平均减少了14.62%,翻译程序运行速度提升了17.23%,验证了该优化方法的有效性。相似文献

12.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

13.

跨平台系统虚拟机的二进制翻译优化

下载免费PDF全文

董卫宇戚旭衍曾韵郭玉东蒋烈辉《计算机工程与应用》2016,52(23):42-49

在跨平台系统虚拟机原型ARCH-BRIDGE的基础上,提出了一种基本块级的动态二进制翻译优化方法,通过两阶段翻译、基于虚拟寄存器的优化翻译及延迟机器状态提交等技术,可在不采用中间表示及确保精确异常的情况下,有效去除二进制翻译所引入的冗余。测试表明,优化后的ARCH-BRIDGE在翻译开销明显优于QEMU的同时,翻译块尺寸和翻译冗余得到了大幅降低,并且SPEC CPU2006、NBENCH及OS引导的性能均得到了显著提升。相似文献

14.

Decoding billions of integers per second through vectorization

D. Lemire L. Boytsov 《Software》2015,45(1):1-29

In many important applications—such as search engines and relational database systems—data are stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and single‐instruction, multiple‐data (SIMD) instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD‐BP128^⋆ that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint‐G8IU and PFOR). At the same time, SIMD‐BP128^⋆ saves up to 2 bits/int. For even better compression, we propose another new vectorized scheme (SIMD‐FastPFOR) that has a compression ratio within 10% of a state‐of‐the‐art scheme (Simple‐8b) while being two times faster during decoding. © 2013 The Authors. Software: Practice and Experience Published by John Wiley & Sons, Ltd. 相似文献

15.

浮点到定点的高效翻译策略研究

郑重陈顼颢沈立王志英《计算机科学与探索》2011,5(5):446-451

动态二进制翻译中,在目标平台没有浮点部件、不支持浮点运算的情况下,浮点指令只能通过模拟解释执行。浮点指令的解释执行造成翻译系统效率急剧下降。通过将浮点指令运算转化为定点运算,解决了浮点指令在目标平台的翻译,为浮点指令的翻译找到了新的途径。在动态二进制翻译系统中进行实验,验证了翻译方法的可行性。实验显示翻译系统的性能有明显提升,含有浮点指令的比例越高,算法能够获得的加速比越高,对含有25%浮点指令的程序,该算法能达到1.55的加速比。相似文献

16.

基于龙芯处理器的二进制翻译器优化 总被引：2，自引：1，他引：1

下载免费PDF全文

蔡嵩松刘奇王剑刘金刚《计算机工程》2009,35(7):280-282

二进制翻译是实现系统迁移的主要方法,但基于通用平台的仅靠软件实现的二进制翻译性能不高。该文以龙芯2F处理器为实现平台,提出一种QEMU二进制翻译器并进行优化,其中包括编译环境的优化以及二进制翻译器本身的优化2个方面,对后者的优化主要涉及寄存器直接映射和多媒体指令的改进。实验结果表明,通过寄存器映射优化后,系统能够获得1．45的加速比,通过多媒体优化后,多媒体程序的执行能达到本地机器执行的80％的性能。相似文献

17.

基于QEMU的高效指令追踪技术

下载免费PDF全文

王涛秦宵宵徐学政王璐方健《计算机系统应用》2023,32(11):3-10

系统模拟器通过模拟处理器、内存、外设等硬件资源创建一个完整的虚拟计算机环境, 支持运行和调试不同架构的软件, 可大大缩短跨架构的软件开发周期. 模拟器的调试模块通常具有指令追踪功能, 可记录程序运行的指令序列以用于进一步分析, 如程序运行时间评估、程序行为模式分析、软硬件联合仿真等. 支持RISC-V架构的主流模拟器QEMU和Spike均具有指令追踪功能, 但其时间和空间开销过大, 在应对规模较大的应用时效率低下. 本文提出了一种基于QEMU的指令追踪技术, 将程序中的基本块、控制流图等静态信息与分支选择等动态信息解耦, 在保证指令序列不失真的同时高效追踪执行序列. 相比QEMU原生实现的指令追踪, 本文提出的指令追踪技术的时间开销平均降低了80%以上, 空间开销平均降低了95%以上. 此外, 本文面向RISC-V架构, 实现了多种场景下的指令序列离线分析, 包括指令分类统计、程序热点标记、行为模式分析等. 相似文献

18.

A high‐performance sorting algorithm for multicore single‐instruction multiple‐data processors

Hiroshi Inoue Takao Moriyama Hideaki Komatsu Toshio Nakatani 《Software》2012,42(6):753-777

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

19.

基于动态二进制翻译技术的仿真器研究

下载免费PDF全文

陈乔蒋烈辉董卫宇徐金龙方明《计算机工程》2011,37(20):277-279

以动态二进制仿真器QEMU为平台,分析动态二进制翻译技术在仿真器开发中的应用,研究QEMU的翻译机制、优化策略、关键技术,并对相关重要代码进行解析。对仿真CPU的性能进行测试,结合分阶段的测试结果,从中找出制约仿真CPU性能的关键阶段,为后续的优化工作提供参考依据。相似文献

20.

Employing AVX vectorization to improve the performance of random number generators

L. Yu. Barash M. S. Guskova L. N. Shchur 《Programming and Computer Software》2017,43(3):145-160

By the example of the RNGAVXLIB random number generator library, this paper considers some approaches to employing AVX vectorization for calculation speedup. The RNGAVXLIB library contains AVX implementations of modern generators and the routines allowing one to initialize up to 10¹⁹ independent random number streams. The AVX implementations yield exactly the same pseudorandom sequences as the original algorithms do, while being up to 40 times faster than the ANSI C implementations. 相似文献