期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compilation Techniques for Multimedia Processors 总被引：5，自引：0，他引：5

Andreas Krall Sylvain Lelait 《International journal of parallel programming》2000,28(4):347-361

The huge processing power needed by multimedia applications has led to multimedia extensions in the instruction set of microprocessors which exploit subword parallelism. Examples of these extended instruction sets are the Visual Instruction Set of the UltraSPARC processor, the AltiVec instruction set of the PowerPC processor, the MMX and ISS extensions of the Pentium processors, and the MAX-2 instruction set of the HP PA-RISC processor. Currently, these extensions can only be used by programs written in assembly language, through system libraries or by calling specialized macros in a high-level language. Therefore, these instructions are not used by most applications. We propose two code generation techniques to produce native code using these multimedia extensions for programs written in a high-level language: classical vectorization and vectorization by unrolling. Vectorization by unrolling is simpler than classical vectorization since data dependence analysis is reduced to acyclic control flow graph analysis. Furthermore, we address the problem of unaligned memory accesses. This can be handled by both static analysis and dynamic runtime checking. Preliminary experimental results for a code generator for the UltraSPARC VIS instruction set show that speedups of up to a factor of 4.8 are possible, and that vectorization by unrolling is much simpler but as effective as classical vectorization. 相似文献

2.

面向特定应用的指令集自动扩展

下载免费PDF全文

吕雅帅沈立王志英戴葵《计算机工程与科学》2007,29(6):84-86

面向应用扩展指令集是面向特定应用处理器设计过程的一个重要环节,这一工作的自动实现对于缩短产品开发周期具有非常重要的意义。现有的技术未能实现该过程的完全自动化,而且在选择指令时并没有全面考虑指令对处理器面积和功耗的影响。本文设计并实现了一个面向特定应用的指令集自动扩展系统,该系统不仅可以根据应用特征自动扩展
新指令,而且可以自动完成编译器的修改。模拟结果显示,扩展的新指令能够在保持功耗、面积基本不变的前提下,带来4.7%～16.7%的性能提升。相似文献

3.

An architecture framework for an adaptive extensible processor

Hamid Noori Farhad Mehdipour Kazuaki Murakami Koji Inoue Morteza Saheb Zamani 《The Journal of supercomputing》2008,45(3):313-340

To improve the performance of embedded processors, an effective technique is collapsing critical computation subgraphs as application-specific instruction set extensions and executing them on custom functional units. The problem with this approach is the immense cost and the long times required to design a new processor for each application. As a solution to this issue, we propose an adaptive extensible processor in which custom instructions (CIs) are generated and added after chip-fabrication. To support this feature, custom functional units are replaced by a reconfigurable matrix of functional units (FUs). A systematic quantitative approach is used for determining the appropriate structure of the reconfigurable functional unit (RFU). We also introduce an integrated framework for generating mappable CIs on the RFU. Using this architecture, performance is improved by up to 1.33, with an average improvement of 1.16, compared to a 4-issue in-order RISC processor. By partitioning the configuration memory, detecting similar/subset CIs and merging small CIs, the size of the configuration memory is reduced by 40%. 相似文献

4.

面向专用处理器指令集设计的应用特征分析方法研究与实现

沈弼龙赵鹏陈旭灿李思昆《计算机工程与科学》2009,31(Z1)

专用处理器的指令集设计是专用处理器设计中的关键问题。SoC专用处理器指令集设计有其特殊的程序特征分析需求,迫切需要面向专用指令集设计的程序特征分析工具支持,但当前能够完全支持专用指令集设计的特征分析工具比较少,设计人员仍需人工或同时调用多种分析工具才能来获取所需特征信息,且效率低、结果不够直观,无法迅速有效地对专用指令集设计提供有效的数据支持。本文面向专用处理器指令集设计,研究并实现了一种基于程序中间表示的应用特征自动分析方法,以可视化的方式得到了程序的控制关系特征、计算特征、操作数据特征和核心运算等特征。该工具不仅可支持专用指令集设计,对于SoC编译优化、任务分配等也可以提供简明直观的辅助支持,具有一定通用性,同时提供图形化的结果显示与友好的人机界面,使用简明方便。相似文献

5.

Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration

Clark Nathan Zhong Hongtao Tang Wilkin Mahlke Scott 《International journal of parallel programming》2003,31(6):429-449

General-purpose processors are often incapable of achieving the challenging cost, performance, and power demands of high-performance applications. To meet these demands, most systems employ a number of hardware accelerators to off-load the computationally demanding portions of the application. As an alternative to this strategy, we examine customizing the computation capabilities of a processor for a particular application. The processor is extended with hardware in the form of a set of custom function units and instruction set extensions. To effectively identify opportunities for creating custom hardware, a dataflow graph design space exploration engine heuristically identifies candidate computation subgraphs without artificially constraining their size or shape. The engine combines estimates of performance gain, cost, and inherent limitations of the processor to grow candidate graphs in profitable directions while pruning unprofitable paths. This paper describes the dataflow graph exploration engine and evaluates its effectiveness across a set of embedded applications. 相似文献

6.

A soft multi-core architecture for edge detection and data analysis of microarray images

George Kornaros 《Journal of Systems Architecture》2010,56(1):48-62

As configurable processing advances, elements from the traditional approaches of both hardware and software development can be combined by incorporating customized, application-specific computational resources into the processor’s architecture, especially in the case of field-programmable-gate-array-based systems with soft-processors, so as to enhance the performance of embedded applications. This paper explores the use of several different microarchitectural alternatives to increase the performance of edge detection algorithms, which are of fundamental importance for the analysis of DNA microarray images. Optimized application-specific hardware modules are combined with efficient parallelized software in an embedded soft-core-based multi-processor. It is demonstrated that the performance of one common edge detection algorithm, namely Sobel, can be boosted remarkably. By exploiting the architectural extensions offered by the soft-processor, in conjunction with the execution of carefully selected application-specific instruction-set extensions on a custom-made accelerating co-processor connected to the processor core, we introduce a new approach that makes this methodology noticeably more efficient across various applications from the same domain, which are often similar in structure. With flexibility to update the processing algorithms, an improvement reaching one order of magnitude over all-software solutions could be obtained. In support of this flexibility, an effective adaptation of this approach is demonstrated which performs real-time analysis of extracted microarray data; the proposed reconfigurable multi-core prototype has been exploited with minor changes to achieve almost 5× speedup. 相似文献

7.

Fuzzy logic microcontroller

Costa A. De Gloria A. Giudici F. Olivieri M. 《Micro, IEEE》1997,17(1):66-74

We propose an architecture dedicated mainly to medium-range applications that demand computational power combined with low cost for the resulting hardware system (chip and board). Our architecture is a 16-bit processor with dedicated instructions and hardware for efficient support of fuzzy logic. To make the architecture effective for control applications developed with a traditional approach or with fuzzy logic, we equipped the processor with a microcontroller's general features. Our design accounts for application characteristics to provide efficient hardware support for fuzzy logic. To achieve this we first analyzed fuzzy control algorithms and derived a general model for fuzzy computation. In defining the model, we considered the large spectrum of possible inference methods, fuzzification and defuzzification mechanisms, and the operators used in control applications. On this basis, we defined the instruction set that supports this computational model and a proper architectural solution. We tested the system (composed of the software model and its hardware support) by simulating different sets of general-purpose and fuzzy control benchmarks 相似文献

8.

Challenges to combining general-purpose and multimedia processors

《Computer》1997,30(12):33-37

Multimedia processor media extensions to general purpose processors present new challenges to the compiler writer, language designer, and microarchitect. Multimedia workloads have always held an important role in embedded applications, such as video cards or set top boxes, but these workloads are becoming increasingly common in general purpose computing as well. Over the past three years the major vendors of general purpose processors (GPPs) have announced extensions to their instruction set architectures that supposedly enhance the performance of multimedia workloads. These include North Carolina MAX 2 extensions to Hewlett-Packard PA-RISC, MMX for Intel's x86, UltraSparc's VIS, and MDMX extensions to MIPS V. Merging these new multimedia instructions with existing GPPs poses several challenges. Also, some doubt remains as to whether multimedia extensions are a real development or just a competition induced fad in the GPP industry. If it is indeed a development, how must current processor microarchitectures change in reaction? And if they change, can GPPs and MMPs apply application specific integrated circuit (ASIC) solutions to the same problems? 相似文献

9.

Instruction set extensions for software defined radio

Suman Mamidi Emily Blem Michael J. Schulte John Glossner Daniel Iancu Andrei Iancu Mayan Moudgill Sanjay Jinturkar 《Microprocessors and Microsystems》2009,33(4):260-272

Software defined radios provide programmable solutions for implementing the physical layer processing of multiple communication standards. Mobile devices implementing these standards require high-performance processors to perform high-bandwidth physical layer processing in real time. In this paper, we present instruction set extensions for several important communication algorithms including cyclic redundancy checking, convolutional encoding, Viterbi decoding, turbo decoding, and Reed–Solomon encoding and decoding. We also present hardware designs for implementing these extensions, along with estimates of their area, critical path delay, and power consumption. The performance benefits of these extensions are evaluated using a supercomputer-class vectorizing compiler and the Sandblaster low-power multithreaded processor for software defined radio. The proposed instruction set extensions provide significant performance improvements at relatively low cost, while maintaining a high degree of programmability. 相似文献

10.

A Software-Configurable Processor Architecture

《Micro, IEEE》2006,26(5):42-51

A software-configurable processor combines a traditional RISC processor with a field-programmable instruction extension unit that lets the system designer tailor the processor to a particular application. To add application-specific instructions to the processor, the programmer adds a pragma before a C or C++ function declaration, and the compiler then turns the function into a single instruction 相似文献

11.

Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization

Hamid Noori Farhad Mehdipour Koji Inoue Kazuaki Murakami 《The Journal of supercomputing》2012,60(2):196-222

Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance and energy efficiency of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple-exits custom instruction and intuition behind it are introduced. Conditional execution capability has been added to the RFU to support the multi-exit feature of custom instructions. Because the proposed RFU has limitations on hardware resources (i.e., connections and processing elements), an integrated mapping-temporal partitioning framework is proposed to guarantee that the generated custom instructions can be mapped on the RFU (mappable custom instructions). Experimental results show that multi-exit custom instructions enhance the performance and energy efficiency by an average of 32% and 3% compared to custom instructions limited to one basic block, respectively. A maximum speedup of 4.9, compared to a single-issue embedded processor, and an average speedup of 1.9 was achieved on MiBench benchmark suite. The maximum and average energy saving are 56% and 22%, respectively. These performance and energy efficiency are obtained at the cost of 30% area overhead. 相似文献

12.

密码指令集扩展研究*

李美峰戴冠中刘航苗胜张德刚《计算机应用研究》2008,25(6):1833-1835

详细分析了常见密码算法的基本操作以及密码指令集扩展的研究现状,针对当前密码系统需要支持多种密码算法的特点指出未来密码指令集扩展的发展方向:指令设计需朝通用性上发展且通用密码处理器是处理器密码指令集扩展的最终目的。相似文献

13.

一款可编程语音处理器的设计与应用

下载免费PDF全文

韩大晗崔慧娟唐昆刘大力《计算机工程》2007,33(12):251-252,255

为了提高通信系统的保密性,降低制造成本,需要进行专用处理器的设计.该文基于SELP(Sinusoidal Excitation Linear Prediction)算法模型原理,设计了一款高质量多速率语音专用处理器芯片.芯片使用可重构体系结构和超长指令字系统设计方法,将复杂度高的子程序进行优化,能够显著提高指令并行度.仿真结果表明:在该芯片上实现语音压缩编码算法,执行效率高于相同工艺水平的通用数字信号处理器,并保持原有编码质量.该处理器能够实现多种类型的语音压缩算法,使语音算法可以达到高保密性、低复杂度和易开发性. 相似文献

14.

An embedded implementation of the Common Language Infrastructure

Joseph C. Libby Kenneth B. Kent 《Journal of Systems Architecture》2009,55(2):114-126

The Common Language Infrastructure provides a unified instruction set which may be targeted by a variety of high level language compilers. This unified instruction set simplifies the construction of compilers and gives application designers the ability to choose the high level programming language that best suits the problem being solved. While the Common Language Infrastructure solves many problems related to design of applications and compilers, it is not without its own problems. The Common Language Infrastructure is based upon a virtual machine, much like the Java Virtual Machine. This requires that all instructions being executed on the Common Language Infrastructure be translated to native machine instructions before they can be executed on the host processor. This leads to degradation in performance. In order to overcome this problem it is proposed that an embedded processor capable of natively executing the CLI instruction set be developed. The objective of this work is the design and implementation, using VHDL and simulation, of an embedded processor capable of natively executing the CLI instruction set. This processor provides a platform easily targeted by software developers. 相似文献

15.

面向SDTA处理单元的一种功耗评估方法

赖明澈戴葵王志英《计算机工程与科学》2009,31(7)

本文针对同步数据传输体系结构(SDTA)处理单元提出了一种功耗评估方法。基于处理单元的结构抽象,结合SDTA特点,采取不同方法对各个子部件功耗分别进行评估。该方法不仅满足了精度要求,而且具有较好的灵活性与较高的工作效率,特别适应于专用指令集处理器的设计流程。实验表明,与PrimePower门级功耗评估工具的模拟结果比较,70%与90%的样本误差分别小于8.2%与10.8%,但评估效率提高了12000倍左右。相似文献

16.

OpenVX高效能并行可重构运算通路的设计与实现

王宇李涛邢立冬冯臻夫《计算机工程》2021,47(12):236-248

针对专用硬件在处理图形图像时无法同时兼顾灵活性、可扩展性和时效性的问题,设计一种支持OpenVX 1.3标准的专用处理器。通过对OpenVX 1.3标准中的核函数进行数据通路映射,分析实现函数高效处理所需的运算单元数目,确定适用于该标准的数据通路运算器的结构。通过编写指令对数据通路进行重构,适应OpenVX标准的演进和扩展。应用65 nm CMOS工艺库对整体电路进行综合验证,实现的OpenVX可重构数据通路运算器面积为21 076.21 μm²、功耗为778.63 mW、系统主频为500 MHz、吞吐量为1.86 GB/s。实验结果表明,该数据通路运算器具有较强的可编程性和可扩展性,能够有效满足实时和高速的通用图像处理要求。相似文献

17.

Automatic generation of compiler backends

Florian Brandner Viktor Pavlu Andreas Krall 《Software》2013,43(2):207-240

相似文献

18.

基于AHB总线的RISC-V微处理器设计与实现

下载免费PDF全文

郝振和焦继业李雨倩《计算机工程与应用》2020,56(20):52-58

在嵌入式应用中,为了满足小面积低功耗的设计需求,设计了一种支持RISC-V指令集架构的微处理器,系统采用2级流水结构,实现了RV32IMAC指令集。处理器采用AHB总线作为片上互连总线,可方便调用外部IP核进行功能拓展。在VCS环境下验证了该微处理器的逻辑功能,仿真结果表明该微处理器能够正常稳定运行。在面积、功耗和性能等方面与蜂鸟E203处理器以及ARM Cortex-M系列处理器进行了对比,该设计比蜂鸟E203处理器面积小了6%,功耗和性能上与Cortex-M0处理器相当。分析结果表明该处理器较适合在小面积、低功耗的嵌入式应用领域进行开发。相似文献

19.

Itanium processor microarchitecture

Sharangpani H. Arora H. 《Micro, IEEE》2000,20(5):24-43

The Itanium processor is the first implementation of the IA-64 instruction set architecture (ISA). The design team optimized the processor to meet a wide range of requirements: high performance on Internet servers and workstations, support for 64-bit addressing, reliability for mission-critical applications, full IA-32 instruction set compatibility in hardware, and scalability across a range of operating systems and platforms. The processor employs EPIC (explicitly parallel instruction computing) design concepts for a tighter coupling between hardware and software. In this design style the hardware-software interface lets the software exploit all available compilation time information and efficiently deliver this information to the hardware. It addresses several fundamental performance bottlenecks in modern computers, such as memory latency, memory address disambiguation, and control flow dependencies 相似文献

20.

FPGA emulation methodology for fast and accurate power estimation of embedded processors

《Journal of Systems Architecture》2017

Early estimation of application-specific power consumption has become one of the major constraints of modern ASIC design. While in early stages of the design process precise power consumption can only be obtained from very time consuming gate-level (GTL) simulation, power estimation methodologies aim to reduce computational overhead by deriving models to approximate power consumption on higher levels. This work presents an FPGA accelerated power estimation methodology for programmable processors based on a hybrid functional level (FLPA) and instruction level power analysis (ILPA) that can be mapped onto an FPGA together with the functional emulation. It enables fast and accurate estimation of application-specific power consumption and energy per task which is crucial for power-aware design of embedded processor architectures. The approach allows both hardware and software designers to optimize their implementations not only for processing performance but also for power efficiency. The power emulation methodology and considerations for the FPGA implementation of the power estimation is described in detail. Model validation against GTL power simulation and results are given for a typical embedded RISC processor and a commercial-grade Application Specific Instruction Set Processor (ASIP). Power consumption models yield fast and accurate power estimation with a %MAE of less than 9% and NRMSE of less than 7% enabling co-optimization of both hardware and software with respect to power consumption in early design stages. 相似文献