期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Rapid design of area-efficient custom instructions for reconfigurable embedded processing

Siew-Kei Lam Thambipillai Srikanthan 《Journal of Systems Architecture》2009,55(1):1-14

RISPs (Reconfigurable Instruction Set Processors) are increasingly becoming popular as they can be customized to meet design constraints. However, existing instruction set customization methodologies do not lend well for mapping custom instructions on to commercial FPGA architectures. In this paper, we propose a design exploration framework that provides for rapid identification of a reduced set of profitable custom instructions and their area costs on commercial architectures without the need for time consuming hardware synthesis process. A novel clustering strategy is used to estimate the utilization of the LUT (Look-Up Table) based FPGAs for the chosen custom instructions. Our investigations show that the area costs computations using the proposed hardware estimation technique on 20 custom instructions are shown to be within 8% of those obtained using hardware synthesis. A systematic approach has been adopted to select the most profitable custom instruction candidates. Our investigations show that this leads to notable reduction in the number of custom instructions with only marginal degradation in performance. Simulations based on domain-specific application sets from the MiBench and MediaBench benchmark suites show that on average, more than 25% area utilization efficiency (performance/area) can be achieved with the proposed technique. 相似文献

2.

Parallel custom instruction identification for extensible processors

《Journal of Systems Architecture》2017

With the ability of customization for an application domain, extensible processors have been used more and more in embedded systems in recent years. Extensible processors customize an application domain by executing parts of application code in hardware instead of software. Determining parts of application code as custom instruction generally requires subgraph enumeration and subgraph selection. Both subgraph enumeration problem and subgraph selection problem are computationally difficult problems. Most of previous works focus on sequential algorithms for these two problems. In this paper, we present a parallel implementation of a latest subgraph enumeration algorithm based on a computer cluster. A standard ant colony optimization algorithm (ACO), a modified version of ACO with local optimum search and a parallel ACO algorithm are also proposed to solve the subgraph selection problem in this work. Experimental results show that the parallel algorithms outperform the sequential algorithms in terms of runtime or (and) quality of results. In addition, we have formally proved the upper bound on the number of feasible solutions in subgraph selection problem with or without the overlapping constraint. 相似文献

3.

Compiler optimizations for processors with SIMD instructions

Ivan Pryanishnikov Andreas Krall Nigel Horspool 《Software》2007,37(1):93-113

To achieve maximum efficiency, modern embedded processors for media applications exploit single instruction multiple data (SIMD) instructions. SIMD instructions provide a form of vectorization where a large machine word is viewed as a vector of subwords and the same operation is performed on all subwords in parallel. Systematic usage of SIMD instructions can significantly improve program performance. With C becoming the dominant language for programming embedded devices, there is a clear need for C compilers that use SIMD instructions whenever appropriate. However, SIMD instructions typically require each memory access to be aligned with the instruction's data access size. Therefore an important problem in designing the compiler is to determine whether a C pointer is aligned, i.e. whether it refers to the beginning of a machine word. In this paper, we describe our SIMD generation algorithm and present an analysis method which determines the alignment of pointers at compile time. The alignment information is used to reduce the number of dynamic alignment checks and the overhead incurred by them. Our method uses an interprocedural analysis which propagates pointer alignment information in function bodies and through function calls. The effectiveness of our method is supported by experimental results which show that in typical programs the alignments of about 50% of the pointers can be statically determined. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

4.

System-level performance evaluation of reconfigurable processors

《Microprocessors and Microsystems》2005,29(2-3):63-73

Reconfigurable architectures that tightly integrate a standard CPU core with a field-programmable hardware structure have recently been receiving increased attention. The design of such a hybrid reconfigurable processor involves a multitude of design decisions regarding the field-programmable structure as well as its system integration with the CPU core. Determining the impact of these design decisions on the overall system performance is a challenging task. In this paper, we first present a framework for the cycle-accurate performance evaluation of hybrid reconfigurable processors on the system level. Then, we discuss a reconfigurable processor for data-streaming applications, which attaches a coarse-grained reconfigurable unit to the coprocessor interface of a standard embedded CPU core. By means of a case study we evaluate the system-level impact of certain design features for the reconfigurable unit, such as multiple contexts, register replication, and hardware context scheduling. The results illustrate that a system-level evaluation framework is of paramount importance for studying the architectural trade-offs and optimizing design parameters for reconfigurable processors. 相似文献

5.

A holistic approach for tightly coupled reconfigurable parallel processors

Hritam Dutta Dmitrij Kissler Frank Hannig Alexey Kupriyanov Jürgen Teich Bernard Pottier 《Microprocessors and Microsystems》2009,33(1):53-62

New standards in signal, multimedia, and network processing for embedded electronics are characterized by computationally intensive algorithms, high flexibility due to the swift change in specifications. In order to meet demanding challenges of increasing computational requirements and stringent constraints on area and power consumption in fields of embedded engineering, there is a gradual trend towards coarse-grained parallel embedded processors. Furthermore, such processors are enabled with dynamic reconfiguration features for supporting time- and space-multiplexed execution of the algorithms. However, the formidable problem in efficient mapping of applications (mostly loop algorithms) onto such architectures has been a hindrance in their mass acceptance. In this paper we present (a) a highly parameterizable, tightly coupled, and reconfigurable parallel processor architecture together with the corresponding power breakdown and reconfiguration time analysis of a case study application, (b) a retargetable methodology for mapping of loop algorithms, (c) a co-design framework for modeling, simulation, and programming of such architectures, and (d) loosely coupled communication with host processor. 相似文献

6.

面向高层次综合的自定义指令自动识别方法

肖成龙林军王珊珊王宁《计算机应用》2018,38(7):2024-2031

针对在高层次综合（HLS）过程中性能提升、功耗降低困难等问题,提出了一种面向高层次综合的自定义指令自动识别方法。在高层次综合过程之前实现对自定义指令的枚举和选择,从而为高层次综合提供通用的自定义指令识别方法。首先,将高层次源代码转换为控制数据流图（CDFG）,实现了对源代码的预处理;其次,基于控制数据流图内的数据流图（DFG）,采用子图枚举算法以自底而上的方式枚举出所有连通凸子图,有效提高了用户可灵活修改约束条件的能力;然后,分别从面积、性能和代码量三个角度考虑,利用子图选择算法选择部分最佳子图作为最终的自定义指令;最后,用所选的自定义指令重新生成新代码作为高层次综合工具的输入。与传统高层次综合相比,采用基于出现频率的模式选择可平均减少19.1%的面积,采用基于关键路径的子图选择可平均减少22.3%的时延。此外,与TD算法相比,所提算法的枚举效率平均提升70.8%。实验结果表明,自定义指令自动识别方法使高层次综合在电路设计中能够显著地提升性能,减少面积和代码量。相似文献

7.

Implementation-aware selection of the custom instruction set for extensible processors

《Microprocessors and Microsystems》2014,38(7):681-691

This paper presents an approach for incorporating the effect of various logic synthesis options and logic level implementations into the custom instruction (CI) selection for extensible processors. This effect translates into the availability of a piecewise continuous spectrum of delay versus area choices for each CI, which in turn influences the selection of the CI set that maximizes the speedup per area cost (SPA) metric. The effectiveness of the proposed approach is evaluated by applying it to several benchmarks and comparing the results with those of a conventional technique. We also apply the methodology to the existing serialization algorithms aimed at relaxing register file constraints in multi-cycle custom instruction design. The comparison shows considerable improvements in the speedup per area compared to the custom instruction selection algorithms under the same area-budget constraint. 相似文献

8.

CPS处理器中的时间指令扩展研究及实现

高振华杨帆陈闻杰柴志雷《计算机应用》2012,32(6):1730-1733

物理进程具有内在的并发及实时特性,因此发展信息—物理融合系统（CPS）需要计算进程能表达这类特性。而传统的计算模式为了方便用户逻辑描述,随着抽象程度的提高逐步丢弃了对时间特性的精确描述。在嵌入式Java处理器JPOR-32基础上面向CPS应用增加了时钟寄存器和时钟计数器,并根据程序员对时间特性的需求,结合异常机制扩展了四条时间指令,使得用户可根据不同需求对时间进行精确控制。最后通过采用时间指令后图像处理程序在该CPS处理器上的运行结果验证了该时间控制机制的可行性、正确性及精确性。相似文献

9.

Supporting multiple-input,multiple-output custom functions in configurable processors

《Journal of Systems Architecture》2007,53(5-6):263-271

Configurable processors have emerged as a promising solution for high performance embedded systems. Many of these processors extend a RISC core with configurable functional units that execute dual-input, single-output (DISO) custom functions. Although studies have shown that supporting multiple-input, multiple-output (MIMO) custom functions can lead to significant speedups, mechanisms to efficiently achieve this have not been adequately addressed. The underlying reason is that a custom function is normally invoked by a single instruction, which usually transfers only two inputs and one output. Attempts to transfer more inputs and outputs in one instruction are impeded by the instruction length and the register file’s R/W ports. This paper proposes a simple extension to transfer multiple inputs and outputs of the custom functions using repeated instructions. While transferring the inputs and outputs may take a few extra cycles, our experiments show that the MIMO extension can still achieve an average 51% increase in speedup compared to a DISO extension and an average 27% increase in speedup compared to a multiple-input, single-output (MISO) extension. 相似文献

10.

Recalling instructions from idling threads to maximize resource utilization for simultaneous multi-threading processors

Yilin Zhang Caleb Douglas Wei-Ming Lin 《Computers & Electrical Engineering》2013

Simultaneous Multi-Threading (SMT) has been a very popular design in improving resource utilization by sharing key datapath components among multiple independent threads. However, allowing any of the threads to overwhelm these shared resources not only leads to unfair thread processing but may also result in severely degraded overall performance. How to prevent idling threads from clogging the critical resources in the pipeline becomes a must in sustaining desired system performance. In this paper, we show that, if one can manage to recall instructions of idling threads from the shared Issue Queue (IQ), the system performance is easily enhanced by a significant margin, with up to 20% for some benchmark mixes. An even more noteworthy feature about this technique is that the ensuing hardware overhead is very insignificant and it can also be coupled with other advanced techniques employed in other stages of the SMT pipeline for potentially additive benefits. 相似文献

11.

Principles, structures, and implementation of reconfigurable ternary optical processors

JIN Yi WANG HongJian OUYANG shan ZHOU Yu SHEN YunPu PENG JunJie LIU XueMin 《中国科学:信息科学(英文版)》2011,(11):2236-2246

相似文献

12.

可重构三值光学处理器的原理、基本结构和实现

金翊王宏健欧阳山周裕沈云付彭俊杰刘学民《中国科学:信息科学》2012,(6):778-788

文中论述了三值光学处理器的重构原理、重构结构和重构操作,给出了三值光学运算器和运算基元的典型结构、分类、命名、重构电路、重构指令和重构例程.文中还对三值光学计算机的高速度性能和低功耗性能作了简单分析.最后描述了一位运算基元的重构实验,实验结果表明本文论述的可重构三值光学处理器原理正确,重构器件和重构指令有效. 相似文献

13.

Principles,structures,and implementation of reconfigurable ternary optical processors

JIN Yi WANG HongJian OUYANG Shan ZHOU Yu SHEN YunFu PENG JunJie & LIU XueMin 《中国科学:信息科学(英文版)》2011,(11):2236-2246

This paper discusses the principles,structures,and implementation procedures for reconfiguring a ternary optical processor (TOP).Typical structures,classifications,and naming of the TOP and basic operation units (BOUs) are described in this paper.The circuit implementations,commands,and processes for the reconfiguration are also discussed in detail.A simple analysis of the high performance and low power consumption of ternary optical computers is presented.Finally,an experiment is performed on reconfiguring a 1-bit BOU,which shows that the principles of reconfigurable TOPs are valid,and that the implementations and commands for the reconfiguration are effective. 相似文献

14.

A constant time algorithm for finding maxima on reconfigurable bus systems using fewer processors

CSR Krishnan C Siva Ram 《Microprocessors and Microsystems》1993,17(10):607-610

相似文献

15.

Optimal algorithms for the channel-assignment problem on a reconfigurable array of processors with wider bus networks

Shi-Jinn Horng Horng-Ren Tsai Yi Pan Seitzer J. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(11):1124-1138

The computation model on which the algorithms are developed is the reconfigurable array of processors with wider bus networks (abbreviated to RAPWBN). The main difference between the RAPWBN model and other existing reconfigurable parallel processing systems is that the bus width of each network is bounded within the range [2,[/spl radic/(N)]]. Such a strategy not only saves the silicon area of the chip as well as increases the computational power enormously, but the strategy also allows the execution speed of the proposed algorithms to be tuned by the bus bandwidth. To demonstrate the computational power of the RAPWBN, the channel-assignment problem is derived in this paper. For the channel-assignment problem with N pairs of components, we first design an O(T + [N//spl omega/]) time parallel algorithm using 2N processors with a 2N-row by 2N-column bus network, where the bus width of each bus network is /spl omega/-bit for 2 /spl les/ /spl omega/ /spl les/ [/spl radic/N] and T = [log/sub /spl omega//N] + 1. By tuning the bus bandwidth to the natural log N-bit and the extended N/sup 1/c/-bit (N/sup 1/c/ > log N) for any constant c and c /spl ges/ 1, two more results which run in O(log N/log log N) and O(1) time, respectively, are also derived. When compared to the algorithms proposed by Olariu et al. [17] and Lin [14], it is shown that our algorithm runs in the equivalent time complexity while significantly reducing the number of processors to O(N). 相似文献

16.

A reconfigurable manager for dynamically reconfigurable hardware

Resano J. Mozos D. Verkest D. Catthoor F. 《Design & Test of Computers, IEEE》2005,22(5):452-460

Dynamic reconfiguration has been a technology solution in search of the right problem to solve. Effective use of the technology requires new programming and task management models. This article describes an approach to dynamic reconfiguration that reduces reconfiguration latency to the point where dynamic multimedia applications can now exploit such platforms. 相似文献

17.

Using open source for a profitable startup

Wall D.A.E. 《Computer》2001,34(12):158-160

The viability of open source projects such as Linux and GNU is frequently questioned. This month, the author shows how the use of such software can help a small, underfunded company establish itself in the commercial software arena. It's apparent that from his perspective the viability question has been answered with a resounding yes 相似文献

18.

High level modeling and automated generation of heterogeneous SoC architectures with optimized custom reconfigurable cores and on-chip communication media

Balal Ahmad Ali Ahmadinia Tughrul Arslan 《Journal of Systems Architecture》2010,56(11):597-615

In this paper we propose a framework for modeling and automated generation of heterogeneous SoC architectures with emphasis on reconfigurable component integration and optimized communication media. In order to facilitate rapid development of SoC architectures, communication-centric platforms for data intensive applications, high level modeling of reconfigurable components for quick simulation and a tool for generation of complete SoC architectures is presented. Four different communication-centric platforms based on traditional bus, crossbar, hierarchical bus and novel hybrid communication media are proposed. These communication-centric platforms are proposed to cater for the different communication requirement of future SoC architectures. Multi-Standard telecommunication application is chosen as our target application domain and a case study of WiMAX is used as a real world example to demonstrate the effectiveness of our approach. A system consisting of an ARM processor, reconfigurable FFT and reconfigurable Viterbi decoder is considered with the option of system scalability for future upgrades. Behavior of system with different communication platforms is analyzed for its throughput and power characteristics with different reconfigurable scenarios to show the effectiveness of our approach. 相似文献

19.

Power-aware BTB for modern processors

Kaveh Jokar Deris Author Vitae Author Vitae 《Computers & Electrical Engineering》2010,36(5):902-911

Modern processors access the branch target buffer (BTB) every cycle to speculate branch target addresses. This aggressive approach improves performance as it results in early identification of target addresses. However, unfortunately, such accesses, quite often, are unnecessary as there is no control flow instruction among those fetched.In this work, we introduce speculative BTB access to address this design inefficiency. Our technique relies on a simple power efficient structure, referred to as the BLC-filter, to identify cycles where there is no control flow instruction among those fetched, at least one cycle in advance. By identifying such cycles and eliminating unnecessary BTB accesses we reduce BTB power dissipation (and therefore power density). 相似文献

20.

Accurate arithmetic for vector processors

《Journal of Parallel and Distributed Computing》1988,5(3):250-270

In addition to the four elementary arithmetic operations, more advanced electronic computers such as vector and parallel computers often provide a number of compound operations as additional elementary operations. If pipelined compound operations like “multiply and add,” “accumulate,” and “multiply and accumulate” contribute essentially to the high speed of the system. Accuracy requirements lead to very similar operations. We identify a set of operations which meet both requirements: high speed and accuracy. After a brief discussion of implementation techniques for the simpler of these operations we present two methods and circuits which allow a fast and correct computation of the more complicated of these operations: “accumulate” and “multiply and accumulate.” The first method computes sums and dot products by making use of a matrix-shaped and pipelined arrangement of adders which cover the full floating-point range. The second method requires some local memory on the arithmetic unit. It permits a drastic reduction in the number of adders required. Both methods can also be used to build a fast arithmetic unit for microcomputers in VLSI technology. 相似文献