期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A method for partitioning applications in hybrid reconfigurable architectures

Michalis?D.?Galanis Email author Athanasios?Milidonis George?Theodoridis Dimitrios?Soudris Costas?E.?Goutis 《Design Automation for Embedded Systems》2005,10(1):27-47

In this paper, we propose a methodology for accelerating application segments by partitioning them between reconfigurable hardware blocks of different granularity. Critical parts are speeded-up on the coarse-grain reconfigurable hardware for meeting the timing requirements of application code mapped on the reconfigurable logic. The reconfigurable processing units are embedded in a generic hybrid system architecture which can model a large number of existing heterogeneous reconfigurable platforms. The fine-grain reconfigurable logic is realized by an FPGA unit, while the coarse-grain reconfigurable hardware by our developed high-performance data-path. The methodology mainly consists of three stages; the analysis, the mapping of the application parts onto fine and coarse-grain reconfigurable hardware, and the partitioning engine. A prototype software framework realizes the partitioning flow. In this work, the methodology is validated using five real-life applications. Analytical partitioning experiments show that the speedup relative to the all-FPGA mapping solution ranges from 1.5 to 4.0, while the specified timing constraints are satisfied for all the applications. 相似文献

2.

A Medium-Grain Reconfigurable Architecture for DSP: VLSI Design, Benchmark Mapping, and Performance

Myjak M.J. Delgado-Frias J.G. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(1):14-23

Reconfigurable hardware has become a well-accepted option for implementing digital signal processing (DSP). Traditional devices such as field-programmable gate arrays offer good fine-grain flexibility. More recent coarse-grain reconfigurable architectures are optimized for word-length computations. We have developed a medium-grain reconfigurable architecture that combines the advantages of both approaches. Modules such as multipliers and adders are mapped onto blocks of 4-bit cells. Each cell contains a matrix of lookup tables that either implement mathematics functions or a random-access memory. A hierarchical interconnection network supports data transfer within and between modules. We have created software tools that allow users to map algorithms onto the reconfigurable platform. This paper analyzes the implementation of several common benchmarks, ranging from floating-point arithmetic to a radix-4 fast Fourier transform. The results are compared to contemporary DSP hardware. 相似文献

3.

Run-Time Management of a MPSoC Containing FPGA Fabric Tiles

Nollet V. Avasare P. Eeckhaut H. Verkest D. Corporaal H. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(1):24-33

Multimedia applications, like, e.g., 3-D games and video decoders, are typically composed of communicating tasks. Their target embedded computing platforms (e.g., TI OMAP3, IBM Cell) contain multiple heterogeneous processing elements. At application design-time, it is often unknown which applications will execute simultaneously. Hence, resource assignment decisions need to be made by a run-time manager. Run-time assignment of these communicating tasks onto the communication and computation resources of such a multiprocessor platform is a challenging task. In the presence of fine-grain reconfigurable hardware processing elements, the run-time manager also needs to consider the creation of a so-called configuration hierarchy. Instead of executing a dedicated hardware task, the fine-grain reconfigurable hardware fabric hosts a programmable softcore block that, in turn, executes the task functionality. Hence, the next challenge for run-time management is to efficiently handle a configuration hierarchy. This paper details a run-time task assignment heuristic that performs fast and efficient task assignment in a multiprocessor system-on-chip containing fine-grain reconfigurable hardware tiles. In addition, this algorithm is capable of managing a configuration hierarchy. We show that being capable of handling a configuration hierarchy significantly improves the task assignment performance (i.e., success rate and assignment quality). In several cases, adding a configuration hierarchy improves the assignment success rate of the assignment heuristic by 20%. 相似文献

4.

Mapping Data-Parallel Tasks Onto Partially Reconfigurable Hybrid Processor Architectures

Vikram K. N. Vasudevan V. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(9):1010-1023

Reconfigurable hybrid processor systems provide a flexible platform for mapping data-parallel applications, while providing considerable speedup over software implementations. However, the overhead for reconfiguration presents a significant deterrent in mapping applications onto reconfigurable hardware. Partial runtime reconfiguration is one approach to reduce the reconfiguration overhead. In this paper, we present a methodology to map data-parallel tasks onto hardware that supports partial reconfiguration. The aim is to obtain the maximum possible speedup, for a given reconfiguration time, bus speed, and computation speed. The proposed approach involves using multiple, identical but independent processing units in the reconfigurable hardware. Under nonzero reconfiguration overhead, we show that there exists an upper limit on the number of processing units that can be employed beyond which further reduction in execution time is not possible. We obtain solutions for the minimum processing time, the corresponding load distribution, and schedule for data transfer. To demonstrate the applicability of the analysis, we present the following: 1) various plots showing the variation of processing time with different parameters; 2) hardware simulations for two examples, viz., 1-D discrete wavelet transform and finite impulse response filter, targeted to Xilinx field-programmable gate arrays (FPGAs); and 3) experimental results for a hardware prototype implemented on a FPGA board 相似文献

5.

Reconfigurable Filter Coprocessor Architecture for DSP Applications 总被引：1，自引：0，他引：1

S. Ramanathan S.K. Nandy V. Visvanathan 《The Journal of VLSI Signal Processing》2000,26(3):333-359

Digital Signal Processing (DSP) is widely used in high-performance media processing and communication systems. In majority of these applications, critical DSP functions are realized as embedded cores to meet the low-power budget and high computational complexity. Usually these cores are ASICs that cannot be easily retargeted for other similar applications that share certain commonalities. This stretches the design cycle that affects time-to-market constraints. In this paper, we present a reconfigurable high-performance low-power filter coprocessor architecture for DSP applications. The coprocessor architecture, apart from having the performance and power advantage of its ASIC counterpart, can be reconfigured to support a wide variety of filtering computations. Since filtering computations abound in DSP applications, the implementation of this coprocessor architecture can serve as an important embedded hardware IP. 相似文献

6.

A Case Study of Hardware/Software Partitioning of Traffic Simulation on the Cray XD1

Tripp J.L. Gokhale M.B. Hansson A.. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(1):66-74

Scientific application kernels mapped to reconfigurable hardware have been reported to have 10times to 100times speedup over equivalent software. These promising results suggest that reconfigurable logic might offer significant speedup on applications in science and engineering. To accurately assess the benefit of hardware acceleration on scientific applications, however, it is necessary to consider the entire application including software components as well as the accelerated kernels. Aspects to be considered include alternative methods of hardware/software partitioning, communications costs, and opportunities for concurrent computation between software and hardware. Analysis of these factors is beyond the scope of current automatic parallelizing compilers. In this paper, a case study is presented in which a simulation of metropolitan road traffic networks is mapped onto a reconfigurable supercomputer, the Cray XD1. Five different methods are presented for mapping the application onto the combined hardware/software system. An approach for approximating the performance of each method is derived through analytic equations. Our results, both analytically and empirically, show that key predictors of performance (which are often not considered in reported speedup of kernel operations) are not necessarily maximum parallelism, but must account for the fraction of the problem that runs on the reconfigurable logic and the amount data flow between software and hardware. 相似文献

7.

A RISC architecture extended by an efficient tightly coupled reconfigurable unit

N. Vassiliadis N. Kavvadias G. Theodoridis S. Nikolaidis 《International Journal of Electronics》2013,100(6):421-438

In this paper, the architecture of an embedded processor extended with a tightly-coupled coarse-grain reconfigurable functional unit (RFU) is proposed. The efficient integration of the RFU with the control unit and the datapath of the processor eliminate the communication overhead between them. To speed up execution, the RFU exploits instruction level parallelism (ILP) and spatial computation. Also, the proposed integration of the RFU efficiently exploits the pipeline structure of the processor, leading to further performance improvements. Furthermore, a development framework for the introduced architecture is presented. The framework is fully automated, hiding all reconfigurable hardware related issues from the user. The hardware model of the architecture was synthesized in a 0.13?µm process and all information regarding area and delay were estimated and presented. A set of benchmarks is used to evaluate the architecture and the development framework. Experimental results prove performance improvements in addition to potential energy reduction. 相似文献

8.

嵌入式可重构系统芯片的硬件验证平台设计

牛俊峰罗奥刘雷波魏少军《电视技术》2008,32(2):24-26

提出一种新的基于嵌入武可重构系统芯片的视频解码方案,采用了软硬件协同验证的方法.设计了相应的硬件验证平台,验证了H.264解码算法在可重构处理器上的可实现性. 相似文献

9.

一种数据驱动的可重构计算统一编程模型 总被引：1，自引：0，他引：1

周学海罗赛王峰齐骥《电子学报》2007,35(11):2123-2128

可重构计算以其优异的性能和高度的灵活性,在国际国内研究领域逐渐引起广泛的关注.然而,在研的可重构计算系统架构多种多样,编程模型多与体系结构相关,使用和移植都非常困难.本文为解决编程通用性问题,从可重构计算的基本特征出发,提出数据驱动的,支持异构任务并行计算的统一编程模型,并讨论其实现方法.该模型基于生产者-消费者通讯机制,支持多种类型的计算结点和通讯网络,具有高度的抽象性.实验结果显示,使用统一编程模型进行应用设计,在不同的架构上能够使用同样的用户程序,并且获得比纯硬件加速方式更高的加速比. 相似文献

10.

Design Aspects of Multi-level Reconfigurable Architectures

Sebastian Lange Martin Middendorf 《Journal of Signal Processing Systems》2008,51(1):23-37

Dynamically reconfigurable hardware has already been deployed for accelerating computationally demanding applications. Some of these hardware architectures allow run time reconfiguration but this usually leads to a large reconfiguration overhead. The advantage of run time reconfiguration is that it allows new algorithmic solutions for many applications. To study the potential of frequent run time reconfiguration it is interesting to investigate its costs and benefits from an abstract point of view and to develop new architectural concepts. Multi-level reconfigurable architectures are one such concept that introduces several levels of reconfiguration. This paper deals with new types of multi-level reconfigurable architectures. The corresponding problem of finding the best granularity for different reconfiguration levels is formulated and investigated. Although this problem is shown to be NP-complete, an interesting restricted subcase is solved optimally in polynomial time. For the general case, a good heuristic is proposed that is based on solutions for the restricted case. Results on three example applications show that the reconfiguration cost can be reduced with the new architectures. Based on a proposed measure of relative efficiency it is also shown that the new architectures are more efficient so that they obtain a larger reconfiguration cost reduction with less additional hardware.

Martin MiddendorfEmail:

相似文献

11.

Medium-Grain Cells for Reconfigurable DSP Hardware

Myjak M.J. Delgado-Frias J.G. 《IEEE transactions on circuits and systems. I, Regular papers》2007,54(6):1255-1265

Reconfigurable hardware contains an array of programmable cells and interconnection structures. Field-programmable gate arrays use fine-grain cells that implement simple logic functions. Some proposed reconfigurable architectures for digital signal processing (DSP) use coarse-grain cells that perform 16-b or 32-b operations. A third alternative is to use medium-grain cells with a word length of 4 or 8 b. This approach combines high flexibility with inherent support for binary arithmetic such as multiplication. This paper presents two medium-grain cells for reconfigurable DSP hardware. Both cells contain an array of small lookup tables, or ldquoelementsrdquo, that can assume two structures. In memory mode, the elements act as a random-access memory. In mathematics mode, the elements implement 4-b arithmetic operations. The first design uses a matrix of 4 times 4 elements and operates in bit-parallel fashion. The second design uses an array of five elements and computes arithmetic functions in bit-serial fashion. Layout simulations in 180-nm CMOS indicate that the parallel cell operates at 267 MHz, whereas the serial cell runs at 167 MHz. However, the parallel design requires over twice the area. The proposed medium-grain cells provide the performance and flexibility needed to implement DSP. To evaluate the designs, the paper estimates the execution time and resource utilization for common benchmarks such as the fast Fourier transform. The architecture model used in this analysis combines the cells with a pipelined hierarchical interconnection network. The end results show great promise compared to other devices, including field-programmable gate arrays. 相似文献

12.

针对粗粒度可配置结构芯片的蚁群路由系统设计

宋立国姜玉宪《微电子学与计算机》2007,24(4):15-17

以最大-最小蚁群系统为基础,为蚁群采用增加了嗅觉分辨能力,应用于粗粒度可配置结构芯片的路由问题。以开发的粗粒度可重构芯片CTaiJi为对象,通过几个算例的比较,可以看到此方法找到最优解的能力优于目前常用的谈判阻塞算法。相似文献

13.

Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System 总被引：1，自引：0，他引：1

Galanis M.D. Dimitroulakos G. Goutis C.E. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(12):1362-1366

This paper presents performance improvements and energy savings from mapping real-world benchmarks on an embedded single-chip platform that includes coarse-grained reconfigurable logic with a microprocessor. The reconfigurable hardware is a 2-D array of processing elements connected with a mesh-like network. Analytical results derived from mapping seven real-life digital signal processing applications, with the aid of an automated design flow, on six different instances of the system architecture are presented. Significant overall application speedups relative to an all-software solution, ranging from 1.81 to 3.99 are reported being close to theoretical speedup bounds. Additionally, the energy savings range from 43% to 71%. Finally, a comparison with a system coupling a microprocessor with a very long instruction word core shows that the microprocessor/coarse-grained reconfigurable array platform is more efficient in terms of performance and energy consumption. 相似文献

14.

A 2-D motion detection model for low-cost embedded reconfigurable I/O devices

Dollas A Sotiropoulos S Papademetriou K 《IEEE transactions on bio-medical engineering》2005,52(8):1443-1449

A low-cost reconfigurable embedded apparatus for two-dimensional (2-D) motion detection has been developed. This paper briefly outlines the embedded reconfigurable system architecture, and presents in-depth the 2-D motion detection model, which is directly mapped to reconfigurable hardware. Emphasis is placed on the hardware ability to adapt to individual needs of kinetically challenged persons by altering detection thresholds and delays, thus resulting into an efficient low-cost reconfigurable hardware implementation of the model. This paper also presents how the model detects complex motions through a vocabulary of simple motions, and how the system is trained to individual users' needs. Experimental results and integrated applications of the model for text processing are also presented. 相似文献

15.

Mapping of generalized template matching onto reconfigurable computers

Xuejun Liang Jean J.S.-N. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(3):485-498

Image processing algorithms for template matching, two-dimensional (2-D) digital filtering, morphologic operations, and motion estimation share some common properties. They can all benefit from using reconfigurable computers that use coprocessor boards based on field-programmable gate array (FPGA) chips. This paper characterizes those applications as generalized template matching (GTM) operations and describes the mapping of the GTM operations onto reconfigurable computers. A three-step approach is described. The first two steps enumerate and prune the design space of basic GTM building blocks, which consist of FPGA buffers and GTM computation cores. The last step is to achieve a solution through an optimal combination of these building blocks where the cost function is the FPGA computation time and the constraints are FPGA coprocessor board resources. Various FPGA buffers are presented so as to introduce design options of basic GTM building blocks. Algorithms used for the mapping are described. Experimental results are summarized to reveal the relationship between the GTM mapping results and FPGA board resource parameters. 相似文献

16.

Interprocedural Compiler Optimization for Partial Run-Time Reconfiguration

Elena Moscu Panainte Koen Bertels Stamatis Vassiliadis 《The Journal of VLSI Signal Processing》2006,43(2-3):161-172

In this paper, we study the performance impact of dynamic hardware reconfigurations for current reconfigurable technology. As a testbed, we target the Xilinx Virtex II Pro, the Molen experimental platform and the MPEG2 encoder as the application. Our experiments show that slowdowns of up to a factor 1000 are observed when the configuration latency is not hidden by the compiler. In order to avoid the performance decrease, we propose an interprocedural optimization that minimizes the number of executed hardware configuration instructions taking into account constraints such as the “FPGA-area placement conflicts” between the available hardware configurations. The presented algorithm allows the anticipation of hardware configuration instructions up to the application’s main procedure. The presented results show that our optimization produces a reduction of 3 to 5 order of magnitude of the number of executed hardware configuration instructions. Moreover, the optimization allows to exploit up to 97% of the maximal theoretical speedup achieved by the reconfigurable hardware execution. 相似文献

17.

Rapid System Prototyping for High Performance Reconfigurable Computing

Vijay Jain S. Shrivastava 《Design Automation for Embedded Systems》2000,5(3-4):339-350

This paper proposes a soft-reconfigurable coarse grain platform, called J-platform, based on a very few types of computational cells, and shows how a range of advanced applications can be mapped for achieving high performance. Two very flexible cells are proposed that account for the versatility of the approach. These are the MA_PLUS, which is an enhanced Multiply-Add cell, and the UNL, a Universal NonLinear cell. These two cells, together, provide an unprecedented capability for coarse-grain reconfigurable computing, as discussed in the paper. Although the applications of the platform range from FIR filtering of images to large-scale inverse problems, the paper demonstrates mapping of two specific advanced problems, namely (1) Reconstruction of Computerized Tomography images from fan beam projections, and (2) Color Conversion of Video from the RGB to HSI domains. Speed-up by factors of 20 or more over today's work stations is estimated. 相似文献

18.

Deployment of Run-Time Reconfigurable Hardware Coprocessors Into Compute-Intensive Embedded Applications

Francisco Fons Mariano Fons Enrique Cantó Mariano López 《Journal of Signal Processing Systems》2012,66(2):191-221

Day after day, embedded systems add more compute-intensive applications inside their end products: cryptography or image and video processing are some examples found in leading markets like consumer electronics and automotive. To face up these ever-increasing computational demands, the use of hardware accelerators synthesized in field-programmable gate arrays (FPGA) lets achieve processing speedups of orders of magnitude versus their counterpart CPU-based software approaches. However, the inherent increment in physical resources penalizes in cost. To address this issue, dynamically reconfigurable hardware technology definitively reached its maturity. SRAM-based reconfigurable logic goes beyond the classical conception of static hardware resources distributed in space and held invariant for the entire application life cycle; it provides a new design abstraction featured by the temporal partitioning of such resources to promote their continuous reuse, reconfiguring them on the fly to play a different role in each instant. This new computing paradigm lets balance the design of embedded applications by partitioning their functionality in space and time—through a series of mutually-exclusive processing tasks synthesized multiplexed in time on the same set of resources—and achieving thus cost savings in both area and power metrics. However, the exploitation of this system versatility requires special attention to avoid performance degradation. Such technical aspects are addressed in this work intended to be a survey on reconfigurable hardware technology and aimed at defining an open, standard and cost-effective system architecture driven by flexible coprocessors instantiated on demand on reconfigurable resources of an FPGA. This concept fits well with the functional features demanded to many embedded applications today and its feasibility has been proved with a state-of-the-art commercial SRAM-based FPGA platform. The achieved results highlight dynamic partial reconfiguration as a potential technology to lead the next computing wave in the industry. 相似文献

19.

Configuration relocation and defragmentation for run-time reconfigurable computing

Compton K. Zhiyuan Li Cooley J. Knol S. Hauck S. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2002,10(3):209-220

Due to its potential to greatly accelerate a wide variety of applications, reconfigurable computing has become a subject of a great deal of research. By mapping the compute-intensive sections of an application to reconfigurable hardware, custom computing systems exhibit significant speedups over traditional microprocessors. However, this potential acceleration is limited by the requirement that the speedups provided must outweigh the considerable cost of reconfiguration. The ability to relocate and defragment configurations on field programmable gate arrays (FPGAs) can dramatically decrease the overall reconfiguration overhead incurred by the use of the reconfigurable hardware. We therefore present hardware solutions to provide relocation and defragmentation support with a negligible area increase over a generic partially reconfigurable FPGA, as well as software algorithms for controlling this hardware. This results in factors of 8 to 12 improvement in the configuration overheads displayed by traditional serially programmed FPGAs. 相似文献

20.

基于流体系结构的高效能分组密码处理器研究 总被引：1，自引：0，他引：1

下载免费PDF全文

王寿成严迎建徐进辉《电子学报》2017,45(4):937-943

针对现有密码处理器存在的问题,借鉴流处理器架构,提出了高效能的可重构分组密码流处理器架构.该架构采用层次化设计思想,通过分块式本地寄存器组的数据组织方式和共享拼接使用运算单元机制,实现了软件流水和硬件流水的协同工作,能够挖掘分组内和分组间的指令级并行性并提高功能单元的利用率.在65nm CMOS工艺下对架构进行了综合仿真,并经过了大量算法映射.实验结果证明,该架构在CBC和ECB加密模式下均具有良好的加密性能.与其他密码处理器相比,该架构具有小面积、高效能的特点. 相似文献