期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploiting Application Data-Parallelism on Dynamically Reconfigurable Architectures: Placement and Architectural Considerations

Banerjee S. Bozorgzadeh E. Dutt N. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(2):234-247

Partial dynamic reconfiguration, often called run-time reconfiguration (RTR), is a key feature in modern reconfigurable platforms. In this paper, we present parallelism granularity selection (PARLGRAN), an application mapping approach that maximizes performance of application task chains on architectures with such capability. PARLGRAN essentially selects a suitable granularity of data-parallelism for individual data parallel tasks while considering key issues such as significant reconfiguration overhead and placement constraints. It integrates granularity selection very effectively in a joint scheduling and placement formulation, necessary due to constraints imposed by partial RTR. As a key step to validating PARLGRAN, we additionally present an exact strategy (integer linear programming formulation). We demonstrate that PARLGRAN generates high-quality schedules with: (1) a set of small test cases where we compare our results with the exact strategy; (2) a very large set of synthetic experiments with over a thousand data-points where we compare it with a simpler strategy that tries to statically maximize data-parallelism, i.e., only considers resource availability; and (3) a detailed application case study of JPEG encoding. The application case-study confirms that blindly maximizing data-parallelism can result in schedules even worse than that generated by a simple (but RTR-aware) approach oblivious to data-parallelism. Last, but very important, we demonstrate that our approach is well-suited for true on-demand computing with detailed execution time estimates on a typical embedded processor. Heuristic execution time is comparable to task execution time, i.e., it is feasible to integrate PARLGRAN in a run-time scheduler for dynamically reconfigurable architectures. 相似文献

2.

HW/SW codesign techniques for dynamically reconfigurable architectures

Noguera J. Badia R.M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2002,10(4):399-415

Hardware/software (HW/SW) codesign and reconfigurable computing are commonly used methodologies for digital-systems design. However, no previous work has been carried out in order to define a HW/SW codesign methodology with dynamic scheduling for run-time reconfigurable architectures. In addition, all previous approaches to reconfigurable computing multicontext scheduling are based on static-scheduling techniques. In this paper, we present three main contributions: 1) a novel HW/SW codesign methodology with dynamic scheduling for discrete event systems using dynamically reconfigurable architectures; 2) a new dynamic approach to reconfigurable computing multicontext scheduling; and 3) a HW/SW partitioning algorithm for dynamically reconfigurable architectures. We have developed a whole codesign framework, where we have applied our methodology and algorithms to the case study of software acceleration. An exhaustive study has been carried out, and the obtained results demonstrate the benefits of our approach. 相似文献

3.

Algorithmic aspects for functional partitioning and scheduling in hardware/software co-design 总被引：4，自引：0，他引：4

Wu Jigang Thambipillai Srikanthan Tao Jiao 《Design Automation for Embedded Systems》2008,12(4):345-375

Hardware/software (HW/SW) partitioning and scheduling are the crucial steps during HW/SW co-design. It has been shown that they are classical combinatorial optimization problems. Due to the possible sequential or concurrent execution of the tasks, HW/SW partitioning and scheduling has become more difficult to solve optimally. In this paper more efficient heuristic algorithms are proposed for the HW/SW partitioning and scheduling. The proposed algorithm partitions a task graph by iteratively moving the task with highest benefit-to-area ratio in higher priority. The benefit-to-area ratio is updated in each iteration step to cater for the task concurrence. The proposed algorithm for task scheduling executes the task lying in hardware-only critical path in higher priority to enhance the task forecast. A large body of experimental results conclusively shows that the proposed heuristic algorithm for partitioning is superior to the latest efficient combinatorial algorithm (Tabu search) cited in this paper. Moreover, the Tabu search for partitioning has been further improved by utilizing the proposed heuristic solution as its initial solution. In addition, the proposed scheduling algorithm obtains the improvements over the most widely used approaches by up to 10% without large increase in running time. This work was presented in part at 2006 IEEE International Conference on Field Programmable Technology (ICFPT). 相似文献

4.

Mapping Data-Parallel Tasks Onto Partially Reconfigurable Hybrid Processor Architectures

Vikram K. N. Vasudevan V. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(9):1010-1023

Reconfigurable hybrid processor systems provide a flexible platform for mapping data-parallel applications, while providing considerable speedup over software implementations. However, the overhead for reconfiguration presents a significant deterrent in mapping applications onto reconfigurable hardware. Partial runtime reconfiguration is one approach to reduce the reconfiguration overhead. In this paper, we present a methodology to map data-parallel tasks onto hardware that supports partial reconfiguration. The aim is to obtain the maximum possible speedup, for a given reconfiguration time, bus speed, and computation speed. The proposed approach involves using multiple, identical but independent processing units in the reconfigurable hardware. Under nonzero reconfiguration overhead, we show that there exists an upper limit on the number of processing units that can be employed beyond which further reduction in execution time is not possible. We obtain solutions for the minimum processing time, the corresponding load distribution, and schedule for data transfer. To demonstrate the applicability of the analysis, we present the following: 1) various plots showing the variation of processing time with different parameters; 2) hardware simulations for two examples, viz., 1-D discrete wavelet transform and finite impulse response filter, targeted to Xilinx field-programmable gate arrays (FPGAs); and 3) experimental results for a hardware prototype implemented on a FPGA board 相似文献

5.

A Fault Tolerant Approach for FPGA Embedded Processors Based on Runtime Partial Reconfiguration

Alexandros Vavousis Andreas Apostolakis Mihalis Psarakis 《Journal of Electronic Testing》2013,29(6):805-823

The ever increasing adoption of field programmable devices in various application domains for building complex embedded systems based on FPGA processors along with the reliability issues having emerged for FPGA devices built with the latest nanometer technologies, have raised the need for new fault tolerant techniques in order to improve dependability and extend system lifetime. In addition, the runtime partial reconfiguration technology highly mature in the modern FPGA families along with the availability of unused programmable resources in most FPGA designs provide new and interesting opportunities to build advanced fault tolerance mechanisms. In this paper, we exploit the dynamic reconfiguration potential of today’s FPGA architectures and the advances in the related design support tools and we propose a fault-tolerant approach for FPGA embedded processors based on runtime partial reconfiguration. According to the proposed methodology, the processor core is partitioned into reconfigurable modules and each module is duplicated to implement a concurrent error detection mechanism. Precompiled configurations containing spare resources are generated for each duplicated module and are used to repair at runtime the defective modules. Also, a fault tolerance scheme for the proxy logic of the reconfigurable modules, which cannot move in the alternative configurations along with the rest logic, is proposed. Moreover, a compression method for the alternative partial bitstreams, which significantly reduces the high storage space requirements of the proposed approach, is presented. Two different hardware decompression schemes have been implemented in a Virtex-5 device and compared in terms of area overhead and decompression latency. Furthermore, a thorough examination has been performed, regarding how the percentage of the spare resources and their allocation in the reconfigurable regions affect the compression efficiency and the processor performance. Finally, the proposed approach has been demonstrated in three different components – ALU, multiplier-accumulator, and instruction-fetch unit – of an open-source embedded processor. 相似文献

6.

An efficient Resource Management to optimize the placement of hardware task on FPGA in the RVC framework

Manel Hentati Samya Elaoud Yassine Aoudni Jean-François Nezan Mohamed Abid 《Design Automation for Embedded Systems》2012,16(4):363-380

Dynamic partial reconfiguration (DPR) functionality allows implementing multi-tasks applications by exchanging tasks in a design at run-time. It is a promising solution to enhance system performances. But, the effective use of DPR is often hampered by the complexity added to the system design process. In this paper, we investigate the implementation of a multi-tasks applications using the DPR in the RVC framework. We present a resource management method which includes three steps: partitioning the application in HW/SW tasks, divided the FPGA in static and dynamic regions and placement the tasks on FPGA. The proposed method is based on using linear programming strategy to find the optimal placement of hardware tasks. We take into account the heterogeneity aspect of the device. The goal is to minimize the resource utilization and fragmentation. We use RVC technology which is based on a specific language for writing dataflow models called RVC-CAL. This language describes the application as set of blocks called actors connected through a network. To test the efficiency of our approach, we exploit the decoder MPEG-4 SP described in RVC-CAL. We measure the quality of placement in terms of tasks rejection, execution time and resource wastage. Application of different data combinations and a comparison with the state-of-the art method show the high performance of the proposed approach. 相似文献

7.

嵌入式粗颗粒度可重构处理器的软硬件协同设计流程 总被引：4，自引：2，他引：2

下载免费PDF全文

于苏东刘雷波尹首一魏少军《电子学报》2009,37(5):1136-1140

面向多媒体应用的可重构处理器架构由主处理器和动态配置的可重构阵列(Reconfigurable Cell Array,RCA)组成.协同设计流程以循环流水线和流水线配置技术为基础,采用启发式算法对应用中较大的关键循环进行了软硬件划分,使用表格调度算法实现了任务在RCA上的映射.经过FPGA验证,H.264基准中的核心算法平均执行速度相比于PipeRench,MorphoSys,以及TI DSP TMS320C64X提高了3.34倍. 相似文献

8.

Memory Controller for Vector Processor

Tassadaq Hussain Osman S. Ünsal Adrian Cristal Eduard Ayguadé 《Journal of Signal Processing Systems》2018,90(11):1533-1549

相似文献

9.

基于XC6200的可重构处理器设计 总被引：1，自引：0，他引：1

常青孙广富卢焕章《信号处理》2001,17(5):454-458

本文讨论一种针对图像信息处理应用的可重构处理器设计与实现.该处理器采用DSP+FPGA的混合计算结构,既具有制造完成后的可编程性,又能提供较高的计算性能,可适用多种实时图像信息处理应用的需要.文中还对动态重构的实现及可重构芯片设计等问题进行了较为深入的讨论,并用设计实例论证了作者的设计思想. 相似文献

10.

A HW/SW Partitioner for Multi-Mode Multi-Task Embedded Applications

Young-Jun Kim Taewhan Kim 《The Journal of VLSI Signal Processing》2006,44(3):269-283

An embedded system is called a multi-mode embedded system if it performs multiple applications by dynamically reconfiguring the system functionality. Further, the embedded system is called a multi-mode multi-task embedded system if it additionally supports multiple tasks to be executed in a mode. In this paper, we address an important HW/SW partitioning problem, that is, HW/SW partitioning of multi-mode multi-task embedded applications with timing constraints of tasks. The objective of the optimization problem is to find a minimal total system cost of allocation/mapping of processing resources to functional modules in tasks together with a schedule that satisfies the timing constraints. The key success of solving the problem is closely related to the degree of the amount of utilization of the potential parallelism among the executions of modules. However, due to an inherently excessively large search space of the parallelism, and to make the task of schedulability analysis easy, the prior HW/SW partitioning methods have not been able to fully exploit the potential parallel execution of modules. To overcome the limitation, we propose a set of comprehensive HW/SW partitioning techniques which solve the three subproblems of the partitioning problem simultaneously: (1) allocation of processing resources, (2) mapping the processing resources to the modules in tasks, and (3) determining an execution schedule of modules. Specifically, based on a precise measurement on the parallel execution and schedulability of modules, we develop a stepwise refinement partitioning technique for single-mode multi-task applications, which aims to solve the subproblems 1, 2 and 3 effectively in an integrated fashion. The proposed techniques is then extended to solve the HW/SW partitioning problem of multi-mode multi-task applications (i.e., to find a globally optimized allocation/mapping of processing resources with feasible execution schedule of modules). From experiments with a set of real-life applications, it is shown that the proposed techniques are able to reduce the implementation cost by 19.0 and 17.0% for single- and multi-mode multi-task applications over that by the conventional method, respectively. 相似文献

11.

Virtual memory window for application-specific reconfigurable coprocessors

Vuletic M. Pozzi L. Ienne P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(8):910-915

The complexity of hardware/software (HW/SW) interfacing and the lack of portability across different platforms, restrain the widespread use of reconfigurable accelerators and limit the designer productivity. Furthermore, communication between SW and HW parts of codesigned applications are typically exposed to SW programmers and HW designers. In this work, we introduce a virtualization layer that allows reconfigurable application-specific coprocessors to access the user-space virtual memory and share the memory address space with user applications. The layer, consisting of an operating system (OS) extension and a HW component, shifts the burden of moving data between processor and coprocessor from the programmer to the OS, lowers the complexity of interfacing, and hides physical details of the system. Not only does the virtualization layer enhance programming abstraction and portability, but it also performs runtime optimizations: by predicting future memory accesses and speculatively prefetching data, the virtualization layer improves the coprocessor execution-applications achieve better performance without any user intervention. We use two different reconfigurable system-on-chip (SoC) running Linux and codesigned applications to prove the viability of our concept. The applications run faster than their SW versions, and the overhead due to the virtualisation is limited. Dynamic prefetching in the virtualisation layer further reduces the abstraction overhead. 相似文献

12.

A Cooperative Management Scheme for Power Efficient Implementations of Real-Time Operating Systems on Soft Processors

Jingzhao Ou Prasanna V.K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(1):45-56

A cooperative management scheme for power efficient implementations of real-time operating systems on field-programmable gate-array (FPGA)-based soft processors is presented. Dedicated power management hardware peripherals are tightly coupled to a soft processor by utilizing its configurability. These hardware peripherals manage tasks and interrupts in cooperation with the soft processor, while retaining the real-time responsiveness of the operating system. More specifically, the hardware peripherals perform the following power management functionalities: (1) control the on-chip clock distribution network for driving the soft processor, its hardware peripherals, and the bus interfaces between them; (2) perform task and interrupt management responsibilities of the operating system when the soft processor is turned off; and (3) selectively wake up the soft processor and its hardware components, and put them into proper activation states based on the hardware resource requirements of the tasks under execution. The implementations of two popular real-time operating systems on a state-of-the-art FPGA device are presented. Measurements on an experimental board show that the proposed power management scheme can lead to significant power savings. 相似文献

13.

The Chimaera reconfigurable functional unit

Hauck S. Fry T.W. Hosler M.M. Kao J.P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2004,12(2):206-217

By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host processor itself. With direct access to the host processor's register file, the system enables the creation of multi-operand instructions and a speculative execution model key to high-performance, general-purpose reconfigurable computing. Chimaera also supports multi-output functions and utilizes partial run-time reconfiguration to reduce reconfiguration time. Combined, the system can provide speedups of a factor of two or more for general-purpose computing, and speedups of 160 or more are possible for hand-mapped applications. 相似文献

14.

G729 Voice Decoder Design

Fatma Sayadi Emmanuel Casseau Mohamed Atri Mehrez Marzougui Rached Tourki Eric Martin 《Journal of Signal Processing Systems》2006,42(2):173-184

Embedded digital signal processing (DSP) systems are usually associated with real time constraints and/or high data rates such that fully software implementations are often not satisfactory. In that case, mixed hardware/software implementations are to be investigated. This paper presents the design of a HW/SW G.729 voice decoder dedicated to embedded systems. The decoder has been built around, on the one hand a reconfigurable digital circuit (FPGA) to achieve the so called IP hardware part—the autocorrelation computation—using a linear systolic array, and on the other hand a digital signal processor (DSP) for the remainder of the algorithm. Apart such an implementation is typically driven by the use of reusable component (IP) it is of great interest for new G729-based applications such as Voice over IP (VoIP) for example. It results in an overall reduction of the execution time per frame. Another interesting point is the design of a parameterizable autocorrelation block which can be useful for a wide range of applications such as GSM 13 Kbit/s, APC 9.6 Kbit/s and G723 6.3 Kbit/s and 5.3 Kbit/s. In the G729 context and using a V50 Virtex FPGA, the execution time of this function is 10 times faster than a TMS320C6201 DSP implementation. 相似文献

15.

Design-Space Exploration for Block-Processing Based Temporal Partitioning of Run-Time Reconfigurable Systems

Meenakshi Kaul Ranga Vemuri 《The Journal of VLSI Signal Processing》2000,24(2-3):181-209

The reconfiguration capability of modern FPGA devices can be utilized to execute an application by partitioning it into multiple segments such that each segment is executed one after the other on the device. This division of an application into multiple reconfigurable segments is called temporal partitioning. We present an automated temporal partitioning technique for acyclic behavior level task graphs. To be effective, any behavior-level partitioning method should ensure that each temporal partition meets the underlying resource constraints. For this, a knowledge of the implementation cost of each task on the hardware should be known. Since multiple implementations of a task that differ in area and delay are possible, we perform design-space exploration to choose the best implementation of a task from among the available implementations.To overcome the high reconfiguration overhead of the current day FPGA devices, we propose integration of the temporal partitioning and design space exploration methodology with block-processing. Block-processing is used to process multiple blocks of data on each temporal partition so as to amortize the reconfiguration time. We focus on applications that can be represented as task graphs that have to be executed many times over a large set of input data. We have integrated block-processing in the temporal partitioning framework so that it also influences the design point selection for each task. However, this does not exclude usage of our system for designs for which block-processing is not possible. For both block-processing and non block-processing designs our algorithm selects the best possible design point to minimize the execution time of the design.We present an ILP-based methodology for the integrated temporal partitioning, design space exploration and block-processing technique that is solved to optimality for small sized design problems and in an iterative constraint satisfaction approach for large sized design problems. We demonstrate with extensive experimental results for the Discrete Cosine Transform (DCT) and random graphs the validity of our approach. 相似文献

16.

基于动态可重构的FFT处理器的设计与实现 总被引：3，自引：1，他引：2

潘伟刘欢李广军《微电子学》2009,39(1)

提出了一种基于局部动态可重构(DPR)的新型可重构FFT处理器.相比传统的FFT设计,该设计方法在重构时间上得到了很大改进,同时,处理器能够动态地添加或移除重构单元.采用新颖的FFT控制算法,使得可重构部分面积很小.该处理器结构在Xilinx Viirtex2p系列FPGA上进行了综合及后仿真.较之Xilinx IPcore,其运算效率明显提高,而且还实现了IP核所不具备的动态可重构性. 相似文献

17.

A dynamic partial reconfigurable system with combined task allocation method to improve the reliability of FPGA

《Microelectronics Reliability》2018

Currently most FPGAs use SRAM-based technology, which are susceptible to faults from external electromagnetic radiation or produced by long-time internal overload operation. The dynamic partial reconfigurable (DPR) system, as an emerging technology, provides a promising way to solve this problem by reallocating the tasks in damaged resource areas to non-faulty regions at runtime. Based on such idea, an infrastructure for coordinately executing specialized hardware tasks on a reconfigurable FPGA is presented to achieve the flexibility for tolerating the occurring faults at runtime. Moreover, a method named MER-3D-Contact that combines the maximum empty rectangles (MER) technique with the adjacency heuristic is proposed to allocate tasks in the dynamical partial reconfiguration system for higher resource utilization, higher task acceptance ratio and lower fragmentation ratio. At last, experiments are carried out to evaluate the performance of the proposed system, results show that the proposed system can make the highest improvement 36% without damaged areas and the highest improvement 58% with damaged resources in terms of task acceptance ratio. Thus, the proposed system is expected a wide application in the field of more reliable FPGAs. 相似文献

18.

An Iterative Algorithm for Hardware-Software Partitioning, Hardware Design Space Exploration and Scheduling

Karam S. Chatha Ranga Vemuri 《Design Automation for Embedded Systems》2000,5(3-4):281-293

The paper proposes a novel heuristic technique for integrated hardware-software partitioning, hardware design space exploration and scheduling. The technique maps an application specified as a task graph on a heterogeneous architecture with an objective to minimize the latency of the task graph subject to the area constraint on the hardware coprocessor. The technique uses an iterative approach where the partitioner decides the processor mapping and HW design points of some tasks. The scheduler then simultaneously decides the processor mapping, HW design point and schedule time of the remaining tasks. There exists a tight coupling between the two design stages allowing them to produce superior quality designs in fewer iterations. The technique accounts for the time overheads due to inter-processor /intra-processor communication and shared memory access conflicts. It can therefore be used for both communication intensive and computation intensive applications. The technique also considers dynamic reconfiguration capability of the hardware coprocessor. The technique performs tradeoff analysis and maps hardware tasks to mutually exclusive temporal segments if this results in lower latency. The effectiveness of the technique is demonstrated by a case study of the JPEG image compression algorithm, comparison with an optimal ILP based approach and experimentation with synthetic graphs. 相似文献

19.

基于预配置和配置重用的粗粒度动态可重构系统任务调度技术

戴紫彬曲彤洲《电子与信息学报》2019,41(6):1458-1465

配置时间过长是制约可重构系统整体性能提升的重要因素,而合理的任务调度技术可有效降低系统配置时间。该文针对粗粒度动态可重构系统(CGDRS)和具有数据依赖关系的流应用,提出了一种3维任务调度模型。首先基于该模型,设计了一种基于预配置策略的任务调度算法(CPSA);然后根据任务间的配置重用性,提出了间隔配置重用与连续配置重用策略,并据此对CPSA算法进行改进。实验结果证明,CPSA算法能够有效解决调度死锁问题、降低流应用执行时间并提高调度成功率。与其它调度算法相比,对流应用执行时间的平均优化比例达到6.13%～19.53%。相似文献

20.

前言

Ping ZHANG Shilin LI Yiming LIU Xiaoqi QIN Xiaodong XU 《通信学报》2005,26(10):1-14

人们对后3G的要求是:在全球范围内实现无缝覆盖,进行包括语音、文本、图像、视频等在内的高速多媒体通信。为此,在有限频谱资源条件下,必须缩短无线信号的传输半径,极大限度地复用频谱资源,提高单位空间的信道容量。采用各种先进的无线传输技术的无线传输网络则在中、小范围内提供高速率、高质量的无线移动通信服务。因而WLAN和WPAN的需求和应用在不断增长,超宽带(UWB,ultra wide-band)等短距离、高空间容量的技术日益兴起,成为目前无线通信领域的热点。UWB的核心是冲激无线电技术,即利用持续时间非常短(纳秒、亚纳秒级)的脉冲波形来… 相似文献