共查询到20条相似文献,搜索用时 46 毫秒
1.
Sima M. Cotofana S.D. Vassiliadis S. van Eijndhoven J.T.J. Vissers K.A. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2004,12(6):622-635
This paper presents a TriMedia processor extended with three reconfigurable designs for entropy decoding (ED), inverse quantization (IQ), and two-dimensional (2-D) inverse discrete cosine transform (IDCT), and assesses the performance gain that is provided by such extensions when performing MPEG2-compliant pel reconstruction. We first describe an extension of the TriMedia architecture, which consists of a multiple-context field programmable gate array (FPGA)-based reconfigurable functional unit (RFU), a configuration unit managing the reconfiguration of the RFU, and their associated instructions. Then, we address the computation of the ED, IQ, and 2-D IDCT tasks, and propose to provide reconfigurable hardware support for a variable-length decoder that can decode two symbols per call (VLD-2), an inverse quantizer that can dequantize four coefficients per call (IQ-4), and an 1-D IDCT (1-D IDCT). The most important aspects concerning the implementation of the FPGA-mapped VLD-2, IQ-4, and 1-D IDCT units, as well as the organization of the software routines calling these FPGA-mapped computing units are outlined. Experimental results indicate that by configuring each of the VLD-2, IQ-4, and 1-D IDCT units on a different FPGA context, and by activating the contexts as needed, the FPGA-augmented TriMedia can perform MPEG2-compliant pel reconstruction with an average speed-up of 1.4/spl times/ over the standard TriMedia. 相似文献
2.
Kiarash Bazargan Ryan Kastner Majid Sarrafzadeh 《Design Automation for Embedded Systems》2000,5(3-4):329-338
The advances in the programmable hardware has lead to new architectures where the hardware can be dynamically adapted to the application to gain better performance. There are still many challenging problems to be solved before any practical general-purpose reconfigurable system is built. One fundamental problem is the placement of the modules on the reconfigurable functional unit (RFU). In reconfigurable systems, we are interested both in online placement, where arrival time of tasks is determined at runtime and is not known a priori, and offline in which the schedule is known at compile time. In the case of offline placement, we are willing to spend more time during compile time to find a compact floorplan for the RFU modules and utilize the RFU area more efficiently. In this paper we present offline placement algorithms based on simulated annealing and greedy methods and show the superiority of their placements over the ones generated by an online algorithm. 相似文献
3.
A software radio architecture for linear multiuser detection 总被引:5,自引:0,他引:5
The integration of multimedia services over wireless channels calls for provision of variable quality of service (QoS) requirements. While radio resource management algorithms (such as power control and call admission control) can provide certain levels of variability in QoS, an alternate approach is to use reconfigurable radio architectures to provide diverse QoS guarantees. We outline a novel reconfigurable architecture for linear multiuser detection, thereby providing a wide range of bit error rate (BER) requirements amongst the constituent receivers of the reconfigurable architecture. Specifically, we focus on achieving this dynamic reconfiguration via a software radio implementation of linear multiuser receivers. Using a unified framework for achieving this reconfiguration, we partition functionality into two core technologies [field programmable gate arrays (FPGA) and digital signal processor (DSP) devices] based on processing speed requirements. We present experimental results on the performance and reconfigurability of the software radio architecture as well as the impact of fixed point arithmetic (due to hardware constraints) 相似文献
4.
5.
《Solid-State Circuits, IEEE Journal of》2009,44(7):2055-2064
6.
Like Yan Binbin Wu Yuan Wen Shaobin Zhang Tianzhou Chen 《Telecommunication Systems》2014,55(3):333-344
It’s a promising way to improve performance significantly by adding reconfigurable processing unit (RPU) to a general purpose processor. In this paper, a Reconfigurable Multi-Core (RMC) architecture combining general multi-core and reconfigurable logic is proposed. Reconfigurable logic is separated into RPUs logically, which are coupled with general purpose cores as co-processors via a full crossbar switch. An RPU Manager (RPU-M) is also designed to manage RPUs. To verify RMC, a simulation method based on the Simics and Virtex 5 FPGA is adopted, which simplifies the simulation and assures the evaluation accuracy of hardware function cores. Five workloads are selected to test RMC, including 3-DES, AES, SHA2, IDCT and JPEG_ENC. The experimental results show a 3.10 times average speedup over software implementation on the original multi-core, and the data and control communication overhead on RMC is acceptable. 相似文献
7.
This paper proposes a reconfigurable remote radio head (RRH) software and hardware architecture and presents the results of its implementation. The implemented RRH has been designed for a centralized architecture super base station that facilitates the integration of 2G, 3G, and 4G ground base stations and geosynchronous equatorial orbit (GEO) satellite gateway stations. The reconfigurable RRH is a primary aspect of the system. By modifying the programmable software configuration from the perspective of software-defined radio, the RRH can support multiple frequency allocations. The RRH is composed of radio frequency (RF) modules, intermediate frequency (IF) modules, and some of the physical layer functions. The development of a reconfigurable IF module is emphasized. A filter-bank multiplexing technique has been employed for the IF module, which reduces the use of hardware resources. To this end, we experimentally evaluated the main key performance indicators (KPIs) of the RRH. The experimental results demonstrate that the proposed design attains the goal of reducing hardware costs and improving system flexibility and stability. 相似文献
8.
《电子学报:英文版》2017,(6):1161-1167
By exploring symmetric cryptographic data level and instruction-level parallelism, the reconfigurable processor architecture for symmetric ciphers is presented based on Very-long instruction word (VLIW) structure. The application-specific instruction-set system for symmetric ciphers is proposed. As for the same arithmetic operation of symmetric ciphers, eleven kinds of reconfigurable cryptographic arithmetic units are designed by the reconfigurable technology. As to the requirement of high energy-efficient design, the loop buffer structure for instruction fetching unit is proposed to reduce the power consumption significantly with the same frequency as conventional, meanwhile, the chain processing mechanism is proposed to improve the cryptographic throughput without any area overhead. It has been fabricated with 0.18μm CMOS technology. The result shows that the processor can work up to 200MHz, and the fourteen kinds of cryptographic algorithms were mapped in the processor, the encryption throughput of AES, SNOW2.0 and SHA2 algorithm can achieve 1.19Gbps, 1.05Gbps, and 407Mbps respectively. 相似文献
9.
In this paper, we propose a methodology for accelerating application segments by partitioning them between reconfigurable
hardware blocks of different granularity. Critical parts are speeded-up on the coarse-grain reconfigurable hardware for meeting
the timing requirements of application code mapped on the reconfigurable logic. The reconfigurable processing units are embedded
in a generic hybrid system architecture which can model a large number of existing heterogeneous reconfigurable platforms.
The fine-grain reconfigurable logic is realized by an FPGA unit, while the coarse-grain reconfigurable hardware by our developed
high-performance data-path. The methodology mainly consists of three stages; the analysis, the mapping of the application
parts onto fine and coarse-grain reconfigurable hardware, and the partitioning engine. A prototype software framework realizes
the partitioning flow. In this work, the methodology is validated using five real-life applications. Analytical partitioning
experiments show that the speedup relative to the all-FPGA mapping solution ranges from 1.5 to 4.0, while the specified timing
constraints are satisfied for all the applications. 相似文献
10.
Gschwind M. Salapura V. Maurer D. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2001,9(2):241-250
Application-specific processors offer an attractive option in the design of embedded systems by providing high performance for a specific application domain. In this work, we describe the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design. By using a common base instruction set, development cost can be reduced and design space exploration is focused on the application-specific aspects of performance. An important aspect of deploying any new architecture is verification which usually requires lengthy software simulation of a design model. We show how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow. While previously hardware emulation required massive investment in design effort and special purpose emulators, an emulation approach based on high-density field-programmable gate array (FPGA) devices now makes hardware emulation practical and cost effective for embedded processor designs. To reduce development cost and avoid duplication of design effort, FPGA prototypes and ASIC implementations are derived from a common source: We show how to perform targeted optimizations to fully exploit the capabilities of the target technology while maintaining a common source base 相似文献
11.
Lodi A. Toma M. Campi F. Cappelli A. Canegallo R. Guerrieri R. 《Solid-State Circuits, IEEE Journal of》2003,38(11):1876-1886
This paper describes a new architecture for embedded reconfigurable computing, based on a very-long instruction word (VLIW) processor enhanced with an additional run-time configurable datapath. The reconfigurable unit is tightly coupled with the processor, featuring an application-specific instruction-set extension. Mapping computation intensive algorithmic portions on the reconfigurable unit allows a more efficient elaboration, thus leading to an improvement in both timing performance and power consumption. A test chip has been implemented in a standard 0.18-/spl mu/m CMOS technology. The test of a signal processing algorithmic benchmark showed speedups ranging from 4.3/spl times/ to 13.5/spl times/ and energy consumption reduced up to 92%. 相似文献
12.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(8):1021-1033
13.
As technology scaling reduces pace and energy efficiency becomes a new important design constraint, superscalar processor designs are reaching their performance limits due to area and power restrictions. As a result, new microarchitectural paradigms need to be developed. This work proposes a new organization for x86 processors, based on a traditional superscalar design coupled to a reconfigurable array. The system exploits the fact that few basic blocks are responsible for most of the instructions that execute in the processor, and transforms these basic blocks into configurations for the reconfigurable array. Each configuration encodes the semantics and dependencies for all instructions in the block, so that the ones already mapped can execute bypassing the fetch, decode and dependency checks stages and improving instruction throughput. Our study on the potential of the architecture shows that performance gains of up to 2.5\(\times \) with respect to a traditional superscalar can be achieved. 相似文献
14.
Exploiting Thread‐Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline
Jaegeun Oh Seok Joong Hwang Huong Giang Nguyen Areum Kim Seon Wook Kim Chulwoo Kim Jong‐Kook Kim 《ETRI Journal》2008,30(4):576-586
In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler‐hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write‐back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single‐instruction multiple‐data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32‐bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2‐way MLEP and 33.7% faster with a 4‐way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler. 相似文献
15.
提出了一个基于硬件抽象机的流水线微处理器设计框架,创造性地使用了一种基于标签结构的模拟执行技术.基于这一框架,描述了一个堆栈抽象机的工作原理,实现了一个Java指令级并行处理器.利用堆栈硬件抽象机和堆栈指令折叠技术的组合解决了Java处理器中的堆栈依赖瓶颈问题.软件模拟证明了该处理器能够最大限度地挖掘出Java程序中的指令级并行,并且拥有更高的处理能力. 相似文献
16.
针对现有密码处理器存在的问题,借鉴流处理器架构,提出了高效能的可重构分组密码流处理器架构.该架构采用层次化设计思想,通过分块式本地寄存器组的数据组织方式和共享拼接使用运算单元机制,实现了软件流水和硬件流水的协同工作,能够挖掘分组内和分组间的指令级并行性并提高功能单元的利用率.在65nm CMOS工艺下对架构进行了综合仿真,并经过了大量算法映射.实验结果证明,该架构在CBC和ECB加密模式下均具有良好的加密性能.与其他密码处理器相比,该架构具有小面积、高效能的特点. 相似文献
17.
针对立方体钠卫星GNC信息处理系统高计算性能与低功率消耗相矛盾的问题,提出了一种资源限制型可重构并行信息处理方法。该方法采用紧耦合可重构并行信息处理架构,将GNC信息处理中需要多次迭代计算且不适合CPU处理的复杂软件算法,以动态部分重构硬件电路单元(DPR)的方式实现,采用基于互斥量的多核并行可重构资源调度算法,通过多核CPU并行管理与调度共享的DPR单元,完成软件算法的硬件加速与优化。实验结果表明,该方法实现了立方星GNC信息处理系统的高效实时快速处理,与传统信息处理方法相比,可节约50%左右的功耗,可应用于计算资源极为有限的星上信息处理领域,具有很好的工程应用前景。 相似文献
18.
Li Zhou Dongpei Liu Jianfeng Zhang Hengzhu Liu 《International Journal of Electronics》2013,100(6):897-910
Coarse-grained reconfigurable arrays (CGRAs) have shown potential for application in embedded systems in recent years. Numerous reconfigurable processing elements (PEs) in CGRAs provide flexibility while maintaining high performance by exploring different levels of parallelism. However, a difference remains between the CGRA and the application-specific integrated circuit (ASIC). Some application domains, such as software-defined radios (SDRs), require flexibility with performance demand increases. More effective CGRA architectures are expected to be developed. Customisation of a CGRA according to its application can improve performance and efficiency. This study proposes an application-specific CGRA architecture template composed of generic PEs (GPEs) and special PEs (SPEs). The hardware of the SPE can be customised to accelerate specific computational patterns. An automatic design methodology that includes pattern identification and application-specific function unit generation is also presented. A mapping algorithm based on ant colony optimisation is provided. Experimental results on the SDR target domain show that compared with other ordinary and application-specific reconfigurable architectures, the CGRA generated by the proposed method performs more efficiently for given applications. 相似文献
19.
Data hiding systems have emerged as a solution against the piracy problem, particularly those based on quantization have been
widely used for its simplicity and high performance. Several data hiding applications, such as broadcasting monitoring and
live performance watermarking, require a real-time multi-channel behavior. While Digital Signal Processors (DSP) have been
used for implementing these schemes achieving real-time performance for audio signal processing, custom hardware architectures
offer the possibility of fully exploiting the inherent parallelism of this type of algorithms for more demanding applications.
This paper presents an efficient hardware implementation of a Rational Dither Modulation (RDM) algorithm-based data hiding
system in the Modulated Complex Lapped Transform (MCLT) domain. In general terms, the proposed hardware architecture is conformed
by an MCLT processor, an Inverse MCLT processor, a Coordinate Rotation Digital Computer (CORDIC) and an RDM-QIM processor.
Results of implementing the proposed hardware architecture on a Field Programmable Gate Array (FPGA) are presented and discussed. 相似文献
20.
《Integration, the VLSI Journal》2007,40(2):74-93
A run-time reconfigurable multiply-accumulate (MAC) architecture is introduced. It can be easily reconfigured to trade bitwidth for array size (thus maximizing the utilization of available hardware); process signed-magnitude, unsigned or 2's complement data; make use of part of its structure or adapt its structure based on the specified throughput requirements and the anticipated computational load. The proposed architecture consists of a reconfigurable multiplier, a reconfigurable adder, an accumulation unit, and two units for data representation conversion and incoming and outgoing data stream transfer. Reconfiguration can be done dynamically by using only a few control bits and the main component modules can operate independently from each other. Therefore, they can be enabled or disabled according to the required function each time. Comparison results in terms of performance, area and power consumption prove the superiority of the proposed reconfigurable module over existing realizations in a quantitative and qualitative manner. 相似文献