首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA, and thus they are i) unable to map applications, even though a mapping exists, and ii) using too many processing elements (PEs) to map an application. In this paper, we model several CGRA details, e.g., irregular CGRA topologies, shared resources and routing PEs in our compiler and develop a graph drawing based approach, split-push kernel mapping (SPKM), for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5times more applications than the previous approach, while generating mappings which have better qualities in terms of utilized CGRA resources. Utilizing fewer resources is directly translated into increased opportunities for novel power and performance optimization techniques. Our technique shows less power consumption in 71 cases and shorter execution cycles in 66 cases out of 100 synthetic applications, with minimum mapping time overhead. We observe similar results on a suite of benchmarks collected from Livermore loops, Mediabench, Multimedia, Wavelet and DSPStone benchmarks. SPKM is not a customized algorithm only for a specific CGRA template, and it is demonstrated by exploring various PE interconnection topologies and shared resource configurations with SPKM.  相似文献   

2.
Coarse-grained reconfigurable arrays (CGRAs) have shown potential for application in embedded systems in recent years. Numerous reconfigurable processing elements (PEs) in CGRAs provide flexibility while maintaining high performance by exploring different levels of parallelism. However, a difference remains between the CGRA and the application-specific integrated circuit (ASIC). Some application domains, such as software-defined radios (SDRs), require flexibility with performance demand increases. More effective CGRA architectures are expected to be developed. Customisation of a CGRA according to its application can improve performance and efficiency. This study proposes an application-specific CGRA architecture template composed of generic PEs (GPEs) and special PEs (SPEs). The hardware of the SPE can be customised to accelerate specific computational patterns. An automatic design methodology that includes pattern identification and application-specific function unit generation is also presented. A mapping algorithm based on ant colony optimisation is provided. Experimental results on the SDR target domain show that compared with other ordinary and application-specific reconfigurable architectures, the CGRA generated by the proposed method performs more efficiently for given applications.  相似文献   

3.
Most of the coarse-grained reconfigurable architectures (CGRAs) are composed of reconfigurable ALU arrays and configuration cache (or context memory) to achieve high performance and flexibility. Specially, configuration cache is the main component in CGRA that provides distinct feature for dynamic reconfiguration in every cycle. However, frequent memory-read operations for dynamic reconfiguration cause much power consumption. Thus, reducing power in configuration cache has become critical for CGRA to be more competitive and reliable for its use in embedded systems. In this paper, we propose dynamically compressible context architecture for power saving in configuration cache. This power-efficient design of context architecture works without degrading the performance and flexibility of CGRA. Experimental results show that the proposed approach saves up to 39.72% power in configuration cache with negligible area overhead (2.16%).   相似文献   

4.
Coarse-grained reconfigurable architectures (CGRAs) require many processing elements (PEs) and a configuration memory unit (configuration cache) for reconfiguration of its PE array. Although this structure is meant for high performance and flexibility, it consumes significant power. Specially, power consumption by configuration cache is explicit overhead compared to other types of intellectual property (IP) cores. Reducing power is very crucial for CGRA to be more competitive and reliable processing core in embedded systems. In this paper, we propose a reusable context pipelining (RCP) architecture to reduce power-overhead caused by reconfiguration. It shows that the power reduction can be achieved by using the characteristics of loop pipelining, which is a multiple instruction stream, multiple data stream (MIMD)-style execution model. RCP efficiently reduces power consumption in configuration cache without performance degradation. Experimental results show that the proposed approach saves much power even with reduced configuration cache size. Power reduction ratio in the configuration cache and the entire architecture are up to 86.33% and 37.19%, respectively, compared to the base architecture.  相似文献   

5.
面向DSP应用的可重构计算   总被引:2,自引:2,他引:0  
DSP应用的特点是计算密集并适合并行处理,传统的可编程处理器与ASIC在性能和灵活性上各有优劣.因此出现了一种新的计算模式-可重构计算.由于它能将效率和灵活性很好地结合在一起,故正得到广泛的关注和研究.本文在介绍可重构计算的概念和分类的基础上,着重讨论了一些主流的可重构计算系统,分析了各类系统应用于DSP的特点,对可重构计算在计算模型,编译器,映射技术以及开发环境等方面的现状和趋势进行了探讨,并给出了自己的思考.  相似文献   

6.
As DSP (Digital Signal Processing) applications become more complex, there is also a growing need for new architectures supporting efficient high-level language compilers. We try to synthesize a new DSP processor architecture by adding several DSP processor specific features to a RISC core that has a compiler friendly structure, such as many general-purpose registers and orthogonal instructions. The synthesized digital signal processor supports single-cycle MAC (Multiply-and-ACcumulate), direct memory access, automatic address generation, and hardware looping capabilities in addition to ordinary RISC instructions. The compiler for the new architecture is quickly implemented by developing a code-converter that modifies the assembly codes that are generated by the RISC compiler. The performance effects of adding each of these as well as all the combined features are evaluated using seven DSP-kernel benchmarks, a QCELP vocoder, and an MPEG video decoder. The effects of CPU clock frequency change due to the addition of these features are also considered. Finally, we also compare the performances with several existing DSP processors, such as TMS320C3x, TMS320C54x, and TMS320C5x.  相似文献   

7.
Reconfigurable hardware contains an array of programmable cells and interconnection structures. Field-programmable gate arrays use fine-grain cells that implement simple logic functions. Some proposed reconfigurable architectures for digital signal processing (DSP) use coarse-grain cells that perform 16-b or 32-b operations. A third alternative is to use medium-grain cells with a word length of 4 or 8 b. This approach combines high flexibility with inherent support for binary arithmetic such as multiplication. This paper presents two medium-grain cells for reconfigurable DSP hardware. Both cells contain an array of small lookup tables, or ldquoelementsrdquo, that can assume two structures. In memory mode, the elements act as a random-access memory. In mathematics mode, the elements implement 4-b arithmetic operations. The first design uses a matrix of 4 times 4 elements and operates in bit-parallel fashion. The second design uses an array of five elements and computes arithmetic functions in bit-serial fashion. Layout simulations in 180-nm CMOS indicate that the parallel cell operates at 267 MHz, whereas the serial cell runs at 167 MHz. However, the parallel design requires over twice the area. The proposed medium-grain cells provide the performance and flexibility needed to implement DSP. To evaluate the designs, the paper estimates the execution time and resource utilization for common benchmarks such as the fast Fourier transform. The architecture model used in this analysis combines the cells with a pipelined hierarchical interconnection network. The end results show great promise compared to other devices, including field-programmable gate arrays.  相似文献   

8.
This paper presents design, development and evaluation of an eXtra-large Scale, Homogeneous and a Heterogeneous Accelerator-Rich Platform (HARP2) for massively parallel signal processing algorithms. HARP is an integrated platform of multiple Coarse-Grained Reconfigurable Arrays (CGRAs) over a Network-on-Chip (NoC) where each CGRA is scaled and tailored for a specific application. The architecture of the NoC consists of nine nodes in a topology of 3-rows × 3-columns and acts as backbone of communication between different CGRAs. In this experimental work, the HARP template is used to instantiate a homogeneous (HARP-hom) and a heterogeneous (HARP-het) platform. The HARP-het is generated for a proof-of-concept test to verify the design and functionality of HARP. It also provides insight to many features of the design and evaluation in terms of different performance metrics. The other version (HARP-hom) is instantiated for a relatively realistic design problem, i.e., satisfying the execution-time constraints imposed on Fast Fourier Transform processing in IEEE-802.11n demodulators. Both of the versions of HARP are treated for comparative analysis using different performance metrics against some of the existing state-of-the-art platforms. The HARP versions are designed to illustrate large-scale homogeneous/heterogeneous multicore architectures while presenting the advantages of maximizing the number of reconfigurable processing resources on a single chip.  相似文献   

9.
Domain specific coarse-grained reconfigurable architectures (CGRAs) have great promise for energy-efficient flexible designs for a suite of applications. Designing such a reconfigurable device for an application domain is very challenging because the needs of different applications must be carefully balanced to achieve the targeted design goals. It requires the evaluation of many potential architectural options to select an optimal solution. Exploring the design space manually would be very time consuming and may not even be feasible for very large designs. Even mapping one algorithm onto a customized architecture can require time ranging from minutes to hours. Running a full power simulation on a complete suite of benchmarks for various architectural options require several days. Finding the optimal point in a design space could require a very long time. We have designed a framework/tool that made such design space exploration (DSE) feasible. The resulting framework allows testing a family of algorithms and architectural options in minutes rather than days and can allow rapid selection of architectural choices. In this paper, we describe our DSE framework for domain specific reconfigurable computing where the needs of the application domain drive the construction of the device architecture. The framework has been developed to automate design space case studies, allowing application developers to explore architectural tradeoffs efficiently and reach solutions quickly. We selected some of the core signal processing benchmarks from the MediaBench benchmark suite and some edge-detection benchmarks from the image processing domain for our case studies. We describe two search algorithms: a stepped search algorithm motivated by our manual design studies and a more traditional gradient based optimization. Approximate energy models are developed in each case to guide the search toward a minimal energy solution. We validate our search results by comparing the architectural solutions selected by our tool to an architecture optimized manually and by performing sensitivity tests to evaluate the ability of our algorithms to find good quality minima in the design space. All selected fabric architectures were synthesized on 130 nm cell-based ASIC fabrication process from IBM. These architectures consume almost same amount of energy on average, but the gradient based approach is more general and promises to extend well to new problem domains. We expect these or similar heuristics and the overall design flow of the system to be useful for a wide range of architectures, including mesh based and other commonly used architectures for CGRAs.  相似文献   

10.
The new encoding tools of high efficiency video coding (HEVC) make the interpolation operation more complex in motion compensation (MC) for better video compression, but impose higher requirements on the computational efficiency and control logic of the hardware architecture. The reconfigurable array processor can take into consideration both the computational efficiency and flexible switching of algorithms very well. Through mining the data dependency and parallelism among interpolation operation, this paper presents a parallelization method based on the dynamic reconfigurable array processor proposed by the project team. The number of pixels loaded from the external memory is reduced significantly, by multiplexing the common data in the previous reference block and the current reference block. Flexible switching of variable block operation is realized by using dynamic reconfiguration mechanism. A 16 x 16 processor element (PE)'s array is used to dynamically process a 4 x 4 - 64 x 64 block size. The experimental results show that, the reference block update speed is increased by 39.9%. In the case of an array size of 16 PEs, the number of pixels processed in parallel reaches 16.  相似文献   

11.
This paper presents an integrated self-aware computing model mitigating the power dissipation of a heterogeneous reconfigurable multicore architecture by dynamically scaling the operating frequency of each core. The power mitigation is achieved by equalizing the performance of all the cores for an uninterrupted exchange of data. The multicore platform consists of heterogeneous Coarse-Grained Reconfigurable Arrays (CGRAs) of application-specific sizes and a Reduced Instruction-Set Computing (RISC) core. The CGRAs and the RISC core are integrated with each other over a Network-on-Chip (NoC) of six nodes arranged in a topology of two rows and three columns. The RISC core constantly monitors and controls the performance of each CGRA accelerator by adjusting the operating frequencies unless the performance of all the CGRAs is optimally balanced over the platform. The CGRA cores on the platform are processing some of the most computationally-intensive signal processing algorithms while the RISC core establishes packet based synchronization between the cores for computation and communication. All the cores can access each other’s computational and memory resources while processing the kernels simultaneously and independently of each other. Besides general-purpose processing and overall platform supervision, the RISC processor manages performance equalization among all the cores which mitigates the overall dynamic power dissipation by 20.7 % for a proof-of-concept test.  相似文献   

12.
This paper introduces a design technique for coarse-grained reconfigurable architectures targeting digital signal processing (DSP) applications. The design procedure is analyzed in detail and an area-time-power efficient reconfigurable kernel architecture is presented. The proposed technique inlines flexibility into custom carry-save (CS) arithmetic datapaths exploiting a stable and canonical interconnection scheme. The canonical interconnection is revealed by a transformation, called uniformity transformation, imposed on the basic architectures of CS-multipliers and CS-chain-adders/subtractors. Experimental results including quantitative and qualitative comparisons with existing reconfigurable arithmetic cores and exploration results of the proposed reconfigurable architecture are provided.  相似文献   

13.
A software radio architecture for linear multiuser detection   总被引:5,自引:0,他引:5  
The integration of multimedia services over wireless channels calls for provision of variable quality of service (QoS) requirements. While radio resource management algorithms (such as power control and call admission control) can provide certain levels of variability in QoS, an alternate approach is to use reconfigurable radio architectures to provide diverse QoS guarantees. We outline a novel reconfigurable architecture for linear multiuser detection, thereby providing a wide range of bit error rate (BER) requirements amongst the constituent receivers of the reconfigurable architecture. Specifically, we focus on achieving this dynamic reconfiguration via a software radio implementation of linear multiuser receivers. Using a unified framework for achieving this reconfiguration, we partition functionality into two core technologies [field programmable gate arrays (FPGA) and digital signal processor (DSP) devices] based on processing speed requirements. We present experimental results on the performance and reconfigurability of the software radio architecture as well as the impact of fixed point arithmetic (due to hardware constraints)  相似文献   

14.
The compiler is generally regarded as the most important software component that supports a processor design to achieve success. This paper describes our application of the open research compiler infrastructure to a novel VLIW DSP (known as the PAC DSP core) and the specific design of code generation for its register file architecture. The PAC DSP utilizes port-restricted, distributed, and partitioned register file structures in addition to a heterogeneous clustered data-path architecture to attain low power consumption and a smaller die. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation scheme and other retargeting optimization phases that allow the effective generation of high quality code. Our preliminary experimental results indicate that our developed compiler can efficiently utilize the features of the specific register file architectures in the PAC DSP. Our experiences in designing compiler support for the PAC VLIW DSP with irregular resource constraints may also be of interest to those involved in developing compilers for similar architectures.
Jenq-Kuen Lee (Corresponding author)Email:
  相似文献   

15.
可重构处理器阵列的系统级建模研究   总被引:1,自引:1,他引:0  
由于粗粒度可重构体系结构设计空间复杂,设计满足应用需求的CGRA需要建立系统级仿真模型进行性能评估.文中提出一种可重构处理器阵列的系统级模型,使用SystemC事务级语言实现建模.模型采用多层互连网络结构实现任意2个处理器间的通信,并且处理器的资源能够通过参数快速地进行配置.仿真实验表明,模型适用于应用算法到粗粒度可重构体系结构映射的模拟仿真.  相似文献   

16.
以4片"魂芯一号"国产高性能DSP处理器和FPGA为核心,设计了一种新型通用雷达信号处理机。处理机采用高速链路口实现4片DSP处理器之间的点对点通信,采用Altera公司的高端FPGA芯片作数据预处理和接口协议转换。该处理机具有很高的运算性能和数据交换能力,并具有较好的通用性、可重构性和扩展性。通过运算性能测试,并在信号处理机上实现某数字阵列雷达信号处理,验证了"魂芯一号"的性能和应用价值。  相似文献   

17.
王向前  洪一  王昊  郑启龙 《电子学报》2015,43(8):1656-1661
魂芯DSP是一款字寻址的、分簇结构的、支持SIMD的VLIW处理器.介绍了基于开源编译器基础设施open64开发魂芯编译器的关键技术,包括地址寄存器的优化处理、综合多种启发因子的指令分簇、分簇架构下的寄存器分配和指令调度.介绍了魂芯DSP编译器的体系结构优化关键技术,包括基于依赖分析的向量化、高效指令的使用和零开销循环的识别.并总结开发经验,给出了基于开源编译基础设施开发编译器的若干注意点.  相似文献   

18.
DReAC:一种新型动态可重构协处理器   总被引:1,自引:1,他引:0       下载免费PDF全文
本文提出了一种应用于数据并行和高密度计算任务的新型动态可重构协处理器——DReAC.DReAC可以独立地以并行或流水工作模式重构协处理器内部数据路径,完成主处理器分配的任务.DReAC由全局控制器、计算阵列和阵列数据缓冲区三部分组成.文中简要介绍了DReAC系统模型,并使用该模型模拟了部份典型算法在DReAC中的实现.仿真结果表明,在典型的多媒体和信号处理应用中,DReAC能够达到通用处理器的10倍以上的速度,甚至在某些应用中远优于其他可重构处理器的性能.  相似文献   

19.
Architecture for Dynamically Reconfigurable Embedded Systems (ADRES) is a templatized coarse-grained reconfigurable processor architecture. It targets at embedded applications which demand high-performance, low-power and high-level language programmability. Compared with typical very long instruction word-based digital signal processor, ADRES can exploit higher parallelism by using more scalable hardware with support of novel compilation techniques. We developed a complete tool-chain, including compiler, simulator and HDL generator. This paper describes the design case of a media processor targeting at H.264 decoder and other video tasks based on the ADRES template. The whole processor design, hardware implementaiton and application mapping are done in a relative short period. Yet we obtain C-programmed real-time H.264/AVC CIF decoding at 50 MHz. The die size, clock speed and the power consumption are also very competitive compared with other processors.
S. DupontEmail:
  相似文献   

20.
Dynamically reconfigurable architectures are emerging as a viable design alternative to implement a wide range of computationally intensive applications. At the same time, an urgent necessity has arisen for support tool development to automate the design process and achieve optimal exploitation of the architectural features of the system. Task scheduling and context (configuration) management become very critical issues in achieving the high performance that digital signal processing (DSP) and multimedia applications demand. This article proposes a strategy to automate the design process which considers all possible optimizations that can be carried out at compilation time, regarding context and data transfers. This strategy is general in nature and could be applied to different reconfigurable systems. We also discuss the key aspects of the scheduling problem in a reconfigurable architecture such as MorphoSys. In particular, we focus on a task scheduling methodology for DSP and multimedia applications, as well as the context management and scheduling optimizations  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号