首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Coarse-grained reconfigurable architectures (CGRAs) require many processing elements (PEs) and a configuration memory unit (configuration cache) for reconfiguration of its PE array. Although this structure is meant for high performance and flexibility, it consumes significant power. Specially, power consumption by configuration cache is explicit overhead compared to other types of intellectual property (IP) cores. Reducing power is very crucial for CGRA to be more competitive and reliable processing core in embedded systems. In this paper, we propose a reusable context pipelining (RCP) architecture to reduce power-overhead caused by reconfiguration. It shows that the power reduction can be achieved by using the characteristics of loop pipelining, which is a multiple instruction stream, multiple data stream (MIMD)-style execution model. RCP efficiently reduces power consumption in configuration cache without performance degradation. Experimental results show that the proposed approach saves much power even with reduced configuration cache size. Power reduction ratio in the configuration cache and the entire architecture are up to 86.33% and 37.19%, respectively, compared to the base architecture.  相似文献   

2.
Most of the coarse-grained reconfigurable architectures (CGRAs) are composed of reconfigurable ALU arrays and configuration cache (or context memory) to achieve high performance and flexibility. Specially, configuration cache is the main component in CGRA that provides distinct feature for dynamic reconfiguration in every cycle. However, frequent memory-read operations for dynamic reconfiguration cause much power consumption. Thus, reducing power in configuration cache has become critical for CGRA to be more competitive and reliable for its use in embedded systems. In this paper, we propose dynamically compressible context architecture for power saving in configuration cache. This power-efficient design of context architecture works without degrading the performance and flexibility of CGRA. Experimental results show that the proposed approach saves up to 39.72% power in configuration cache with negligible area overhead (2.16%).   相似文献   

3.
A convolutional neural network (CNN) architecture supporting on-device user customization is proposed. The network architecture consists of a large CNN trained on a general data and a smaller augmenting network that can be re-trained on-device using a small user-specific data provided by the user. The proposed approach is applied to handwritten character recognition of the Latin and the Korean alphabet, Hangul. Experiments show a 3.5-fold reduction of the prediction error after user customization for both the Latin and the Korean character set compared to the CNN trained with general data. To minimize the energy required when retraining on-device, the use of a coarse-grained reconfigurable array processor (CGRA) in a low-power, efficient manner is presented. The CGRA achieves a speedup of 36× and a 54-fold reduced energy consumption compared to an ARMv8 processor. Compared to a 3-way VLIW processor, a speedup of 42× and a 12-fold energy reduction is observed, demonstrating the potential of general-purpose CGRAs as light-weight DNN accelerators.  相似文献   

4.
Coarse-grained reconfigurable arrays (CGRAs) have shown potential for application in embedded systems in recent years. Numerous reconfigurable processing elements (PEs) in CGRAs provide flexibility while maintaining high performance by exploring different levels of parallelism. However, a difference remains between the CGRA and the application-specific integrated circuit (ASIC). Some application domains, such as software-defined radios (SDRs), require flexibility with performance demand increases. More effective CGRA architectures are expected to be developed. Customisation of a CGRA according to its application can improve performance and efficiency. This study proposes an application-specific CGRA architecture template composed of generic PEs (GPEs) and special PEs (SPEs). The hardware of the SPE can be customised to accelerate specific computational patterns. An automatic design methodology that includes pattern identification and application-specific function unit generation is also presented. A mapping algorithm based on ant colony optimisation is provided. Experimental results on the SDR target domain show that compared with other ordinary and application-specific reconfigurable architectures, the CGRA generated by the proposed method performs more efficiently for given applications.  相似文献   

5.
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA, and thus they are i) unable to map applications, even though a mapping exists, and ii) using too many processing elements (PEs) to map an application. In this paper, we model several CGRA details, e.g., irregular CGRA topologies, shared resources and routing PEs in our compiler and develop a graph drawing based approach, split-push kernel mapping (SPKM), for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5times more applications than the previous approach, while generating mappings which have better qualities in terms of utilized CGRA resources. Utilizing fewer resources is directly translated into increased opportunities for novel power and performance optimization techniques. Our technique shows less power consumption in 71 cases and shorter execution cycles in 66 cases out of 100 synthetic applications, with minimum mapping time overhead. We observe similar results on a suite of benchmarks collected from Livermore loops, Mediabench, Multimedia, Wavelet and DSPStone benchmarks. SPKM is not a customized algorithm only for a specific CGRA template, and it is demonstrated by exploring various PE interconnection topologies and shared resource configurations with SPKM.  相似文献   

6.
针对可重构密码处理器对于不同域上的序列密码算法兼容性差、实现性能低的问题,该文分析了序列密码算法的多级并行性并提出了一种反馈移位寄存器(FSR)的预抽取更新模型。进而基于该模型设计了面向密码阵列架构的可重构反馈移位寄存器运算单元(RFAU),兼容不同有限域上序列密码算法的同时,采取并行抽取和流水处理策略开发了序列密码算法的反馈移位寄存器级并行性,从而有效提升了粗粒度可重构阵列(CGRA)平台上序列密码算法的处理性能。实验结果表明与其他可重构处理器相比,对于有限域(GF)(2)上的序列密码算法,RFAU带来的性能提升为23%~186%;对于GF(2u)域上的序列密码算法,性能提升达约66%~79%,且面积效率提升约64%~91%。  相似文献   

7.
互连网络在粗粒度可重构结构(Coarse-Grained Reconfigurable Array, CGRA)中非常重要,对CGRA的性能、面积和功耗均有较大影响。为了减小互连网络导致的面积开销和功耗并提升CGRA的性能,该文提出一种具有自路由和无阻塞特性的互连网络,构建了一种层次型的网络拓扑结构。通过这种互连网络,任意一对处理单元之间均可以建立连接和交换数据,而且这种连接是自路由和无阻塞的。实验结果显示,与已有结构相比,该结构以至多增加14.1%的面积开销为代价,获得最高可达46.2%的整体性能提升。  相似文献   

8.
High-speed real-time digital frequency analysis is one major field of Fast Fourier Transform (FFT) application, such as Synthetic Aperture Radar(SAR) processing and medical imaging. In SAR processing, the image size could be 4 k×4 k in normal and it has become larger over the years. In the view of real-time, extensibility and reusable characteristics, an Field Programmable Gate Array(FPGA) based multi-channel variable-length FFT architecture which adopts radix-2 butterfly algorithm is proposed in this paper. The hardware implementation of FFT is partially reconfigurable architecture. Firstly, the proposed architecture in the paper has flexibility in terms of chip area, speed, resource utilization and power consumption. Secondly, the proposed architecture combines serial and parallel methods in its butterfly computations. Furthermore, on system-level issue, the proposed architecture takes advantage of state processing in serial mode and data processing in parallel mode. In case of sufficient FPGA resources, state processing of serial mode mentioned above is converted to pipeline mode. State processing of pipeline mode achieves high throughput.  相似文献   

9.
The standard H.264/AVC Intra frame encoding process has several data dependent and computational intensive coding methodologies that limit the overall encoding speed. It causes not only a high degree of computational complexity but also an unacceptable delay especially for the real-time video applications. Based on DCT properties and spatial activity analysis, low power hardware architecture for high throughput Full-Search Free (FSF) Intra mode selection and direction prediction algorithm is proposed. The FSF Intra prediction Algorithm significantly reduces the computational complexity and the processing run-time required for the H.264/AVC Intra frame prediction process. The ASIC implementation for the proposed architecture is carried out and synthesizing results are obtained. The heavily tested 45nm ASIC design is able to achieve an operating frequency of 140 MHz while limiting the overall power consumption to 9.01 mW, which nominates our proposed FSF Intra prediction architecture for interactive real-time H.264/AVC mobile video decoders.  相似文献   

10.
This paper presents an integrated self-aware computing model mitigating the power dissipation of a heterogeneous reconfigurable multicore architecture by dynamically scaling the operating frequency of each core. The power mitigation is achieved by equalizing the performance of all the cores for an uninterrupted exchange of data. The multicore platform consists of heterogeneous Coarse-Grained Reconfigurable Arrays (CGRAs) of application-specific sizes and a Reduced Instruction-Set Computing (RISC) core. The CGRAs and the RISC core are integrated with each other over a Network-on-Chip (NoC) of six nodes arranged in a topology of two rows and three columns. The RISC core constantly monitors and controls the performance of each CGRA accelerator by adjusting the operating frequencies unless the performance of all the CGRAs is optimally balanced over the platform. The CGRA cores on the platform are processing some of the most computationally-intensive signal processing algorithms while the RISC core establishes packet based synchronization between the cores for computation and communication. All the cores can access each other’s computational and memory resources while processing the kernels simultaneously and independently of each other. Besides general-purpose processing and overall platform supervision, the RISC processor manages performance equalization among all the cores which mitigates the overall dynamic power dissipation by 20.7 % for a proof-of-concept test.  相似文献   

11.
CSSA-低功耗Montgomery模乘的环形脉动阵列   总被引:1,自引:0,他引:1  
文章提出了一种环形脉动阵列CSSA(Circular Structured Systolic Array),用于实现Montgomery模乘算法MMM(Montgomery Modular Multiplication)。该阵列采用循环结构,迭代计算。仿真结果表明,与基于一维脉动阵列的MMM硬件实现相比,该结构牺牲了运算时间,但是降低了功耗和芯片面积(本文实现的两个例子,功耗和芯片面积均减少了约97%)。并且,处理单元的数量可配置,以平衡速度和功耗。  相似文献   

12.
This paper presents the system architecture, modeling, and design constraints for a baseband, integrated, CMOS, impulse ultra-wideband transceiver targeting very low power consumption on the order of 1 mW. Intended for a sensor network application, the radio supports low communication rates (/spl sim/100 kpbs) and ranging capabilities over short distances (/spl sim/10 m). Based on a "mostly digital" architecture, the analog complexity is reduced by moving the A/D convertor as close to the antenna as is reasonable. Pulses are generated from simple digital switches, overlaying the signal energy on the lower FCC UWB band (0-960 MHz). Reception is achieved using baseband gain blocks feeding a time-interleaved bank of low resolution A/D converters. A window of energy is captured in time and fed to the digital backend for processing. To save power and area, the digital backend implements only a pulse template correlation filter block overlaid with an additional spreading code. As a pulse template is used, no specific channel estimation or interference cancellation is assumed. The system performance is quantified for this case and implementation tradeoffs are explored with a strong focus on reducing power consumption. In particular, the issues of modulation choice, clock generation, gain and noise figure, ADC resolution, and digital signal processing requirements will be discussed.  相似文献   

13.
The retina of the human eye and more particularly the retinal blood vasculature can be used in several medical and biometric applications. The use of retinal images in such applications however, is computationally intensive, due to the high complexity of the algorithms used to extract the vessels from the retina. In addition, the emergence of portable biometric authentication applications, as well as onsite biomedical diagnostics raises the need for real-time, power-efficient implementations of such algorithms that can also satisfy the performance and accuracy requirements of portable systems that use retinal images. In an attempt to meet those requirements, this work presents a VLSI implementation of a retina vessel segmentation system while exploring various parameters that affect the power consumption, the accuracy and performance of the system. The proposed design implements an unsupervised vessel segmentation algorithm which utilizes matched filtering with signed integers to enhance the difference between the blood vessels and the rest of the retina. The design accelerates the process of obtaining a binary map of the vessels tree by using parallel processing and efficient resource sharing, achieving real-time performance. The design has been verified on a commercial FPGA platform and exhibits significant performance improvements (up to 90×) when compared to other existing hardware and software implementations, with an overall accuracy of 92.4%. Furthermore, the low power consumption of the proposed VLSI implementation enables the proposed architecture to be used in portable systems, as it achieves an efficient balance between performance, power consumption and accuracy.  相似文献   

14.
在视频信号的编解码流程中,离散余弦变换(DCT)是一个至关重要的环节,其决定了视频压缩的质量和效率。针对88尺寸的2维离散余弦变换,该文提出一种基于粗粒度可重构阵列结构(Coarse-Grained Reconfigurable Array, CGRA)的硬件电路结构。利用粗粒度可重构阵列的可重配置的特性,实现在单一平台支持多个视频压缩编码标准的88 2维离散余弦变换。实验结果显示,这种结构每个时钟周期可以并行处理8个像素,吞吐率最高可达1.157109像素/s。与已有结构相比,设计效率和功耗效率最高可分别提升4.33倍和12.3倍,并能够以最高30帧/s的帧率解码尺寸为40962048,格式为4:2:0的视频序列。  相似文献   

15.
This article presents a parallel architecture for 3-D discrete wavelet transform (3-DDWT). The proposed design is based on the 1-D pipelined lifting scheme. The architecture is fully scalable beyond the present coherent Daubechies filter bank (9,?7). This 3-DDWT architecture has advantages such as no group of pictures restriction and reduced memory referencing. It offers low power consumption, low latency and high throughput. The computing technique is based on the concept that lifting scheme minimises the storage requirement. The application specific integrated circuit implementation of the proposed architecture is done by synthesising it using 65?nm Taiwan Semiconductor Manufacturing Company standard cell library. It offers a speed of 486?MHz with a power consumption of 2.56?mW. This architecture is suitable for real-time video compression even with large frame dimensions.  相似文献   

16.
Increasing complexity of a system-on-chip design demands efficient on-chip interconnection architecture such as on-chip network to overcome limitations of bus architecture. In this brief, we propose a packet-switched on-chip interconnection network architecture, through which multiple processing units of different clock frequencies can communicate with each other without global synchronization. The architecture is analyzed in terms of area and energy consumption, and implementation issues on building blocks are addressed for cost-effective design. A test chip is implemented using 0.38-/spl mu/m CMOS technology, and measured its operation at 800 MHz to demonstrate its feasibility.  相似文献   

17.
Fractional Motion Estimation (FME) in high-definition H.264 presents a significant design challenge in terms of memory bandwidth, latency and area cost as there are various modes and complex mode decision flow, which require over 45% of the computation complexity in the H.264 encoding process. In this paper, a new high-performance VLSI architecture for Fractional Motion Estimation (FME) in H.264/AVC based on the full-search algorithm is presented. This architecture is made up of three different pipeline processors to establish a trade-off between processing time and hardware utilization. The computing scheme based on a 4-pixel interpolation unit with a 10-pixel input bandwidth is capable of processing a macroblock (MB) in 870 clock cycles. The final VLSI implementation only requires 11.4 k gates and 4.4kBytes of RAM in a standard 180 nm CMOS technology operating at 290 MHz. Our design generates the residual image and the best MVs and mode in a high throughput and low area cost architecture while achieving enough processing capacity for 1080HD (1920 × 1088@30fps) real-time video streams.  相似文献   

18.
In the past years, many works have demonstrated the applicability of Coarse-Grained Reconfigurable Array (CGRA) accelerators to optimize loops by using software pipelining approaches. They are proven to be effective in reducing the total execution time of multimedia and signal processing applications. However, the run-time reconfigurability of CGRAs is hampered overheads introduced by the needed translation and mapping steps. In this work, we present a novel run-time translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. We propose a greedy approach, since the modulo scheduling for CGRA is an NP-complete problem. In addition to read-after-write dependencies, the dynamic modulo scheduling faces new challenges, such as register insertion to solve recurrence dependences and to balance the pipelining paths. Our results demonstrate that the greedy run-time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for a 16-issue VLIW processor. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated. In this analysis it is demonstrated that when changing from one memory access per cycle to two memory accesses per cycle, the modulo scheduling algorithm is able to exploit this increase in memory bandwidth and enhance performance accordingly. Additionally, to measure area and performance, the proposed CGRA was prototyped on an FPGA. The area comparisons show that a crossbar CGRA (with 16 processing elements and including an 4-issue VLIW host processor) is only 1.11 × bigger than a standalone 8-issue VLIW softcore processor.  相似文献   

19.
Given the energy waste problem in contemporary households and the consequent need for optimal energy use, this article presents a novel network architecture that is generically applicable on domestic appliances, such as white goods, and audiovisual and communication equipment, and is capable of performing real-time management of their energy consumption. Deploying the latest information and communication technology, the proposed architecture enables definition of energy saving applications that perform three main functions: real-time estimation of the energy consumption of the home environment, without exploiting smart metering devices; control of domestic appliances energy use so that energy consumption of the home environment is kept within user-defined limits; and autonomous identification and management of standby devices, targeting minimal energy consumption.  相似文献   

20.
This paper presents a reduced-complexity, fixed-point algorithm and efficient real-time VLSI architectures for multiuser channel estimation, one of the core baseband processing operations in wireless base-station receivers for CDMA. Future wireless base-station receivers will need to use sophisticated algorithms to support extremely high data rates and multimedia. Current DSP implementations of these algorithms are unable to meet real-time requirements. However, there exists massive parallelism and bit level arithmetic present in these algorithms than can be revealed and efficiently implemented in a VLSI architecture. We re-design an existing channel estimation algorithm from an implementation perspective for a reduced complexity, fixed-point hardware implementation. Fixed point simulations are presented to evaluate the precision requirements of the algorithm. A dependence graph of the algorithm is presented and area-time trade-offs are developed. An area-constrained architecture achieves low data rates with minimum hardware, which may be used in pico-cell base-stations. A time-constrained solution exploits the entire available parallelism and determines the maximum theoretical data processing rates. An area-time efficient architecture meets real-time requirements with minimum area overhead.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号