期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

MMNNN: A tree-based Multicast Mechanism for NoC-based deep Neural Network accelerators

《Microprocessors and Microsystems》2021

Network-on-Chip (NoC) devices have been widely used in multiprocessor systems. In recent years, NoC-based Deep Neural Network (DNN) accelerators have been proposed to connect neural computing devices using NoCs. Such designs dramatically reduce off-chip memory accesses of these platforms. However, the large number of one-to-many packet transfers significantly degrade performance with traditional unicast channels. We propose a multicast mechanism for a NoC-based DNN accelerator called Multicast Mechanism for NoC-based Neural Network accelerator (MMNNN). To do so, we propose a tree-based multicast routing algorithm with excellent scalability and the ability to minimize the number of packets in the network. We also propose a router architecture for single-flit packets. Our proposed router transfers flits to multiple destinations in a single process and has no head-of-line blocking issue, offering higher throughput and lower latency than traditional wormhole router architectures. Simulation results show that our proposed multicast mechanism offers excellent performance in classification latency, average packet latency, and energy consumption. 相似文献

2.

A hardware/software platform for QoS bridging over multi-chip NoC-based systems

Ashkan Beyranvand Nejad Anca Molnos Matias Escudero Martinez Kees Goossens 《Parallel Computing》2013

Recent embedded systems integrate a growing number of intellectual property cores into increasingly large designs. Implementation, prototyping, and verification of such large systems has become very challenging. One of the reasons is that chips/FPGAs resources are limited and therefore it is not always possible to implement the whole design in the traditional system-on-a-chip solutions. The state-of-the-art is to partition such systems into smaller sub-systems to implement each on a separate chip. Consequently, it requires interconnecting separate chips/FPGAs. Since Networks-on-Chip (NoCs) have become common interconnection solutions in embedded designs, we propose to bridge NoC-based SoCs enabling a generic multi-chip systems interconnection. In this context, the contribution of this paper is threefold, (i) we explore the NoC protocol stack to determine the best layer for implementing the off-chip bridge, (ii) we propose a generic hardware architecture for the bridge, and (iii) we develop a new software architecture enabling seamless configuration and communication of multi-chip NoC-based SoCs. Finally, we demonstrate performance, i.e., bandwidth and latency, of the bridge in a multi-FPGA platform, while the bridge guarantees QoS of traffic. The synthesis results indicate the implementation area cost of the bridge is only 1% of Xilinx Virtex6 FPGA. 相似文献

3.

Modular Neural Tile Architecture for Compact Embedded Hardware Spiking Neural Network

Sandeep Pande Fearghal Morgan Seamus Cawley Tom Bruintjes Gerard Smit Brian McGinley Snaider Carrillo Jim Harkin Liam McDaid 《Neural Processing Letters》2013,38(2):131-153

Biologically-inspired packet switched network on chip (NoC) based hardware spiking neural network (SNN) architectures have been proposed as an embedded computing platform for classification, estimation and control applications. Storage of large synaptic connectivity (SNN topology) information in SNNs require large distributed on-chip memory, which poses serious challenges for compact hardware implementation of such architectures. Based on the structured neural organisation observed in human brain, a modular neural networks (MNN) design strategy partitions complex application tasks into smaller subtasks executing on distinct neural network modules, and integrates intermediate outputs in higher level functions. This paper proposes a hardware modular neural tile (MNT) architecture that reduces the SNN topology memory requirement of NoC-based hardware SNNs by using a combination of fixed and configurable synaptic connections. The proposed MNT contains a 16:16 fully-connected feed-forward SNN structure and integrates in a mesh topology NoC communication infrastructure. The SNN topology memory requirement is 50 % of the monolithic NoC-based hardware SNN implementation. The paper also presents a lookup table based SNN topology memory allocation technique, which further increases the memory utilisation efficiency. Overall the area requirement of the architecture is reduced by an average of 66 % for practical SNN application topologies. The paper presents micro-architecture details of the proposed MNT and digital neuron circuit. The proposed architecture has been validated on a Xilinx Virtex-6 FPGA and synthesised using 65 nm low-power CMOS technology. The evolvable capability of the proposed MNT and its suitability for executing subtasks within a MNN execution architecture is demonstrated by successfully evolving benchmark SNN application tasks representing classification and non-linear control functions. The paper addresses hardware modular SNN design and implementation challenges and contributes to the development of a compact hardware modular SNN architecture suitable for embedded applications 相似文献

4.

支持存储访问的NoC模拟器的设计与实现

邓庆绪王文凤金曦刘柄蔚孔繁鑫《小型微型计算机系统》2012,33(7):1537-1545

随着片上网络的发展,片上多处理器系统通信性能提高的同时,存储器的访问性能将成为片上多处理器系统的性能瓶颈.目前片上网络的研究主要依赖于模拟器,而现有的片上网络模拟器都不能完成对存储器访问的准确模拟.本文设计并实现了一个能对存储器访问进行模拟的模拟器,为存储器性能的研究提供了一个实验平台;论文通过采用大量访问集对该模拟器进行测试,得出了若干条与存储器访问性能优化相关的片上网络设计建议. 相似文献

5.

LPNet: A DNN based latency prediction technique for application mapping in Network-on-Chip design

《Microprocessors and Microsystems》2021

Analytical models used for latency estimation of Network-on-Chip (NoC) are not producing reliable accuracy. This makes these analytical models difficult to use in optimization of design space exploration. In this paper, we propose a learning based model using deep neural network (DNN) for latency predictions. Input features for DNN model are collected from analytical model as well as from Booksim simulator. Then this DNN model has been adopted in mapping optimization loop for predicting the best mapping of given application and NoC parameters combination. Our simulations show that using the proposed DNN model, prediction error is less than 12% for both synthetic and application specific traffic. More than 108 times speedup could be achieved using DPSO with DNN model compared to DPSO using Booksim simulator. 相似文献

6.

基于多FPGA的NoC多核处理器验证平台设计 总被引：1，自引：0，他引：1

黄晓林潘红兵易伟杨虎凌梦黄辰何书专李丽《计算机工程与设计》2012,33(1):180-185

为了能够灵活地验证和实现自主设计的基于NoC的多核处理器,缩短NoC多核处理器的设计周期,提出了设计集成4片Virtex-6-550T FPGA的NoC多核处理器原型芯片设计/验证平台.分析和评估了NoC多核处理器的规模以及对FPGA硬件资源的需求,在此基础上给出了集成4片FPGA的开发板详细设计方案,并对各主要模块如互联架构、电源、板级时钟分布、接口技术、存储资源等关键设计要点进行阐述.描述了开发板各个主要模块的测试过程和结果,表明了该设计的可行性. 相似文献

7.

Elastic superposition task mapping for NoC-based reconfigurable systems

《Microprocessors and Microsystems》2017

With technology progress, more and more applications are integrated into a single chip. This requires a large number of processing elements (PEs) in a system, such that computation can be effectively enhanced through parallel processing. To support more efficient parallel processing, the Network-on-Chip (NoC) is being increasingly adopted as an interconnection architecture. Nevertheless, for NoC-based reconfigurable systems, the issue of mapping tasks to the PEs becomes more complex, due to the characteristic of hardware reconfiguration. This work proposes a novel Elastic Superposition Mapping (ESM) that introduces a useful PE reservation heuristic along with dynamic cross-application superposition. The ESM can provide a great elasticity for an NoC-based reconfigurable system to map more applications. Thus, the task load on PE will increase. Experiments show that, compared to the state-of-the-art mapping methods, 7% to 49% more applications can be executed, the average task load on PE can be increased by 5.5% to 56%, and the application waiting time can be reduced by 11% to 54%. 相似文献

8.

A generic FPGA prototype for on-chip systems with network-on-chip communication infrastructure

Mohammad Arjomand Amirali Boroumand Hamid Sarbazi-Azad 《Computers & Electrical Engineering》2014

As System-on-Chips (SoCs) grow in complexity and size, proposals of networks-on-chip (NoCs) as the on-chip communication infrastructure are justified by reusability, scalability, and energy efficiency provided by the interconnection networks. Simulation and mathematical analysis offer flexibility for the evaluations under various network configurations. However, the accuracy of such analyzing methods largely depends on the approximations made. On the other hand, prototyping can be used to improve the evaluation accuracy by bringing the design closer to reality. In this paper, we propose a FPGA prototype that is general enough to model different video-processing SoCs where different cores communicate via NoC. To model NoC, we accurately implement a fully-synthesized on-chip router supporting multiple virtual channels. For the processing nodes, on the other side, we propose a general and simple traffic generator capable of modeling different synthetic functions (i.e. Poisson and self-similar). Indeed, the application traffic is modeled using 1-D hybrid cellular automata which can effectively generate high quality pseudorandom patterns. Finally, for the energy efficiency, the proposed prototype is capable to support multiple frequency regions. To realize the voltage–frequency island partitioned SoC, we use the utilities that Xilinx FPGA platform offers to design Globally Synchronous Locally Asynchronous (GALS) systems via Delay-Locked Loop elements. 相似文献

9.

A multi-processor NoC-based architecture for real-time image/video enhancement 总被引：1，自引：0，他引：1

Sergio Saponara Luca Fanucci Esa Petri 《Journal of Real-Time Image Processing》2013,8(1):111-125

The paper presents a multi-processor architecture for real-time and low-power image and video enhancement applications. Differently from other state-of-the-art parallel architectures the proposed solution is composed of heterogeneous tiles. The tiles have computational and memory capabilities, support different algorithmic classes and are connected by a novel Network-on-Chip (NoC) infrastructure. The proposed packet-switched data transfer scheme avoids communication bottlenecks when more tiles are working concurrently. The functional performances of the NoC-based multi-processor architecture are assessed by presenting the achieved results when the platform is programmed to support different enhancement algorithms for still images or videos. The implementation complexity of the NoC-based multi-tile platform, integrated in 65 nm CMOS technology, is reported and discussed. 相似文献

10.

Power optimization for application-specific networks-on-chips: A topology-based approach

Haytham Elmiligi Ahmed A. Morgan M. Watheq El-Kharashi Fayez Gebali 《Microprocessors and Microsystems》2009,33(5-6):343-355

This paper analyzes the main sources of power consumption in Networks-on-Chip (NoC)-based systems. Analytical power models of global interconnection links are studied at different levels of abstraction. Additionally, power measurement experiments are performed for different types of routers. Based on this study, we propose a new topology-based methodology to optimize the power consumption of complex NoC-based systems at early design phases. The efficiency of the proposed methodology is verified through a case study of an MPEG4 video application. Experimental results show a promising improvement in power consumption (8.55%), average number of hops (10.80%), and number of global links (56.25%) compared to the best known related work. 相似文献

11.

Random access schemes for efficient FPGA SpMV acceleration

《Microprocessors and Microsystems》2016

Utilizing hardware resources efficiently is vital to building the future generation of high-performance computing systems. The sparse matrix – dense vector multiplication (SpMV) kernel, which is notorious for its poor efficiency on conventional processors, is a key component in many scientific computing applications and increasing SpMV efficiency can contribute significantly to improving overall system efficiency. The major challenge in implementing SpMV efficiently is handling the input-dependent memory access patterns, and reconfigurable logic is a strong candidate for tackling this problem via memory system customization. In this work, we consider three schemes (all off-chip, all on-chip, caching) for servicing the irregular-access component of SpMV and investigate their effects on accelerator efficiency. To combine the strengths of on-chip and off-chip random accesses, we propose a hardware-software caching scheme named NCVCS that combines software preprocessing with a nonblocking cache to enable highly efficient SpMV accelerators with modest on-chip memory requirements. Our results from the comparison of the three schemes implemented as part of an FPGA SpMV accelerator show that our scheme effectively combines the high efficiency from on-chip accesses with the capability of working with large matrices from off-chip accesses. 相似文献

12.

Application of design reuse to artificial neural networks: case study of the back propagation algorithm

N. Izeboudjen A. Bouridane A. Farah H. Bessalah 《Neural computing & applications》2012,21(7):1531-1544

相似文献

13.

一种新型片上网络互连结构的仿真和实现 总被引：2，自引：0，他引：2

陈芳露陆雯青虞志益周晓方《小型微型计算机系统》2010,31(5)

综合性能、硬件实现等方面考虑,提出一种基于片上网络的互连拓扑结构-层次化路由结构MLR(Multi-Layer Router).该结构通过层次化设计减小网络直径,具有良好的对称性和扩展性.网络建模仿真和硬件实现结果显示,在不同网络负载和不同IP核节点数的情况下,MLR与传统结构相比,在处理网络通信时,对于网络丢包率、通信延迟和网络吞吐量等网络性能参数均有最多50%-70%的提升;同时通过共享路由的方式,减少了超过20%的芯片面积和40%以上的动态功耗,有效降低了互连结构的硬件开销相似文献

14.

CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs

Bin LiAuthor Vitae Li ZhaoAuthor VitaeRavi IyerAuthor Vitae Li-Shiuan PehAuthor VitaeMichael LeddigeAuthor Vitae Michael EspigAuthor VitaeSeung Eun LeeAuthor Vitae Donald NewellAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(5):700-713

Contention in performance-critical shared resources affects performance and quality-of-service (QoS) significantly. While this issue has been studied recently in CMP architectures, the same problem exists in SoC architectures where the challenge is even more severe due to the contention of shared resources between programmable cores and fixed-function IP blocks. In the SoC environment, efficient resource sharing and a guarantee of a certain level of QoS are highly desirable. Researchers have proposed different techniques to support QoS, but most existing works focus on only one individual resource. Coordinated management of multiple QoS-aware shared resources remains an open problem. In this paper, we propose a class-of-service based QoS architecture (CoQoS), which can jointly manage three performance-critical resources (cache, NoC, and memory) in a NoC-based SoC platform. We evaluate the interaction between the QoS-aware allocation of shared resources in a trace-driven platform simulator consisting of detailed NoC and cache/memory models. Our simulations show that the class-of-service based approach provides a low-cost flexible solution for SoCs. We show that assigning the same class-of-service to multiple resources is not as effective as tuning the class-of-service of each resource while observing the joint interactions. This demonstrates the importance of overall QoS support and the coordination of QoS-aware shared resources. 相似文献

15.

面向应用的流存储系统评测与改进

汪芳安虹徐光许牧姚平《小型微型计算机系统》2010,31(5)

有限的片外存储带宽是制约流处理器性能提升的瓶颈之一,流存储系统已经采用了多种方式来缓解这个问题,但当前的设计并没有充分考虑应用具体的访存模式对有效带宽利用率的影响.通过分析和实验,评估流存储系统主要设计参数对不同访存模式的优化效果;在此基础上针对不同的流访问并行度提出了相应的结构改进,加入宽发射和短作业优先调度支持,充分挖掘存储访问的局部性和并行性,改善了负载平衡,从而有效地提高了片外带宽的使用效率和流程序的整体性能. 相似文献

16.

New access modes of parallel memory subsystem for sub-pixel motion estimation

Radomir Jakovljević Aleksandar Berić Edwin van Dalen Dragan Milićev 《Journal of Real-Time Image Processing》2018,15(2):279-296

Accessing pixels in memory is a well-known bottleneck of SIMD (single instruction multiple data) processors in video/imaging. To tackle it, we propose new block and row access modes of parallel on-chip memory subsystem, which enable a higher processing throughput and lower energy consumption than the access modes of the state-of-the-art subsystems. The new access modes significantly reduce the number of on-chip memory accesses, and thereby accelerate one of key video/imaging kernels: sub-pixel block-matching motion estimation. The main idea is to exploit spatial overlaps of blocks/rows accessed for pixel interpolation, which are known at the subsystem design-time, and merge multiple accesses into a single one by accessing somewhat more pixels at a time than with other parallel memories. To avoid the need for a wider, and, therefore, more costly SIMD datapath, we propose new memory read operations that split all pixels accessed at a time into multiple SIMD-wide blocks/rows, in a convenient way for further processing. As a proof of concept, we describe a parametric, scalable, and cost-efficient architecture that supports the new access modes. The architecture is based on a previously proposed set of memory banks with multiple pixels per bank word, and a previously proposed shifted scheme for arranging pixels in the banks. We analytically and experimentally demonstrate advantages of this work on a case study of sub-pixel motion estimation for video frame-rate conversion. The implemented motion estimator processes 2160p video at 60 fps in real time, while clocked at 600 MHz. Compared to the implementations based on the state-of-the-art subsystems, this work enables 40–70 % higher throughput, consumes 17–44 % less energy and has similar silicon area and off-chip memory bandwidth costs. That is 1.8–2.9 times more efficient than the prior art, considering the throughput and all costs, i.e., consumption, area, and off-chip bandwidth. Such a higher efficiency is the result of the new access modes, which reduced the number of on-chip memory accesses by 1.6–2.1 times, and the cost-efficient architecture. 相似文献

17.

ProNoC: A low latency network-on-chip based many-core system-on-chip prototyping platform

《Microprocessors and Microsystems》2017

Network-on-chip (NoC) is an emerging interconnect infrastructure to address the scalability limitation of conventional shared bus architecture for many-core system-on-chip (MCSoC). Current field-programmable gate arrays (FPGAs) have over million lookup tables, making it possible to prototype a complete NoC-based MCSoC on a single FPGA device. FPGA prototyping allows rapid system verification and optimum design parameters estimation. However, existing NoC-based MCSoC prototypes are usually adopting simple NoC architectural functionality. These NoC prototypes cannot represent a realistic projection of the state-of-the-art application-specific integrated circuit (ASIC) NoCs as these prototypes have limited overall system performance. This paper presents ProNoC, an integrated tool for rapid prototyping and validation of NoC-based MCSoC projects targeting FPGA devices. ProNoC adopts most advanced NoC features such as the support of virtual channel (VC), virtual network, low latency routing and different routing algorithms. Results show that NoC interconnect in ProNoC outperforms CONNECT, the most recent VC based prototype NoC with lower logic cell utilization, higher maximum operating frequency, higher average saturation throughput, and lower average communication latency. Moreover, ProNoC is equipped with graphical user interface to facilitate the development of MCSoC prototypes on FPGA platforms. 相似文献

18.

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

下载免费PDF全文

Shu-Chang Zhou Yu-Zhi Wang He Wen Qin-Yao He Yu-Heng Zou 《计算机科学技术学报》2017,32(4):667-682

Quantized neural networks (QNNs), which use low bitwidth numbers for representing parameters and performing computations, have been proposed to reduce the computation complexity, storage size and memory usage. In QNNs, parameters and activations are uniformly quantized, such that the multiplications and additions can be accelerated by bitwise operations. However, distributions of parameters in neural networks are often imbalanced, such that the uniform quantization determined from extremal values may underutilize available bitwidth. In this paper, we propose a novel quantization method that can ensure the balance of distributions of quantized values. Our method first recursively partitions the parameters by percentiles into balanced bins, and then applies uniform quantization. We also introduce computationally cheaper approximations of percentiles to reduce the computation overhead introduced. Overall, our method improves the prediction accuracies of QNNs without introducing extra computation during inference, has negligible impact on training speed, and is applicable to both convolutional neural networks and recurrent neural networks. Experiments on standard datasets including ImageNet and Penn Treebank confirm the effectiveness of our method. On ImageNet, the top-5 error rate of our 4-bit quantized GoogLeNet model is 12.7%, which is superior to the state-of-the-arts of QNNs. 相似文献

19.

Adaptive image compression technique for wireless sensor networks

Mohsen Nasri Author VitaeAbdelhamid HelaliAuthor Vitae Halim Sghaier Author VitaeHassen Maaref Author Vitae 《Computers & Electrical Engineering》2011,37(5):798-810

When using wireless sensor networks for real-time image transmission, some critical points should be considered. These points are limited computational power, storage capability, narrow bandwidth and required energy. Therefore, efficient compression and transmission of images in wireless sensor network is considered. To address the above mentioned concerns, an efficient adaptive compression scheme that ensures a significant computational and energy reduction as well as communication with minimal degradation of the image quality is proposed. This scheme is based on wavelet image transform and distributed image compression by sharing the processing of tasks to extend the overall lifetime of the network. Simulation results are presented and they show that the proposed scheme optimizes the network lifetime, reduces significantly the amount of the required memory and minimizes the computation energy by reducing the number of arithmetic operations and memory accesses. 相似文献

20.

复用NoC测试SoC内嵌IP芯核的测试规划研究 总被引：1，自引：0，他引：1

下载免费PDF全文

赵建武师奕兵王志刚《计算机工程与应用》2010,46(15):60-63

测试规划是SoC芯片测试中需要解决的一个重要问题。一种复用片上网络测试内嵌IP芯核的测试规划方法被用于限制测试模式下SoC芯片功耗不超出最大芯片功耗范围,消除测试资源共享所引起的冲突,达到减小测试时间的目的。提出了支持测试规划的无拥塞路由算法和测试扫描链优化配置方法。使用VHDL硬件描述语言实现了在FPGA芯片中可综合的二维Mesh片上网络测试平台,用于片上网络性能参数、路由算法以及基于片上网络的SoC芯片测试方法的分析评估。相似文献