首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Network-on-Chip (NoC) has been proposed to overcome the complex on-chip communication problem of System-on-Chip (SoC) design in deep sub-micron. A complete NoC design contains exploration on both hardware and software architectures. The hardware architecture includes the selection of Processing Elements (PEs) with multiple types and their topology. The software architecture contains allocating tasks to PEs, scheduling of tasks and their communications. To find the best hardware design for the target tasks, both hardware and software architectures need to be considered simultaneously. Previous works on NoC design have concentrated on solving only one or two design parameters at a time. In this paper, we propose a hardware–software co-synthesis algorithm for a heterogeneous NoC architecture. The design goal is to minimize energy consumption while meeting the real-time requirements commonly seen in embedded applications. The proposed algorithm is based on Simulated-Annealing (SA). To compare the solution quality and efficiency of the proposed algorithm, we also implement the branch-and-bound and iterative algorithm to solve the hardware–software co-synthesis problem of a heterogeneous NoC. With the given synthetic task sets, the experimental results show that the proposed SA-based algorithm achieves near-optimal solution in a reasonable time, while the branch-and-bound algorithm takes a very long time to find the optimal solution, and the iterative algorithm fails to achieve good solution quality. When applying the co-synthesis algorithms to a real-world application with PE library that has little variation in PE performance and energy consumption, the iterative algorithm achieves solution quality comparable to that of the proposed SA-based algorithm.  相似文献   

2.
类脑处理器能够支持多种脉冲神经网络SNN的部署来完成多种任务。片上网络NoC能够用较少的资源和功耗解决片上复杂的互连通信问题。现有的类脑处理器多采用片上网络来连接多个神经元核,以支持神经元之间的通信。SNN在时间步内瞬时突发的通信会在短时间内产生大量的脉冲报文。在这种通信行为下,片上网络会在短时间内达到饱和,造成网络拥塞。片上网络中非拥塞感知路由算法会进一步加剧网络拥塞状态,如何在每一个时间步内有效处理这些数据包,从而降低网络延迟,提高吞吐率,成为了目前需要解决的问题。首先对SNN的瞬时猝发通信特性进行了分析;然后提出一种拥塞感知的哈密尔顿路径路由算法,以降低NoC平均延迟和提高吞吐率;最后,使用Verilog HDL实现该路由算法,并通过模拟仿真进行性能评估。在网络规模为16×16的2D Mesh结构的片上网络中,相对于没有拥塞感知的路由算法,在数量猝发模式和概率猝发模式下,所提出的拥塞感知路由算法的NoC平均延迟分别降低了13.9%和15.9%;吞吐率分别提高了21.6%和16.8%。  相似文献   

3.
The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms  相似文献   

4.
Network-on-chip (NoC) communication architectures present promising solutions for scalable communication requests in large system-on-chip (SoC) designs. Intellectual property (IP) core assignment and mapping are two key steps in NoC design, significantly affecting the quality of NoC systems. Both are NP-hard problems, so it is necessary to apply intelligent algorithms. In this paper, we propose improved intelligent algorithms for NoC assignment and mapping to overcome the draw-backs of traditional intelligent algorithms. The aim of our proposed algorithms is to minimize power consumption, time, area, and load balance. This work involves multiple conflicting objectives, so we combine multiple objective optimization with intelligent algorithms. In addition, we design a fault-tolerant routing algorithm and take account of reliability using comprehensive performance indices. The proposed algorithms were implemented on embedded system synthesis benchmarks suite (E3S). Experimental results show the improved algorithms achieve good performance in NoC designs, with high reliability.  相似文献   

5.
The increasing complexity of Multi-Processor System on Chip (MPSoC) is requiring communication infrastructures that will efficiently accommodate the communication needs of the integrated computation resources. Exploring the arbitration space is crucial for achieving low latency communication. This paper illustrates an arbiter synthesis approach that allows a high performance MPSoC communication for multi-bus and Network on Chip (NoC) architectures. A cost function has been formulated in order to affect the priority order to each component or each set of components in a manner that minimizes the communication latency and generates a multi-level arbiter. The performance of the proposed approach have been analyzed in a design of an 8 × 8 ATM switch subsystem and a MPEG4 decoder mapped onto a 2-D mesh NoC. The results demonstrate that the MPSoC arbiter is well suited to provide high priority communication traffic with low latencies by allowing a preemption of lower priority transport. The sum of the mean waiting time at the eight ports of the ATM switch is minimum under the MPSoC arbitration scheme (4.30 cycle per word) while it is 3.00 times larger under the poorer performance arbitration scheme. In the case of the MPEG4 decoder, the average packet latency of the MPSoC is about 480 cycles while it is 640 cycles in the poorer performance arbitration scheme under a 0.4 flits/cycle injection rate.  相似文献   

6.
NoC映射是NoC设计中的重要步骤,映射结果的优劣对NoC的QoS约束和通信功耗有着很大的影响。提出一种采用云自适应遗传算法实现NoC映射的方案,该算法利用云模型对传统遗传算法加以改进,以此新方法自动调整遗传算法过程中的交叉概率和变异概率,从而达到优化遗传算法的目的。结合NoC映射中的具体问题,在功耗和延时约束的限制条件下,建立了延时约束下的NoC映射功耗数学模型。实验表明,该方法在NoC映射中取得了良好的效果,降低了通信功耗。  相似文献   

7.
Application mapping in 2-D mesh-based Network-on-Chip (NoC) architecture is an optimization problem in which each application task (e.g., processor or memory units) should be mapped one-to-one onto a network element (switch or router) to optimize performance requirements (e.g., communication energy or communication latency) under certain platform constraints (e.g., bandwidth and/or latency). Network-on-Chip is a scheme that establishes links between limited application-specific components within Multi-Processor System-on-Chip (MPSoC), but it has a vital role to ensure the maximum data transfer rate and reduce total number of physical interconnections. Most of the works on heuristic application mapping for mesh-based NoC design aim to minimize both total communication energy and run-time, however they experience the following issues: (i) relatively high CPU time due to linear search for the task and tile mapping combinations, (ii) consumption of relatively high communication energy due to random tile selection when two or more tiles are equivalent in terms of average weighted distance by their adjacent mapped tasks, and (iii) even after constructive application mapping, some of the tasks consume higher communication energy due to their inappropriate placements. In this paper we present a low time-complexity heuristic mapping algorithm of weighted application graph under permissible bandwidth constraint to minimize communication energy of 2-D mesh-based NoC architecture. The experimental results of multimedia benchmarks, as well as randomly generated samples show the low communication energy as well as time-complexity under bandwidth constraints in comparison to the recent heuristic application mapping approaches. In our approach, the communication energy is also close to the optimal solution obtained by Integer Linear Programming (ILP).  相似文献   

8.
不规则IP模块到2维NoC结构的映射方法研究   总被引:1,自引:0,他引:1  
提出了一种新的基于NoC(Network on Chip)的不规则IP模块映射方法.其基本思想是把较大的IP模块分解成几个小的IP虚模型,或把几个较小的IP模块组合成一个IP虚模型,使得每个IP虚模型能映射到NoC结构的一个资源节点上.通过计算曼哈顿距离和输入/输出度,可以确定每个通信节点中缓冲区的大小.根据计算的通信代价可以对初始映射结果进行调整,从而可以避免通信拥塞,降低系统的功耗.  相似文献   

9.
针对能量和存储能力在无线传感器网络(Wireless Sensor Networks,WSNs)路由上的特殊要求,为了使节点能量消耗相对均衡,同时避免出现拥塞,提出了一种改进的蚂蚁算法。仿真结果表明该算法能够更有效地降低通信负载,减少能量消耗。  相似文献   

10.
随着处理器核数的增加,片上互连网络NoC结构日趋复杂,导致片上互连网络功耗所占的比重和功耗分析的难度也在增加。片上互连网络的任务映射,既要保证多处理器核心之间通信的高性能,又要保证耗费尽可能少的功耗和面积,即在有限的功耗和面积开销下获得较高的性能。在进行任务映射时,核心之间的通信距离是减少任务通信功耗的关键。连续且近凸的区域有助于缩短任务的通信距离。分析了一种功耗最优的片上互连网络启发式映射算法(INC),该算法由区域选择算法和节点映射算法组成。对区域选择算法的2个因子进行了改进,使应用总的通信开销最小化且保证后续应用以很小的通信代价进行区域选择。提出了新的基于选择区域的映射算法。它们在动态到达程序映射问题中的实验结果表明,新的区域选择算法和节点映射算法相比于INC,可以减少12.10%的通信功耗,并且带来11.23%的通信延迟优化。  相似文献   

11.
针对将计算任务合理地映射到三维片上网络(NoC)的问题,提出了一种基于遗传算法(GA)的改进算法。GA具有快速随机的搜索能力,Prim算法可在加权连通图内得到最小生成树,改进算法结合了两种算法的优势,将计算任务合理地分配到各个网络节点,对于优化三维片上网络功耗和散热等问题具有很高的效率。通过仿真实验,对所提出的基于Prim算法的改进GA与基本GA的3D NoC映射算法进行了对比,仿真结果显示,基于Prim算法的改进GA平均功耗更低,从总体趋势来看,处理单元数量的增加与功耗降低幅度成正相关,在101个处理单元情况下,平均功耗比基本GA降低32%。  相似文献   

12.
The significant speed-gap between processor and memory makes last-level cache performance crucial for multi-core architectures (MCA). Non-uniform cache architecture (NUCA) has been proposed to overcome the performance limitations of MCA for many embedded applications. The cache is partitioned into sub-banks, with each sub-bank being an independently accessible entity connected with a fast on-chip network (NoC). This paper presents two NoC-assisted mechanisms to improve the performance and power consumption of NUCA coherence. The first mechanism provides priority-based communication based on the wormhole routing architecture to support NUCA coherence. High-priority coherent packets are transmitted first to save time. The second mechanism offers multicasting communication based on the proposed priority-based NoC to provide efficient cache coherency for NUCA. We dispatch and collect coherence packets at the collecting nodes (CN) to further decrease the number of coherent messages flowing in the NoC. Experimental results show that the priority-based transmission can improve performance by approximately 10?%. The proposed multicasting mechanism can further improve performance and decrease power consumption of the NoC in NUCA by approximately 15?%. The two proposed mechanisms can together enhance the performance by 25?% averagely.  相似文献   

13.
延迟优化的片上网络低功耗映射*   总被引:2,自引:1,他引:2  
片上网络(NoC)是解决传统基于总线的片上系统(SoC)所面临的功耗、延迟、同步和信号完整性等挑战的有效解决方案。功耗和延迟是NoC设计中的重要约束和性能指标,在设计的各个阶段都存在着优化空间。基于蚁群优化算法,通过通信链路上并发通信事件的均匀分布来降低NoC映射阶段的功耗和延迟。仿真实验表明,与链路通信量负载均衡的方法相比,该方案能进一步在拓扑映射阶段优化功耗和延迟。  相似文献   

14.
在片上网络NoC( Network-on-Chip)中,通过光通信取代传统的电信号传精来获得低延时、低功耗成为一种新兴的研究方向—光五连片上网络ONoC(Optical Network-on-Chip)本文提出一种全新的双向传输的波长路由片上网络,这种新的结构对调制好的光信号的波长进行判断来实现在网络节点之间的路由,同时还能够通过器件和传输通道的共享实现数据的双向传输.和传统的电信号传输网络相比,本文提出的双向传输结构减少了50%的硬件开销和70%的芯片面积开销,提高了器件利用率,降低了网络传输延时,极大地提高了网络传精性能,对于光互连片上网络具有重要意义.  相似文献   

15.
Wireless communication over LTE (long term evolution) brings several design challenges to industry and academia, due to its high throughput demand. Specially in the case of hand held mobile devices where the power budget is very limited and high throughput requires more computation power. On the other hand, the industry is struggling for flexible hardware solution, a Software Defined Radio (SDR), to amortize huge costs of hardware changes to suit the continued evolution in wireless standards. In this article, an MPSoC design has been presented for the baseband processing of a 20 MHz LTE system. Transport Triggered Architecture (TTA) has been preferred over conventional DSPs/VLIW architectures as processing element (PE) of MPSoC. Processing tasks are statically scheduled. Synchronization among the PEs is based on polling of a shared memory space. In addition an approach is presented to organize I/O buffer in such a way that the stalling probability of a PE should be reduced to exploit efficiently data and task level parallelism. The total power consumption by all the PEs synthesized on 130 nm technology at 200 MHz and 1.5 V is 105.04 mW. The total energy consumption to process one subframe including carrier recovery is 0.0767 mJ. Our study shows that TTA architecture brings several improvements in conventional SIMD/VLIW architectures. TTA as contrary to other run time designs has a guaranteed performance and lower energy consumption due to the fact that all the data dependency/independency issues are resolved at compile time. Further, it is also true due to the fact that TTA has a reduced register file (RF) traffic, number of RF ports and lower overall cycle count for a given task. To the best of author’s knowledge this article is among the first few published articles on LTE receiver implementation with published figures like time, frequency, power and perhaps the first article explaining further in detail about data access pattern to process an LTE subframe, memory organization, subsystem interconnection, and synchronization.  相似文献   

16.
NoCs (Network on Chips) are the most popular interconnection mechanism used for systems that require flexibility, extensibility and low power consumption. However, communication performance is strongly related to the routing algorithm that is used in the NoC. The most important issues in the routing process are: deadlock, livelock, congestion and faults. In this paper, a classification of NoC routing protocols is proposed according to the problems they address. Two main families emerge: mono objective and multi objectives. A discussion of the advantages and the drawbacks of each protocols family is given. A summary of the most used practices in this field and the less used ones is provided. This survey shows that it is hard to satisfy the four objectives at the same time with classical methods, highlighting the strengths of multi-objectives approaches.  相似文献   

17.
拓扑结构感知的片上网络体系结构应用映射与优化   总被引:1,自引:0,他引:1  
应用映射是片上网络体系结构研究的关键问题之一,映射结果的好坏会极大地影响体系结构的性能。现有的应用映射方法大多基于特定的网络结构,如2d-mesh、2d-torus等,研究NoC性能或功耗约束的应用映射与优化方法。本文提出了一种拓扑结构感知的基于高层代码转换的片上网络应用映射与优化方法。该方法采用多面体模型对应用的核心循环进行自动并行和局部性优化,并将网络拓扑结构抽象成带权重的有向图,使用该有向图对任务流图进行覆盖,以提高任务的并行性,降低任务间同步和通信开销。实验结果表明,采用优化的映射方法后任务节点间的并行性被充分利用,通信开销降低,整体上提高了片上网络系统性能。  相似文献   

18.
Three-dimensional integrated circuits (3D ICs) are suitable alternatives to traditional two-dimensional (2D) ICs by leveraging its advantage of better performance and packaging; therefore, they have been highly considered by researchers. On the other hand, emerging network-on-chip (NoC) based many-core chips provides great potential for running multiple applications simultaneously. However, using this approach leads to the increase of the interference between applications, resulting in lowering the performance of each application. Hence, mapping tasks belonging to various applications onto the nodes of an architecture is a very important issue. In this study, based on partitioning concept, a novel methodology for mapping of multiple applications at run-time onto an irregular wireless 3D NoC-based multiprocessor system-on-chip (MPSoC) platform in which more than one task can be supported by each processing element (PE) was presented. In the second algorithm (enhanced irregular-partitioning best neighbor), according to the number of applications running simultaneously, the partitioning of network will be dynamically changed to minimize the communication overhead and congestion on the NoC that leads to more efficient task mapping. The simulation results reveal that the second proposed algorithm (enhanced IPBN) in comparison with NPBN (non-partitioning best neighbor) algorithm and our first proposed algorithm (basic IPBN) enhances the performance by decreasing the total execution time, average hop count, average channel load and energy consumption.  相似文献   

19.
Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoCs), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs to provide data exchange and synchronization support.This paper focuses on the energy/delay exploration of a distributed shared memory architecture, suitable for low-power on-chip multiprocessors based on NoC. A mechanism is proposed for the data allocation on the distributed shared memory space, dynamically managed by an on-chip hardware memory management unit (HwMMU). Moreover, the exploitation of the HwMMU primitives for the migration, replication, and compaction of shared data is discussed. Experimental results show the impact of different distributed shared memory configurations for a selected set of parallel benchmark applications from the power/-performance perspective. Furthermore, a case study for a graph exploration algorithm is discussed, accounting for the effects of the core mapping and the network topology on energy and performance at the system level.  相似文献   

20.
针对可重构阵列处理器访存数据量大、数据并行性要求高且数据全局重用少、局部性明显的特点,提出了一种分布式Cache结构的簇内局部优先高效互连访问结构,该结构实现了簇内4×4个PE对4×4个Cache的并行访问,选用Xilinx公司的ZYNQ系列芯片XC7Z045 FFG900-2进行FPGA综合。在无冲突情况下,该互连结构支持簇内16个PE的同时读/写访问,最高频率可达221 MHz,访存峰值带宽为7.6 GB/s。在此结构上实现了灰度共生矩阵提取纹理图像特征算法,数据访存带宽达到478.125 MB/s,运行时间为0.24 ms。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号