首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 31 毫秒
1.
随着器件、工艺和应用技术的不断发展,片上多处理器已经成为主流技术,而且片上多处理器的规模越来越大、片内集成的处理器核数目越来越多,用于片内处理器核及其它部件之间互连的片上网络逐渐成为影响片上多处理器性能的瓶颈之一.片上网络的拓扑结构定义网络内部结点的物理布局和互连方法,决定和影响片上网络的成本、延迟、吞吐率、面积、容错...  相似文献   

2.
片上网络互连拓扑综述   总被引:1,自引:0,他引:1  
随着器件、工艺和应用技术的不断发展,片上多处理器已经成为主流技术,而且片上多处理器的规模越来越大、片内集成的处理器核数目越来越多,用于片内处理器核及其它部件之间互连的片上网络逐渐成为影响片上多处理器性能的瓶颈之一。片上网络的拓扑结构定义网络内部结点的物理布局和互连方法,决定和影响片上网络的成本、延迟、吞吐率、面积、容错能力和功耗等,同时影响网络路由策略和网络芯片的布局布线方法,是片上网络研究中的关键之一。对比了不同片上网络的拓扑结构,分析了各种结构的性能,并对未来片上网络拓扑研究提出建议。  相似文献   

3.
片上二维网络互连性能分析   总被引:1,自引:1,他引:1  
片上互连网络已日益成为影响片上多处理器性能的重要因素之一.几乎所有的互连结构均是在二维网络的基础上演变发展而来的.首先分析了几种常见的内部结点度均为4的二维网络的静态特性,提出了一种新的二维片上网络互连路由结构和通信协议,基于全局均匀随机通信模型,通过改变网络规模和变换通信强度,分析了不同结构网络的动态特性,然后用链接数表示通信成本,提出了一种新的网络互连综合性能评估指标网络单位成本延迟负载能力,最后对二维网络片上互连的综合性能进行了对比分析,指出了其各自适用的场合.  相似文献   

4.
分级环片上网络互连   总被引:1,自引:0,他引:1  
在大规模、超大规模片上互连网络中,因为二维互连方式的性能较差而使多维互连方式成为可选方案之一.文中首先基于区域划分设计了一种分级环互连结构,分析了其静态互连特性,然后基于卡诺图编码设计了一种分级环互连的路由结构以及寻径方法,在均匀通信模式测试了不同的分级环级联链路缓冲区设置方法下网络的性能,详细分析了按照等比序列设置分级环级联链路缓冲区时分级环互连方式的动态网络特性,最后根据互连性能与Mesh等二维片上互连方式比较的结果,给出了分级环互连方式的使用场合.实验结果表明,虽然在较小规模网络中性能较差,但是分级环互连方式能以较低的成本、较高的性能实现大规模、超大规模片上网络的互连,其中单环分级互连方式在较低网络负载下综合性能更好,而双环分级互连方式则具有更大的网络负载能力,在较高网络负载下性能更好.  相似文献   

5.
衡霞  支亚军  韩俊刚 《计算机科学》2013,40(Z6):220-222
在研究片上网络服务质量的基础上,提出面向多处理器的64核片上网络结构。IP单元产生不同类型的数据包,网络提供优先级别服务,以保证高优先级数据包的低延时需要。性能统计结果表明,该模型对多处理器之间不同类型的数据包通信均满足服务质量要求。  相似文献   

6.
设计定制片上网络以满足不同特定应用需求已经成为片上网络设计的发展趋势。定制专用系统一般由各种不同类型的设备组成,将这些设备映射到传统的规则网络拓扑上可能导致较低的性能/开销比。基于精细化设计的定制片上网络成为领域专用系统架构的主流选择。然而,精细化设计也给硬件设计师带来了诸多挑战,传统的手工设计耗费大量时间。因此,探索具有精确化和敏捷化设计特征的定制网络拓扑成为定制片上网络设计的一个重要挑战。为了探索定制片上网络的最佳拓扑结构,设计了一种精确高效的探索算法;同时为了降低时间复杂度,提出了一种启发式线性规划算法HLP,以加快多个网络层之间的遍历速度。与传统的Mesh拓扑结构相比,生成的拓扑结构实现了约20%的性能提升,并将平均跳数减少了约30%。同时,该设计探索算法具有较低的时间复杂度,可以在线性时间复杂度下实现定制片上网络架构的自动生成,具有较高的可扩展性,可应用于大规模片上系统。  相似文献   

7.
片上网络拓扑结构对芯片的性能有直接的影响.文中提出了一种新的拓扑结构TM,它结合了torus网络和mesh网络的优点.对于n×n的网络,在物理链路数方面,TM和mesh网络相同,比torus网络少2n;在拓扑直径方面,TM的拓扑直径为n,而torus和mesh网络的拓扑直径分别为2×(n/2)和2×(n-1);在完全适应性路由算法设计方面,torus网络需要的虚拟通道数至少为3,且虚拟网络划分机制不能直接应用其中,然而,虚拟网络划分机制适用于mesh和TM网络,且它们只需要2条虚拟通道.文中从理论和模拟实验两方面对TM网络进行了验证,实验结果表明无论在均衡负载还是非均衡负载下,TM的性能都要优于mesh网络,在大部分情况下,TM的性能介于mesh和torus之间,在某些通信模型下,torus的性能比TM差,主要原因在于这些通信模型下torus网络中虚通道使用不均衡.  相似文献   

8.
本文主要探讨了片上网络目前的发展状况和趋势,对比其他网络总结了片上网络的优缺点,并着重分析了片上网络的拓扑结构、通信方式。  相似文献   

9.
随着单个芯片上集成的处理器的个数越来越多,传统的电互连网络已经无法满足对互连网络性能的需求,需要一种新的互连方式,因此光互连网络技术应运而生.目前,电互连的片上网络在功耗、性能、带宽、延迟等方面遇到了瓶颈,而光互连作为一种新的互连方式引用到片上网络具有低损耗、高吞吐率、低延迟等无可比拟的优势.本文主要探讨了片上光网络的...  相似文献   

10.
半导体技术的快速发展以及芯片上系统应用复杂度的不断增长,使得片上互连结构的吞吐量、功耗、延迟以及时钟同步等问题更加复杂,出现了将通信机制与计算资源分离的片上网络.片上网络设计涉及从物理层到应用层诸多方面的问题.本文给出片上网络设计的一些关键技术:设计流程、拓扑结构、路由技术、交换技术、性能评估;并指出目前研究存在的问题和今后的研究方向.  相似文献   

11.
We present a new architecture level unified reliability evaluation methodology for chip multiprocessors (CMPs). The proposed reliability estimation (REST) is based on a Monte Carlo algorithm. What distinguishes REST from the previous work is that both the computational and communication components are considered in a unified manner to compute the reliability of the CMP. We utilize REST tool to develop a new dynamic reliability management (DRM) scheme to address time-dependent dielectric breakdown and negative-bias temperature instability aging mechanisms in network-on-chip (NoC) based CMPs. Designed as a control loop, the proposed DRM scheme uses an effective neural network based reliability estimation module. The neural-network predictor is trained using the REST tool. We investigate how system’s lifetime changes when the NoC as the communication unit of the CMP is considered or not during the reliability evaluation process and find that differences can be as high as 60%. Full-system based simulations using a customized GEM5 simulator show that reliability can be improved by up to 52% using the proposed DRM scheme in a best-effort scenario with 2–9% performance penalty (using a user set target lifetime of 7 years) over the case when no DRM is employed.  相似文献   

12.
Virtualizing network-on-chip resources in chip-multiprocessors   总被引:1,自引:0,他引:1  
The number of cores on a single silicon chip is rapidly growing and chips containing tens or even hundreds of identical cores are expected in the future. To take advantage of multicore chips, multiple applications will run simultaneously. As a consequence, the traffic interferences between applications increases and the performance of individual applications can be seriously affected.In this paper, we improve the individual application performance when several applications are simultaneously running. This proposal is based on the virtualization concept and allows us to reduce execution time and network latency in a significant percentage.  相似文献   

13.
In tiled Chip Multiprocessors (CMPs) last-level cache (LLC) banks are usually shared but distributed among the tiles. A static mapping of cache blocks to the LLC banks leads to poor efficiency since a block may be mapped away from the tiles actually accessing it. Dynamic policies either rely on the static mapping of blocks to a set of banks (D-NUCA) or rely on the OS to dynamically load pages to statically mapped addresses (first-touch).  相似文献   

14.
Networks-on-chip will serve as the central integration platform in future complex systems-on-chip (SoC) designs, composed of a large number of heterogeneous processing resources. Most researchers advocate the use of traditional regular networks like meshes, tori or trees as architectural templates which gained a high popularity in general-purpose parallel computing. However, most SoC platforms are special-purpose tailored to the domain-specific requirements of their application. They are usually built from a large diversity of heterogeneous components which communicate in a very specific, mostly irregular way.

In this work, we propose a methodology for the design of customized irregular networks-on-chip, called INoC. We take advantage of a priori knowledge of the communication characteristic of the application to generate an optimized network topology and routing algorithm. We show that customized irregular networks are clearly superior to traditional regular architectures in terms of performance at comparable implementation costs for irregular workloads. Even more, they inherently offer a high degree of scalability and expansibility which allows to adapt the network to an arbitrary number of nodes with a given communication demand. This can normally not be accomplished by traditional approaches.  相似文献   


15.
陈嘉  安虹  刘圆  王莉 《计算机仿真》2007,24(6):81-85
多核结构上采用由用户显式制导的并行程序设计模型,使用锁和同步变量来实现同步.事务存储模型能够解决由锁机制带来的一系列问题,提高程序的并发性.介绍了在文中提出的一种基于事务存储模型的多核结构(Transactional-Memory based Chip Multiple-Superscaler,TMCMS)上的并行编程模型,以及针对循环程序的执行模型;以FFT程序为例具体介绍了循环结构的并行化方法和编译转换过程.在初步的实验中,将处理单元从1增加到16个时,在所设计的编程模型的支持下,IPC(Instruction Per Cycle)有接近线性的增长,说明该并行编程模型能够充分发掘程序中潜在的细粒度线程级并行性,同时保持并行程序设计的简单性.  相似文献   

16.
Current high-end microprocessors achieve high performance as a result of adding more features and therefore increasing complexity. This paper makes the case for a Chip-Multiprocessor based on the Data-Driven Multithreading (DDM-CMP) execution model in order to overcome the limitations of current design trends. Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronization overheads. DDM-CMP avoids the complexity of other designs by combining simple commodity microprocessors with a small hardware overhead for thread scheduling and an interconnection network. Preliminary experimental results show that a DDM-CMP chip of the same hardware budget as a high-end commercial microprocessor, clocked at the same frequency, achieves a speedup of up to 18.5 with a 78–81% power consumption of the commercial chip. Overall, the estimated results for the proposed DDM-CMP architecture show a significant benefit in terms of both speedup and power consumption making it an attractive architecture for future processors.  相似文献   

17.
As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE. This paper is based in part on “Chip multiprocessing and the Cell Broadband Engine”, ACM Computing Frontiers 2006.  相似文献   

18.
Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed that a big fraction of directory lookups cause a miss, because the block looked up is not allocated in any local cache. To reduce the number of directory lookups and therefore the power consumption, we propose to add a filter before the directory access.We introduce two filter implementations. In the first one, filtering information is explicitly kept in the shared cache for every block. In the second one, filtering information is decoupled from the shared cache organization, so the filter size does not depend on the shared cache size.We evaluate our filters in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with write-through local caches and a shared cache. We show that, for SPLASH2 benchmarks, the proposed filters reduce the number of directory lookups performed by 60% while power consumption is reduced by ∼28%. For Specweb2005, the number of directory lookups performed is reduced by 68% (44%), while directory power consumption is reduced by 19% (9%) using the first (second) filter implementation.  相似文献   

19.
基于多核处理器的VTD-XML节点查询执行性能优化   总被引:1,自引:0,他引:1  
郭宪勇  陈性元  邓亚丹 《计算机科学》2014,41(2):179-181,190
针对目前主流的多核处理器,研究了基于VTD-XML的节点查询执行性能优化,即基于预读策略从多线程并发执行和提高线程内存访问性能两个方面优化XML节点查询的性能。实验结果表明,提出的多线程XML文档解析框架可以充分利用多核处理器的计算资源,并有效地提高线程的内存访问性能,大大提高了XML节点查询的性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号