首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The industry trends for processors are toward integrating an increasing number of cores into a single chip. Researchers have to deal with frequent data migration across network-on-chip and the increasing on-chip traffic. The innovation from flat to hierarchy is probably a natural design methodology for scalable systems (Martin et al. in Commun ACM, 55(7):78–89, 2012. doi: 10.1145/2209249.2209269). Unfortunately, the alternative of hierarchical directory protocol inevitably leads to on-chip traffic overhead, protocol complexity and access latency. In this paper, we target hierarchical cache coherence protocol to overcome the potentially high cost of maintaining cache coherence in current multicore processors. We propose a novel vertical caching protocol combined with grouped coherence, in which the coherence domain expand on demand. More specifically, its design philosophy is to provide a ‘best-effort’ single-copy delivery which allows the shared data only in the first common shared level. Compared to the previous hierarchical protocol, our proposal is able to achieve the performance improvement of 9.9% in the 16-core system and 13.4% in the 64-core system as well as an on-chip traffic reduction of about 10.8% in the 16-core system and 15.9% in the 64-core system, respectively.  相似文献   

2.
《Computer Communications》2002,25(11-12):1009-1017
The token bucket characterization provides a deterministic yet concise representation of a traffic source. In this paper, we study the impact of the long-range dependence (LRD) property of traffic generated by today's multimedia applications on the optimal dimensioning of token bucket parameters. To this aim, we empirically illustrate the difference between the token bucket characteristics of traffic exhibiting different degrees of time dependence but with identical macroscopic properties (i.e. inter-arrival time and packet size distributions). In addition, we use a statistical model to analytically determine optimal token bucket parameters under various optimization criteria. The statistical model is based on fractional Brownian motion and takes LRD into account. We apply this model to several aggregated MPEG video sources. We then assess the validity of these analytic results by comparing them to empirical results. We conclude that the analytic approach presented here is effective in optimally sizing token buckets for LRD traffic, and promises to be applicable under different traffic conditions and for various optimization criteria.  相似文献   

3.
One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied latencies caused by memory bank conflicts. This results in conflicts at the writeback stage of the in-order pipeline and causes pipeline stalls, thus degrading system throughput. Based on this observation, we investigate and propose a novel Elastic Pipeline design that minimizes the negative impact of on-chip memory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed Elastic Pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves the overall performance by up to 20.7 % (on average 13.3 %) for representative benchmarks, at trivial hardware overhead.  相似文献   

4.
片上网络互连拓扑综述   总被引:1,自引:0,他引:1  
随着器件、工艺和应用技术的不断发展,片上多处理器已经成为主流技术,而且片上多处理器的规模越来越大、片内集成的处理器核数目越来越多,用于片内处理器核及其它部件之间互连的片上网络逐渐成为影响片上多处理器性能的瓶颈之一。片上网络的拓扑结构定义网络内部结点的物理布局和互连方法,决定和影响片上网络的成本、延迟、吞吐率、面积、容错能力和功耗等,同时影响网络路由策略和网络芯片的布局布线方法,是片上网络研究中的关键之一。对比了不同片上网络的拓扑结构,分析了各种结构的性能,并对未来片上网络拓扑研究提出建议。  相似文献   

5.
The benefit of Class-of-Service (CoS) is an important topic in the “Network Neutrality” debate. As part of the debate, it has been suggested that over-provisioning is a viable strategy to meet performance targets of future applications, and that there is no need for to worry about provisioning differentiated services in an IP backbone for a small fraction of users needing better-than-best-effort service. In this paper, we quantify the extra capacity requirement for an over-provisioned classless (i.e., best-effort) network compared to a CoS network providing the same delay or loss performance for premium traffic. We first develop a link model that quantifies the required extra capacity (REC). To illustrate key parameters involved in analytically quantifying REC, we start with simple traffic distributions. Then, for more bursty traffic distributions (e.g., long-range dependent), we find the REC using ns-2 simulations of CoS and classless links. We, then, use these link models to quantify the REC for network topologies (obtained from Rocketfuel) under various scenarios including situations with “closed loop” traffic generated by many TCP sources that adapt to the available capacity. We also study the REC under link and node failures. We show that REC can still be significant even when the proportion of premium traffic requiring performance assurances is small, a situation often considered benign for the over-provisioning alternative. We also show that the impact of CoS on best-effort (BE) traffic is relatively small while still providing the desired performance for premium traffic.  相似文献   

6.
In this paper, a processor allocation mechanism for NoC-based chip multiprocessors is presented. Processor allocation is a well-known problem in parallel computer systems and aims to allocate the processing nodes of a multiprocessor to different tasks of an input application at run time. The proposed mechanism targets optimizing the on-chip communication power/latency and relies on two procedures: processor allocation and task migration. Allocation is done by a fast heuristic algorithm to allocate the free processors to the tasks of an incoming application when a new application begins execution. The task-migration algorithm is activated when some application completes execution and frees up the allocated resources. Task migration uses the recently deallocated processors and tries to rearrange the current tasks in order to find a better mapping for them. The proposed method can also capture the dynamic traffic pattern of the network and perform task migration based on the current communication demands of the tasks. Consequently, task migration adapts the task mapping to the current network status. We adopt a non-contiguous processor allocation strategy in which the tasks of the input application are allowed to be mapped onto disjoint regions (groups of processors) of the network. We then use virtual point-to-point circuits, a state-of-the-art fast on-chip connection designed for network-on-chips, to virtually connect the disjoint regions and make the communication latency/power closer to the values offered by contiguous allocation schemes. The experimental results show considerable improvement over existing allocation mechanisms.  相似文献   

7.
A distributed counter allows each processor in an asynchronous message passing network to access the counter value and increment it. We study the problem of implementing a distributed counter so that no processor is a communication bottleneck. We prove a lower bound of Ω(logn/log logn) on the number of messages that some processor must exchange in a sequence ofncounting operations spread overnprocessors. We propose a counter that achieves this bound when each processor increments the counter exactly once. Hence, the lower bound is tight. Because most algorithms and data structures count in some way, the lower bound holds for many distributed computations. We feel that the proposed concept of a communication bottleneck is a relevant measure of efficiency for a distributed algorithm and data structure, because it indicates the achievable degree of distribution.  相似文献   

8.
自相似排队系统的蒙特卡罗仿真研究   总被引:2,自引:0,他引:2  
自相似性是网络通信量的普遍性质并且对网络性能有很大的影响。论文利用蒙特卡罗方法研究了自相似排队系统的性能问题。研究表明长程相关和短程相关对于排队系统性能具有非常不同的影响,尤其是在缓存较大的情况下。同时,还发现通信量中长程相关发生作用的尺度与通信量以及排队系统本身的参数都有关,这对于实际的网络设计具有较强的参考意义。  相似文献   

9.
The network-on-chip paradigm is an emerging paradigm that effectively addresses and presumably can overcome the many on-chip interconnection and communication challenges that already exist in today's chips or will likely occur in future chips. Effective on-chip implementation of network-based interconnect paradigms requires developing and deploying a whole new set of infrastructure IPs and supporting tools and methodologies. This special issue illustrates how, to date, engineers have successfully deployed NoCs to meet certain very-aggressive specifications. At the same time, the articles reveal many issues and challenges that require solutions if the NoC paradigm will indeed become a panacea or quasi-panacea for tomorrow’s SoCs.  相似文献   

10.
自相似流量关键参数分析   总被引:1,自引:0,他引:1  
大量的研究结果表明,网络流量过程普遍存在着自相似和长相关特性,自相似和长相关特性对网络性能具有重要的影响.目前绝大部分研究都集中在Hurst系数的估计及其性能影响上,这是不全面的.本文深入研究影响网络性能的自相似流量关键参数,通过仿真分析Hurst系数和方差系数对网络性能的影响,表明Hurst系数和方差系数对网络性能均有重要的影响.分析了方差对网络性能影响的原因,研究了G与方差之间的关系及其计算方法,给出了基于IDC的复合分形更新过程参数的估计算法,分析了分形开始时间对网络性能的影响.  相似文献   

11.
Cell Multiprocessor Communication Network: Built for Speed   总被引:1,自引:0,他引:1  
Kistler  M. Perrone  M. Petrini  F. 《Micro, IEEE》2006,26(3):10-23
Multicore designs promise various power-performance and area-performance benefits. But inadequate design of the on-chip communication network can deprive applications of these benefits. To illuminate this important point in multicore processor design, the authors analyze the Cell processor's communication network, using a series of benchmarks involving DMA traffic patterns and synchronization protocols.  相似文献   

12.
In this paper, we study the buffer queueing behaviour in high-speed networks. Some limited analytical derivations of queue models have been proposed in literature but their solutions are often a great mathematical challenge. We propose to use the Polya distribution to overcome such limitations. The specific behaviour of an IP interface with bursty traffic and long-range dependence is investigated by a version of the “classical” M/D/n queueing model called Polya/D/n. This is queueing system with a Polya input stream (a negative binomial distributed number of arrivals in a fixed time interval), a constant service time, multiple servers, and infinite waiting rooms. The model is considered a renewal process because of its quasi-random input stream and constant service time. We develop balance equations for the state of the system and obtain results for the packet loss and delay. The finding that the Polya distribution is adequate to model bursty input streams in IP network interfaces has motivated the proposal to evaluate the Polya/D/n system. It is shown that the variance in the input stream significantly changes the characteristics of the waiting system. The suggested model is new and allows defining different bursty traffic and evaluating losses and delays relatively easily.  相似文献   

13.
Molecular dynamics (MD) simulation has broad applications, and an increasing amount of computing power is needed to satisfy the large scale of the real world simulation. The advent of the many-core paradigm brings unprecedented computing power, but it remains a great challenge to harvest the computing power due to MD’s irregular memory-access pattern. To address this challenge, this paper presents a joint application/architecture study to enhance the scalability of MD on Godson-T-like many-core architecture. First, a preprocessing approach leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then three incremental optimization strategies–a novel data-layout to improve data locality, an on-chip locality-aware parallel algorithm to enhance data reuse, and a pipelining algorithm to hide latency to shared memory–are proposed to enhance on-chip parallelism for Godson-T many-core processor. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency of 0.99 on 64 cores, which is confirmed by a field-programmable gate array emulator. Also the performance per watt of MD on Godson-T is much higher than MD on a 16-cores Intel core i7 symmetric multiprocessor (SMP) and 26 times higher than MD on an 8-core 64-thread Sun T2 processor. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a hierarchical parallelization scheme is designed to map the MD algorithm to Godson-T many-core cluster and a simple performance model is derived, which suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments.  相似文献   

14.
支持服务质量的片上网络路由器设计   总被引:1,自引:0,他引:1  
系统芯片的复杂应用使得片上互连成为系统性能的瓶颈,因此出现了以片上网络为核心的通信结构,而路由器是片上网络的关键部件,它完成数据在片上网络拓扑结构上的传输.设计了支持服务质量的片上网络路由器.采用面向连接的细粒度数据交换方式为确保通信服务提供严格的端对端延迟需求,采用无连接的数据交换方式支持尽力而为通信服务,同时采用均衡片上通信负载的路由算法,有效地提高了平均通信性能.  相似文献   

15.
Fractional Brownian motion (fBm) emerged as a useful model for self-similar and long-range dependent aggregate Internet traffic. Asymptotic, respectively, approximate performance measures are known for single queueing systems with fBm through traffic. In this paper end-to-end performance bounds for a through flow in a network of tandem queues under open-loop fBm cross traffic are derived. To this end, a rigorous sample path envelope for fBm is proven that complements previous approximate results. The sample path envelope and the concept of leftover service curves are employed to model the remaining service after scheduling fBm cross traffic at a queuing system. Using composition results for tandem systems from the stochastic network calculus end-to-end statistical performance bounds for individual flows in networks under fBm cross traffic are derived. The discovery is that these bounds grow in O(n(logn)1/(2-2H)) for n systems in series where H is the Hurst parameter of the cross traffic. Explicit results on the impact of the variability and the burstiness of through and cross traffic on network performance are shown. Our analysis has direct implications on fundamental questions in network planning and service management.  相似文献   

16.
Fault-tolerant scheduling is an imperative step for large-scale computational Grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary and a backup on two different processors. In this paper, we address the problem of how to schedule DAGs in Grids with communication delays so that service failures can be avoided in the presence of processors faults. The challenge is, that as tasks in a DAG have dependence on each other, a task must be scheduled to make sure that it will succeed when any of its predecessors fails due to a processor failure. We first propose a communication model and determine when communications between a backup and backups of its successors are necessary. Then we determine when a backup can start and its eligible processors so as to guarantee that every DAG can complete upon any processor failure. We develop two algorithms to schedule backups, which minimize response time and replication cost, respectively. We also develop a suboptimal algorithm which targets minimizing replication cost while not affecting response time. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms.  相似文献   

17.
In a planar geometric network vertices are located in the plane, and edges are straight line segments connecting pairs of vertices, such that no two of them intersect. In this paper we study distributed computing in asynchronous, failure-free planar geometric networks, where each vertex is associated to a processor, and each edge to a bidirectional message communication link. Processors are aware of their locations in the plane.We consider fundamental computational geometry problems from the distributed computing point of view, such as finding the convex hull of a geometric network and identification of the external face. We also study the classic distributed computing problem of leader election, to understand the impact that geometric information has on the message complexity of solving it.We obtain an O(nlog2n) message complexity algorithm to find the convex hull, and an O(nlogn) message complexity algorithm to identify the external face of a geometric network of n processors. We present a matching lower bound for the external face problem. We prove that the message complexity of leader election in a geometric ring is Ω(nlogn), hence geometric information does not help in reducing the message complexity of this problem.  相似文献   

18.
We study deterministic gossiping in synchronous systems with dynamic crash failures. Each processor is initialized with an input value called rumor. In the standard gossip problem, the goal of every processor is to learn all the rumors. When processors may crash, then this goal needs to be revised, since it is possible, at a point in an execution, that certain rumors are known only to processors that have already crashed. We define gossiping to be completed, for a system with crashes, when every processor knows either the rumor of processor v or that v has already crashed, for any processor v. We design gossiping algorithms that are efficient with respect to both time and communication. Let t<n be the number of failures, where n is the number of processors. If , then one of our algorithms completes gossiping in O(log2t) time and with O(npolylogn) messages. We develop an algorithm that performs gossiping with O(n1.77) messages and in O(log2n) time, in any execution in which at least one processor remains non-faulty. We show a trade-off between time and communication in gossiping algorithms: if the number of messages is at most O(npolylogn), then the time has to be at least . By way of application, we show that if nt=Ω(n), then consensus can be solved in O(t) time and with O(nlog2t) messages.  相似文献   

19.
基于软硬件的协同支持在众核上对1-DFFT算法的优化研究   总被引:2,自引:0,他引:2  
随着高性能计算需求的日益增加,片上众核(many-core)处理器成为未来处理器架构的发展方向.快速傅立叶变换(FFT)作为高性能计算中的重要应用,对计算能力和通信带宽都有较高的要求.因此基于众核处理器平台,实现高效、可扩展的FFT算法是算法和体系结构设计者共同面临的挑战.文中在众核处理器Godson-T平台上对1-D FFT算法进行了优化和评估,在节省几乎三分之一L2 Cache存储开销的情况下,通过隐藏矩阵转置,计算与通信重叠等优化策略,使得优化后的1-D FFT算法达到3倍以上的性能提升.并通过片上网络拥塞状况的实验分析,发现对于像FFT这样访存带宽受限的应用,增加L2 Cache的访问带宽,可以缓解因为爆发式读写带给片上网络和L2 Cache的压力,进一步提高程序的性能和扩展性.  相似文献   

20.
Distributed-memory parallel computers and networks of workstations (NOWs) both rely on efficient communication over increasingly high-speed networks. Software communication protocols are often the performance bottleneck. Several current and proposed parallel systems address this problem by dedicating one general-purpose processor in a symmetric multiprocessor (SMP) node specifically for protocol processing. This protocol processing convention reduces communication latency and increases effective bandwidth, but also reduces the peak performance since the dedicated processor no longer performs computation. In this paper, we study a parallel machine with SMP nodes and compare two protocol processing policies: the Fixed policy, which uses a dedicated protocol processor; and the Floating policy, where all processors perform both computation and protocol processing. The results from synthetic microbenchmarks and five macrobenchmarks show that: (i) a dedicated protocol processor benefits light-weight protocols much more than heavy-weight protocols, (ii) a dedicated protocol processor is generally advantageous when there are four or more processors per node, (iii) multiprocessor node performance is not as sensitive to interrupt overhead as uniprocessor node because a message arrival is likely to find an idle processor on a multiprocessor node, thereby eliminating interrupts, (iv) the system with the lowest cost-performance will include a dedicated protocol processor when interrupt overheads are much higher than protocol weight—as in light-weight protocols.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号