期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

RAFT: A router architecture with frequency tuning for on-chip networks

Asit K. MishraAuthor Vitae Aditya Yanamandra^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(5):625-640

With increasing number of cores being integrated on a single die, Network-on-Chips (NoCs) have become the de-facto standard in providing scalable communication backbones for these multi-core chips. NoCs have a significant impact on the system’s performance, power and reliability. However, NoCs can be plagued by higher power consumption and degraded throughput if the network and router are not designed properly. Towards this end, this paper proposes a novel router architecture, where we tune the frequency of a router in response to network load to manage both performance and power. We propose three dynamic frequency tuning techniques, FreqBoost, FreqThrtl and FreqTune, targeted at congestion and power management in NoCs. We also propose and evaluate a novel fine-grained frequency tuning scheme where we vary the number of virtual-channels in a router dynamically. As a further optimization to these schemes, we propose a frequency tuning scheme where we tune the frequency of the four ports of a mesh router separately from the local port. As enablers for these techniques, we exploit Dynamic Voltage and Frequency Scaling (DVFS) and the imbalance in a generic router pipeline through time stealing. We also evaluate and analyze the proposed schemes from the point of view of reliability against soft error vulnerability and provide guidelines in choosing the appropriate scheme when reliability is the prime design constraint.Experiments using synthetic workloads on an 8 × 8 wormhole-switched mesh interconnect show that FreqBoost is a better choice for reducing average latency (maximum 40%) while, FreqThrtl provides the maximum benefits in terms of power saving and energy delay product (EDP). The FreqTune scheme is a better candidate for optimizing both performance and power, achieving on an average 36% reduction in latency, 13% savings in power (up to 24% at high load), and 40% savings (up to 70% at high load) in EDP. With application benchmarks, we observe IPC improvement up to 23% using our design. Our analysis shows FreqBoost to be the most robust scheme amongst the three schemes when reliability is a concern. 相似文献

2.

Globally Synchronized Frames for guaranteed quality-of-service in on-chip networks

Jae W. Lee Man Cheuk Ng Krste Asanović 《Journal of Parallel and Distributed Computing》2012

Future chip multiprocessors (CMPs) may have hundreds to thousands of threads competing to access shared resources, and will require quality-of-service (QoS) support to improve system utilization. This paper introduces Globally-Synchronized Frames (GSF), a framework for providing guaranteed QoS in on-chip networks in terms of minimum bandwidth and maximum delay bound. The GSF framework can be easily integrated in a conventional virtual channel (VC) router without significantly increasing the hardware complexity. We exploit a fast on-chip barrier network to efficiently implement GSF. Performance guarantees are verified by analysis and simulation. According to our simulations, all concurrent flows receive their guaranteed minimum share of bandwidth in compliance with a given bandwidth allocation. The average throughput degradation of GSF on an 8×8 mesh network is within 10% compared to the conventional best-effort VC router. 相似文献

3.

Securing the data path of next-generation router systems

Tilman Wolf Russell Tessier Gayatri Prabhu 《Computer Communications》2011,34(4):598-606

As the technology used to implement computer network infrastructure advances, networking resources are becoming more vulnerable to attack. Recent router designs are based on general-purpose programmable processors, which increase their potential vulnerability. To address this issue, a Secure Packet Processing platform has been developed that can flexibly protect emerging router systems. Both instruction-level operation of embedded processors and I/O operations of router ports are monitored to detect anomalous behavior. If such behavior is detected, a recovery system is invoked to restore the system into an operational state. Experimental results show that processor-based attacks can generally be determined by a processing monitor within a single instruction. I/O anomalies, including unexpected packet broadcast or delay, can be detected by an I/O monitor with limited overhead. Overall, the system overhead for secure monitoring is limited to a fraction of the overall system space, memory, and power budget. 相似文献

4.

The psi-cube: a bus-based cube-type clustering network for high-performance on-chip systems

Masaru Takesue 《Parallel Computing》2006,32(11-12):852

This paper proposes a bus-based cube-type network, called psi-cube, that alleviates the two problems, long wires and a limited number of I/O pins, against the on-chip systems through a small diameter and dynamic clusters, respectively. The 2ⁿ-node psi-cube is organized on the sets of node-partitions produced with an extended n-bit Hamming code ψ(n, k) [M. Takesue, Ψ-Cubes: recursive bused fat-hypercubes for multilevel snoopy caches, in: Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks, IEEE CS Press, 1999, pp. 62–67] if we connect the nodes in each partition to the bus owned by the leader of the partition. Owing to the routing between the leaders separated by the distance of 1–3, the diameter equals n/2 if n≠2^p − 1 or n/2 otherwise. The maximum bus length is O(2^p−1) or O(2^k−1) when the psi-cube is mapped onto an array. We dynamically produce separate sets of clusters for different off-chip targets such as memory blocks, so the traffic to the leaders of clusters is much smaller than in static clusters fixed in hardware. From simulation results, the psi-cube outperforms over the mesh if the bus delay is less than 4 times the mesh link’s, and the dynamic clusters increase the psi-cube bandwidth by over 60%. 相似文献

5.

高阶路由器结构研究综述

杨文祥董德尊雷斐李存禄吴际孙凯旋《计算机工程与科学》2016,38(8):1517-1523

随着高性能网络规模的增加,高阶路由器结构设计成为高性能计算中研究的重点和热点。使用高阶路由器,网络能实现更低的报文传输延迟、网络构建成本和网络功耗,同时高阶路由器的应用还可以提高网络可靠性。过去十年是高阶路由器发展最快的时期,对近年高阶路由器的研究进行了综述,并对未来发展趋势进行了预测,主要介绍了以YARC为代表的经典结构化设计以及"network within a network"等近年来涌现的新型设计方法。未来的研究重点是解决高阶路由器结构设计中遇到的缓存和仲裁等各种问题,并利用光互连等技术设计性能更好的结构。相似文献

6.

Sharded Router: A novel on-chip router architecture employing bandwidth sharding and stealing

Junghee Lee Chrysostomos Nicopoulos Hyung Gyu Lee Jongman Kim 《Parallel Computing》2013

Packet-based networks-on-chip (NoC) are considered among the most viable candidates for the on-chip interconnection network of many-core chips. Unrelenting increases in the number of processing elements on a single chip die necessitate a scalable and efficient communication fabric. The resulting enlargement of the on-chip network size has been accompanied by an equivalent widening of the physical inter-router channels. However, the growing link bandwidth is not fully utilized, because the packet size is not always a multiple of the channel width. While slicing of the physical channel enhances link utilization, it incurs additional delay, because the number of flit per packet also increases. This paper proposes a novel router micro-architecture that employs fine-grained bandwidth “sharding” (i.e., partitioning) and stealing in order to mitigate the elevation in the zero-load latency caused by slicing. Consequently, the zero-load latency of the Sharded Router becomes identical with that of a conventional router, whereas its throughput is markedly improved by fully utilizing all available bandwidth. Detailed experiments using a full-system simulation framework indicate that the proposed router reduces the average network latency by up to 19% and the execution time of real multi-threaded workloads by up to 43%. Finally, hardware synthesis analysis verifies the modest area overhead of the Sharded Router over a conventional design. 相似文献

7.

一种新的匿名路由器问题解决方案 总被引：1，自引：0，他引：1

史怀洲朱培栋《信息网络安全》2008,(11):45-46

路由器网络拓扑发现在网络拓扑发现中占有非常重要的作用,也是进行网络分析、建模的基础,与网络安全息息相关。本文针对路由器网络拓扑发现中出现的匿名路由器问题,提出了一种新的解决方案——带条件的最大匿名综合。该方法具有不增加网络负担、适用范围广、高效等优点,使得路由器网络拓扑数据更加准确和完整,是一种较好的匿名路由器解决方案。相似文献

8.

Scalable architecture for a contention-free optical network on-chip

Somayyeh Koohi Shaahin HessabiAuthor Vitae 《Journal of Parallel and Distributed Computing》2012

This paper proposes CoNoC (Contention-free optical NoC) as a new architecture for on-chip routing of optical packets. CoNoC is built upon all-optical switches (AOSs) which passively route optical data streams based on their wavelengths. The key idea of the proposed architecture is the utilization of per-receiver wavelength in the data network to prevent optical contention at the intermediate nodes. Routing optical packets according to their wavelength eliminates the need for resource reservation at the intermediate nodes and the corresponding latency, power, and area overheads. Since passive architecture of the AOS confines the optical contention to the end-points, we propose an electrical arbitration architecture for resolving optical contention at the destination nodes. By performing a series of simulations, we study the efficiency of the proposed architecture, its power and energy consumption, and the data transmission latency. Moreover, we compare the proposed architecture with electrical NoCs and alternative ONoC architectures under various synthetic traffic patterns. Averaged across different traffic patterns, the proposed architecture reduces per-packet power consumption by 19%, 28%, 29%, and 91% and achieves per-packet energy reduction of 28%, 40%, 20%, and 99% over Columbia, Phastlane, λ

λ

-router, and electrical torus, respectively. 相似文献

9.

Silicon-aware distributed switch architecture for on-chip networks

《Journal of Systems Architecture》2013,59(7):505-515

It is well-known that current Chip MultiProcessor (CMP) and high-end MultiProcessor System-on-Chip (MPSoC) designs are growing in their number of components. Networks-on-Chip (NoC) provide the required connectivity for such CMP and MPSoC designs at reasonable costs. As technology advances, links become the critical component in the NoC due to their long delay and power consumption, becoming unacceptable for long global interconnects.In this paper we present a new switch architecture that reduces the negative impact of links on the NoC. We call our proposal distributed switch. The distributed switch spreads the circuitry of the switch onto the links. Thus, packets are buffered, routed, and forwarded at the same time they are crossing the link.Distributing a modular switch onto the link improves the trade off between the power consumption and the operating frequency of the entire network. On the contrary, area resources are increased. Additionally, the distributed switch presents better fault tolerance and process variation behavior with respect to a non-distributed switch. 相似文献

10.

An on-chip instruction cache design with one-bit tag for low-power embedded systems

Ji Gu^{Author Vitae} Hui Guo Author VitaePatrick LiAuthor Vitae 《Microprocessors and Microsystems》2011,35(4):382-391

On-chip instruction cache is a potential power hungry component in embedded systems due to its large chip area and high access-frequency. Aiming at reducing power consumption of the on-chip cache, we propose a Reduced One-Bit Tag Instruction Cache (ROBTIC), where the cache size is judiciously reduced and the cache tag field only contains the least significant bit of the full-tag. We develop a cache operational control scheme for ROBTIC so that with the one-bit cache tag, the program locality can still be efficiently exploited. For applications where most of the memory accesses are localized, our cache can achieve similar performance as a traditional full-tag cache; however, the power consumption of the cache can be significantly reduced due to the much smaller cache size, narrower tag array (just one bit), and tinier tag comparison circuit being used. Experiments on a set of benchmarks implemented in CMOS 180 nm process technology demonstrate that our proposed design can reduce up to 27.3% dynamic power consumption and 30.9% area of the traditional cache when the cache size is fixed at 32 instructions, which outperforms the existing partial-tag based cache design. With the cache size customization, a further 47.8% power saving can be achieved. Our experimental results also show that when implemented in the deep sub-micron technologies where the leakage power is not ignorable, our design is still efficient - a coherent power saving trend (about 22%) has been observed for technologies from 130 nm down to 65 nm. 相似文献

11.

An analog VLSI design for a neuron with a choice of learning rules 总被引：1，自引：0，他引：1

Mark James Reference to Neal 《Neurocomputing》2000,30(1-4):185-200

An analog implementation of a neuron using standard VLSI components is described. The node is capable of both delta-rule and simple error-correcting learning. Decomposition into functional blocks allows the parts of the design to be easily separated and understood. The connectivity problem is eased by serially encoding inputs so that all nodes in a layer are connected to a single line carrying activations from the previous layer. Performance implications of the architecture are considered. The design was simulated with the Spice transistor level simulator. Schemas for interconnection of large numbers of nodes and simulations of the circuitry required are presented. Results show that effective learning is achieved by both algorithms. Implementation of multiple learning rules in a single neuron is demonstrated as an effective way of increasing flexibility in neural network hardware implementations. 相似文献

12.

Two algorithms for three-layer channel routing

R. Srinivasan L.M. Patnaik 《Computer aided design》1984,16(5):264-271

A channel router is an important design aid in the design automation of VLSI circuit layout. Many algorithms have been developed based on various wiring models with routing done on two layers. With the recent advances in VLSI process technology, it is possible to have three independent layers for interconnection. In this paper two algorithms are presented for three-layer channel routing. The first assumes a very simple wiring model. This enables the routing problem to be solved optimally in a time of

O (n log n)

. The second algorithm is for a different wiring model and has an upper bound of

O(n^{2})

for its execution time. It uses fewer horizontal tracks than the first algorithm. For the second model the channel width is not bounded by the channel density. 相似文献

13.

A low-latency modular switch for CMP systems

Antoni Roca José Flich Federico Silla José DuatoAuthor vitae 《Microprocessors and Microsystems》2011,35(8):742-754

As technology advances, the number of cores in Chip MultiProcessor systems and MultiProcessor Systems-on-Chips keeps increasing. The network must provide sustained throughput and ultra-low latencies.In this paper we propose new pipelined switch designs focused in reducing the switch latency. We identify the switch components that limit the switch frequency: the arbiter. Then, we simplify the arbiter logic by using multiple smaller arbiters, but increasing greatly the switch area. To solve this problem, a second design is presented where the routing traversal and arbitrations tasks are mixed. Results demonstrate a switch latency reduction ranging from 10% to 21%. Network latency is reduced in a range from 11% to 15%. 相似文献

14.

A scalable single-chip multi-processor architecture with on-chip RTOS kernel

B. D. A. C. V. V. M. P. J. A. 《Journal of Systems Architecture》2003,49(12-15):619

Now that system-on-chip technology is emerging, single-chip multi-processors are becoming feasible. A key problem of designing such systems is the complexity of their on-chip interconnects and memory architecture. It is furthermore unclear at what level software should be integrated. An example of a single-chip multi-processor for real-time (networked) embedded systems is the multi-microprocessor (MμP). Its architecture consists of a scalable number of identical master processors and a configurable set of shared co-processors. Additionally, an on-chip real-time operating system kernel is included to support transparent multi-tasking over the set of master processors. In this paper, we explore the main design issues of the architecture platform on which the MμP is based. In addition, synthesis results are presented for a lightweight configuration of this architecture platform. 相似文献

15.

A practical bytecode interpreter for programmable routers on IXP network processors

S. G. 《Computer Networks》2009,53(15):2740-2751

相似文献

16.

Fixed latency on-chip interconnect for hardware spiking neural network architectures

Sandeep Pande Fearghal Morgan Gerard Smit Tom Bruintjes Jochem Rutgers Brian McGinley Seamus Cawley Jim Harkin Liam McDaid 《Parallel Computing》2013

Information in a Spiking Neural Network (SNN) is encoded as the relative timing between spikes. Distortion in spike timings can impact the accuracy of SNN operation by modifying the precise firing time of neurons within the SNN. Maintaining the integrity of spike timings is crucial for reliable operation of SNN applications. A packet switched Network on Chip (NoC) infrastructure offers scalable connectivity for spike communication in hardware SNN architectures. However, shared resources in NoC architectures can result in unwanted variation in spike packet transfer latency. This packet latency jitter distorts the timing information conveyed on the synaptic connections in the SNN, resulting in unreliable application behaviour. 相似文献

17.

Multistage A2LVQ architecture and implementation for image compression

Infall Syafalni M.F.M. Salleh 《Digital Signal Processing》2013,23(5):1414-1426

This paper presents the implementation of two hardware architectures, i.e., A₂ Lattice Vector Quantization (LVQ) and Multistage A₂LVQ (MA₂LVQ), using a Field-Programmable Gate Array (FPGA). First, the renowned LVQ quantizer by Conway and Sloane is implemented followed by a low-complexity A₂LVQ based on a new A₂LVQ algorithm. It is revealed that the implementation requires high number of multiplier circuits. Then the implementation of a low-complexity A₂LVQ is presented. This implementation uses only the first quadrant of the A₂ lattice Voronoi region formed by W and T regions. This paper also presents the implementation of a multistage A₂LVQ (MA₂LVQ) with an architecture built from successive A₂ quantizer blocks. Synthesis results show that the execution time of the low-complexity A₂LVQ reaches up to 35.97 ns. The MA₂LVQ is implemented using both low-complexity A₂LVQ and ordinary A₂ architectures. The system with the former architecture utilizes less logic and register elements by 47%. 相似文献

18.

MPFS: A truly scalable router architecture for next generation Internet

ZhiGang Sun Yi Dai ZhengHu Gong 《中国科学F辑(英文版)》2008,51(11):1761-1771

A new generation architecture of IP routers called massive parallel forwarding and switching （MPFS） is proposed, which is totally different from modern routers. The basic idea of MPFS is mapping complicated forwarding process into multilevel scalable switch fabric so as to implement packet forwarding in a pipelining and distributed way. This processing mechanism is named forwarding in switching （FIS）. By interconnecting multi-stage, lower speed components, called forwarding and switching nodes （FSN）, MPFS achieves better scalability in forwarding and switching performance just like MPP. We put emphasis upon IPv6 lookup problem in MPFS and propose a method for partitioning IPv6 FIB and mapping them to switch fabric. Simulation and computation results suggest that MPFS routers can support line-speed forwarding with a million of IPv6 prefixes at 40 Gbps. We also propose an implementation of 160 Tbps core router based on MPFS architecture at last. 相似文献

19.

一种单信道无线传感器网络的隐藏终端和暴露终端问题解决方案

王晖《电子技术应用》2008,34(9)

结合 GPS 技术提供的定位特点,采用跨层设计思想,提出了一种单信道多跳 WSN 网络的媒体访问控制协议。在协议的控制帧中携带路由信息和状态信息,根据这些信息,判断存在的暴露终端和隐藏终端,同时利用上游节点的 ACK 应答作为与下游节点的 RTS 握手,建立 CTS/DATA/ACK的三次交互机制。仿真表明,与已有的 IEEE802.11和 MACA-BI 等同类协议相比,该协议可有效解决隐藏终端和暴露终端问题,提高网络的吞吐量和端到端的延时,并降低网络的握手开销和控制帧的冲突概率。相似文献

20.

A parallel algorithm with enhancements via partial objective value cuts for cluster-based wireless sensor network design

Hui Lin Halit Üster 《Journal of Parallel and Distributed Computing》2014

In this paper, we develop a parallel algorithm for the solution of an integrated topology control and routing problem in Wireless Sensor Networks (WSNs). After presenting a mixed-integer linear optimization formulation for the problem, for its solution, we develop an effective parallel algorithm in a Master–Worker model that incorporates three parallelization strategies, namely low-level parallelism, domain decomposition, and multiple search (both cooperative and independent) in a single Master–Worker framework. 相似文献