期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李诚李华伟《计算机工程》2007,33(2):252-254

随着网络带宽的飞速增长和各种新的网络应用不断涌现，原有的基于通用处理器和ASIC的互联网架构已经不能满足新的需求。兼具强大处理能力和灵活可编程配置能力的网络处理器逐渐得到广泛的应用。高性能的网络处理器通常采用多个并发的处理单元进行数据平面的快速处理，这些处理单元在网络处理器中居于核心的地位。该文讨论了网络处理器中处理单元设计需要考虑的因素，设计了一种较为灵活有效的处理单元架构，并进行了FPGA原型验证，证实了该结构的可行性。相似文献

2.

Shared memory multiprocessor architectures for software IP routers

Luo Y. Laxmi Narayan Bhuyan Chen X. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(12):1240-1249

We propose new shared memory multiprocessor architectures and evaluate their performance for future Internet protocol (IP) routers based on symmetric multiprocessor (SMP) and cache coherent nonuniform memory access (CC-NUMA) paradigms. We also propose a benchmark application suite, RouterBench, which consists of four categories of applications representing key functions on the time-critical path of packet processing in routers. An execution driven simulation environment is created to evaluate SMP and CC-NUMA router architectures using this RouterBench. The execution driven simulation can produce accurate cycle-level execution time prediction and reveal the impact of various architectural parameters on the performance of routers. We port the FUNET trace and its routing table for use in our experiments. We find that the CC-NUMA architecture provides an excellent scalability for design of high-performance IP routers. Results also show that the CC-NUMA architecture can sustain good lookup performance, even at a high frequency of route updates. 相似文献

3.

下一代网络处理器及应用综述

赵玉宇程光刘旭辉袁帅唐路《软件学报》2021,32(2):445-474

网络处理器作为能够完成路由查找、高速分组处理以及QoS保障等主流业务的网络设备核心计算芯片,可以结合自身可编程性完成多样化分组处理需求,适配不同网络应用场景.面向超高带宽及智能化终端带来的网络环境转变,高性能可演进的下一代网络处理器设计是网络通信领域的热点问题,受到学者们的广泛关注.融合不同芯片架构优势、高速服务特定业... 相似文献

4.

ETA: experience with an Intel Xeon processor as a packet processing engine

Regnier G. Minturn D. McAlpine G. Saletore V.A. Foong A. 《Micro, IEEE》2004,24(1):24-31

Server-based networks have well-documented performance limitations. These limitations outline a major goal of Intel's embedded transport acceleration (ETA) project, the ability to deliver high-performance server communication and I/O over standard Ethernet and transmission control protocol/Internet protocol (TCP/IP) networks. By developing this capability, Intel hopes to take advantage of the large knowledge base and ubiquity of these standard technologies. With the advent of 10 gigabit Ethernet, these standards promise to provide the bandwidth required of the most demanding server applications. We use the term packet processing engine (PPE) as a generic term for the computing and memory resources necessary for communication-centric processing. Such PPEs have certain desirable attributes; the ETA project focuses on developing PPEs with such attributes, which include scalability, extensibility, and programmability. General-purpose processors, such as the Intel Xeon in our prototype, are extensible and programmable by definition. Our results show that software partitioning can significantly increase the overall communication performance of a standard multiprocessor server. Specifically, partitioning the packet processing onto a dedicated set of compute resources allows for optimizations that are otherwise impossible when time sharing the same compute resources with the operating system and applications. 相似文献

5.

Multiprocessing in multiprotocol routers

《Journal of Microcomputer Applications》1994,17(2):99-112

High-speed networks place strict requirements on the architecture of communication subsystems. One of the most significant problems in conventional subsystems is provision of high-speed protocol processing. The protocol processing problem is especially significant in the environment of multiprotocol routers where several routing protocols are supported. A multiprocessor architecture for high-speed processing in multiprotocol environments is presented and analyzed. It is shown that exploitation of vertical and horizontal parallelism in protocol stacks combined with parallelism in memory accesses and packet memory management significantly increases system performance. The presented architecture, used in realistic environments, meets the throughput requirements of high-speed network links offering throughput up to 100 Mbps. 相似文献

6.

Impact of out-of-sequence processing on the performance of data transmission

《Computer Networks》1999,31(5):475-492

Application Level Framing (ALF) was proposed by Clark and Tennenhouse as an important design principle for developing high performance applications. ALF relies in part on the ability of applications and protocols to process packets independently one from the other. Thus, performance gains one might expect from the use of ALF are clearly related to performance gains one might expect from applications that can handle and process packets received out-of-sequence, as compared to application that require in-sequence delivery (FTP, TELNET, etc.). In this paper, we examine how the ability to process out-of-sequence packets impacts the efficiency of data transmission. We consider both the impact of application parameters such as the time to process a packet by the application, as well as network parameters such as network transmission delay, network loss rate and flow and congestion control characteristics. The performance measure of interest are total latency, buffer requirements, and jitter. We show, using experimental and simulation results, that out-of-sequence processing is beneficial only for very limited ranges of transmission delays and application processing time. We discuss the impact of this on the architecture of communication systems dedicated to distributed multimedia applications. 相似文献

7.

A highly flexible,distributed multiprocessor architecture for network processing

《Computer Networks》2003,41(5):563-586

Network processors (NPs) are an emerging field of programmable processors that are optimized to implement data plane packet processing networking functions. Unlike the general-purpose CPUs that rely heavily on caching for improving performance, the lack of locality in packet processing and need for high-performance I/O have forced designers to come up with innovative architectures that can hide memory latency while still processing packets at high data rates. Most of these NPs use some type of multiprocessing in combination with a hierarchy of memory types to achieve high performance. In addition, to keep up with packets arriving at high data rates over multiple incoming media interfaces, an NP must perform fast I/O and memory operations such as packet storage, table lookup, and extraction of fields in packet headers. We describe an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance. We describe the challenges in programming such a processor including the issues related to consistency and maintaining packet ordering. We also present a programming model for generic network applications that uses software pipelines. We then demonstrate the use of the programming model in implementing two applications, namely, mapping traffic management algorithms onto a multithreaded architecture and an implementation of a media gateway based on voice-over-AAL2. 相似文献

8.

Flexible VLIW processor based on FPGA for efficient embedded real-time image processing

Vincent Brost Fan Yang Charles Meunier 《Journal of Real-Time Image Processing》2014,9(1):47-59

Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability. 相似文献

9.

Rapid prototyping of an ATM programmable associative operator

《Journal of Systems Architecture》2000,46(13):1159-1173

In this paper, we describe a semi-automatic method for designing a programmable architecture related to high speed communication protocols. A case study of associative based architecture of high speed communication system is presented with a validation environment. The environment provides an interesting estimation using XILINX prototyping board including memories (content addressable memory, CAM, RAM, DPRAM). In our approach, we try to perform a rapid prototyping of such architecture and allow the designer to interact easily in order to customize the architecture according to application requirements. This method of validation provides important benefits in hardware prototyping: better validation environment and reduced time to give a real estimation for a large variety of applications. 相似文献

10.

多处理器MPEG2并行解码系统的设计 总被引：1，自引：0，他引：1

许洁斌韦岗《数据采集与处理》1998,13(4):367-373

ＭＰＥＧ２运动图像及伴音压缩标准是许多视频服务应用的核心算法。基于软件结合多处理器的并行系统实现ＭＰＥＧ２算法解压，不仅灵活适用于多种ＭＰＥＧ２产品的回放功能，避免了硬件芯片解压的局限性，而且随着个人计算机的普及和性能的提高，这种系统适配卡方案可以令个人计算机拥有更多的ＭＰＥＧ２服务功能，对ＭＰＥＧ２系列标准更新算法的研究和测试工作也带来方便。本文分析了ＭＰＥＧ２解码对实现系统的要求，特别是解压处理时各部分运算量和数据传输、处理的要求。根据这些数据本文基于多种ＴＭＳ３２０Ｃ４０并行处理系统板，对ＭＰＥＧ２输入码流的数据分割，并行解码存储控制和通信、解码算法复杂度等问题进行了实验和分析，据此得到相应的设计选择和数据。最后提出了ＭＰＥＧ２并行处理解码系统的设计方案。相似文献

11.

On the defense of the distributed denial of service attacks: anon-off feedback control approach

Yong Xiong Liu S. Sun P. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2001,31(4):282-293

Proposes a coordinated defense scheme of distributed denial of service (DDoS) network attacks, based on the backward-propagation, on-off control strategy. When a DDoS attack is in effect, a high concentration of malicious packet streams are routed to the victim in a short time, making it a hot spot. A similar problem has been observed in multiprocessor systems, where a hot spot is formed when a large number of processors access simultaneously shared variables in the same memory module. Despite the similar terminologies used here, solutions for multiprocessor hot spot problems cannot be applied to that in the Internet, because the hot traffic in DDoS may only represent a small fraction of the Internet traffic, and the attack strategies on the Internet are far more sophisticated than that in the multiprocessor systems. The performance impact on the hot spot is related to the total hot packet rate that can be tolerated by the victim. We present a backward pressure propagation, feedback control scheme to defend DDoS attacks. We use a generic network model to analyze the dynamics of network traffic, and develop the algorithms for rate-based and queue-length-based feedback control. We show a simple design to implement our control scheme on a practical switch queue architecture 相似文献

12.

Network virtualization substrate with parallelized data plane

Yong Liao 《Computer Communications》2011,34(13):1549-1558

Network virtualization provides the ability to run multiple concurrent virtual networks over a shared substrate. However, it is challenging to design such a platform to host multiple heterogenous and often highly customized virtual networks. Not only high degree of flexibility is desired for virtual networks to customize their functions, fast packet forwarding is also required. This paper presents PdP, a flexible network virtualization platform capable of achieving high speed packet forwarding. A PdP node has multiple machines to perform packet processing for virtual networks hosted in the system. To forward packets in high speed, the data plane of a virtual network in PdP can be allocated with multiple forwarding machines to process packets in parallel. Furthermore, a virtual network in PdP can be fully customized. Both the control plane and data plane of a virtual network run in virtual machines so as to be isolated from other virtual networks. We have built a proof-of-concept prototyping PdP platform using off-the-shelf commodity hardware and open source software. The performance evaluation results show that our system can closely match the best-known packet forwarding speed of software router running in commodity hardware. 相似文献

13.

Torus Ring: improving performance of interconnection network by modifying hierarchical ring

《Parallel Computing》2007,33(1):2-20

In multiprocessor systems, interconnection network design is critical for overall system performance. Among the popular interconnection networks, unidirectional ring-based networks have been one of popular choices for high performance large-scale shared memory multiprocessor systems. In this paper, we propose “Torus Ring”, which is a modified version of two-level hierarchical ring. The Torus Ring has the same complexity as the hierarchical rings, and the only difference is the way it connects the local rings. Compared to hierarchical rings, the Torus Ring helps exploit the memory access locality of application programs more efficiently. It has an advantage over the hierarchical ring when the destination of a packet is the adjacent local ring, especially the backward adjacent local ring. Although we assume that the destination of a network packet is uniformly distributed across the processing nodes, the average number of hops in Torus Ring is equal to that of the hierarchical ring. However, the performance gain of the Torus Ring is expected to increase, due to the memory access locality of the application programs in the real parallel programming environment. In the simulation results, the latency of the interconnection network is reduced by up to 19% and the execution time is reduced by up to 10%, with the moderate ring utilization ratio. 相似文献

14.

Steps towards the automatic production of performance models of web applications

《Computer Networks》2003,41(1):29-39

相似文献

15.

A Resource-Efficient Communication Architecture for Chip Multiprocessors on FPGAs

下载免费PDF全文

Maggie Swetha Thota 《计算机科学技术学报》2011,26(3):434-447

Significant advances in field-programmable gate arrays (FPGAs) have made it viable to explore innovative multiprocessor solutions on a single FPGA chip.For multiprocessors,an efficient communication network that matches the needs of the target application is always critical to the overall performance.Wormhole packet-switching network-on-chip (NoC) solutions are replacing conventional shared buses to deal with scalability and complexity challenges coming along with the increasing number of processing elements (PEs).However,the quest for high performance networks has led to very complex and resource-expensive NoC designs,leaving little room for the real computing force,i.e.,PEs.Moreover,many techniques offer very small performance gains or none at all when network traffic is light while increasing the resource usage of routers.We argue that computation is still the primary task of multiprocessors and sufficient resources should be reserved for PEs.This paper presents our novel design and implementation of a resource-efficient communication network for multiprocessors on FPGAs.We reduce not only the required number of routers for a given number of PEs by introducing a new PE-router topology,but also the resource requirement of each router.Our communication network relies on the NEWS channels to transfer packets in a pipelined fashion following the path determined by the routing network.The implementation results on various Xilinx FPGAs show good performance in the typical range of network load for multiprocessor applications. 相似文献

16.

Study of OpenMP applications on the InfiniBand-based software distributed shared-memory system

Inho Park Seon Wook Kim 《Parallel Computing》2005,31(10-12):1099

For the past decades computer engineers have focused on building high-performance and large-scale computer systems with low-cost. One of the examples is a distributed-memory computer system like a cluster, where fast processing nodes to use commodity processors are connected through a high speed network. But it is not easy to develop applications on this system, because a programmer must consider all data and control dependences between processes and program them explicitly. For alleviating this problem the distributed virtual shared-memory (DVSM) system has been proposed. It is well known that the performance of the DVSM system highly depends on the network’s performance and programming semantics, and currently its performance is very limited on a conventional network. Recently many advanced hardware-based interconnection technologies have been introduced, and one of them is the InfiniBand Architecture (IBA) which supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations. In this paper, we present the implementation of our InfiniBand-based DVSM system and analyze the performance of SPEC OMP benchmarks in detail by comparing with the DVSM based on the traditional network architecture and the hardware shared-memory multiprocessor (SMP) system. As experiment result, we show that our DVSM system to use full features of the IBA can improve the performance significantly over the IPoIB-based traditional system on the IBA, and furthermore the performance of one application on the IBA-based DVSM system is better than on the hardware SMP. 相似文献

17.

A massively parallel architecture for self-organizing feature maps

Porrmann M. Witkowski U. Ruckert U. 《Neural Networks, IEEE Transactions on》2003,14(5):1110-1121

A hardware accelerator for self-organizing feature maps is presented. We have developed a massively parallel architecture that, on the one hand, allows a resource-efficient implementation of small or medium-sized maps for embedded applications, requiring only small areas of silicon. On the other hand, large maps can be simulated with systems that consist of several integrated circuits that work in parallel. Apart from the learning and recall of self-organizing feature maps, the hardware accelerates data pre- and postprocessing. For the verification of our architectural concepts in a real-world environment, we have implemented an ASIC that is integrated into our heterogeneous multiprocessor system for neural applications. The performance of our system is analyzed for various simulation parameters. Additionally, the performance that can be achieved with future microelectronic technologies is estimated. 相似文献

18.

Performance analysis of BusNet protocol for backplane bus-based interprocessor communication

《Computer Communications》2001,24(15-16):1578-1588

Nowadays, backplane bus-based multiprocessor systems often utilize the standard network protocol such as TCP/IP for communication between processors on the backplane bus. In such systems, it is common for the backplane bus to emulate the standard MAC protocols such as CSMA/CD. This paper aims to analyze the delay performance of the MAC emulation-based backplane network by constructing queueing models based on detailed bus operations. For this purpose, we choose BusNet as a target protocol. BusNet is an ANSI standard network protocol and its specification contains basic operations commonly used in most backplane buses. We investigate the throughput-delay characteristics in terms of packet size, block transfer scale, and arbitration scheme. We also compare the packet delay in BusNet with the IEEE 802.3 CSMA/CD network which BusNet is expected to be compatible with. The simulation result shows how an optimal block transfer scale can be determined in respect of the performance trade-off between BusNet and other real-time traffics. 相似文献

19.

基于网络处理器应用中网包保序机制研究

余建明薛一波《计算机工程与应用》2007,43(33):147-149

网包保序是高层协议对网络设备的严格要求。网络处理器（NP）中广泛采用的并行处理架构对保序机制的实现提出了新的挑战。通过理论分析比较了临界区轮询（CSRR）和线程池（POTS）两种保序模型在NP体系结构下的性能,提出在实际网络环境中POTS性能更优。在NP的代表性设计Intel IXP2400上的实测表明,POTS比CSRR性能高1.7％-102%。相似文献

20.

An effective processor allocation strategy for multiprogrammedshared-memory multiprocessors

Yue K.K. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(12):1246-1258

Existing techniques for sharing the processing resources in multiprogrammed shared-memory multiprocessors, such as time-sharing, space-sharing, and gang-scheduling, typically sacrifice the performance of individual parallel applications to improve overall system utilization. We present a new processor allocation technique called Loop-Level Process Control (LLPC) that dynamically adjusts the number of processors an application is allowed to use for the execution of each parallel section of code, based on the current system load. This approach exploits the maximum parallelism possible for each application without overloading the system. We implement our scheme on a Silicon Graphics Challenge multiprocessor system and evaluate its performance using applications from the Perfect Club benchmark suite and synthetic benchmarks. Our approach shows significant improvements over traditional time-sharing and gang-scheduling. It has performance comparable to, or slightly better than, static space-sharing, but our strategy is more robust since, unlike static space-sharing, it does not require a priori knowledge of the applications' parallelism characteristics 相似文献