期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Synchronous dataflow architecture for network processors 总被引：1，自引：0，他引：1

Carlstrom J. Boden T. 《Micro, IEEE》2004,24(5):10-18

Network processors are programmable, highly integrated communications circuits optimized to provide processing at high data and packet rates. The packet instruction set computer (PISC) architecture is a synchronous dataflow architecture developed for network processors. It uses a deep pipeline that contains two types of processing elements: PISC processors, which perform programmable data manipulation, and I/O processors, which provide access to shared resources such as look-up table memory, hardware accelerators, or coprocessors. 相似文献

2.

A scalable high-performance computing solution for networks onchips

《Micro, IEEE》2002,22(5):46-55

The Eclipse network-on-a-chip architecture uses a sophisticated parallel programming model, realized through multithreaded processors, interleaved memory modules, and a high-capacity interconnection network to support system-on-a-chip designs 相似文献

3.

Speeding up high-speed protocol processors

Serpanos D.N. 《Computer》2004,37(9):108-111

Many network technologies aim to exploit the bandwidth of high-speed links, which now achieve data transfer rates up to several terabits per second. As packet interarrival times shrink to a few tens of nanoseconds, network systems must address a transmission-processing gap by providing extremely fast data paths as well as high-performance subsystems to implement such functions as protocol processing, memory management, and scheduling. Today, network processors are an important class of embedded processors, used all across the network systems space-from personal to local and wide area networks. Network processor architectures focus on exploiting parallelism to achieve high performance. They usually employ conventional architectural concepts to accelerate the processing required to switch packets between different protocol stacks. The architectures support the mechanisms that network protocols implement in a specific stack by providing efficient data paths and by executing many intelligent network or more homogeneous links - for example, a set of Ethernet links. Although network processors can also handle packets concurrently from different protocol stacks, we describe only single-stack processing here. However, the arguments and results extend to a multistack environment. 相似文献

4.

Architectural support for thread communications in multi-core processors

Sevin Varoglu Stephen Jenks 《Parallel Computing》2011,37(1):26-41

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献

5.

Analyzing and enhancing the parallel sort operation on multithreaded architectures 总被引：1，自引：1，他引：0

Layali Rashid Wessam M. Hassanein Moustafa A. Hammad 《The Journal of supercomputing》2010,53(2):293-312

The sort operation is a core part of many critical applications (e.g., database management systems). Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include simultaneous multithreading, chip multiprocessors, and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69× for radix sort and up to 4.17× for quicksort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on chip multiprocessors where multiple CPUs are available. While quicksort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing. 相似文献

6.

NePSim: a network processor simulator with a power evaluation framework

Yan Luo Jun Yang Bhuyan L.N. Li Zhao 《Micro, IEEE》2004,24(5):34-44

This article presents NePSim, an integrated system that includes a cycle-accurate architecture simulator, an automatic formal verification engine, and a parameterizable power estimator for NPs consisting of clusters of multithreaded execution cores, memory controllers, I/O ports, packet buffers, and high-speed buses. To perform concrete simulation and provide reliable performance and power analysis, we defined our system to comply with Intel's IXP1200 processor specification because academia has widely adopted it as a representative model for NP research. 相似文献

7.

Predictive scheduling of network processors

《Computer Networks》2003,41(5):601-621

To provide flexibility in deploying new protocols and services, general-purpose processing engines are being placed in the datapath of routers. Such network processors (NPs) are typically simple RISC multiprocessors that perform forwarding and custom application processing of packets. The inherent unpredictability of execution time of arbitrary instruction code poses a significant challenge in providing service guarantees for data flows that compete for such processing resources in the network. However, we show that network processing workloads are highly regular and predictable, which can be exploited for scheduling purposes. We present two such predictive processor scheduling algorithms that aim at providing service guarantees as well as improving the performance of the NP by increasing the instruction data locality. Simulation results show that these algorithms provide significantly better performance than processor scheduling algorithms that do not take packet processing times into consideration. 相似文献

8.

网络处理器中处理单元的设计与实现

下载免费PDF全文

李诚李华伟《计算机工程》2007,33(2):252-254

随着网络带宽的飞速增长和各种新的网络应用不断涌现，原有的基于通用处理器和ASIC的互联网架构已经不能满足新的需求。兼具强大处理能力和灵活可编程配置能力的网络处理器逐渐得到广泛的应用。高性能的网络处理器通常采用多个并发的处理单元进行数据平面的快速处理，这些处理单元在网络处理器中居于核心的地位。该文讨论了网络处理器中处理单元设计需要考虑的因素，设计了一种较为灵活有效的处理单元架构，并进行了FPGA原型验证，证实了该结构的可行性。相似文献

9.

Hint-based cache design for reducing miss penalty in HBS packet classification algorithm

Yeim-Kuan Chang Fang-Chen Kuo 《Journal of Parallel and Distributed Computing》2013

In this paper, we implement some notable hierarchical or decision-tree-based packet classification algorithms such as extended grid of tries (EGT), hierarchical intelligent cuttings (HiCuts), HyperCuts, and hierarchical binary search (HBS) on an IXP2400 network processor. By using all six of the available processing microengines (MEs), we find that none of these existing packet classification algorithms achieve the line speed of OC-48 provided by IXP2400. To improve the search speed of these packet classification algorithms, we propose the use of software cache designs to take advantage of the temporal locality of the packets because IXP network processors have no built-in caches for fast path processing in MEs. Furthermore, we propose hint-based cache designs to reduce the search duration of the packet classification data structure when cache misses occur. Both the header and prefix caches are studied. Although the proposed cache schemes are designed for all the dimension-by-dimension packet classification schemes, they are, nonetheless, the most suitable for HBS. Our performance simulations show that the HBS enhanced with the proposed cache schemes performs the best in terms of classification speed and number of memory accesses when the memory requirement is in the same range as those of HiCuts and HyperCuts. Based on the experiments with all the high and low locality packet traces, five MEs are sufficient for the proposed rule cache with hints to achieve the line speed of OC-48 provided by IXP2400. 相似文献

10.

并行网络处理器在路由器中的应用

史旭东陈平梁阿磊《计算机应用与软件》2003,20(1):32-34

网络处理器是为提高报文处理效率而出现的专用处理器。随着网络带宽需求的增加和大量网络服务的涌现，网络处理器向可编程，并行的方向发展，本文以Intel IXP1200网络处理器为例描述了处理器的并行性，并提出了基于并行网络处理器的路由器体系结构，实验证明，这种体系较传统的路由体系可以大大提高报文转发能力。相似文献

11.

A performance study of general-purpose applications on graphics processors using CUDA 总被引：1，自引：0，他引：1

Shuai Che Michael Boyer Jiayuan Meng David Tarjan Jeremy W. Sheaffer Kevin Skadron 《Journal of Parallel and Distributed Computing》2008

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA’s C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications. 相似文献

12.

Performance models for network processor design

Wolf T. Franklin M.A. 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(6):548-561

To provide a variety of new and advanced communications services, computer networks are required to perform increasingly complex packet processing. This processing typically takes place on network routers and their associated components. An increasingly central component in router design is a chip-multiprocessor (CMP) referred to as "network processor" or NP. In addition to multiple processors, NPs have multiple forms of on-chip memory, various network and off-chip memory interfaces, and other specialized logic components such as CAMs (content addressable memories). The design space for NPs (e.g., number of processors, caches, cache sizes, etc.) is large due to the diverse workload, application requirements, and system characteristics. System design constraints relate to the maximum chip area and the power consumption that are permissible while achieving defined line rates and executing required packet functions. In this paper, an analytic performance model that captures the processing performance, chip area, and power consumption for a prototypical NP is developed and used to provide quantitative insights into system design trade offs. The model, parameterized with a networking application benchmark, provides the basis for the design of a scalable, high-performance network processor and presents insights into how best to configure the numerous design elements associated with NPs. 相似文献

13.

网络测试中海量数据包存储与处理技术的研究

吕雪峰梅天凤《计算机应用》2009,29(Z2)

为了有效地捕获、存储与处理网络测试中所产生的海量数据包,提出一种基于WinPcap库与Oracle数据库的软件架构.通过性能试验,该架构可在每秒5000个的数据包速率下,不发生掉包现象,同时又具有灵活与强大的数据包统计与分析功能.试验结果表明,所提出的架构能够满足通常网络测试的需求.同时,相对传统数据包文件直接存储的架构,该架构的内存管理压力更小,数据回放与处理更灵活,分析能力更强大. 相似文献

14.

Exploiting locality and tolerating remote memory access latency using thread migration

Stephen Jenks Jean-Luc Gaudiot 《International journal of parallel programming》1997,25(4):281-304

Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers. 相似文献

15.

MIMS: Towards a Message Interface Based Memory System

下载免费PDF全文

陈荔城陈明宇阮元黄永兵崔泽汉卢天越包云岗《计算机科学技术学报》2014,29(2):255-272

The decades-old synchronous memory bus interface has restricted many innovations in the memory system, which is facing various challenges （or walls） in the era of multi-core and big data. In this paper, we argue that a message- based interface should be adopted to replace the traditional bus-based interface in the memory system. A novel message interface based memory system called MIMS is proposed. The key innovation of MIMS is that processors communicate with the memory system through a universal and flexible message packet interface. Each message packet is allowed to encapsulate multiple memory requests （or commands） and additional semantic information. The memory system is more intelligent and active by equipping with a local buffer scheduler, which is responsible for processing packets, scheduling memory requests, preparing responses, and executing specific commands with the help of semantic information. Under the MIMS framework, many previous innovations on memory architecture as well as new optimization opportunities such as address compression and continuous requests combination can be naturally incorporated. The experimental results on a 16-core cycle-detailed simulation system show that： with accurate granularity message, MIMS can improve system performance by 53.21% and reduce energy delay product （EDP） by 55.90%. Furthermore, it can improve effective bandwidth utilization by 62.42% and reduce memory access latency by 51% on average. 相似文献

16.

基于网络处理器的负载均衡算法的研究与实现

古俐明《计算机工程与应用》2008,44(10):108-110

针对高速环境下转发决策困难的问题,提出一种由多个网络处理器组成的并行转发引擎结构。为解决负载分配问题,提出一种基于映射表的自适应负载分配算法AIHDA。AIHDA算法根据各网络处理器的负载状况和网络流量特性调整负载分配方式,折衷考虑了负载均衡和报文保序要求,具有较好的综合性能。相似文献

17.

一种适用于网络处理器的队列管理算法 总被引：5，自引：0，他引：5

郑波林闯李寅《计算机研究与发展》2005,42(10):1698-1705

遵循比例区分服务模型,设计了一种适用于网络处理器的队列管理算法．算法包含两部分,分组入队列时实现丢失率控制的RR—PLR（round—robin based proportional loss rate）和分组出队列时实现时延控制的WRR—PAD（WRR based proportional average delay）．算法采用轮循的机制,避免了除法运算和排序操作,具有O（1）的复杂度,而且易于在网络处理器上实现．性能模拟以及实测的结果表明,该算法能有效实现平均分组丢失率和平均排队时延的比例控制,系统的总吞吐率达到了1．125Gbps（每个分组64B,即2．25Mpps）．相似文献

18.

实时微处理器体系结构综述 总被引：1，自引：0，他引：1

下载免费PDF全文

石伟张明郭御风龚锐《计算机工程与科学》2015,37(5):857-864

实时应用已经成为嵌入式应用中一类快速崛起的典型应用。作为实时系统的核心部件,实时微处理器体系结构是微处理器领域的一个重要研究方向。与通用处理器追求最大吞吐量不同,实时处理器要求具有紧凑且可计算的最坏执行时间。传统的实时处理器往往采用较为简单的处理器结构,避免复杂结构引入执行时间的不确定性。随着实时应用对处理器性能需求越来越高,实时处理器正逐渐向多线程与多核结构发展。在多线程与多核处理器中,共享资源竞争导致实时系统的确定性变差,对实时处理器体系结构带来了更大挑战。对实时微处理器体系结构进行综述,首先从指令集、微体系结构、存储、I/O、任务调度等多个方面对传统实时处理器进行分析;然后分别对采用多线程与多核结构的高性能实时处理器展开分析;最后对几种商用实时处理器结构进行比较,总结实时处理器发展现状与未来发展趋势。相似文献

19.

支持多核并行程序确定性重放的高效访存冲突记录方法 总被引：2，自引：0，他引：2

刘磊黄河唐志敏《计算机研究与发展》2012,49(1):64-75

多核系统中并行程序执行过程的不确定性给程序调试带来了很大的困难.准确记录初始执行中冲突访存的次序是并行程序确定性重放的基础.提出了通过建立精确happens-before关系记录访存冲突的方法.此方法利用简洁高效的地址冲突检测机制确定冲突访存操作在执行中所处happens-before序关系的位置,可以抑制部分记录信息的产生,从而有效减少记录信息.与其他方式方法相比,可以进一步压缩17%的记录条数.采用逻辑向量时钟描述冲突访存操作间的happens-before关系,与采用标量时钟相比,可以避免happens-before关系的误识,降低重放执行时并行度的损失. 相似文献

20.

Sequence-preserving parallel IP lookup using multiple SRAM-based pipelines

Weirong Jiang Viktor K. Prasanna 《Journal of Parallel and Distributed Computing》2009

SRAM (static random access memory)-based pipelined algorithmic solutions have become competitive alternatives to TCAMs (ternary content addressable memories) for high-throughput IP lookup. Multiple pipelines can be utilized in parallel to improve the throughput further. However, several challenges must be addressed to make such solutions feasible. First, the memory distribution over different pipelines, as well as across different stages of each pipeline, must be balanced. Second, the traffic among these pipelines should be balanced. Third, the intra-flow packet order (i.e. the sequence) must be preserved. In this paper, we propose a parallel SRAM-based multi-pipeline architecture for IP lookup. A two-level mapping scheme is developed to balance the memory requirement among the pipelines as well as across the stages in each pipeline. To balance the traffic, we propose an early caching scheme to exploit the data locality inherent in the architecture. Our technique uses neither a large reorder buffer nor complex reorder logic. Instead, a flow-aware queuing scheme exploiting the flow information is used to maintain the intra-flow sequence. Extensive simulation using real-life traffic traces shows that the proposed architecture with 8 pipelines can achieve a throughput of up to 10 billion packets per second, i.e. 3.2 Tbps for minimum size (40 bytes) packets, while preserving intra-flow packet order. 相似文献