首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
RFC(Recursive Flow Classification)算法是目前速度较快的基于软件实现的多维包分类算法,但是随着规则集规模的增大,其消耗的内存空间迅速增大.针对这一问题,本文提出了一种基于内存优化的RFC算法-Compact RFC,该算法根据RFC算法构建的交叉乘积表中元素的分布特点设计出了一种压缩的数据结构及压缩方法,能够消除RFC交叉乘积表中60%以上的冗余空间,并且仍然保持与RFC算法相同的时间复杂度.本文在Intel IXP2800网络处理器上实现了RFC和Compact RFC,验证了Compact RFC的优越性能,实验同时表明Compact RFC在Intel IXP2800上消耗较少的资源就能够达到OC-192(10Gbps)的分类速度,具有较高的应用价值.  相似文献   

2.
Hardware and software cache optimizations are active fields of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined through simple software and hardware optimizations. Because current caches provide little flexibility for exploiting temporal and spatial locality, two hardware modifications are proposed to support these two kinds of locality. Spatial locality is exploited by using large virtual cache lines which do not exhibit the performance flaws of large physical cache lines. Temporal locality is exploited by minimizing cache pollution with a bypass mechanism that still allows to exploit spatial locality. Subsequently, it is shown that simple software informations on the spatial/temporal locality of array references, as provided by current data locality optimization algorithms, can be used to increase cache performance significantly. The performance and design tradeoffs of the proposed mechanisms are discussed, Software-assisted caches are also shown to provide a very convenient support for further enhancement of data locality optimizations.  相似文献   

3.
查询结果缓存可以对查询结果的文档标识符集合或者实际的返回页面进行缓存,以提高用户查询的响应速度,相应的缓存形式可以分别称之为标识符缓存或页面缓存。对于固定大小的内存,标识符缓存可以获得更高的命中率,而页面缓存可以达到更高的响应速度。该文根据用户查询访问的时间局部性和空间局部性,提出了一种新颖的基于时空局部性的层次化结果缓存机制。首先,该机制将固定大小的结果缓存划分为两层:页面缓存和标识符缓存。对于用户提交的查询,该机制会首先使用第一层的页面缓存进行应答,如果未能命中,则继续尝试使用第二层的标识符缓存。实验显示这种层次化的缓存机制较传统的仅依赖于单一缓存形式的机制,在平均查询响应时间上,取得了可观的性能提升:例如,相对单纯的页面缓存,平均达到9%,最好情况下达到11%。其次,该机制在标识符缓存的基础上,设计了一种启发式的预取策略,对用户查询检索的空间局部性进行挖掘。实验显示,这种预取策略的融合,能进一步促进检索系统性能的有效提升,从而最终建立起一套时空完备的、有效的结果缓存机制。  相似文献   

4.
Packet classification is one of the most challenging functions in Internet routers since it involves a multi-dimensional search that should be performed at wire-speed. Hierarchical packet classification is an effective solution which reduces the search space significantly whenever a field search is completed. However, the hierarchical approach using binary tries has two intrinsic problems: back-tracking and empty internal nodes. To avoid back-tracking, the hierarchical set-pruning trie applies rule copy, and the grid-of-tries uses pre-computed switch pointers. However, none of the known hierarchical algorithms simultaneously avoids empty internal nodes and back-tracking. This paper describes various packet classification algorithms and proposes a new efficient packet classification algorithm using the hierarchical approach. In the proposed algorithm, a hierarchical binary search tree, which does not involve empty internal nodes, is constructed for the pruned set of rules. Hence, both back-tracking and empty internal nodes are avoided in the proposed algorithm. Two refinement techniques are also proposed; one for reducing the rule copy caused by the set-pruning and the other for avoiding rule copy. Simulation results show that the proposed algorithm provides an improvement in search performance without increasing the memory requirement compared with other existing hierarchical algorithms.  相似文献   

5.
In embedded systems caches are very precious for keeping low the memory bandwidth and to allow employing slow and narrow off-chip devices. Conversely, the power and die size resources consumed by the cache force the embedded system designers to use small and simple cache memories. This kind of caches can experience poor performance because of their not flexible placement policy. In this scenario, a big fraction of the misses can originate from the mismatch between the cache behavior and the memory accesses' locality features (conflict misses).In this paper we analyze the conflict miss phenomenon and define a cache utilization measure. Then we propose an object level Cache Aware allocation Technique (CAT) to transform the application to fit the cache structure, minimize the number of conflict misses and maximize cache exploitation. The solution transforms the program layout using the standard functionalities of a linker.The CAT approach allowed the considered applications to deliver the same performance on two times and sometimes four times smaller caches. Moreover the CAT improved programs on direct-mapped caches outperformed the original versions on set-associative caches. In this way, the results highlight that our approach can help embedded system designers to meet the system requirements with smaller and simpler cache memories.  相似文献   

6.
In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are throughput-oriented devices which execute thousands of threads simultaneously. To handle working set of this massive number of threads, modern GPGPUs exploit several levels of caches. GPGPU design trend shows that the size of caches continues to grow to support even more thread level parallelism. We propose using cache compression to increase effective cache capacity, improve performance, and reduce power consumption in GPGPUs. Our work is motivated by the observation that the values within a cache block are similar, i.e., the arithmetic difference of two successive values within a cache block is small. To reduce data redundancy in L1 data caches and L2 cache, we use low-cost and implementation-efficient base-delta-immediate (BDI) algorithm. BDI replaces a cache block with a base and an array of deltas where the combined size of the base and deltas is less than the original cache block. We also study locality of fields in integer and floating-point numbers. We found that entropy of fields varies across different data types. Based on entropy, we offer different BDI compression schemes for integer and floating-point numbers. We augment a simple, yet effective, predictor that determines type of values dynamically in hardware and without the help of a compiler or a programmer. Evaluation results show that on average, cache compression improves performance by 8% and saves energy of caches by 9%.  相似文献   

7.
The increasing gap between processor and memory speeds, as well as the introduction of multi-core CPUs, have exacerbated the dependency of CPU performance on the memory subsystem. This trend motivates the search for more efficient caching mechanisms, enabling both faster service of frequently used blocks and decreased power consumption. In this paper we describe a novel, random sampling based predictor that can distinguish transient cache insertions from non-transient ones. We show that this predictor can identify a small set of data cache resident blocks that service most of the memory references, thus serving as a building block for new cache designs and block replacement policies. Although we only discuss the L1 data cache, we have found this predictor to be efficient also when handling L1 instruction caches and shared L2 caches.  相似文献   

8.
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.  相似文献   

9.
Modern Internet routers have to handle a large number of packet classification rules, which requires classification schemes to be scalable both in time and space. In this paper, we present a scalable packet classification algorithm that is developed by combining two new concepts to the well‐known bit vector (BV) scheme. We propose a range search method based on a cache‐aware tree (CATree) which makes full use of processor's cache line to reduce the number of dynamic random access memory (DRAM) accesses. Theoretically, the number of DRAM accesses of CATree is about log(m+1) times lower than that of the widely used binary search algorithm, where m is the number of keys in a single cache line. Based on our computational results on a set of 1024 keys, the CATree algorithm is 36% faster than binary search algorithm and the performance is better when applied to a larger set of keys. In addition, we develop a rule re‐arrangement algorithm to reduce the bitmap space of BV. With this re‐arrangement, the rules for the same action may be assigned an identical priority. This reduces the number of priorities as well as the memory space of the bitmap. Furthermore, this also reduces the number of memory accesses and hence, increases the CPU cache utilization. With CATree and rule re‐arrangement, the cache‐aware bit vector with rule re‐arrangement algorithm achieves better performance in comparison with the regular BV scheme, both in space and time. In our experiments, the proposed algorithm reduces the bitmap memory space of a practical set of firewall rules by two orders of magnitude and is 91% faster than the regular BV.  相似文献   

10.
Recent results in the Rio project at the University of Michigan show that it is possible to create an area of main memory that is as safe as disk from operating system crashes. This paper explores how to integrate the reliable memory provided by the Rio file cache into a database system. Prior studies have analyzed the performance benefits of reliable memory; we focus instead on how different designs affect reliability. We propose three designs for integrating reliable memory into databases: non-persistent database buffer cache, persistent database buffer cache, and persistent database buffer cache with protection. Non-persistent buffer caches use an I/O interface to reliable memory and require the fewest modifications to existing databases. However, they waste memory capacity and bandwidth due to double buffering. Persistent buffer caches use a memory interface to reliable memory by mapping it into the database address space. This places reliable memory under complete database control and eliminates double buffering, but it may expose the buffer cache to database errors. Our third design reduces this exposure by write protecting the buffer pages. Extensive fault tests show that mapping reliable memory into the database address space does not significantly hurt reliability. This is because wild stores rarely touch dirty, committed pages written by previous transactions. As a result, we believe that databases should use a memory interface to reliable memory. Received January 1, 1998 / Accepted June 20, 1998  相似文献   

11.
基于IXP2400千兆防火墙包分类算法的设计与实现*   总被引:3,自引:0,他引:3  
针对千兆网下包过滤防火墙,提出了HSBIPG(Hash Search Based on IP Group)包分类算法,并分析了算法的优缺点,基于该算法用IXP2400实现了线速千兆包过滤防火墙,通过实验证明了此算法是可行和高效的。  相似文献   

12.
提出了一种高效、适用性好、易于实现的报文分类算法CSAC(classification on self-adaptive cache).该算法通过缓存属性子空间内报文集合的分类查询路径,将查询结果复用于同一子空间后续报文的分类.而缓存命中失效时也不必从头开始查询,减少了失效的时间开销.根据通信流量上下文变化对缓存运行状态造成的影响,算法采用自适应缓存机制,通过动态调整缓存的粒度、结构和缓存项在散列桶中的位置,有效地保证了缓存命中率.此外,算法不需要预处理过程,支持多维复杂规则(如4~7层属性、逻辑匹配操作等)和规则增量更新,比较适合于网络边界安全、用户流量审计和负载均衡等报文分类比较复杂的应用.采用CSAC算法开发的高端防火墙和入侵检测设备在实际网络环境中的性能良好.  相似文献   

13.
Gupta  P. McKeown  N. 《Micro, IEEE》2000,20(1):34-41
Increasing demands on Internet router performance and functionality create a need for algorithms that classify packets quickly with minimal storage requirements and allow frequent updates. Unlike previous algorithms, the algorithm proposed here meets this need well by using heuristics that exploit structure present in classifiers. Our approach, which we call HiCuts (hierarchical intelligent cuttings), attempts to partition the search space in each dimension, guided by simple heuristics that exploit the classifier's structure. We discover this structure by preprocessing the classifier. We can tune the algorithm's parameters to trade off query time against storage requirements. In classifying packets based on four header fields, HiCuts performs quickly and requires relatively little storage compared with previously described algorithms  相似文献   

14.
We study the on-line caching problem in a restricted cache where each memory item can be placed in only a restricted subset of cache locations. Examples of restricted caches in practice include victim caches, assist caches, and skew caches. To the best of our knowledge, all previous on-line caching studies have considered on-line caching in identical or fully-associative caches where every memory item can be placed in any cache location.In this paper, we focus on companion caches, a simple restricted cache that includes victim caches and assist caches as special cases. Our results show that restricted caches are significantly more complex than identical caches. For example, we show that the commonly studied Least Recently Used algorithm is not competitive unless cache reorganization is allowed while the performance of the First In First Out algorithm is competitive but not optimal. We also present two near optimal algorithms for this problem as well as lower bound arguments.  相似文献   

15.
The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches.  相似文献   

16.
IXP2400网络处理器及其微引擎中多线程实现的研究   总被引:2,自引:0,他引:2  
网络处理器兼顾了ASIC的高性能和RISC芯片的可编程灵活性,能较好地满足数据通信高速发展的要求,在将来的网络设备中,有广阔的应用前景。IXP2400是Intel公司推出的第二代网络处理器。它采用了高性能的并行体系结构来处理复杂的算法、包内容检测、流量管理和线速转发。多线程技术是IXP2400实现高速数据处理的关键技术。该文介绍了IXP2400的硬件结构及软件开发,并分析了其微引擎中多线程实现的有关技术。  相似文献   

17.
A new cache architecture based on temporal and spatial locality   总被引:5,自引:0,他引:5  
A data cache system is designed as low power/high performance cache structure for embedded processors. Direct-mapped cache is a favorite choice for short cycle time, but suffers from high miss rate. Hence the proposed dual data cache is an approach to improve the miss ratio of direct-mapped cache without affecting this access time. The proposed cache system can exploit temporal and spatial locality effectively by maximizing the effective cache memory space for any given cache size. The proposed cache system consists of two caches, i.e., a direct-mapped cache with small block size and a fully associative spatial buffer with large block size. Temporal locality is utilized by caching candidate small blocks selectively into the direct-mapped cache. Also spatial locality can be utilized aggressively by fetching multiple neighboring small blocks whenever a cache miss occurs. According to the results of comparison and analysis, similar performance can be achieved by using four times smaller cache size comparing with the conventional direct-mapped cache.And it is shown that power consumption of the proposed cache can be reduced by around 4% comparing with the victim cache configuration.  相似文献   

18.
In a snoopy cache multiprocessor system, each processor has a cache in which it stores blocks of data. Each cache is connected to a bus used to communicate with the other caches and with main memory. Each cache monitors the activity on the bus and in its own processor and decides which blocks of data to keep and which to discard. For several of the proposed architectures for snoopy caching systems, we present new on-line algorithms to be used by the caches to decide which blocks to retain and which to drop in order to minimize communication over the bus. We prove that, for any sequence of operations, our algorithms' communication costs are within a constant factor of the minimum required for that sequence; for some of our algorithms we prove that no on-line algorithm has this property with a smaller constant.  相似文献   

19.
A large scale, cache-based multiprocessor that is interconnected by a hierarchical network such as hierarchical buses or a multistage interconnection network (MIN) is considered. An adaptive cache coherence scheme for the system is proposed based on a hardware approach that handles multiple shared reads efficiently. The new protocol allows multiple copies of a shared data block in the hierarchical network, but minimizes the cache coherence overhead by dynamically partitioning the network into sharing and nonsharing regions based on program behavior. The new cache coherence scheme effectively utilizes the bandwidth of the hierarchical networks and exploits the locality properties of parallel algorithms. Simulation experiments have been carried out to analyze the performance of the new protocol. The simulation results show that the new protocol gives 15% to 30% performance improvement over some existing cache coherence schemes on similar systems for a wide range of workload parameters  相似文献   

20.
钟婷  刘勇  耿技 《计算机应用》2005,25(11):2568-2570
包过滤的效率极大地影响防火墙的性能。提出了一个基于INTEL IXP2400网络处理器高效的包过滤方案。此方案通过动态规则表,静态规则树及哈希硬件加速单元实现对包过滤的优化,使得基于INTEL IXP2400的防火墙能真正达到千兆线速。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号