期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

LOP: A packet classification architecture with higher throughput and lower power consumption than TCAM

Xin He Jorgen Peddersen Sri Parameswaran 《Design Automation for Embedded Systems》2010,14(3):231-263

Packet classification is an important method implemented in modern network processors used in embedded systems such as routers. Current software-based packet classification techniques exhibit low performance, prompting researchers to move their focus to architectures encompassing both software and hardware components. Some of the newer hardware architectures exclusively utilize Ternary Content Addressable Memory (TCAM) to improve the performance of rule matching. However, this results in systems with high power consumption. In this paper, we propose LOP, a novel SRAM-based architecture where incoming packets are compared against parts of all rules simultaneously until a single matching rule is found for the compared bits in the packets. LOP significantly reduces power consumption as only a segment of the memory is compared against the incoming packet. Despite the additional time penalty to match a single packet, parallel comparison of multiple packets can improve throughput beyond that of the TCAM approaches, while consuming significantly less power. Compared with a state-of-the-art TCAM implementation (throughput of 495 Million Search per Second (Msps)) in 65 nm CMOS technology, on average, LOP saves 43% of energy consumption with a throughput of 590 Msps. In addition, an analysis of how the area scales is provided. 相似文献

2.

Low-memory and high-performance architectures for the CCSDS 122.0-B-1 compression standard

《Integration, the VLSI Journal》2019

Two low-memory and high-performance architectures for the CCSDS 122.0-B-1 standard are proposed. They use novel memory organizations to reduce the total memory requirements in order to be implemented in a single FPGA device. The architectures were implemented in radiation-hardened and commercial FPGA devices. Based on the experimental results for the case of Virtex5QV radiation-hardened device, the throughput is 135 MSamples/sec for image with 12 bits/pixel and horizontal resolution up 8192 pixels. Also, the proposed architectures outperform the existing one in terms of the memory requirements and area. 相似文献

3.

Efficient Parallelization of Polyphase Arbitrary Resampling FIR Filters for High-Speed Applications

Hannes Ramon Haolin Li Piet Demeester Johan Bauwelinck Guy Torfs 《Journal of Signal Processing Systems》2018,90(3):295-303

This article describes a method for increasing the sampling rate of efficient polyphase arbitrary resampling FIR filters. An FPGA proof of concept prototype of this architecture has been implemented in a Xilinx Kintex-7 FPGA which is able to convert the sampling rate of a signal from 500 MHz to 600 MHz. This article compares this new architecture with other best known efficient resampling architectures implemented on the same FPGA. The area usage on the FPGA shows that our proposed implementation is very proficient in high bandwidth applications without requiring significantly more resources on the FPGA. A theoretical calculation of the resampling error introduced on a modulated data stream is provided to evaluate the new architecture against other existing resampling architectures. 相似文献

4.

High-throughput Block Turbo Decoding: From Full-parallel Architecture to FPGA Prototyping 总被引：1，自引：0，他引：1

Camille Leroux Christophe Jégo Patrick Adde Michel Jézéquel 《Journal of Signal Processing Systems》2009,57(3):349-361

Ultra high-speed block turbo decoder architectures meet the demand for even higher data rates and open up new opportunities for the next generations of communication systems such as fiber optic transmissions. This paper presents the implementation, onto an FPGA device of an ultra high throughput block turbo code decoder. An innovative architecture of a block turbo decoder which enables the memory blocks between all half-iterations to be removed is presented. A complexity analysis of the elementary decoder leads to a low complexity decoder architecture for a negligible performance degradation. The resulting turbo decoder is implemented on a Xilinx Virtex II-Pro FPGA in a communication experimental setup which also includes an innovative parallel product encoder. The implemented block turbo decoder processes input data at 600 Mb/s. The component code is an extended Bose, Ray-Chaudhuri, Hocquenghem (eBCH(16,11)) code. Some solutions to reach even higher data rates are finally presented. 相似文献

5.

Reduced Memory and Low Power Architectures for CORDIC-based FFT Processors

Erdal Oruklu Xin Xiao Jafar Saniie 《Journal of Signal Processing Systems》2012,66(2):129-134

This paper presents a pipelined, reduced memory and low power CORDIC-based architecture for fast Fourier transform implementation. The proposed algorithm utilizes a new addressing scheme and the associated angle generator logic in order to remove any ROM usage for storing twiddle factors. As a case study, the radix-2 and radix-4 FFT algorithms have been implemented on FPGA hardware. The synthesis results match the theoretical analysis and it can be observed that more than 20% reduction can be achieved in total memory logic. In addition, the dynamic power consumption can be reduced by as much as 15% by reducing memory accesses. 相似文献

6.

Edge‐Preserving Algorithm for Block Artifact Reduction and Its Pipelined Architecture

Truong Quang Vinh Young‐Chul Kim 《ETRI Journal》2010,32(3):380-389

This paper presents a new edge‐protection algorithm and its very large scale integration (VLSI) architecture for block artifact reduction. Unlike previous approaches using block classification, our algorithm utilizes pixel classification to categorize each pixel into one of two classes, namely smooth region and edge region, which are described by the edge‐protection maps. Based on these maps, a two‐step adaptive filter which includes offset filtering and edge‐preserving filtering is used to remove block artifacts. A pipelined VLSI architecture of the proposed deblocking algorithm for HD video processing is also presented in this paper. A memory‐reduced architecture for a block buffer is used to optimize memory usage. The architecture of the proposed deblocking filter is verified on FPGA Cyclone II and implemented using the ANAM 0.25 µm CMOS cell library. Our experimental results show that our proposed algorithm effectively reduces block artifacts while preserving the details. The PSNR performance of our algorithm using pixel classification is better than that of previous algorithms using block classification. 相似文献

7.

Dual-Data Rate Transpose-Memory Architecture Improves the Performance,Power and Area of Signal-Processing Systems

Mohamed El-Hadedy Xinfei Guo Martin Margala Mircea R. Stan Kevin Skadron 《Journal of Signal Processing Systems》2017,88(2):167-184

This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double-edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in prior work. The proposed design is evaluated with both FPGA and ASIC flow in 28/32nm technology. The experimental results show that the proposed memory achieves almost 4X improvement in throughput while consuming 46 % less area with the FPGA implementations compared to prior work. For ASIC implementations, it achieves more than 60 % area reduction and at least 2X performance improvement while burning 60 % less power compared to other register-based designs implemented with the same flow. As an example, a proposed 8X8 transpose memory with 12-bit input/output resolution is able to achieve a throughput of 107.83Gbps at 647MHz by taking only 140 slices on a Virtex-7 Xilinx FPGA platform, and achieve a throughput of 88.2Gbps at 529MHz by taking 0.024mm ² silicon area for ASIC. The proposed transpose memory is integrated in both 2D-DCT and 2D-IDCT blocks for signal processing applications on the same FPGA platform. The new architecture allows a 3.5X speed-up in performance for the 2D-DCT algorithm, compared to the previous work, while consuming 28 % less area, and 2D-IDCT achieves a 3X speed-up while consuming 20 % less area. 相似文献

8.

PMCNOC: A Pipelining Multi-channel Central Caching Network-on-chip Communication Architecture Design

N. Wang A. Sanusi P. Y. Zhao M. Elgamel M. A. Bayoumi 《Journal of Signal Processing Systems》2010,60(3):315-331

With the de facto transformation of technology into nano-technology, more and more functional components can be embedded on a single silicon die, thus enabling high degree pipelining operations such as those required for multimedia applications. In recent years, system-on-chip designs have migrated from fairly simple single processor and memory designs to relatively complicated systems with multiple processors, on-chip memories, standard peripherals, and other functional blocks. The communication between these IP blocks is becoming the dominant critical system path and performance bottleneck of system-on-chip designs. Network-on-chip architectures, such as Virtual Channel (2004), Black-bus (2004), Pirate (2004), AEthereal (2005), and VICHAR (2006) architectures, emerged as promising solutions for future system-on-chip communication architecture designs. However, these existing architectures all suffer from certain problems, including high area cost and communication latency and/or low network throughput. This paper presents a novel network-on-chip architecture, Pipelining Multi-channel Central Caching, to address the shortcomings of the existing architectures. By embedding a central cache into every switch of the network, blocked head packets can be removed from the input buffers and stored in the caches temporally, thus alleviating the effect of head-of-line and deadlock problems and achieving higher network throughput and lower communication latency without paying the price of higher area cost. Experimental results showed that the proposed architecture exhibits both hardware simplicity and system performance improvement compared to the existing network-on-chip architectures. 相似文献

9.

Sharing of SRAM Tables Among NPN-Equivalent LUTs in SRAM-Based FPGAs

Meyer J. Kocan F. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(2):182-195

This article introduces a novel lookup table (LUT) and its usage in the configurable logic block (CLB) architectures for SRAM-based field-programmable gate array (FPGA) architectures. The proposed CLB allows sharing of SRAM tables of LUTs among NPN-equivalent functions to reduce the size of memories used for storing the functions and also reduces the number of configuration bits required. We measured many different characteristics of FPGAs using our new CLB architecture, including area, delay, routing, and power requirements. We experimentally found that for many different FPGA architectures, CLBs can share one-fourth of their SRAM tables between two basic logic elements (BLEs), which reduced both power consumption and area without negatively affecting routing or wirelength, and there was only a negligible increase in critical path delay of 0.27%. Specifically, we find that FPGAs consisting of CLBs with 16 BLEs and 34 inputs can be implemented with eight normal SRAMs and four SRAMs shared between two BLEs, for an overall reduction of four out of sixteen SRAM tables per CLB. With this new CLB architecture, we measured an approximate reduction in overall power consumption of 2% and an estimated reduction in area of 3% 相似文献

10.

VLSI design of memory-efficient, high-speed baseline MQ coder for JPEG 2000

Kishor Sarawadekar^{Author Vitae} Swapna Banerjee Author Vitae 《Integration, the VLSI Journal》2012,45(1):1-8

The embedded block coding with optimized truncation (EBCOT) algorithm is the heart of the JPEG 2000 image compression system. The MQ coder used in this algorithm restricts throughput of the EBCOT because there is very high correlation among all procedures to be performed in it. To overcome this obstacle, a high throughput MQ coder architecture is presented in this paper. To accomplish this, we have studied the number of rotations performed and the rate of byte emission in an image. This study reveals that in an image, on an average 75.03% and 22.72% of time one and two shifts occur, respectively. Similarly, about 5.5% of time two bytes are emitted concurrently. Based on these facts, a new MQ coder architecture is proposed which is capable of consuming one symbol per clock cycle. The throughput of this coder is improved by operating the renormalization and byte out stages concurrently. To reduce the hardware cost, synchronous shifters are used instead of hard shifters. The proposed architecture is implemented on Stratix FPGA and is capable of operating at 145.9 MHz. Memory requirement of the proposed architecture is reduced by a minimum of 66% compared to those of the other existing architectures. Relative figure of merit is computed to compare the overall efficiency of all architectures which show that the proposed architecture provides good balance between the throughput and hardware cost. 相似文献

11.

Application-aware virtual paths insertion for NOCs

Majed ValadBeigi Farshad Safaei Bahareh Pourshirazi 《Microelectronics Journal》2014

Network-on-chip (NoC) has rapidly become a promising alternative for complex system-on-chip architectures including recent multicore architectures. Additionally, optimizing NoC architectures with respect to different design objectives that are suitable for a particular application domain is crucial for achieving high-performance and energy-efficient customized solutions. Despite the fact that many researches have provided various solutions for different aspects of NoCs design, a comprehensive NoCs system solution has not emerged yet. This paper presents a novel methodology to provide a solution for complex on-chip communication problems to reduce power, latency and area overhead. Our proposed NoC communication architecture is based on setting up virtual source–destination paths between selected pairs of NoCs cores so that the packets belonging to distance nodes in the network can bypass intermediate routers while traveling through these virtual paths. In this scheme, the paths are constructed for an application based on its task-graph at the design time. After that, the run time scheduling mechanism is applied to improve the buffer management, virtual channel and switch allocation schemes and hence, the constructed paths are optimized dynamically. Moreover, in our design the router complexity and its overheads are reduced. Additionally, the suggested router has been implemented on Xilinx Virtex-5 FPGA family. The evaluation results captured by SPLASH-2 benchmark suite reveal that in comparison with the conventional NoC router, the proposed router takes 25% and 53% reduction in latency and energy, respectively besides 3.5% area overhead. Indeed, our experimental results demonstrate a significant reduction in the average packet latency and total power consumption with negligible area overhead. 相似文献

12.

Stateful-NOR based reconfigurable architecture for logic implementation

《Microelectronics Journal》2015,46(6):551-562

Most commercial Field Programmable Gate Arrays (FPGAs) have limitations in terms of density, speed, configuration overhead and power consumption mostly due to the use of SRAM cells in Look-Up Tables (LUTs), configuration memory and programmable interconnects. Also, hardwired Application Specific Integrated Circuit (ASIC) blocks designed for high performance arithmetic circuits in FPGA reduce the area available for reconfiguration. In this paper, we propose a novel generalized hybrid CMOS-memristor based architecture using stateful-NOR gates as basic building blocks for implementation of logic functions. These logic functions are implemented on memristor nanocrossbar layers, while the CMOS layer is used for selection and connection of memristors. The proposed pipelined architecture combines the features of ASIC, FPGA and microprocessor based designs. It has high density due to the use of nanocrossbar layer and high throughput especially for arithmetic circuits. The proposed architecture for three input one output logic block is compared with conventional LUT based Configurable Logic Block (CLB) having the same number of inputs and outputs; which shows 1.82×area saving, 1.57×speedup and 3.63×less power consumption. The automation algorithm to implement any logic function using proposed architecture is also presented. 相似文献

13.

Exploiting generalized de-Bruijn/Kautz topologies for flexible iterative channel code decoder architectures

《Integration, the VLSI Journal》2015

Modern iterative channel code decoder architectures have tight constrains on the throughput but require flexibility to support different modes and standards. Unfortunately, flexibility often comes at the expense of increasing the number of clock cycles required to complete the decoding of a data-frame, thus reducing the sustained throughput. The Network-on-Chip (NoC) paradigm is an interesting option to achieve flexibility, but several design choices, including the topology and the routing algorithm, can affect the decoder throughput. In this work logarithmic diameter topologies, in particular generalized de-Bruijn and Kautz topologies, are addressed as possible solutions to achieve both flexible and high throughput architectures for iterative channel code decoding. In particular, this work shows that the optimal shortest-path routing algorithm for these topologies, that is still available in the open literature, can be efficiently implemented resorting to a very simple circuit. Experimental results show that the proposed architecture features a reduction of about 14% and 10% for area and power consumption respectively, with respect to a previous shortest-path routing-table-based design. 相似文献

14.

A high performance MQ encoder architecture in JPEG2000

Kai Liu Author Vitae Yu Zhou Author Vitae Author Vitae Jian Feng Ma Author Vitae 《Integration, the VLSI Journal》2010,43(3):305-317

In this paper, a novel architecture for an MQ arithmetic coder with high throughput is proposed. The architecture can process two symbols in parallel. The main characteristics are eight process elements for the prediction of probability interval A, the combination of calculation units for the code register C with the Byteout&Flush procedure, and the use of a dedicated probability estimation table to decrease the internal memory. From FPGA synthesis results, the architecture’s throughput can reach 96.60 M context symbols per second with an internal memory size of 1509 bits, which is comparable to that of other architectures and suitable for chip implementation. 相似文献

15.

Efficient Implementations for AES Encryption and Decryption 总被引：1，自引：0，他引：1

Rashmi Ramesh Rachh P. V. Ananda Mohan B. S. Anami 《Circuits, Systems, and Signal Processing》2012,31(5):1765-1785

This paper proposes two efficient architectures for hardware implementation of the Advanced Encryption Standard (AES) algorithm. The composite field arithmetic for implementing SubBytes (S-box) and InvSubBytes (Inverse S-box) transformations investigated by several authors is used as the basis for deriving the proposed architectures. The first architecture for encryption is based on optimized S-box followed by bit-wise implementation of MixColumns and AddRoundKey and optimized Inverse S-box followed by bit-wise implementation of InvMixColumns and AddMixRoundKey for decryption. The proposed S-box and Inverse S-box used in this architecture are designed as a cascade of three blocks. In the second proposed architecture, the block III of the proposed S-box is combined with the MixColumns and AddRoundKey transformations forming an integrated unit for encryption. An integrated unit for decryption combining the block III of the proposed InvSubBytes with InvMixColumns and AddMixRoundKey is formed on similar lines. The delays of the proposed architectures for VLSI implementation are found to be the shortest compared to the state-of-the-art implementations of AES operating in non-feedback mode. Iterative and fully unrolled sub-pipelined designs including key schedule are implemented using FPGA and ASIC. The proposed designs are efficient in terms of Kgates/Giga-bits per second ratio compared with few recent state-of-the-art ASIC (0.18-μm CMOS standard cell) based designs and throughput per area (TPA) for FPGA implementations. 相似文献

16.

A Gbps IPSec SSL Security Processor Design and Implementation in an FPGA Prototyping Platform

Haixin Wang Guoqiang Bai Hongyi Chen 《Journal of Signal Processing Systems》2010,58(3):311-324

相似文献

17.

An SDN approach to route massive data flows of sensor networks

Olivier Flauzac Carlos Javier Gonzalez Santamaria Florent Nolot Isaac Woungang 《International Journal of Communication Systems》2020,33(7)

With the advent of the Internet of Things (IoT), more and more devices can establish a connection with local area networks and use routing protocols to forward all information to the sink. But these devices may not have enough resources to execute a complex routing protocol or to memorize all information about the network. With proactive routing protocols, each node calculates the best path, and it needs enough resources to memorize the network topology. With reactive routing protocols, each node has to broadcast the message to learn the right path that the packets must follow. In all cases, in large networks such as IoT, this is not an appropriate mechanism. This paper presents a new software‐defined network (SDN)–based network architecture to optimize the resource consumption of each IoT object while securing the exchange of messages between the embedded devices. In this architecture, the controller is in charge of all decisions, and objects only exchange messages and forward packets among themselves. In the case of large networks, the network is organized into clusters. Our proposed network architectures are tested with 1000 things grouped in five clusters and managed by one SDN controller. The tests using OpenDayLight and IoT embedded applications have been implemented on several scenarios providing the ability and the scalability from dynamic reorganization of the end‐devices. This approach explores the network performance issues using a virtualized SDN‐clustered environment which contributes to a new model for future network architectures. 相似文献

18.

高性能交通标志检测模块的VLSI结构设计

下载免费PDF全文

王刚毅金炎胜任广辉刘通《红外与激光工程》2018,47(9):926001-0926001(9)

交通标志检测是驾驶辅助系统的重要功能，但对实时性极高的要求使其非常具有挑战性。提出了一种高性能禁令标志检测模块的VLSI结构，并在FPGA平台上完成了实现和验证。该结构的基本原理是同时利用颜色与形状特征，在图像的红色边缘位图中采用圆霍夫变换检测圆形。通过挖掘圆霍夫变换的局部特性，所提出的结构在内存占用方面显著低于常规结构。所有半径同时投票的设计使FPGA的逻辑单元和内存的并行性得以充分发挥。该结构在Altera公司的EP3C55F484C6型FPGA上进行了验证，其最大可运行频率达到122 MHz，且资源占用在可接受范围内。实验结果表明:该结构的吞吐量达到115 M像素/s，且对低光照条件、局部遮挡、多标志相连和相似背景颜色等不利条件具有良好的适应能力。相似文献

19.

Adaptable,Fast, Area-Efficient Architecture for Logarithm Approximation with Arbitrary Accuracy on FPGA

Dimitris Bariamis Dimitris Maroulis Dimitris K. Iakovidis 《Journal of Signal Processing Systems》2010,58(3):301-310

This paper presents ALA (Adaptable Logarithm Approximation), a novel hardware architecture for the approximation of the base-2 logarithm of integers at an arbitrary accuracy, suitable for fast and area-efficient FPGA implementation. It is based on a piecewise linear approximation methodology, implemented so that an arbitrary number of linear segments approximate the logarithm function. The achieved approximation accuracy depends on the number of segments used, which also affects the size of a ROM used for storing the parameters that control the computation. The implementation of the ROM using an FPGA BlockRAM allows the parameters to be updated without reconfiguration of the FPGA core. This provides the considerable advantage of data set adaptability to the proposed architecture over the other relevant architectures, as the parameters can be easily updated to minimize the approximation error for different data sets. Both real and synthetic datasets have been used for evaluation purposes. The results show that ALA adapts well to all data sets used and requires significantly less FPGA slices than the CORDIC architecture to achieve the same or higher approximation accuracy. Moreover, it provides a throughput of one result per cycle and up to four times lower latency than the CORDIC core. 相似文献

20.

Low-Complexity High-Speed Decoder Design for Quasi-Cyclic LDPC Codes 总被引：1，自引：0，他引：1

Wang Z. Cui Z. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(1):104-114

This paper studies low-complexity high-speed decoder architectures for quasi-cyclic low density parity check (QC-LDPC) codes. Algorithmic transformation and architectural level optimization are incorporated to reduce the critical path. Enhanced partially parallel decoding architectures are proposed to linearly increase the throughput of conventional partially parallel decoders through introducing a small percentage of extra hardware. Based on the proposed architectures, a (8176, 7154) Euclidian geometry-based QC-LDPC code decoder is implemented on Xilinx field programmable gate array (FPGA) Virtex-II 6000, where an efficient nonuniform quantization scheme is employed to reduce the size of memories storing soft messages. FPGA implementation results show that the proposed decoder can achieve a maximum (source data) decoding throughput of 172 Mb/s at 15 iterations 相似文献