期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient hardware implementation of a highly-parallel 3GPP LTE/LTE-advance turbo decoder

Yang Sun Joseph R. CavallaroAuthor vitae 《Integration, the VLSI Journal》2011,44(4):305-315

We present an efficient VLSI architecture for 3GPP LTE/LTE-Advance Turbo decoder by utilizing the algebraic-geometric properties of the quadratic permutation polynomial (QPP) interleaver. The high-throughput 3GPP LTE/LTE-Advance Turbo codes require a highly-parallel decoder architecture. Turbo interleaver is known to be the main obstacle to the decoder parallelism due to the collisions it introduces in accesses to memory. The QPP interleaver solves the memory contention issues when several MAP decoders are used in parallel to improve Turbo decoding throughput. In this paper, we propose a low-complexity QPP interleaving address generator and a multi-bank memory architecture to enable parallel Turbo decoding. Design trade-offs in terms of area and throughput efficiency are explored to find the optimal architecture. The proposed parallel Turbo decoder has been synthesized, placed and routed in a 65-nm CMOS technology with a core area of 8.3 mm² and a maximum clock frequency of 400 MHz. This parallel decoder, comprising 64 MAP decoder cores, can achieve a maximum decoding throughput of 1.28 Gbps at 6 iterations 相似文献

2.

A 285-MHz pipelined MAP decoder in 0.18-/spl mu/m CMOS

Seok-Jun Lee Shanbhag N.R. Singer A.C. 《Solid-State Circuits, IEEE Journal of》2005,40(8):1718-1725

Presented in this paper is a pipelined 285-MHz maximum a posteriori probability (MAP) decoder IC. The 8.7-mm/sup 2/ IC is implemented in a 1.8-V 0.18-/spl mu/m CMOS technology and consumes 330 mW at maximum frequency. The MAP decoder chip features a block-interleaved pipelined architecture, which enables the pipelining of the add-compare-select kernels. Measured results indicate that a turbo decoder based on the presented MAP decoder core can achieve: 1) a decoding throughput of 27.6 Mb/s with an energy-efficiency of 2.36 nJ/b/iter; 2) the highest clock frequency compared to existing 0.18-/spl mu/m designs with the smallest area; and 3) comparable throughput with an area reduction of 3-4.3/spl times/ with reference to a look-ahead based high-speed design (Radix-4 design), and a parallel architecture. 相似文献

3.

Area-efficient high-throughput MAP decoder architectures

Seok-Jun Lee Shanbhag N.R. Singer A.C. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(8):921-933

Iterative decoders such as turbo decoders have become integral components of modern broadband communication systems because of their ability to provide substantial coding gains. A key computational kernel in iterative decoders is the maximum a posteriori probability (MAP) decoder. The MAP decoder is recursive and complex, which makes high-speed implementations extremely difficult to realize. In this paper, we present block-interleaved pipelining (BIP) as a new high-throughput technique for MAP decoders. An area-efficient symbol-based BIP MAP decoder architecture is proposed by combining BIP with the well-known look-ahead computation. These architectures are compared with conventional parallel architectures in terms of speed-up, memory and logic complexity, and area. Compared to the parallel architecture, the BIP architecture provides the same speed-up with a reduction in logic complexity by a factor of M, where M is the level of parallelism. The symbol-based architecture provides a speed-up in the range from 1 to 2 with a logic complexity that grows exponentially with M and a state metric storage requirement that is reduced by a factor of M as compared to a parallel architecture. The symbol-based BIP architecture provides speed-up in the range M to 2M with an exponentially higher logic complexity and a reduced memory complexity compared to a parallel architecture. These high-throughput architectures are synthesized in a 2.5-V 0.25-/spl mu/m CMOS standard cell library and post-layout simulations are conducted. For turbo decoder applications, we find that the BIP architecture provides a throughput gain of 1.96 at the cost of 63% area overhead. For turbo equalizer applications, the symbol-based BIP architecture enables us to achieve a throughput gain of 1.79 with an area savings of 25%. 相似文献

4.

A Flexible UMTS-WiMax Turbo Decoder Architecture

Martina M. Nicola M. Masera G. 《Circuits and Systems II: Express Briefs, IEEE Transactions on》2008,55(4):369-373

This work proposes a VLSI decoding architecture for concatenated convolutional codes. The novelty of this architecture is twofold: 1) the possibility to switch on-the-fly from the universal mobile telecommunication system turbo decoder to the WiMax duo-binary turbo decoder with a limited resources overhead compared to a single-mode WiMax architecture; and 2) the design of a parallel, collision free WiMax decoder architecture. Compared to two single-mode solutions, the proposed architecture achieves a complexity reduction of 17.1% and 27.3% in terms of logic and memory, respectively. The proposed, flexible architecture has been characterized in terms of performance and complexity on a 0.13-mum standard cell technology, and sustains a maximum throughput of more than 70 Mb/s. 相似文献

5.

High-speed VLSI architecture for parallel Reed-Solomon decoder 总被引：3，自引：0，他引：3

Hanho Lee 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(2):288-294

This paper presents high-speed parallel Reed-Solomon (RS) (255,239) decoder architecture using modified Euclidean algorithm for the high-speed multigigabit-per-second fiber optic systems. Pipelining and parallelizing allow inputs to be received at very high fiber-optic rates and outputs to be delivered at correspondingly high rates with minimum delay. A parallel processing architecture results in speed-ups of as much as or more than 10 Gb, since the maximum achievable clock frequency is generally bounded by the critical path of the modified Euclidean algorithm block. The parallel RS decoders have been designed and implemented with the 0.13-/spl mu/m CMOS standard cell technology in a supply voltage of 1.1 V. It is suggested that a parallel RS decoder, which can keep up with optical transmission rates, i.e., 10 Gb/s and beyond, could be implemented. The proposed channel = 4 parallel RS decoder operates at a clock frequency of 770 MHz and has a data processing rate of 26.6 Gb/s. 相似文献

6.

A 1-Gb/s, four-state, sliding block Viterbi decoder 总被引：1，自引：0，他引：1

Black P.J. Meng T.H.-Y. 《Solid-State Circuits, IEEE Journal of》1997,32(6):797-805

To achieve unlimited concurrency and hence throughput in an area-efficient manner, a sliding block Viterbi decoder (SBVD) is implemented that combines the filtering characteristics of a sliding block decoder with the computational efficiency of the Viterbi algorithm. The SBVD approach reduces decode of a continuous input stream to decode of independent overlapping blocks, without constraining the encoding process. A systolic SBVD architecture is presented that combines forward and backward processing of the block interval. The architecture is demonstrated in a four-state, R=1/2, eight-level soft decision Viterbi decoder that has been designed and fabricated in double-metal CMOS. The 9.21 mm×8.77 mm chip containing 150 k transistors is fully functional at a clock rate of 83 MHz and dissipates 3.0 W under typical operating conditions (V_DD=5.0 V, T_A=27°C). This corresponds to a block decode rate of 83 MHz, equivalent to a decode rate of 1 Gb/s. For low-power operation, typical parts are fully functional at a clock rate of greater than 12 MHz, equivalent to a decode rate of 144 Mb/s, and dissipate 24 mW at V_DD=1.5 V, demonstrating extremely low power consumption at such high rates 相似文献

7.

Low complexity NB-LDPC decoder based on shared comparator architecture for ECN/EVN

孙书龙 Linmin 《中国邮电高校学报(英文版)》2018,25(3):65-70

Non-binary low density parity check (NB-LDPC) codes are considered as preferred candidate in conditions where short/medium codeword length codes and better performance at low signal to noise ratios (SNR) are required. They have better burst error correcting performance, especially with high order Galois fields (GF). A shared comparator(SCOMP) architecture for elementary of check node (ECN)/elementary of variable node (EVN) to reduce decoder complexity is introduced because high complexity of check node (CN) and variable node (VN) prevent NB-LDPC decoder from widely applications. The decoder over GF(16) is based on the extended min-sum (EMS) algorithm. The decoder matrix is an irregular structure as it can provide better performance than regular ones. In order to provide higher throughput and increase the parallel processing efficiency,the clock which is 8 times of the system frequency is adopted in this paper to drive the CN/VN modules. The decoder complexity can be reduced by 28% from traditional decoder when shared comparator architecture is introduced. The result of synthesis software shows that the throughput can achieve 34 Mbit/s at 10 iterations. The proposed architecture can be conveniently extended to GF such as GF(64) or GF(256). Compared with previous works, the decoder proposed in this paper has better hardware efficiency for practical applications. 相似文献

8.

A Memory Efficient Partially Parallel Decoder Architecture for Quasi-Cyclic LDPC Codes

Wang Z. Cui Z. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(4):483-488

This paper presents a memory efficient partially parallel decoder architecture suited for high rate quasi-cyclic low-density parity-check (QC-LDPC) codes using (modified) min-sum algorithm for decoding. In general, over 30% of memory can be saved over conventional partially parallel decoder architectures. Efficient techniques have been developed to reduce the computation delay of the node processing units and to minimize hardware overhead for parallel processing. The proposed decoder architecture can linearly increase the decoding throughput with a small percentage of extra hardware. Consequently, it facilitates the applications of LDPC codes in area/power sensitive high-speed communication systems 相似文献

9.

一种合并状态度量计算的高效并行Turbo码译码器结构设计及FPGA实现

张茜詹明章坚武王富龙冯云开唐浩《电信科学》2022,38(2):47-58

为满足无线通信中高吞吐、低功耗的要求,并行译码器的结构设计得到了广泛的关注。基于并行Turbo码译码算法,研究了前后向度量计算中的对称性,提出了一种基于前后向合并计算的高效并行Turbo码译码器结构设计方案,并进行现场可编程门阵列(field-programmable gate array,FPGA)实现。结果表明,与已有的并行Turbo码译码器结构相比,本文提出的设计结构使状态度量计算模块的逻辑资源降低50%左右,动态功耗在125 MHz频率下降低5.26%,同时译码性能与并行算法的译码性能接近。相似文献

10.

Loosely coupled memory-based decoding architecture for low density parity check codes

Se-Hyeon Kang In-Cheol Park 《IEEE transactions on circuits and systems. I, Regular papers》2006,53(5):1045-1056

Parallel decoding is required for low density parity check (LDPC) codes to achieve high decoding throughput, but it suffers from a large set of registers and complex interconnections due to randomly located 1's in the sparse parity check matrix. This paper proposes a new LDPC decoding architecture to reduce registers and alleviate complex interconnections. To reduce the number of messages to be exchanged among processing units (PUs), two data flows that can be loosely coupled are developed by allowing duplicated operations. In addition, intermediate values are grouped and stored into local storages each of which is accessed by only one PU. In order to save area, local storages are implemented using memories instead of registers. A partially parallel architecture is proposed to promote the memory usage and an efficient algorithm that schedules the processing order of the partially parallel architecture is also proposed to reduce the overall processing time by overlapping operations. To verify the proposed architecture, a 1024 bit rate-1/2 LDPC decoder is implemented using a 0.18-/spl mu/m CMOS process. The decoder runs correctly at the frequency of 200 MHz, which enables almost 1 Gbps decoding throughput. Since the proposed decoder occupies an area of 10.08 mm/sup 2/, it is less than one fifth of area compared to the previous architecture. 相似文献

11.

Layered Approx-Regular LDPC: Code Construction and Encoder/Decoder Design

Haibin Zhang Jia Zhu Huifeng Shi Dawei Wang 《IEEE transactions on circuits and systems. I, Regular papers》2008,55(2):572-585

Layered approximately regular (LAR) low-density parity-check (LDPC) codes are proposed, with which one single pair of encoder and decoder support various code lengths and code rates. The parity check matrices of LAR-LDPC codes have a "layer-block-cell" structure with some additional constraints. An encoder architecture is then designed for LAR-LDPC codes, by making two improvements to the Richardson-Urbanke approach: the forward substitution operation is entirely removed and the dense-matrix-vector multiplication is handled using feedback shift-registers. A partially parallel decoder architecture is also designed for LAR-LDPC codes, where a layered modified min-sum decoding algorithm is used to trade off among complexity, speed, and performance. More importantly, the interconnection network, which is inevitable for partially parallel decoders, has much lower hardware complexity compared with that for general LDPC codes. Both the encoder and decoder architectures are highly flexible in code length and code rate. 相似文献

12.

Highly-Parallel Decoding Architectures for Convolutional Turbo Codes

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(10):1147-1151

Highly parallel decoders for convolutional turbo codes have been studied by proposing two parallel decoding architectures and a design approach of parallel interleavers. To solve the memory conflict problem of extrinsic information in a parallel decoder, a block-like approach in which data is written row-by-row and read diagonal-wise is proposed for designing collision-free parallel interleavers. Furthermore, a warm-up-free parallel sliding window architecture is proposed for long turbo codes to maximize the decoding speeds of parallel decoders. The proposed architecture increases decoding speed by 6%-34% at a cost of a storage increase of 1% for an eight-parallel decoder. For short turbo codes (e.g., length of 512 bits), a warm-up-free parallel window architecture is proposed to double the speed at the cost of a hardware increase of 12% 相似文献

13.

An artificial neural net Viterbi decoder

Xiao-An Wang Wicker S.B. 《Communications, IEEE Transactions on》1996,44(2):165-171

The Viterbi algorithm is a maximum likelihood means for decoding convolutional codes and has thus played an important role in applications ranging from satellite communications to cellular telephony. In the past, Viterbi decoders have usually been implemented using digital circuits. The speed of these digital decoders is directly related to the amount of parallelism in the design. As the constraint length of the code increases, parallelism becomes problematic due to the complexity of the decoder. In this paper an artificial neural network (ANN) Viterbi decoder is presented. The ANN decoder is significantly faster than comparable digital-only designs due to its fully parallel architecture. The fully parallel structure is obtained by implementing most of the Viterbi algorithm using analog neurons as opposed to digital circuits. Several modifications to the ANN decoder are considered, including an analog/digital hybrid design that results in an extremely fast and efficient decoder. The ANN decoder requires one-sixth the number of transistors required by the digital decoder. The connection weights of the ANN decoder are either +1 or -1, so weight considerations in the implementation are eliminated. This, together with the design's modularity and local connectivity, makes the ANN Viterbi decoder a natural fit for VLSI implementation. Simulation results are provided to show that the performance of the ANN decoder matches that of an ideal Viterbi decoder 相似文献

14.

FPGA implementation of low complexity LDPC iterative decoder

Shivani Verma Sanjay Sharma 《International Journal of Electronics》2016,103(7):1112-1126

Low-density parity-check (LDPC) codes, proposed by Gallager, emerged as a class of codes which can yield very good performance on the additive white Gaussian noise channel as well as on the binary symmetric channel. LDPC codes have gained lots of importance due to their capacity achieving property and excellent performance in the noisy channel. Belief propagation (BP) algorithm and its approximations, most notably min-sum, are popular iterative decoding algorithms used for LDPC and turbo codes. The trade-off between the hardware complexity and the decoding throughput is a critical factor in the implementation of the practical decoder. This article presents introduction to LDPC codes and its various decoding algorithms followed by realisation of LDPC decoder by using simplified message passing algorithm and partially parallel decoder architecture. Simplified message passing algorithm has been proposed for trade-off between low decoding complexity and decoder performance. It greatly reduces the routing and check node complexity of the decoder. Partially parallel decoder architecture possesses high speed and reduced complexity. The improved design of the decoder possesses a maximum symbol throughput of 92.95 Mbps and a maximum of 18 decoding iterations. The article presents implementation of 9216 bits, rate-1/2, (3, 6) LDPC decoder on Xilinx XC3D3400A device from Spartan-3A DSP family. 相似文献

15.

Turbo Decoder Using Contention-Free Interleaver and Parallel Architecture 总被引：1，自引：0，他引：1

Wong C.-C. Lai M.-W. Lin C.-C. Chang H.-C. Lee C.-Y. 《Solid-State Circuits, IEEE Journal of》2010,45(2):422-432

This paper introduces a turbo decoder that utilizes multiple soft-in/soft-out (SISO) decoders to decode one codeword. In addition, each SISO decoder is modified to allow simultaneous execution over multiple successive trellis stages. The design issues related to the architecture with parallel high-radix SISO decoders are discussed. First, a contention-free interleaver for the hybrid parallelism is presented to overcome the complicated collision problem as well as reduce interconnection network complexity. Second, two techniques for the high-speed add-compare-select (ACS) circuits are given to lessen area overhead of the SISO decoder. Third, a modification of the processing schedule is made for higher operating efficiency. Two designs with parallel architecture have been implemented. The first design with 32 SISO decoders, each of which processes 2 symbols per cycle, has 160 Mb/s and 0.22 nJ/b/iter after measurement. The second design uses 16 SISO decoders to deal with 4 symbols per cycle and achieves 100% efficiency, leading to 1000 Mb/s and 0.15 nJ/b/iter in post-layout simulation. 相似文献

16.

Block-LDPC: a practical LDPC coding system design approach

Hao Zhong Tong Zhang 《IEEE transactions on circuits and systems. I, Regular papers》2005,52(4):766-775

This paper presents a joint low-density parity-check (LDPC) code-encoder-decoder design approach, called Block-LDPC, for practical LDPC coding system implementations. The key idea is to construct LDPC codes subject to certain hardware-oriented constraints that ensure the effective encoder and decoder hardware implementations. We develop a set of hardware-oriented constraints, subject to which a semi-random approach is used to construct Block-LDPC codes with good error-correcting performance. Correspondingly, we develop an efficient encoding strategy and a pipelined partially parallel Block-LDPC encoder architecture, and a partially parallel Block-LDPC decoder architecture. We present the estimation of Block-LDPC coding system implementation key metrics including the throughput and hardware complexity for both encoder and decoder. The good error-correcting performance of Block-LDPC codes has been demonstrated through computer simulations. With the effective encoder/decoder design and good error-correcting performance, Block-LDPC provides a promising vehicle for real-life LDPC coding system implementations. 相似文献

17.

High-throughput layered decoder implementation for quasi-cyclic LDPC codes 总被引：2，自引：0，他引：2

Zhang K. Huang X. Wang Z. 《Selected Areas in Communications, IEEE Journal on》2009,27(6):985-994

This paper presents a high-throughput decoder design for the Quasi-Cyclic (QC) Low-Density Parity-Check (LDPC) codes. Two new techniques are proposed, including parallel layered decoding architecture (PLDA) and critical path splitting. PLDA enables parallel processing for all layers by establishing dedicated message passing paths among them. The decoder avoids crossbar-based large interconnect network. Critical path splitting technique is based on articulate adjustment of the starting point of each layer to maximize the time intervals between adjacent layers, such that the critical path delay can be split into pipeline stages. Furthermore, min-sum and loosely coupled algorithms are employed for area efficiency. As a case study, a rate-1/2 2304-bit irregular LDPC decoder is implemented using ASIC design in 90nm CMOS process. The decoder can achieve the maximum decoding throughput of 2.2Gbps at 10 iterations. The operating frequency is 950MHz after synthesis and the chip area is 2.9mm². 相似文献

18.

Parallel turbo coding interleavers: avoiding collisions in accessesto storage elements 总被引：1，自引：0，他引：1

Giulietti A. van der Perre L. Strum A. 《Electronics letters》2002,38(5):232-234

High-speed, low latency convolutional turbo codes require a parallel decoder architecture. To maximise the gain in speed, the interleaver also should have a parallel structure. Here, a class of optimum parallel interleavers regarding the access to storage elements is presented. They combine regularity (easy implementation) with no latency in data transfer between the decoder module and intrinsic/extrinsic values memories, and show excellent BER performance 相似文献

19.

Memory-Reduced Maximum A Posteriori Probability Decoding for High-Throughput Parallel Turbo Decoders

Rahul Shrestha Roy Paily 《Circuits, Systems, and Signal Processing》2016,35(8):2832-2854

Wireless communication standards make use of parallel turbo decoder for higher data rate at the cost of large hardware resources. This paper presents a memory-reduced back-trace technique, which is based on a new method of estimating backward-recursion factors, for the maximum a posteriori probability (MAP) decoding. Mathematical reformulations of branch-metric equations are performed to reduce the memory requirement of branch metrics for each trellis stage. Subsequently, an architecture of MAP decoder and its scheduling based on the proposed back trace as well as branch-metric reformulation are presented in this work. Comparative analysis of bit-error-rate (BER) performances in additive white Gaussian noise channel environment for MAP as well as parallel turbo decoders are carried out. It has shown that a MAP decoder with a code rate of 1/2 and a parallel turbo decoder with a code rate of 1/3 have achieved coding gains of 1.28 dB at a BER of 10\(^{-5}\) and of 0.4 dB at a BER of 10\(^{-4}\), respectively. In order to meet high-data-rate benchmarks of recently deployed wireless communication standards, very large scale integration implementations of parallel turbo decoder with 8–64 MAP decoders have been reported. Thereby, savings of hardware resources by such parallel turbo decoders based on the suggested memory-reduced techniques are accounted in terms of complementary metal oxide semiconductor transistor count. It has shown that the parallel turbo decoder with 32 and 64 MAP decoders has shown hardware savings of 34 and 44 % respectively. 相似文献

20.

A 1-Gb/s joint equalizer and trellis decoder for 1000BASE-T Gigabit Ethernet

Haratsch E.F. Azadet K. 《Solid-State Circuits, IEEE Journal of》2001,36(3):374-384

1000BASE-T Gigabit Ethernet employs eight-state 4-dimensional trellis-coded modulation to achieve robust 1-Gb/s transmission over four pairs of Category-5 copper cabling. This paper compares several postcursor equalization and trellis decoding algorithms with respect to performance, hardware complexity, and critical path. It is shown that parallel decision-feedback decoders (PDFD) offer the best tradeoff. The example of a 14-tap PDFD, however, shows that it is challenging to meet the required throughput of 1 Gb/s using current standard-cell CMOS technology. A modified approach is proposed which uses decision-feedback prefilters followed by a one-tap PDFD. This considerably reduces hardware complexity and improves the throughput while still meeting the bit-error-rate requirement. The critical path is further reduced by employing a look-ahead technique. The proposed joint equalizer and trellis decoder architecture has been implemented in 3.3-V 0.25-/spl mu/m standard-cell CMOS process. It achieves a throughput of 1 Gb/s with a 125 MHz clock. Compared to a 14-tap PDFD, the design improves both gate count and throughput by a factor of two, while suffering only from a 1.3-dB performance degradation. 相似文献