首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A new cell architecture for high performance digit-serial computation is presented. The design of this cell is based on the feed forward of the carry digit, which allows a high level of pipelining to increase the throughput rate with minimum latency. This will give designers greater flexibility in finding the best trade-off between hardware cost and throughput rate. A twin-pipe architecture to double the throughput rate of digit-serial/parallel multipliers is also presented. The effects of the number of pipelining levels and the twin architecture on the throughput rate and hardware cost are presented. A two's complement digit-serial/parallel multiplier which can operate on both negative and positive numbers is also presented.  相似文献   

2.
Multiplication in the finite fieldGF(2^{m}) has particular computational advantages in data encryption systems. This paper presents a new algorithm for performing fast multiplication inGF(2^{m}), which isO(m)in computation time and implementation area. The bit-slice architecture of a serial-in-serial-out modulo multiplier is described and the circuit details given. The design is highly regular, modular, and well-suited for VLSI implementation. The resulting multiplier will have application in algorithms based on arithmetic in large finite fields of characteristic 2, and which require high throughput.  相似文献   

3.
A novel unidirectional pipelined serial/parallel multiplier (PSPM) is presented. This design has halved the initial delay and reduces the number of latches by 10% of the conventional structure. An area-time criteria is used to compare the new architecture with the old PSPM.<>  相似文献   

4.
一种新型2-DCT/IDCT结构的设计与实现   总被引:2,自引:0,他引:2       下载免费PDF全文
傅宇卓  王嘉芳  胡铭曾 《电子学报》2002,30(Z1):2126-2129
本文根据MPEG-2视频编码的特点,设计了仅由一个1-DCT核完成的2-DCT/IDCT结构,该结构的转换矩阵通过SRAM实现,具备双端口的输入输出,数据吞吐率较高,能够有效节省芯片面积.1-DCT核由7个乘法器组成,乘法器可以根据计算速度的快慢灵活设计.为了解决双端口无冲突的存储访问,提出了一个数据排列方案.由于乘法器的乘数之一为常数,我们设计了一种常数修改方案能够有效的降低成法器的硬件开销.该2-DCT/IDCT结构通过了FPGA验证,具有较强的工程实用价值.  相似文献   

5.
A high-responsivity 9-V/Lux-s high-speed 5000-frames/s (at full 512/spl times/512 resolution) CMOS active pixel sensor (APS) is presented in this paper. The sensor was designed for a 0.35-/spl mu/m 2P3M CMOS sensor process and utilizes a five-transistor pixel to provide a true parallel shutter. Column-parallel analog-to-digital converter (ADC) architecture yields fast readout from pixels and digitization of the data simultaneously with acquiring a new frame. The chip has a two-row SRAM to store data from the ADC and read previous rows of data out of the chip. There are a total of 16 parallel ports operating up to 90 MHz delivering /spl sim/1.3 Gpixel/s or 13 Gb/s of data at the maximum rate. In conclusion, a comparison between two high-speed digital CMOS sensor architectures, which are a column-parallel APS and a digital pixel sensor (DPS), is conducted.  相似文献   

6.
Presented in this paper is a pipelined 285-MHz maximum a posteriori probability (MAP) decoder IC. The 8.7-mm/sup 2/ IC is implemented in a 1.8-V 0.18-/spl mu/m CMOS technology and consumes 330 mW at maximum frequency. The MAP decoder chip features a block-interleaved pipelined architecture, which enables the pipelining of the add-compare-select kernels. Measured results indicate that a turbo decoder based on the presented MAP decoder core can achieve: 1) a decoding throughput of 27.6 Mb/s with an energy-efficiency of 2.36 nJ/b/iter; 2) the highest clock frequency compared to existing 0.18-/spl mu/m designs with the smallest area; and 3) comparable throughput with an area reduction of 3-4.3/spl times/ with reference to a look-ahead based high-speed design (Radix-4 design), and a parallel architecture.  相似文献   

7.
Multiple inputs multiple outputs orthogonal frequency division multiplexing (MIMO-OFDM) technology is regarded as a promising solution to offer ultra-high data rate in wireless communications. This paper presents a field-programmable gate array (FPGA) implementation of an early-pruned K-Best detection algorithm applicable to ultra-high data throughput MIMO-OFDM communication systems. The algorithm simplifies the computation significantly compared to conventional K-Best algorithm with negligible bit error ratio (BER) degradation. A fully parallel structure is implemented on a FPGA platform, which achieves 1.9Gb/s detection throughput and is about three times over previous implementation. Moreover, a pre-processing method is realized to reduce the number of multipliers inside the detector and shrinks the critical path delay down to 8.32 ns. Together with candidate sharing and early-pruning architecture to further save the hardware cost, a high-speed, compact MIMO signal detector is demonstrated.  相似文献   

8.
提升结构(Lifting Scheme)是一种新的双正交小波变换构造方法.这种方法使得计算复杂度大大降低,有效地减少了运行时间.介绍了基于FPGA的高速9/7提升小波变换的设计,提出采用多级流水线硬件结构实现一维离散小波变换(1-D DWT).该结构使系统吞吐量提高到原来的3倍,面积仅增加40%.在实现二维离散小波变换(2-D DWT)时采用基于行的结构,可以提高片内资源利用率和运行速度,满足小波变换实时性的要求.  相似文献   

9.
并行BCH伴随式计算电路的优化   总被引:1,自引:0,他引:1  
张亮  王志功  胡庆生 《信号处理》2010,26(3):458-461
随着通信系统的速率越来越高,对BCH译码器吞吐量的要求也不断提高。由于BCH码是串行的处理数据,在吞吐量大的应用时一般需要并行处理,但这会导致电路的复杂度显著增加。本文主要研究并行伴随式计算电路的优化。通过合并输入端的常量乘法器,得到改进的并行伴随式结构。该结构克服了传统方法只能对局部的乘法器进行优化的缺点,可以对全部乘法器进行优化,从而有效的减少逻辑资源。实验结果表明,对于并行度为64的BCH(2040,1952)译码器,本文的优化结构可以节省67%的逻辑资源,而且在并行度、纠错能力和码长变化时,仍然可以获得较好的优化结果。   相似文献   

10.
In H.264/AVC, the motion estimation (ME) routine supports variable block size and involves highly parallel sum of absolute difference (SAD) computations. In this study, we introduce a bit serial hybrid-grained processing element (PE) based 2D architecture that has both early termination and intensive data reuse capabilities. PEs operate on most significant bit-first arithmetic for early termination and the 2D architecture enables on-chip data reuse between neighboring PEs in a bit-by-bit pipelined fashion. Hybrid-grained PEs reduce the hardware overhead of conventional adder tree structures used for implementing the variable block size ME. Our design reduces the gate count by 7x compared to its ASIC counterpart, operates at a comparable frequency while sustaining 30 fps and 60 fps; and outperforms bit parallel and bit serial architectures in terms of throughput and performance per gate for various video formats.  相似文献   

11.
RHiNET-2/SW is a network switch that enables high-performance optical network based parallel computing system in a distributed environment. The switch used in such a computing system must provide high-speed, low-latency packet switching with high reliability. Our switch allows high-speed 8-Gb/s/port optical data transmission over a distance of up to 100 m, and the aggregate throughput is 64 Gb/s. In RHiNET-2/SW, eight pairs of 800-Mb/s×12-channel optical interconnection modules and a one-chip CMOS ASIC switch LSI (a 784-pin BGA package) are mounted on a single compact board. To enable high-performance parallel computing, this switch must provide high-speed, highly reliable node-to-node data transmission. To evaluate the reliability of the switch, we measured the bit error rate (BER) and skew between the data channels. The BER of the signal transmission through one I/O port was better than 10-11 at a data rate of 800 Mb/s ×10 b with a large timing-budget margin (870 ps) and skew of less than 140 ps. This shows that RHiNET-2/SW can provide high-throughput, highly reliable optical data transmission between the nodes of a network-based parallel computing system  相似文献   

12.
We present a new serial-parallel concurrent modular-multiplication algorithm and architecture suitable for standard RSA encryption. In the new scheme, multiplication is performed modulo a multiple of the RSA modulus n, which has a diminished-radix form 2 k -v, where k and v are positive integers and v < n. This design is the first concurrent modular multiplier to use a diminished-radix algorithm and to pipeline concurrent modular-reduction to optimize the clock rate. For a modular multiplier of order ranging from 1 to 10 (number of multiplier bits per clock cycle), a faster clock rate and throughput is possible than with other known designs including those of Brickell, Morita, Sedlak and Golze, and Miyaguchi. Throughput estimates for 512-bit RSA decryption range from 100 kbit/s in a serial mode to 650 kbit/s with a modular multiplier of order 10, at a clock rate of 20 MHz on 1.5 m CMOS.  相似文献   

13.
In this paper, we propose two new VLSI architectures for computing the N-point discrete Fourier transform (DFT) and its inverse (IDFT) based on a radix-2 fast algorithm, where N is a power of two. The first part of this work presents a linear systolic array that requires log2 N complex multipliers and is able to provide a throughput of one transform sample per clock cycle. Compared with other related systolic designs based on direct computation or a radix-2 fast algorithm, the proposed one has the same throughput performance but involves less hardware complexity. This design is suitable for high-speed real-time applications, but it would not be easily realized in a single chip when N gets large. To balance the chip area and the processing speed, we further present a new reduced-complexity design for the DFT/IDFT computation. The alternative design is a memory-based architecture that consists of one complex multiplier, two complex adders, and some special memory units. The new design has the capability of computing one transform sample every log2 N+1 clock cycles on average. In comparison with the first design, the second design reaches a lower throughput with less hardware complexity. As N=512, the chip area required for the memory-based design is about 5742×5222 μm2, and the corresponding throughput can attain a rate as high as 4M transform samples per second under 0.6 μm CMOS technology. Such area-time performance makes this design very competitive for use in long-length DFT applications, such as asymmetric digital subscriber lines (ADSL) and orthogonal frequency-division multiplexing (OFDM) systems  相似文献   

14.
Two efficient approaches are proposed to improve the performance of soft-output Viterbi (1998) algorithm (SOVA)-based turbo decoders. In the first approach, an easily obtainable variable and a simple mapping function are used to compute a target scaling factor to normalize the extrinsic information output from turbo decoders. An extra coding gain of 0.5 dB can be obtained with additive white Gaussian noise channels. This approach does not introduce extra latency and the hardware overhead is negligible. In the second approach, an adaptive upper bound based on the channel reliability is set for computing the metric difference between competing paths. By combining the two approaches, we show that the new SOVA-based turbo decoders can approach maximum a posteriori probability (MAP)-based turbo decoders within 0.1 dB when the target bit-error rate (BER) is moderately low (e.g., BER<10/sup -4/ for 1/2 rate codes). Following this, practical implementation issues are discussed and finite precision simulation results are provided. An area-efficient parallel decoding architecture is presented in this paper as an effective approach to design high-throughput turbo/SOVA decoders. With the efficient parallel architecture, multiple times throughput of a conventional serial decoder can be obtained by increasing the overall hardware by a small percentage. To resolve the problem of multiple memory accesses per cycle for the efficient parallel architecture, a novel two-level hierarchical interleaver architecture is proposed. Simulation results show that the proposed interleaver architecture performs as well as random interleavers, while requiring much less storage of random patterns.  相似文献   

15.
A new high capacity, reservation-based switch architecture for ATM/WDM networks is presented. The scheme is contention-free and highly flexible yielding a powerful solution for high-speed broadband packet-switched networks. Switching management and control are studied for data rates of up to 10 Gbit/s/port, providing and aggregated throughput of over 1 Tbit/s  相似文献   

16.
Novel architectures for 1-D and 2-D discrete wavelet transform (DWT) by using lifting schemes are presented in this paper. An embedded decimation technique is exploited to optimize the architecture for 1-D DWT, which is designed to receive an input and generate an output with the low- and high-frequency components of original data being available alternately. Based on this 1-D DWT architecture, an efficient line-based architecture for 2-D DWT is further proposed by employing parallel and pipeline techniques, which is mainly composed of two horizontal filter modules and one vertical filter module, working in parallel and pipeline fashion with 100% hardware utilization. This 2-D architecture is called fast architecture (FA) that can perform J levels of decomposition for N * N image in approximately 2N2(1 - 4(-J))/3 internal clock cycles. Moreover, another efficient generic line-based 2-D architecture is proposed by exploiting the parallelism among four subband transforms in lifting-based 2-D DWT, which can perform J levels of decomposition for N * N image in approximately N2(1 - 4(-J))/3 internal clock cycles; hence, it is called high-speed architecture. The throughput rate of the latter is increased by two times when comparing with the former 2-D architecture, but only less additional hardware cost is added. Compared with the works reported in previous literature, the proposed architectures for 2-D DWT are efficient alternatives in tradeoff among hardware cost, throughput rate, output latency and control complexity, etc.  相似文献   

17.
介绍一种新型有限域乘法器,其基本原理是引入多项式拆分概念和多项式拆分方法,将m次的多项式拆分成两个m/2次多项式分别做有限域乘法,这样可以降低乘法运算的阶数,用加法计算电路来代替。并且根据这种算法设计了新型乘法器的电路实现,将这种新型乘法器并且与比特串行乘法器的仿真结果做对比。结果表明新型的有限域乘法器达到了较高的系统数据吞吐率,可以应用于纠错系统、RS编码器和译码器中。  相似文献   

18.
A 16/spl times/16-b parallel multiplier fabricated in a 0.6-/spl mu/m CMOS technology is described. The chip uses a modified array scheme incorporated with a Booth's algorithm to reduce the number of adding stages of partial products. The combination of scaled 0.6-/spl mu/m CMOS technology and advanced arithmetic architecture achieves a multiplication time of 7.4 ns while dissipating only 400 mW. This multiplication time is shorter than other MOS high-speed multipliers previously reported and is comparable to those for advanced bipolar and GaAs multipliers.  相似文献   

19.
A chip set for high-speed radix-2 fast Fourier transform (FFT) applications up to 512 points is described. The chip set comprises a (16+16)/spl times/(12+12)-bit complex number multiplier, and a 16-bit butterfly chip for data reordering, twiddle factor generation, and butterfly arithmetic. The chips have been implemented using a standard cell design methodology on a 2-/spl mu/m bulk CMOS process. Three chips implement a complex FFT butterfly with a throughput of 10 MHz, and are cascadable up to 512 points. The chips feature an offline self-testing capability.  相似文献   

20.
Iterative decoders such as turbo decoders have become integral components of modern broadband communication systems because of their ability to provide substantial coding gains. A key computational kernel in iterative decoders is the maximum a posteriori probability (MAP) decoder. The MAP decoder is recursive and complex, which makes high-speed implementations extremely difficult to realize. In this paper, we present block-interleaved pipelining (BIP) as a new high-throughput technique for MAP decoders. An area-efficient symbol-based BIP MAP decoder architecture is proposed by combining BIP with the well-known look-ahead computation. These architectures are compared with conventional parallel architectures in terms of speed-up, memory and logic complexity, and area. Compared to the parallel architecture, the BIP architecture provides the same speed-up with a reduction in logic complexity by a factor of M, where M is the level of parallelism. The symbol-based architecture provides a speed-up in the range from 1 to 2 with a logic complexity that grows exponentially with M and a state metric storage requirement that is reduced by a factor of M as compared to a parallel architecture. The symbol-based BIP architecture provides speed-up in the range M to 2M with an exponentially higher logic complexity and a reduced memory complexity compared to a parallel architecture. These high-throughput architectures are synthesized in a 2.5-V 0.25-/spl mu/m CMOS standard cell library and post-layout simulations are conducted. For turbo decoder applications, we find that the BIP architecture provides a throughput gain of 1.96 at the cost of 63% area overhead. For turbo equalizer applications, the symbol-based BIP architecture enables us to achieve a throughput gain of 1.79 with an area savings of 25%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号