期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陆晓凤刘锋佟冬王克义《电子学报》2011,39(5):1072-1076

本文针对H.264 Fidelity Range Extensions(FRExt,High Profile)解码过程中扩展的所有变换,采用二维矩阵分解和基于矩阵运算提取公共因子的操作,利用通用运算单元来设计高效的可重构VLSI结构.该结构不但节省面积(可重构变换结构只消耗了4807门电路),并且具有高性能(采用TSM... 相似文献

2.

Algorithm-based low-power and high-performance multimedia signalprocessing

Liu K.J.R. An-Yeu Wu Raghupathy A. Jie Chen 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1998,86(6):1155-1202

Low power and high performance are the two most important criteria for many signal-processing system designs, particularly in real-time multimedia applications. There have been many approaches to achieve these two design goals at many different implementation levels ranging from very-large-scale-integration fabrication technology to system design. We review the works that have been done at various levels and focus on the algorithm-based approaches for low-power and high-performance design of signal processing systems. We present the concept of multirate computing that originates from filterbank design, then show how to employ it along with the other algorithmic methods to develop low-power and high-performance signal processing systems. The proposed multirate design methodology is systematic and applicable to many problems. We demonstrate that multirate computing is a powerful tool at the algorithmic level that enables designers to achieve either significant power reduction or high throughput depending on their choice. Design examples on basic multimedia processing blocks such as filtering, source coding, and channel coding are given. A digital signal-processing engine that is an adaptive reconfigurable architecture is also derived from the common features of our approach. Such an architecture forms a new generation of high-performance embedded signal processor based on the adaptive computing model. The goal of this paper is to demonstrate the flexibility and effectiveness of algorithm-based approaches and to show that the multirate approach is an effective and systematic design methodology to achieve low-power and high throughput signal processing at the algorithmic and architectural level 相似文献

3.

Three-Parallel Reed-Solomon Decoder Using S-DCME for High-Speed Communications

Jae Do Lee Myung Hoon Sunwoo 《Journal of Signal Processing Systems》2012,66(1):15-24

This paper proposes a high-speed and area-efficient three-parallel Reed-Solomon (RS) decoder using the simplified degree computationless modified Euclid (S-DCME) algorithm for the key equation solver (KES) block. To achieve a high throughput rate, the inner signals, such as the syndrome, error locator and error value polynomials, are computed in parallel. In addition, the key equations are solved by using the S-DCME algorithm to reduce the hardware complexity. To handle the many problems caused by applying the S-DCME algorithm to the KES block, we modify the architectures of some of the blocks in the three-parallel RS decoder. The proposed RS architecture can reduce the hardware complexity by about 80% with respect to the KES block. In addition, the proposed RS architecture has an approximately 25% shorter latency than the conventional parallel RS architectures. 相似文献

4.

Novel Area-Efficient FPGA Architectures for FIR Filtering With Symmetric Signal Extension

Benkrid Abd.S. Benkrid K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(5):709-722

This paper presents four novel area-efficient field-programmable gate-array (FPGA) bit-parallel architectures of finite impulse response (FIR) filters that smartly support the technique of symmetric signal extension while processing finite length signals at their boundaries. The key to this is a clever use of variable-depth shift registers which are efficiently implemented in Xilinx FPGAs in the form of shift register logic (SRL) components. Comparisons with the conventional architecture of FIR filter with symmetric boundary processing show considerable area saving especially with long-tap filters. For instance, our architecture implementation of the 8-tap low Daubechies-8 FIR filter achieves ~ 30% reduction in the area requirement (in terms of slices) compared to the conventional architecture while maintaining the same throughput. Two of the above-cited novel architectures are dedicated to the special case of symmetric FIR filters. The first architecture is highly area-efficient but requires a clock frequency doubler. While this reduces the overall processing speed (to a maximum of 2), it does maintain a high throughput. Moreover, this speed penalty is cancelled in bi-phase filters which are widely used in multirate architectures (e.g., wavelets). Our second symmetric FIR filter architecture saves less logic than the first architecture (e.g., 10% with the 9-tap low Biorthogonal 9&7 symmetric filter instead of 37% with the first architecture) but overcomes its speed penalty as it matches the throughput of the conventional architecture. 相似文献

5.

Dual-Data Rate Transpose-Memory Architecture Improves the Performance,Power and Area of Signal-Processing Systems

Mohamed El-Hadedy Xinfei Guo Martin Margala Mircea R. Stan Kevin Skadron 《Journal of Signal Processing Systems》2017,88(2):167-184

This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double-edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in prior work. The proposed design is evaluated with both FPGA and ASIC flow in 28/32nm technology. The experimental results show that the proposed memory achieves almost 4X improvement in throughput while consuming 46 % less area with the FPGA implementations compared to prior work. For ASIC implementations, it achieves more than 60 % area reduction and at least 2X performance improvement while burning 60 % less power compared to other register-based designs implemented with the same flow. As an example, a proposed 8X8 transpose memory with 12-bit input/output resolution is able to achieve a throughput of 107.83Gbps at 647MHz by taking only 140 slices on a Virtex-7 Xilinx FPGA platform, and achieve a throughput of 88.2Gbps at 529MHz by taking 0.024mm ² silicon area for ASIC. The proposed transpose memory is integrated in both 2D-DCT and 2D-IDCT blocks for signal processing applications on the same FPGA platform. The new architecture allows a 3.5X speed-up in performance for the 2D-DCT algorithm, compared to the previous work, while consuming 28 % less area, and 2D-IDCT achieves a 3X speed-up while consuming 20 % less area. 相似文献

6.

Design Space Exploration of 1-D FFT Processor

Shaohan Liu Dake Liu 《Journal of Signal Processing Systems》2018,90(11):1609-1621

A design space exploration methodology of 1-D FFT processor is proposed to find the best hardware architecture in a quantitative way during early design. The methodology includes architecture candidate collection, coarse-grained architecture selection, and circuit level design optimizations. We show how to select a better architecture from candidates including different architectures (SDF, SDC, MDF, MDC and memory-based) with different degree of parallelism at different radices. The sub-level designs, including designs of rotator and data scaling module, are introduced for further optimizations. As a proof of concept, an FFT processor for 4G, WLAN and future 5G is designed supporting 16-4096 and 12-2400 point FFTs. Memory-based architecture with 16-datapath mixed-radix butterfly unit is selected to satisfy the demands for 1GS/s (4096) throughput. The synthesis result based on 65nm technology shows that the silicon cost and power consumption are 1.46mm2 and 68.64mW respectively. The proposed processor has better normalized throughput per area unit and normalized FFTs per energy unit than the state of the art available designs. 相似文献

7.

Current-mode subthreshold MOS implementation of the Herault-Juttenautoadaptive network

Cohen M.H. Andreou A.G. 《Solid-State Circuits, IEEE Journal of》1992,27(5):714-727

The authors explore translinear circuits in subthreshold MOS technology and current-mode design techniques for the implementation of neuromorphic analog network processing. The architecture, also known as the Herault-Jutten network, performs an independent component analysis and is essentially a continuous-time recursive linear adaptive filter. Analog I/O interface, weight coefficients, and adaptation blocks are all integrated on the chip. A small network with six neurons and 30 synapses was fabricated in a 2-μm double-polysilicon, double-metal n-well CMOS process. Circuit designs at the transistor level yield area-efficient implementations for neurons, synapses, and the adaptation blocks. The authors discuss the design methodology and constraints as well as test results from the fabricated chips 相似文献

8.

PMCNOC: A Pipelining Multi-channel Central Caching Network-on-chip Communication Architecture Design

N. Wang A. Sanusi P. Y. Zhao M. Elgamel M. A. Bayoumi 《Journal of Signal Processing Systems》2010,60(3):315-331

With the de facto transformation of technology into nano-technology, more and more functional components can be embedded on a single silicon die, thus enabling high degree pipelining operations such as those required for multimedia applications. In recent years, system-on-chip designs have migrated from fairly simple single processor and memory designs to relatively complicated systems with multiple processors, on-chip memories, standard peripherals, and other functional blocks. The communication between these IP blocks is becoming the dominant critical system path and performance bottleneck of system-on-chip designs. Network-on-chip architectures, such as Virtual Channel (2004), Black-bus (2004), Pirate (2004), AEthereal (2005), and VICHAR (2006) architectures, emerged as promising solutions for future system-on-chip communication architecture designs. However, these existing architectures all suffer from certain problems, including high area cost and communication latency and/or low network throughput. This paper presents a novel network-on-chip architecture, Pipelining Multi-channel Central Caching, to address the shortcomings of the existing architectures. By embedding a central cache into every switch of the network, blocked head packets can be removed from the input buffers and stored in the caches temporally, thus alleviating the effect of head-of-line and deadlock problems and achieving higher network throughput and lower communication latency without paying the price of higher area cost. Experimental results showed that the proposed architecture exhibits both hardware simplicity and system performance improvement compared to the existing network-on-chip architectures. 相似文献

9.

Area-Efficient Arithmetic Expression Evaluation Using Deeply Pipelined Floating-Point Cores

Scrofano R. Ling Zhuo Prasanna V.K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(2):167-176

Recently, it has become possible to implement floating-point cores on field-programmable gate arrays (FPGAs) to provide acceleration for the myriad applications that require high-performance floating-point arithmetic. To achieve high clock rates, floating-point cores for FPGAs must be deeply pipelined. This deep pipelining makes it difficult to reuse the same floating-point core for a series of dependent computations. However, floating-point cores use a great deal of area, so it is important to use as few of them in an architecture as possible. In this paper, we describe area-efficient architectures and algorithms for arithmetic expression evaluation. Such expression evaluation is necessary in applications from a wide variety of fields, including scientific computing and cognition. The proposed designs effectively hide the pipeline latency of the floating-point cores and use at most two floating-point cores for each type of operator in the expression. While best-suited for particular classes of expressions, the proposed designs can evaluate general expressions as well. Additionally, multiple expressions can be evaluated without reconfiguration. Experimental results show that the areas of our designs increase linearly with the number of types of operations in the expression and that our designs occupy less area and achieve higher throughput than designs generated by a commercial hardware compiler. 相似文献

10.

Relaxed Annihilation-Reordering Look-Ahead QRD-RLS Adaptive Filters

Lijun Gao Keshab K. Parhi Jun Ma 《The Journal of VLSI Signal Processing》2003,35(2):119-135

The optimum architecture design and mapping of QRD-RLS adaptive filters can be achieved through filter architecture selections, look-ahead transformations, and hierarchical pipelining/folding transformations. In this paper, a relaxed annihilation-reordering look-ahead (RARL) architecture is proposed, and shown to be more power and area efficient than pipelined processing architecture which was considered the most area efficient. The filters with this architecture are based on relaxed weight-update through filtering approximation, where a filter tap weight is updated upon arrival of every block of input data, and are speeded up with annihilation-reordering look-ahead transformation. As a result of the computational complexity reduction, this architecture does not change the iteration bound and filter clock frequency, and leads to speed up with linear increase in power consumption, while the pipelined processing architectures result in speedup with quadratic increase in power consumption. Upon hardware mapping, this architecture is also more advantageous to achieve low area designs. Two design examples are presented to illustrate mapping optimization using above transformations. These results are important for mapping designs onto ASICs, FPGAs or parallel computing machines. The results show significant improvements in throughput, power consumption and hardware requirement. It is also interesting to show through mathematics and simulations that the RARL QRD-RLS filters have no performance degradation in terms of convergence rate. 相似文献

11.

From synchronous to GALS: A new architecture for FPGAs

René Gagné Jean Belzile 《Microelectronics Journal》2009,40(11):1657-1666

The conflictual demand of faster and larger designs is increasingly difficult to answer by the advances of solid state technology alone. At some point, it is expected that designers and manufacturers will have to give up the traditional synchronous design methodology for a Globally Asynchronous Locally Synchronous (GALS) one. Such changes imply more synchronization constraints, but also more flexibility. Consequently, this paper proposes a novel Field-Programmable Gate Arrays (FPGA) architecture that is compatible with existing devices and that can also support GALS designs. The main objective is simple: the proposed architecture must appear unchanged for synchronous design, but it must also include a minimal amount of basic components to prevent metastability for efficient asynchronous communications. Thus, the paper presents the constraint equations required to implement such a circuit. It also presents a pausible clock generator application and simulation results for the proposed architecture. All results demonstrate that with a few additional customized circuits, a standard FPGA cell can become appropriate for GALS methodologies. 相似文献

12.

The Harmonized Parabolic Synthesis Methodology for Hardware Efficient Function Generation with Full Error Control

Erik Hertz Jingou Lai Bertil Svensson Peter Nilsson 《Journal of Signal Processing Systems》2018,90(12):1623-1637

The Harmonized Parabolic Synthesis methodology is a further development of the Parabolic Synthesis methodology for approximation of unary functions such as trigonometric functions, logarithms and the square root with moderate accuracy for ASIC implementation. These functions are extensively used in computer graphics, communication systems and many other application areas. For these high-speed applications, software solutions are not sufficient, and a hardware implementation is therefore needed. The Harmonized Parabolic Synthesis methodology has two outstanding advantages: it is parallel, thus reducing the execution time, and it is based on low complexity operations, thus being simple to implement in hardware. A difference compared to other approximation methodologies is that it is a multiplicative, and not additive, methodology. Compared to the Parabolic Synthesis methodologies it is possible to significantly enhance the performance in terms of reducing chip area, computation delay and power consumption. Furthermore, it increases the possibility to tailor the characteristics of the error, improving conditions for subsequent calculations. To evaluate the methodology, the fractional part of the logarithm is implemented and its performance is compared to the Parabolic Synthesis methodology. The comparison is made with 15-bit resolution. The design implemented using the proposed methodology performs 3× better than the Parabolic Synthesis implementation in terms of throughput, while consuming 90% less energy. The chip area is 70% smaller than for the Parabolic Synthesis methodology. In summary, the new technology further increases the advantages of Parabolic Synthesis. 相似文献

13.

Custom design of a VLSI PCM-FDM transmultiplexer from system specifications to circuit layout using a computer-aided design system

《Solid-State Circuits, IEEE Journal of》1986,21(1):73-85

The computer-aided design of a VLSI PCM-FDM transmultiplexer is presented. The entire design process, from system specifications to integrated circuit layout, is carried out with the aid of specialized computer programs for the analysis, synthesis, and optimization at each design level: the filter network, the architecture, and the circuit layout. These CAD tools support a top-down custom design methodology based on bit-serial architectures and standard cells. A customized architecture is constructed which is integrated using a 5-/spl mu/m CMOS cell library. The results are compared with a fully manual design and demonstrate the power of architecture based computer-aided design methodologies for VLSI filtering. By combining both synthesis and optimization aids at each design level it is possible to achieve a high degree of automation while retaining an efficient use of silicon area, high throughput, and moderate power consumption. 相似文献

14.

A 0.8-μm BiCMOS ATM switch on an 800 Mb/s asynchronous bufferedbanyan network

Sakaue K. Shobatake Y. Motoyama M. Kumaki Y. Takatsuka S. Tanaka S. Hara H. Matsuda K. Kitaoka S. Noda M. Niitsu Y. Norishima M. Momose H. Maeguchi K. Ishibe M. Shimizu S. Kodama T. 《Solid-State Circuits, IEEE Journal of》1991,26(8):1133-1144

An experimental element switch LSI for asynchronous transfer mode (ATM) switching systems was realized using 0.8-μm BiCMOS technology. The element switch transfers cells asynchronously when used in a buffered banyan network. Three key features of the element switch architecture are CASO buffers to increase the throughput, a synchronization technique called SCDB (synchronization in a clocked dual port buffer) to make possible asynchronous cell transmission on the element switches with simple hardware, and an implementation technique for virtual cut through, called CELL-BYPASS, which lowers the latency. An implementation of elastic store is proposed to achieve high-speed synchronization with simple hardware. The element switch LSI adopts an emitter-coupled-logic (ECL) interface. The maximum operation frequency of the element switch LSI is 200 MHz (typical) 相似文献

15.

Multi-symbol-sliced dynamically reconfigurable Reed-Solomon decoder design based on unified finite-field processing element 总被引：1，自引：0，他引：1

Huai-Yi Hsu Jih-Chiang Yeo An-Yeu Wu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(5):489-500

Reed-Solomon (RS) codes play an important role in providing the error correction and the data integrity in various communication/storage applications. For high-speed applications, most RS decoders are implemented as dedicated application-specified integrated circuits (ASICs) based on parallel architectures, which can deliver high data throughput rate. For lower-speed applications, the RS decoding operations are usually performed by using fine-grained processing elements (PE) controlled by a programmable digital signal processing (DSP) core, which provides high flexibility. In this paper, we propose a novel m-PE multi-symbol-sliced (MSS) RS datapath structure. The m-PE RS architecture is a highly scalable design and can be dynamically reconfigured at 1-PE, 2-PE,...,m/2-PE, and m-PE modes to deliver necessary data throughput rate. With the help of the gated-clock scheme to turn off the idle PEs, the proposed runtime configurable ASIC design provides good tradeoff between the data throughput rate and the power consumption. Hence, it can save energy to extend the battery life of the portable devices. We demonstrate a prototyping design using 4 PEs by using UMC 0.18-/spl mu/m CMOS technology. The design can be dynamically reconfigured to be operated at 1-PE, 2-PE, and 4-PE modes, with performance of 140 Mb/s at 18.91 mW, 280 Mb/s at 28.77 mW, and 560 Mb/s at 48.47 mW, respectively. Compared with existing RS designs, the proposed m-PE RS decoder has better normalized area/power efficiency than most DSP-type and ASIC-type RS designs. The reconfigurable feature makes our design a good candidate for the error control coding (ECC) unit of the storage system in power-aware portable devices. 相似文献

16.

Low cost high throughput pipelined architecture of 2-D 8 × 8 integer transforms for H.264/AVC

Meeturani Sharma Honey Durga Tiwari 《International Journal of Electronics》2013,100(8):1033-1045

In this article, we present the implementation of high throughput two-dimensional (2-D) 8?×?8 forward and inverse integer DCT transform for H.264. Using matrix decomposition and matrix operation, such as the Kronecker product and direct sum, the forward and inverse integer transform can be represented using simple addition operations. The dual clocked pipelined structure of the proposed implementation uses non-floating point adders and does not require any transpose memory. Hardware synthesis shows that the maximum operating frequency of the proposed pipelined architecture is 1.31?GHz, which achieves 21.05 Gpixels/s throughput rate with the hardware cost of 42932 gates. High throughput and low hardware makes the proposed design useful for real time H.264/AVC high definition processing. 相似文献

17.

Automatic Design of Reconfigurable Domain-Specific Flexible Cores

Compton K. Hauck S. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(5):493-503

Reconfigurable hardware is ideal for use in systems-on-a-chip (SoC), as it provides both hardware-level performance and post-fabrication flexibility. However, any one architecture is rarely equally optimized for all applications. SoCs targeting a specific set of applications can greatly benefit from incorporating customized reconfigurable logic instead of generic field-programmable gate-array (FPGA) logic. Unfortunately, manually designing a domain-specific architecture for every SoC would require significant design time. Instead, this paper discusses our initial efforts towards creating a reconfigurable hardware generator capable of automatically creating flexible, yet domain-specific, designs. Our tests indicate that our generated architectures are more than 5times smaller than equivalent FPGA implementations and nearly as area-efficient as standard cell designs. We also use a novel technique employing synthetic circuit generation to demonstrate the flexibility of our architecture generation techniques. 相似文献

18.

High performance, high throughput turbo/SOVA decoder design

Zhongfeng Wang Parhi K.K. 《Communications, IEEE Transactions on》2003,51(4):570-579

Two efficient approaches are proposed to improve the performance of soft-output Viterbi (1998) algorithm (SOVA)-based turbo decoders. In the first approach, an easily obtainable variable and a simple mapping function are used to compute a target scaling factor to normalize the extrinsic information output from turbo decoders. An extra coding gain of 0.5 dB can be obtained with additive white Gaussian noise channels. This approach does not introduce extra latency and the hardware overhead is negligible. In the second approach, an adaptive upper bound based on the channel reliability is set for computing the metric difference between competing paths. By combining the two approaches, we show that the new SOVA-based turbo decoders can approach maximum a posteriori probability (MAP)-based turbo decoders within 0.1 dB when the target bit-error rate (BER) is moderately low (e.g., BER<10/sup -4/ for 1/2 rate codes). Following this, practical implementation issues are discussed and finite precision simulation results are provided. An area-efficient parallel decoding architecture is presented in this paper as an effective approach to design high-throughput turbo/SOVA decoders. With the efficient parallel architecture, multiple times throughput of a conventional serial decoder can be obtained by increasing the overall hardware by a small percentage. To resolve the problem of multiple memory accesses per cycle for the efficient parallel architecture, a novel two-level hierarchical interleaver architecture is proposed. Simulation results show that the proposed interleaver architecture performs as well as random interleavers, while requiring much less storage of random patterns. 相似文献

19.

A "flying-adder" architecture of frequency and phase synthesis with scalability

Liming Xiu Zhihong You 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2002,10(5):637-649

Most of today's digital designs, from small-scale digital block designs to system-on-chip (SoC) designs, are based on "synchronous" design principle. Clock is the most important issue in these designs. Frequency and phase synthesis is closely related to the clock generation. A frequency and phase synthesis technique based on phase-locked loop is proposed in that delivers high performance, easy integration, and high stability. However, there are problems associated with this architecture, such as: 1) its highest deliverable frequency is limited by the speed of the accumulator and 2) the phase synthesis circuitry will not work well in certain ranges (dead zone) and in certain conditions (dual stability). This paper presents an improved architecture that addresses these problems. The new frequency synthesis circuitry has scalability for higher output frequency. It also has an internal node whose frequency is twice that of output signal. When duty cycle is not a concern, this signal can be used directly as clock source. The new phase synthesis circuitry is free of "dead zone" and "dual stability." The improved architecture has better performance, is simpler to implement, and is easier to understand. 相似文献

20.

Design Reuse and Buffers in High-Tech Infrastructure Development: A Stakeholder Perspective

Gil N. Beckman S. 《Engineering Management, IEEE Transactions on》2007,54(3):484-497

This study investigates the implementation of design reuse and buffers in developing the infrastructure of high-tech production facilities Design reuse entails using the same systems architecture from one project to the next. Design buffers involve building slack into a proven systems architecture to absorb foreseeable change requests. Choosing the appropriate amounts of reuse and slack is dependent on the uncertainty in the manufacturing technology over the infrastructure life cycle. While proven infrastructure designs can economically accommodate incremental changes in technology, adaptation costs escalate when sufficient buffers are not built-in and changes are radical. We uncover opposing stakeholder interests in determining the extent to which reuse or buffers are used. Design reuse is attractive to the client to reduce the risk that a facility fails to perform, but limits the designer's job to tedious customization work. Design buffers are attractive to the designer to do original problem-solving and limit the risks of being unresponsive to uncertainty, but not to the client who is not guaranteed that the investments will pay off. We find that inequalities between the two stakeholders in the governing power on design decision-making compound the difficulties in assessing and implementing the reuse versus buffers tradeoff. 相似文献