首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double-edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in prior work. The proposed design is evaluated with both FPGA and ASIC flow in 28/32nm technology. The experimental results show that the proposed memory achieves almost 4X improvement in throughput while consuming 46 % less area with the FPGA implementations compared to prior work. For ASIC implementations, it achieves more than 60 % area reduction and at least 2X performance improvement while burning 60 % less power compared to other register-based designs implemented with the same flow. As an example, a proposed 8X8 transpose memory with 12-bit input/output resolution is able to achieve a throughput of 107.83Gbps at 647MHz by taking only 140 slices on a Virtex-7 Xilinx FPGA platform, and achieve a throughput of 88.2Gbps at 529MHz by taking 0.024mm 2 silicon area for ASIC. The proposed transpose memory is integrated in both 2D-DCT and 2D-IDCT blocks for signal processing applications on the same FPGA platform. The new architecture allows a 3.5X speed-up in performance for the 2D-DCT algorithm, compared to the previous work, while consuming 28 % less area, and 2D-IDCT achieves a 3X speed-up while consuming 20 % less area.  相似文献   

2.
The simultaneous application of voltage scaling, repeater insertion, and wire sizing is proposed in this paper to achieve high performance, low power, and low area on wave-pipelined interconnect circuits. Based on this methodology, design optimizations for three different types of applications are performed and different design metrics are used to obtain the optimal values of supply voltage, number of repeaters, and interconnect dimensions for these applications. The optimal supply voltage for low-power applications is shown to be twice the threshold voltage. In addition, an optimal throughput-per-energy-area (TPEA) design is compared with low-voltage differential signaling (LVDS). The optimal TPEA design is shown to reduce dynamic power by 10% and wire area by 70% compared to LVDS, without any loss of throughput performance.  相似文献   

3.
Wave-pipelining enables a digital circuit to be operated at a higher frequency. In the literature, only trial-and-error and manual procedures are adopted for the choice of the optimum value of clock frequency and clock skew between the input and output registers of wave-pipelined circuits. One of the major contributions of this paper is the proposal for automating the above procedure. A second contribution is the study of how logic depths determine the superiority of wave-pipelining over pipelining with regard to power dissipation. For the study and implementation of wave-pipelined circuits, filters using the distributed arithmetic algorithm are considered. In this paper, two automation schemes are proposed for the implementation of the wave-pipelined filters on both Xilinx and Altera field programmable gate arrays (FPGAs). In the first scheme, a self-tuning finite state machine (FSM) is used to choose the clock skew and clock period for I/O registers between the wave-pipelined blocks. In the second approach, an on-chip soft-core processor is used to choose the clock skew and clock period. To test the efficacy of the schemes proposed, filters with different taps are implemented using three schemes: synchronous pipelining, sub-optimal wave-pipelining and no pipelining (i.e. using neither synchronous pipelining nor wave-pipelining). From the implementation results, it is observed that wave-pipelined distributed arithmetic (DA) filters are faster by a factor of 1.31–1.61 compared to non-pipelined DA filters. The synchronous pipelined DA filters are in turn faster by a factor of 1.73–3.27 compared to the wave-pipelined DA filters. The increased speeds are achieved in the pipelined filters at the cost of an increase in the number of slices by 15–33% and in the number of registers by 350–530%. To compare the power dissipation, both pipelined and wave-pipelined DA filters are tested by operating them at the same frequency. For medium logic depths, the wave-pipelined DA filters dissipate less power than pipelined filters. This work is carried out with the funding received from the Department of Information Technology, Ministry of information and Telecommunication, New Delhi, India.  相似文献   

4.
The problem of minimizing dynamic power consumption by scaling down the supply voltage of computational elements off critical paths is widely addressed in the literature for the case of combinational designs. The problem is NP-hard in general. To address the problem in the case of synchronous sequential digital designs, one needs to move some registers while applying voltage scaling. Moving these registers shifts some computational elements from critical paths, and can be done by basic retiming. Integrating basic retiming and supply voltage scaling to address this NP-hard problem cannot in general be done in polynomial run time. In this paper, we propose to first apply a guided retiming and then to apply supply voltage scaling on the retimed design. We devise new polynomial time algorithms to realize this guided retiming, and the supply voltage scaling on the retimed design. Also, we show that the problem in the case of combinational designs is not NP-hard for some combinational circuits with certain structure, and give a polynomial time algorithm to optimally solve it. Methods to determine lower bounds on the optimal reduction of dynamic power consumption are also provided. Experimental results on known benchmarks have shown that the proposed approach can reduce dynamic power consumption by factors as high as 61% for single-phase designs with minimal clock period. Also, they have shown that it can solve optimally the problem, and produce converter-free designs with reduced dynamic power consumption. For large size circuits from ISCAS'89 benchmark suite, the proposed algorithms run in 15 s-1 h.  相似文献   

5.
Dynamic voltage scaling has been widely acknowledged as a powerful technique for trading off power consumption and delay for processors. Recently, variable-frequency (and variable-voltage) parallel and serial links have also been proposed, which can save link power consumption by exploiting variations in the bandwidth requirement. This provides a new dimension for power optimization in a distributed embedded system connected by a voltage-scalable interconnection network. At the same time, it imposes new challenges for variable-voltage scheduling as well as flow control. First, the variable-voltage scheduling algorithm should be able to trade off the power consumption and delay jointly for both processors and links. Second, for the variable-frequency network, the scheduling algorithm should not only consider the real-time constraints, but should also be consistent with the underlying flow control techniques. In this paper, we address joint dynamic voltage scaling for variable-voltage processors and communication links in such systems. We propose a scheduling algorithm for real-time applications that captures both data flow and control flow information. It performs efficient routing of communication events through multihops, as well as efficient slack allocation among heterogeneous processors and communication links to maximize energy savings, while meeting all real-time constraints. Our experimental study shows that on an average, joint voltage scaling on processors and links can achieve 32% less power compared with voltage scaling on processors alone  相似文献   

6.
SRAM based FPGA are subjected to ion radiation in many operating environments. Following the current trend of shrinking device feature size & increasing die area, newer FPGA are more susceptible to radiation induced errors. Single event upsets (SEU), (also known as soft-errors) account for a considerable amount of radiation induced errors. SEU are difficult to detect & correct when they affect memory-elements present in the FPGA, which are used for the implementation of finite state machines (FSM). Conventional practice to improve FPGA design reliability in the presence of soft-errors is through configuration memory scrubbing, and through component redundancy. Configuration memory scrubbing, although suitable for combinatorial logic in an FPGA design, does not work for sequential blocks such as FSM. This is because the state-bits stored in flip-flops (FF) are variable, and change their value after each state transition. Component redundancy, which is also used to mitigate soft-errors, comes at the expense of significant area overhead, and increased power consumption compared to nonredundant designs. In this paper, we propose an alternate approach to implement the FSM using synchronous embedded memory blocks to enhance the runtime reliability without significant increase in power consumption. Experiments conducted on various benchmark FSM show that this approach has higher reliability, lower area overhead, and consumes less power compared to a component redundancy technique.  相似文献   

7.
High computing capabilities and limited number of input/output pins of modern integrated circuits require an efficient and reliable interconnection architecture. The proposed communication scheme allows a large number of IP cores to send data over a single wire using logic code division multiple access (LCDMA) technique. Reliability is increased by using hardware redundancy, and three LCDMA-based fault tolerant designs are proposed: (a) duplication with logic comparison (DLC), (b) conventional triple modular redundancy (TMR), and (c) triple modular redundancy with sign voter (TSV). With aim to detect a received bit from chip sequence, LCDMA–DLC and LCDMA–TSV designs compare absolute values of the sums, while LCDMA–TMR compares only sign bits of the sums generated at the outputs of decoders. All proposed designs are implemented in FPGA and ASIC technologies. MATLAB simulation results show that increasing the length of spreading codes affects to an increase in reliability. A comparative analysis of the proposed fault tolerant designs in terms of hardware complexity, latency, power consumption and error detecting and correcting capability is conducted. It is shown that LCDMA–DLC design has lower hardware overhead and power consumption, with satisfactory better bit error rate (BER) performance, in comparison to LCDMA–TMR and LCDMA–TSV approach.  相似文献   

8.
《Microelectronics Journal》2015,46(6):551-562
Most commercial Field Programmable Gate Arrays (FPGAs) have limitations in terms of density, speed, configuration overhead and power consumption mostly due to the use of SRAM cells in Look-Up Tables (LUTs), configuration memory and programmable interconnects. Also, hardwired Application Specific Integrated Circuit (ASIC) blocks designed for high performance arithmetic circuits in FPGA reduce the area available for reconfiguration. In this paper, we propose a novel generalized hybrid CMOS-memristor based architecture using stateful-NOR gates as basic building blocks for implementation of logic functions. These logic functions are implemented on memristor nanocrossbar layers, while the CMOS layer is used for selection and connection of memristors. The proposed pipelined architecture combines the features of ASIC, FPGA and microprocessor based designs. It has high density due to the use of nanocrossbar layer and high throughput especially for arithmetic circuits. The proposed architecture for three input one output logic block is compared with conventional LUT based Configurable Logic Block (CLB) having the same number of inputs and outputs; which shows 1.82×area saving, 1.57×speedup and 3.63×less power consumption. The automation algorithm to implement any logic function using proposed architecture is also presented.  相似文献   

9.
Joint scheduling and power control schemes have previously been proposed to reduce power dissipation in wireless ad hoc networks. However, instead of power consumption, throughput is a more important performance concern for some emerging multihop wireless networks, such as wireless mesh networks. This paper examines joint link scheduling and power control with the objective of throughput improvement. The MAximum THroughput link Scheduling with Power Control (MATH-SPC) problem is first formulated and then a mixed integer linear programming (MILP) formulation is presented to provide optimal solutions. However, simply maximizing the throughput may lead to a severe bias on bandwidth allocation among links. To achieve a good tradeoff between throughput and fairness, a new parameter called the demand satisfaction factor (DSF) to characterize the fairness of bandwidth allocation and formulate the MAximum Throughput fAir link Scheduling with Power Control (MATA-SPC) problem is defined. An MILP formulation and an effective polynomial-time heuristic algorithm, namely, the serial linear programming rounding (SLPR) heuristic, to solve the MATA-SPC problem are also presented. Numerical results show that bandwidth can be fairly allocated among all links/flows by solving the MILP formulation or by using the heuristic algorithm at the cost of a minor reduction of network throughput. In addition, extensions to end-to-end throughput and fairness and multiradio wireless multihop networks are discussed.  相似文献   

10.
在保证一定吞吐量前提下,为了降低传输功率,文中提出了一种基于部分信道反馈信息的多用户OFDM系统子载波分配方案。该方案利用少量已反馈子载波的状态信息来线性推导未反馈子载波的信道增益。仿真结果表明,在平坦瑞利信道衰落中,相比传统方式该方案在不增加反馈开销的前提下该方案能够大幅度地改善系统的性能,提高资源利用率。  相似文献   

11.
In this article, a novel block-based visible image watermark VLSI architecture design and its hardware implementation in field programmable gate array (FPGA) is proposed. In this watermarking process, 1D-DCT is introduced to facilitate hardware implementation. Mathematical model is developed to reduce the computational complexity for the calculation of embedding and scaling factors, which are used to make the resultant image of best quality with uniform watermark visibility. The proposed architecture has a 12–stage pipeline. Parallelism techniques are employed in block level in order to achieve high performance. A single 8-point fast 1D-DCT is used to calculate the DCT coefficient values of the host image and the watermark image to minimize the resource utilization and power consumption. The hardware implementation of this algorithm leads to numerous advantages including reduced power, area and higher pipeline throughput. The performance of the architecture is studied by implementing Xilinx Virtex V technology based FPGA with DSP 48E. Throughput achieved based on this VLSI architecture is 5.21 Gbits/s with a total resource utilization of 4058BELs.  相似文献   

12.
Multiple-input multiple-output (MIMO) significantly increases the throughput of a communication system by employing multiple antennas at the transmitter and the receiver. To extract maximum performance from a MIMO system, a computationally intensive search based detector is needed. To meet the challenge of MIMO detection, typical suboptimal MIMO detectors are ASIC or FPGA designs. We aim to show that a MIMO detector on Graphic processor unit (GPU), a low-cost parallel programmable co-processor, can achieve high throughput and can serve as an alternative to ASIC/FPGA designs. However, careful architecture aware software design is needed to leverage the performance offered by GPU. We propose a novel soft MIMO detection algorithm, multi-pass trellis traversal (MTT), and show that we can achieve ASIC/FPGA-like performance and handle different configurations in software on GPU. The proposed design can be used to accelerate wireless physical layer simulations and to offload MIMO detection processing in wireless testbed platforms.  相似文献   

13.
Network security for mobile devices is in high demand because of the increasing virus count. Since mobile devices have limited CPU power, dedicated hardware is essential to provide sufficient virus detection performance. A TCAM-based virus-detection unit provides high throughput, but also challenges for low power and low cost. In this paper, an adaptively dividable dual-port BiTCAM (unifying binary and ternary CAMs) is proposed to achieve a high-throughput, low-power, and low-cost virus-detection processor for mobile devices. The proposed dual-port BiTCAM is realized with the dual-port AND-type match-line scheme which is composed of dual-port dynamic AND gates. The dual-port designs reduce power consumption and increase storage efficiency due to shared storage spaces. In addition, the dividable BiTCAM provides high flexibility for regularly updating the virus-database. The BiTCAM achieves a 48% power reduction and a 40% transistor count reduction compared with the design using a conventional single-port TCAM. The implemented 0.13 mum processor performs up to 3 Gbps virus detection with an energy consumption of 0.44 fJ/pattern-byte/scan at peak throughput.  相似文献   

14.
In this brief, a high-throughput and low-complexity fast Fourier transform (FFT) processor for wideband orthogonal frequency division multiplexing communication systems is presented. A new indexed-scaling method is proposed to reduce both the critical-path delay and hardware cost by employing shorter wordlength. Together with the mixed-radix multipath delay feedback structure, the proposed FFT processor can achieve very high throughput with low hardware cost. From analysis, it is shown that the proposed indexed-scaling method can save at least 11% memory utilizations compared to other state-of-the-art scaling algorithms. Also, a test chip of a 1.2 Gsample/s 2048-point FFT processor has been designed using UMC 90-nm 1P9M process with a core area of 0.97 mm2. The signal-to-quantization-noise ratio (SQNR) performance of this test chip is over 32.7 dB to support 16-QAM modulation and the power consumption is about 117 mW at 300 MHz. Compared to the fixed-point FFT processors, about 26% area and 28% power can be saved under the same throughput and SQNR specifications.  相似文献   

15.
This paper describes a new programmable routing fabric for field-programmable gate arrays (FPGAs). Our results show that an FPGA using this fabric can achieve 1.57 times lower dynamic power consumption and 1.35 times lower average net delays with only 9% reduction in logic density over a baseline island-style FPGA implemented in the same 65-nm CMOS technology. These improvements in power and delay are achieved by 1) using only short interconnect segments to reduce routed net lengths, and 2) reducing interconnect segment loading due to programming overhead relative to the baseline FPGA without compromising routability. The new routing fabric is also well-suited to monolithically stacked 3-D-IC implementation. It is shown that a 3-D-FPGA using this fabric can achieve a 3.3 times improvement in logic density, a 2.51 times improvement in delay, and a 2.93 times improvement in dynamic power consumption over the same baseline 2-D-FPGA.  相似文献   

16.
In this paper, we present a novel, high throughput field-programmable gate array (FPGA) architecture, PITIA, which combines the high-performance of application specific integrated circuits (ASICs) and the flexibility afforded by the reconfigurability of FPGAs. The new architecture, which targets datapath circuits, uses the concepts of wave steering and pipelined interconnects. We discuss the FPGA architecture and show results for performance, power consumption, clock network performance, and routability. Results for some commonly used datapath designs are encouraging with throughputs in the neighborhood of 625MHz in 0.25-/spl mu/m 2.5-V CMOS technology. Results for random benchmark circuits are also shown. We characterize designs according to their Rent's exponents and argue that designs with predominantly local interconnects are the best fit in PITIA. We also show that as technology scales down toward deep submicron, PITIA shows an increasing throughput performance.  相似文献   

17.
Chip multiprocessors with globally asynchronous locally synchronous (GALS) clocking styles are promising candidates for processing computationally-intensive and energy-constrained workloads. The GALS methodology simplifies clock tree design, provides opportunities to use clock and voltage scaling jointly in system submodules to achieve high energy efficiencies, and can also result in easily scalable clocking systems. However, its use typically also introduces performance penalties due to additional communication latency between clock domains. We show that GALS chip multiprocessors (CMPs) with large inter-processor first-inputs–first-outputs (FIFOs) buffers can inherently hide much of the GALS performance penalty while executing applications that have been mapped with few communication loops. In fact, the penalty can be driven to zero with sufficiently large FIFOs and the removal of multiple-loop communication links. We present an example mesh-connected GALS chip multiprocessor and show it has a less than 1% performance (throughput) reduction on average compared to the corresponding synchronous system for many DSP workloads. Furthermore, adaptive clock and voltage scaling for each processor provides an approximately 40% power savings without any performance reduction. These results compare favorably with the GALS uniprocessor, which compared to the corresponding synchronous uniprocessor, has a reported greater than 10% performance (throughput) reduction and an energy savings of approximately 25% using dynamic clock and voltage scaling for many general purpose applications.   相似文献   

18.
Today's data reconstruction in digital communication systems requires designs of highest throughput rate at low power. The Viterbi algorithm is a key element in such digital signal processing applications. The nonlinear and recursive nature of the Viterbi decoder makes its high-speed implementation challenging. Several promising approaches to achieve either high throughput or low power have been proposed in the past. A combination of these is developed in this paper. Additional new concepts allow building a signal-flow graph suitable for the design of high-speed Viterbi decoders with low power. Using a flexible datapath generator facilitates the essential quantitative optimization from architectural down to physical level to fully exploit the low-power and high-speed potential of a given technology. With parameterizable design entry, this datapath generator establishes the basis of a scalable platform-based design library. Altogether, this allows coverage of the range of today's industrial interest in high throughput rates, from 150 Msymbols/s up to 1.2 Gsymbols/s using conventional CMOS logic. The features of two exemplary Viterbi decoder implementations prove the benefit of this physically oriented design methodology in terms of speed and low power, when compared to other leading edge implementations  相似文献   

19.
This paper focuses on potential issues related to the random selection of a sensing channel that occurs after the prediction phase in a Cognitive Radio Network (CRN). A novel approach (Approach-1) for improved selection is proposed, which relies on the probabilities of channels by which they are predicted idle. Further, closed-form expressions are derived for the throughput of Cognitive Users (CUs) in the conventional and proposed approaches. In addition to this, a fundamental approach for computing the prediction probabilities is also proposed. Moreover, a new challenging issue named “sense and stuck” was observed in the conventional approach. The proposed approach is validated by comparing the results achieved with the results of the conventional approach. However, to achieve the prediction probabilities, the pre-channel-state-information is a prerequisite, but it may be unavailable for particular scenarios; therefore, a modified selection method is introduced to avoid the sense and stuck problem. An algorithm to evaluate the throughput using the random, improved, and modified selection methods is presented with its space and time complexities. Furthermore, for additional improvement in the throughput of the CU, a new frame structure is introduced, in which the spectrum prediction and sensing periods are exploited for simultaneous transmission of data via the underlay spectrum access technique (Approach-2). The simulated results of Approach-2 are compared with our pre-obtained results of Approach-1, which confirm significant improvement in the throughput.  相似文献   

20.
We consider circuit techniques for reducing field-programmable gate-array (FPGA) power consumption and propose a family of new FPGA routing switch designs that are programmable to operate in three different modes: high-speed, low-power, or sleep. High-speed mode provides similar power and performance to traditional FPGA routing switches. In low-power mode, speed is curtailed in order to reduce power consumption. Leakage is reduced by 28%–52% in low-power versus high-speed mode, depending on the particular switch design selected. Dynamic power is reduced by 28%–31% in low-power mode. Leakage power in sleep mode, which is suitable for unused routing switches, is 61%–79% lower than in high-speed mode. Each of the proposed switch designs has a different power/area/speed tradeoff. All of the designs require only minor changes to a traditional routing switch and involve relatively small area overhead, making them easy to incorporate into current commercial FPGAs. The applicability of the new switches is motivated through an analysis of timing slack in industrial FPGA designs. It is observed that a considerable fraction of routing switches may be slowed down (operate in low-power mode), without impacting overall design performance.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号