首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 33 毫秒
1.
A logarithmic processor is proposed that uses external RAM for holding the table required for logarithmic subtraction. The proposed processor requires that the RAM be initialized before any computations occur. We give an algorithm to initialize the RAM using the limited arithmetic unit of the processor. The algorithm is ten times faster than a bit by bit computation of the logarithm and antilogarithm. Bounds are developed for comparing the error of this algorithm against the error of earlier algorithms. Simulation results show that this algorithm avoids catastrophic cancellation, and is as accurate as any previously known single precision algorith.  相似文献   

2.
A 32-bit fixed-point logarithmic arithmetic unit is proposed for the possible application to mobile three-dimensional (3-D) graphics system. The proposed logarithmic arithmetic unit performs division, reciprocal, square-root, reciprocal-square-root and square operations in two clock cycles and powering operation in four clock cycles. It can program its number range for accurate computation flexibility of 3-D graphics pipeline and eight -region piecewise linear approximation model for logarithmic and antilogarithmic conversion to reduce the operation error under 0.2%. Its test chip is implemented by 1-poly 6-metal 0.18-mum CMOS technology with 9-k gates. It operates at the maximum frequency of 231 MHz and consumes 2.18 mW at 1.8-V supply  相似文献   

3.
A low-power, area-efficient four-way 32-bit multifunction arithmetic unit has been developed for programmable shaders for handheld 3D graphics systems. It adopts the logarithmic number system (LNS) at the arithmetic core for the single-cycle throughput and the small-size low-power unification of various complicated arithmetic operations such as power, logarithm, trigonometric functions, vector-SIMD multiplication, division, square root and vector dot product. 24-region and 16-region piecewise linear logarithmic and antilogarithmic converters are proposed with 0.8% and 0.02% maximum conversion error, respectively. All the supported operations are implemented with less than 6.3% operation error and unified into a single arithmetic platform with maximum four-cycle latency and single-cycle throughput. A 93 K gate test chip is fabricated using one-poly five-metal 0.18-mum CMOS technology. It operates at 210 MHz with maximum power consumption of 15.3 mW at 1.8 V.  相似文献   

4.
A convolutional neural network (CNN) architecture supporting on-device user customization is proposed. The network architecture consists of a large CNN trained on a general data and a smaller augmenting network that can be re-trained on-device using a small user-specific data provided by the user. The proposed approach is applied to handwritten character recognition of the Latin and the Korean alphabet, Hangul. Experiments show a 3.5-fold reduction of the prediction error after user customization for both the Latin and the Korean character set compared to the CNN trained with general data. To minimize the energy required when retraining on-device, the use of a coarse-grained reconfigurable array processor (CGRA) in a low-power, efficient manner is presented. The CGRA achieves a speedup of 36× and a 54-fold reduced energy consumption compared to an ARMv8 processor. Compared to a 3-way VLIW processor, a speedup of 42× and a 12-fold energy reduction is observed, demonstrating the potential of general-purpose CGRAs as light-weight DNN accelerators.  相似文献   

5.
To design a power-efficient digital signal processor, this study develops a fundamental arithmetic unit of a low-power adder that operates on effective dynamic data ranges. Before performing an addition operation, the effective dynamic ranges of two input data are determined. Based on a larger effective dynamic range, only selected functional blocks of the adder are activated to generate the desired result while the input bits of the unused functional blocks remain in their previous states. The added result is then recovered to match the required word length. Using this approach to reduce switching operations of noneffective bits allows input data in 2's complement and sign magnitude representations to have similar switching activities. This investigation thus proposes a 2's complement adder with two master-stage and slave-stage flip-flops, a dynamic-range determination unit and a sign-extension unit, owing to the easy implementation of addition and subtraction in such a system. Furthermore, this adder has a minimum number of transistors addressed by carry-in bits and thus is designed to reduce the power consumption of its unused functional blocks. The dynamic range and sign-extension units are explored in detail to minimize their circuit area and power consumption. Experimental results demonstrate that the proposed 32-bit adder can reduce power dissipation of conventional low-power adders for practical multimedia applications. Besides the ripple adder, the proposed approach can be utilized in other adder cells, such as carry lookahead and carry-select adders, to compromise complexity, speed and power consumption for application-specific integrated circuits and digital signal processors.  相似文献   

6.
The authors discuss employing alternative number systems to reduce power dissipation in portable devices and high-performance systems. They focus on two alternative number systems that are quite different from the conventional linear number representations, namely the logarithmic number system (LNS) and the residue number system (RNS). Both have recently attracted the interest of researchers for their low-power properties. The authors address aspects of the conventional arithmetic representations, the impact of logarithmic arithmetic on power dissipation, and discuss the low-power aspects of residue arithmetic  相似文献   

7.
This paper describes the main features and functions of the Pentium(R) 4 processor microarchitecture. We present the front-end of the machine, including its new form of instruction cache called the trace cache, and describe the out-of-order execution engine, including a low latency double-pumped arithmetic logic unit (ALU) that runs at 4 GHz. We also discuss the memory subsystem, including the low-latency Level 1 data cache that is accessed in two clock cycles. We then describe some of the key features that contribute to the Pentium(R) 4 processor's floating-point and multimedia performance. We provide some key performance numbers for this processor, comparing it to the Pentium(R) III processor  相似文献   

8.
A highly functional circuit for pulse width modulation (PWM) signal processing is proposed as a core of the A-D merged circuit architecture for time-domain information processing. The core circuit employs a switched-current integration technique as its computing architecture and functions as a linear arithmetic operator, a memory, and also a delaying device of PWM signals. A 0.8-μm CMOS test chip includes 110 transistors plus two capacitors and performs parallel additions and multiplications at the accuracy of 1.2 ns. A cumulative property of the technique allows the circuit to serve as a low-power accumulator that consumes 23% of the energy of the full digital 7-b accumulator. A PWM multiply-accumulate unit and a nonlinear operation unit are also proposed to extend functionality of the circuit. Since the PWM signal carries multibit data in a binary amplitude pulse, these circuits can be favorably applicable to low-voltage and low-power designs in the deep submicrometer era  相似文献   

9.
岳梦云  白冰 《电子学报》2000,48(10):2041-2046
本文设计了一种适用于电机矢量控制算法的数字信号处理系统的微架构定义,包括其指令集定义、存储器模型以及与主CPU的交互模式.该设计具有通过固定部分多操作数有效缩减指令编码长度提高代码密度以及后台执行多周期指令提高ALU并行效率的显著优点.文中给出了典型的FOC控制算法在DSP (Digital Signal Processor)指令集上实现的指令周期数,也给出了对应架构的电路实现情况,最终以ARM CORTEX-M0及几款主流DSP作为比较基线,通过实测实验数据证明了体系结构的高能效比,以较为有限的电路面积代价,极大提高了集成DSP的嵌入式系统的运行效率.  相似文献   

10.
The advent of development of high-performance, low-power digital circuits is achieved by a suitable emerging nanodevice called quantum-dot cellular automata(QCA). Even though many efficient arithmetic circuits were designed using QCA, there is still a challenge to implement high-speed circuits in an optimized manner. Among these circuits, one of the essential structures is a parallel multi-digit decimal adder unit with significant speed which is very attractive for future environments. To achieve high speed, a new correction logic formulation method is proposed for single and multi-digit BCD adder. The proposed enhanced single-digit BCD adder(ESDBA) is 26% faster than the carry flow adder(CFA)-based BCD adder. The multi-digit operations are also performed using the proposed ESDBA, which is cascaded innovatively. The enhanced multi-digit BCD adder(EMDBA) performs two 4-digit and two 8-digit BCD addition 50% faster than the CFA-based BCD adder with the nominal overhead of the area. The EMDBA performs two 4-digit BCD addition 24% faster with 23% decrease in the area, similarly for 8-digit operation the EMDBA achieves 36% increase in speed with 21% less area compared to the existing carry look ahead(CLA)-based BCD adder design. The proposed multi-digit adder produces significantly less delay of(N-1)+3.5 clock cycles compared to the N*One digit BCD adder delay required by the conventional BCD adder method. It is observed that as per our knowledge this is the first innovative proposal for multi-digit BCD addition using QCA.  相似文献   

11.
Network security for mobile devices is in high demand because of the increasing virus count. Since mobile devices have limited CPU power, dedicated hardware is essential to provide sufficient virus detection performance. A TCAM-based virus-detection unit provides high throughput, but also challenges for low power and low cost. In this paper, an adaptively dividable dual-port BiTCAM (unifying binary and ternary CAMs) is proposed to achieve a high-throughput, low-power, and low-cost virus-detection processor for mobile devices. The proposed dual-port BiTCAM is realized with the dual-port AND-type match-line scheme which is composed of dual-port dynamic AND gates. The dual-port designs reduce power consumption and increase storage efficiency due to shared storage spaces. In addition, the dividable BiTCAM provides high flexibility for regularly updating the virus-database. The BiTCAM achieves a 48% power reduction and a 40% transistor count reduction compared with the design using a conventional single-port TCAM. The implemented 0.13 mum processor performs up to 3 Gbps virus detection with an energy consumption of 0.44 fJ/pattern-byte/scan at peak throughput.  相似文献   

12.
In this paper, we proposed a new architecture of lifting processor for JPEG2000 and implemented it with both FPGA and ASIC. It includes a new cell structure that executes a unit of lifting calculation to satisfy the requirements of the lifting process of a repetitive arithmetic. After analyzing the operational sequence of lifting arithmetic in detail and imposing the causality to implement in hardware, the unit cell was optimized. A new simple lifting kernel was organized by repeatedly arranging the unit cells and a lifting processor was realized for Motion JPEG2000 with the kernel. The proposed processor can handle any size of tiles and support both lossy and lossless operation with (9,7) filter and (5,3) filter, respectively. Also, it has the same throughput rate as the input, and can continuously output the wavelet coefficients of the four types (LL, LH, HL, HH) simultaneously. The lifting processor was implemented in a 0.35 mum CMOS fabrication process, the result of which occupied about 90 000 gates, and was stably operated in about 150 MHz  相似文献   

13.
We have fabricated a high yield integrated memory array processor (IMAP) LSI, which features a high memory bandwidth (1.28-GB/s) and low power consumption (4-W max.) and which contains a 2-Mb SRAM with 1.28-I/O's and 64 processor elements (PE's) in one chip. A high-bandwidth and low-power memory circuit design is the key technology to realize the IMAP-LSI. We adopted following new designs for memory circuit. (1) Memory access time is designed to be twice as fast as PE execution time (2) Employment of dynamic power control mode, which reduces the memory power consumption down to 30% of maximum power without a loss in access-speed (3) Simplified synchronization with PE's (4) 4-way block redundancy. These design techniques are suitable for future system integrated ULSI's  相似文献   

14.
An arithmetic unit (AU) that performs all basic arithmetic operations in the finite field GF(2m) is presented, where m is an arbitrary integer. The presented finite field AU consists of an arithmetic processor, an arithmetic logic unit, and a control unit. The proposed AU has low circuit complexity and is programmable, so that any error-correcting decoder that operates in GF(2m) can be easily implemented with this AU.  相似文献   

15.
A 32-b RISC/DSP microprocessor with reduced complexity   总被引:2,自引:0,他引:2  
This paper presents a new 32-b reduced instruction set computer/digital signal processor (RISC/DSP) architecture which can be used as a general purpose microprocessor and in parallel as a 16-/32-b fixed-point DSP. This has been achieved by using RISC design principles for the implementation of DSP functionality. A DSP unit operates in parallel to an arithmetic logic unit (ALU)/barrelshifter on the same register set. This architecture provides the fast loop processing, high data throughput, and deterministic program flow absolutely necessary in DSP applications. Besides offering a basis for general purpose and DSP processing, the RISC philosophy offers a higher degree of flexibility for the implementation of DSP algorithms and achieves higher clock frequencies compared to conventional DSP architectures. The integrated DSP unit provides instruction set support for highly specialized DSP algorithms. Subword processing optimized for DSP algorithms has been implemented to provide maximum performance for 16-b data types. While creating a unified base for both application areas, we also minimized transistor count and we reduced complexity by using a short instruction pipeline. A parallelism concept based on a varying number of instruction latency cycles made superscalar instruction execution superfluous  相似文献   

16.
This work proposes a communication digital signal processor (DSP) suitable for massive signal processing operations in orthogonal frequency division multiplexing (OFDM) and code-division multiple-access (CDMA) communication systems. The OFDM-based IEEE 802.11a wireless LAN transceiver and CDMA-based WCDMA uplink receiver are simulated to evaluate the computation requirements of future communication systems. The architecture of the communication digital signal processor is established according to the computational complexity of these simulations. The proposed architecture supports basic butterfly operations, single/double-precision and real- and complex-valued multiplication-and-accumulation (MAC), squared error computation, and add-compare-select (ACS) operation. This butterfly/complex MAC architecture can greatly enhance the execution efficiency of operations often found in communication applications. The processor chip is fabricated using a 0.35-/spl mu/m n-well one-poly four-metal CMOS technology. The fabricated DSP chip reaches a speed of 1.1 G MAC/s when operating in the high-speed mode, and it achieves 4 M MAC/s/mW in the low-power mode.  相似文献   

17.
This paper describes the design, realization, and evaluation of a mixed-signal motion estimation processor using the full-search block-matching algorithm. The approach features digital I/O and a low-power, compact analog computational core. The proof-of-concept realization whose architecture incorporates pixel reuse, was fabricated in 0.8-/spl mu/m CMOS technology occupying 0.65 mm/sup 2/, and operates on 4 /spl times/ 4 pixel blocks and a search area of 8 /spl times/ 8 pixels. The processor achieves a low energy consumption per motion vector of 1.35 nJ and dissipates 0.8 mW from a 3-V power supply at QCIF 15 frames/s. The approach is intended for portable applications of digital video encoding.  相似文献   

18.
A low-spurious low-power 12-bit 160-MS/s digital to analog converter (DAC) for baseband wireless transmitter is proposed and demonstrated. Degenerated current switches are introduced and benefits of using them are discussed. Mismatch behavior under packaging-induced die stress is also presented. The mobility shift caused by package stress inherited from a thin-die is a dominant source of I/Q mismatch. A 2-channel I/Q DAC core consumes 4 mA with a 1.3/2.6 V dual supply. The 0.13 mm2 I/Q DAC core fabricated in 90-nm digital CMOS process with a highly-integrated digital processor achieves 74 dB SFDR, 55 dB SNDR, and -73 dB THD for a 975 kHz sinusoid at 153.6 MS/s sample rate  相似文献   

19.
《电子学报:英文版》2017,(6):1161-1167
By exploring symmetric cryptographic data level and instruction-level parallelism, the reconfigurable processor architecture for symmetric ciphers is presented based on Very-long instruction word (VLIW) structure. The application-specific instruction-set system for symmetric ciphers is proposed. As for the same arithmetic operation of symmetric ciphers, eleven kinds of reconfigurable cryptographic arithmetic units are designed by the reconfigurable technology. As to the requirement of high energy-efficient design, the loop buffer structure for instruction fetching unit is proposed to reduce the power consumption significantly with the same frequency as conventional, meanwhile, the chain processing mechanism is proposed to improve the cryptographic throughput without any area overhead. It has been fabricated with 0.18μm CMOS technology. The result shows that the processor can work up to 200MHz, and the fourteen kinds of cryptographic algorithms were mapped in the processor, the encryption throughput of AES, SNOW2.0 and SHA2 algorithm can achieve 1.19Gbps, 1.05Gbps, and 407Mbps respectively.  相似文献   

20.
The paper proposes a new continuous-flow mixed-radix (CFMR) fast Fourier transform (FFT) processor that uses the MR (radix-4/2) algorithm and a novel in-place strategy. The existing in-place strategy supports only a fixed-radix FFT algorithm. In contrast, the proposed in-place strategy can support the MR algorithm, which allows CF FFT computations regardless of the length of FFT. The novel in-place strategy is made by interchanging storage locations of butterfly outputs. The CFMR FFT processor provides the MR algorithm, the in-place strategy, and the CF FFT computations at the same time. The CFMR FFT processor requires only two N-word memories due to the proposed in-place strategy. In addition, it uses one butterfly unit that can perform either one radix-4 butterfly or two radix-2 butterflies. The CFMR FFT processor using the 0.18 /spl mu/m SEC cell library consists of 37,000 gates excluding memories, requires only 640 clock cycles for a 512-point FFT and runs at 100 MHz. Therefore, the CFMR FFT processor can reduce hardware complexity and computation cycles compared with existing FFT processors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号