期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Shmuel Wimer Amir Albeck Israel Koren 《Computers & Electrical Engineering》2014

VLSI designs are typically data-independent and as such, they must produce the correct result even for the worst-case inputs. Adders in particular assume that addition must be completed within prescribed number of clock cycles, independently of the operands. While the longest carry propagation of an n-bit adder is n bits, its expected length is only O(log₂ n) bits. We present a novel dual-mode adder architecture that reduces the average energy consumption in up to 50%. In normal mode the adder targets the O(log₂ n)-bit average worst-case carry propagation chains, while in extended mode it accommodates the less frequent O(n)-bit chain. We prove that minimum energy is achieved when the adder is designed for O(log₂ n) carry propagation, and present a circuit implementation. Dual-mode adders enable voltage scaling of the entire system, potentially supporting further overall energy reduction. The energy-time tradeoff obtained when incorporating such adders in ordinary microprocessor’s pipeline and other architectures is discussed. 相似文献

2.

高效缩1码模2ⁿ+1加法器设计与优化

吕晓兰《测控技术》2014,33(2):127-129

针对目前存在的缩1码模2~n+1加法器的优缺点,设计出一个有效的基于进位选择的缩1码模2~n+1加法器。在模加法器的进位计算中,采用进位选择计算代替传统的进位计算,进位计算前缀运算量明显减少。分析和实验结果表明,对于比较大的n值,进位选择缩1码模2~n+1加法器在保持较高运算速度的前提下,有效地提高了集成度。相似文献

3.

Radix-8 full adder in QCA with single clock-zone carry propagation delay

《Microprocessors and Microsystems》2017

We design a 3-bit adder or a radix-8 full adder (FA) in quantum-dot cellular automata (QCA), where the 3-bit carry propagation path can be accommodated in one clock-zone. To achieve this, we introduce group majority signals similar to group propagate and generate signals in parallel prefix computations, use them to reformulate the carry expressions of a previous radix-4 FA, and as such we could extend it to higher radix FAs. Applying the aforementioned new interpretation of carry expressions (via group majority signals) on 3-bit adders, results in that only a single clock cycle is required for 12-bit (vs. the previous 8-bit) carry propagation, across four radix-8 FAs. Based on the proposed radix-8 QCA-FA, we realized 8-, 16-, 32-, 64, and 128-bit QCA adders via QCADesigner. Comparison of these adders with the previous radix-4 experiment, showed 9–41% speed up, and 57–76% area saving, for 16–128-bit adders, respectively. On the other hand, compared to the best previous radix-2 design, for the same bit widths, we experienced 57–172% speed up, but at the cost of 138–4% area increase, except for the 64 and 128-bit cases, where we also experienced 19% and 41% area saving, respectively. 相似文献

4.

Comparative study of FFA architectures using different multiplier and adder topologies

Paliwal Payal Sharma Janki Ballabh Nath Vijay 《Microsystem Technologies》2020,26(5):1455-1462

Parallel FIR filter is the prime block of many modern communication application such as MIMO, multi-point transceivers etc. But hardware replication problem of parallel techniques make the system more bulky and costly. Fast FIR algorithm (FFA) gives the best alternative to traditional parallel techniques. In this paper, FFA based FIR structures with different topologies of multiplier and adder are implemented. To optimize design different multiplication technique like add and shift method, Vedic multiplier and booth multiplier are used for computation. Various adders such as carry select adder, carry save adder and Han-Carlson adder are analyzed for improved performance of the FFA structure. The basic objective is to investigate the performance of these designs for the tradeoffs between area, delay and power dissipation. Comparative study is carried out among conventional and different proposed designs. The advantage of presented work is that; based on the constraints, one can select the suitable design for specific application. It also fulfils the literature gap of critical analysis of FPGA implementation of FFA architecture using different multiplier and adder topologies. Xilinx Vivado HLS tool is used to implement the proposed designs in VHDL.

相似文献

5.

VLIW数字信号处理器64位可重构加法器的设计

下载免费PDF全文

张志伟马鸿李立健王东琳《计算机工程》2007,33(16):29-31,34

描述了一款适用于超长指令字数字信号处理器的64位加法器的设计。该加法器高度可重构，可以支持2个64位数据的加法运算、4个32位数据的加法运算、8个16位数据的加法运算以及16个8位数据的加法运算。它结合了Brent-Kung对数超前进位加法器和进位选择加法器的优点，使得加法器的面积和连线减少了50%，而延时与加法器的长度的对数成正比。仿真结果表明，在典型工作条件下，采用0.18μm工艺库标准单元，其关键路径的延时为0.83ns，面积为0.149mm2，功耗仅为0.315mW。相似文献

6.

数字信号处理器中高性能可重构加法器设计

马鸿李振伟彭思龙《计算机工程》2009,35(12):1-4

设计一款适用于高性能数字信号处理器的16位加法器。该加法器结合条件进位选择和条件“和”选择加法器的特点,支持可重构,可以进行2个16位数据或者4个8位数据的加法运算,同时对其进位链进行优化。相对于传统的条件进位选择加法器,在典型工作条件下,采用0.18μm工艺库标准单元,其延时降低46%,功耗降低5%。相似文献

7.

子字并行加法器的研究与实现

下载免费PDF全文

马胜黄立波王志英刘聪戴葵《计算机工程与应用》2009,45(36):54-59

子字并行加法器能够有效提高多媒体应用程序的处理性能。基于门延迟模型对加法器原理及性能进行了分析,设计了进位截断和进位消除两种子字并行控制机制。在这两种机制的指导下,实现了多种子字并行加法器,并对它们的性能进行了比较和分析。结果表明进位消除机制相对于进位截断机制需要较短的延时,较少的逻辑门数以及较低的功耗。在各种子字并行加法器中,Kogge-Stone加法器具有最少的延迟时间,RCA加法器具有最少的逻辑门数和最低的功耗。研究结果可以用于指导子字并行加法器的设计与选择。相似文献

8.

Design of high-performance QCA incrementer/decrementer circuit based on adder/subtractor methodology

《Microprocessors and Microsystems》2020

This paper focuses on a novel design of an adder/subtractor-based incrementer/decrementer using quantum-dot cellular automata (QCA) technology. QCA is a promising nanotechnology that offers new techniques of computation and data transmission. We use the multilayer crossover technique in the proposed designs to achieve low latency and area for the scalability feature. Moreover, new designs of QCA half and full adders are proposed to improve the operating speed of the incrementer/decrementer. The working of the proposed designs is analyzed via the QCA simulator tool, and the results are compared with previous studies in terms of cell count, area, and latency. According to the analysis, the presented designs perform well; for example, the proposed 4-bit incrementer design shows an improvement of 65 % in terms of area usage and 3.2 times lower latency compared to its existing counterpart. 相似文献

9.

Lane of parallel through carry in ternary optical adder 总被引：7，自引：0，他引：7

JIN Yi HE Huacan & Al Lirong School of Computer Engineering Science Shanghai University Shanghai China College of Computer Science & Engineering Northwestern Polytechnical University Xi''''an China 《中国科学F辑(英文版)》2005,48(1):107-116

At the present 50 to 100 microseconds are necessary for a liquid crystal to change its state from opacity to clarity; 1.14×10-5 microseconds are however proved to be enough for light to pass through a clarity liquid crystal device. Rooted from this great difference in time, an optical adder was constructed with parallel through carry lanes (PTCL) composed of liquid crystals. Because all carries in PTCL process in parallel, the carry delay in the ternary optical computer's adder is avoided. Eliminating the carry delay in adder of ternary optical computer by physical means, the PTCL is also applicable for other types of optical adders. Moreover a light diagram of the adder and one PTCL structure are provided. 相似文献

10.

Effective RCA design using quantum dot cellular automata

《Microprocessors and Microsystems》2020

Quantum dot Cellular Automata (QCA) is a transistor less technology alternative to CMOS for developing low-power, high speed digital circuits. Adder circuits are broadly employed in all digital computation systems. In this paper, a novel coplanar QCA full adder circuit is proposed which is designed with minimum number of QCA cells. The proposed full adder requires only 13 QCA cells, an area of 0.008 μm² and delay of about 2 clock cycles to implement its function. Then an efficient 4-bit Ripple Carry Adder (RCA) is designed based on the proposed full adder that performs higher end addition in an effective way. Simulations results are obtained precisely using QCA designer tool version 2.0.3. Also the simulation results shows that the proposed 4-bit Ripple Carry Adder (RCA) requires only 70 QCA cells, an area of 0.18 μm² and delay of about 5 clock cycles to implement its function with enhanced performance in terms of latency, area and QCA Cost. From the comparisons, it is found that our work achieves over 55% improvement in QCA cell count. 相似文献

11.

Delay and area efficient approximate multiplier using reverse carry propagate full adder

《Microprocessors and Microsystems》2020

The performance and power of error resilient applications will rise with a decrease in designing complexness due to approximate computing. This paper includes the new method for the approximation of multipliers. Variable likelihood terms are produced by the alteration of partial products of the multiplier. Based on the probability statistics, the accumulation of altered partial products leads to the variation of logic complexity. Here the estimate is implemented in 2 variables of 16-bit multiplier and in the final stage with reverse carry propagate adder(RCPA). The reverse carry propagate adder have carry signal propagation from the most significant bit(MSB) to the least significant bit(LSB), which results in greater relevance to the input carry than the output carry. The technique of carry circulation in reverse order with delay variations increases the stability. Utilizing the RCPA in approximate multiplier provide 21% and 7% improvements in area and delay. On comparing, this structure is resilient to delay variations than the ideal approximate adder. 相似文献

12.

VLSI中加法器的一种高效自测试设计

肖继学陈光《计算机辅助设计与图形学学报》2007,19(11):1465-1470

基于算术加法测试生成,提出了VLSI中加法器的一种自测试方案:加法器产生自身所需的所有测试矢量.通过优化测试矢量的初值改进这些测试矢量,提高了其故障侦查、定位能力.借助于测试矢量左移、逻辑与操作等方式对加法器自测试进行了设计.对8位、16位、32位行波、超前进位加法器的实验结果表明,该自测试能实现单、双固定型故障的完全测试,其单、双故障定位率分别达到了95.570%,72.656%以上.该自测试方案可实施真速测试且不会降低电路的原有性能,其测试时间与加法器长度无关. 相似文献

13.

A fast inner product processor based on equal alignments

S. P. Smith H. C. Torng 《Journal of Parallel and Distributed Computing》1985,2(4):376-390

Inner product computation is an important operation, invoked repeatedly in matrix multiplications. A high-speed inner product processor can be very useful (among many possible applications) in real-time signal processing. This paper presents the design of a fast inner product processor, with appreciably reduced latency and cost. The inner product processor is implemented with a tree of carry-propagate or carry-save adders; this structure is obtained with the incorporation of three innovations in the conventional multiply/add tree: (1) The leaf-multipliers are expanded into adder subtrees, thus achieving an O(log Nb) latency, where N denotes the number of elements in a vector and b the number of bits in each element. (2) The partial products, to be summed in producing an inner product, are reordered according to their “minimum alignments.” This reordering brings approximately a 20% saving in hardware—including adders and data paths. The reduction in adder widths also yields savings in carry propagation time for carry-propagate adders. (3) For trees implemented with carry-save adders, the partial product reordering also serves to truncate the carry propagation chain in the final propagation stage by 2 log b − 1 positions, thus significantly reducing the latency further. A form of the Baugh and Wooley algorithm is adopted to implement two's complement notation with changes only in peripheral hardware. 相似文献

14.

A new construction adder based on Chinese abacus algorithm

Shu-Chung YiAuthor Vitae 《Computers & Electrical Engineering》2012,38(2):185-193

A new construction adder based on Chinese abacus algorithm is presented in this paper. There are two kinds of beads used in this construction. Each column element has three higher beads with a weight of four and three lower beads with a weight of one. The proposed 32-bit adder contains eight column elements. The construction was simulated by the technology of TSMC 0.18 μm CMOS process. Layout was also made by the same technology. The maximum delay of the 32-bit abacus adder is 0.91 ns and 14% less than that of Carry Look-ahead Adders for 0.18 μm technology. The power consumption of the abacus adder is 3.1 mW and 28% less than that of Carry Look-ahead Adders for 0.18 μm technology. Recent researches are compared with the proposed adder. The construction was also simulated by Predictive Technology Model. The PTM results also presented. The use of Chinese abacus approach offers a competitive technique with respect to other adders. 相似文献

15.

Performance Measurement of Energy Efficient and Highly Scalable Hybrid Adder

B. Annapoorani P. Marikkannu 《计算机系统科学与工程》2023,45(3):2659-2672

The adders are the vital arithmetic operation for any arithmetic operations like multiplication, subtraction, and division. Binary number additions are performed by the digital circuit known as the adder. In VLSI (Very Large Scale Integration), the full adder is a basic component as it plays a major role in designing the integrated circuits applications. To minimize the power, various adder designs are implemented and each implemented designs undergo defined drawbacks. The designed adder requires high power when the driving capability is perfect and requires low power when the delay occurred is more. To overcome such issues and to obtain better performance, a novel parallel adder is proposed. The design of adder is initiated with 1 bit and has been extended up to 32 bits so as verify its scalability. This proposed novel parallel adder is attained from the carry look-ahead adder. The merits of this suggested adder are better speed, power consumption and delay, and the capability in driving. Thus designed adders are verified for different supply, delay, power, leakage and its performance is found to be superior to competitive Manchester Carry Chain Adder (MCCA), Carry Look Ahead Adder (CLAA), Carry Select Adder (CSLA), Carry Select Adder (CSA) and other adders. 相似文献

16.

Applying Approximate Counting for Computing the Frequency Moments of Long Data Streams

André Gronemeier Martin Sauerhoff 《Theory of Computing Systems》2009,44(3):332-348

This paper takes up a remark in the well-known paper of Alon, Matias, and Szegedy (J. Comput. Syst. Sci. 58(1):137–147, 1999) about the computation of the frequency moments of data streams and shows in detail how any F _k with k≥1 can be approximately computed using space O(km ^1−1/k(k+log m+log log n)) based on approximate counting. An important building block for this, which may be interesting in its own right, is a new approximate variant of reservoir sampling using space O(log log n) for constant error parameters. 相似文献

17.

Finding the conditional location of a median path on a tree

Biing-Feng Wang Tzu-Chin Lin Chien-Hsin Lin Shan-Chyun Ku 《Information and Computation》2008,206(7):828-839

In this paper, we study the problem of locating a median path of limited length on a tree under the condition that some existing facilities are already located. The existing facilities may be located at any subset of vertices. Upper and lower bounds are proposed for both the discrete and continuous models. In the discrete model, a median path is not allowed to contain partial edges. In the continuous model, a median path may contain partial edges. The proposed upper bounds for these two models are O(n log n) and O(n log nα(n)), respectively. They improve the previous known bounds from O(n log² n) and O(n²), respectively. The proposed lower bounds are both Ω(n log n). 相似文献

18.

A versatile Montgomery multiplier architecture with characteristic three support

E. Öztürk 《Computers & Electrical Engineering》2009,35(1):71-85

We present a novel unified core design which is extended to realize Montgomery multiplication in the fields GF(2ⁿ), GF(3^m), and GF(p). Our unified design supports RSA and elliptic curve schemes, as well as the identity-based encryption which requires a pairing computation on an elliptic curve. The architecture is pipelined and is highly scalable. The unified core utilizes the redundant signed digit representation to reduce the critical path delay. While the carry-save representation used in classical unified architectures is only good for addition and multiplication operations, the redundant signed digit representation also facilitates efficient computation of comparison and subtraction operations besides addition and multiplication. Thus, there is no need for a transformation between the redundant and the non-redundant representations of field elements, which would be required in the classical unified architectures to realize the subtraction and comparison operations. We also quantify the benefits of the unified architectures in terms of area and critical path delay. We provide detailed implementation results. The metric shows that the new unified architecture provides an improvement over a hypothetical non-unified architecture of at least 24.88%, while the improvement over a classical unified architecture is at least 32.07%. 相似文献

19.

一种基于块匹配算法的SAD运算加速器

谷会涛陈书明《计算机工程与科学》2012,34(7):60-64

基于块匹配算法的运动估计是图像和视频应用中的关键技术。SAD运算是运动估计中最主要的运算形式,具有极高的计算复杂度和传输带宽需求。本文提出了一种可配置的SAD运算加速器结构,采用一个16×1规模的PE阵列和一个加法树结构加速SAD运算的执行。本文将PE阵列和加法树结构的流水线进行细致划分,有效提高了工作频率。加速器采用DMA事件机制,大部分的数据传输可以与SAD计算并行进行,减少了数据传输延迟引起的性能下降。实验结果显示,搜索16×16大小的搜索窗口,本文结构只需要4102个周期。基于SMIC0.13μm的CMOS标准单元工艺对本文结构进行综合,最高工作频率可达到750MHz,面积约为16.8k门和3.5KB的片上存储器。相似文献

20.

OTIS-MOT: an efficient interconnection network for parallel processing

Prasanta K. Jana Dheeresh K. Mallick 《The Journal of supercomputing》2012,59(2):920-940

Mesh of trees (MOT) is well known for its small diameter, high bisection width, simple decomposability and area universality. On the other hand, OTIS (Optical Transpose Interconnection System) provides an efficient optoelectronic model for massively parallel processing system. In this paper, we present OTIS-MOT as a competent candidate for a two-tier architecture that can take the advantages of both the OTIS and the MOT. We show that an n⁴_-n^{4}_{-} processor OTIS-MOT has diameter 8log n ^∗+1 (The base of the logarithm is assumed to be 2 throughout this paper.) and fault diameter 8log n+2 under single node failure. We establish other topological properties such as bisection width, multiple paths and the modularity. We show that many communication as well as application algorithms can run on this network in comparable time or even faster than other similar tree-based two-tier architectures. The communication algorithms including row/column-group broadcast and one-to-all broadcast are shown to require O(log n) time, multicast in O(n ²log n) time and the bit-reverse permutation in O(n) time. Many parallel algorithms for various problems such as finding polynomial zeros, sales forecasting, matrix-vector multiplication and the DFT computation are proposed to map in O(log n) time. Sorting and prefix computation are also shown to run in O(log n) time. 相似文献