期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Estimating interlock and improving balance for pipelined architectures

《Journal of Parallel and Distributed Computing》1988,5(4):334-358

Pipelining is now a standard technique for increasing the speed of computers, particularly for floating-point arithmetic. Single-chip, pipelined floating-point functional units are available as “off the shelf” components. Addressing arithmetic can be done concurrently with floating-point operations to construct a fast processor that can exploit fine-grain parallelism. This paper describes a metric to estimate the optimal execution time of DO loops on particular processors. This metric is parameterized by the memory bandwidth and peak floating-point rate of the processor, as well as the length of the pipelines used in the functional units. Data dependence analysis provides information about the execution order constraints of the operations in the DO loop and is used to estimate the amount of pipeline interlock required by a loop. Several transformations are investigated to determine their impact on loops under this metric. 相似文献

2.

The exact dot product as basic tool for long interval arithmetic

Ulrich Kulisch Van Snyder 《Computing》2011,91(3):307-313

Computing with guarantees is based on two arithmetical features. One is fixed (double) precision interval arithmetic. The other one is dynamic precision interval arithmetic, here also called long interval arithmetic. The basic tool to achieve high speed dynamic precision arithmetic for real and interval data is an exact multiply and accumulate operation and with it an exact dot product. Pipelining allows to compute it at the same high speed as vector operations on conventional vector processors. Long interval arithmetic fully benefits from such high speed. Exactitude brings very high accuracy, and thereby stability into computation. This document, which has been incorporated into the draft standard for interval arithmetic being developed by IEEE P1788, specifies the implementation of an exact multiply and accumulate operation. 相似文献

3.

一种基2冗余符号数加法的改进算法

下载免费PDF全文

李云锋赵金薇周汇俞军《计算机工程》2007,33(24):242-243

冗余符号数加法器满足了对加法器高速度和高精度的要求。该文针对二进制符号数加法传统算法的不足，提出了一种改进算法，设计了相应的加法电路。它采用3级结构实现加法器，结构简单而规则，中间进位与中间和都仅需要1bit编码。与传统结构相比，该算法实现的电路速度更快、面积更小、动态功耗更少。相似文献

4.

Verve

Günter Knittel 《Computer Graphics Forum》1993,12(3):37-48

The design principles of a hardware acceleratorfor volume rendering are described. The architecture represents a voxel subsystem which interfaces easily to any existing workstation. Host requirements are low since it contains a multiport memory holding the complete data set and all arithmetic units needed to perform an effective visualization. Our approach aims at virtual reality by providing some “real-world” examination techniques. The user (e.g., a physician) is enabled to analyze the data set from an arbitrary viewpoint and, even more, to “walk through” the volume model. For a realistic impression, the machine produces perspective projections, supports the illumination by non-parallel light comingfrom a freely movable point light source and provides depth cueing. The objects are Phong shaded at a rate of 10⁷ operations/s and can be displayed semitransparently. One unit achieves interactive speed: for real-time operation only a small number of units (typically 4-16) must be placed in parallel. 相似文献

5.

Design and implementation of low power and high speed multiplier using quaternary carry look-ahead adder

《Microprocessors and Microsystems》2020

Need of Digital Signal Processing (DSP) systems which is embedded and portable has been increasing as a result of the speed growth of semiconductor technology. Multiplier is a most crucial part in almost every DSP application. So, the low power, high speed multipliers is needed for high speed DSP. Array multiplier is one of the fast multiplier because it has regular structure and it can be designed very easily. Array multiplier is used for multiplication of unsigned numbers by using full adders and half adders. It depends on the previous computations of partial sum to produce the final output. Hence, delay is more to produce the output. In the previous work, Complementary Metal Oxide Semiconductor (CMOS) Carry Look-ahead Adders (CLA) and CMOS power gating based CLA are used for maximizing the speed of the multiplier and to improve the power dissipation with minimum delay. CMOS logic is based on radix 2(binary) number system. In arithmetic operation, major issue corresponds to carry in binary number system. Higher radix number system like Quaternary Signed Digit (QSD) can be used for performing arithmetic operations without carry. The proposed system designed an array multiplier with Quaternary Signed Digit number system (QSD) based Carry Look-Ahead Adder (CLA) to improve the performance. Generally, the quaternary devices require simpler circuit to process same amount of data than that needed in binary logic devices. Hence the Quaternary logic is applied in the CLA to improve the speed of adder and high throughput. In array multiplier architecture, instead of full adders, carry look-ahead adder based on QSD are used. This facilitates low consumption of power and quick multiplication. Tanner EDA tool is used for simulating the proposed multiplier circuit in 180 nm technology. With respect to area, Power Delay Product (PDP), Average power proposed QSD CLA multiplier is compared with Power gating CLA and CLA multiplier. 相似文献

6.

Fast methods on parallel and vector machines

Clive Temperton 《Computer Physics Communications》1982,26(3-4):331-334

The implementation of fast numerical methods on parallel and vector computers is illustrated by describing the development of fast Fourier transform routines for the vector-processing Cray-1 and Cyber 205 machines. Various vectorization methods are presented for FFT's on the Cray-1. By performing a number of transforms in parallel, “super-vector” performance can be achieved. By modifying the algorithms slightly, multiple transforms can be implemented faster on the Cyber 205 (using 64-bit arithmetic on the 2-pipe model) than on the Cray-1, provided that enough transforms (of order 100) can be performed in parallel. 相似文献

7.

PIPE (Pipelined Image-Processing Engine)

Ernest W. Kent Michael O. Shneier Ronald Lumia 《Journal of Parallel and Distributed Computing》1985,2(1):50-78

The Sensory-Interactive Robotics Group of the National Bureau of Standards' Industrial Systems Division is designing and constructing an experimental multistage pipelined image-processing device for research in machine vision. The device can acquire images from a variety of sources, such as analog or digital television cameras, ranging devices, and conformal mapping arrays. It can process sequences of images in real time, through a serial pipeline of local operations, under the control of an external device. Its output can be presented to such devices as monitors, robot vision systems, iconic-to-symbolic mapping devices, and image-processing computers. In addition to a forward flow of images through successive stages of operations in the pipeline, other paths between the stages of the device permit recursive operations within a single stage, and feedback of the results of operations from a stage to the preceding stage. This architecture facilitates a variety of relaxation operations, interactions of images over time, and other interesting functions. Numerous operations are supported, including arithmetic and Boolean neighborhood operations on images within each stage, and between-stage operations on each pixel such as thresholding, Boolean and arithmetic operations, functional mappings, and a variety of functions for combining pixel data converging via the multiple image paths. The device can also be used to implement several alternative processing modes. Some operate within each stage, for example, to control edge effects or to implement “MIMD” operations specific to regions of interest defined by the host device. Others operate between stages, for example, to support variable-resolution pyramids. 相似文献

8.

FPGA implementation of XOR-MUX full adder based DWT for signal processing applications

《Microprocessors and Microsystems》2020

In the recent past there is a rapid development in the field of digital technology especially in signal processing and image processing based applications Excellent performance high speed, compactable in size low power and less delay are the essential needs of the devices used for applications such as signal processing, audio processing and software define radio and so on. Particularly, digital gadgets are prone to have more critical logic size and power consumption and take large area in VLSI Implementation due to arithmetic operations of adders and multiplier designs. Thus priority architecture of Digital Wavelet Transform (DWT) is affected as it comprises a number of Filter banks in level basics, thus all Filter banks have number of adders and multipliers due to coefficient decompositions of low and high pass filters. On this n-size repeated filter logic takes more logic size and power consumption. Here, the proposed work presents a novel approach of DWT by replacing conventional adders and multipliers with XOR-MUX adders and Truncations multipliers thereby reducing the 2n logic size to n-size logic. Finally, the proposed DWT architecture designed in VHDL and also implemented in FPGA XC6SLX9-2TQG144 proved the performance in terms of delay, area and power. 相似文献

9.

GF(2~m)上椭圆曲线密码体制的硬件实现 总被引：2，自引：0，他引：2

唐薛峰沈海斌严晓浪《计算机工程与应用》2004,40(11):96-98

特征为2的有限域GF(2m)较适合椭圆曲线密码算法的硬件实现。该文通过对GF(2m)上模运算的分析,将所有的模运算转化成模乘和模加,并对LSD乘法器的进行了改进,所设计的运算单元能进行GF(2m)上所有的模运算,利用该运算单元所实现的椭圆曲线密码算法具有面积小,速度快的优点,适合用于处理能力和存储空间受限的设备中。相似文献

10.

声码器中一种四级可重构ALU的研究与设计

荆涛王沁《小型微型计算机系统》2008,29(12)

在面向语音编解码算法实现的高性能声码器设计中,支持可变长VLIW指令集的ALU单元是实现其设计目标的重要环节.本文提出一种四级可重构的ALU设计,以前缀算法加法器为核心,并通过操作数和资源的重构,能在单周期内完成81种复合算术逻辑运算,同时将其控制编码压缩了58.93%以适应指令集的宽度约束,高效实现了算法中潜在的高并行性,很好的满足了运算密集型的算法应用需求. 相似文献

11.

Advanced Arithmetic for the Digital Computer,Design of Arithmetic Units

《Electronic Notes in Theoretical Computer Science》2000

Advances in computer technology are now so profound that the arithmetic capability and repertoire of computers can and should be expanded. Nowadays the elementary floating-point operations +, −, ×, / give computed results that coincide with the rounded exact result for any operands. Advanced computer arithmetic extends this accuracy requirement to all operations in the usual product spaces of computation: the real and complex vector spaces as well as their interval correspondents. This enhances the mathematical power of the digital computer considerably. A new computer operation, the scalar product, is fundamental to the development of advanced computer arithmetic.This paper studies the design of arithmetic units for advanced computer arithmetic. Scalar product units are developed for different kinds of computers like personal computers, workstations, mainframes, super computers or digital signal processors. The new expanded computational capability is gained at modest cost. The units put a methodology into modern computer hardware which was available on old calculators before the electronic computer entered the scene. In general the new arithmetic units increase both the speed of computation as well as the accuracy of the computed result. The circuits developed in this paper show that there is no way to compute an approximation of a scalar product faster than the correct result.A collection of constructs in terms of which a source language may accommodate advanced computer arithmetic is described in the paper. The development of programming languages in the context of advanced computer arithmetic is reviewed. The simulation of the accurate scalar product on existing, conventional processors is discussed. Finally the theoretical foundation of advanced computer arithmetic is reviewed and a comparison with other approaches to achieving higher accuracy in computation is given. Shortcomings of existing processors and standards are discussed. 相似文献

12.

Bounded algebra and current-mode digital circuits 总被引：4，自引：0，他引：4

下载免费PDF全文

WU Xunwei Massoud Pedram 《计算机科学技术学报》1999,14(6):551-557

This paper proposes two bounded arithmetic operations,which are easily realized with current signals.Based on these two operations,a bounded algebra system suitable for describing current-mode digital circuits is developed and its relationship with the Boolean algebra,which is suitable for representing voltagemode digital circuits,is investigated.Design procedure for current-mode circuits using the proposed algebra system is demonstrated on a number of common circuit elements which are used to realize arithmetic operations,such as adders and multipliers. 相似文献

13.

Performance Measurement of Energy Efficient and Highly Scalable Hybrid Adder

B. Annapoorani P. Marikkannu 《计算机系统科学与工程》2023,45(3):2659-2672

The adders are the vital arithmetic operation for any arithmetic operations like multiplication, subtraction, and division. Binary number additions are performed by the digital circuit known as the adder. In VLSI (Very Large Scale Integration), the full adder is a basic component as it plays a major role in designing the integrated circuits applications. To minimize the power, various adder designs are implemented and each implemented designs undergo defined drawbacks. The designed adder requires high power when the driving capability is perfect and requires low power when the delay occurred is more. To overcome such issues and to obtain better performance, a novel parallel adder is proposed. The design of adder is initiated with 1 bit and has been extended up to 32 bits so as verify its scalability. This proposed novel parallel adder is attained from the carry look-ahead adder. The merits of this suggested adder are better speed, power consumption and delay, and the capability in driving. Thus designed adders are verified for different supply, delay, power, leakage and its performance is found to be superior to competitive Manchester Carry Chain Adder (MCCA), Carry Look Ahead Adder (CLAA), Carry Select Adder (CSLA), Carry Select Adder (CSA) and other adders. 相似文献

14.

Design and use of DIP-1: A fast, flexible and dynamically microprogrammable pipelined image processor

F.A. Gerritsen L.G. Aardema 《Pattern recognition》1981,14(1-6):319-330

The design of a fast, flexible and dynamically microprogrammable pipelined image processor is presented. The machine is especially suited, though not completely devoted, to perform local operations (up to 16 × 16) of both logical and arithmetic character on pictures, stored in a random access image memory in a 256 level grey scale. Separate parts of the machine take care of data manipulation and address generation. The machine's functioning is illustrated by discussing the way in which arithmetic N × N neighbourhood operations and binary 3 × 3 neighbourhood operations were implemented and finally the software supporting microprogram development and debugging and the run-time support software is described. 相似文献

15.

Granularity via nondeterministic computations: What we gain and what we lose

Vladik Kreinovich Bernadette Bouchon-Meunier 《国际智能系统杂志》1997,12(6):469-481

We humans usually think in words; to represent our opinion about, e.g., the size of an object, it is sufficient to pick one of the few (say, five) words used to describe size (“tiny,” “small,” “medium,” etc.). Indicating which of 5 words we have chosen takes 3 bits. However, in the modern computer representations of uncertainty, real numbers are used to represent this “fuzziness.” A real number takes 10 times more memory to store, and therefore, processing a real number takes 10 times longer than it should. Therefore, for the computers to reach the ability of a human brain, Zadeh proposed to represent and process uncertainty in the computer by storing and processing the very words that humans use, without translating them into real numbers (he called this idea granularity). If we try to define operations with words, we run into the following problem: e.g., if we define “tiny” + “tiny” as “tiny,” then we will have to make a counter-intuitive conclusion that the sum of any number of tiny objects is also tiny. If we define “tiny” + “tiny” as “small,” we may be overestimating the size. To overcome this problem, we suggest to use nondeterministic (probabilistic) operations with words. For example, in the above case, “tiny” + “tiny” is, with some probability, equal to “tiny,” and with some other probability, equal to “small.” We also analyze the advantages and disadvantages of this approach: The main advantage is that we now have granularity and we can thus speed up processing uncertainty. The main disadvantage is that in some cases, when defining symmetric associative operations for the set of words, we must give up either symmetry, or associativity. Luckily, this necessity is not always happening: in some cases, we can define symmetric associative operations. © 1997 John Wiley & Sons, Inc. 相似文献

16.

Hardware Support for Interval Arithmetic 总被引：1，自引：0，他引：1

Reinhard Kirchner Ulrich W. Kulisch 《Reliable Computing》2006,12(3):225-237

A hardware unit for interval arithmetic (including division by an interval that contains zero) is described in this paper. After a brief introduction an instruction set for interval arithmetic is defined which is attractive from the mathematical point of view. These instructions consist of the basic arithmetic operations and comparisons for intervals including the relevant lattice operations. To enable high speed, the case selections for interval multiplication (9 cases) and interval division (14 cases) are done in hardware. The lower bound of the result is computed with rounding downwards and the upper bound with rounding upwards by parallel units simultaneously. The rounding mode must be an integral part of the arithmetic operation. Also the basic comparisons for intervals together with the corresponding lattice operations and the result selection in more complicated cases of multiplication and division are done in hardware. There they are executed by parallel units simultaneously. The circuits described in this paper show that with modest additional hardware costs interval arithmetic can be made almost as fast as simple floating-point arithmetic. 相似文献

17.

一种高性能子字并行乘法器的设计与实现

下载免费PDF全文

黄立波岳虹陆洪毅戴葵《计算机工程与应用》2007,43(20):104-106

提出了一种支持子字并行的乘法器体系结构,并完成了其VLSI设计与实现。该乘法器在16 bit阵列子字并行结构的基础上,扩展了有符号与无符号之间的混合操作,采用多周期合并技术,实现了32 bit宽度的子字并行,并支持子字模式的乘累加,同时采用流水线设计技术,能够在单周期内完成4个8×8、2个16×16或1个32×16的有符号/无符号乘法操作。0.18 μm的标准单元库的实现表明该乘法器既能减小面积又能提高主频,是硬件消耗和运算性能的较好折衷,非常适用于多媒体微处理器的设计。相似文献

18.

Generalization of computing systems: the architecture and organization

ROLAND YII 《International journal of systems science》2013,44(9):877-888

Modern computers are primarily binary. Proposed in this paper is a new generalized computing system, by which a number system of any arbitrary radix can be manipulated directly without encoders and decoders. The feature of being able to vary the radix is achieved by pulse counting with alterable carry conditions. The architecture and the organization of the new system are also drastically different from those of existing computers. Instead of the adder being the basic functional structure, the heart of the system consists essentially of an adder, subtractor, multiplier and an accumulator integrated into one inseparable unit. Introduced in this paper is a unified system design suitable for all computing systems and also the concept of a universal building block which can be shared both by the arithmetic unit and the memory. As a memory element, the universal building block is capable of storing a digit of any given radix rather than just binary. Instead of communicating between the binary registers and the constituents of the memory or the arithmetic unit as in existing computers, variable pulse train generators and counters are made to communicate with each other. Binary logic is adopted to channel the pulses or the instructions instead of performing the arithmetic operations. A modern computer, however complex, thus becomes a special case of the generalized computing system. 相似文献

19.

Parallel vector reduction algorithms and architectures

《Journal of Parallel and Distributed Computing》1988,5(2):103-130

Vector operations are important in many computer applications. They often represent the main part of operations of the entire problem and consume a great amount of computing time. So, it is natural to apply parallel computation to vector operations in order to increase the speed of solving a problem. Among vector operations, vector reduction is a known and common type of operation (e.g., vector summation, inner product evaluation). In this paper vector reduction techniques for parallel pipelined processing are discussed. The computation and communication properties and constraints of both single and multiple vector reductions in a multipipeline environment are considered. From this a simple, yet efficient “partitioned linear pipeline array” (PLPA) architecture is proposed and the performance of a number of scheduling algorithms related to this architecture is determined. The performance comparison between the proposed approach and the well-known tree-structured reduction processor is given. From the results of performance analysis, it is shown that the PLPA approach has approximately the same performance as a pipelined binary reduction tree. However, the PLPA approach is much simpler and easier to implement, and is also more flexible than a tree-structured reduction processor. Finally, as an example, the matrix multiplication operation on a PLPA is considered. It is shown that with the PLPA architecture a very good performance can be obtained. 相似文献

20.

A Fast Domain Decomposition High Order Poisson Solver

Bertil Gustafsson Lina Hemmingsson-Frändén 《Journal of scientific computing》1999,14(3):223-243

We present a fast high-order Poisson solver for implementation on parallel computers. The method uses deferred correction, such that high-order accuracy is obtained by solving a sequence of systems with a narrow stencil on the left-hand side. These systems are solved by a domain decomposition method. The method is direct in the sense that for any given order of accuracy, the number of arithmetic operations is fixed. Numerical experiments show that these high-order solvers easily outperform standard second-order ones. The very fast algorithm in combination with the coarser grid allowed for by the high-order method, also makes it quite possible to compete with adaptive methods and irregular grids for problems with solutions containing widely different scales. 相似文献