首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.

Modular multiplication is one of the most time-consuming operations that account for almost 80% of computational overhead in a scalar multiplication in elliptic curve cryptography. In this paper, we present a new speed record for modular multiplication over 192-bit NIST prime P-192 on 8-bit AVR ATmega microcontrollers. We propose a new integer representation named Range Shifted Representation (RSR) which enables an efficient merging of the reduction operation into the subtractive Karatsuba multiplication. This merging results in a dramatic optimization in the intermediate accumulation of modular multiplication by reducing a significant amount of unnecessary memory access as well as the number of addition operations. Our merged modular multiplication on RSR is designed to have two duplicated groups of 96-bit intermediate values during accumulation. Hence, only one accumulation of the group is required and the result can be used twice. Consequently, we significantly reduce the number of load/store instructions which are known to be one of the most time-consuming operations for modular multiplication on constrained devices. Our implementation requires only 2888 cycles for the modular multiplication of 192-bit integers and outperforms the previous best result for modular multiplication over P-192 by a factor of 17%. In addition, our modular multiplication is even faster than the Karatsuba multiplication (without reduction) which achieved a speed record for multiplication on AVR processor.

  相似文献   

2.
Tremblay  M. O'Connor  J.M. 《Micro, IEEE》1996,16(2):42-50
UItraSpare I is a second-generation superscalar processor. It is a high performance, highly integrated, four issue superscalar processor based on the Spare Version 9 64-bit RISC architecture. We have extended the core instruction set to include graphics instructions that provide the most common operations related to two dimensional image processing; two- and three-dimensional graphics and image compression algorithms; and parallel operations on pixel data with 8-, 16-, and 32-bit components. Additional, new memory access instructions support the very high bandwidth requirements typical of graphics and multimedia applications  相似文献   

3.
Pseudorandom number generators are required for many computational tasks, such as stochastic modelling and simulation. This paper investigates the serial and parallel implementation of a Linear Congruential Generator for Graphics Processing Units (GPU) based on the binary representation of the normal number $\alpha _{2,3}$ . We adapted two methods of modular reduction which allowed us to perform most operations in 64-bit integer arithmetic, improving on the original implementation based on 106-bit double-double operations, which resulted in four-fold increase in efficiency. We found that our implementation is faster than existing methods in literature, and our generation rate is close to the limiting rate imposed by the efficiency of writing to a GPU’s global memory.  相似文献   

4.
董冕  吴丹  饶金理  黄威  戴葵  邹雪城 《计算机工程》2012,38(16):249-252
通过硬件共享的方式实现一套高性能子字并行运算单元,运算单元采用流水线设计,可以一个周期进行1个64-bit、2个32-bit、4个16-bit或8个8-bit定点运算,1个双精度或2个单精度浮点运算。运算单元采用Verilog HDL设计,在0.18 μm 标准CMOS工艺库下实现,并针对实际多媒体应用程序基于ESCA系统进行性能评测。实验结果表明,该运算单元可以在硬件开销和性能上获得较好的平衡。  相似文献   

5.
Ramtron公司推出的VRS51L3074单片机拥有增强型算术单元,能够实现16位乘除法、乘加和移位等操作。本文分析了该单元的特性及使用要点,并给出利用该单元实现的2个实用算法——32位有符号整数开平方和16位二进制数转BCD码。实践表明.该单元可有效提高VRS51L3074处理复杂算术运算的效率。  相似文献   

6.
为了提高多媒体数据的处理能力,高性能DSP普遍引入了SIMD技术。作为DSP重要组成部分的乘法器也必须具备这一功能。本文对SIMD乘法器的实现进行深入研究,提出了一种新的SIMD乘法器体系结构,采用两个16×8乘法器,通过对其操作数和结果进行符号扩展和拼接等处理,简单而高效地实现了16位FT-SIMD乘法器。同时,本体系结构可以扩展为32位和64位的SIMD乘法器。  相似文献   

7.
The floating point number is the most commonly used real number representation for digital computations due to its high precision characteristics. It is used on computers and on single chip applications such as DSP chips. Double precision (64-bit) representations allow for a wider range of real numbers to be denoted. However, single precision (32-bit) operations are more efficient. Recently, there has been an increasing interest in mixed precision computations which take advantage of single precision efficiency on 64-bit numbers. This calls for the ability to interchange between the two formats. In this paper, an algorithm that converts floating point numbers from 64- to 32-bit representations is presented. The algorithm was implemented as a Verilog code and tested on field programmable gate array (FPGA) using the Quartus II DE2 board and Agilent 16821A portable logic analyzer. Results indicate that the algorithm can perform the conversion reliably and accurately within a constant execution time of 25 ns with a 20 MHz clock frequency regardless of the number being converted.  相似文献   

8.
Advanced Encryption Standard (AES) has replaced its predecessor, Double Encryption Standard (DES), as the most widely used encryption algorithm in many security applications. Up to today, AES standard has key size variants of 128, 192, and 256-bit, where longer bit keys provide more secure ciphered text output. In the hardware perspective, bigger key size also means bigger area and power consumption due to more operations that need to be done. Some companies that employ ultra-high security in their systems may look for a key size bigger than 256-bit AES. In this paper, 128 and 256-bit AES hardware, as well as two variants of an AES encryption algorithm for 512-bit and 1024-bit key size, are implemented and compared in terms of power consumption and area. The experiment is done in 45 nm CMOS technology at 1.1 V using a Synopys DC Compiler and Modelsim and total power consumption and area results are presented and graphically compared.  相似文献   

9.
A new secret sharing scheme capable of protecting image data coded with B bits per pixel is introduced and analyzed in this paper. The proposed input-agnostic encryption solution generates B-bit shares by combining bit-level decomposition/stacking with a {k,n}-threshold sharing strategy. Perfect reconstruction is achieved by performing decryption through simple logical operations in the decomposed bit-levels without the need for any postprocessing operations. The framework allows for cost-effective cryptographic image processing of B-bit images over the Internet.  相似文献   

10.
We propose an ultra-lightweight, compact, and low power block cipher BORON. BORON is a substitution and permutation based network, which operates on a 64-bit plain text and supports a key length of 128/80 bits. BORON has a compact structure which requires 1939 gate equivalents (GEs) for a 128-bit key and 1626 GEs for an 80-bit key. The BORON cipher includes shift operators, round permutation layers, and XOR operations. Its unique design helps generate a large number of active S-boxes in fewer rounds, which thwarts the linear and differential attacks on the cipher. BORON shows good performance on both hardware and software platforms. BORON consumes less power as compared to the lightweight cipher LED and it has a higher throughput as compared to other existing SP network ciphers. We also present the security analysis of BORON and its performance as an ultra-lightweight compact cipher. BORON is a well-suited cipher design for applications where both a small footprint area and low power dissipation play a crucial role.  相似文献   

11.
Algorithmic cooling is a potentially important technique for making scalable NMR quantum computation feasible in practice. Given the constraints imposed by this approach to quantum computing, the most likely cooling algorithms to be practicable are those based on simple reversible polarization compression (RPC) operations acting locally on small numbers of bits. Several different algorithms using 2- and 3-bit RPC operations have appeared in the literature, and these are the algorithms I consider in this note. Specifically, I show that the RPC operation used in all these algorithms is essentially a majority vote of 3 bits, and prove the optimality of the best such algorithm. I go on to derive some theoretical bounds on the performance of these algorithms under some specific assumptions about errors.   相似文献   

12.
Sakamura  K. 《Micro, IEEE》1987,7(2):17-31
The TRON microprocessor has an open architecture? it is being implemented by several manufacturers. It will be expandable to 64-bit operations, and its design reflects the needs of a family of application-specific operating system kernels.  相似文献   

13.
The complexity involved in mapping an algorithm to hardware is a function of the controller logic and data path. Minimizing data path size can lead to significant savings in hardware area and power dissipation. This paper presents an implementation of a novel architectural transformation technique for mapping a word bit wide algorithm to byte vector serial architecture. The technique divides the input word to several bytes and then traces each byte for extracting architectural transformation. The technique is applied on Advanced Encryption Standard (AES) algorithm which is non-linear in nature. Using this technique, the 32-bit AES algorithm is transformed into a byte-systolic architecture. The novelty of the technique is more pronounced around the mix column design which is the most complex part of the AES algorithm. The complex matrix multiplication component and standard transformations of the 32-bit AES algorithm are transformed to support 8-bit operations. The resulted AES architectures reuse same logic resources for key expansion and encryption/decryption. The proposed design offers moderate data rates in the range of 41 Mbps for encryption and 37 Mbps for decryption while utilizing 236 and 280 slices, respectively, on Xilinx Virtex II xc2v1000-6 FPGA. Comparison results show significant gain in throughput when compared with other 8-bit designs. This makes it a viable data/communication security solution for a variety of embedded and consumer electronics.  相似文献   

14.
The calculation of the motion observed between two images of the same scene is required for many applications such as video compression, panoramic stitching and optic flow algorithms for vehicle navigation. The particular application that we focus on in this paper is the need for small light-weight vehicles, such as unmanned ground or air vehicles, to sense their own motion for use in autonomous navigation algorithms. As the processing is ideally performed on-board these vehicles, there are severe restrictions on the processing environment available to perform the optic flow calculations. This has led to the development of FPGA solutions to calculate optic flow. However the most recent approaches still have extensive on-board memory requirements and make use of complex processing operations such as multiplication and matrix inversion. We present an FPGA implementation of a low complexity version of the Lucas–Kanade registration algorithm. This algorithm operates on one-bit images instead of the standard eight-bit approach and consequently can utilize simple logic operations such as exclusive-or rather than multiplications and also makes very efficient use of the available internal memory and resources.  相似文献   

15.
The parallel computation model upon which the proposed algorithms are based is the hyper-bus broadcast network. The hyper-bus broadcast network consists of processors which are connected by global buses only. Based on such an improved architecture, we first design two O(1) time basic operations for finding the maximum and minimum of N numbers each of size O(log N)-bit and computing the matrix multiplication operation of two N×N matrices, respectively. Then, based on these two basic operations, three of the most important instances in the algebraic path problem, the connectivity problem, and several related problems are all solved in O(log N) time. These include the all-pair shortest paths, the minimum-weight spanning tree, the transitive closure, the connected component, the biconnected component, the articulation point, and the bridge problems, either in an undirected or a directed graph, respectively  相似文献   

16.
Papamichalis  P. Simar  R.  Jr. 《Micro, IEEE》1988,8(6):13-29
The 320C30 is a fast processor with a large memory space and floating-point-arithmetic capabilities. The authors describe the 320C30 architecture in detail, discussing both the internal organization of the device and the external interfaces. They also explain the pipeline structure, addressing software-related issues and constructs, and examine the development tools and support. Finally, they present examples of applications. Some of the major features of the 320C30 are: a 60-ns cycle time that results in execution of over 16 million instructions per second (MIPS) and over 33 million floating-point operations per second (Mflops); 32-bit data buses and 24-bit address buses for a 16M-word overall memory space; dual-access, 4 K×32-bit on-chip ROM and 2 K×32-bit on-chip RAM; a 64×32-bit program cache; a 32-bit integer/40-bit floating-point multiplier and ALU; eight extended-precision registers, eight auxiliary registers, and 23 control and status registers; generally single-cycle instructions; integer, floating-point, and logical operation; two- and three-operand instructions; an on-chip DMA controller; and fabrication in 1-μm CMOS technology and packaging in a 180-pin package. These facilitate FIR (finite impulse response) and IIR (infinite impulse response) filtering, telecommunications and speech applications, and graphics and image processing applications  相似文献   

17.
SIMD单元集成已经成为提高处理器性能的重要途径之一。虽然定点SIMD单元的硬件复用低成本设计技术已经较为成熟,但是,大部分浮点SIMD单元的硬件设计还停留在简单的硬件复制方法上。本文针对日益增长的128位高精度浮点操作的计算需求,提出了其相应的SIMD低成本硬件结构方案。综合实验结果表明,所提出的SIMD浮点乘加单元比传统128位高精度浮点乘加单元具有更加优化的性能与面积参数。  相似文献   

18.
基于算术加法测试生成,提出了VLSI中加法器的一种自测试方案:加法器产生自身所需的所有测试矢量.通过优化测试矢量的初值改进这些测试矢量,提高了其故障侦查、定位能力.借助于测试矢量左移、逻辑与操作等方式对加法器自测试进行了设计.对8位、16位、32位行波、超前进位加法器的实验结果表明,该自测试能实现单、双固定型故障的完全测试,其单、双故障定位率分别达到了95.570%,72.656%以上.该自测试方案可实施真速测试且不会降低电路的原有性能,其测试时间与加法器长度无关.  相似文献   

19.
Multi-operand associative techniques attain their full power in algorithms where the data may be recast into disjoint data sets, all acted upon concurrently, each by a different operand common to the set. But the multi-operand approach can also serve to enhance arithmetic operations significantly. The speed-up of associative multiplication by handling a number of multiplier bits at a time is described and analyzed, including an effective algorithm for limited sum of products. The most complex process treated is convolution, which serves to illustrate the enhancement of an extended sum of products. Any number of vectors stored in memory can be convolved simultaneously by a common filter vector. Execution time is 45 msec for 1024 element data and filter vectors, 2048 element results, and 16-bit precision.  相似文献   

20.
The Internet of Things (IoT) and cyber-physical systems (CPS) has grown exponentially over the recent years, has motivated the development and deployment of the low resource devices for a wide range of applications in the IoT. Many such resource constrained devices are deployed to match the heterogeneous application requirements of IoT and CPS systems, wherein privacy and security have emerged, as the most difficult challenges, as the constrained devices are not been designed to have security features. This paper presents a lightweight cipher, based on ARX (Addition-Modulo, Rotation and XOR) operations, Fiestel structure, an amalgamation of BRIGHT and SIMON structure, hence the name BRISI. The cipher encrypts 32-bit plaintext using 64-bit key. The software implementation is performed using MATLAB tool and it fulfils the Avalanche criterion, Key-sensitivity, correlation coefficient, entropy and histogram. The proposed design is simulated using Xilinx Vivado and is implemented on Nexys-4 DDR Artix-7 and Basys-3 Artix-7 FPGA family and is evaluated for (LUT and register) power and timing  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号