期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王堃夏宏《移动信息》2023,45(6):245-249

为适应信息安全对网络加密数据吞吐率日益增长的要求，基于我国自主设计的首个商用加密算法SM4，本文在开源的RISC-V处理器中，设计了一个具有直接访存功能的SM4加脱密单元，并对RISC-V的指令集进行了扩展，扩展的指令可直接调用SM4单元。这种方法不仅通过硬件实现了SM4加脱密算法，同时有效减少了SM4单元在加解密过程中使用取数和存数指令访存的频率，大幅度提高了数据加密的速度。为了解决CPU访存与SM4单元访存的冲突，设计中采用了流水线互锁方案，并使用Modelsim进行了仿真验证。在300MHz的时钟频率下，加解密4kB数据需要10500 个时钟周期，吞吐率达到了914.28Mbit/s。相似文献

2.

面向SM3算法的高性能FPGA实现

王汉宁孙浩邓辰辰杨锦江《微电子学与计算机》2023,(7):105-110

现有SM3算法的高性能实现,主要采用多级流水线结构和不同关键路径优化策略,提升SM3算法实现的吞吐量.但多级流水线形式的设计会消耗大量硬件资源.本文首先充分挖掘了SM3算法在FPGA平台的可并行性,通过增加少量的寄存器,降低了算法关键路径的逻辑深度,并通过消息扩展与压缩函数并行执行的方法,仅用1 211个LUT的逻辑资源实现了单核2.55 Gbit/s的吞吐量.相比已有方案单位逻辑资源的吞吐量提升了5.40倍,面积更小、功耗更低、性能更高.最终基于该结构设计了32核的SM3算法硬件,能够实现比已有64级流水线结构更高的吞吐量,且硬件开销更低,单位逻辑资源的吞吐量提升了2.27倍. 相似文献

3.

一种基于值预测和指令复用的按序处理器预执行机制

下载免费PDF全文

党向磊王箫音佟冬陆俊林易江芳王克义《电子学报》2011,39(12):2880-2883

为提高按序处理器的性能和能效性,本文提出一种基于值预测和指令复用的预执行机制(PVPIR).与传统预执行方法相比,PVPIR在预执行过程中能够预测失效Load指令的读数据并使用预测值执行与该Load指令数据相关的后续指令,从而对其中的长延时缓存失效提前发起存储访问以提高处理器性能.在退出预执行后,PVPIR通过复用有效的预执行结果来避免重复执行已正确完成的指令,以降低预执行的能耗开销.PVPIR实现了一种结合跨距(Stride)预测和AVD(Address-Value Delta)预测的值预测器,只记录发生过长延时缓存失效的Load指令信息,从而以较小的硬件开销取得较好的值预测效果.实验结果表明,与Runahead-AVD和iEA方法相比,PVPIR将性能分别提升7.5%和9.2%,能耗分别降低11.3%和4.9%,从而使能效性分别提高17.5%和12.9%. 相似文献

4.

面向RISC-V嵌入式处理器的浮点单元设计与移植

唐俊龙吴圳羲卢英龙黄智昌邹望辉《电子设计工程》2023,(7):119-123+131

针对软件实现浮点运算的速度无法满足RISC-V嵌入式处理器浮点运算的需求,设计了一种由浮点加法器和浮点乘法器构成的浮点单元（FPU）,其中浮点乘法器提出了新型的Wallace树压缩结构,提高了压缩速率。在“蜂鸟E203”处理器中,完成浮点指令的译码模块与派遣模块的设计,实现FPU模块的移植。基于Simc180 nm工艺,使用Sysnopsys公司的Design Compile、VCS工具对FPU进行功能验证和综合,仿真结果表明,浮点加法器的关键路径延时为10.17 ns,相比于串行浮点加法器延时缩短23%,浮点乘法器的压缩结构关键路径延时为0.27 ns,相比传统Wallace树压缩延时缩短10%,移植前后的FPU运算结果一致。相似文献

5.

超立方体法在SM4算法S盒中的应用

王悦李树国《微电子学与计算机》2014,(7)

S盒是商用密码算法SM4中最耗时的一部分,因此构造高性能的S盒具有重要意义.为了显著减少SM4算法进行加解密运算的延时,我们引入了N维超立方体法构造S盒,在硬件电路的实现上,相比于传统S盒的查表法延时缩短6%,面积减少17%.此方法同时适用于其它对称加密算法中的S盒变换,具有可借鉴性. 相似文献

6.

GISEES:面向嵌入式系统的扩展指令集自动产生方法 总被引：1，自引：0，他引：1

陈虎陈书明陈胜刚谷会涛陈小文《电子学报》2011,39(9):2026-2033

面向应用的指令集处理器通过增加扩展指令可有效提升处理器的性能,满足上市时间要求.然而为嵌入式系统定制扩展指令需解决以下3个问题:设计空间随应用复杂度的增加指数增加,有限的片上资源限制了扩展指令的数量和复杂度,现有指令集扩展算法复杂度高难以在嵌入式系统上运行.本文提出了一种快速的指令集扩展方法GISEES.该方法以应用的典型操作为中心产生扩展指令以裁剪了设计空间,并采用基于最大公共等价子串的资源共享策略减少资源开销和插入的多路选择器的数量.实验结果表明,该方法具有线性复杂度,可产生效率更高的扩展指令,更适合为嵌入式系统定制高效的扩展指令. 相似文献

7.

可重构的素域SM2算法优化方法

李斌周清雷陈晓杰冯峰《通信学报》2022,(3):30-41

针对SM2算法软件效率低、硬件实现资源利用率低、可扩展性差的问题,提出了一种可重构的素域SM2算法优化方法.通过对SM2算法的深入分析,从不同计算阶段和计算特点着手,分别采用KOA快速乘法、快速模约减和Barrett算法实现推荐或任意参数的模乘运算,并优化改进基为4的扩展欧几里得算法加速模逆运算.然后,在标准射影坐标系... 相似文献

8.

面向电子控制器的片上可调试性结构设计

下载免费PDF全文

陈芳芳《电子器件》2018,41(3)

提出一种满足电子控制器高可靠要求的片上调试结构。通过复用JTAG接口,可以消除冗余引脚带来的成本和体积开销,同时基于TAP控制器而设计的自定义指令,使得JTAG链路实现结构测试和功能调试的融合;针对调试命令与总线访问的协议转换需求,设计一种低开销与高效率的串并转换单元,配合外围的调试软件和协议转换器,实现全局地址空间的调试访问。实验结果表明,设计的调试结构使得调试时间平均缩短79.8%,面积开销下降16.73%,同时显著提高了调试链路的可靠性。相似文献

9.

高性能SIMD乘法阵列体系结构

吴虎成刘洋徐瑞刘建平《微电子学与计算机》2014,(3)

描述了一种新型的高性能高能效SIMD乘法阵列的结构.该乘法阵列支持同时执行1个64位乘法,4个32位乘法或16个16位有符号/无符号乘法.通过修改乘法算法实现结构,提高了乘加单元的面积复用度,在较小的面积和性能开销下实现了上述功能.并引入了"溢出补偿技术"解决了复数矩阵乘法运算的判溢出问题.通过牺牲非关键路径上短位宽乘法性能,提高关键路径上高位宽乘法性能.所述结构与文献[1]中乘法簇结构相比,64位乘法延时减少3.65%,面积降低3.92%,功耗提高5.71%. 相似文献

10.

面向算术单元的FPGA工艺映射算法

路宝珠杨海钢祁亚男张茉莉崔秀海《微电子学与计算机》2012,29(12):1-6

本文提出了一种针对算术单元的FPGA工艺映射算法ArithM.实验结果表明,与公认ABC中的黑盒子映射算法相比,本文算法能平均减少逻辑单元面积7%,减少电路关键路径延时5%.ArithM采用了单元共享、平衡算术链以及吸收邻近节点三种方法来优化算术资源. 相似文献

11.

Power Efficient SDR Implementation of IEEE 802.11a/p Physical Layer

Daniele Lo Iacono Teo Cupaiuolo 《Journal of Signal Processing Systems》2013,73(3):281-289

Software defined physical layer modems can be considered the new trend in the field of communications. Differently from dedicated hardware, software can be easily modified to implement a large variety of standards on the same platform. The use of software can significantly reduce development costs, but generally comes at the price of an increase in silicon area and power consumption. For different reasons, this price is something that is not always convenient or even possible to pay, as in the case of low-cost ICs implementing a single waveform, or even multi-mode modems embedding legacy IPs already available in hardware. In particular, power consumption overhead can be prohibitive for mobile terminals or in general for battery-powered devices. The very first challenge for a computing fabric to be competitive is to find and implement the right trade-off between flexibility and performance. This was the guideline for the design of the Block Processing Engine (BPE), a template architecture conceived for power-efficient baseband processing. The BPE core feature is a mixed-grain instruction set balancing general-purpose fine-grain instructions with more specific coarse-grain instructions wrapping custom hardware modules. To further limit the power consumption, the BPE also implements instruction-pipelining, variable-size SIMD and multi-task support. To prove the efficiency of such an approach, a dual-mode IEEE 802.11a/p receiver has been implemented. 相似文献

12.

Exploiting Thread‐Level Parallelism in Lockstep Execution by Partially Duplicating a Single Pipeline

Jaegeun Oh Seok Joong Hwang Huong Giang Nguyen Areum Kim Seon Wook Kim Chulwoo Kim Jong‐Kook Kim 《ETRI Journal》2008,30(4):576-586

In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler‐hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write‐back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single‐instruction multiple‐data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32‐bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2‐way MLEP and 33.7% faster with a 4‐way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler. 相似文献

13.

Error detection by duplicated instructions in super-scalarprocessors

Oh N. Shirvani P.P. McCluskey E.J. 《Reliability, IEEE Transactions on》2002,51(1):63-75

This paper proposes a pure software technique "error detection by duplicated instructions" (EDDI), for detecting errors during usual system operation. Compared to other error-detection techniques that use hardware redundancy, EDDI does not require any hardware modifications to add error detection capability to the original system. EDDI duplicates instructions during compilation and uses different registers and variables for the new instructions. Especially for the fault in the code segment of memory, formulas are derived to estimate the error-detection coverage of EDDI using probabilistic methods. These formulas use statistics of the program, which are collected during compilation. EDDI was applied to eight benchmark programs and the error-detection coverage was estimated. Then, the estimates were verified by simulation, in which a fault injector forced a bit-flip in the code segment of executable machine codes. The simulation results validated the estimated fault coverage and show that approximately 1.5% of injected faults produced incorrect results in eight benchmark programs with EDDI, while on average, 20% of injected faults produced undetected incorrect results in the programs without EDDI. Based on the theoretical estimates and actual fault-injection experiments, EDDI can provide over 98% fault-coverage without any extra hardware for error detection. This pure software technique is especially useful when designers cannot change the hardware, but they need dependability in the computer system. To reduce the performance overhead, EDDI schedules the instructions that are added for detecting errors such that "instruction-level parallelism" (ILP) is maximized. Performance overhead can be reduced by increasing ILP within a single super-scalar processor. The execution time overhead in a 4-way super-scalar processor is less than the execution time overhead in the processors that can issue two instructions in one cycle 相似文献

14.

基于指令虚拟化的安卓本地代码加固方法

张晓寒张源池信坚杨珉《电子与信息学报》2020,42(9):2108-2116

安卓系统越来越广泛地被应用于各种类型的智能设备,比如智能手机、智能手表、智能电视、智能汽车。与此同时,针对这些平台应用软件的逆向攻击也日益增多,这不仅极大地侵犯了软件开发者的合法权益,也给终端用户带来了潜在的安全风险。如何保护运行在各种类型设备上的安卓应用软件不被逆向攻击成为一个重要的研究问题。然而,现有的安卓软件保护方法比如命名混淆、动态加载、代码隐藏等虽然可在一定程度上增加安卓软件的逆向难度,但是原理相对简单容易被绕过。一种更为有效的方法是基于指令虚拟化的加固方法,但已有的指令虚拟化方法只针对特定架构(x86架构),无法兼容运行于多种架构的安卓设备。该文针对安卓应用软件中的本地代码提出了一种架构无关的指令虚拟化技术,设计并实现了基于虚拟机打包保护(VMPP)的加固系统。该系统包含一套基于寄存器架构的定长虚拟指令集、支持该虚拟指令集的解释器以及可以与现有开发环境集成的工具链。在大量C/C++代码以及真实安卓软件上的测试表明,VMPP在引入较低的运行时开销下,能够显著提升安卓本地代码的防逆向能力,并且可被用于保护不同架构上的安卓本地代码。相似文献

15.

适用于片上并行计算阵列的超精简处理器架构

周韧研刘雷波魏少军《电路与系统学报》2012,17(2):1-5

提出一种超精简处理单元架构。该处理单元基于运算-跳转式单指令处理器体系。使用指令优化和内部总线上加速器,该处理单元能够执行传统算术运算式单指令处理器难于执行的高效位运算以及执行效率较低的数据转移操作。以该处理单元构成的片上大规模并行计算阵列可用于图像处理等局部性强、实时性要求高的计算任务。包含有该处理单元架构的16 16的原型阵列已经在FPGA上实现,性能达30.7GOPS@120MHz,平均功耗39.5mW。相似文献

16.

SM3及SHA-2系列算法硬件可重构设计与实现

下载免费PDF全文

朱宁龙戴紫彬张立朝赵峰《微电子学》2015,45(6):777-780, 784

针对当前国内外杂凑算法标准和应用需求不同的现状,采用数据流可重构的设计思想和方法,在对SM3及SHA-2系列杂凑算法的不同特征进行分析研究的基础上,总结归纳出统一的处理模型,进而设计了一种新的硬件结构。基于该结构,根据不同环境对杂凑算法安全强度的不同要求,可以单独灵活地实现SM3,SHA-256,SHA-384及SHA-512算法。实验结果表明,设计的硬件电路有效降低了硬件资源消耗,提高了系统吞吐率,能够满足国内外商用杂凑算法的要求。相似文献

17.

Hardware/Software Co-reconfigurable Instruction Decoder for Adaptive Multi-core DSP Architectures

Yong-Kyu Jung 《Journal of Signal Processing Systems》2011,62(3):273-285

A programmable instruction decoder (PID) is introduced for designing adaptive multi-core DSP architectures by using a hardware/software co-reconfigurable approach without employing programmable devices. This PID permits DSP software developers for post-manufacturing modification of their DSP instruction sets to add their application-specific instructions whenever necessary. In addition, PID offers software developers an enhanced means to utilize the underlying DSP architectures by rescheduling implemented micro-operations for their tailored instructions in the DSP processors. Thus, emerging DSP applications can be swiftly and efficiently re-imported to PID-based DSP processors without re-fabrication of new DSP chips. In addition to instruction-level modification, an innovative instruction-packing procedure for PID is presented for further enhancement of the PID-based DSP systems. PID architecture was developed and implemented in VHDL. The PID-based DSP systems were also developed and evaluated to demonstrate various post-manufacturing adaptabilities in DSP processor systems. Various multi-core DSP architectures based on Texas Instruments’ TMS320C55 DSP processor were used for evaluating performance and adaptability of this new programmable instruction decoder. 相似文献

18.

A hardware accelerator for two-dimensional image analysis

《Integration, the VLSI Journal》1988,6(3):329-344

This paper describes the architecture and operation of a new hardware accelerator called MultiRing for performing various geometrical operations on two-dimensional image space. This hardware architecture is shown to be applicable for design rule checking in VLSI layout and many image processing operations including noise suppression and contour extraction. It has both a fast execution speed and extremely high flexibility. Each row data stored in ring memory is processed in the corresponding processor in full parallelism. Each processor is simultaneously configured by the instruction decoder/controller to perform one of the 20 basic instructions each ring cycle, which gives MultiRing maximal flexibility in terms of design rule change or the instruction set enhancement. Correct functional behavior of MultiRing was confirmed by successfully running a software simulator having one-to-one structural correspondence to the MultiRing hardware. 相似文献

19.

Software defined industrial network architecture for edge computing offloading

许方敏叶桓宇崔绍华赵成林姚海鹏《中国邮电高校学报(英文版)》2019,26(1):49-58

互联网与传统制造业的融合使得“工业物联网”（IoT）成为一个热门研究课题。但传统的工业网络仍然面临着来自资源管理,原始数据存储限制和计算能力的挑战。在本文中,我们提出了一种新的软件定义工业网络（SDIN）体系结构来解决IIoT中存在的资源利用,数据处理和存储以及系统兼容性等缺陷。该架构基于软件定义网络（SDN）架构,并结合分层云雾计算和内容感知缓存技术。文中基于SDIN架构,讨论了工业应用中的两种边缘计算策略,并通过考虑不同的场景和服务要求,仿真结果证实了SDIN架构在边缘计算卸载应用中的可行性和有效性。相似文献