首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 398 毫秒
1.
提出了一种新型的多态高效并行阵列机结构--萤火虫2号阵列机。该结构的处理单元可以在SIMD和MIMD两种模式下运行,兼有异步执行机制,还可以实现分布式指令级并行处理。采用了硬件的多线程管理器和高效通信机制,这些机制使得此种阵列机能够实现效率很高的线程级并行运算、数据级并行运算和分布式指令级并行运算。尤其值得指出的是,此种阵列机的流处理性能堪与专用集成电路匹敌。该结构还能有效实现静态与动态数据流计算,可以高效实现图形、图像和数字信号处理任务。  相似文献   

2.
网络处理嚣是专门为网络处理而设计的处理嚣,其指令集是软硬件的界面,指令集的设计对性能有较大的影响.本文提出了一种针对高频率指令对-HFIP的组合优化方法,该方法充分利用了网络处理器基准程序里指令执行过程中的动态相关性,开发了simpIescalar模拟嚣的指令格式里未使用的空住作为新指令的扩展域.采用量化的方法对实验结果进行分析.模拟结果显示该方法合理有效,在提高网络处理器性能的同时有效降低指令cache的功耗.实现性能/功耗的权衡.  相似文献   

3.
基于VelociTI体系结构的DSP指令分配的实现   总被引:1,自引:0,他引:1  
在设计基于VelociTI体系结构的数字信号处理器过程中,为了高速实现并行指令的分配,提出了一种基于该体系结构的指令分配方法:排序法。该方法结合决策树原理实现取指包指令并行性测试,并将处理器的功能单元按照一个规定的顺序排列,使每一个功能单元与执行包的某一个字段对应,将执行包中的指令根据译码的结果和功能单元的顺序进行重新排序,从而完成指令到功能单元的分配。仿真结果证明该方法是十分有效的。  相似文献   

4.
高性能数字信号处理器的设计   总被引:1,自引:0,他引:1  
严伟  龚幼民 《微处理机》2004,25(4):10-15
本文完成了16位的数字信号处理器的设计,该数字信号处理器设计了针对信号处理的指令与体系结构,指令数为88条,综合后数字信号处理器的内核单元数为12799。十六位定点数字信号处理器为单发射系统,采用了多数据和地址总线设计,使四级流水在流水线的四个周期保持正常的数据流动,分散的寄存器形式结构,使多数指令在一周期内得到完成。数字信号处理器包含了中央算术逻辑单元、乘法器单元、移位器单元、排序器单元、辅助寄存器单元、中断单元的设计。在中央算术逻辑单元中,完成加/减运算以及逻辑运算,在进位链中采用了选择进位链,对数据溢出采用了饱和处理的方法;在乘法器单元中采用BOOTH算法和先进进位加法器相结合的单元设计;在排序器设计中,按照中断、指令第二指令字、累加器、堆栈等不同的程序排序源设计不同的通路,并按照ZLVC的条件,设计了条件转移指令;在辅助寄存器单元选择一条与正向进位相反方向的进位来实现FFT算法位反序要求;在中断单元中,采用二级中断,大堆栈保存地址,流水“冲刷”技术。  相似文献   

5.
随着硬件功能的不断丰富和软件开发环境的逐渐成熟,GPU开始被应用于通用计算领域,协助CPU加速程序运行。为了追求高性能,GPU往往包含成百上千个核心运算单元,高密度的计算资源使得其性能远高于CPU的同时功耗也高于CPU,功耗问题已经成为制约GPU发展的重要问题之一。在深入研究Fermi GPU架构的基础上,提出一种高精度的体系结构级功耗模型,该模型首先计算不同native指令及每次访问存储器消耗的功耗;然后根据应用在硬件上的执行指令和采样工具获得采样结果,分析预测其功耗;最后通过13个基准测试应用对实际测试与功耗模型测试结果进行对比分析,该模型的预测精度可达90%左右。  相似文献   

6.
YHFT-D4是一款具有分簇的VLIW体系结构的DSP,它有多个功能单元,可在单个时钟周期并行地执行多条指令。指令执行的功能单元是哪个,哪些指令并行执行,这些由编译器或程序员静态决定,文章给出了YHFT-D4汇编器的设计和实现方法。  相似文献   

7.
针对椭圆曲线密码算法中有限域模乘运算的需求,提出其专用模乘指令。利用指令域中的组参数实现算法多组模乘运算,通过对参数进行配置,使指令支持运算长度拓展,在模乘运算单元中实现Montgomery模乘算法,并设计素域和二进制域统一的硬件流水线,以及双域乘法器单元结构。实验结果表明,该有限域模乘指令和硬件运算单元具有较高的执行效率和较好的灵活性。  相似文献   

8.
张倩 《计算机工程》2009,35(10):273-275
针对二维SIMD结构,提出一种可以动态关闭空转部件且结合编译器、指令集和体系结构支持的低功耗调度算法,其中包括编译器优化二维SIMD指令,功耗指令发出部件开关信号,系统接收信号并执行。采用对不同功能单元分别调度的方式和部件局部化的方法。在模拟器上的实验结果表明该方法可以节省整个系统约15%的能量消耗。  相似文献   

9.
Runahead执行技术能够显著地提高计算机系统的存储级并行,而无需对处理器结构做出较大改动。但Runahead执行处理器要比传统处理器多执行很多指令,最多是正常执行指令数的三倍以上,大大增加了处理器的功耗。本文通过分析发现Runahead执行在预执行阶段会执行大量的无效指令,据此提出一种减少无效指令的方法来提高Runa-head执行处理器的效率。通过实验分析,在性能影响较小的情况下,该方法最多可以减少50%的Runahead执行处理器在预执行阶段执行的无效指令。  相似文献   

10.
给出了一种嵌入于微处理器,8bit×8bit+20bit并行MAC单元的设计;该设计可完成8bit整数或序数的乘法或乘加运算,具有整数乘加运算的饱和检测和饱和处理功能;设计中采用了一种新型Booth编码方法;对部分积压缩阵列进行了优化,将累加值作为一个部分积参与部分积压缩阵列的累加运算,节省了一级超前进位加法器;压缩阵列采用了一种新型4∶2压缩器,进一步缩短了延时,节省了面积。  相似文献   

11.
异步电路能很好地解决同步集成电路设计中出现的时钟扭曲和时钟功耗过大等问题。本文采用异步集成电路设计方法设计了一款32位异步子字并行乘累加单元,并在0.18μm工艺条件下实现了该单元。通过使用特殊的部分积译码电路,该乘累加单元能支持多种子字并行模式,适用于多媒体处理。评测结果表明,异步乘累加单元的性能和功耗指标均优于采用同样结构的同步乘累加单元。  相似文献   

12.
有源电力滤波器的比例递推积分控制   总被引:6,自引:1,他引:6       下载免费PDF全文
在采用有源电力滤波器消除谐波中,由于滤波器的检测精度、指令电流计算延时和输出滤波器的相移等因素的影响,滤波效果不是很理想.提出了应用于并联型有源滤波器的比例递推积分控制算法,比例系数和积分系数由模糊推理在线整定.该算法能有效地提高系统的动态和稳态性能,可适应于单相、三相三线制、三相四线制非线性负载电流波形畸变的抑制.实验结果证明了所提出的控制算法的有效性.  相似文献   

13.
提出了一种基于功率控制的无线传感器网络MAC协议,根据节点接收阈值,计算出节点发送最优功率,在根本上减小发送功率从而节省节点能量。为了减少节点间的碰撞,引入了自适应调整竞争窗口和快速退避机制,减少节点空闲时间,从而进一步减少节点耗能。仿真结果显示在能量和吞吐量上都有显著提高。  相似文献   

14.
Traditionally, code scheduling is used to optimize the performance of an application, because it can rearrange the code to allow the execution of independent instructions in parallel based on instruction level parallelism (ILP). According to our observations, it can also be applied to reduce power dissipation by taking advantage of the properties of existing low-power techniques. In this paper, we present a power-aware code scheduling (PACS), which is a code scheduling integrated with power gating (PG) and dynamic voltage scaling (DVS) to reduce power consumption while executing an application. In other words, from the viewpoint of compilation optimization, PG and DVS can be applied simultaneously to a code and their impact can be enhanced by code scheduling to further save power. The result shows that when compared with hardware power gating, the proposed PACS can outperform by more than 33% and 41% in terms of energy delay product and energy delay2 product for DSPStone and Mediabench.  相似文献   

15.
本文通过分析不同类型的多信道MAC协议的特点,指出了并行协商类多信道MAC协议存在的消失节点问题和通信竞争问题。针对上述问题,基于无线传感器网络节点的能量有效性,本文提出了一种新的多信道MAC协议:LPR MAC。本协议采用全网同步,时间上划分为多个时间片,节点在网络建立时随机选择某个时间片作为自己的固定接收周期,在接收周期按各自的伪随机序列在多个信道之间进行跳跃,并行协商,在其余时间片休眠。仿真结果表明,该协议减少了通信竞争程度,降低了能量消耗。  相似文献   

16.
LPR-MAC:一种采用并行协商机制的低功耗多信道MAC协议   总被引:1,自引:1,他引:0  
本文通过分析不同类型的多信道MAC协议的特点,指出了并行协商类多信道MAC协议存在的消失节点问题和通信竞争问题。针对上述问题,基于无线传感器网络节点的能量有效性,本文提出了一种新的多信道MAC协议:LPR-MAC。本协议采用全网同步,时间上划分为多个时间片,节点在网络建立时随机选择某个时间片作为自己的固定接收周期,在接收周期按各自的伪随机序列在多个信道之间进行跳跃,并行协商,在其余时间片休眠。仿真结果表明,该协议减少了通信竞争程度,降低了能量消耗。  相似文献   

17.
In an attempt to improve the speed of VLSI signal processing systems, a new architecture for a high-speed multiply-accumulate (MAC) unit optimized for digital filters is proposed. This unit is designed as a coprocessor for the LEON2 RISC processor [LEON2 Processor; 2005 [Online]. <http://www.gaisler.com/products/leon2/leon.html>]. In this work, four parallel MAC units with two dual-port coefficient register-files, a three-port general register-file and a control unit are included in the coprocessing block. With the existence of four parallel units, several SIMD format instructions have been added to LEON2 instruction set. Each MAC unit has two 16-bit inputs, 32-bit output register and a programmable round-saturate block. The MAC unit uses a new architecture which embeds the accumulate module within the partial products summation tree of the multiplier with minimum overhead. A central control unit controls inputs of the four MACs and loading of the output registers. Our experimental results demonstrate a high performance in implementation of digital filters at elevated speeds of up to 33 millions of input samples per second in a 0.18 μm technology.  相似文献   

18.
Embedded systems are vulnerable to buffer overflow attacks. In this paper, we propose a hardware memory monitor module that aims to detect buffer overflow attacks by analyzing the security of an embedded processor at the instruction level. The functionality of the memory monitor module does not rely on the source code and can perform security check through dynamic methods. Compared with several existing countermeasures that protect only part of the program’s data space, our proposed memory monitor module can protect the program’s entire data space. The proposed memory monitor module has negligible performance overhead because it runs in parallel with the embedded processor. As demonstrated in an FPGA (Field Programmable Gate Array) based prototype, the experimental results show that our memory monitor module can effectively resist several types of buffer overflow attacks with approximately a 15% hardware cost overhead and only a 0.1% performance penalty.  相似文献   

19.
The Wi-Fi technology, driven by its tremendous success, is expanding into a wide variety of devices and applications. However, many of these new devices, like handheld devices, pose new challenges in terms of QoS and energy efficiency. In order to address these challenges, in this paper we study how the novel MAC aggregation mechanisms developed in the 802.11n standard can be used to enhance the current 802.11 QoS and power saving protocols. Our contribution is twofold. First, we present a simulation study that illustrates the interactions between 802.11n and the current 802.11 QoS and power saving protocols. This study reveals that the 802.11n MAC aggregation mechanisms perform better when combined with the power save mode included in the original 802.11 standard than with the 802.11e U-APSD protocol. Second, we design CA-DFA, an algorithm that, using only information available at layer two, adapts the amount of 802.11n aggregation used by a Wi-Fi station according to the level of congestion in the network. A detailed performance evaluation demonstrates the benefits of CA-DFA in terms of QoS, energy efficiency and network capacity with respect to state of the art alternatives.  相似文献   

20.
In embedded systems, cache is commonly used to improve system performance. However, the cache consumes a large amount of power, and among the components of the cache memory, tag comparisons consume the most amount of power. Therefore, how to design a cache that does not consume so much power when comparing tags and that has a high hit ratio is an important challenge. In this paper, we propose a Tagless Instruction Cache, called TL-IC, that does not perform tag comparisons in order to save power in embedded systems. To guarantee that an instruction fetched from TL-IC is the desired instruction, instead of cache lines being used, the basic blocks of programs are placed into TL-IC. In addition, to utilize TL-IC as much as possible in order to save the most amount of power and to take into account the general-purpose and special-purpose applications, both the static allocation and the dynamic allocation of basic blocks are used to select the frequently executed basic blocks of programs in TL-IC. With a high utilization of TL-IC that does not perform tag comparisons, the power consumed in fetching instructions can be efficiently reduced. In the simulation results, we show and compare the power consumption of our proposed TL-IC, L0 cache, Linebuffer, and TH-IC.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号