期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

周国昌王忠车德亮冯国臣《计算机工程与应用》2004,40(31):13-16

论文介绍的SIMD协处理器是用于低层图像理解的16位定点嵌入式阵列处理器。该协处理器采用load/store体系结构,并且除SIMD固有的数据并行性外,还具有三级流水和三组指令并发执行的并行性。三组指令并发执行使数据交换操作和其它类型操作并发执行,从而实现了数据交换操作的隐含执行,大大减少了通信和I/O操作的开销。相似文献

2.

基于Itanium处理器的密码算法实现

陈迅姜晶菲张民选《计算机工程与应用》2004,40(15):40-42,208

使用ItaniumCompiler7.0编译器对现有分组密码算法的C语言实现进行编译得到汇编代码,在对这些汇编代码进行分析时可以发现编译器并没有充分利用Itanium处理器提供的资源。针对这一问题,该文提出了在Itanium处理器上有效实现常用密码算法的方法,主要是利用Itanium处理器指令集中提供的SIMD指令提高处理的并行性,并探讨了Itanium处理器SIMD指令的使用方法。相似文献

3.

DSP体系结构发展的新趋势 总被引：3，自引：0，他引：3

黄峰熊召新李胜平朱全庆邹雪城《计算机工程》2002,28(4):1-2,218

CISC→RISC设计思想对DSP体系结构设计中数据和指令级并行性开发产生了深刻影响，融合RISC和SIMD技术的单核处理器已经成为DSP体系结构设计的新趋势。相似文献

4.

LS SIMD阵列微处理器的控制逻辑设计 总被引：10，自引：0，他引：10

李莉沈绪榜《计算机学报》2000,23(5):557-560

首先介绍了 L S SIMD阵列微处理器的三种并行性 :数据并行、流水线并行和指令的并行执行 .针对这三种并行性 ,阐述了控制逻辑的设计 . 相似文献

5.

面向粗粒度数据流网络处理器的混合定制硬件加速

李韬孙志刚《计算机工程与科学》2011,33(11):40

本文针对控制流网络处理器固定拓扑结构的限制及指令集并行性开发的不足,将粗粒度数据流设计思想引入到网络处理器体系结构设计中,提出了一种新型粗粒度数据流网络处理器体系结构-DynaNP。DynaNP利用处理引擎(PE)内控制流执行方式获得较高的可编程性,还利用PE间数据流执行方式开发了报文处理中的任务级并行性。为了进一步提高DynaNP的系统流量,面向DynaNP的多核及数据流特性,设计了混合定制硬件加速机制,并详细介绍了实现混合定制硬件加速的关键技术,通过提供统一的混合定制硬件加速接口,可以支持定制指令和协处理器两种典型硬件加速器。相似文献

6.

支持SIMD 与簇间双字传输体系下的VLIW DSP 分簇算法

陈思灵郑启龙冯玉谦付和萍《计算机系统应用》2012,21(10):100-104

VLIW DSP通过软件流水获得时间并行性,通过指令分簇获得空间并行性.指令的分簇本质上是资源分配问题.传统的指令分簇假设一条指令分到某一簇执行,而某些体系结构提供SIMD指令,传统的分簇算法对这类体系结构并不完全适用.提出的基于评估模型的分簇算法能对SIMD指令和普通指令进行合理的分簇.分簇之后,通过调度簇间传输指令,合成适当的簇间双字传输指令.由于SIMD和簇间双字传输的引入,以及较好的分簇决策,程序整体的调度延迟变短.对许多数字信号处理程序相对于没分簇的情况下的性能有2～3倍的性能提升,相对寄存器压力分簇算法有约7～10%性能的提升. 相似文献

7.

基于共享向量的二维SIMD调度算法

张为华臧斌宇王晔钱兴隆朱传琪《计算机学报》2006,29(10):1740-1749

针对目前二维SIMD结构编译技术研究的不足,结合二维SIMD结构中普遍采用的复用数据通路和寄存器少的限制和应用程序的特点,提出了一种解决数据向量复用的算法.该算法先使用数据向量的代表元计算各SIMD指令间数据向量的重用信息,再根据这些信息对SIMD指令进行调度.该算法可以有效缓解应用程序在二维SIMD结构执行时加载数据的压力,有效提高结构受限二维SIMD结构的并行性.实验数据显示,该算法对各种应用程序可获得平均2.97的加速比和平均3.86的SIMD指令级并行度. 相似文献

8.

BWDSP SIMD编译的寄存器分配优化技术研究

王昊王向前《单片机与嵌入式系统应用》2015,15(4)

BWDSP是一款自主设计的国产VLIW(超长指令字)数字信号处理器,支持SIMD技术,其SIMD指令可以在4个宏上同时执行4个32位计算,对寄存器使用有特殊规则,Open64编译器的寄存器分配策略并不适用于这种规则.本文对BWDSP SIMD指令的寄存器分配优化技术进行了研究,并在BWDSP的编译器OCC上得以实现. 相似文献

9.

EDGE结构上一种通过超块重组加速单线程应用的方法

魏学超安虹毛梦捷《小型微型计算机系统》2012,(10):2249-2254

Explicit Data Graph Execution(EDGE)ISA是一种专门为类数据流驱动的分片式众核处理器而设计的指令集体系结构.相较于传统的采用控制流驱动的处理器,EDGE结构以超块(Hyperblock)而不是单个指令作为其执行单位,在超块内部实现数据流执行,超块之间按照推测序保持控制流执行,有利于挖掘指令级并行性.但是,EDGE编译器按照程序的串行执行顺序组织超块,超块间和超块内部受限于数据依赖,削弱了整个程序运行时的潜在数据级并行性和线程级并行性,不利于发挥EDGE分片式结构的优势.本文通过分析EDGE编译器超块组织的特点,结合EDGE结构特有的执行模型,提出一种普适性的超块组织框架来模拟EDGE结构上多线程运行的效果,进一步挖掘EDGE结构运行串行单线程程序时的指令级并行性.本文选用TRIPS微处理器作为EDGE结构的实例处理器,利用矩阵乘法等三个实验验证了我们所提出的框架的可行性,实验结果表明这些应用在TRIPS上获得了较好的性能提升. 相似文献

10.

利用流SIMD扩展加速3D曲线网格的流线计算 总被引：4，自引：0，他引：4

张文李晓梅《计算机学报》2001,24(8):785-790

流线是一种基本的流场可视化技术,计算流线要耗费大量时间,Intel处理器（Pentium Ⅲ,Pentium4）提供流SIMD扩展（SSE）,支持指令级SIMD操作。3D曲线网格上的流线计算包含速度插值、数值积分、点定位等主要子过程,具有很高的内在SIMD并行性。通过将数据按SSE数据类型组织以及对主要子过程进行SIMD并行化,设计了线流计算的SSE算法。采用向量类库、嵌入汇编两种SSE编码方式分别实现SSE算法,并依据处理器的体系结构优化代码。测试结果表明：SSE大大加速了3D曲线网格的流线计算,向量类库方式比传统计算提高55％左右的性能,嵌入汇编提高75％左右。相似文献

11.

处理元操作自主性的分析与设计

李炜肖潇沈绪榜《计算机应用研究》2006,23(10):89-92

SIMD计算机在并行计算中有着广泛的应用,为了增加其并行处理的能力,一个很重要的方面就是如何提高其PE（Processing Element）的操作自治能力。通过在原有指令系统中加入卫式指令和伪跳转指令,并且加入少量相应的硬件,不仅能够大大提高PE的操作自主性从而实现数据和指令的高度并行,而且在低功耗方面也具有重要作用。相似文献

12.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

13.

SIMD计算机的优化编译器设计 总被引：1，自引：1，他引：0

下载免费PDF全文

赵辉黄石《计算机工程》2009,35(1):201-203

利用处理器的相关资源,提高编译器优化性能和增强代码可适应性是SIMD处理器优化编译的关键。该文基于M语言和LSSIMD体系结构,结合现代编译器的编译技术,提出针对SIMD协处理器编译器的优化和实现方法,包括寄存器分配、单值合并、代码压缩等。实验结果表明,编译生成的目标代码准确、高效。相似文献

14.

类比推理协处理器中的流水线技术 总被引：1，自引：0，他引：1

沈旭昆《计算机研究与发展》1998,35(5):393-397

流水线是提高当代处理器性能的最重要技术，转移指令的处理策略直接影响流水线效率。文中讨论了类比推理协处理体系结构中采用的流流水线及转移指令处理策略，它是使该系统处理类比推理问题的速度比通用处理器快一个数量级以上的重要因素之一。相似文献

15.

A pipelined interface for high floating-point performance with precise exceptions

Iacobovici S. 《Micro, IEEE》1988,8(3):77-87

Two options are presented that were considered for a pipelined interface between a central processing unit (CPU) and a floating-point coprocessor (FPU), along with the CPU recovery mechanisms that provide precise floating-point exceptions for each option. The first option supports parallel execution of both floating-point and integer instructions, while the second option pipelines only the execution of floating-point instructions. The use of the second option in National Semiconductor's 32532/32580 processor cluster because it offers high performance with significantly lower complexity. The 32532 microprocessor features a pipelined slave protocol that hides the CPU-FPU communication overhead for most floating-point instructions by pipelining their execution. A simple recovery mechanism implemented within the CPU maintains the precision of floating-point exceptions. As a result, the 32532 microprocessor supports very high floating point performance without sacrificing software compatibility with previous Series 32000 CPU-FPU clusters.<> 相似文献

16.

LS SIMD计算机并行计算的面向对象仿真

张发存赵晓红沈绪榜《计算机工程与应用》2003,39(26):143-146

论文详细介绍了基于LSSIMD计算机的并行计算的面向对象仿真,提出了一个新颖的SIMD机的面向对象软件模型,并在PC机Windows平台上用MicrosoftVisualC++6.0编程实现。通过对一组数字图象采用不同的处理算法进行仿真计算、并与LSSIMD并行机的同样图象的相同算法的运行结果进行比较,证明该系统具有正确性、实用性和可靠性。相似文献

17.

A high‐performance sorting algorithm for multicore single‐instruction multiple‐data processors

Hiroshi Inoue Takao Moriyama Hideaki Komatsu Toshio Nakatani 《Software》2012,42(6):753-777

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献