期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

利用流SIMD扩展加速3D曲线网格的流线计算 总被引：4，自引：0，他引：4

张文李晓梅《计算机学报》2001,24(8):785-790

流线是一种基本的流场可视化技术,计算流线要耗费大量时间,Intel处理器（Pentium Ⅲ,Pentium4）提供流SIMD扩展（SSE）,支持指令级SIMD操作。3D曲线网格上的流线计算包含速度插值、数值积分、点定位等主要子过程,具有很高的内在SIMD并行性。通过将数据按SSE数据类型组织以及对主要子过程进行SIMD并行化,设计了线流计算的SSE算法。采用向量类库、嵌入汇编两种SSE编码方式分别实现SSE算法,并依据处理器的体系结构优化代码。测试结果表明：SSE大大加速了3D曲线网格的流线计算,向量类库方式比传统计算提高55％左右的性能,嵌入汇编提高75％左右。相似文献

2.

适用于SIMD体系结构的FPGA分页仿真模型研究

何义任巨文梅杨乾明伍楠张春元郭敏《计算机研究与发展》2011,48(1)

SIMD结构能有效地开发多媒体和复杂科学计算的并行性,成为产业应用和研究的热点.在大规模SIMD体系结构研究中,为缓解FPGA芯片容量对仿真系统规模的限制,提出了适用于SIMD体系结构的FPGA分页仿真模型,有效降低了SIMD结构对FPGA计算资源和存储资源的需求,提高了SIMD结构的可验证规模.对MASA流处理器的仿真实验结果表明,不采用任何仿真优化技术,FPGA芯片EP2S180可支持的最大仿真规模为8个cluster的MASA,采用分页仿真模型,EP2S180的最大仿真规模可增加至256个cluster的MASA,而且仿真时间的增量是可接受的. 相似文献

3.

时钟共享多线程处理器SIMD控制器设计与实现

《电子技术应用》2016,(11):29-32

针对图形图像处理器中指令与数据加载以及数据收集的问题,设计和实现了一种时钟共享多线程处理器中的SIMD控制器,完成相关SIMD指令的发送、数据的加载和数据的收集。该控制器以实现高效的数据级并行计算为目标,采用有限状态机实现了前向处理单元、行控制器和列控制器的设计。实验结果表明,所设计的专用硬件电路能够有效提高图形图像处理器处理并行数据的能力。相似文献

4.

自动提取程序的SIMD和MIMD异构性 总被引：1，自引：0，他引：1

曾国荪陆鑫达《计算机研究与发展》2000,37(11):1397-1403

异构计算是并行处理的一个新领域,可望达到超级线性加速比。提取程序的异构性是异构计算的一个重要步骤,这方面的工作难度大,概念和术语含糊不清。从程序结构和程序运行的观点出发,清楚地给出了SIMD,MIMD并行性的形式定义,它是提取程序异构性的依据,还提出了基于程序结构变换和基于性能分析的两种方法,该方法可作为开创自动提取程序异构性的框架。相似文献

5.

多媒体处理器的SIMD代码生成 总被引：1，自引：0，他引：1

吴圣宁李思昆《计算机科学》2007,34(7):268-270

通用处理器的SIMD（Single Instruction Multiple Data）多媒体扩展，为提高多媒体应用的性能提供了新的体系结构支持。但目前编译技术对这类指令不能提供很好的支持。本文提出了一个新的SIMD指令生成算法，基于把编译器前端的程序分析和编译器后端的机器信息相结合的思想，采用扩展的treeparsing技术，有效识别程序中的并行操作以生成SIMD指令。基于SUIF（Stanford University Intermediate Format）编译器框架的实验表明，针对一组多媒体kernel，本文提出的算法可平均减少其非SIMD代码47％的cycles。相似文献

6.

一种SIMD多DSP数字图像处理系统研究与设计

李勇齐同斌张瑞生《电子技术应用》2007,33(11):71-73

数字图像处理需要大量的数据运算,要求系统具有很高的数据吞吐量。并行处理结构能较好地满足这一要求。介绍一种SIMD并行多DSP数字图像处理系统。该系统具有避免冲突、能连续处理图像数据、处理器间通信及I/O部分简单、硬件及软件模块化等优点。相似文献

7.

一种支持SIMD体系结构的高效分布式堆栈——HEDSSA

孙海燕《计算机工程与科学》2017,39(11):1986-1990

随着问题规模的增大和对实时性要求的提高,SIMD向量处理器尤其是带有向量运算单元的处理器在业界得到广泛应用。处理器上程序的运行状态一般由编译器通过堆栈进行管理。已有编译器堆栈设计机制在SIMD体系结构中严重影响了整个应用程序的运行性能。根据SIMD体系结构特点,提出了一种高效分布式堆栈设计方法——HEDSSA。实验结果表明,HEDSSA堆栈使得应用程序在进行局部数据访问、函数调用、发生中断以及动态分配数据时能够以更高的效率访问堆栈数据。相似文献

8.

RISC-V向量指令集的Yolov3移植优化

王宇木潘志铭吴鹏飞付维田乐兰李桂润孙轶群《单片机与嵌入式系统应用》2021,21(12):20-25,30

为研究SIMD在嵌入式领域中对处理器性能的提升效果,选择一种并行化程度较高的图像处理算法Yolov3进行SIMD向量化移植.根据开源指令集RISC-V扩展指令集中的V(Vector)指令集修改Yolov3算法的代码,将其部署到优矽科技自研的WH64处理器的VPU(Vector Processor Unit)中验证;结合Amdahl定律和Yolov3自测程序评估SIMD算法提升的性能.实验结果表明,在Xilinx的Kintex7板上以50 MHz主频运行,在向量化算法占比90％以上时,SIMD处理过后的代码程序达到了标量计算2.25x的加速比. 相似文献

9.

SIMD指令集设计空间的形式化描述

李春江徐颖黄娟娟杨灿群《计算机科学》2013,40(6):32-36

SIMD (Single-Instruction-Multiple-Data)并行体系结构在现代处理器体系结构中扮演非常重要的角色.SI MD指令集已经成为处理器指令集中重要的子集.SIMD结构和指令集实现了短向量并行处理能力,SIMD指令集实现了对多种数据类型、多种操作模式的支持.采用形式化的方法,描述SIMD指令集的设计空间,从多个正交的维度刻画SIMD指令集的设计,基于此详细讨论了SIMD指令集的设计问题.该形式化方法有益于对SIMD指令集体系结构的分析和设计. 相似文献

10.

一种基于SIMD结构的可重组乘累加器设计

单睿《微计算机应用》2003,24(3):141-145,F003

超高速乘法器是高性能通用微处理器和媒体处理器的重要部件。本文提出一种基于SIMD(Single Lnstrnction multiple Data)高性能并行处理器体系结构的可重组乘累加器及其修正算法，用于音频、视频和网络通信等多媒体数据处理，克服了传统的定长数据处理在多媒体应用方面所固有的局限性，满足了下一代高性能计算的要求。相似文献

11.

IP core implementation of a self-organizing neural network 总被引：1，自引：0，他引：1

Hendry D.C. Duncan A.A. Lightowler N. 《Neural Networks, IEEE Transactions on》2003,14(5):1085-1096

This paper reports on the design issues and subsequent performance of a soft intellectual property (IP) core implementation of a self-organizing neural network. The design is a development of a previous 0.65-/spl mu/m single silicon chip providing an array of 256 neurons, where each neuron stores a 16 element reference vector. Migrating the design to a soft IP core presents challenges in achieving the required performance as regards area, power, and clock speed. This same migration, however, offers opportunities for parameterizing the design in a manner which permits a single soft core to meet the requirements of many end users. Thus, the number of neurons within the single instruction multiple data (SIMD) array, the number of elements per reference vector, and the number of bits of each such element are defined by synthesis time parameters. The construction of the SIMD array of neurons is presented including performance results as regards power, area, and classifications per second . For typical parameters (256 neurons with 16 elements per reference vector) the design provides over 2 000 000 classifications per second using a mainstream 0.18-/spl mu/m digital process. A RISC processor, the array controller (AC), provides both the instruction stream and data to the SIMD array of neurons and an interface to a host processor. The design of this processor is discussed with emphasis on the control aspects which permit supply of a continuous instruction stream to the SIMD array and a flexible interface with the host processor. 相似文献

12.

Efficient polygon clipping for an SIMD graphics pipeline

Schneider B.-O. van Welzen J. 《IEEE transactions on visualization and computer graphics》1998,4(3):272-285

SIMD processors have become popular architectures for multimedia. Though most of the 3D graphics pipeline can be implemented on such SIMD platforms in a straightforward manner, polygon clipping tends to cause clumsy and expensive interruptions to the SIMD pipeline. This paper describes a way to increase the efficiency of SIMD clipping without sacrificing the efficient flow of a SIMD graphics pipeline. In order to fully utilize the parallel execution units, we have developed two methods to avoid serialization of the execution stream: deferred clipping postpones polygon clipping and uses hardware assistance to buffer polygons that need to be clipped. SIMD clipping partitions the actual polygon clipping procedure between the SIMD engine and a conventional RISC processor. To increase the efficiency of SIMD clipping, we introduce the concepts of clip-plane pairs and edge batching. Clip-plane pairs allow clipping a polygon against two clip planes without introducing corner vertices. Edge batching reduces the communication and control overhead for starting of clipping on the SIMD engine 相似文献

13.

Performance Measures for Evaluating Algorithms for SIMD Machines

《IEEE transactions on pattern analysis and machine intelligence》1982,(4):319-331

This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors. 相似文献

14.

The UCSC Kestrel parallel processor

Di Bias A. Dahle D.M. Diekhans M. Grate L. Hirschberg J. Karplus K. Keller H. Kendrick M. Mesa-Martinez F.J. Pease D. Rice E. Schultz A. Speck D. Hughey R. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(1):80-92

The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems, to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple-data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This work presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz. 相似文献

15.

嵌入式SIMD处理器上G．729的优化方法研究

李图平龚素文《计算机工程与应用》2007,43(3):139-141

首先介绍了G．729语音编解码器算法原理以度嵌入式SIMD处理器VFASTDSP芯片的结构性能，重点讨论了系统实现过程中的VP6汇编代码优化、调度策略以及各功能模块并行算法的设计优化。实践证明，优化后的编码器在384K的网络带宽下可以得到无延迟、主观音质完美的通话效果，达到商用的需求。相似文献

16.

A parallel processing VLSI BAM engine

Hasan S.M.R. Ng Kang Siong 《Neural Networks, IEEE Transactions on》1997,8(2):424-436

In this paper emerging parallel/distributed architectures are explored for the digital VLSI implementation of adaptive bidirectional associative memory (BAM) neural network. A single instruction stream many data stream (SIMD)-based parallel processing architecture, is developed for the adaptive BAM neural network, taking advantage of the inherent parallelism in BAM. This novel neural processor architecture is named the sliding feeder BAM array processor (SLiFBAM). The SLiFBAM processor can be viewed as a two-stroke neural processing engine, It has four operating modes: learn pattern, evaluate pattern, read weight, and write weight. Design of a SLiFBAM VLSI processor chip is also described. By using 2-mum scalable CMOS technology, a SLiFBAM processor chip with 4+4 neurons and eight modules of 256x5 bit local weight-storage SRAM, was integrated on a 6.9x7.4 mm(2) prototype die. The system architecture is highly flexible and modular, enabling the construction of larger BAM networks of up to 252 neurons using multiple SLiFBAM chips. 相似文献

17.

Top-Performance Tokenization and Small-Ruleset Regular Expression Matching

Daniele Paolo Scarpazza 《International journal of parallel programming》2011,39(1):3-32

In the last decade, the volume of unstructured data that Internet and enterprise applications create and consume has been growing at impressive rates. The tools we use to process these data are search engines, business analytics suites, natural-language processors and XML processors. These tools rely on tokenization, a form of regular expression matching aimed at extracting words and keywords in a character stream. The further growth of unstructured data-processing paradigms depends critically on the availability of high-performance tokenizers. Despite the impressive amount of parallelism that the multi-core revolution has made available (in terms of multiple threads and wider SIMD units), most applications employ tokenizers that do not exploit this parallelism. I present a technique to design tokenizers that exploit multiple threads and wide SIMD units to process multiple independent streams of data at a high throughput. The technique benefits indefinitely from any future scaling in the number of threads or SIMD width. I show the approach’s viability by presenting a family of tokenizer kernels optimized for the Cell/B.E. processor that deliver a performance seen, so far, only on dedicated hardware. These kernels deliver a peak throughput of 14.30 Gbps per chip, and a typical throughput of 9.76 Gbps on Wikipedia input. Also, they achieve almost-ideal resource utilization (99.2%). The approach is applicable to any SIMD enabled processor and matches well the trend toward wider SIMD units in contemporary architecture design. 相似文献

18.

An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

Jung-Wook Park Hoon-Mo Yang Gi-Ho Park Shin-Dug Kim Charles C. Weems 《Journal of Parallel and Distributed Computing》2010

In order to guarantee both performance and programmability demands in 3D graphics applications, vector and multithreaded SIMD architectures have been employed in recent graphics processing units. This paper introduces a novel instruction-systolic array architecture, which transfers an instruction stream in a pipelined fashion to efficiently share the expensive functional resources of a graphics processor. Specifically, cache misses and dynamic branches can cause additional latencies and complicated management in these parallel architectures. To address this problem, we combine a systolic execution scheme with on-demand warp activation that handles cache miss latency and branch divergence efficiently without significantly increasing hardware resources, either in terms of logic or register space. Simulation indicates that the proposed architecture offers 25% better performance than a traditional SIMD architecture with the same resources, and requires significantly fewer resources to match the performance of a typical modern vector multi-threaded GPU architecture. 相似文献

19.

SIMD计算机的优化编译器设计 总被引：1，自引：1，他引：0

下载免费PDF全文

赵辉黄石《计算机工程》2009,35(1):201-203

利用处理器的相关资源,提高编译器优化性能和增强代码可适应性是SIMD处理器优化编译的关键。该文基于M语言和LSSIMD体系结构,结合现代编译器的编译技术,提出针对SIMD协处理器编译器的优化和实现方法,包括寄存器分配、单值合并、代码压缩等。实验结果表明,编译生成的目标代码准确、高效。相似文献

20.

SIMD自动向量化编译优化概述 总被引：1，自引：0，他引：1

高伟赵荣彩韩林庞建民丁锐《软件学报》2015,26(6):1265-1284

SIMD扩展部件是集成到通用处理器中的加速部件,旨在发掘多媒体程序和科学计算程序的数据级并行.首先介绍SIMD扩展部件的背景和研究现状,然后从发掘方法、数据布局、多平台向量化这3个角度介绍了SIMD自动向量化的研究问题、困难和最新研究成果,最后展望了SIMD编译优化未来的研究方向. 相似文献