共查询到20条相似文献,搜索用时 468 毫秒
1.
利用流SIMD扩展加速3D曲线网格的流线计算 总被引:4,自引:0,他引:4
流线是一种基本的流场可视化技术,计算流线要耗费大量时间,Intel处理器(Pentium Ⅲ,Pentium4)提供流SIMD扩展(SSE),支持指令 级SIMD操作。3D曲线网格上的流线计算包含速度插值、数值积分、点定位等主要子过程,具有很高的内在SIMD并行性。通过将数据按SSE数据类型组织以及对主要子过程进行SIMD并行化,设计了线流计算的SSE算法。采用向量类库、嵌入汇编两种SSE编码方式分别实现SSE算法,并依据处理器的体系结构优化代码。测试结果表明:SSE大大加速了3D曲线网格的流线计算,向量类库方式比传统计算提高55%左右的性能,嵌入汇编提高75%左右。 相似文献
2.
SIMD结构能有效地开发多媒体和复杂科学计算的并行性,成为产业应用和研究的热点.在大规模SIMD体系结构研究中,为缓解FPGA芯片容量对仿真系统规模的限制,提出了适用于SIMD体系结构的FPGA分页仿真模型,有效降低了SIMD结构对FPGA计算资源和存储资源的需求,提高了SIMD结构的可验证规模.对MASA流处理器的仿真实验结果表明,不采用任何仿真优化技术,FPGA芯片EP2S180可支持的最大仿真规模为8个cluster的MASA,采用分页仿真模型,EP2S180的最大仿真规模可增加至256个cluster的MASA,而且仿真时间的增量是可接受的. 相似文献
3.
4.
自动提取程序的SIMD和MIMD异构性 总被引:1,自引:0,他引:1
异构计算是并行处理的一个新领域,可望达到超级线性加速比。提取程序的异构性是异构计算的一个重要步骤,这方面的工作难度大,概念和术语含糊不清。从程序结构和程序运行的观点出发,清楚地给出了SIMD,MIMD并行性的形式定义,它是提取程序异构性的依据,还提出了基于程序结构变换和基于性能分析的两种方法,该方法可作为开创自动提取程序异构性的框架。 相似文献
5.
多媒体处理器的SIMD代码生成 总被引:1,自引:0,他引:1
通用处理器的SIMD(Single Instruction Multiple Data)多媒体扩展,为提高多媒体应用的性能提供了新的体系结构支持。但目前编译技术对这类指令不能提供很好的支持。本文提出了一个新的SIMD指令生成算法,基于把编译器前端的程序分析和编译器后端的机器信息相结合的思想,采用扩展的treeparsing技术,有效识别程序中的并行操作以生成SIMD指令。基于SUIF(Stanford University Intermediate Format)编译器框架的实验表明,针对一组多媒体kernel,本文提出的算法可平均减少其非SIMD代码47%的cycles。 相似文献
6.
7.
孙海燕 《计算机工程与科学》2017,39(11):1986-1990
随着问题规模的增大和对实时性要求的提高,SIMD向量处理器尤其是带有向量运算单元的处理器在业界得到广泛应用。处理器上程序的运行状态一般由编译器通过堆栈进行管理。已有编译器堆栈设计机制在SIMD体系结构中严重影响了整个应用程序的运行性能。根据SIMD体系结构特点,提出了一种高效分布式堆栈设计方法——HEDSSA。实验结果表明,HEDSSA堆栈使得应用程序在进行局部数据访问、函数调用、发生中断以及动态分配数据时能够以更高的效率访问堆栈数据。 相似文献
8.
为研究SIMD在嵌入式领域中对处理器性能的提升效果,选择一种并行化程度较高的图像处理算法Yolov3进行SIMD向量化移植.根据开源指令集RISC-V扩展指令集中的V(Vector)指令集修改Yolov3算法的代码,将其部署到优矽科技自研的WH64处理器的VPU(Vector Processor Unit)中验证;结合Amdahl定律和Yolov3自测程序评估SIMD算法提升的性能.实验结果表明,在Xilinx的Kintex7板上以50 MHz主频运行,在向量化算法占比90%以上时,SIMD处理过后的代码程序达到了标量计算2.25x的加速比. 相似文献
9.
10.
超高速乘法器是高性能通用微处理器和媒体处理器的重要部件。本文提出一种基于SIMD(Single Lnstrnction multiple Data)高性能并行处理器体系结构的可重组乘累加器及其修正算法,用于音频、视频和网络通信等多媒体数据处理,克服了传统的定长数据处理在多媒体应用方面所固有的局限性,满足了下一代高性能计算的要求。 相似文献
11.
IP core implementation of a self-organizing neural network 总被引:1,自引:0,他引:1
This paper reports on the design issues and subsequent performance of a soft intellectual property (IP) core implementation of a self-organizing neural network. The design is a development of a previous 0.65-/spl mu/m single silicon chip providing an array of 256 neurons, where each neuron stores a 16 element reference vector. Migrating the design to a soft IP core presents challenges in achieving the required performance as regards area, power, and clock speed. This same migration, however, offers opportunities for parameterizing the design in a manner which permits a single soft core to meet the requirements of many end users. Thus, the number of neurons within the single instruction multiple data (SIMD) array, the number of elements per reference vector, and the number of bits of each such element are defined by synthesis time parameters. The construction of the SIMD array of neurons is presented including performance results as regards power, area, and classifications per second . For typical parameters (256 neurons with 16 elements per reference vector) the design provides over 2 000 000 classifications per second using a mainstream 0.18-/spl mu/m digital process. A RISC processor, the array controller (AC), provides both the instruction stream and data to the SIMD array of neurons and an interface to a host processor. The design of this processor is discussed with emphasis on the control aspects which permit supply of a continuous instruction stream to the SIMD array and a flexible interface with the host processor. 相似文献
12.
Schneider B.-O. van Welzen J. 《IEEE transactions on visualization and computer graphics》1998,4(3):272-285
SIMD processors have become popular architectures for multimedia. Though most of the 3D graphics pipeline can be implemented on such SIMD platforms in a straightforward manner, polygon clipping tends to cause clumsy and expensive interruptions to the SIMD pipeline. This paper describes a way to increase the efficiency of SIMD clipping without sacrificing the efficient flow of a SIMD graphics pipeline. In order to fully utilize the parallel execution units, we have developed two methods to avoid serialization of the execution stream: deferred clipping postpones polygon clipping and uses hardware assistance to buffer polygons that need to be clipped. SIMD clipping partitions the actual polygon clipping procedure between the SIMD engine and a conventional RISC processor. To increase the efficiency of SIMD clipping, we introduce the concepts of clip-plane pairs and edge batching. Clip-plane pairs allow clipping a polygon against two clip planes without introducing corner vertices. Edge batching reduces the communication and control overhead for starting of clipping on the SIMD engine 相似文献
13.
《IEEE transactions on pattern analysis and machine intelligence》1982,(4):319-331
This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors. 相似文献
14.
Di Bias A. Dahle D.M. Diekhans M. Grate L. Hirschberg J. Karplus K. Keller H. Kendrick M. Mesa-Martinez F.J. Pease D. Rice E. Schultz A. Speck D. Hughey R. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(1):80-92
The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems, to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple-data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This work presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz. 相似文献
15.
首先介绍了G.729语音编解码器算法原理以度嵌入式SIMD处理器VFASTDSP芯片的结构性能,重点讨论了系统实现过程中的VP6汇编代码优化、调度策略以及各功能模块并行算法的设计优化。实践证明,优化后的编码器在384K的网络带宽下可以得到无延迟、主观音质完美的通话效果,达到商用的需求。 相似文献
16.
In this paper emerging parallel/distributed architectures are explored for the digital VLSI implementation of adaptive bidirectional associative memory (BAM) neural network. A single instruction stream many data stream (SIMD)-based parallel processing architecture, is developed for the adaptive BAM neural network, taking advantage of the inherent parallelism in BAM. This novel neural processor architecture is named the sliding feeder BAM array processor (SLiFBAM). The SLiFBAM processor can be viewed as a two-stroke neural processing engine, It has four operating modes: learn pattern, evaluate pattern, read weight, and write weight. Design of a SLiFBAM VLSI processor chip is also described. By using 2-mum scalable CMOS technology, a SLiFBAM processor chip with 4+4 neurons and eight modules of 256x5 bit local weight-storage SRAM, was integrated on a 6.9x7.4 mm(2) prototype die. The system architecture is highly flexible and modular, enabling the construction of larger BAM networks of up to 252 neurons using multiple SLiFBAM chips. 相似文献
17.
In the last decade, the volume of unstructured data that Internet and enterprise applications create and consume has been
growing at impressive rates. The tools we use to process these data are search engines, business analytics suites, natural-language
processors and XML processors. These tools rely on tokenization, a form of regular expression matching aimed at extracting
words and keywords in a character stream. The further growth of unstructured data-processing paradigms depends critically
on the availability of high-performance tokenizers. Despite the impressive amount of parallelism that the multi-core revolution
has made available (in terms of multiple threads and wider SIMD units), most applications employ tokenizers that do not exploit
this parallelism. I present a technique to design tokenizers that exploit multiple threads and wide SIMD units to process
multiple independent streams of data at a high throughput. The technique benefits indefinitely from any future scaling in
the number of threads or SIMD width. I show the approach’s viability by presenting a family of tokenizer kernels optimized
for the Cell/B.E. processor that deliver a performance seen, so far, only on dedicated hardware. These kernels deliver a peak
throughput of 14.30 Gbps per chip, and a typical throughput of 9.76 Gbps on Wikipedia input. Also, they achieve almost-ideal
resource utilization (99.2%). The approach is applicable to any SIMD enabled processor and matches well the trend toward wider
SIMD units in contemporary architecture design. 相似文献
18.
Jung-Wook Park Hoon-Mo Yang Gi-Ho Park Shin-Dug Kim Charles C. Weems 《Journal of Parallel and Distributed Computing》2010
In order to guarantee both performance and programmability demands in 3D graphics applications, vector and multithreaded SIMD architectures have been employed in recent graphics processing units. This paper introduces a novel instruction-systolic array architecture, which transfers an instruction stream in a pipelined fashion to efficiently share the expensive functional resources of a graphics processor. Specifically, cache misses and dynamic branches can cause additional latencies and complicated management in these parallel architectures. To address this problem, we combine a systolic execution scheme with on-demand warp activation that handles cache miss latency and branch divergence efficiently without significantly increasing hardware resources, either in terms of logic or register space. Simulation indicates that the proposed architecture offers 25% better performance than a traditional SIMD architecture with the same resources, and requires significantly fewer resources to match the performance of a typical modern vector multi-threaded GPU architecture. 相似文献
19.