期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Monte Carlo simulation of neutron diffusion on SIMD architectures

A. McKerrell L. M. Delves 《Parallel Computing》1988,8(1-3):363-370

An algorithm is described for the Monte Carlo simulation of neutron diffusion, including the treatment of fission neutrons, which is efficient on a computer with SIMD architecture. Tests carried out on the ICL DAP give results which compare favourably for timing with those obtained using standard UKAEA serial programs. 相似文献

2.

Mapping massive SIMD parallelism onto vector architectures for simulation

Jonathan B. Rosenberg Jonathan D. Becker 《Software》1989,19(8):739-756

A software behavioural simulator for a new massively parallel single-instruction/multiple data (SIMD) architecture has been developed that can accurately simulate the entire 16, 384 bit-serial processor array. The key to this high performance modelling is the exploitation of an inherent mapping that exists between massively parallel SIMD architectures and the vector architectures used in many high performance scientific super-computers. The new SIMD architecture, called BLITZEN, is based on the Massively Parallel Processor (MPP) built for NASA by Goodyear in the late 1970s. By simulating the full-scale machine with very high performance, the simulator allows development of algorithms and high-level software to proceed before realization of the hardware. This paper describes the SIMD - vector architecture mapping, the highly vectorized simulator in which it is used, and how the result was a simulator that achieved a level of performance three orders of magnitude faster than the conventional uniprocessor approach. 相似文献

3.

Vector data flow analysis for SIMD optimizations on OpenCL programs

Yu‐Te Lin Jenq‐Kuen Lee 《Concurrency and Computation》2016,28(5):1629-1654

Multi‐core systems equipped with micro processing units and accelerators such as digital signal processors (DSPs) and graphics processing units (GPUs) have become a major trend in processor design in recent years in attempts to meet ever‐increasing application performance requirements. Open Computing Language (OpenCL) is one of the programming languages that include new extensions proposed to exploit the computing power of these kinds of processors. Among the newly extended language features, the single‐instruction multiple‐data (SIMD) linguistics and vector types are added to OpenCL to exploit hardware features of the accelerators. The addition makes it necessary to consider how traditional compiler data flow analysis can be adopted to meet the optimization requirements of vector linguistics. In this paper, we propose a calculus framework to support the data flow analysis of vector constructs for OpenCL programs that compilers can use to perform SIMD optimizations. We model OpenCL vector operations as data access functions in the style of mathematical functions. We then show that the data flow analysis for OpenCL vector linguistics can be performed based on the data access functions. Based on the information gathered from data flow analysis, we illustrate a set of SIMD optimizations on OpenCL programs. The experimental results incorporating our calculus and our proposed compiler optimizations show that the proposed SIMD optimizations can provide average performance improvements of 22% on x86 CPUs and 4% on advanced micro devices GPUs. For the selected 15 benchmarks, 11 of them are improved on x86 CPUs, and six of them are improved on advanced micro devices GPUs. The proposed framework has the potential to be used to construct other SIMD optimizations on OpenCL programs. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

4.

基于“龙芯”SIMD技术的RealVideo去块滤波优化

吴少刚刘波《计算机工程与设计》2009,30(3)

介绍了RealVideo解码中去块效应滤波器的滤波过程,对滤波器的复杂度进行了分析,并在"龙芯"平台下利用多媒体指令进行了优化.实验结果表明,RealVideo解码器中的去块滤波有效地提高了图像的主、客观质量,基于多媒体指令的优化技术较好地解决了多媒体数据处理的并行性问题,减少了滤波耗时,提高了"龙芯"平台下RealVideo文件的播放能力. 相似文献

5.

基于Intel SIMD指令的二维FFT优化算法 总被引：1，自引：0，他引：1

李成军周卫峰朱重光《计算机工程与应用》2007,43(5):41-44

在基于频域的大数据量图像处理算法中,最为耗时的步骤就是对图像数据进行二维FFT变换的过程。论文针对这一问题,提出一种基于Intel SIMD指令的二维FFT优化算法。通过将数据按照便于SIMD指令计算的方式进行组织,利用SSE3指令加速复数乘法,在二维处理中针对处理器缓存进行优化等方法,实现了很高的性能。实验结果表明:描述的算法比目前使用最广泛的公共域FFT程序包FFTW快30%左右。达到了对大数据量图像进行快速处理的要求,具有较大的工程实用价值。相似文献

6.

LS SIMD计算机并行计算的面向对象仿真

张发存赵晓红沈绪榜《计算机工程与应用》2003,39(26):143-146

论文详细介绍了基于LSSIMD计算机的并行计算的面向对象仿真,提出了一个新颖的SIMD机的面向对象软件模型,并在PC机Windows平台上用MicrosoftVisualC++6.0编程实现。通过对一组数字图象采用不同的处理算法进行仿真计算、并与LSSIMD并行机的同样图象的相同算法的运行结果进行比较,证明该系统具有正确性、实用性和可靠性。相似文献

7.

A study on a quantum-inspired evolutionary algorithm based on pair swap

Takahiro Imabeppu Shigeru Nakayama Satoshi Ono 《Artificial Life and Robotics》2008,12(1-2):148-152

A quantum-inspired evolutionary algorithm (QEA) is proposed as a stochastic algorithm to perform combinatorial optimization problems. The QEA is evolutionary computation that uses quantum bits and superposition states in quantum computation. Although the QEA is a coarse-grained parallel algorithm, it involves many parameters that must be adjusted manually. This paper proposes a new method, named pair swap, which exchanges each best solution information between two individuals instead of migration in the QEA. Experimental results show that our proposed method is a simpler algorithm and can find a high quality solution in the 0-1 knapsack problem. This work was presented in part at the 12th International Symposium on Artificial Life and Robotics, Oita, Japan, January 25–27, 2007 相似文献

8.

适用于SIMD体系结构的FPGA分页仿真模型研究

何义任巨文梅杨乾明伍楠张春元郭敏《计算机研究与发展》2011,48(1)

SIMD结构能有效地开发多媒体和复杂科学计算的并行性,成为产业应用和研究的热点.在大规模SIMD体系结构研究中,为缓解FPGA芯片容量对仿真系统规模的限制,提出了适用于SIMD体系结构的FPGA分页仿真模型,有效降低了SIMD结构对FPGA计算资源和存储资源的需求,提高了SIMD结构的可验证规模.对MASA流处理器的仿真实验结果表明,不采用任何仿真优化技术,FPGA芯片EP2S180可支持的最大仿真规模为8个cluster的MASA,采用分页仿真模型,EP2S180的最大仿真规模可增加至256个cluster的MASA,而且仿真时间的增量是可接受的. 相似文献

9.

面向VLIW结构的寄存器压力敏感表调度算法*

王红梅王敏张铁军单睿侯朝焕《计算机应用研究》2009,26(11):4039-4041

为了改善寄存器压力问题,提出一种寄存器压力敏感的指令调度算法。该算法在传统表调度算法的基础上采用关键路径为优先级函数,并考虑在寄存器压力区域内调整非关键节点的调度时机,在应用程序性能不损失的情况下达到了减小寄存器压力的目的。相似文献

10.

Application of the G-JF discrete-time thermostat for fast and accurate molecular simulations

Niels Grønbech-Jensen Natha Robert Hayre Oded Farago 《Computer Physics Communications》2014

A new Langevin–Verlet thermostat that preserves the fluctuation–dissipation relationship for discrete time steps is applied to molecular modeling and tested against several popular suites (AMBER, GROMACS, LAMMPS) using a small molecule as an example that can be easily simulated by all three packages. Contrary to existing methods, the new thermostat exhibits no detectable changes in the sampling statistics as the time step is varied in the entire numerical stability range. The simple form of the method, which we express in the three common forms (Velocity-Explicit, Störmer–Verlet, and Leap-Frog), allows for easy implementation within existing molecular simulation packages to achieve faster and more accurate results with no cost in either computing time or programming complexity. 相似文献

11.

Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

Qiang Wu Canqun Yang Tao Tang Liquan Xiao 《Journal of Parallel and Distributed Computing》2013

Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations. 相似文献

12.

基于SIMD的似然率快速算法

下载免费PDF全文

欧建林蔡骏林茜《计算机工程》2009,35(13):177-178

分析基于连续概率密度的隐马尔可夫模型大词汇量连续语音识别系统中的似然率计算方法,阐述运用并行方式实现似然率计算的可行性,并在此基础上,提出一种基于SIMD的似然率快速算法,通过对语音识别工具包HTK3．4中似然率计算模块的改进实现该算法。实验结果表明,在不降低识别准确率的前提下,该算法能有效加快似然率计算的速度。相似文献

13.

K元2—立方体网络SIMD计算机图像模板匹配并行算法 总被引：5，自引：0，他引：5

李俊山沈绪榜《计算机学报》2001,24(11):1196-1201

模板匹配是进行虑波、边缘检测、目标识别和图像匹配的一种基本和有效的方法 .对于 N× N的图像和M× N ( M相似文献

14.

Achieving speedups for APL on an SIMD distributed memory machine

Raymond Greenlaw Lawrence Snyder 《International journal of parallel programming》1990,19(2):111-127

The potential speedup for SIMD parallel implementations of APL programs is considered. Both analytical and (simulated) empirical studies are presented. The approach is to recognize that nearly 95% of the operators appearing in APL programs are either scalar primitive, reduction or indexing and so the performance of these operators gives a good estimate of the amount of speedup a full program might receive. Substantial speedups are demonstrated for these operators and the empirical evidence accords with the analytical estimates.This research has been funded by the Office of Naval Research Contract No. N00014-86-K-0264 and the National Science Foundation Grant No. DCR 8416878. 相似文献

15.

面向多簇架构DSP的树匹配向量化算法

郭连伟郑启龙黄胜兵徐华叶《计算机系统应用》2015,24(10):142-147

BWDSP是针对高性能计算设计的一款新型的处理器, 采用多簇超长指令字体系结构和SIMD架构, 有丰富的指令集. 为充分利用BWDSP提供的向量化资源, 迫切需要提出一种向量化算法. 本文在open64基础上研究并实现了面向多簇超长指令字(VLIW)DSP的SIMD编译优化算法. 算法基于OPEN64的中间语言WHIRL, 能够充分地利用BWDSP丰富的硬件资源和向量化指令. 最终实验结果表明, 对于能够合成双字和单字的循环程序, 该优化算法能够平均取得6倍和4倍的加速比. 相似文献

16.

一种基于奔腾SIMD指令的快速背景提取方法 总被引：3，自引：0，他引：3

周西汉刘勃周荷琴袁非牛《计算机工程与应用》2004,40(27):81-83

论文提出一种基于Intel奔腾SIMD指令的快速背景提取方法。在一种改进的混合高斯背景模型中,Jeffrey值的计算和背景模型的更新等存在着很高的内在SIMD并行性,通过将数据按照SSE数据类型组织,实现了混合高斯背景模型的SIMD算法。实验结果表明:嵌入奔腾SIMD指令的方法比传统计算提高75%左右的性能,加速了背景提取的速度,达到了实时处理的要求,具有较大的实际应用价值。相似文献

17.

Distributed evaluation of an iterative function for all object pairs on an SIMD hypercube

Fikret Er al 《Information Processing Letters》1991,40(6):341-345

An efficient distributed algorithm for evaluating an iterative function on all pairwise combinations of C objects on an SIMD hypercube is presented. The algorithm achieves uniform load distribution and minimal, completely local interprocessor communication. 相似文献

18.

基于共享向量的二维SIMD调度算法

张为华臧斌宇王晔钱兴隆朱传琪《计算机学报》2006,29(10):1740-1749

针对目前二维SIMD结构编译技术研究的不足,结合二维SIMD结构中普遍采用的复用数据通路和寄存器少的限制和应用程序的特点,提出了一种解决数据向量复用的算法.该算法先使用数据向量的代表元计算各SIMD指令间数据向量的重用信息,再根据这些信息对SIMD指令进行调度.该算法可以有效缓解应用程序在二维SIMD结构执行时加载数据的压力,有效提高结构受限二维SIMD结构的并行性.实验数据显示,该算法对各种应用程序可获得平均2.97的加速比和平均3.86的SIMD指令级并行度. 相似文献

19.

Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators

《Parallel Computing》2014,40(8):425-447

EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components.Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:

•method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations;
•method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques;
•method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources;
•approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs.

Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively. 相似文献

20.

基于龙芯SIMD技术的AES加解密优化 总被引：1，自引：1，他引：0

下载免费PDF全文

顾丽红魏海蕊《计算机工程》2009,35(3):189-191

高级加密标准AES是Linux系统中安全网络协议采用的主流的加解密算法。该文通过分析AES加解密算法,结合龙芯平台的体系结构特征,提出基于多媒体指令扩展（SIMD技术）优化AES性能的方法。优化前后的安全文件传输协议Sftp（AES加解密）数据传输结果表明,龙芯SIMD技术优化AES算法减少了加解密时间,有效地提高了Sftp的网络传输速率。相似文献