共查询到20条相似文献,搜索用时 15 毫秒
1.
Toru Takahashi Cris Cecka William Fong Eric Darve 《International journal for numerical methods in engineering》2012,89(1):105-133
This paper presents a number of algorithms to run the fast multipole method (FMM) on NVIDIA CUDA‐capable graphical processing units (GPUs) (Nvidia Corporation, Sta. Clara, CA, USA). The FMM is a class of methods to compute pairwise interactions between N particles for a given error tolerance and with computational cost of . The methods described in the paper are applicable to any FMMs in which the multipole‐to‐local (M2L) operator is a dense matrix and the matrix is precomputed. This is the case for example in the black‐box fast multipole method (bbFMM), which is a variant of the FMM that can handle large class of kernels. This example will be used in our benchmarks. In the FMM, two operators represent most of the computational cost, and an optimal implementation typically tries to balance those two operators. One is the nearby interaction calculation (direct sum calculation, line 29 in Listing 1), and the other is the M2L operation. We focus on the M2L. By combining multiple M2L operations and reordering the primitive loops of the M2L so that CUDA threads can reuse or share common data, these approaches reduce the movement of data in the GPU. Because memory bandwidth is the primary bottleneck of these methods, significant performance improvements are realized. Four M2L schemes are detailed and analyzed in the case of a uniform tree. The four schemes are tested and compared with an optimized, OpenMP parallelized, multi‐core CPU code. We consider high and low precision calculations by varying the number of Chebyshev nodes used in the bbFMM. The accuracy of the GPU codes is found to be satisfactory and achieved performance over 200 Gflop/s on one NVIDIA Tesla C1060 GPU (Nvidia Corporation, Sta. Clara, CA, USA). This was compared against two quad‐core Intel Xeon E5345 processors (Intel Corporation, Sta. Clara, CA, USA) running at 2.33 GHz, for a combined peak performance of 149 Gflop/s for single precision. For the low FMM accuracy case, the observed performance of the CPU code was 37 Gflop/s, whereas for the high FMM accuracy case, the performance was about 8.5 Gflop/s, most likely because of a higher frequency of cache misses. We also present benchmarks on an NVIDIA C2050 GPU (a Fermi processor)(Nvidia Corporation, Sta. Clara, CA, USA) in single and double precision. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献
2.
3.
目前有限元分析软件多基于中央处理器的平台方式构建,在处理复杂高层结构非线性响应分析时暴露出计算耗时多、计算效率低以及对计算硬件要求高等问题。图形处理器由于其硬件构造的先天优势,可以提供十倍乃至上百倍于中央处理器的浮点运算和并行计算性能,因而为高层结构非线性计算所面临的瓶颈问题提供了一个切实可行的解决方法。该文在构建异构并行计算平台的基础上,提出一种适用于图形处理器加速的有限元并行数值计算方法。该方法利用精细化结构分析模型的自由度数据和图形处理器中的线程建立一一对应映射关系,对动力响应的隐式积分算法进行图形处理器线程级的并行化处理,并且结合EBE单元级的优化存储空间机制,降低系统方程组求解时对内存空间的需求。通过对比振动台试验结果对该方法进行验证,并对实际高层钢筋混凝土框筒结构工程进行弹塑性地震响应分析,结果显示该文所提方法在保证模型精度前提下能有效提高大型复杂高层结构非线性响应分析效率。 相似文献
4.
为了提场卷积算法在矢量!字信号处理器(DSP)上的执行效率,提出了一种高效的并行化卷积算法——基2并行短卷积(PSC R2)算法。该算法采用了基2短卷积运算结构,摆脱了传统并行化卷积算法的直接结构,从而有效降低了算法的循环次!。基于该算法结构,还提出了矢量DSP专用指令以匹配卷积的运算结构,保障算法执行效率。通过实际评估,证明了该算法在时间复杂度上仅为传统的内循环矢量化(VIL)算法的43%,为外循环矢量化(VOL)算法的55%,并且在存储空间开销上能够与传统算法基本持平。利用该算法,可以大幅降低移动通信和数字信号处理中的卷积、相关、滤波运算的时间复杂度。 相似文献
5.
6.
7.
Felix Fritzen Liang Xia Matthias Leuschner Piotr Breitkopf 《International journal for numerical methods in engineering》2016,106(6):430-453
This paper extends current concepts of topology optimization to the design of structures made of nonlinear microheterogeneous materials. The objective is to maximize the macroscopic structural stiffness for a prescribed material volume usage while accounting for the nonlinearity and the microstructure of the material. The resulting design problem considers two scales: the macroscopic scale at which the optimization is performed and the microscopic scale at which the material heterogeneities and the nonlinearities are observed. The topology optimization at the macroscopic scale is performed by means of the bi‐directional evolutionary structural optimization method. The solution of the macroscopic boundary value problem requires as inputs the effective constitutive response with full consideration of the microstructure. While computational homogenization methods such as the FE2 method could be used to solve the nonlinear multiscale problem, the associated numerical expense (CPU time and memory) is highly unacceptable. In order to regain the computational feasibility of the computational scale transition, a recent model reduction technique of the authors is employed: the potential‐based reduced basis model order reduction with graphics processing unit acceleration. Numerical examples show the efficiency of the resulting nonlinear two‐scale designs. The impact of different load amplitudes on the design is examined. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献
8.
研究了计算机图形处理与计算视觉处理中的图像保边平滑(保持图像边缘平滑)处理。考虑到基于优化方法的保边平滑算法多使用一阶平滑先验作为能量函数的正则项,但它会使平滑结果产生阶梯状的平滑效果,提出了一种基于二阶平滑先验的保边平滑算法,该算法能够避免一阶平滑先验存在的阶梯状平滑偏差,同时锋利地保持图像中显著的边缘。针对该算法的连续变量与01变量的混合优化问题,使用了一种快速的求解方法,该方法在使用图形处理器(GPU)并行加速的情况下能够快速获取平滑结果。通过实验验证了该算法在深度图保边平滑处理、JPEG卡通图像压缩瑕疵恢复以及边缘提取问题中的应用效果。 相似文献
9.
一种改进型裂谱分析方法 总被引:3,自引:0,他引:3
本文采用平顶升余弦(RAD)滤波器代替传统裂谱分析方法中的高斯型滤波器,提出了一种改进型裂谱分析方法,实验结果主宰了新方法具有很高性能稳定性和增强湮没晶粒(或其他散射体)散射中的缺陷回波信号能力。 相似文献
10.
11.
12.
影像传感器是DC中最重要的部件。文章从DC影像传感器基本概念的介绍入手,详细介绍了高端、中低端以及民用DC中影像传感器的尺寸种类、不同尺寸影像传感器的优缺点,进而引导读者如何选购适合自己的影像传感器。 相似文献
13.
Linda A. Felton Jing Yang Khalid Shah Hossein Omidian Jose G. Rocca 《Drug development and industrial pharmacy》2013,39(6):683-689
ABSTRACTThe objective of the current study was to investigate the oxidative induction time (OIT) as a measurement of the stability of an oxygen-sensitive model drug. The OIT was determined by differential scanning calorimetry and represents the time required for oxidative decomposition to occur at a given temperature. Samples were heated to a specific temperature under a nitrogen blanket then held isothermal while exposed to oxygen. The experiment proceeded until oxidative degradation of the sample was apparent from the real-time heat flow graphs. Variables investigated in this study included different lots and suppliers of a model drug as well as the addition of antioxidants. Results demonstrated that the stability of the drug was dependent on the supplier. All antioxidants investigated in this study improved oxygen stability of the model compound, as evidenced by a longer OIT. Butylated hydroxyanisole (BHA) was found to better stabilize the drug than butylated hydroxytoluene at equivalent concentrations. The combination of ascorbic acid and BHA provided the greatest protection against oxidation of the model compound. The results of this study demonstrate the usefulness of OIT to investigate the oxygen stability of pharmaceutical compounds. 相似文献
14.
光电信号分析在宇宙线观测研究中的应用 总被引:2,自引:0,他引:2
介绍了在宇宙线在面观测系统中经常使用的一种探测器-塑料闪烁体探测器。这种探测器的主要优点是工作稳定、易于长时间野外观测,便于维及造介低廉,其基本功能即是产生、收集光脉冲信号并将转换为电脉冲信号。 相似文献
15.
Jeffrey M. Snow John S. Usher Bruce E. Stuckman 《Quality and Reliability Engineering International》1992,8(2):105-111
In this paper we investigate the use of the average unit run length (AURL) as an important measure of the effectiveness of various quality control charting schemes. In particular we focus on its appropriateness for normally distributed processes that tend to produce units (or measurements) at slow rates. In our investigations with the standard Shewhart X? and R charts, as well as the CUSUM chart, AURL shows that a sample size of n=1 can yield the fastest means of detecting shifts. 相似文献
16.
分层空时码多载波CDMA的译码与检测 总被引:1,自引:0,他引:1
研究V-BLAST MIMO MC-CDMA下行系统。提出了一种对每个子载波进行V-BLAST译码的非线性
算法,对不同天线数、用户数情况进行了系统的仿真和分析,并对V-BLAST译码的线性和非线性算法进行了系
统仿真比较。 相似文献
17.
18.
针对正交频分复用(OFDM)通信系统中传统的基于发送信号有限字符集特性的盲信道估计算法复杂度高、实用性低的问题进行了研究,提出了一种降低算法复杂度的改进的盲信道估计算法。通过将OFDM系统的频域接收信号模型等价分解成多个接收信号分组的方式,改进算法仅需搜索单个发送信号分组便可实现盲信道估计,其计算量大大降低。此外,改进算法还可以通过控制盲估计算法中分组的数目很好地实现系统性能和复杂度的折衷设计,实用性也更强。数值仿真结果表明,改进算法的最优估计方案与传统算法的性能完全相同,且其次优估计方案的性能随着比特信噪比的增加逐步逼近传统算法,分别在10和15dB处趋于一致。 相似文献
19.
文章在新型潜艇全综合、分布式作战系统的体系结构下,以声纳信息作为主要信息源,综述了纯方位TMA`欧信息融合TMA、方位-多普勒TMA、噪声能量-方位联合估距、与匹配场声源定位相结合等几种实现TMA的方法,在此基础上对TMA功能提出了新设想,初步提出了一种TMA的功能设计。该功能设计从作战系统的角度出发开发TMA功能,充分利用了多传感器获得的信息,以及人的主观判断,增强了TMA的能力,使它能够更好地为指挥员战术决策服务,为武器的发射控制提供更加精确的目标运动要素解算结果。与通常的TMA设计相比,该顶设计能够识别目标机动,增加了人机交互功能,并设计了便于观看TMA结果、监视目标机动和执行交互式跟踪改进的TMA显示画面。 相似文献
20.
针对LTE上行的单载波频分多址系统--离散傅立叶变换扩频的正交频分复用(DFT-S-OFDM)系统,提出了一种低复杂度的迭代检测实现方法.由于传统方法根据发射机与信道级联系统的等效传输矩阵,完成最小均方误差(MMSE)的迭代检测,但非对角矩阵求逆复杂度较高,因此,所提出的新方法对发送端DFT扩频后的信号先进行单点的MMSE检测,然后根据逆离散傅立叶变换(IDFT)解扩后的后验均值和方差等效得到输出外信息比特似然比.仿真结果表明,所提出迭代检测接收算法的性能与传统方法相近,而其实现复杂度则有较大降低. 相似文献