首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents a number of algorithms to run the fast multipole method (FMM) on NVIDIA CUDA‐capable graphical processing units (GPUs) (Nvidia Corporation, Sta. Clara, CA, USA). The FMM is a class of methods to compute pairwise interactions between N particles for a given error tolerance and with computational cost of . The methods described in the paper are applicable to any FMMs in which the multipole‐to‐local (M2L) operator is a dense matrix and the matrix is precomputed. This is the case for example in the black‐box fast multipole method (bbFMM), which is a variant of the FMM that can handle large class of kernels. This example will be used in our benchmarks. In the FMM, two operators represent most of the computational cost, and an optimal implementation typically tries to balance those two operators. One is the nearby interaction calculation (direct sum calculation, line 29 in Listing 1), and the other is the M2L operation. We focus on the M2L. By combining multiple M2L operations and reordering the primitive loops of the M2L so that CUDA threads can reuse or share common data, these approaches reduce the movement of data in the GPU. Because memory bandwidth is the primary bottleneck of these methods, significant performance improvements are realized. Four M2L schemes are detailed and analyzed in the case of a uniform tree. The four schemes are tested and compared with an optimized, OpenMP parallelized, multi‐core CPU code. We consider high and low precision calculations by varying the number of Chebyshev nodes used in the bbFMM. The accuracy of the GPU codes is found to be satisfactory and achieved performance over 200 Gflop/s on one NVIDIA Tesla C1060 GPU (Nvidia Corporation, Sta. Clara, CA, USA). This was compared against two quad‐core Intel Xeon E5345 processors (Intel Corporation, Sta. Clara, CA, USA) running at 2.33 GHz, for a combined peak performance of 149 Gflop/s for single precision. For the low FMM accuracy case, the observed performance of the CPU code was 37 Gflop/s, whereas for the high FMM accuracy case, the performance was about 8.5 Gflop/s, most likely because of a higher frequency of cache misses. We also present benchmarks on an NVIDIA C2050 GPU (a Fermi processor)(Nvidia Corporation, Sta. Clara, CA, USA) in single and double precision. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

2.
提出了一种基于波分复用(WDM),分组交换,副载波复用和波长变换技术的新型全光城域网节点结构,网络拓扑采用环形,网络节点采用可调输出固定输入的选波原则以利于数据的多重接入,副载波复用和基于级联半导体光放大器的波长变换,实现了射频副载波路由信息与基带IP数据包的同步复用和IP数据包的透明传输。  相似文献   

3.
李红豫  滕军  李祚华  张璐 《工程力学》2018,35(11):79-85,91
目前有限元分析软件多基于中央处理器的平台方式构建,在处理复杂高层结构非线性响应分析时暴露出计算耗时多、计算效率低以及对计算硬件要求高等问题。图形处理器由于其硬件构造的先天优势,可以提供十倍乃至上百倍于中央处理器的浮点运算和并行计算性能,因而为高层结构非线性计算所面临的瓶颈问题提供了一个切实可行的解决方法。该文在构建异构并行计算平台的基础上,提出一种适用于图形处理器加速的有限元并行数值计算方法。该方法利用精细化结构分析模型的自由度数据和图形处理器中的线程建立一一对应映射关系,对动力响应的隐式积分算法进行图形处理器线程级的并行化处理,并且结合EBE单元级的优化存储空间机制,降低系统方程组求解时对内存空间的需求。通过对比振动台试验结果对该方法进行验证,并对实际高层钢筋混凝土框筒结构工程进行弹塑性地震响应分析,结果显示该文所提方法在保证模型精度前提下能有效提高大型复杂高层结构非线性响应分析效率。  相似文献   

4.
为了提场卷积算法在矢量!字信号处理器(DSP)上的执行效率,提出了一种高效的并行化卷积算法——基2并行短卷积(PSC R2)算法。该算法采用了基2短卷积运算结构,摆脱了传统并行化卷积算法的直接结构,从而有效降低了算法的循环次!。基于该算法结构,还提出了矢量DSP专用指令以匹配卷积的运算结构,保障算法执行效率。通过实际评估,证明了该算法在时间复杂度上仅为传统的内循环矢量化(VIL)算法的43%,为外循环矢量化(VOL)算法的55%,并且在存储空间开销上能够与传统算法基本持平。利用该算法,可以大幅降低移动通信和数字信号处理中的卷积、相关、滤波运算的时间复杂度。  相似文献   

5.
MOEMS阵列光开关的微反射镜的制作   总被引:2,自引:0,他引:2  
通过各向异性湿法腐蚀工艺利用 (110 )硅片制作了 8× 8阵列光开关的微反射镜阵列 ,反射镜面为 { 111}面 ,表面粗糙度低于 10nm ,垂直度好于用同样的方法在 (10 0 )硅片上制作的反射镜。  相似文献   

6.
在介绍了Tsai氏标定方法的基础上提出了一种应用于显微测量系统的标定方法。该方法不需要标定出摄像机的绝对坐标,而是通过实验的方法测量出两摄像机系统之间的角度关系,得到其旋转矩阵,进而标定出整个系统的外参数。在显微测量应用中,这种新的标定技术比传统的Tsai氏标定法计算量更小,实用性更强,并能保证较高的标定精度。  相似文献   

7.
This paper extends current concepts of topology optimization to the design of structures made of nonlinear microheterogeneous materials. The objective is to maximize the macroscopic structural stiffness for a prescribed material volume usage while accounting for the nonlinearity and the microstructure of the material. The resulting design problem considers two scales: the macroscopic scale at which the optimization is performed and the microscopic scale at which the material heterogeneities and the nonlinearities are observed. The topology optimization at the macroscopic scale is performed by means of the bi‐directional evolutionary structural optimization method. The solution of the macroscopic boundary value problem requires as inputs the effective constitutive response with full consideration of the microstructure. While computational homogenization methods such as the FE2 method could be used to solve the nonlinear multiscale problem, the associated numerical expense (CPU time and memory) is highly unacceptable. In order to regain the computational feasibility of the computational scale transition, a recent model reduction technique of the authors is employed: the potential‐based reduced basis model order reduction with graphics processing unit acceleration. Numerical examples show the efficiency of the resulting nonlinear two‐scale designs. The impact of different load amplitudes on the design is examined. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

8.
研究了计算机图形处理与计算视觉处理中的图像保边平滑(保持图像边缘平滑)处理。考虑到基于优化方法的保边平滑算法多使用一阶平滑先验作为能量函数的正则项,但它会使平滑结果产生阶梯状的平滑效果,提出了一种基于二阶平滑先验的保边平滑算法,该算法能够避免一阶平滑先验存在的阶梯状平滑偏差,同时锋利地保持图像中显著的边缘。针对该算法的连续变量与01变量的混合优化问题,使用了一种快速的求解方法,该方法在使用图形处理器(GPU)并行加速的情况下能够快速获取平滑结果。通过实验验证了该算法在深度图保边平滑处理、JPEG卡通图像压缩瑕疵恢复以及边缘提取问题中的应用效果。  相似文献   

9.
一种改进型裂谱分析方法   总被引:3,自引:0,他引:3  
本文采用平顶升余弦(RAD)滤波器代替传统裂谱分析方法中的高斯型滤波器,提出了一种改进型裂谱分析方法,实验结果主宰了新方法具有很高性能稳定性和增强湮没晶粒(或其他散射体)散射中的缺陷回波信号能力。  相似文献   

10.
空间离子电推进系统电源处理单元设计   总被引:3,自引:0,他引:3  
空间电推进电源处理单元是组成电推进系统的关键设备之一,它是多电源组合、输出功率大、电压高及时序控制的复杂电源变换产品。以离子电推进系统配置的电源处理单元为例,叙述了输出功率为1 kW,屏栅电源输出电压达到1 000 V的各功能电源的电路设计,并给出了实验电路的测试结果。  相似文献   

11.
基于MFC开发ARX程序直接读取矢量地图,通过改进的高效率的线性内插算法实现等高线层转换,该算法解决了一般线性内插算法中常有的平项现象。区域层直接通过修改有序边表多边形扫描转换算法实现离散。最后应用以上技术实现了一个矢量地图数字离散化系统,并通过对某市矢量地图离散化证明了系统的高效,达到实际工程应用的要求。  相似文献   

12.
影像传感器是DC中最重要的部件。文章从DC影像传感器基本概念的介绍入手,详细介绍了高端、中低端以及民用DC中影像传感器的尺寸种类、不同尺寸影像传感器的优缺点,进而引导读者如何选购适合自己的影像传感器。  相似文献   

13.
ABSTRACT

The objective of the current study was to investigate the oxidative induction time (OIT) as a measurement of the stability of an oxygen-sensitive model drug. The OIT was determined by differential scanning calorimetry and represents the time required for oxidative decomposition to occur at a given temperature. Samples were heated to a specific temperature under a nitrogen blanket then held isothermal while exposed to oxygen. The experiment proceeded until oxidative degradation of the sample was apparent from the real-time heat flow graphs. Variables investigated in this study included different lots and suppliers of a model drug as well as the addition of antioxidants. Results demonstrated that the stability of the drug was dependent on the supplier. All antioxidants investigated in this study improved oxygen stability of the model compound, as evidenced by a longer OIT. Butylated hydroxyanisole (BHA) was found to better stabilize the drug than butylated hydroxytoluene at equivalent concentrations. The combination of ascorbic acid and BHA provided the greatest protection against oxidation of the model compound. The results of this study demonstrate the usefulness of OIT to investigate the oxygen stability of pharmaceutical compounds.  相似文献   

14.
光电信号分析在宇宙线观测研究中的应用   总被引:2,自引:0,他引:2  
冯振勇  黄庆 《光电工程》1997,24(3):49-52
介绍了在宇宙线在面观测系统中经常使用的一种探测器-塑料闪烁体探测器。这种探测器的主要优点是工作稳定、易于长时间野外观测,便于维及造介低廉,其基本功能即是产生、收集光脉冲信号并将转换为电脉冲信号。  相似文献   

15.
In this paper we investigate the use of the average unit run length (AURL) as an important measure of the effectiveness of various quality control charting schemes. In particular we focus on its appropriateness for normally distributed processes that tend to produce units (or measurements) at slow rates. In our investigations with the standard Shewhart X? and R charts, as well as the CUSUM chart, AURL shows that a sample size of n=1 can yield the fastest means of detecting shifts.  相似文献   

16.
分层空时码多载波CDMA的译码与检测   总被引:1,自引:0,他引:1  
研究V-BLAST MIMO MC-CDMA下行系统。提出了一种对每个子载波进行V-BLAST译码的非线性 算法,对不同天线数、用户数情况进行了系统的仿真和分析,并对V-BLAST译码的线性和非线性算法进行了系 统仿真比较。  相似文献   

17.
自动测试系统自检适配器自动设计技术   总被引:2,自引:0,他引:2  
采用面向信号方法建立仪器模型、开关模型,对系统所有仪器端口进行自动最优匹配,与相应开关构成自检回路,并实现自检回路自动布线. 依据IVI(Interchangeable Virtual Instrument)-Signal Interface 标准自动编写自检程序,较大地提高了设计效率和标准化程度. 以某机载通讯设备自动测试系统为例,采用该技术可缩短 ATS 自检适配器开发周期至原来的 1/3,显著降低了开发成本.  相似文献   

18.
李玮  程时昕 《高技术通讯》2007,17(12):1216-1220
针对正交频分复用(OFDM)通信系统中传统的基于发送信号有限字符集特性的盲信道估计算法复杂度高、实用性低的问题进行了研究,提出了一种降低算法复杂度的改进的盲信道估计算法。通过将OFDM系统的频域接收信号模型等价分解成多个接收信号分组的方式,改进算法仅需搜索单个发送信号分组便可实现盲信道估计,其计算量大大降低。此外,改进算法还可以通过控制盲估计算法中分组的数目很好地实现系统性能和复杂度的折衷设计,实用性也更强。数值仿真结果表明,改进算法的最优估计方案与传统算法的性能完全相同,且其次优估计方案的性能随着比特信噪比的增加逐步逼近传统算法,分别在10和15dB处趋于一致。  相似文献   

19.
文章在新型潜艇全综合、分布式作战系统的体系结构下,以声纳信息作为主要信息源,综述了纯方位TMA`欧信息融合TMA、方位-多普勒TMA、噪声能量-方位联合估距、与匹配场声源定位相结合等几种实现TMA的方法,在此基础上对TMA功能提出了新设想,初步提出了一种TMA的功能设计。该功能设计从作战系统的角度出发开发TMA功能,充分利用了多传感器获得的信息,以及人的主观判断,增强了TMA的能力,使它能够更好地为指挥员战术决策服务,为武器的发射控制提供更加精确的目标运动要素解算结果。与通常的TMA设计相比,该顶设计能够识别目标机动,增加了人机交互功能,并设计了便于观看TMA结果、监视目标机动和执行交互式跟踪改进的TMA显示画面。  相似文献   

20.
针对LTE上行的单载波频分多址系统--离散傅立叶变换扩频的正交频分复用(DFT-S-OFDM)系统,提出了一种低复杂度的迭代检测实现方法.由于传统方法根据发射机与信道级联系统的等效传输矩阵,完成最小均方误差(MMSE)的迭代检测,但非对角矩阵求逆复杂度较高,因此,所提出的新方法对发送端DFT扩频后的信号先进行单点的MMSE检测,然后根据逆离散傅立叶变换(IDFT)解扩后的后验均值和方差等效得到输出外信息比特似然比.仿真结果表明,所提出迭代检测接收算法的性能与传统方法相近,而其实现复杂度则有较大降低.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号