首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
P systems are inherently parallel and non-deterministic theoretical computing devices defined inside the field of Membrane Computing. Many P system simulators have been presented in this area, but they are inefficient since they cannot handle the parallelism of these devices. Nowadays, we are witnessing the consolidation of the GPUs as a parallel framework to compute general purpose applications. In this paper, we analyse GPUs as an alternative parallel architecture to improve the performance in the simulation of P systems, and we illustrate it by using the case study of a family of P systems that provides an efficient and uniform solution to the SAT problem. Firstly, we develop a simulator that fully simulates the computation of the P system, demonstrating that GPUs are well suited to simulate them. Then, we adapt this simulator to the GPU architecture idiosyncrasies, improving the performance of the previous simulator.  相似文献   

在当前量子计算的研究中,量子线路模拟器作为重要的研究工具,一直受到研究者们的高度重视.QuEST是一款开源的通用量子线路模拟器,能在单个CPU结点、多个CPU结点和单个GPU等多种测试平台上灵活运行.量子线路模拟固有的并行性使其非常适合在GPU上运行,并能获得较大的性能加速.但是其缺点在于所消耗的内存空间巨大,单个GP...  相似文献   

塔台模拟机冲突检测算法是一种耗时大的并行算法。针对其导致塔台模拟系统核心服务器CPU负担过重的缺点,在常用冲突检测算法的基础上,提出一种基于统一设备构架(CUDA)的塔台模拟机冲突检测实现方案。首先介绍GPU并行运算的体系结构基础,并将基于卡尔曼滤波的目标物体跟踪技术的分层冲突检测算法移植到GPU。然后利用相同价格的CPU和GPU对比运算效果。实验结果表明:与相同算法的CPU实现方案相比,GPU实现方案将计算效率提高10~50倍。使用此方案,极大地减轻了核心服务器的负担,使塔台模拟机的性能得到质的提高。  相似文献   

Heterogeneous computing is a growing trend in recent computer architecture design and is often used to improve the performance and power efficiency for computing applications by utilizing the special-purpose processors or accelerators, such as the Graphic Computing Unit (GPU), Field Programmable Gate Array (FPGA) and Digital Signal Processor (DSP). With the increase of complexity, the interaction among accelerators and processors could be deadfall if a race condition happens. However, the existing tools for such task are either too slow or hard to extend the race condition detection mechanism. Therefore, tools for application profiling with approximate timing model are important to the design of such heterogeneous systems in a timing manner.In this paper, we proposed a pluggable GPU interface on an existing timing approximate CPU simulator based on QEMU for analyzing the memory behavior of heterogeneous systems. Monitoring the memory behavior, the pluggable interface could be extended to any kinds of accelerators, such as GPU, DSP and FPGA, for race condition detection. Taking the GPU as an example, we integrated the detailed GPU simulator from Multi2Sim with the existing timing approximate CPU simulator, VPA, to showcase the efficiency of the proposed work. The experimental results showed that the emulation speed of the proposed framework could even reach 9 × faster than Multi2Sim in some cases. In addition, the race condition detection mechanism further indicates the problematic memory accesses to users.  相似文献   

递归神经网络(RNN)近些年来被越来越多地应用在机器学习领域,尤其是在处理序列学习任务中,相比CNN等神经网络性能更为优异。但是RNN及其变体,如LSTM、GRU等全连接网络的计算及存储复杂性较高,导致其推理计算慢,很难被应用在产品中。一方面,传统的计算平台CPU不适合处理RNN的大规模矩阵运算;另一方面,硬件加速平台GPU的共享内存和全局内存使基于GPU的RNN加速器的功耗比较高。FPGA 由于其并行计算及低功耗的特性,近些年来被越来越多地用来做 RNN 加速器的硬件平台。对近些年基于FPGA的RNN加速器进行了研究,将其中用到的数据优化算法及硬件架构设计技术进行了总结介绍,并进一步提出了未来研究的方向。  相似文献   

袁良  张云泉  龙国平  王可  张先轶 《软件学报》2010,21(Z1):251-262
近年来在生物计算,科学计算等领域成功地应用了GPU 加速计算并获得了较高加速比.然而在GPU 上编程和调优过程非常繁琐,为此,研究人员提出了许多提高编程效率的编程模型和编译器,以及指导程序优化的计算模型,在一定程度上简化了GPU上的算法设计和优化,但是已有工作都存在一些不足.针对GPU低延迟高带宽的特性,提出了基于延迟隐藏因子的GPU 计算模型,模型提取算法隐藏延迟的能力,以指导算法优化.利用3 种矩阵乘算法进行实测与模型预测,实验结果表明,在简化模型的情况下,平均误差率为0.19.  相似文献   

徐新海  杨学军  林宇斐  林一松  唐滔 《软件学报》2011,22(10):2538-2552
近年来,为了缓解日益严重的功耗问题,异构并行体系结构已成为超级计算机发展的一个重要趋势.图形处理器(graphics processing unit,简称GPU)凭借其超高的计算性能和性能功耗比,作为一种高效的加速部件已被广泛应用于高性能计算领域.但是,GPU先天的可靠性缺陷势必加剧超级计算机的可靠性问题.目前,国际上关于CPU-GPU异构系统容错技术的研究工作主要将GPU从异构系统中独立出来,以每次调用为粒度对其进行容错处理.设计了一种面向CPU-GPU异构系统的Lazy容错方法,给出了基于编译指导命令的容错框架及其约束,并讨论了相关的编译实现和优化方法,最后通过实验验证了该方法的正确性.实验结果表明,与现有的容错方法相比,利用所设计的LazyFT容错方法对GPGPU(general purpose computation on graphics hardware)程序进行容错处理,可以明显降低容错代价.  相似文献   

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013)  [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.  相似文献   

Beam tracing combines the flexibility of ray tracing and the speed of polygon rasterization. However, beam tracing so far only handles linear transformations; thus, it is only applicable to linear effects such as planar mirror reflections but not to non‐linear effects such as curved mirror reflection, refraction, caustics and shadows. In this paper, we introduce non‐linear beam tracing to render these non‐linear effects. Non‐linear beam tracing is highly challenging because commodity graphics hardware supports only linear vertex transformation and triangle rasterization. We overcome this difficulty by designing a non‐linear graphics pipeline and implementing it on top of a commodity GPU. This allows beams to be non‐linear where rays within the same beam do not have to be parallel or intersect at a single point. Using these non‐linear beams, real‐time GPU applications can render secondary rays via polygon streaming similar to how they render primary rays. A major strength of this methodology is that it naturally supports fully dynamic scenes without the need to pre‐store a scene database. Utilizing our approach, non‐linear ray tracing effects can be rendered in real‐time on a commodity GPU under a unified framework.  相似文献   

王金英  陈砾 《计算机仿真》2006,23(7):280-282,298
飞行模拟器在Windows环境下,用Vc++和Open GL、Open GVS等软件开发工具,采用面向对象的程序设计方法实现仪表仿真、视景仿真、飞行方程仿真和控制台仿真的系统仿真功能。多功能飞行模拟器采用具有真实驾驶杆、油门杆、脚蹬和各种按钮开关的驾驶舱,以一台工控机为主控机外挂几台微机组成,通过网络协议完成实时信息传递,数据交换和条件设置。此设备既具有良好的人机交互功能,又可作为本科生演示和相关专业研究生的实验设备,也可以利用此平台进行仪表、视景等系统的继续开发工作,因此具有良好的可扩展性。  相似文献   

强流脉冲电子束作为一种高亮度的光源,具有广阔的应用前景,而双轴束流测量的控制对于研究产生电子束的加速器技术具有重要意义;利用多线程技术在CVI环境下的实现方法,研究了双路束流参数特点、双路束流测量的时序同步控制方法,在此基础上采用线程池多线程技术实现基于TCP/IP的加速器束流测量的数据采集软件;结果表明采用多线程技术可以很好地实现多任务的同时工作,有助于提高束测量的响应速度和测量的执行效率。  相似文献   

GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is usability, since vendor specific APIs are quite different from existing programming languages, and it requires a substantial knowledge of the device and programming interface to optimize applications. Hence, lately a growing number of higher level programming models are targeting GPUs to alleviate this problem. The ultimate goal for a high-level model is to expose an easy-to-use interface for the user to offload compute intensive portions of code (kernels) to the GPU, and tune the code according to the target accelerator to maximize overall performance with a reduced development effort. In this paper, we share our experiences of three of the notable high-level directive based GPU programming models—PGI, CAPS and OpenACC (from CAPS and PGI) on an Nvidia M2090 GPU. We analyze their performance and programmability against Isotropic (ISO)/Tilted Transversely Isotropic (TTI) finite difference kernels, which are primary components in the Reverse Time Migration (RTM) application used by oil and gas exploration for seismic imaging of the sub-surface. When ported to a single GPU using the mentioned directives, we observe an average 1.5–1.8x improvement in performance for both ISO and TTI kernels, when compared with optimized multi-threaded CPU implementations using OpenMP.  相似文献   

脉动阵列结构规整、吞吐量大,适合矩阵乘算法,广泛用于设计高性能卷积、矩阵乘加速结构。在深亚微米工艺下,通过增大阵列规模来提升芯片计算性能,会导致频率下降、功耗剧增等问题。因此,结合3D集成电路技术,提出了一种将平面脉动阵列结构映射到3D集成电路上的双精度浮点矩阵乘加速结构3D-MMA。首先,设计了针对该结构的分块映射调度算法,提升矩阵乘计算效率;其次,提出了基于3D-MMA的加速系统,构建了3D-MMA的性能模型,并对其设计空间进行探索;最后,评估了该结构实现代价,并同已有先进加速器进行对比分析。实验结果表明,访存带宽为160GB/s时,采用4层16×16脉动阵列的堆叠结构时,3D-MMA计算峰值性能达3TFLOPS,效率达99%,且实现代价小于二维实现。在相同工艺下,同线性阵列加速器及K40GPU相比,3D-MMA的性能是后者的1.36及1.92倍,而面积远小于后者。探索了3D集成电路在高性能矩阵乘加速器设计中的优势,对未来进一步提升高性能计算平台性能具有一定的参考价值。  相似文献   

In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.  相似文献   

Discrete Wavelet Transform on Consumer-Level Graphics Hardware   总被引:1,自引:0,他引:1  
Discrete wavelet transform (DWT) has been heavily studied and developed in various scientific and engineering fields. Its multiresolution and locality nature facilitates applications requiring progressiveness and capturing high-frequency details. However, when dealing with enormous data volume, its performance may drastically reduce. On the other hand, with the recent advances in consumer-level graphics hardware, personal computers nowadays usually equip with a graphics processing unit (GPU) based graphics accelerator which offers SIMD-based parallel processing power. This paper presents a SIMD algorithm that performs the convolution-based DWT completely on a GPU, which brings us significant performance gain on a normal PC without extra cost. Although the forward and inverse wavelet transforms are mathematically different, the proposed algorithm unifies them to an almost identical process that can be efficiently implemented on GPU. Different wavelet kernels and boundary extension schemes can be easily incorporated by simply modifying input parameters. To demonstrate its applicability and performance, we apply it to wavelet-based geometric design, stylized image processing, texture-illuminance decoupling, and JPEG2000 image encoding  相似文献   

GPU虚拟化相关技术研究综述   总被引:1,自引:0,他引:1  
因为计算密集型应用的增多,亚马逊和阿里巴巴等公司的云平台开始引入GPU(Graphic processing unit)加速计算. 云平台支持多用户共享GPU的使用,可以提升GPU的利用效率,降低成本;也有利于GPU的有效管理. 通过虚拟机监视器以及各种软硬件的帮助,GPU虚拟化技术为云平台共享GPU提供了一种可行方案. 本文综合分析了GPU虚拟化技术的最近进展,先根据技术框架的共同点进行分类;然后从拓展性、共享性、使用透明性、性能、扩展性等方面对比分析,最后总结了GPU虚拟化的问题和发展方向.  相似文献   

Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed‐memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task‐parallel programs executed on hybrid distributed‐memory CPU‐graphics processing unit (GPU) systems in a global‐address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU‐GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state‐of‐the‐art CCSD(T) application module from the computational chemistry domain. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

大直径电子束束斑质心精确定位方法研究   总被引:1,自引:0,他引:1  
强流脉冲电子束作为一种高亮度的光源,具有广阔的应用前景,而电子束的质心位置的控制对于研究产生电子束的加速器技术具有重要意义;针对短脉冲电子束位置的时间变化测量要求,研究了一种对数据本底、噪声相对不敏感的束斑质心位置的计算方法,该方法在结合传统的质心法的基础上,再利用不同数据区计算质心时得到的多条中心轨迹曲线的交点位置来判断质心位置,大大地降低了对数据质量的要求,最重要的是解决了从单次计算结果给出准确数据时存在较大误差以及易于受到各种不利因素如噪声、数据截断等影响的问题;对于扩展型电子束光斑的质心位置,即使存在较大噪声时,计算误差仍可小于0.5像素。  相似文献   

High computational power of GPUs (Graphics Processing Units) offers a promising accelerator for general-purpose computing. However, the need for dedicated programming environments has made the usage of GPUs rather complicated, and a GPU cannot directly execute binary code of a general-purpose application. This paper proposes a two-phase virtual execution environment (GXBIT) for automatically executing general-purpose binary applications on CPU/GPU architectures. GXBIT incorporates two execution phases. The first phase is responsible for extracting parallel hot spots from the sequential binary code. The second phase is responsible for generating the hybrid executable (both CPU and GPU instructions) for execution. This virtual execution environment works well for any applications that run repeatedly. The performance of generated CUDA (Compute Unified Device Architecture) code from GXBIT on a number of benchmarks is close to 63% of the hand-tuned GPU code. It also achieves much better overall performance than the native platforms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号