首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 187 毫秒
1.
张延松  刘专  韩瑞琛  张宇  王珊 《软件学报》2023,34(11):5205-5229
GPU数据库近年来在学术界和工业界吸引了大量的关注. 尽管一些原型系统和商业系统(包括开源系统)开发了作为下一代的数据库系统, 但基于GPU的OLAP引擎性能是否真的超过CPU系统仍然存有疑问, 如果能够超越, 那什么样的负载/数据/查询处理模型更加适合, 则需要更深入的研究. 基于GPU的OLAP引擎有两个主要的技术路线: GPU内存处理模式和GPU加速模式. 前者将所有的数据集存储在GPU显存来充分利用GPU的计算性能和高带宽内存性能, 不足之处在于GPU容量有限的显存制约了数据集大小以及稀疏访问模式的数据存储降低GPU显存的存储效率. 后者只在GPU显存中存储部分数据集并通过GPU加速计算密集型负载来支持大数据集, 主要的挑战在于如何为GPU显存选择优化的数据分布和负载分布模型来最小化PCIe传输代价和最大化GPU计算效率. 致力于将两种技术路线集成到OLAP加速引擎中, 研究一个定制化的混合CPU-GPU平台上的OLAP框架OLAP Accelerator, 设计CPU内存计算、GPU内存计算和GPU加速3种OLAP计算模型, 实现GPU平台向量化查询处理技术, 优化显存利用率和查询性能, 探索GPU数据库的不同的技术路线和性能特征. 实验结果显示GPU内存向量化查询处理模型在性能和内存利用率两方面获得最佳性能, 与OmniSciDB和Hyper数据库相比性能达到3.1和4.2倍加速. 基于分区的GPU加速模式仅加速了连接负载来平衡CPU和GPU端的负载, 能够比GPU内存模式支持更大的数据集.  相似文献   

2.
人工智能与各行业全面融合的浪潮方兴未艾,促使传统云平台拥抱以图形处理器(GPU)为代表的众核体系架构。为满足不同租户对于机器学习、深度学习等高密度计算的需求,使得传统云平台大力发展GPU虚拟化技术。安全作为云平台GPU虚拟化应用的关键环节,目前鲜有系统性的论述。因此,本文围绕云平台GPU虚拟化安全基本问题——典型GPU虚拟化技术给云平台引入的潜在安全威胁和GPU虚拟化的安全需求及安全防护技术演进趋势——展开。首先,深入分析了典型GPU虚拟化方法及其安全机制,并介绍了针对现有GPU虚拟化方法的侧信道、隐秘信道与内存溢出等攻击方法;其次,深入剖析了云平台GPU虚拟化所带来的潜在安全威胁,并总结了相应的安全需求;最后,提出了GPU上计算与内存资源协同隔离以确保多租户任务间的性能隔离、GPU任务行为特征感知以发现恶意程序、GPU任务安全调度、多层联合攻击阻断、GPU伴生信息脱敏等五大安全技术研究方向。本文希望为云平台GPU虚拟化安全技术发展与应用提供有益的参考。  相似文献   

3.
当前虚拟桌面实施方法中,终端用户对3D图形处理能力越来越高的要求与虚拟机GPU处理能力之间的矛盾逐渐凸显。为解决上述问题,对GPU虚拟化的典型实施方法进行了研究。在对上述虚拟化技术进行分析的基础上,介绍了一种改进的基于设备独占法和API remoting法的虚拟化方案。利用Hypervisor创建两种模式的虚拟机,分别为一台父虚拟机(GVM)和多台子虚拟机(DVM)。GVM完全独占物理GPU,而DVM与物理GPU无直接交互关系。两种模式虚拟机共享GPU内存以及指令通道,DVM中的GPU调用指令传递至GVM,通过GVM对物理GPU进行快速调用,将调用结果返回到共享内存空间,进而呈现给用户。最后对改进的GPU虚拟化方法与典型虚拟化方法进行了对比与分析,总结了其中的优缺点,梳理了将来的研究重点。  相似文献   

4.
GPU可以显著提升一些网络功能的性能,但在GPU加速的网络功能虚拟化(Network Function Virtualization,NFV)系统中,由于网络功能需要以虚拟化方式独立开发和部署,其CPU-GPU处理流水线的CPU处理阶段会有较大的额外开销,使得网络功能GPU加速的效果不明显。为解决该问题,提出一个新的支持GPU加速的NFV系统框架。利用服务链中网络功能之间共享数据和流状态的特性,设计了共享式状态管理机制,以减少网络功能中重复性的协议栈处理和流状态管理开销,提升GPU加速的效果。对原型系统进行评估表明,相比于现有的系统框架,该框架能够显著地降低多种GPU加速的网络功能中CPU处理阶段的时间开销,并在常见的网络功能服务链上实现了高达2倍的吞吐量提升。  相似文献   

5.
针对分布式多节点多GPU的系统环境,实现一种基于CUDA框架的多GPU通用计算虚拟化平台。应用程序可以如同使用本地GPU一样方便地使用多个远程GPU,原来的CUDA应用程序可以不经过修改或者只进行少量的修改就可以运行在该虚拟化GPU平台上,从而实现单机多GPU和多机多GPU在编程模式上的统一,并通过一个基于高斯混合模型的数据聚类程序来进行实验验证。实验结果表明,在不影响程序正确性的前提下,相对于原来使用CPU的程序,使用两个远程GPU可以获得十倍左右的加速比。  相似文献   

6.
张鸿骏  武延军  张珩  张立波 《软件学报》2020,31(10):3038-3055
散列表(hash table)作为一类根据关键码值(key value)提供高效数据访问的数据索引结构,其广泛应用于各类计算机应用中,尤其是在对性能要求极高的系统软件、数据库以及高性能计算领域.在网络、云计算和物联网服务方面,以散列表为核心结构已经成为缓存系统的重要系统组件.然而,随着大规模数据量的大幅度增加,以多核CPU为核心设计散列表结构的系统已经逐渐出现性能瓶颈,亟需进一步改进散列表的高性能和可扩展性.随着通用图形处理器(graphic processing unit,简称GPU)的日益普及以及硬件计算能力和并发性能的大幅度提升,各类以并行计算为核心的系统软件任务在GPU上进行了优化设计并得到可观的性能提升.由于存在稀疏性和随机性,采用现有散列表的并行结构直接在GPU上应用势必会带来高频次的内存访问和频繁的总线数据传输,影响了散列表在GPU上的性能发挥.重点分析了缓存系统中散列表索引的内存访问、命中率与索引开销,提出并设计了一种适应GPU的混合访问缓存索引框架CCHT(cache cuckoo hash table),提供了两种适应不同命中率和索引开销要求的缓存策略,允许写入与查询操作并发执行,最大程度地利用了GPU硬件的计算性能与并发特性,减少了内存访问与总线传输.通过在GPU硬件上的实现与实验验证,CCHT在保证缓存命中率的同时,性能优于其他用于缓存索引的散列表.  相似文献   

7.
余宽隆  陈渝  茅俊杰  张磊 《计算机科学》2014,41(10):7-11,22
在移动设备上并发运行多个操作系统,可拓宽和多样化其使用模式,但目前采用的移动虚拟化管理系统技术会带来性能开销和多余的内存消耗。通过分析在单一移动设备上支持多个操作系统所带来的多OS内存管理和外设分配等方面挑战,研究并设计了物理内存在线分配和分时复用外设等新技术,本设计在Galaxy Nexus智能手机上最终实现了ARM-MuxOS原型系统。这一系统不仅可在单一移动设备上支持多个操作系统,而且可在内存较少的环境下管理多个OS的内存分配,避免了传统虚拟化技术的性能开销与工程量。实验结果表明,ARM-MuxOS原型系统不仅能支持Android与FireFox OS的快速并发执行,而且其性能和内存消耗优于现有的移动虚拟化管理系统。  相似文献   

8.
内存虚拟化是系统虚拟化中如何有效抽象、利用、隔离计算机物理内存的重要方法,决定着系统虚拟化的整体性能.传统的纯软件内存虚拟化方法会产生较大的资源开销并且兼容性差,而硬件辅助的内存虚拟化方法需要重新设计处理器硬件架构.基于MIPS架构处理器提出一种软硬件协同的内存虚拟化方法,在不增加硬件支持的情况下提高内存虚拟化性能.提出的多层虚拟地址空间模型不仅可以解决MIPS架构处理器存在的虚拟化缺陷,而且可以在已有的内存虚拟化方法上提高性能.在多层虚拟地址空间模型的基础上,提出基于地址空间标识码(address space identity,ASID)、动态划分的旁路转换缓冲(translation lookaside buffer,TLB)共享方法,降低了虚拟机切换的开销.最终,在MIPS架构的龙芯3号处理器上实现了系统虚拟机VIRT-LOONGSO)N.性能测试表明,提出的方法可以提高大多数测试程序的性能,达到二进制翻译执行性能的3~5倍,并在TLB模拟方法的基础上提高了5%~16%的性能.  相似文献   

9.
哈希表以访问效率时间复杂度O(1)著称,作为一类可提供大规模数据高效访问的算法和数据结构为各类大数据应用所采用,例如,适用于各类新兴高性能(HPC)领域、数据库领域的工作负载和场景.随着高性能协处理器GPU硬件性能的日益提升,面向高性能GPU环境的哈希表并行优化已逐渐吸引了大量研究工作.当前的各类GPU哈希表优化方法和解决方案集中于利用GPU的大规模线程环境和高内存带宽来提升哈希表的事务高并发性处理和键值对数据快速访问.然而,由于现有GPU哈希表结构的研究工作普遍忽略了GPU资源有效管理,并没有以如何充分利用GPU线程资源和显存资源.同时,由于GPU显存空间的大小限制,用于存储哈希表结构数据的空间有限,无法应对更大规模的哈希表结构.因此,面向GPU环境下的哈希表方法的可扩展性和性能仍存在着技术挑战.本文提出并设计了一种面向GPU环境的可处理大规模并发事务的哈希表技术,命名为Starfish. Starfish提出了新的基于异步GPU流的“交换层”(swap layer)技术,用以支持GPU显存外的动态哈希表,同时也保障了GPU哈希表的索引方法性能.为了解决GPU大规模线程的访问带来的哈...  相似文献   

10.
针对深度神经网络在分布式多机多GPU上的加速训练问题,提出一种基于虚拟化的远程多GPU调用的实现方法。利用远程GPU调用部署的分布式GPU集群改进传统一对一的虚拟化技术,同时改变深度神经网络在分布式多GPU训练过程中的参数交换的位置,达到两者兼容的目的。该方法利用分布式环境中的远程GPU资源实现深度神经网络的加速训练,且达到单机多GPU和多机多GPU在CUDA编程模式上的统一。以手写数字识别为例,利用通用网络环境中深度神经网络的多机多GPU数据并行的训练进行实验,结果验证了该方法的有效性和可行性。  相似文献   

11.
蒋筱斌  熊轶翔  张珩  武延军  赵琛 《软件学报》2023,34(4):1977-1996
现阶段,随着数据规模扩大化和结构多样化的趋势日益凸现,如何利用现代链路内链的异构多协处理器为大规模数据处理提供实时、可靠的并行运行时环境,已经成为高性能以及数据库领域的研究热点.利用多协处理器(GPU)设备的现代服务器(multi-GPU server)硬件架构环境,已经成为分析大规模、非规则性图数据的首选高性能平台.现有研究工作基于Multi-GPU服务器架构设计的图计算系统和算法(如广度优先遍历和最短路径算法),整体性能已显著优于多核CPU计算环境.然而,这类图计算系统中,多GPU协处理器间的图分块数据传输性能受限于PCI-E总线带宽和局部延迟,导致通过增加GPU设备数量无法达到整体系统性能的类线性增长趋势,甚至会出现严重的时延抖动,进而已无法满足大规模图并行计算系统的高可扩展性要求.经过一系列基准实验验证发现,现有系统存在如下两类缺陷:(1)现代GPU设备间数据通路的硬件架构发展日益更新(如NVLink-V1,NVLink-V2),其链路带宽和延迟得到大幅改进,然而现有系统受限于PCI-E总线进行数据分块通信,无法充分利用现代GPU链路资源(包括链路拓扑、连通性和路由);(2)在...  相似文献   

12.
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.  相似文献   

13.
现有CPU加速的高性能Linpack基准测试程序(HPL)一般采用基于实际运算能力的动态负载均衡算法来实现。然而该算法在单节点多GPU的平台上表现不佳,其原因是单节点多GPU平台上单个GPU计算量小,并且GPU与CPU的总性能差距较大。为此,提出了经验指导的动态负载均衡算法以及多GPU自适应负载均衡算法,并且在单节点多GPU平台上进行了验证,结果显示,其比现有的基于NVIDIA费米GPU的HPI有6.3%的加速效果。  相似文献   

14.
We are presenting an innovative, massively-parallel heterogeneous architecture for the very fast construction and implementation of very large Aho–Corasick and Commentz-Walter pattern-matching automata, commonly used in data-matching applications, and validate its use with large sets of data actively used in intrusion detection systems. Our approach represents the first known hybrid-parallel model for the construction of such automata and the first to allow self-adjusting pattern-matching automata in real-time by allowing full-duplex transfers at maximum throughput between the host (CPU) and the device (GPU). The architecture we propose is easily scalable to multi-GPU and multi-CPU systems and benefits greatly from GPU acceleration, also relying on a highly-efficient storage model for the automata and includes on-demand support for regular-expression matching, as well as support for custom heuristics to be built on top of the architecture, at different processing stages.  相似文献   

15.
Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially.  相似文献   

16.
《Parallel Computing》2014,40(5-6):86-99
Simulation of in vivo cellular processes with the reaction–diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel efficiency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.  相似文献   

17.
MPI (Message Passing Interface)专为节点密集型大规模计算集群设计,然而,随着MPI+CUDA (Compute Unified Device Architecture)应用程序以及计算节点拥有GPU的计算机集群的出现,类似于MPI的传统通信库已无法满足.而在机器学习领域,也面临着同样的挑战,如Caff以及CNTK (Microsoft CognitiveToolkit)的深度学习框架,由于训练过程中, GPU会缓存庞大的数据量,而大部分机器学习训练的优化算法具有迭代性特点,导致GPU间的通信数据量大,通信频率高,这些已成为限制深度学习训练性能提升的主要因素之一,虽然推出了像NCCL(Nvidia Collective multi-GPU Communication Library)这种解决深度学习通信问题的集合通信库,但也存在不兼容MPI等问题.因此,设计一种更加高效、符合当前新趋势的通信加速机制便显得尤为重要,为解决上述新形势下的挑战,本文提出了两种新型通信广播机制:(1)一种基于MPI_Bcast的管道链PC (Pipelined Chain)通信机制:为GPU缓存提供高效的节点内外通信.(2)一种适用于多GPU集群系统的基于拓扑感知的管道链TA-PC (TopologyAware Pipelined Chain)通信机制:充分利用多GPU节点间的可用PCIe链路.为了验证提出的新型广播设计,分别在三种配置多样化的GPU集群上进行了实验:GPU密集型集群RX1、节点密集型集群RX2、均衡型集群RX3.实验中,将新的设计与MPI+NCCL1 MPI_Bcast进行对比实验,对于节点内通信和节点间的通信,分别取得了14倍和16.6倍左右的性能提升;与NCCL2的对比试验中,小中型消息取得10倍左右的性能提升,大型消息取得与其相当的性能水平,同时TA-PC设计相比于PC设计,在64GPU集群上实现50%左右的性能提升.实验结果充分说明,提出的解决方案在可移植性以及性能方面有较大的优势.  相似文献   

18.
We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster configuration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm. We present data also for a traditional high-end multi-core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in a cluster configuration it is necessary to use the so-called offload mode which reduces the performances of the single system. As to the GPU, the Kepler architecture offers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.  相似文献   

19.
One of the main building blocks and major challenges for 5G cellular systems is the design of flexible network architectures which can be realized by the software defined networking paradigm. Existing commercial cellular systems rely on closed and inflexible hardware-based architectures both at the radio frontend and in the core network. These problems significantly delay the adoption and deployment of new standards, impose significant challenges in implementing and innovation of new techniques to maximize the network capacity and accordingly the coverage, and prevent provisioning of truly- differentiated services which are able to adapt to growing and uneven and highly variable traffic patterns. In this paper, a new software-defined architecture, called SoftAir, for next generation (5G) wireless systems, is introduced. Specifically, the novel ideas of network function cloudification and network virtualization are exploited to provide a scalable, flexible and resilient network architecture. Moreover, the essential enabling technologies to support and manage the proposed architecture are discussed in details, including fine-grained base station decomposition, seamless incorporation of Openflow, mobility- aware control traffic balancing, resource-efficient network virtualization, and distributed and collaborative traffic classification. Furthermore, the major benefits of SoftAir architecture with its enabling technologies are showcased by introducing software- defined traffic engineering solutions. The challenging issues for realizing SoftAir are also discussed in details.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号