首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
地震波的叠前时间偏移算法是构造复杂岩层成像最有效的方法之一。地震勘探进入海量数据时代,且叠前偏移算法是数据处理中最费时的环节,对叠前偏移算法做并行计算优化有着重要的研究意义。近年来,高性能并行计算开始进入异构、众核时代,以Intel新一代至强融核MIC(Xeon Phi)为例,新型众核处理器具有成本低、性能高等特点。从最经典的Kirchhoff叠前时间偏移(PKTM)算法出发,基于CPU+MIC异构平台,采用offload编程模式实现对PKTM算法的并行移植与性能优化,对于6 000万规模(8 000×8 000)的应用问题,总的并行模拟时间从357.52s减少到1.66s,性能提升了214.37倍。  相似文献   

2.
Stackless traversal algorithms for ray tracing acceleration structures require significantly less storage per ray than ordinary stack‐based ones. This advantage is important for massively parallel rendering methods, where there are many rays in flight. On SIMD architectures, a commonly used acceleration structure is the MBVH, which has multiple bounding boxes per node for improved parallelism. It scales to branching factors higher than two, for which, however, only stack‐based traversal methods have been proposed so far. In this paper, we introduce a novel stackless traversal algorithm for MBVHs with up to four‐way branching. Our approach replaces the stack with a small bitmask, supports dynamic ordered traversal, and has a low computation overhead. We also present efficient implementation techniques for recent CPU, MIC (Intel Xeon Phi) and GPU (NVIDIA Kepler) architectures.  相似文献   

3.
4.
For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)‐2 symmetric matrix‐vector multiplication, and the BLAS‐3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi‐GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU‐GPU kernel into computational kernels at higher‐levels of software stacks, that is, a shared‐memory dense eigensolver and a distributed‐memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher‐level kernels, not only reducing the solution time but also enabling the solution of larger‐scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

5.
The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the algorithm itself, assuming specific measures, the paper assumes a general scheme for finding similarity measures for all pairs of vectors and investigates optimizations for scalability in a hybrid Intel Xeon/Xeon Phi system. Hybrid systems including multicore CPUs and many-core compute devices such as Intel Xeon Phi allow parallelization of such computations using vectorization but require proper load balancing and optimization techniques. The proposed implementation uses C/OpenMP with the offload mode to Xeon Phi cards. Several results are presented: execution times for various partitioning parameters such as batch sizes of vectors being compared, impact of dynamic adjustment of batch size, overlapping computations and communication. Execution times for comparison of all pairs of vectors are presented as well as those for which similarity measures account for a predefined threshold. The latter makes load balancing more difficult and is used as a benchmark for the proposed optimizations. Results are presented for the native mode on an Intel Xeon Phi, CPU only and the CPU \(+\) offload mode for a hybrid system with 2 Intel Xeons with 20 physical cores and 40 logical processors and 2 Intel Xeon Phis with a total of 120 physical cores and 480 logical processors.  相似文献   

6.
在服务器端加入GPU或MIC等协处理器可以提升性能。但是,传统Web服务器端软件不能充分发挥多核CPU+MIC协处理器异构硬件体系的性能。为解决该问题,针对该硬件体系提出了一种新的Web服务器软件框架。该软件框架基于分阶段事件驱动模型,将部分动态请求调度至MIC协处理器执行,并可在多核CPU和MIC协处理器上并行处理动态请求。同时,通过采用自适应调度算法兼顾了CPU和MIC协处理器间的负载均衡。仿真实验表明,该模型在平均响应时间、吞吐量等方面均优于传统的基于先到先服务(First Come First Served,FCFS)的Web服务器软件模型。  相似文献   

7.
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

8.
We investigate heterogeneous computing, which involves both multicore CPUs and manycore Xeon Phi coprocessors, as a new strategy for computational cardiology. In particular, 3D tissues of the human cardiac ventricle are studied with a physiologically realistic model that has 10,000 calcium release units per cell and 100 ryanodine receptors per release unit, together with tissue-scale simulations of the electrical activity and calcium handling. In order to attain resource-efficient use of heterogeneous computing systems that consist of both CPUs and Xeon Phis, we first direct the coding effort at ensuring good performance on the two types of compute devices individually. Although SIMD code vectorization is the main theme of performance programming, the actual implementation details differ considerably between CPU and Xeon Phi. Moreover, in addition to combined OpenMP+MPI programming, a suitable division of the cells between the CPUs and Xeon Phis is important for resource-efficient usage of an entire heterogeneous system. Numerical experiments show that good resource utilization is indeed achieved and that such a heterogeneous simulator paves the way for ultimately understanding the mechanisms of arrhythmia. The uncovered good programming practices can be used by computational scientists who want to adopt similar heterogeneous hardware platforms for a wide variety of applications.  相似文献   

9.
激光等离子体粒子模拟广泛用于探索极端物质状态下的科学问题。将一种基于粒子云网格方法的三维等离子体粒子模拟程序LARED P移植到Intel Xeon Phi协处理器上。在移植的过程中,综合运用了Native和Offload两种编程模式:首先运用Native模式对LARED P程序中热点计算任务进行优化研究,通过采用SIMD扩展指令使该计算任务获得了4.61倍的加速;然后运用Offload模式将程序移植到CPU-Intel Xeon Phi异构系统上,并通过使用异步数据传输和双缓冲技术分别提升了程序性能9.8%和21.8%。  相似文献   

10.
In order to exploit the efficient computing power of many integrated cores on heterogeneous cluster, a multi-level and multi-granularity collaborative parallel computing method is proposed for finite element structural mechanical analysis. Computing tasks are divided into three levels: inter-node parallelism, inter-device parallelism and inter-core parallelism. Through mapping decomposablecomput- ing jobs to different hardware layers of heterogeneous MIC system, the proposed method not only effectively resolves the load balancing problem between CPU and MIC devices, but also significantly reduces the communication overheads of the system. Different engineering simulation case experiments for large scale parallel computing were conducted on “Tianhe 2” supercomputer. Up to 39000 CPU+MIC cores were employed and the finite element size of the analysis was more than 100 million units. Test results show that the proposed method can achieve good speedup and parallel computing efficiency in large scale parallel computing of finite element structural analysis. The optimized adaptation of finite element structural analysis and heterogeneous MIC computing platform is realized, which can provide reference for parallel porting and performance optimization of similar applications.  相似文献   

11.
Shared memory parallelization of the flux kernel of PETSc-FUN3D, an unstructured tetrahedral mesh Euler flow code previously studied for distributed memory and multi-core shared memory, is evaluated on up to 61 cores per node and up to 4 threads per core. We explore several thread-level optimizations to improve flux kernel performance on the state-of-the-art many integrated core (MIC) Intel processor Xeon Phi “Knights Corner,” with a focus on strong thread scaling. While the linear algebraic kernel is bottlenecked by memory bandwidth for even modest numbers of cores sharing a common memory, the flux kernel, which arises in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, is compute-intensive and is known to exploit effectively contemporary multi-core hardware. We extend study of the performance of the flux kernel to the Xeon Phi in three thread affinity modes, namely scatter, compact, and balanced, in both offload and native mode, with and without various code optimizations to improve alignment and reduce cache coherency penalties. Relative to baseline “out-of-the-box” optimized compilation, code restructuring optimizations provide about 3.8x speedup using the offload mode and about 5x speedup using the native mode. Even with these gains for the flux kernel, with respect to execution time the MIC simply achieves par with optimized compilation on a contemporary multi-core Intel CPU, the 16-core Sandy Bridge E5 2670. Nevertheless, the optimizations employed to reduce the data motion and cache coherency protocol penalties of the MIC are expected to be of value for CFD and many other unstructured applications as many-core architecture evolves. We explore large-scale distributed-shared memory performance on the Cray XC40 supercomputer, to demonstrate that optimizations employed on Phi hybridize to this context, where each of thousands of nodes are comprised of two sockets of Intel Xeon Haswell CPUs with 32 cores per node.  相似文献   

12.
3D梯度向量流场(3D GVF field)广泛应用于多种3D图像分析算法中,其计算需要多次迭代,计算量大,如何提高其计算速度具有重要的研究意义。面向Intel Xeon Phi众核集成架构,首次进行了3D GVF场计算的加速优化。首先,挖掘3D图像像素点间存在的天然并行性,发挥众核架构优势,尝试线程级并行(多核)和数据级并行(SIMD)。其次,3D GVF场的计算过程是一种典型的3D 7点模板运算,结合Xeon Phi架构的L2 缓存规格,提出一种高效的数据分块策略,充分挖掘数据的时/空局部性,有效缓解模板计算引起的缓存缺失,提升了计算性能。实验结果表明,引入模板优化技术能显著提升3D GVF场的计算速度,在图像维度为5123时,所提方法在57核Xeon Phi平台上的性能相比在2.6GHz 8核16线程的Intel Xeon E5 2670 CPU上的性能,加速比可达2.77。  相似文献   

13.
曙光E级原型机是我国“十三五”计划中3台原型系统之一,该系统采用异构计算架构,CPU和加速器选用AMD授权的国产海光处理器架构。除了采用基准测试程序对芯片进行测试外,为探究真实应用在该原型机上的性能,移植了激光等离子体应用GTC-P,对比了GTC-P在海光CPU和DCU与Intel 6148 CPU和NVIDIA V100 GPU上的性能,并在原型机的多结点上进行了扩展性分析。性能评估工作反映了高性能计算应用在曙光E级原型机上的实际运行性能。  相似文献   

14.
While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring “standard” as well as “accelerated” resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The “General, Hybrid, and Optimized Sparse Toolkit” (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack. The library code and several applications are available as open source.  相似文献   

15.
现今CPU和GPU的发展已经出现新的瓶颈,将两者“结合”在同一块芯片上成为一种新的趋势。这种新的异构架构给片上共享资源的管理带来压力。而共享末级缓存(LLC)的管理对性能的影响非常关键。由于CPU程序和GPU程序的不同特性,给CPU和GPU间共享的末级缓存管理带来新的挑战。通过分析GPU程序访存特征,借鉴之前的缓存管理方案,提出对CPU-GPU融合系统的末级缓存进行等量的静态划分和最优静态划分的方案。实验结果表明:通过缓存划分可以有效避免CPU和GPU程序间的干扰。与传统LRU策略相比,等量静态划分和最优静态划分可以使系统整体性能分别提高7.68%和11.62%。  相似文献   

16.
Intel Xeon Phi协处理器的指令集IMCI引入了硬件实现的vgather指令,旨在帮助512位SIMD寄存器访问非连续内存地址上的数据。然而实验结果显示,vgather很有可能成为应用在Xeon Phi协处理器上关键的性能瓶颈之一。基于以上结论,针对vgather的性能建模可以帮助用户深入地掌握和理解Xeon Phi协处理器的性能特性。在实验方法上,本文方法与现存的通过程序段内嵌入汇编代码进行数据统计不同,使用PAPI等性能分析工具直接收集硬件计数器的统计结果,作为模型的实验数据。本文的性能模型基于AGI事件次数和根据VPU_DATA_READ次数估算得出的vgather所导致的平均延迟构建而成。该模型能够对Xeon Phi应用代码中由vgather所导致的总延迟进行预测。最终,为了验证模型预测的准确性,将该模型应用在三维7点stencil应用代码上,预测结果显示,vgather耗时占计算总耗时的约40%。再将该结果与利用intrinsics指令去除vgather后的计算耗时进行了对比验证,结果显示模型预测准确。基于上述结论,采用硬件计数器的统计结果在Xeon Phi协处理器上针对vgather构建了性能模型。同时,通过与其他平台的vgather对比,认为该模型也可以应用在同样具备vgather的Intel CPU处理器平台上。  相似文献   

17.
有限差分算法是一种基于偏微分方程的数值离散方法,被广泛应用于弹性波传播问题的数值模拟中。该算法访存跨度大、计算密度高、CPU利用率低,这在实际应用中成为了性能瓶颈。针对上述问题,在详析3D有限差分算法(3DFD)的基础上,基于Intel MIC架构,采用三步递进法对其进行优化:首先,通过分支消除、循环展开、不变量外提等基本优化法削减计算强度并为向量化扫除障碍;然后,通过分析数据依赖及循环分块,使用向量指令集改写核心算法等并行优化法,充分利用MIC协处理器多线程、长向量的机制;最后,在异构众核平台(CPU+MIC:Many Integra-ted Cores)下通过数据传输最小化、负载均衡等异构协同优化法实现CPU和MIC的并行计算。实验验证,与原有算法相比,优化后的算法在异构平台上获得了50~120倍的加速。  相似文献   

18.
The use of Graphics Processing Units (GPUs) for high‐performance computing has gained growing momentum in recent years. Unfortunately, GPU‐programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high‐performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU‐GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high‐level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high‐level abstraction for stencil programming on heterogeneous CPU‐GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU‐only and GPU‐only parallel applications, respectively. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

19.
The single‐instruction multiple‐data (SIMD) computing capability of modern processors is continually improved to deliver ever better performance and power efficiency. For example, Intel has increased SIMD register lengths from 128 bits in streaming SIMD extension to 512 bits in AVX‐512. The ARM scalable vector extension supports SIMD register length up to 2048 bits and includes predicated instructions. However, SIMD instruction translation in dynamic binary translation has not received similar attention. For example, the widely used QEMU emulates guest SIMD instructions with a sequence of scalar instructions, even when the host machines have relevant SIMD instructions. This leaves significant potential for performance enhancement. We propose a newly designed SIMD translation framework for dynamic binary translation, which takes advantage of the host's SIMD capabilities. The proposed framework has been built in HQEMU, an enhanced QEMU with a separate thread for applying LLVM optimizations. The current prototype supports ARMv7, ARMv8, and IA32 guests on the X86‐64 AVX‐2 host. Compared with the scalar‐translation version HQEMU, our framework runs up to 1.84 times faster on Standard Performance Evaluation Corporation 2006 CFP benchmarks and up to 6.81 times faster on selected real applications.  相似文献   

20.
多核处理机系统Cache管理技术研究现状   总被引:1,自引:0,他引:1       下载免费PDF全文
多核处理器的Cache结构设计和管理是微处理器设计领域的重要问题。当前主流的商用微处理器均采用共享最后一级Cache的系统结构,而片上最后一级Cache的性能通常对处理器的性能影响较大,因此共享Cache的管理问题成为当前研究热点。本文首先介绍当前主流多核处理器及其设计问题,然后介绍了共享Cache管理的三项重要技术:线程调度、NUCA和Cache划分,最后给出多核处理器Cache管理技术的发展方向。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号