期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast computation of computer-generated hologram using Xeon Phi coprocessor

Koki Murano Tomoyoshi Shimobaba Atsushi Sugiyama Naoki Takada Takashi Kakue Minoru Oikawa Tomoyoshi Ito 《Computer Physics Communications》2014

We report fast computation of computer-generated holograms (CGHs) using Xeon Phi coprocessors, which have massively x86-based processors on one chip, recently released by Intel. CGHs can generate arbitrary light wavefronts, and therefore, are promising technology for many applications: for example, three-dimensional displays, diffractive optical elements, and the generation of arbitrary beams. CGHs incur enormous computational cost. In this paper, we describe the implementations of several CGH generating algorithms on the Xeon Phi, and the comparisons in terms of the performance and the ease of programming between the Xeon Phi, a CPU and graphics processing unit (GPU). 相似文献

2.

Memory E?cient Two-Pass 3D FFT Algorithm for Intelr Xeon PhiTM Coprocessor

下载免费PDF全文

刘益群李焱张云泉张先轶《计算机科学技术学报》2014,(6)

Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intelr Many Integrated Core (MIC) Architecture provides not only high floating-point performanc... 相似文献

3.

基于Intel Xeon Phi的激光等离子体粒子模拟研究

姚文科杜云飞吴　强　杨灿群《计算机工程与科学》2014,36(5):809-813

激光等离子体粒子模拟广泛用于探索极端物质状态下的科学问题。将一种基于粒子云网格方法的三维等离子体粒子模拟程序LARED P移植到Intel Xeon Phi协处理器上。在移植的过程中,综合运用了Native和Offload两种编程模式：首先运用Native模式对LARED P程序中热点计算任务进行优化研究,通过采用SIMD扩展指令使该计算任务获得了4.61倍的加速;然后运用Offload模式将程序移植到CPU-Intel Xeon Phi异构系统上,并通过使用异步数据传输和双缓冲技术分别提升了程序性能9.8%和21.8%。相似文献

4.

Yuuichi Asahi Guillaume Latu Julien Bigot Shinya Maeyama Virginie Grandgirard Yasuhiro Idomura 《Concurrency and Computation》2020,32(5)

Communication and computation overlapping techniques have been introduced in the five‐dimensional gyrokinetic codes GYSELA and GKV. In order to anticipate some of the exa‐scale requirements, these codes were ported to the modern accelerators, Xeon Phi KNL and Tesla P 100 GPU. On accelerators, a serial version of GYSELA on KNL and GKV on GPU are respectively 1.3× and 7.4× faster than those on a single Skylake processor (a single socket). For the scalability, we have measured GYSELA performance on Xeon Phi KNL from 16 to 512 KNLs (1024 to 32k cores) and GKV performance on Tesla P 100 GPU from 32 to 256 GPUs. In their parallel versions, transpose communication in semi‐Lagrangian solver in GYSELA or Convolution kernel in GKV turned out to be a main bottleneck. This indicates that in the exa‐scale, the network constraints would be critical. In order to mitigate the communication costs, the pipeline and task‐based overlapping techniques have been implemented in these codes. The GYSELA 2D advection solver has achieved a 33% to 92% speed up, and the GKV 2D convolution kernel has achieved a factor of 2 speed up with pipelining. The task‐based approach gives 11% to 82% performance gain in the derivative computation of the electrostatic potential in GYSELA. We have shown that the pipeline‐based approach is applicable with the presence of symmetry, while the task‐based approach can be applicable to more general situations. 相似文献

5.

Putt Sakdhnagool Amit Sabne Rudolf Eigenmann 《Concurrency and Computation》2019,31(1)

While GPUs have seen a steady increase in usage, Xeon Phis have struggled in proving their value, and eventually got discontinued. Is this a matter of the Intel many‐core architecture's younger age or are there reasons due to specific features? This paper reviews quantitative information addressing these questions. Using two latest coprocessors, we evaluate performance and programming productivity across a range of microbenchmarks and applications. We consider productivity as the percentage of hand‐optimized performance reached by a simple high‐level parallel programming model that is translated onto the specific architectures by advanced compilers. We evaluate and compare the performance of the two coprocessors and point out where Xeon Phis fall short. We also briefly review the performance of different execution modes of Xeon Phis. Moreover, unlike common expectation, we found that Xeon Phis' productivity is marginally better that GPUs. The current results suggest that the performance advantage of GPUs outweighs the productivity benefit of Xeon Phis. Closing the performance gap and increasing the productivity benefit that a more regular many‐core paradigm can offer will be essential in designing a next‐generation architecture. 相似文献

6.

Fredrik Robertsn Keijo Mattila Jan Westerholm 《Concurrency and Computation》2019,31(13)

We present a high‐performance implementation of the lattice‐Boltzmann method (LBM) on the Knights Landing generation of Xeon Phi. The Knights Landing architecture includes 16GB of high‐speed memory (MCDRAM) with a reported bandwidth of over 400 GB/s, and a subset of the AVX‐512 single instruction multiple data (SIMD) instruction set. We explain five critical implementation aspects for high performance on this architecture: (1) the choice of appropriate LBM algorithm, (2) suitable data layout, (3) vectorization of the computation, (4) data prefetching, and (5) running our LBM simulations exclusively from the MCDRAM. The effects of these implementation aspects on the computational performance are demonstrated with the lattice‐Boltzmann scheme involving the D3Q19 discrete velocity set and the TRT collision operator. In our benchmark simulations of fluid flow through porous media, using double‐precision floating‐point arithmetic, the observed performance exceeds 960 million fluid lattice site updates per second. 相似文献

7.

基于Intel Xeon 5410处理器的刀片服务器概述

孙杰《电子技术》2012,39(6):48-49,44

刀片服务器是目前服务器发展的趋势,具有众多优势.文章分析了一种基于Inter Xeon 5410处理器的刀片服务器组成,并重点对刀片服务器的硬件组成、计算刀片构架、硬件互联网络、软件监控管理、数据库管理等关键内容进行了介绍. 相似文献

8.

E. Coronado‐Barrientos G. Indalecio A. García‐Loureiro 《Concurrency and Computation》2019,31(1)

Emerging new architectures used in High Performance Computing require new research to adapt and optimise algorithms to them. As part of this effort, we propose the new AXC format to improve the performance of the SpMV product for the Intel Xeon Phi coprocessor. The performance of the OpenCL kernel, based on our new format, is compared with three very different and high efficient sparse matrix formats, ie, CSR, ELLR‐T, and K1. We perform tests with most of the matrices from the Williams collection used to test SpMV kernels for GPUs architectures in several related works. The numerical results show that the AXC format is more robust to spatial indirections proper of sparse matrices and prefers matrices with low variability amongst their rows' population, very much like matrices originated by FEM codes. The Conjugate Gradient (CG) is implemented in OpenCL using all the formats in this work to expose strengths and weaknesses of the formats in a real application. The CG implementation shows that the AXC has the fastest conversion time and its coherent with the numerical results generated by the SpMV tests, and that the format has a slower memory operations time due to an extra step required by the format and its larger memory footprint. 相似文献

9.

Improving matrix-based dynamic programming on massively parallel accelerators

《Information Systems》2017

Dynamic programming techniques are well-established and employed by various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. These algorithms usually operate in an iteration-based manner where new values are computed from values of the previous iteration. The data dependencies enforce synchronization which limits possibilities for internal parallel processing. In this paper, we investigate parallel approaches to processing matrix-based dynamic programming algorithms on modern multicore CPUs, Intel Xeon Phi accelerators, and general purpose GPUs. We address both the problem of computing a single distance on large inputs and the problem of computing a number of distances of smaller inputs simultaneously (e.g., when a similarity query is being resolved). Our proposed solutions yielded significant improvements in performance and achieved speedup of two orders of magnitude when compared to the serial baseline. 相似文献

10.

基于Xeon Phi平台的波动方程叠前深度偏移

杨祥森金君王鹏马召贵亢永敢《计算机工程与科学》2015,37(5):907-913

波动方程叠前深度偏移适用于强横向变速介质,是一种高精度成像方法,但其巨大的计算量阻碍了该技术的应用。Xeon Phi是一种全新的高性能计算设备,为波动方程叠前深度偏移方法的推广应用提供了新的技术支持。以裂步傅里叶算子为例,介绍了面向Xeon Phi平台的偏移算法移植和优化方法,即采用offload模式将计算核函数加载到Xeon Phi设备上,在Xeon Phi协处理器上采用多线程方式,并且调整程序结构,充分利用SIMD矢量引擎提高向量化处理效率。扩展负载动态均衡的并行框架,形成了一套适用于大规模异构系统、基于Xeon Phi平台的波动方程叠前深度偏移软件。实际数据测试表明Xeon Phi平台可以极大地提高地震偏移处理效率,具有良好的可扩展性。相似文献