期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using the Xeon Phi Platform to Run Speculatively-Parallelized Codes

Alvaro Estebanez Diego R. Llanos Arturo Gonzalez-Escribano 《International journal of parallel programming》2017,45(2):225-241

Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code. 相似文献

2.

Data Parallel Algorithmic Skeletons with Accelerator Support

Steffen Ernsting Herbert Kuchen 《International journal of parallel programming》2017,45(2):283-299

Hardware accelerators such as GPUs or Intel Xeon Phi comprise hundreds or thousands of cores on a single chip and promise to deliver high performance. They are widely used to boost the performance of highly parallel applications. However, because of their diverging architectures programmers are facing diverging programming paradigms. Programmers also have to deal with low-level concepts of parallel programming that make it a cumbersome task. In order to assist programmers in developing parallel applications Algorithmic Skeletons have been proposed. They encapsulate well-defined, frequently recurring parallel programming patterns, thereby shielding programmers from low-level aspects of parallel programming. The main contribution of this paper is a comparison of two skeleton library implementations, one in C++ and one in Java, in terms of library design and programmability. Besides, on the basis of four benchmark applications we evaluate the performance of the presented implementations on two test systems, a GPU cluster and a Xeon Phi system. The two implementations achieve comparable performance with a slight advantage for the C++ implementation. Xeon Phi performance ranges between CPU and GPU performance. 相似文献

3.

Improving performance of iterative solvers with the AXC format using the Intel Xeon Phi

Coronado-Barrientos Edoardo Indalecio Guillermo García-Loureiro Antonio 《The Journal of supercomputing》2018,74(6):2823-2840

The Journal of Supercomputing - This work is focused on the application of the new AXC format in iterative algorithms on the Intel Xeon Phi coprocessor to solve linear systems by accelerating the... 相似文献

4.

面向MIC协处理器的OLAP外键连接算法

张宇张延松陈红王珊《软件学报》2017,28(3):490-501

众核架构协处理器Xeon Phi成为新兴的主流高性能计算平台.对于数据库应用而言,内存分析处理是一种计算密集型负载,其主要的性能取决于大事实表与维表之间的内存外键连接性能.本文关注于一种相对于缓存相关的分区哈希连接算法和缓存不相关的无分区哈希连接算法的缓存友好型外键连接算法,以适应Xeon Phi协处理器较小的LLC和高并发线程的特点.通过挖掘OLAP模式中的代理键特征,基于键值匹配的哈希探测操作可以进一步简化为事实表与维表之间基于主-外键参照完整性约束的代理键参照访问,因此复杂的哈希表和CPU代价较高的哈希探测操作可以简化为通过映射外键值为代理键向量内存偏移地址的方法对代理向量直接访问.基于代理向量参照访问的外键连接算法能够简单并高效地应用于Xeon Phi协处理器平台,通过更多的核心和高并发线程来掩盖内存访问延迟.实验中对传统的哈希连接算法（无分区哈希连接算法和基数分区哈希连接算法）和基于代理向量参照技术的外键连接算法在Xeon E5-2650 v3 10核处理器平台和Xeon Phi 5110P 60核协处理器平台进行性能测试和比较,实验结果给出了主流的内存外键连接算法在不同数据集和不同平台上全面的性能特征. 相似文献

5.

Characterizing and optimizing Java-based HPC applications on Intel many-core architecture

Yang Yu Tianyang Lei Haibo Chen Binyu Zang 《中国科学:信息科学(英文版)》2017,60(12):122106

The increasing demand for performance has stimulated the wide adoption of many-core accelerators like Intel^® Xeon Phi^TM Coprocessor, which is based on Intel’s Many Integrated Core architecture. While many HPC applications running in native mode have been tuned to run efficiently on Xeon Phi, it is still unclear how a managed runtime like JVM performs on such an architecture. In this paper, we present the first measurement study of a set of Java HPC applications on Xeon Phi under JVM. One key obstacle to the study is that there is currently little support of Java for Xeon Phi. This paper presents the result based on the first porting of OpenJDK platform to Xeon Phi, in which the HotSpot virtual machine acts as the kernel execution engine. The main difficulty includes the incompatibility between Xeon Phi ISA and the assembly library of Hotspot VM. By evaluating the multithreaded Java Grande benchmark suite and our ported Java Phoenix benchmarks, we quantitatively study the performance and scalability issues of JVM on Xeon Phi and draw several conclusions from the study. To fully utilize the vector computing capability and hide the significant memory access latency on the coprocessor, we present a semi-automatic vectorization scheme and software prefetching model in HotSpot. Together with 60 physical cores and tuning, our optimized JVM achieves averagely 2.7x and 3.5x speedup compared to Xeon CPU processor by using vectorization and prefetching accordingly. Our study also indicates that it is viable and potentially performance-beneficial to run applications written for such a managed runtime like JVM on Xeon Phi. 相似文献

6.

Improvement of workload balancing using parallel loop self-scheduling on Intel Xeon Phi

Chao-Tung Yang Chao-Wei Huang Shuo-Tsung Chen 《The Journal of supercomputing》2017,73(11):4981-5005

In recent years, Intel promotes its new product Xeon Phi coprocessor, which is similar to the x86 architecture coprocessor. It has about 60 cores and can be regarded as a single computing node, with the computing power that cannot be ignored. This work aims to improve the workload balance by parallel loop self-scheduling scheme performed on Xeon Phi-based computer cluster. The proposed concept is implemented by hybrid MPI and OpenMP parallel programming in C language. Since parallel loop self-scheduling composes of static and dynamic allocation, weighting algorithm is adopted in the static part, while the well-known loop self-scheduling is adopted in dynamic part. The loop block is partitioned according to the weighting of MIC and HOST nodes. Accordingly, Xeon Phi with many-core is adopted to implement parallel loop self-scheduling. Finally, we test the performance in the experiments by four applicable problems: matrix multiplication, sparse matrix multiplication, Mandelbrot set and circuit meet. The experimental results indicate how to do the weight allocation and which scheduling method can achieve the best performance. 相似文献

7.

Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

《Parallel Computing》2017

Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation. 相似文献

8.

Fast computation of computer-generated hologram using Xeon Phi coprocessor

Koki Murano Tomoyoshi Shimobaba Atsushi Sugiyama Naoki Takada Takashi Kakue Minoru Oikawa Tomoyoshi Ito 《Computer Physics Communications》2014

We report fast computation of computer-generated holograms (CGHs) using Xeon Phi coprocessors, which have massively x86-based processors on one chip, recently released by Intel. CGHs can generate arbitrary light wavefronts, and therefore, are promising technology for many applications: for example, three-dimensional displays, diffractive optical elements, and the generation of arbitrary beams. CGHs incur enormous computational cost. In this paper, we describe the implementations of several CGH generating algorithms on the Xeon Phi, and the comparisons in terms of the performance and the ease of programming between the Xeon Phi, a CPU and graphics processing unit (GPU). 相似文献

9.

Explicit Fourth-Order Runge–Kutta Method on Intel Xeon Phi Coprocessor

Beata Bylina Joanna Potiopa 《International journal of parallel programming》2017,45(5):1073-1090

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge–Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using the CSR storage scheme and working on Intel Xeon Phi, were investigated. The implementation based on the Intel MKL library uses the high-performance BLAS and Sparse BLAS routines. In our application we focus on OpenMP style programming. We implement SpMV operation and vector addition using the basic optimizing techniques and the vectorization. We evaluate our approach in native and offload modes for various number of cores and thread allocation affinities. Both implementations (based on Intel MKL and made by the authors) were compared in respect of the time, the speedup and the performance. The numerical experiments on Intel Xeon Phi show that the performance of authors’ implementation is very promising and gives a gain of up to two times compared to the multithreaded implementation (based on Intel MKL) running on CPU (Intel Xeon processor) and even three times in comparison with the application which uses Intel MKL on Intel Xeon Phi. 相似文献

10.

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

《Parallel Computing》2019

Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. 相似文献

11.

Particle-in-Cell algorithms for emerging computer architectures

Viktor K. Decyk Tajendra V. Singh 《Computer Physics Communications》2014

We have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture. 相似文献

12.

A thread-level parallelization of pairwise additive potential and force calculations suitable for current many-core architectures

Yoshimichi Andoh Soichiro Suzuki Satoshi Ohshima Tatsuya Sakashita Masao Ogino Takahiro Katagiri Noriyuki Yoshii Susumu Okazaki 《The Journal of supercomputing》2018,74(6):2449-2469

In molecular dynamics (MD) simulations, calculations of potentials and their derivatives by coordinate, i.e., forces, in a pairwise additive manner such as the Lennard–Jones interactions and a short-range part of the Coulombic interactions form the main part of arithmetic operations. It is essential to achieve high thread-level parallelization efficiency of these pairwise additive calculations of potentials and forces to use current supercomputers with many-core architectures effectively. In this paper, we propose four new thread-level parallelization algorithms for the pairwise additive potential and force calculations. We implement the four codes in a MD calculation code based on the fast multipole method. Performance benchmarks were taken on the FX100 supercomputer and Intel Xeon Phi coprocessor. The code succeeds in achieving high thread-level parallelization efficiency with 32 threads on the FX100 and up to 60 threads on the Xeon Phi. 相似文献

13.

基于向量引用Platform-Oblivious内存连接优化技术

张延松张宇王珊《软件学报》2018,29(3):883-895

以MapD为代表的图分析数据库系统通过GPU、Phi等新型众核处理器来支持高性能分析处理,在面向复杂数据模式时连接操作仍然是重要的性能瓶颈.近年来,异构处理器逐渐成为高性能计算的主流平台,内存连接性能的研究从多核CPU平台扩展到新兴的众核处理器,但众多的研究成果并未系统地揭示连接算法性能、连接数据集大小、硬件架构之间的内在联系,难以为未来异构处理器平台的数据库提供连接平台优化选择策略.本文以面向多核CPU、Xeon Phi、GPU处理器平台的内存连接优化技术为目标,通过优化内存哈希表设计,实现以向量映射替代哈希映射操作,消除哈希代价对内存连接算法的影响,从而更加准确地测量内存连接算法在多核CPU的cache大小、Xeon Phi的cache大小、Xeon Phi的并发多线程、GPU的SIMT（单指令多线程）机制等硬件相关因素影响下的性能特征.实验结果表明,缓存与并发多线程机制是提高内存连接算法性能的重要影响因素.缓存机制对于满足cache大小的连接操作具有性能优势,而GPU的并发多线程机制则在较大表的连接操作中具有较高的性能,Xeon Phi则在满足其L2 cache大小的连接操作中具有最高性能.实验结果揭示了内存连接操作性能与异构处理器硬件特性的联系,为未来异构处理器平台内存数据库查询优化器提供了优化策略. 相似文献

14.

Compressing three-dimensional sparse arrays using inter- and intra-task parallelization strategies on Intel Xeon and Xeon Phi

Chun-Yuan Lin Huang Ting Yen Che-Lun Hung 《The Journal of supercomputing》2017,73(8):3391-3410

Array operations are useful in a lot of scientific codes. In recent years, several applications, such as the geological analysis and the medical images processing, are processed using array operations for three-dimensional (abbreviate to “3D”) sparse arrays. Due to the huge computation time, it is necessary to compress 3D sparse arrays and use parallel computing technologies to speed up sparse array operations. How to compress the sparse arrays efficiently is an important task for practical applications. Hence, in this paper, two strategies, inter- and intra-task parallelization (abbreviate to “ETP” and “RTP”), are presented to compress 3D sparse arrays, respectively. Each strategy was designed and implemented on Intel Xeon and Xeon Phi, respectively. From experimental results, the ETP strategy achieves 17.5\(\times \) and 18.2\(\times \) speedup ratios based on Intel Xeon E5-2670 v2 and Intel Xeon Phi SE10X, respectively; 4.5\(\times \) and 4.5\(\times \) speedup ratios for the RTP strategy based on these two environments, respectively. 相似文献

15.

阵列众核处理器上的高效归并排序算法

石嵩李宏亮朱巍《计算机研究与发展》2016,53(2):362-373

排序是计算机科学中最基本的问题之一,随着众核处理器结构的不断发展,设计众核结构上的高效排序算法具有重要意义.众核处理器的一个重要方向是阵列众核处理器,根据阵列众核处理器的结构特点,提出了2种面向阵列众核结构的高效归并排序算法,通过利用DMA(direct memory access)多缓冲机制提高访存效率、深度平衡归并策略保持众多核心之间的负载均衡、SIMD(single instruction multiple data)归并方法提高归并计算效率以及片上交换归并策略提高片上数据重用率,大幅度提高了阵列众核处理器的排序性能.在异构融合阵列众核处理器DFMC(deeply-fused many-core)原型系统的实验结果表明,算法排序速度达647 MKeys/s(million keys per second),其排序效率(排序速度/峰值性能)是NVIDIA GPU上最快的归并排序算法(GTX580平台)的3.3倍,是Intel Xeon Phi上最快的归并排序算法的2.7倍.最后,建立了阵列众核处理器上归并排序算法的性能分析模型,利用该模型分析了主要结构参数与算法性能的关系,对阵列众核处理器的研究有一定的指导意义. 相似文献

16.

Parallel Ant Colony Optimization on Graphics Processing Units

Audrey Delévacq Pierre Delisle Marc Gravel Michaël Krajecki 《Journal of Parallel and Distributed Computing》2013

The purpose of this paper is to propose effective parallelization strategies for the Ant Colony Optimization (ACO) metaheuristic on Graphics Processing Units (GPUs). The Max–Min Ant System (MMAS) algorithm augmented with 3-opt local search is used as a framework for the implementation of the parallel ants and multiple ant colonies general parallelization approaches. The four resulting GPU algorithms are extensively evaluated and compared on both speedup and solution quality on a state-of-the-art Fermi GPU architecture. A rigorous effort is made to keep parallel algorithms true to the original MMAS applied to the Traveling Salesman Problem. We report speedups of up to 23.60 with solution quality similar to the original sequential implementation. With the intent of providing a parallelization framework for ACO on GPUs, a comparative experimental study highlights the performance impact of ACO parameters, GPU technical configuration, memory structures and parallelization granularity. 相似文献

17.

Xeon Phi平台上基于模板优化的3D-GVF场计算加速

齐金李宽杨灿群杜云飞《计算机工程与科学》2014,36(8):1435-1440

3D梯度向量流场（3D GVF field）广泛应用于多种3D图像分析算法中,其计算需要多次迭代,计算量大,如何提高其计算速度具有重要的研究意义。面向Intel Xeon Phi众核集成架构,首次进行了3D GVF场计算的加速优化。首先,挖掘3D图像像素点间存在的天然并行性,发挥众核架构优势,尝试线程级并行（多核）和数据级并行（SIMD）。其次,3D GVF场的计算过程是一种典型的3D 7点模板运算,结合Xeon Phi架构的L2 缓存规格,提出一种高效的数据分块策略,充分挖掘数据的时/空局部性,有效缓解模板计算引起的缓存缺失,提升了计算性能。实验结果表明,引入模板优化技术能显著提升3D GVF场的计算速度,在图像维度为5123时,所提方法在57核Xeon Phi平台上的性能相比在2.6GHz 8核16线程的Intel Xeon E5 2670 CPU上的性能,加速比可达2.77。相似文献

18.

Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

Nozal Raúl Perez Borja Bosque Jose Luis Beivide Ramón 《The Journal of supercomputing》2019,75(3):1123-1136

Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload distribution among their devices. This paper describes an extension of the Maat library to allow the co-execution of a data-parallel OpenCL kernel on a heterogeneous system composed by a CPU and an Intel Xeon Phi. Maat provides an abstract view of the heterogeneous system as well as set of load balancing algorithms to squeeze the performance out of the node. It automatically performs the data partition and distribution among the devices, generates the kernels and efficiently merges the partial outputs together. Experimental results show that this approach always outperforms the baseline with only a Xeon Phi, giving excellent performance and energy efficiency. Furthermore, it is essential to select the right load balancing algorithm because it has a huge impact in the system performance and energy consumption.

相似文献

19.

Benchmarking Performance of a Hybrid Intel Xeon/Xeon Phi System for Parallel Computation of Similarity Measures Between Large Vectors

Paweł Czarnul 《International journal of parallel programming》2017,45(5):1091-1107

The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the algorithm itself, assuming specific measures, the paper assumes a general scheme for finding similarity measures for all pairs of vectors and investigates optimizations for scalability in a hybrid Intel Xeon/Xeon Phi system. Hybrid systems including multicore CPUs and many-core compute devices such as Intel Xeon Phi allow parallelization of such computations using vectorization but require proper load balancing and optimization techniques. The proposed implementation uses C/OpenMP with the offload mode to Xeon Phi cards. Several results are presented: execution times for various partitioning parameters such as batch sizes of vectors being compared, impact of dynamic adjustment of batch size, overlapping computations and communication. Execution times for comparison of all pairs of vectors are presented as well as those for which similarity measures account for a predefined threshold. The latter makes load balancing more difficult and is used as a benchmark for the proposed optimizations. Results are presented for the native mode on an Intel Xeon Phi, CPU only and the CPU \(+\) offload mode for a hybrid system with 2 Intel Xeons with 20 physical cores and 40 logical processors and 2 Intel Xeon Phis with a total of 120 physical cores and 480 logical processors. 相似文献

20.

基于Intel MIC众核架构的视频字幕提取算法并行加速

下载免费PDF全文

高飞刘轶《计算机工程与科学》2015,37(4):634-640

视频字幕检索是视频检索领域的重要部分。随着OCR技术的不断完善,视频字幕检索算法也取得了很多重大突破,然而在检索效果提升的同时,视频包含的大量图像、文字信息使数据处理成为制约字幕提取的性能瓶颈。众核架构高性能协处理器近年发展迅猛,为高性能计算研究打下了良好的硬件基础。将Intel众核MIC应用到视频字幕提取中,选用OpenMP并行语言进行加速。通过在Intel Xeon Phi 7110P进行测试,获得了比较理想的加速比。相似文献