期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compressing three-dimensional sparse arrays using inter- and intra-task parallelization strategies on Intel Xeon and Xeon Phi

Chun-Yuan Lin Huang Ting Yen Che-Lun Hung 《The Journal of supercomputing》2017,73(8):3391-3410

Array operations are useful in a lot of scientific codes. In recent years, several applications, such as the geological analysis and the medical images processing, are processed using array operations for three-dimensional (abbreviate to “3D”) sparse arrays. Due to the huge computation time, it is necessary to compress 3D sparse arrays and use parallel computing technologies to speed up sparse array operations. How to compress the sparse arrays efficiently is an important task for practical applications. Hence, in this paper, two strategies, inter- and intra-task parallelization (abbreviate to “ETP” and “RTP”), are presented to compress 3D sparse arrays, respectively. Each strategy was designed and implemented on Intel Xeon and Xeon Phi, respectively. From experimental results, the ETP strategy achieves 17.5\(\times \) and 18.2\(\times \) speedup ratios based on Intel Xeon E5-2670 v2 and Intel Xeon Phi SE10X, respectively; 4.5\(\times \) and 4.5\(\times \) speedup ratios for the RTP strategy based on these two environments, respectively. 相似文献

2.

Improvement of workload balancing using parallel loop self-scheduling on Intel Xeon Phi

Chao-Tung Yang Chao-Wei Huang Shuo-Tsung Chen 《The Journal of supercomputing》2017,73(11):4981-5005

In recent years, Intel promotes its new product Xeon Phi coprocessor, which is similar to the x86 architecture coprocessor. It has about 60 cores and can be regarded as a single computing node, with the computing power that cannot be ignored. This work aims to improve the workload balance by parallel loop self-scheduling scheme performed on Xeon Phi-based computer cluster. The proposed concept is implemented by hybrid MPI and OpenMP parallel programming in C language. Since parallel loop self-scheduling composes of static and dynamic allocation, weighting algorithm is adopted in the static part, while the well-known loop self-scheduling is adopted in dynamic part. The loop block is partitioned according to the weighting of MIC and HOST nodes. Accordingly, Xeon Phi with many-core is adopted to implement parallel loop self-scheduling. Finally, we test the performance in the experiments by four applicable problems: matrix multiplication, sparse matrix multiplication, Mandelbrot set and circuit meet. The experimental results indicate how to do the weight allocation and which scheduling method can achieve the best performance. 相似文献

3.

High-performance simulations of turbulent boundary layer flow using Intel Xeon Phi many-core processors

Kang Ji-Hoon Hwang Jinyul Sung Hyung Jin Ryu Hoon 《The Journal of supercomputing》2021,77(9):9597-9614

The Journal of Supercomputing - Direct numerical simulations (DNS) of turbulent flows have increasing importance because they not only provide fundamental understanding of turbulent flows but also... 相似文献

4.

Performance benchmarking of deep learning framework on Intel Xeon Phi

Yang Chao-Tung Liu Jung-Chun Chan Yu-Wei Kristiani Endah Kuo Chan-Fu 《The Journal of supercomputing》2021,77(3):2486-2510

The Journal of Supercomputing - With the success of deep learning (DL) methods in diverse application domains, several deep learning software frameworks have been proposed to facilitate the usage... 相似文献

5.

Accelerating time series motif discovery in the Intel Xeon Phi KNL processor

Fernandez Ivan Villegas Alejandro Gutierrez Eladio Plata Oscar 《The Journal of supercomputing》2019,75(11):7053-7075

The Journal of Supercomputing - Time series analysis is an important research topic of great interest in many fields. Recently, the Matrix Profile method, and particularly one of its... 相似文献

6.

Benchmarking Performance of a Hybrid Intel Xeon/Xeon Phi System for Parallel Computation of Similarity Measures Between Large Vectors

Paweł Czarnul 《International journal of parallel programming》2017,45(5):1091-1107

The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the algorithm itself, assuming specific measures, the paper assumes a general scheme for finding similarity measures for all pairs of vectors and investigates optimizations for scalability in a hybrid Intel Xeon/Xeon Phi system. Hybrid systems including multicore CPUs and many-core compute devices such as Intel Xeon Phi allow parallelization of such computations using vectorization but require proper load balancing and optimization techniques. The proposed implementation uses C/OpenMP with the offload mode to Xeon Phi cards. Several results are presented: execution times for various partitioning parameters such as batch sizes of vectors being compared, impact of dynamic adjustment of batch size, overlapping computations and communication. Execution times for comparison of all pairs of vectors are presented as well as those for which similarity measures account for a predefined threshold. The latter makes load balancing more difficult and is used as a benchmark for the proposed optimizations. Results are presented for the native mode on an Intel Xeon Phi, CPU only and the CPU \(+\) offload mode for a hybrid system with 2 Intel Xeons with 20 physical cores and 40 logical processors and 2 Intel Xeon Phis with a total of 120 physical cores and 480 logical processors. 相似文献

7.

Fast computation of computer-generated hologram using Xeon Phi coprocessor

Koki Murano Tomoyoshi Shimobaba Atsushi Sugiyama Naoki Takada Takashi Kakue Minoru Oikawa Tomoyoshi Ito 《Computer Physics Communications》2014

We report fast computation of computer-generated holograms (CGHs) using Xeon Phi coprocessors, which have massively x86-based processors on one chip, recently released by Intel. CGHs can generate arbitrary light wavefronts, and therefore, are promising technology for many applications: for example, three-dimensional displays, diffractive optical elements, and the generation of arbitrary beams. CGHs incur enormous computational cost. In this paper, we describe the implementations of several CGH generating algorithms on the Xeon Phi, and the comparisons in terms of the performance and the ease of programming between the Xeon Phi, a CPU and graphics processing unit (GPU). 相似文献

8.

Using the Xeon Phi Platform to Run Speculatively-Parallelized Codes

Alvaro Estebanez Diego R. Llanos Arturo Gonzalez-Escribano 《International journal of parallel programming》2017,45(2):225-241

Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code. 相似文献

9.

英特尔发布四核多路至强虚拟整合无限

和风《办公自动化》2007,(10)

2007年9月6日,英特尔公司今天在全球同步发布业界首款专为多路(MP)服务器设计的四核英特尔至强7300系列服务器处理器.该处理器可满足企业应用环境中以及虚拟化应用环境中的服务器整合、数据库应用、企业资源规划和商务智能等应用时对服务器高性能、高可靠性和高可扩展性的需求. 相似文献

10.

Xeon Phi协处理器的功耗特征测量与分析

《计算机工程》2017,(6):313-321

精确测量和分析Xeon Phi协处理器的功耗特征是实现协处理器功耗管理和优化的基本前提,但准确提取和分析运行在Xeon Phi上并行程序的功耗较为复杂。为此,采用特制的功耗测量设备,完整提取14路供电通道的实时电压和电流,通过计算获得协处理器实时功耗,并在实测数据的基础上分别分析Xeon Phi协处理器启动、空闲、线程和存储系统等的功耗特征。实验结果表明,该功耗模型为功耗优化提供了可信的基础数据,能够指导基于Xeon Phi处理器上的功耗优化。相似文献

11.

英特尔向客户推出采用新型英特尔(R)至强(R)处理器的产品——具备集成I/O的Jasper Forest处理器为通讯和存储应用带来理想选择

《办公自动化》2010,(4)

相似文献

12.

全新英特尔至强7300系列处理器的平台

《办公自动化》2007,(10):43-43

本刊9月6日讯:全新英特尔(R)至强(R)多路7300系列服务器平台(Caneland).由全新四核英特尔(R)至强7300系列处理器(Tigerton)和具备数据流量优化(Data Traffic Optimizations)特性的英特尔7300芯片组(研发代号"Clarksboro")构建而成. 相似文献

13.

Accelerating iterative linear solvers using multiple graphical processing units

Zhangxin Chen Bo Yang 《国际计算机数学杂志》2015,92(7):1422-1438

In this paper, we develop, study and implement iterative linear solvers and preconditioners using multiple graphical processing units (GPUs). Techniques for accelerating sparse matrix–vector (SpMV) multiplication, linear solvers and preconditioners are presented. Four Krylov subspace solvers, a Neumann polynomial preconditioner and a domain decomposition preconditioner are implemented. Our numerical tests with NVIDIA C2050 GPUs show that the SpMV kernel can be sped over 40 times faster using four GPUs. Our linear solvers and preconditioners have similar speedup. 相似文献

14.

Parallelized mining of domain knowledge on GPGPU and Xeon Phi clusters

Tanvir Atahary Tarek M. Taha Scott Douglass 《The Journal of supercomputing》2016,72(6):2132-2156

相似文献

15.

ETA: experience with an Intel Xeon processor as a packet processing engine

Regnier G. Minturn D. McAlpine G. Saletore V.A. Foong A. 《Micro, IEEE》2004,24(1):24-31

Server-based networks have well-documented performance limitations. These limitations outline a major goal of Intel's embedded transport acceleration (ETA) project, the ability to deliver high-performance server communication and I/O over standard Ethernet and transmission control protocol/Internet protocol (TCP/IP) networks. By developing this capability, Intel hopes to take advantage of the large knowledge base and ubiquity of these standard technologies. With the advent of 10 gigabit Ethernet, these standards promise to provide the bandwidth required of the most demanding server applications. We use the term packet processing engine (PPE) as a generic term for the computing and memory resources necessary for communication-centric processing. Such PPEs have certain desirable attributes; the ETA project focuses on developing PPEs with such attributes, which include scalability, extensibility, and programmability. General-purpose processors, such as the Intel Xeon in our prototype, are extensible and programmable by definition. Our results show that software partitioning can significantly increase the overall communication performance of a standard multiprocessor server. Specifically, partitioning the packet processing onto a dedicated set of compute resources allows for optimizations that are otherwise impossible when time sharing the same compute resources with the operating system and applications. 相似文献

16.

Scientific applications of iterative Toeplitz solvers

Michael K. Ng Raymond H. Chan 《Calcolo》1996,33(3-4):249-267

Recent research on using the preconditioned conjugate gradient method as an iterative method for solving Toeplitz systems has brought much attention. One of the main important results of this methodology is that the complexity of solving a large class of Toeplitz systems can be reduced toO (n logn) operations as compared to theO(n log² n) operations required by fast direct Toeplitz solvers, provided that a suitable preconditioner is chosen under certain conditions on the Toeplitz operator. In this paper, we survery some applications of iterative Toeplitz solvers to Toeplitz-related problems arising from scientific applications. These applications include partial differential equations, queueing networks, signal and image processing, integral equations, and time series analysis. Research supported by the Cooperative Research Centre for Advanced Computational Systems. Research supported in part by HKRGC grants no. CUHK 316/94E. 相似文献

17.

Efficient exploitation of the Xeon Phi architecture for the Ant Colony Optimization (ACO) metaheuristic

Felipe Tirado Ricardo J. Barrientos Paulo González Marco Mora 《The Journal of supercomputing》2017,73(11):5053-5070

In recent years, the use of compute-intensive coprocessors has been widely studied in the field of Parallel Computing to accelerate sequential processes through a Graphic Processing Unit (GPU). Intel has recently released a GPU-type coprocessor, the Intel Xeon Phi. It is composed up to 72 cores connected by a bidirectional ring network with a Vector Process Unit (VPU) on large vector registers. In this work, we present novel parallel algorithms of the well-known Ant Colony Optimization (ACO) on the recent many-core platform Intel Xeon Phi coprocessor. ACO is a popular metaheuristic algorithm applied to a wide range of NP-hard problems. To show the efficiency of our approaches, we test our algorithms solving the Traveling Salesman Problem. Our results confirm the potential of our proposed algorithms which led to distinct improvements of performance over previous state-of-the-art approaches in GPU. We implement and compare a set of algorithms to deal with the different steps of ACO. The matrices calculation in the proposed algorithms efficiently exploit the VPU and cache in Xeon Phi. We also show a novel implementation of the roulette wheel selection algorithm, named as UV-Roulette (unique random value roulette). We compare our results in Xeon Phi to state-of-the-art GPU methods, achieving higher performance with large size problems. We also exposed the difficulties and key hardware performance factors to deal with the ACO algorithm on a Xeon Phi coprocessor. 相似文献

18.

Object-oriented programming of distributed iterative equation solvers

Robert Ian Mackie 《Computers & Structures》2008,86(6):511-519

An object-oriented approach is used to develop classes and frameworks for the implementation of distributed iterative equation solution. The software is implemented using the .NET framework, and builds upon previous work by the author. Development of the framework for iterative solution makes good use of interfaces to isolate sources of complexity. The framework is used for three different solution scenarios (i) conjugate gradient iteration on a single matrix; (ii) conjugate gradient iteration when domain decomposition is used; and (iii) using the Schur complement approach. Moreover, the framework is used for both local and remote objects. The .NET framework makes it very straightforward to program distributed applications, and the object-oriented approach greatly facilitates the software development. The framework was used in a finite element program and the speed-up results are shown. 相似文献

19.

An extended iterative format for the progressive-iteration approximation 总被引：1，自引：0，他引：1

Hongwei Lin Zhiyu Zhang 《Computers & Graphics》2011,35(5):967-975

Progressive-iteration approximation (PIA) is a new data fitting technique developed recently for blending curves and surfaces. Taking the given data points as the initial control points, PIA constructs a series of fitting curves (surfaces) by adjusting the control points iteratively, while the limit curve (surface) interpolates the data points. More importantly, progressive-iteration approximation has the local property, that is, the limit curve (surface) can interpolate a subset of data points by just adjusting a part of corresponding control points, and remaining others unchanged. However, the current PIA format requires that the number of the control points equals that of the data points, thus making the PIA technique inappropriate to fitting large scale data points. To overcome this drawback, in this paper, we develop an extended PIA (EPIA) format, which allows that the number of the control points is less than that of the given data points. Moreover, since the main computations of EPIA are independent, they can be performed in parallel efficiently, with storage requirement O(n), where n is the number of the control points. Therefore, due to its local property and parallel computing capability, the EPIA technique has great potential in large scale data fitting. Specifically, by the EPIA format, we develop an incremental data fitting algorithm in this paper. In addition, some examples are demonstrated in this paper, all implemented by the parallel computing toolbox of Matlab, and run on a PC with a four-core CPU. 相似文献

20.

Efficient use of iterative solvers in nested topology optimization

Oded Amir Mathias Stolpe Ole Sigmund 《Structural and Multidisciplinary Optimization》2010,42(1):55-72

In the nested approach to structural optimization, most of the computational effort is invested in the solution of the analysis equations. In this study, it is suggested to reduce this computational cost by using an approximation to the solution of the analysis problem, generated by a Krylov subspace iterative solver. By choosing convergence criteria for the iterative solver that are strongly related to the optimization objective and to the design sensitivities, it is possible to terminate the iterative solution of the nested equations earlier compared to traditional convergence measures. The approximation is computationally shown to be sufficiently accurate for the purpose of optimization though the nested equation system is not necessarily solved accurately. The approach is tested on several large-scale topology optimization problems, including minimum compliance problems and compliant mechanism design problems. The optimized designs are practically identical while the time spent on the analysis is reduced significantly. 相似文献