共查询到20条相似文献,搜索用时 0 毫秒
1.
Array operations are useful in a lot of scientific codes. In recent years, several applications, such as the geological analysis and the medical images processing, are processed using array operations for three-dimensional (abbreviate to “3D”) sparse arrays. Due to the huge computation time, it is necessary to compress 3D sparse arrays and use parallel computing technologies to speed up sparse array operations. How to compress the sparse arrays efficiently is an important task for practical applications. Hence, in this paper, two strategies, inter- and intra-task parallelization (abbreviate to “ETP” and “RTP”), are presented to compress 3D sparse arrays, respectively. Each strategy was designed and implemented on Intel Xeon and Xeon Phi, respectively. From experimental results, the ETP strategy achieves 17.5\(\times \) and 18.2\(\times \) speedup ratios based on Intel Xeon E5-2670 v2 and Intel Xeon Phi SE10X, respectively; 4.5\(\times \) and 4.5\(\times \) speedup ratios for the RTP strategy based on these two environments, respectively. 相似文献
2.
In recent years, Intel promotes its new product Xeon Phi coprocessor, which is similar to the x86 architecture coprocessor. It has about 60 cores and can be regarded as a single computing node, with the computing power that cannot be ignored. This work aims to improve the workload balance by parallel loop self-scheduling scheme performed on Xeon Phi-based computer cluster. The proposed concept is implemented by hybrid MPI and OpenMP parallel programming in C language. Since parallel loop self-scheduling composes of static and dynamic allocation, weighting algorithm is adopted in the static part, while the well-known loop self-scheduling is adopted in dynamic part. The loop block is partitioned according to the weighting of MIC and HOST nodes. Accordingly, Xeon Phi with many-core is adopted to implement parallel loop self-scheduling. Finally, we test the performance in the experiments by four applicable problems: matrix multiplication, sparse matrix multiplication, Mandelbrot set and circuit meet. The experimental results indicate how to do the weight allocation and which scheduling method can achieve the best performance. 相似文献
3.
Kang Ji-Hoon Hwang Jinyul Sung Hyung Jin Ryu Hoon 《The Journal of supercomputing》2021,77(9):9597-9614
The Journal of Supercomputing - Direct numerical simulations (DNS) of turbulent flows have increasing importance because they not only provide fundamental understanding of turbulent flows but also... 相似文献
4.
Yang Chao-Tung Liu Jung-Chun Chan Yu-Wei Kristiani Endah Kuo Chan-Fu 《The Journal of supercomputing》2021,77(3):2486-2510
The Journal of Supercomputing - With the success of deep learning (DL) methods in diverse application domains, several deep learning software frameworks have been proposed to facilitate the usage... 相似文献
5.
Fernandez Ivan Villegas Alejandro Gutierrez Eladio Plata Oscar 《The Journal of supercomputing》2019,75(11):7053-7075
The Journal of Supercomputing - Time series analysis is an important research topic of great interest in many fields. Recently, the Matrix Profile method, and particularly one of its... 相似文献
6.
Paweł Czarnul 《International journal of parallel programming》2017,45(5):1091-1107
The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the algorithm itself, assuming specific measures, the paper assumes a general scheme for finding similarity measures for all pairs of vectors and investigates optimizations for scalability in a hybrid Intel Xeon/Xeon Phi system. Hybrid systems including multicore CPUs and many-core compute devices such as Intel Xeon Phi allow parallelization of such computations using vectorization but require proper load balancing and optimization techniques. The proposed implementation uses C/OpenMP with the offload mode to Xeon Phi cards. Several results are presented: execution times for various partitioning parameters such as batch sizes of vectors being compared, impact of dynamic adjustment of batch size, overlapping computations and communication. Execution times for comparison of all pairs of vectors are presented as well as those for which similarity measures account for a predefined threshold. The latter makes load balancing more difficult and is used as a benchmark for the proposed optimizations. Results are presented for the native mode on an Intel Xeon Phi, CPU only and the CPU \(+\) offload mode for a hybrid system with 2 Intel Xeons with 20 physical cores and 40 logical processors and 2 Intel Xeon Phis with a total of 120 physical cores and 480 logical processors. 相似文献
7.
Koki Murano Tomoyoshi Shimobaba Atsushi Sugiyama Naoki Takada Takashi Kakue Minoru Oikawa Tomoyoshi Ito 《Computer Physics Communications》2014
We report fast computation of computer-generated holograms (CGHs) using Xeon Phi coprocessors, which have massively x86-based processors on one chip, recently released by Intel. CGHs can generate arbitrary light wavefronts, and therefore, are promising technology for many applications: for example, three-dimensional displays, diffractive optical elements, and the generation of arbitrary beams. CGHs incur enormous computational cost. In this paper, we describe the implementations of several CGH generating algorithms on the Xeon Phi, and the comparisons in terms of the performance and the ease of programming between the Xeon Phi, a CPU and graphics processing unit (GPU). 相似文献
8.
Alvaro Estebanez Diego R. Llanos Arturo Gonzalez-Escribano 《International journal of parallel programming》2017,45(2):225-241
Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code. 相似文献
9.
2007年9月6日,英特尔公司今天在全球同步发布业界首款专为多路(MP)服务器设计的四核英特尔至强7300系列服务器处理器.该处理器可满足企业应用环境中以及虚拟化应用环境中的服务器整合、数据库应用、企业资源规划和商务智能等应用时对服务器高性能、高可靠性和高可扩展性的需求. 相似文献
10.
12.
13.
In this paper, we develop, study and implement iterative linear solvers and preconditioners using multiple graphical processing units (GPUs). Techniques for accelerating sparse matrix–vector (SpMV) multiplication, linear solvers and preconditioners are presented. Four Krylov subspace solvers, a Neumann polynomial preconditioner and a domain decomposition preconditioner are implemented. Our numerical tests with NVIDIA C2050 GPUs show that the SpMV kernel can be sped over 40 times faster using four GPUs. Our linear solvers and preconditioners have similar speedup. 相似文献
14.
15.
Server-based networks have well-documented performance limitations. These limitations outline a major goal of Intel's embedded transport acceleration (ETA) project, the ability to deliver high-performance server communication and I/O over standard Ethernet and transmission control protocol/Internet protocol (TCP/IP) networks. By developing this capability, Intel hopes to take advantage of the large knowledge base and ubiquity of these standard technologies. With the advent of 10 gigabit Ethernet, these standards promise to provide the bandwidth required of the most demanding server applications. We use the term packet processing engine (PPE) as a generic term for the computing and memory resources necessary for communication-centric processing. Such PPEs have certain desirable attributes; the ETA project focuses on developing PPEs with such attributes, which include scalability, extensibility, and programmability. General-purpose processors, such as the Intel Xeon in our prototype, are extensible and programmable by definition. Our results show that software partitioning can significantly increase the overall communication performance of a standard multiprocessor server. Specifically, partitioning the packet processing onto a dedicated set of compute resources allows for optimizations that are otherwise impossible when time sharing the same compute resources with the operating system and applications. 相似文献
16.
Recent research on using the preconditioned conjugate gradient method as an iterative method for solving Toeplitz systems
has brought much attention. One of the main important results of this methodology is that the complexity of solving a large
class of Toeplitz systems can be reduced toO (n logn) operations as compared to theO(n log2
n) operations required by fast direct Toeplitz solvers, provided that a suitable preconditioner is chosen under certain conditions
on the Toeplitz operator. In this paper, we survery some applications of iterative Toeplitz solvers to Toeplitz-related problems
arising from scientific applications. These applications include partial differential equations, queueing networks, signal
and image processing, integral equations, and time series analysis.
Research supported by the Cooperative Research Centre for Advanced Computational Systems.
Research supported in part by HKRGC grants no. CUHK 316/94E. 相似文献
17.
Felipe Tirado Ricardo J. Barrientos Paulo González Marco Mora 《The Journal of supercomputing》2017,73(11):5053-5070
In recent years, the use of compute-intensive coprocessors has been widely studied in the field of Parallel Computing to accelerate sequential processes through a Graphic Processing Unit (GPU). Intel has recently released a GPU-type coprocessor, the Intel Xeon Phi. It is composed up to 72 cores connected by a bidirectional ring network with a Vector Process Unit (VPU) on large vector registers. In this work, we present novel parallel algorithms of the well-known Ant Colony Optimization (ACO) on the recent many-core platform Intel Xeon Phi coprocessor. ACO is a popular metaheuristic algorithm applied to a wide range of NP-hard problems. To show the efficiency of our approaches, we test our algorithms solving the Traveling Salesman Problem. Our results confirm the potential of our proposed algorithms which led to distinct improvements of performance over previous state-of-the-art approaches in GPU. We implement and compare a set of algorithms to deal with the different steps of ACO. The matrices calculation in the proposed algorithms efficiently exploit the VPU and cache in Xeon Phi. We also show a novel implementation of the roulette wheel selection algorithm, named as UV-Roulette (unique random value roulette). We compare our results in Xeon Phi to state-of-the-art GPU methods, achieving higher performance with large size problems. We also exposed the difficulties and key hardware performance factors to deal with the ACO algorithm on a Xeon Phi coprocessor. 相似文献
18.
Robert Ian Mackie 《Computers & Structures》2008,86(6):511-519
An object-oriented approach is used to develop classes and frameworks for the implementation of distributed iterative equation solution. The software is implemented using the .NET framework, and builds upon previous work by the author. Development of the framework for iterative solution makes good use of interfaces to isolate sources of complexity. The framework is used for three different solution scenarios (i) conjugate gradient iteration on a single matrix; (ii) conjugate gradient iteration when domain decomposition is used; and (iii) using the Schur complement approach. Moreover, the framework is used for both local and remote objects. The .NET framework makes it very straightforward to program distributed applications, and the object-oriented approach greatly facilitates the software development. The framework was used in a finite element program and the speed-up results are shown. 相似文献
19.
Progressive-iteration approximation (PIA) is a new data fitting technique developed recently for blending curves and surfaces. Taking the given data points as the initial control points, PIA constructs a series of fitting curves (surfaces) by adjusting the control points iteratively, while the limit curve (surface) interpolates the data points. More importantly, progressive-iteration approximation has the local property, that is, the limit curve (surface) can interpolate a subset of data points by just adjusting a part of corresponding control points, and remaining others unchanged. However, the current PIA format requires that the number of the control points equals that of the data points, thus making the PIA technique inappropriate to fitting large scale data points. To overcome this drawback, in this paper, we develop an extended PIA (EPIA) format, which allows that the number of the control points is less than that of the given data points. Moreover, since the main computations of EPIA are independent, they can be performed in parallel efficiently, with storage requirement O(n), where n is the number of the control points. Therefore, due to its local property and parallel computing capability, the EPIA technique has great potential in large scale data fitting. Specifically, by the EPIA format, we develop an incremental data fitting algorithm in this paper. In addition, some examples are demonstrated in this paper, all implemented by the parallel computing toolbox of Matlab, and run on a PC with a four-core CPU. 相似文献
20.
Oded Amir Mathias Stolpe Ole Sigmund 《Structural and Multidisciplinary Optimization》2010,42(1):55-72
In the nested approach to structural optimization, most of the computational effort is invested in the solution of the analysis
equations. In this study, it is suggested to reduce this computational cost by using an approximation to the solution of the
analysis problem, generated by a Krylov subspace iterative solver. By choosing convergence criteria for the iterative solver
that are strongly related to the optimization objective and to the design sensitivities, it is possible to terminate the iterative
solution of the nested equations earlier compared to traditional convergence measures. The approximation is computationally
shown to be sufficiently accurate for the purpose of optimization though the nested equation system is not necessarily solved
accurately. The approach is tested on several large-scale topology optimization problems, including minimum compliance problems
and compliant mechanism design problems. The optimized designs are practically identical while the time spent on the analysis
is reduced significantly. 相似文献