首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A macro package for expressing message passing functions within parallel FORTRAN program is presented. It makes the user program fully portable among all parallel computers where the macros are implemented. The implementation on the Intel iPSC/2 hypercube is discussed in more detail. New message passing primitives have been added to the iPSC/2 operating system, offering the user a broader functionality at no efficiency loss. The full macro set, using these primitives, works with the same performance as the original Intel primitives.  相似文献   

2.
Karp  A.H. Babb  R.B.  II 《Software, IEEE》1988,5(5):52-67
A simple program that approximates π by numerical quadrature is rewritten to run on nine commercially available processors to illustrate the compilations that arise in parallel programming in FORTRAN. The machines used are the Alliant FX/8, BBN Butterfly, Cray X-MP/48, ELXSI 6400, Encore Multimax, Flex/32, IBM 3090/VF, Intel iPSC, and Sequent Balance. Some general impediments to using parallel processors to do production work are identified  相似文献   

3.
In a previous work we studied the concurrent implementation of a numerical model, CONDIFP, developed for the analysis of depth-averaged convection–diffusion problems. Initial experiments were conducted on the Intel Touchstone Delta System, using up to 512 processors and different problem sizes. As for other computation-intensive applications, the results demonstrated an asymptotic trend to unity efficiency when the computational load dominates the communication load. This paper relates some other numerical experiences, in both one and two space dimensions with various choices of initial and boundary conditions, carried out on the Intel Paragon XP/S Model L38 with the aim to illustrate the parallel solver versatility and reliability.  相似文献   

4.
Ordering clones from a genomic library into physical maps of whole chromosomes presents a pivotal computational problem in genetics. Previous research has shown the physical mapping problem to be isomorphic to the NP-complete Optimal Linear Arrangement (OLA) problem for which no polynomial-time algorithm for determining the optimal solution is known. Serial implementations of stochastic global optimization techniques such as simulated annealing yielded very good results but proved computationally intensive. The design, analysis and implementation of coarse-grained parallel MIMD algorithms for simulated annealing on the Intel iPSC/860 hypercube is presented. Data decomposition and control decomposition strategies based on Markov chain decomposition, perturbation methods and problem-specific annealing heuristics are proposed and applied to the physical mapping problem. A suite of parallel algorithms are implemented on an 8-node Intel iPSC/860 hypercube, exploiting the nearest-neighbor communication pattern on the Boolean hypercube topology. Convergence, speedup and scalability characteristics of the various parallel algorithms are analyzed and discussed. Results indicate a deterioration of performance when a single Markov chain of solution states is distributed across multiple processing elements in the Intel iPSC/860 hypercube.  相似文献   

5.
Heterogeneous (or hybrid) computing platforms with Intel Xeon Phi accelerators offer potential advantages of energy efficient, massively parallel computing, while supporting parallel programming models familiar for users of multicore CPUs. However, realizing this potential for real-world applications still remains a challenging issue. The main goal of this paper is the suitability assessment of offload-based programming environments for porting a real-life scientific application to hybrid platforms with Intel KNC and KNL accelerators, assuming no significant modifications of the application code. The main criterion of this assessment is the application performance. The evaluated environments include: 1) Intel Offload coupled with OpenMP, 2) OpenMP 4.0 and 3) OpenMP 4.5 Accelerator Models, and 4) hStreams Library with OpenMP. A real-life application dedicated to the numerical modeling of alloy solidification is used as a testbed in the assessment. An experimental evaluation of the four versions of the application code for a platform with KNC coprocessors shows that excluding OpenMP 4.0, the rest of them are able to adapt to expansion of available resources, however, with different efficiency. While the shortest execution time is achieved for Intel Offload, the high-level abstractions of hStreams contribute considerably to making porting and tuning the application easier, with low performance overheads in comparison to the low-level Intel Offload extension. Benchmarking the application performance and scalability on a platform with multiple KNL processors, using the Offload over Fabric technology with Intel Offload and OpenMP 4.5, concludes the assessment.  相似文献   

6.
A parallel implementation of the finite volume method for three-dimensional, time-dependent, thermal convective flows is presented. The algebraic equations resulting from the finite volume discretization, including a pressure equation which consumes most of the computation time, are solved by a parallel multigrid method. A flexible parallel code has been implemented on the Intel Paragon, the Cray T3D, and the IBM SP2 by using domain decomposition techniques and the MPI communication software. The code can use 1D, 2D, or 3D partitions as required by different geometries, and is easily ported to other parallel systems. Numerical solutions for air (Prandtl number Pr = 0.733) with various Rayleigh numbers up to 107 are discussed.  相似文献   

7.
An analysis is presented of the primary factors influencing the performance of a parallel implementation of the UCLA atmospheric general circulation model (AGCM) on distributed-memory, massively parallel computer systems. Several modifications to the original parallel AGCM code aimed at improving its numerical efficiency, load-balance and single-node code performance are discussed. The impact of these optimization strategies on the performance on two of the state-of-the-art parallel computers, the Intel Paragon and Cray T3D, is presented and analyzed. It is found that implementation of a load-balanced FFT algorithm results in a reduction in overall execution time of approximately 45% compared to the original convolution-based algorithm. Preliminary results of the application of a load-balancing scheme for the physics part of the AGCM code suggest that additional reductions in execution time of 10–15% can be achieved. Finally, several strategies for improving the single-node performance of the code are presented, and the results obtained thus far suggest that reductions in execution time in the range of 35–45% are possible. © 1998 John Wiley & Sons, Ltd.  相似文献   

8.
Many large-scale finite element problems are intractable on current generation production supercomputers. High-performance computer architectures offer effective avenues to bridge the gap between computational needs and the power of computational hardware. The biggest challenge lies in the substitution of the key algorithms in an application program with redesigned algorithms which exploit the new architectures and use better or more appropriate numerical techniques. A methodology for implementing non-linear finite element analysis on a homogeneous distributed processing network is discussed. The method can also be extended to heterogeneous networks comprised of different machine architectures provided that they have a mutual communication interface. This unique feature has greatly facilitated the port of the code to the 8-node Intel Touchstone Gamma and then the 512-node Intel Touchstone Delta. The domain is decomposed serially in a preprocessor. Separate input files are written for each subdomain. These files are read in by local copies of the program executable operating in parallel. Communication between processors is addressed utilizing asynchronous and synchronous message passing. The basic kernel of message passing is the internal force exchange which is analogous to the computed interactions between sections of physical bodies in static stress analysis. Benchmarks for the Intel Delta are presented. Performance exceeding 1 gigaflop was attained. Results for two large-scale finite element meshes are presented.  相似文献   

9.
We investigate the numerical solution of discrete-time algebraic Riccati equations on a parallel distributed architecture. Our solvers obtain an initial solution of the Riccati equation via the disc function method, and then refine this solution using Newton's method. The Smith iteration is employed to solve the Stein equation that arises at each step of Newton's method. The numerical experiments on an Intel Pentium-II cluster, connected via a Myrinet switch, report the performance and scalability of the new algorithms. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

10.
We discuss a parallel library of efficient algorithms for model reduction of large-scale systems with state-space dimension up to (104). We survey the numerical algorithms underlying the implementation of the chosen model reduction methods. The approach considered here is based on state-space truncation of the system matrices and includes absolute and relative error methods for both stable and unstable systems. In contrast to serial implementations of these methods, we employ Newton-type iterative algorithms for the solution of the major computational tasks. Experimental results report the numerical accuracy and the parallel performance of our approach on a cluster of Intel Pentium II processors.  相似文献   

11.
A finite element code with a polycrystal plasticity model for simulating deformation processing of metals has been developed for parallel computers using High Performance Fortran (HPF). The conversion of the code from an original implementation on the Connection Machine systems using CM Fortran is described. The sections of the code requiring minimal inter-processor communication are easily parallelized, by changing only the syntax for specifying data layout. However, the solver routine based on the conjugate gradient method required additional modifications, which are discussed in detail. The performance of the code on a massively parallel distributed-memory Intel PARAGON supercomputer is evaluated through timing statistics. Published by Elsevier Science Ltd.  相似文献   

12.
Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor’s L2 cache. Through an analysis on relevant performance events using Intel VTune™Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading.  相似文献   

13.
Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.  相似文献   

14.
Various tridiagonal solvers have been proposed in recent years for different parallel platforms. In this paper, the performance of three tridiagonal solvers, namely, the parallel partition LU algorithm, the parallel diagonal dominant algorithm, and the reduced diagonal dominant algorithm, is studied. These algorithms are designed for distributed-memory machines and are tested on an Intel Paragon and an IBM SP2 machine. Measured results are reported in terms of execution time and speedup. Analytical studies are conducted for different communication topologies and for different tridiagonal systems. The measured results match the analytical results closely. In addition to addressing implementation issues, performance considerations such as problem sizes and models of speedup are also discussed. © 1997 John Wiley & Sons, Ltd.  相似文献   

15.
The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory‐bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine‐grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict‐free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

16.
A new mapping algorithm is presented for domain decomposition for the purpose of allowing researchers to conduct finite element analysis on massively parallel computers. Over the last few years, massively parallel MIMD machines such as the Intel Touchstone Delta and recently the Intel Touchstone Paragon have become increasingly popular for speeding up finite element computations. Most of these applications use domain decomposition as a first step towards conquering the problem. Many different algorithms have been developed by researchers to achieve an effective domain decomposition. Some of these methods use connectivity information only, some use coordinate information only, while others use both of them together. Some algorithms are based on assigning weights to nodes using a particular strategy while others are recursive in nature. As will be discussed in this paper, the logic employed in various algorithms works perfectly well for certain meshes to be decomposed, in certain numbers of subdomains; while it gives far from perfect results for other meshes or for same meshes to be decomposed in a different number of subdomains. The logic used in the proposed algorithm has been developed in a creative way such that it is closer to a human's natural thinking when making decisions. Fairly large meshes can be decomposed in a matter of seconds on a Sun Sparc station by the proposed algorithm. Its execution time remains almost the same for any number of subdomains.  相似文献   

17.
This paper describes the parallel implementation of a numerical model for the simulation of problems from fluid dynamics on distributed memory multiprocessors. The basic procedure is to apply a fully explicit upwind finite difference approximation on a staggered grid. A theoretical time complexity analysis shows that a perfect speedup is achieved asympotically. Experimental results on the Intel Touchstone Delta System confirm the analytical performance model. © 1997 John Wiley & Sons, Ltd.  相似文献   

18.
Traditionally, computer programs have been developed using the sequential programming paradigm. With the advent of parallel computing systems, such as multicore processors and distributed environments, the sequential paradigm became a barrier to the utilisation of the available resources, since the program is restricted to a single processing unit. To address this issue, we propose a transparent automatic parallelisation tool with a binary rewriter. The steps of our approach are: the disassembly of the Intel x86 application, its transformation into an intermediary language; the analysis of this intermediary representation to obtain the flow and dependency graphs; the partitioning of the application into parallel units, using the obtained graphs; and, finally, the reassembly of the application back into the original Intel x86 architecture. By transforming the compiled application software, we aim at obtaining a program which can better explore the parallel resources, with no extra effort required from users or developers.  相似文献   

19.
In the paper we present parallel solutions for performing image contour ranking on coarse-grained machines. In contour ranking, a linear representation of the edge contours is generated from the edge contours of a raw image. We describe solutions that employ different divide-and-conquer approaches and that use different communication patterns. The combining step of the divide-and-conquer solutions uses efficient sequential techniques for merging information about subimages. The proposed solutions are implemented on Intel Delta and Intel Paragon machines. We discuss performance results and present scalability analysis using different image and machine sizes. © 1997 by John Wiley & Sons, Ltd.  相似文献   

20.
We present a portable, parallel implementation of an urban air quality model. The parallel model runs on the Intel Delta, Intel Paragon, IBM SP2, and Cray T3D, using a variety of standard communication libraries. We analyze the performance of the air quality model on these platforms based on a model derived from the parallel communication behavior and sequential execution time of the air quality model. We predict the performance of the next generation air quality models based on this analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号