首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A parallel ordering technique is a typical strategy for parallelization of the ICCG method. This paper proposes a new parallel ordering method to develop a parallel ICCG solver utilizing fewer synchronization points and achieving a high convergence rate. The new parallel ordering is called block red-black ordering. In this method, nodes in an analyzed grid are divided into several or many blocks, and red-black ordering is applied to the blocks. Since the blocks with an identical color are independent of each other, forward and backward substitutions in the ICCG iteration can be parallelized in each color. The new method has the advantage that only one synchronization point exists in each parallelized substitution. In order to evaluate the convergence and the parallel speed-up of the method, we carried out an analytical investigation using the ordering graph theory and numerical tests on a scalar parallel computer. The analytical study shows that the convergence rate is improved by an increase in the number of nodes of one block and that an optimal block size for getting the best convergence rate is easily set. The numerical tests show that the new method achieves a high parallel speed-up rate due to fast convergence, small synchronization costs, and effective utilization of the data cache on a scalar parallel computer.  相似文献   

2.
Two parallel block tridiagonalization algorithms and implementations for dense real symmetric matrices are presented. Block tridiagonalization is a critical pre-processing step for the block tridiagonal divide-and-conquer algorithm for computing eigensystems and is useful for many algorithms desiring the efficiencies of block structure in matrices. For an “effectively” sparse matrix, which frequently results from applications with strong locality properties, a heuristic parallel algorithm is used to transform it into a block tridiagonal matrix such that the eigenvalue errors remain bounded by some prescribed accuracy tolerance. For a dense matrix without any usable structure, orthogonal transformations are used to reduce it to block tridiagonal form using mostly level 3 BLAS operations. Numerical experiments show that block tridiagonal structure obtained from this algorithm directly affects the computational complexity of the parallel block tridiagonal divide-and-conquer eigensolver. Reduction to block tridiagonal form provides significantly lower execution times, as well as memory traffic and communication cost, over the traditional reduction to tridiagonal form for eigensystem computations.  相似文献   

3.
As an example application the elliptic partial differential equation for steady groundwater flow is considered. Uncertainties in the conductivity may be quantified with a stochastic model. A discretisation by a Galerkin ansatz with tensor products of finite element functions in space and stochastic ansatz functions leads to a certain type of stochastic finite element system (SFEM). This yields a large system of equations with a particular structure. They can be efficiently solved by Krylov subspace methods, as here the main ingredient is the multiplication with the system matrix and the application of the preconditioner. We have implemented a “hierarchical parallel solver” on a distributed memory architecture for this. The multiplication and the preconditioning uses a—possibly parallel—deterministic solver for the spatial discretisation as a building block in a black-box fashion. This paper is concerned with a coarser grained level of parallelism resulting from the stochastic formulation. These coarser levels are implemented by running different instances of the deterministic solver in parallel. Different possibilities for the distribution of data are investigated, and the efficiencies determined. On up to 128 processors, systems with more than 5 × 107 unknowns are solved.  相似文献   

4.
Selfconsistent simulations of ion–cyclotron heating of tokamak plasmas require iterating between a solver of the wave equations in toroidal geometry and a solver of the Fokker–Planck equation describing the evolution of the ion distribution functions. A huge amount of information must be exchanged between the two codes at each iteration. For the package TORICSSFPQL, we have developed an interface which substantially reduces the CPU and memory requirements for the storage and transmission of these data. We present this interface here, and we take advantage of its efficiency to compare simulations in which the quasilinear diffusion operator in SSFPQL is evaluated superposing the TORIC results for all relevant toroidal modes excited by a given antenna with simulations using the fields of a single “representative” toroidal mode.  相似文献   

5.
A domain where recombination does not occur often, yet maintained linkage disequilibrium exists on DNA sequence is known as a “haplotype block” or “LD block”. Many methods are available to identify LD blocks using disequilibrium parameters, such as the well-known Gabriel's method, Zhu's method and so on. We considered that Echelon analysis can be applied to identify LD block, and report herein that LD blocks can be identified according to our new method applying Echelon analysis. The results of numerical examples are also provided.  相似文献   

6.
A coarse-grain parallel solver for systems of linear algebraic equations with general sparse matrices by Gaussian elimination is discussed. Before the factorization two other steps are performed. A reordering algorithm is used during the first step in order to obtain a permuted matrix with as many zero elements under the main diagonal as possible. During the second step the reordered matrix is partitioned into blocks for asynchronous parallel processing (normally the number of blocks is equal to the number of processors). It is possible to obtain blocks with nearly the same number of rows, because there is no requirement to produce square diagonal blocks. The first step is much more important than the second one and has a significant influence on the performance of the solver. A straightforward implementation of the reordering algorithm will result inO(n 2) operations. By using binary trees this cost can be reduced toO(NZ logn), whereNZ is the number of non-zero elements in the matrix andn is its order (normallyNZ is much smaller thann 2). Some experiments on parallel computers with shared memory have been performed. The results show that a solver based on the proposed reordering performs better than another solver based on a cheaper (but at the same time rather crude) reordering whose cost is onlyO(NZ) operations.  相似文献   

7.
A three-dimensional parallel unstructured non-nested multigrid solver for solutions of unsteady incompressible viscous flow is developed and validated. The finite-volume Navier–Stokes solver is based on the artificial compressibility approach with a high-resolution method of characteristics-based scheme for handling convection terms. The unsteady flow is calculated with a matrix-free implicit dual time stepping scheme. The parallelization of the multigrid solver is achieved by multigrid domain decomposition approach (MG-DD), using single program multiple data (SPMD) and multiple instruction multiple data (MIMD) programming paradigm. There are two parallelization strategies proposed in this work, first strategy is a one-level parallelization strategy using geometric domain decomposition technique alone, second strategy is a two-level parallelization strategy that consists of a hybrid of both geometric domain decomposition and data decomposition techniques. Message-passing interface (MPI) and OpenMP standard are used to communicate data between processors and decompose loop iterations arrays, respectively. The parallel-multigrid code is used to simulate both steady and unsteady incompressible viscous flows over a circular cylinder and a lid-driven cavity flow. A maximum speedup of 22.5 could be achieved on 32 processors, for instance, the lid-driven cavity flow of Re = 1000. The results obtained agree well with numerical solutions obtained by other researchers as well as experimental measurements. A detailed study of the time step size and number of pseudo-sub-iterations per time step required for simulating unsteady flow are presented in this paper.  相似文献   

8.
Hydrologic modeling requires the handling of a wide range of highly nonlinear processes from the scale of a hill slope to the continental scale, and thus the computational efficiency of the model becomes a critical issue for water resource management. This work is aimed at implementing and evaluating a flexible parallel computing framework for hydrologic simulations by applying OpenMP in the HydroGeoSphere (HGS) model. HGS is a 3D control-volume finite element model that solves the nonlinear coupled equations describing surface–subsurface water flow, solute migration and energy transport. The computing efficiency of HGS is improved by three parallel computing schemes: 1) parallelization of Jacobian matrix assembly, 2) multi-block node reordering for performing LU solve efficiently, and 3) parameter privatization for reducing memory access latency. Regarding to the accuracy and consistency of the simulation solutions obtained with parallel computing, differences in the solutions are entirely due to use of a finite linear solver iteration tolerance, which produces slightly different solutions which satisfy the convergence tolerance. The maximum difference in the head solution between the serial and parallel simulations is less than 10−3 m, using typical convergence tolerances. Using the parallel schemes developed in this work, three key achievements can be summarized: (1) parallelization of a physically-based hydrologic simulator can be performed in a manner that allows the same code to be executed on various shared memory platforms with minimal maintenance; (2) a general, flexible and robust parallel iterative sparse-matrix solver can be implemented in a wide range of numerical models employing either structured or unstructured mesh; and (3) the methodology is flexible, especially for the efficient construction of the coefficient and Jacobian matrices, compared to other parallelized hydrologic models which use parallel library packages.  相似文献   

9.
This note is a sequel of paper (Escudero et al. (2012) [1]), in which the sequential Branch-and-Fix Coordination referred to as the BFC-MS algorithm was introduced for solving large-scale multistage mixed 0–1 optimization problems up to optimality under uncertainty. The aim of the note is to present the parallelization version of the BFC algorithm, referred to as PC-BFCMS, so that the elapsed time reduction on problem solving is analyzed. The parallelization is performed at two levels. The inner level parallelizes the optimization of the MIP submodels attached to the set of scenario clusters that have been created by the modeler-defined break stage to decompose the original problem, where the nonanticipativity constraints are partially relaxed. Several strategies are presented for analyzing the performance of the inner parallel computation based on MPI (Message Passing Interface) threads to solve scenario cluster submodels versus the sequential version of the BFC-MS methodology. The outer level of parallelization defines a set of 0–1 variables, the combinations of whose 0–1 values, referred to as paths (one for each combination), allow independent models to be optimized in parallel, such that each one can itself be internally optimized with the inner parallelization approach. The results of using the outer–inner parallelization are remarkable: the larger the number of paths and MPI threads (in addition to the number of threads that the MIP solver allows to be used), the smaller the elapsed time to obtain the optimal solution. This new approach allows problems to be solved faster, and, can thus solve very large scale problems that could not otherwise be solved by plain use of a state-of the-art MIP solver, or could not be solved by the sequential version of the decomposition algorithm in acceptable elapsed time.  相似文献   

10.
Multicore accelerators are used today to supplement traditional superscalar processors in massively parallel computer nodes with extra floating‐point computation power. This paper presents our parallelization and performance enhancement and evaluation of the conjugate gradient (CG) linear equation solver with enhanced matrix multiplication on the Cell Broadband Engine accelerator. The paper also compares the CG performance results on the Cell and two CG implementations on a computer with two quadcore Xeon processors, one with OpenMP and the other with OpenMPI. We also report the enhancements made on the CG code and performance analysis of CG on single and dual Cell Broadband Engine packages with 8 and 16 synergistic processing elements and on Xeon for heptadiagonal matrices, in particular to matrix multiplication and synchronization. We also report the communication and computation time breakdowns and the floating point operations per second ratio. Our parallel CG solver is shown to scale well with data size, grid dimensionality, and number of cores. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

11.
In this paper we study two generalizations of the well known unrelated parallel machines scheduling problem under makespan (Cmax) minimization. First, a situation in which not every available parallel machine should be used and it is desirable to employ only a subset of the parallel machines. This is referred to as “Not All Machines” or NAM in short. This environment applies frequently in production shops where capacity exceeds demand or when production capacity can be lent to third companies. Also, NAM can be used to increase production capacity and it is not clear how many additional machines should be acquired. The second studied generalization has been referred to as “Not All Jobs” or NAJ. Here, there is no obligation to process all available jobs. We propose Mixed Integer Programming mathematical formulations for both NAM and NAJ, and it is shown that the latter can be effectively solved with modern commercial solvers. We also present three algorithms to solve the NAM problem. These algorithms are compared with the proposed MIP formulation when solved with IBM ILOG CPLEX 12.1. Comprehensive computational and statistical experiments prove that our proposed algorithms significantly improve the results given by the solver.  相似文献   

12.
This work is devoted to the development of efficient parallel algorithms for the direct numerical simulation (DNS) of incompressible flows on modern supercomputers. In doing so, a Poisson equation needs to be solved at each time-step to project the velocity field onto a divergence-free space. Due to the non-local nature of its solution, this elliptic system is the part of the algorithm that is most difficult to parallelize. The Poisson solver presented here is restricted to problems with one uniform periodic direction. It is a combination of a block preconditioned Conjugate Gradient (PCG) and an FFT diagonalization. The latter decomposes the original system into a set of mutually independent 2D systems that are solved by means of the PCG algorithm. For the most ill-conditioned systems, that correspond to the lowest Fourier frequencies, the PCG is replaced by a direct Schur-complement based solver.The previous version of the Poisson solver was conceived for single-core (also dual-core) processors and therefore, the distributed memory model with message-passing interface (MPI) was used. The irruption of multi-core architectures motivated the use of a two-level hybrid MPI + OpenMP parallelization with the shared memory model on the second level. Advantages and implementation details for the additional OpenMP parallelization are presented and discussed in this paper. Numerical experiments show that, within its range of efficient scalability, the previous MPI-only parallelization is slightly outperformed by the MPI + OpenMP approach. But more importantly, the hybrid parallelization has allowed to significantly extend the range of efficient scalability. Here, the solver has been successfully tested up to 12800 CPU cores for meshes with up to 109 grid points. However, estimations based on the presented results show that this range can be potentially stretched up until 200,000 cores approximately. Finally, several examples of DNS simulations are briefly presented to illustrate some potential applications of the solver.  相似文献   

13.
Exploratory data mining and analysis requires a computing environment which provides facilities for the user-friendly expression and rapid execution of scientific queries. In this paper, we address research issues in the parallelization of scientific queries containing complex user-defined operations. In a parallel query execution environment, parallelizing a query execution plan involves determining how input data streams to evaluators implementing logical operations can be divided to be processed by clones of the same evaluator in parallel. We introduced the concept of relevance window that characterizes data lineage and data partitioning opportunities available for an user-defined evaluator. In addition, we developed a query parallelization framework by extending relational parallel query optimization algorithms to allow the parallelization characteristics of user-defined evaluators to guide the process of query parallelization in an extensible query processing environment. We demonstrated the utility of our system by performing experiments mining cyclonic activity, blocking events, and the upward wave-energy propagation features from several observational and model simulation datasets.  相似文献   

14.
Peigin  S.  Epstein  B.  Rubin  T.  Seror  S. 《The Journal of supercomputing》2004,27(1):49-68
We present a highly scalable parallelization of a high-accuracy 3D serial multiblock Navier-Stokes solver. The code solves the full Navier-Stokes equations and is capable of performing large-scale computations for practical configurations in an industrial enviroment. The parallelization strategy is based on the geometrical domain decomposition principle, and on the overlapped communication and computation concept. The important advantage of the strategy is that the suggested type of message-passing ensures a very high scalability of the algorithm from the network point of view, because, on the average, the communication work per processor is not increased if the number of processors is increased. The parallel multiblock-structured Navier-Stokes solver based on the parallel virtual machine (PVM) routines was implemented on 106-processors distributed memory cluster managed by the MOSIX software package. Analysis of the results demonstrated a high level of parallel efficiency (speed up) of the computational algorithm. This allowed the reduction of the execution time for large-scale computations employing 10 million of grid points, from an estimated 46 days on the SGI ORIGIN 2000 computer (in the serial single-user mode) to 5–6 hours on 106-processors cluster. Thus, the parallel multiblock full Navier-Stokes code can be successfully used for large-scale practical aerodynamic simulations of a complete aircraft on millions-points grids on a daily basis, as needed in industry.  相似文献   

15.
16.
A new efficient parallelization strategy for optimization of aerodynamic shapes is proposed. The optimization method employs a full Navier-Stokes solver for accurate estimation of the objective function. As such it requires huge computational resources which makes efficient parallelization crucial for successful promotion of the method to an engineering environment. The algorithm is based on a multilevel embedded parallelization approach, which includes (1) parallelization of the multiblock full Navier-Stokes solver with parallel CFD evaluation of objective function, (2) parallelization of optimization process with parallel optimal search on multiple search domains and, finally, (3) parallel grid generation. Applications (implemented on a 144-processors distributed memory cluster) include various transonic profile optimizations in the presence of nonlinear constraints. The results demonstrate that the approach combines high accuracy of optimization with high parallel efficiency. The proposed multilevel parallelization which efficiently makes use of computational power supplied by multiprocessor systems, leads to a significant computational time-saving and allows application of the method to practical aerodynamic design in the aircraft industry.  相似文献   

17.
Single- and multi-level iterative methods for sparse linear systems are applied to unsteady flow simulations via implementation into a direct numerical simulation solver for incompressible turbulent flows on unstructured meshes. The performance of these solution methods, implemented in the well-established SAMG and ML packages, are quantified in terms of computational speed and memory consumption, with a direct sparse LU solver (SuperLU) used as a reference. The classical test case of unsteady flow over a circular cylinder at low Reynolds numbers is considered, employing a series of increasingly fine anisotropic meshes. As expected, the memory consumption increases dramatically with the considered problem size for the direct solver. Surprisingly, however, the computation times remain reasonable. The speed and memory usage of pointwise algebraic and smoothed aggregation multigrid solvers are found to exhibit near-linear scaling. As an alternative to multi-level solvers, a single-level ILUT-preconditioned GMRES solver with low drop tolerance is also considered. This solver is found to perform sufficiently well only on small meshes. Even then, it is outperformed by pointwise algebraic multigrid on all counts. Finally, the effectiveness of pointwise algebraic multigrid is illustrated by considering a large three-dimensional direct numerical simulation case using a novel parallelization approach on a large distributed memory computing cluster.  相似文献   

18.
19.
In this paper, two significant weaknesses of locally linear embedding (LLE) applied to computer vision are addressed: “intrinsic dimension” and “eigenvector meanings”. “Topological embedding” and “multi-resolution nonlinearity capture” are introduced based on mathematical analysis of topological manifolds and LLE. The manifold topological analysis (MTA) method is described and is based on “topological embedding”. MTA is a more robust method to determine the “intrinsic dimension” of a manifold with typical topology, which is important for tracking and perception understanding. The manifold multi-resolution analysis (MMA) method is based on “multi-resolution nonlinearity capture”. MMA defines LLE eigenvectors as features for pattern recognition and dimension reduction. Both MTA and MMA are proved mathematically, and several examples are provided. Applications in 3D object recognition and 3D object viewpoint space partitioning are also described.  相似文献   

20.
The aim of this study was to investigate the relationship between three Eysenckian personality dimensions – psychoticism, extroversion and neuroticism – and the Internet use. A sample of 427 Turkish university students completed the Eysenck’s Personality Questionnaire, an Internet survey which contained questions about interpersonal motives for Internet use and a scale for measuring the tendency for expressing one’s “true” self on the Internet. The results indicated that psychoticism was the only personality dimension related to establishing new relationships and having “Internet only” friends; and extroversion was the only personality dimension that is related to maintaining long-distance relationships, and supporting daily face-to-face relationships. The results supported the idea that for some individuals, Internet can be used as social substitute for face-to-face social interactions while for some others it can be used as a tool of social extension, depending on the user’s personality characteristics. Also, psychoticism and neuroticism were found to be positively associated with the expressing “true self” on the Internet, and it was shown that the relationship between psychoticism and Internet uses as social substitute is mediated by the tendency to express one’s true self on the Internet.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号