首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 297 毫秒
1.
This paper introduces new approaches to the data distribution-partition problem for sparse matrices in a multiprocessor environment. In this work, the data partition problem of a sparse matrix is modeled as a Min-Max Problem subject to the uniformity constrain when the goal is to balance the load for both sparse and dense operations. This problem is NP-Complete and two heuristic solutions (ABO and GPB) are proposed. The key of ABO and GPB is to determine the permutation of rows/columns of the input sparse matrix to obtain a sorted matrix with a homogeneous density of nonzero elements. Due to the heuristic nature of the proposed methods their validation is carried out by a comparative study of the parallel efficiency of two types of problems (sparse and mixed) when ABO, GPB, Block, Cyclic and MRD data distributions are applied.This work has been partially supported by the Spanish CICYT through grant TIC2002-00228.  相似文献   

2.
Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.  相似文献   

3.
《国际计算机数学杂志》2012,89(3-4):249-270
We study the parallel implementation of two diagonalization methods for solving dense linear systems: the well known Gauss-Jordan method and a new one introduced by Huard. The number of arithmetic operations performed by the Huard method is the same as for Gaussian elimination, namely 2n 3/3, less than for the Jordan method, namely n 3. We introduce parallel versions of these methods, compare their performances and study their complexity. We assume a shared memory computer with a number of processors p of the order of n, the size of the problem to be solved, We show that the best parallel version for Jordan's method is by rows whereas the best one for Huard's method is by columns. Our main result states that for a small number of processors the parallel Huard method is faster than the parallel Jordan method and slower otherwise. The separation is obtained for p = 0.44n.  相似文献   

4.
The preconditioned weighted essentially non-oscillatory (P-WENO) solver for viscous flows (Huang et al. (2009) [9]) is extended to non-inertial reference frames. In the present scheme, patched multi-block grid system is employed and parallel computing is adopted as well. With the present parallel P-WENO solver, three-dimensional flows of the Phase VI Rotor from National Renewable Energy Laboratory (NREL) can be simulated and analyzed for different wind speeds. Our simulation results show good agreement with the numerical predictions based on incompressible Navier–Stokes (N–S) equations as well as the available wind tunnel data from NREL. The flow phenomena, including separation and attachment line, can be captured by the present scheme. The parallel strategy adopted is a block-domain decomposition method for the patched multi-block grid system. To balance the load among different computing nodes, a Tabu search algorithm is adopted for the parallelization. The parallel efficiency of the parallel P-WENO scheme is examined for node numbers ranging from 1 to 64. It is found that the parallel efficiency is monotonically decreased as the node number adopted is increased; the parallel efficiency is retained over 90% for all cases of different node numbers. Due to the high parallel efficiency, our parallel P-WENO solver is validated for applying to practical fluid problems from compressible to incompressible limits.  相似文献   

5.
基于模糊Petri网的并行推理算法的矩阵维数越大,其算法的时间复杂度也就越高。针对反向搜索压缩模糊Petri网模型的相关理论和并行推理算法的特点,结合矩阵命令提出一种实现双向推理的矩阵运算机制,以及其对应的基于模糊Petri网的双向并行推理算法。在使用一般模糊推理算法的过程中,推理矩阵为(11×8)维的模糊Petri网模型,而使用改进算法进行双向推理时所涉及的推理矩阵阶数仅为(7×6)。实验结果表明,与一般的模糊推理算法和反向搜索算法相比,该算法能够提高整个推理过程的并行度,降低算法的时间复杂度,从而提高推理效率。  相似文献   

6.
Randomized algorithms are gaining ground in high-performance computing applications as they have the potential to outperform deterministic methods, while still providing accurate results. We propose a randomized solver for distributed multicore architectures to efficiently solve large dense symmetric indefinite linear systems that are encountered, for instance, in parameter estimation problems or electromagnetism simulations. The contribution of this paper is to propose efficient kernels for applying random butterfly transformations and a new distributed implementation combined with a runtime (PaRSEC) that automatically adjusts data structures, data mappings, and the scheduling as systems scale up. Both the parallel distributed solver and the supporting runtime environment are innovative. To our knowledge, the randomization approach associated with this solver has never been used in public domain software for symmetric indefinite systems. The underlying runtime framework allows seamless data mapping and task scheduling, mapping its capabilities to the underlying hardware features of heterogeneous distributed architectures. The performance of our software is similar to that obtained for symmetric positive definite systems, but requires only half the execution time and half the amount of data storage of a general dense solver.  相似文献   

7.
《Parallel Computing》2007,33(7-8):541-560
A new parallel code for the simulation of the transient, 3D dispersal of volcanic particles in the atmosphere is presented. The model equations, describing the multiphase flow dynamics of gas and solid pyroclasts ejected from the volcanic vent during explosive eruptions, are solved by a finite-volume discretization scheme and a pressure-based iterative non-linear solver suited to compressible multiphase flows. The solution of the multiphase equation set is computationally so demanding that the simulation of the transient 3D dynamics of eruptive columns would not be cost-effective on a single workstation. The new code has been parallelized by adopting an ad hoc domain partitioning scheme that enforces the load balancing in the presence of a large number of topographic blocking-cells. An optimized communication layer has been built over the Message-Passing Interface. It is shown that the present code has a remarkable efficiency on several high-performance platforms and makes it possible, for the first time, to simulate fully 3D eruptive scenarios on realistic volcano topography.  相似文献   

8.
New parallel computational techniques are introduced for the parallelization of Generic Approximate Sparse Inverse multigrid methods, based on Portable Operating System Interface for UniX (POSIX) threads, for multicore systems. Parallelization of the Generic Approximate Sparse Inverse Matrix (GenAspI) algorithm is achieved based on a new computational approach, namely “strip,” which utilizes the data independence of the rows assigned in each available processor. Additionally, new parallel computational techniques are proposed for the parallelization of a modified multigrid V-Cycle method, based on POSIX Threads, for multicore systems. The modified V-Cycle utilized a Parallel PGenAspI Preconditioned Bi-Conjugate Gradient STABilized (BiCGSTAB) as a coarse solver to ensure better parallel performance of the multigrid method. For parallelization purposes, a replication of the multigrid method function is executed on each processor with different index bands and with proper synchronization points to ensure less thread-creation overhead and to maximize parallel performance. Theoretical estimates on speedups and efficiency are also presented. Finally, numerical results for the performance of the PGenAspI algorithm and the PGenAspI–MGV method for solving classical two-dimensional boundary value problems on multicore computer systems are presented. The implementation issues of the proposed method are also discussed using POSIX threads on multicore systems.  相似文献   

9.
本文主要介绍了大规模油藏数值模拟并行计算技术在国内的研究进展,提供了精细油藏模拟在国产Beowulf系统上的计算实例和应用效果,给出了百万网格点规模的油藏应用算例在不同处理器规模下的数值模拟计算结果与性能分析,并实现了一个针对海量数据可视化的三维图、二维图、表格显示的后处理显示系统.  相似文献   

10.
We investigate fully parallel Newton-Krylov-Schwarz (NKS) algorithms for solving the large sparse nonlinear systems of equations arising from the finite element discretization of the three-dimensional Poisson-Boltzmann equation (PBE), which is often used to describe the colloidal phenomena of an electric double layer around charged objects in colloidal and interfacial science. The NKS algorithm employs an inexact Newton method with backtracking (INB) as the nonlinear solver in conjunction with a Krylov subspace method as the linear solver for the corresponding Jacobian system. An overlapping Schwarz method as a preconditioner to accelerate the convergence of the linear solver. Two test cases including two isolated charged particles and two colloidal particles in a cylindrical pore are used as benchmark problems to validate the correctness of our parallel NKS-based PBE solver. In addition, a truly three-dimensional case, which models the interaction between two charged spherical particles within a rough charged micro-capillary, is simulated to demonstrate the applicability of our PBE solver to handle a problem with complex geometry. Finally, based on the result obtained from a PC cluster of parallel machines, we show numerically that NKS is quite suitable for the numerical simulation of interaction between colloidal particles, since NKS is robust in the sense that INB is able to converge within a small number of iterations regardless of the geometry, the mesh size, the number of processors. With help of an additive preconditioned Krylov subspace method NKS achieves parallel efficiency of 71% or better on up to a hundred processors for a 3D problem with 5 million unknowns.  相似文献   

11.
遥感卫星图像几何粗校正的数据并行方法研究   总被引:1,自引:0,他引:1  
主要研究星上遥感图像的实时几何粗校正问题.卫星遥感图像现在一般都大到上万个像素行和列,采用传统的单个处理器的串行方式在星上进行实时处理是难以满足应用要求的.提出了一种在一维PE阵列的SIMD计算机上采用基于处理元阵列平移的数据并行校正方法,并根据NASA的LANDSAT-1卫星的有关的参数,对该方法进行了详细讨论,给出了具体的实现方法.通过对复杂性和加速比的讨论,表明该方法在性能上比采用单个处理器的串行方法提高了N倍.  相似文献   

12.
This paper describes a parallel simulation of an electrical circuit using the Transmission Line Modelling (TLM) method. The TLM method is used to decouple the circuit into sub-circuits which are then simulated concurrently in a parallel processing system. This approach not only simplifies the circuit formulation process but also reduces the overall computing time of the circuit simulation when compared with traditional sequential method. The techniques for decoupling the circuit and implementing the parallel algorithm are described. The method is demonstrated in an electrical circuit simulation. Comparison of the computing time and simulated results with the sequential approach confirms the computing efficiency and accuracy of the proposed method.  相似文献   

13.
Two-level parallelization is introduced to solve a massive block-tridiagonal matrix system. One-level is used for distributing blocks whose size is as large as the number of block rows due to the spectral basis, and the other level is used for parallelizing in the block row dimension. The purpose of the added parallelization dimension is to retard the saturation of the scaling due to communication overhead and inefficiencies in the single-level parallelization only distributing blocks. As a technique for parallelizing the tridiagonal matrix, the combined method of “Partitioned Thomas method” and “Cyclic Odd–Even Reduction” is implemented in an MPI-Fortran90 based finite element-spectral code (TORIC) that calculates the propagation of electromagnetic waves in a tokamak. The two-level parallel solver using thousands of processors shows more than 5 times improved computation speed with the optimized processor grid compared to the single-level parallel solver under the same conditions. Three-dimensional RF field reconstructions in a tokamak are shown as examples of the physics simulations that have been enabled by this algorithmic advance.  相似文献   

14.
A two-dimensional principal component analysis (2D PCA) method directed at processing digital images is discussed. The method is based on representation of images as a set of rows and columns analyzing these sets. Two methods of realizing the 2D PCA corresponding to the parallel and cascade forms of its realization are presented, and their characteristics are estimated. The application of the 2D PCA method is shown for solving problems of representation and recognition of facial images. The experiments are fulfilled on ORL and FERET bases.  相似文献   

15.
该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库,在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能,并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。  相似文献   

16.
A coarse-grain parallel solver for systems of linear algebraic equations with general sparse matrices by Gaussian elimination is discussed. Before the factorization two other steps are performed. A reordering algorithm is used during the first step in order to obtain a permuted matrix with as many zero elements under the main diagonal as possible. During the second step the reordered matrix is partitioned into blocks for asynchronous parallel processing (normally the number of blocks is equal to the number of processors). It is possible to obtain blocks with nearly the same number of rows, because there is no requirement to produce square diagonal blocks. The first step is much more important than the second one and has a significant influence on the performance of the solver. A straightforward implementation of the reordering algorithm will result inO(n 2) operations. By using binary trees this cost can be reduced toO(NZ logn), whereNZ is the number of non-zero elements in the matrix andn is its order (normallyNZ is much smaller thann 2). Some experiments on parallel computers with shared memory have been performed. The results show that a solver based on the proposed reordering performs better than another solver based on a cheaper (but at the same time rather crude) reordering whose cost is onlyO(NZ) operations.  相似文献   

17.
There are several reconfiguring-network models of parallel computation that are considered in the published literature, depending on their switching capabilities. Can these reconfigurable models be the basis for the design of massively parallel computers? Perhaps the most fundamental related issue is virtual parallelism, or the self-simulation problem: Given an algorithm which is designed for a large reconfigurable mesh, can it be executed efficiently on a smaller reconfigurable mesh? In this work, we give several positive answers to the self-simulation problem. We show that the simulation of a reconfiguring mesh by a smaller one can be carried optimally and using standard methods on the model in which buses are established along rows or along columns. A novel technique is shown to achieve asymptotically optimal self simulation on models which allow buses to switch column and row edges, provided that a bus is a "linear" path of connected edges. Finally, for models in which a bus is any subgraph of the underlying mesh, efficient simulations are presented, paying by an extra factor which is polylogarithmic in the size of the simulated mesh. Although the self-simulation algorithms are complex and require extensive bookkeeping operations, the required space is asymptotically optimal.  相似文献   

18.
汉字笔段形成规律及其提取方法   总被引:8,自引:0,他引:8  
该文从点阵图像行(列)连通像素段出发,研究汉字图像的笔段构成,发现汉字点阵图像仅由阶梯型笔段和平行长笔段两种类型的笔段构成,并归纳出阶梯型笔段和平行长笔段的形成规律.以笔段形成规律为基础提出了汉字笔段的提取方法,该方法将像素级汉字图像转变为以笔段为单位的图像,有利于汉字识别、汉字细化及汉字字体的自动生成.最后该文给出了印刷体和手写体汉字笔段提取的实验结果.  相似文献   

19.
为研究极端条件下金属材料的性质,在JASMIN 框架上研制了三维并行位错动力学程序PDD3D. 它集成了离散位错动力学模拟的物理方案和数值算法.通过设计实现高效的分布式数据结构、可扩展的快速多极子解法器以及基于影像区的拓扑操作通信方式,该程序具有较高的性能和较好的可扩展性.1024 个处理器上,对包含3 千万条位错线的物理模型的模拟结果显示,PDD3D 程序获得了81%的并行效率.  相似文献   

20.
针对以往效率较低的串行计算CRC16 CCITT校验码的算法,研究了其计算效率低下的原因,并引入了一种通用的并行算法。在Quartus II下使用Verilog HDL实现了该算法并进行了仿真,使用Nios II自定义指令分析了采用并行算法对串行算法的性能改进。最后,通过多级流水线技术对基本并行电路进行改进和仿真,揭示了利用流水线技术提高存在反馈结构的逻辑电路Fmax存在的问题,并提出了应对的方法。仿真的结果表明,采用改进后的多级流水线电路可以大幅提高并行计算电路Fmax,进而提升CRC16 CCITT校验码计算的效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号