首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 953 毫秒
1.
The implementation and performance of the multidimensional Fast Fourier Transform (FFT) on a distributed memory Beowulf cluster is examined. We focus on the three-dimensional (3D) real transform, an essential computational component of Galerkin and pseudo-spectral codes. The approach studied is a 1D domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors. Communication is based upon the standard portable message passing interface (MPI). We show that 1/P scaling for execution time at fixed problem size N3 (i.e., linear speedup) can be obtained provided that (1) the transpose algorithm is optimized for simultaneous block communication by all processors; and (2) communication is arranged for non-overlapping pairwise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed. This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components. An example is presented using a 3D passive scalar code.  相似文献   

2.
Constraint aggregation is the key for efficient structural optimization when using a gradient-based optimizer and an adjoint method for sensitivity analysis. We explore different methods of constraint aggregation for numerical optimization. We analyze existing approaches, such as considering all constraints individually, taking the maximum of the constraints and using the Kreisselmeier–Steinhauser (KS) function. A new adaptive approach based on the KS function is proposed that updates the aggregation parameter by taking into account the constraint sensitivity. This adaptive approach is shown to significantly increase the accuracy of the results without additional computational cost especially when a large number of constraints are active at the optimum. The characteristics of each aggregation method and the performance of the proposed adaptive approach are shown by solving a wing structure weight minimization problem.  相似文献   

3.
《国际计算机数学杂志》2012,89(3-4):249-270
We study the parallel implementation of two diagonalization methods for solving dense linear systems: the well known Gauss-Jordan method and a new one introduced by Huard. The number of arithmetic operations performed by the Huard method is the same as for Gaussian elimination, namely 2n 3/3, less than for the Jordan method, namely n 3. We introduce parallel versions of these methods, compare their performances and study their complexity. We assume a shared memory computer with a number of processors p of the order of n, the size of the problem to be solved, We show that the best parallel version for Jordan's method is by rows whereas the best one for Huard's method is by columns. Our main result states that for a small number of processors the parallel Huard method is faster than the parallel Jordan method and slower otherwise. The separation is obtained for p = 0.44n.  相似文献   

4.
The quantum Monte Carlo diagonalization or stochastic diagonalization serves as a computational method of solving exactly quantum Hamiltonian models. While based on a variational method, in which the solution approaches the optimal eigenstate of a huge Hamiltonian matrix, the diagonalization method in practice has difficulty because of the rapidly increasing number of quantum states. In this paper, we suggest an improved implementation method of finding the ground state via exact diagonalization of the Hubbard and t-J model Hamiltonians. Achieved is a great increase in the computational capability through an optimized code based on Boolean operations, a reduction of the state space using symmetry properties, and an effective variation on the trial ground state. Our method is restricted mainly by the memory capacity to keep the components of the trial ground state. Carried out on a single personal computer, the method turns out to find exact solutions in a relatively short time with 108-109 basis states.  相似文献   

5.
We examine the computational efficiency of linear algebra components in iterative solvers for grid-oriented simulations of PDEs. While the standard sparse matrix-vector (MV) techniques show significant losses of performance, especially on modern processors, our sparse banded components have the potential to exploit today's high computing power. We explain the major concepts of the FEAST software which contains such highly tuned numerical linear algebra basic components (Sparse Banded Blas) up to complete multigrid solvers, all being optimized with respect to the actual hardware platform. Based on algorithmic and computational studies, we present the FEAST indices which are indicators for the true performance of many modern processors, depending on the underlying FEM space, the problem size and the implementation style. These indices allow a new rating of the various hardware platforms with regard to different mathematical solution strategies, for academic and realistic numerical problems and ranging from ‘low cost’ PCs up to supercomputers.  相似文献   

6.
Null hypothesis significance tests and their p-values currently dominate the statistical evaluation of classifiers in machine learning. Here, we discuss fundamental problems of this research practice. We focus on the problem of comparing multiple fully specified classifiers on a small-sample test set. On the basis of the method by Quesenberry and Hurst, we derive confidence intervals for the effect size, i.e. the difference in true classification performance. These confidence intervals disentangle the effect size from its uncertainty and thereby provide information beyond the p-value. This additional information can drastically change the way in which classification results are currently interpreted, published and acted upon. We illustrate how our reasoning can change, depending on whether we focus on p-values or confidence intervals. We argue that the conclusions from comparative classification studies should be based primarily on effect size estimation with confidence intervals, and not on significance tests and p-values.  相似文献   

7.
We describe the design and implementation of GAUSS — an online algorithm selection system for numerical quadrature. Given a quadrature problem and performance constraints on its solution, GAUSS selects the best (or nearly best) algorithm. GAUSS uses inductive logic programming to generalize a database of performance data; this produces high-level rules that correlate problem features with algorithm performance. Such rules then serve as the basis for recommending algorithms for new problem instances. GAUSS functions online (new data and information can be incrementally incorporated) and can also provide phenomenological explanations of algorithm recommendations.  相似文献   

8.
李炜  杨慧中 《控制与决策》2014,29(3):541-545

联合对角化能够成功解决盲分离问题, 但在求解时会得到非期望的奇异解, 从而无法完全分离出源信号. 鉴于此, 提出一种用于线性卷积混合盲分离的联合对角化方法, 将卷积混合模型变换为瞬时模型, 并对变换后的模型应用联合对角化求取分离矩阵. 在求解过程中, 引入约束条件对解的范围进行限定, 避免了奇异解的出现. 仿真结果表明, 所提出的方法能够成功实现卷积混合信号盲分离.

  相似文献   

9.
非对称非正交快速联合对角化算法   总被引:1,自引:0,他引:1  
针对非对称联合对角化算法收敛速度慢以及有可能收敛到奇异解的问题, 首先提出一种基于最小二乘的非对称代价函数, 该代价函数在最小二乘标准的基础上增加了使对角化矩阵非奇异的约束项, 以保证算法不会收敛到奇异解. 然后利用一种循环最小化技术来优化提出的代价函数, 得到一种非对称非正交快速联合对角化算法. 算法的性能分析证明, 该算法不仅全局渐近收敛, 而且具有不变性. 左右对角化矩阵的关系也证明了非对称联合对角化的一般性. 实验仿真表明, 与原非对称联合对角化算法相比, 提出的算法收敛速度更快, 而且可以显著降低干扰信号比.  相似文献   

10.
To make the results reasonable, existing joint diagonalization algorithms have imposed a variety of constraints on diagonalizers. Actually, those constraints can be imposed uniformly by minimizing the condition number of diagonalizers. Motivated by this, the approximate joint diagonalization problem is reviewed as a multiobjective optimization problem for the first time. Based on this, a new algorithm for nonorthogonal joint diagonalization is developed. The new algorithm yields diagonalizers which not only minimize the diagonalization error but also have as small condition numbers as possible. Meanwhile, degenerate solutions are avoided strictly. Besides, the new algorithm imposes few restrictions on the target set of matrices to be diagonalized, which makes it widely applicable. Primary results on convergence are presented and we also show that, for exactly jointly diagonalizable sets, no local minima exist and the solutions are unique under mild conditions. Extensive numerical simulations illustrate the performance of the algorithm and provide comparison with other leading diagonalization methods. The practical use of our algorithm is shown for blind source separation (BSS) problems, especially when ill-conditioned mixing matrices are involved.   相似文献   

11.
In this paper, a parallel implementation of the Iterative Alternating Direction Explicit method by D’Yakonov (IADE-DY) to solve 2-D telegraphic problem on a distributed system using Message Passing Interface (MPI) and Parallel Virtue Machine (PVM) are presented. The parallelization of the program is implemented by a domain decomposition strategy. A Single Program Multiple Data (SPMD) model is employed for the implementation. The implementation is discussed in relation to means of the parallel performance strategies and analysis. The model enhances overlap communication and computation to avoid unnecessary synchronization, hence, the method yields significant speedup. The level of speedup observed from tables as the mesh increases are in the range of 5–10%. Improvement has been achieved by numbers of tables and figures in our experiment. We present some analyses that are helpful for speedup and efficiency. It is concluded that the efficiency is strongly dependent on the grid size, block numbers and the number of processors for both MPI and PVM. Different strategies to improve the computational efficiency are proposed.  相似文献   

12.
This paper focuses on the resolution of a large number of small random symmetric linear systems and its parallel implementation in single precision on graphics processing units (GPUs). The computations involved by each linear system are independent from the others, and the number of unknowns does not exceed 64. For this purpose, we present the adaptation to our context of largely used methods that include: LDLt factorization, Householder reduction to a tridiagonal matrix, parallel cyclic reduction (PCR) that is not a power of two and the divide and conquer algorithm for tridiagonal eigenproblems. We not only detail the implementation and optimization of each method, but we also compare the sustainability of each solution and its performance which include both parallel complexity and cache memory occupation. In the context of solving a large number of small random linear systems on GPUs with no information about their conditioning, our research indicates that the best strategy requires the use of Householder tridiagonalization + PCR followed if necessary by a divide and conquer diagonalization.  相似文献   

13.
Single machine scheduling is a classical optimization problem that depicts multiple real life systems in which a single resource (the machine) represents the whole system or the bottleneck operation of the system. In this paper we consider the problem under a weighted completion time performance metric in which the processing time of the tasks to perform (the jobs) are uncertain, but can only take values from closed intervals. The objective is then to find a solution that minimizes the maximum absolute regret for any possible realization of the processing times. We present an exact branch-and-bound method to solve the problem, and conduct a computational experiment to ascertain the possibilities and limitations of the proposed method. The results show that the algorithm is able to optimally solve instances of moderate size (25–40 jobs depending on the characteristics of the instance).  相似文献   

14.
We develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of /spl Omega/(N/sup 3///spl radic/C), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines.  相似文献   

15.
An approach to the exact diagonalization of many-electron Hamiltonian in semiconductor quantum dot (QD) structures is proposed. The QD model is based on 3D finite hard-wall confinement potential and nonparabolic effective-mass approximation (EMA) that render analytical basis functions such as Laguerre polynomials inaccessible for the numerical treatment of this kind of models. In this approach, the many-electron wave function is expanded in a basis of Slater determinants constructed from numerical wave functions of the single-electron Hamiltonian with the nonparabolic EMA which results in a cubic eigenvalue problem from a finite difference discretization. The nonlinear eigenvalue problem is solved by using the Jacobi-Davidson method. The Coulomb matrix elements in the many-electron Hamiltonian are obtained by solving Poisson's problems via GMRES. Numerical results reveal that a good convergence can be achieved by means of a few single-electron basis states.  相似文献   

16.
We present a parallel implementation of the Bose Hubbard model, using imaginary time propagation to find the lowest quantum eigenstate and real time propagation for simulation of quantum dynamics. Scaling issues, performance of sparse matrix-vector multiplication, and a parallel algorithm for determining nonzero matrix elements are described. Implementation of imaginary time propagation yields an O(N) linear convergence on a single processor and slightly better than ideal performance on up to 160 processors for a particular problem size. The determination of the nonzero matrix elements is intractable using sequential non-optimized techniques for large problem sizes. Thus, we discuss a parallel algorithm that takes advantage of the intrinsic structural characteristics of the Fock-space matrix representation of the Bose Hubbard Hamiltonian and utilizes a parallel implementation of a Fock state look up table to make this task solvable within reasonable timeframes. Our parallel algorithm demonstrates near ideal scaling on thousand of processors. We include results for a matrix 22.6 million square, with 202 million nonzero elements, utilizing 2048 processors.  相似文献   

17.
In this paper, the parallel implementation of the iterative alternating direction explicit method by D'Yakonov (IADE-DY) and the Mitchell and Fairweather double sweep method (DS-MF) compared with the alternative direction implicit (ADI) method for solving the 2D telegraphic problem on a distributed system of Armadillo cluster and geranium Cadcam cluster using the parallel virtual machine and message passing interface is presented. We implement the scheduling of n tridiagonal system of equations with the above methods to show improvement on speedup and efficiency with parallel strategies across the two platforms. The single program multiple data model was employed for the implementation. The IADE-DY, DS-MF and ADI are developed by the splitting of the implicit equation using the finite difference discretization on the 2D telegraphic problem. The implementation is discussed in relation to means of the performance parallel strategies and analysis. The model enhances overlap communication and computation to avoid unnecessary synchronization, hence, the method yields significant speedup by the use of the non-blocking communication. We present some analyses that are helpful for speedup and efficiency.  相似文献   

18.
Here we introduce a class of linear operators called recursive orthogonal transforms (ROTs) that allow a natural implementation on a distributed control network. We derive conditions under which ROTs can be used to represent SO(n) for n4. We propose a paradigm for distributed feedback control based on plant matrix diagonalization. To find an ROT suitable for this task, we derive a gradient flow on the appropriate underlying Lie group. A numerical example is presented.  相似文献   

19.
《Advanced Robotics》2013,27(4):453-481
The kinematic (KS) and algorithmic singularities (AS) in controlling robotic manipulators have been investigated intensively because they are not predictable or difficult to avoid. The problem with handling these singularities is an unnecessary performance reduction in the non-singular region and the difficulty in performance tuning. In this paper, we propose a method of avoiding KS and AS by applying a task reconstruction approach while maximizing the task performance by calculating singularity measures. The proposed method is implemented by removing the component approaching the singularity calculated by using a singularity measure in real-time. The outstanding feature of the proposed task reconstruction (TR) method is that it is based on a local TR as opposed to the local joint reconstruction of many other approaches. This method has a dynamic task priority assignment feature which ensures system stability under singular regions due to the change of task priority. The TR method enables us to increase the task controller gain to improve the task performance, whereas this increase can destabilize the system for conventional algorithms in real experiments. In addition, the physical meaning of tuning parameters is very straightforward. Hence, we can maximize task performance even near the singular region while simultaneously obtaining the singularity-free motion. The advantage of the proposed method is experimentally tested by using a 7-d. o. f. spatial manipulator and the result shows that the new method improves the performance several times over the existing algorithms.  相似文献   

20.
Using Hermite's formulation of polynomial stability conditions, static output feedback (SOF) controller design can be formulated as a polynomial matrix inequality (PMI), a (generally nonconvex) nonlinear semidefinite programming problem that can be solved (locally) with PENNON, an implementation of a penalty and augmented Lagrangian method. Typically, Hermite SOF PMI problems are badly scaled and experiments reveal that this has a negative impact on the overall performance of the solver. In this note we recall the algebraic interpretation of Hermite's quadratic form as a particular Bézoutian and we use results on polynomial interpolation to express the Hermite PMI in a Lagrange polynomial basis, as an alternative to the conventional power basis. Numerical experiments on benchmark problem instances show the improvement brought by the approach, in terms of problem scaling, number of iterations and convergence behaviour of PENNON.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号