共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201. 相似文献
2.
The computational complexity of a parallel algorithm depends critically on the model of computation. We describe a simple and elegant rule-based model of computation in which processors apply rules asynchronously to pairs of objects from a global object space. Application of a rule to a pair of objects results in the creation of a new object if the objects satisfy the guard of the rule. The model can be efficiently implemented as a novel MIMD array processor architecture, the Intersecting Broadcast Machine. For this model of computation, we describe an efficient parallel sorting algorithm based on mergesort. The computational complexity of the sorting algorithm is O(nlog 2
n), comparable to that for specialized sorting networks and an improvement on the O(n
1.5) complexity of conventional mesh-connected array processors. 相似文献
3.
A symbolic-numerical algorithm for the computation of the matrix elements in the parametric eigenvalue problem to a prescribed accuracy is presented. A procedure for calculating the oblate angular spheroidal functions that depend on a parameter is discussed. This procedure also yields the corresponding eigenvalues and the matrix elements (integrals of the eigenfunctions multiplied by their derivatives with respect to the parameter). The efficiency of the algorithm is confirmed by the computation of the eigenvalues, eigenfunctions, and the matrix elements and by the comparison with the known data and the asymptotic expansions for small and large values of the parameter. The algorithm is implemented as a package of programs in Maple-Fortran and is used for the reduction of a singular two-dimensional boundary value problem for the elliptic second-order partial differential equation to a regular boundary value problem for a system of second-order ordinary differential equations using the Kantorovich method. 相似文献
5.
A parallel algorithm is proposed for the two-dimensional discrete Fourier transform (2-D DFT) computation which eliminates interprocessor communications and uses only O( N) processors. The mapping of the algorithm onto architectures with broadcast and report capabilities is discussed. Expressions are obtained for estimating the speed performance on these machines as a function of the size N× N of the 2-D DFT, the bandwidth of the communications channel, the time for an addition, the time T( FN) for a single processing element to perform an N-point DFT, and the degree of parallelism. For single I/O channel machines that are capable of exploiting the full degree of parallelism of the algorithm, attainable execution times are as low as the time T( FN) plus the I/O time for data upload and download. An implementation on a binary tree computer is discussed 相似文献
6.
In this paper we present a new stable algorithm for the parallel QR-decomposition of “tall and skinny” matrices. The algorithm has been developed for the dense symmetric eigensolver ELPA, where the QR-decomposition of tall and skinny matrices represents an important substep. Our new approach is based on the fast but unstable CholeskyQR algorithm (Stathopoulos and Wu, 2002) [1]. We show the stability of our new algorithm and provide promising results of our MPI-based implementation on a BlueGene/P and a Power6 system. 相似文献
7.
In this paper, a fully parallel method for finding some or all finite eigenvalues of a real symmetric matrix pencil ( A, B) is presented, where A is a symmetric tridiagonal matrix and B is a diagonal matrix with b1 > 0 and bi ≥ 0, i = 2,3,…, n. The method is based on the homotopy continuation with rank 2 perturbation. It is shown that there are exactly m disjoint, smooth homotopy paths connecting the trivial eigenvalues to the desired eigenvalues, where m is the number of finite eigenvalues of ( A, B). It is also shown that the homotopy curves are monotonic and easy to follow. 相似文献
8.
We propose a model of parallel computation, the YPRAM, that allows general parallel algorithms to be designed for a wide class of parallel models. The basic model captures locality among processors, which is measured as a function of two parameters; latency and bandwidth. We design YPRAM algorithms for solving several fundamental problems: parallel prefix, sorting, sorting numbers from a bounded range, and list ranking. We show that our model predicts, reasonably accurately, the actual known performances of several basic parallel models — PRAM, hypercube, mesh and tree — when solving these problems. 相似文献
9.
The Lanczos algorithm is a very effective method for finding extreme eigenvalues of symmetric matrices. In this paper, we
present our parallel version of the Lanczos method for symmetric generalized eigenvalue problem, PLANSO. PLANSO is based on
a sequential package called LANSO which implements the Lanczos algorithm with partial reorthogonalization. It is portable
to all parallel machines that support MPI and it is easy to interface with most parallel computing packages. Through numerical
experiments, we demonstrate that it achieves similar parallel efficiency as PARPACK, but uses considerably less time.
Received: 21 January 1998 / Accepted: 10 June 1999 相似文献
11.
A major reason for the lack of practical use of parallel computers has been the absence of a suitable model of parallel computation. Many existing models are either theoretical or are tied to a particular architecture. A more general model must be architecture independent, must realistically reflect execution costs, and must reduce the cognitive overhead of managing massive parallelism. A growing number of models meeting some of these goals have been suggested. We discuss their properties and relative strengths and weaknesses. We conclude that data parallelism is a style with much to commend it, and discuss the Bird-Meertens formalism as a coherent approach to data parallel programming.This work was supported by the Natural Sciences and Engineering Research Council of Canada. 相似文献
12.
Split and merge segmentation is a popular region-based segmentation scheme for its robustness and computational efficiency. But it is hard to realize for larger size images or video frames in real time due to its iterative sequential data flow pattern. A quad-tree data structure is quite popular for software implementation of the algorithm, where a local parallelism is difficult to establish due to inherent data dependency between processes. In this paper, we have proposed a parallel algorithm of splitting and merging which depends only on local operations. The algorithm is mapped onto a hierarchical cell network, which is a parallel version of Locally Excitory Globally Inhibitory Oscillatory Network (LEGION). Simulation results show that the proposed design is faster than any of the standard split and merge algorithmic implementations, without compromising segmentation quality. The timing performance enhancement is manifested in its Finite State Machine based VLSI implementation in VIRTEX series FPGA platforms. We have also shown that, though segmentation qualitywise split-and-merge algorithm is little bit behind the state-of-the-art algorithms, computational speedwise it over performs those sophisticated and complex algorithms. Good segmentation performance with minimal computational cost enables the proposed design to tackle real time segmentation problem in live video streams. In this paper, we have demonstrated live PAL video segmentation using VIRTEX 5 series FPGA. Moreover, we have extended our design to HD resolution for which the time taken is less than 5 ms rendering a processing throughput of 200 frames per second. 相似文献
13.
We introduce a new method to reduce a real matrix to a real Schur form by a sequence of similarity transformations that are 3D orthogonal transformations. Two significant features of this method are that: all the transformed matrices and all the computations are done in the real field; and it can be easily parallelized. We call the algorithm that uses this method the real two-zero (RTZ) algorithm. We describe both serial and parallel implementations of the RTZ algorithm. Our tests indicate that the rate of convergence to a real Schur form is quadratic for real near-normal matrices with real distinct eigenvalues. Suppose n is the order of a real matrix A. In order to choose a sequence of 3D orthogonal transformations on A, we need to determine some ordering on triples in T={(k,l,m)|1⩽k相似文献
15.
A modified Numerov-like eigenvalue algorithm, previously introduced, is parallelized. An inplementation of this algorithm on a Helios based parallel processing transputer system is discussed. Time savings with respect to a sequential approach are commented. 相似文献
16.
A parallel algorithm for generating all combinations of m items out of n given items in lexicographic order is presented. The computational model is a linear systolic array consisting of m identical processing elements. It takes ( mn) time-steps to generate all the ( mn) combinations. Since any processing element is identical and executes the same procedure, it is suitable for VLSI implementation. Based on mathematical induction, such algorithm is proved to be correct. 相似文献
17.
Spatial regularity amidst a seemingly chaotic image is often meaningful. Many papers in computational geometry are concerned with detecting some type of regularity via exact solutions to problems in geometric pattern recognition. However, real-world applications often have data that is approximate, and may rely on calculations that are approximate. Thus, it is useful to develop solutions that have an error tolerance. A solution has recently been presented by Robins et al. [Inform. Process. Lett. 69 (1999) 189–195] to the problem of finding all maximal subsets of an input set in the Euclidean plane
that are approximately equally-spaced and approximately collinear. This is a problem that arises in computer vision, military applications, and other areas. The algorithm of Robins et al. is different in several important respects from the optimal algorithm given by Kahng and Robins [Patter Recognition Lett. 12 (1991) 757–764] for the exact version of the problem. The algorithm of Robins et al. seems inherently sequential and runs in O(n5/2) time, where n is the size of the input set. In this paper, we give parallel solutions to this problem. 相似文献
18.
A parallel algorithm for tiling with polyominoes is presented. The tiling problem is to pack polyominoes in a finite checkerboard. The algorithm using lxmxn processing elements requires O(1) time, where l is the number of different kinds of polyominoes on an mxn checkerboard. The algorithm can be used for placement of components or cells in a very large-scale integrated circuit (VLSI) chip, designing and compacting printed circuit boards, and solving a variety of two- or three-dimensional packing problems. 相似文献
19.
Automatic process partitioning is the operation of automatically rewriting an algorithm as a collection of tasks, each operating primarily on its own portion of the data, to carry out the computation in parallel. Hybrid shared memory systems provide a hierarchy of globally accessible memories. To achieve high performance on such machines one must carefully distribute the work and the data so as to keep the workload balanced while optimizing the access to nonlocal data. In this paper we consider a semi-automatic approach to process partitioning in which the compiler, guided by advice from the user, automatically transforms programs into such an interacting set of tasks. This approach is illustrated with a picture processing example written in BLAZE, which is transformed by the compiler into a task system maximizing locality of memory reference.Research supported by an IBM Graduate Fellowship.Research supported under NASA Contract No. 520-1398-0356.Research supported by NASA Contract No. NAS1-18107 while the last two authors were in residence at ICASE, NASA, Langley Research Center. 相似文献
20.
A new real structure-preserving Jacobi algorithm is proposed for solving the eigenvalue problem of quaternion Hermitian matrix. By employing the generalized JRS-symplectic Jacobi rotations, the new quaternion Jacobi algorithm can preserve the symmetry and JRS-symmetry of the real counterpart of quaternion Hermitian matrix. Moreover, the proposed algorithm only includes real operations without dimension-expanding and is generally superior to the state-of-the-art algorithm. Numerical experiments are reported to indicate its efficiency and accuracy. 相似文献
|