首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 21 毫秒
1.
E. Kaltofen  A. Lobo 《Algorithmica》1999,24(3-4):331-348
We describe a coarse-grain parallel approach for the homogeneous solution of linear systems. Our solutions are symbolic, i.e., exact rather than numerical approximations. We have performed an outer loop parallelization that works well in conjunction with a black box abstraction for the coefficient matrix. Our implementation can be run on a network cluster of UNIX workstations as well as on an SP-2 multiprocessor. Task distribution and management are effected through MPI and other packages. Fault tolerance, checkpointing, and recovery are incorporated. Detailed timings are presented for experiments with systems that arise in RSA challenge integer factoring efforts. For example, we can solve a 252,222 × 252,222 system with about 11.04 million nonzero entries over the Galois field with two elements using four processors of an SP-2 multiprocessor, in about 26.5 hours CPU time. Received June 1, 1997; revised March 10, 1998.  相似文献   

2.
We present a cross-language C++/Python program for simulations of quantum mechanical systems with the use of Quantum Monte Carlo (QMC) methods. We describe a system for which to apply QMC, the algorithms of variational Monte Carlo and diffusion Monte Carlo and we describe how to implement theses methods in pure C++ and C++/Python. Furthermore we check the efficiency of the implementations in serial and parallel cases to show that the overhead using Python can be negligible.

Program summary

Program title: MontePythonCatalogue identifier: ADZP_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADZP_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 49 519No. of bytes in distributed program, including test data, etc.: 114 484Distribution format: tar.gzProgramming language: C++, PythonComputer: PC, IBM RS6000/320, HP, ALPHAOperating system: LINUXHas the code been vectorised or parallelized?: Yes, parallelized with MPINumber of processors used: 1-96RAM: Depends on physical system to be simulatedClassification: 7.6; 16.1Nature of problem: Investigating ab initio quantum mechanical systems, specifically Bose-Einstein condensation in dilute gases of 87RbSolution method: Quantum Monte CarloRunning time: 225 min with 20 particles (with 4800 walkers moved in 1750 time steps) on 1 AMD OpteronTM Processor 2218 processor; Production run for, e.g., 200 particles takes around 24 hours on 32 such processors.  相似文献   

3.
We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.  相似文献   

4.
5.
Massively parallel computers are emerging as a valuable tool for supercomputer applications. Their processing speed and memory size makes them ideal for solving large applications. An implementation of a molecular dynamics simulation using a neighbour list type algorithm is presented. By efficient use and understanding of the architecture, an extremely efficient neighbour list algorithm (without the need to store the list) has been developed. The large number of processors has allowed us to model large samples (up to one million atoms), reducing the artefacts which may be caused by having a small sample size. This implementation has provided performance results that surpass those of standard machines. The improvements are by factors of hundreds in terms of speed of calculation, and the sizes of the systems that can be modelled.  相似文献   

6.
A new parallel implementation of Strassen’s matrix multiplication algorithm is proposed for massively parallel supercomputers with 2D, all-port torus interconnection networks. The proposed algorithm employs a special conflict-free routing pattern for better scalability and is able to yield a performance rate very close to the theoretical bound for many practical network and matrix sizes. It effectively scales up to very large networks typically containing hundreds-of-thousands processors where petaflop or exaflop processing rates are sought.  相似文献   

7.
The binary adaptive resonance (ART1) neural network algorithm has been successfully implemented in the past for the classifying and grouping of similar vectors from a machine-part matrix. A modified ART1 paradigm which reorders the input vectors, along with a modified procedure for storing a group's representation vectors, has proven successful in both speed and functionality in comparison to former techniques. This paradigm has been adapted and implemented on a neuro-computer utilizing 256 processors which allows the computer to take advantage of the inherent parallelism of the ART1 algorithm. The parallel implementation results in tremendous improvements in the speed of the machine-part matrix optimization. The machine-part matrix was initially limited to 65,536 elements (256×256) which is a consequence of the maximum number of processors within the parallel computer. The restructuring and modification of the parallel implementation has allowed the number of matrix elements to increase well beyond their previous limits. Comparisons of the modified structure with both the serial algorithm and the initial parallel implementation are made. The advantages of using a neural network approach in this case are discussed.  相似文献   

8.
The problem of finding an optimal product sequence for sequential multiplication of a chain of matrices (the matrix chain ordering problem, MCOP) is well-known. We consider the problem of finding an optimal product schedule for evaluating a chain of matrix products on a parallel computer (the matrix chain scheduling problem, MCSP). The difference between MCSP and MCOP is that MCOP pertains to a product sequence for single processor systems and MCSP pertains to a sequence of concurrent matrix products for parallel systems. The approach of parallelizing each matrix product after finding an optimal product sequence for single processor systems does not always guarantee minimum evaluation time on parallel systems since each parallelized matrix product may use processors inefficiently. We introduce a new processor scheduling algorithm for MCSP which reduces the evaluation time of a chain of matrix products on a parallel computer, even at the expense of a slight increase in the total number of operations. Given a chain of n matrices and a matrix product utilizing at most P/k processors in a P-processor system, the proposed algorithm approaches k(n-1)/(n+klog(k)-k) times the performance of parallel evaluation using the optimal sequence found for MCOP. Experiments performed on a Fujitsu AP1000 multicomputer also show that the proposed algorithm significantly decreases the time required to evaluate a chain of matrix products in parallel systems.  相似文献   

9.
Improving scheduling of tasks in a heterogeneous environment   总被引:1,自引:0,他引:1  
Optimal scheduling of parallel tasks with some precedence relationship, onto a parallel machine is known to be NP-complete. The complexity of the problem increases when task scheduling is to be done in a heterogeneous environment, where the processors in the network may not be identical and take different amounts of time to execute the same task. We introduce a task duplication-based scheduling algorithm for network of heterogeneous systems (TANH), with complexity O(V/sup 2/), which provides optimal results for applications represented by directed acyclic graphs (DAGs), provided a simple set of conditions on task computation and network communication time could be satisfied. The performance of the algorithm is illustrated by comparing the scheduling time with an existing "best imaginary level scheduling (BIL)" scheme for heterogeneous systems. The scalability for a higher or lower number of processors, as per their availability is also discussed. We have shown to provide substantial improvement over existing work on the task duplication-based scheduling algorithm (TDS).  相似文献   

10.
Extraction of frequent patterns in transaction-oriented database is crucial to several data mining tasks such as association rule generation, time series analysis, classification, etc. Most of these mining tasks require multiple passes over the database and if the database size is large, which is usually the case, scalable high performance solutions involving multiple processors are required. This paper presents an efficient scalable parallel algorithm for mining frequent patterns on parallel shared nothing platforms. The proposed algorithm is based on one of the best known sequential techniques referred to as Frequent Pattern (FP) Growth algorithm. Unlike most of the earlier parallel approaches based on different variants of the Apriori Algorithm, the algorithm presented in this paper does not explicitly result in having entire counting data structure duplicated on each processor. Furthermore, the proposed algorithm introduces minimum communication (and hence synchronization) overheads by efficiently partitioning the list of frequent elements list over processors. The experimental results show scalable performance over different machine and problem sizes. The comparison of implementation results with existing parallel approaches show significant gains in the speedup. On an 8-processor machine, we report an average speedup of 6 for different problem sizes.  相似文献   

11.
A coarse-grain parallel solver for systems of linear algebraic equations with general sparse matrices by Gaussian elimination is discussed. Before the factorization two other steps are performed. A reordering algorithm is used during the first step in order to obtain a permuted matrix with as many zero elements under the main diagonal as possible. During the second step the reordered matrix is partitioned into blocks for asynchronous parallel processing (normally the number of blocks is equal to the number of processors). It is possible to obtain blocks with nearly the same number of rows, because there is no requirement to produce square diagonal blocks. The first step is much more important than the second one and has a significant influence on the performance of the solver. A straightforward implementation of the reordering algorithm will result inO(n 2) operations. By using binary trees this cost can be reduced toO(NZ logn), whereNZ is the number of non-zero elements in the matrix andn is its order (normallyNZ is much smaller thann 2). Some experiments on parallel computers with shared memory have been performed. The results show that a solver based on the proposed reordering performs better than another solver based on a cheaper (but at the same time rather crude) reordering whose cost is onlyO(NZ) operations.  相似文献   

12.
The problem of maximal clique enumeration (MCE) is to enumerate all of the maximal cliques in a graph. Once enumerated, maximal cliques are widely used to solve problems in areas such as 3-D protein structure alignment, genome mapping, gene expression analysis, and detection of social hierarchies. Even the most efficient serial MCE algorithms require large amounts of time to enumerate the maximal cliques in networks arising from these problems that contain hundreds, thousands, or larger numbers of vertices. The previous attempts to provide practical solutions to the MCE problem through parallel implementation have had limited success, largely due to a number of challenges inherent to the nature of the MCE combinatorial search space. On the one hand, MCE algorithms often create a backtracking search tree that has a highly irregular and hard-or-impossible to predict structure; therefore, almost any static decomposition of the search tree by parallel processors results in highly unbalanced processor execution times. On the other hand, the data-intensive nature of the MCE problem often makes naive dynamic load distribution strategies that require extensive data movement prohibitively expensive. As a result, good scaling of the overall execution time of parallel MCE algorithms has been reported for only up to a couple hundred processors. In this paper, we propose a parallel, scalable, and memory-efficient MCE algorithm for distributed and/or shared memory high performance computing architectures, whose runtime scales linearly for thousands of processors on real-world application graphs with hundreds and thousands of nodes. Its scalability and efficiency are attributed to the proposed: (a) representation of the search tree decomposition to enable parallelization; (b) parallel depth-first backtracking search to both constrain the search space and minimize memory requirement; (c) least stringent synchronization to minimize data movement; and (d) on-demand work stealing intelligently coupled with work stack splitting to minimize computing elements’ idle time. To the best of our knowledge, the proposed parallel MCE algorithm is the first to achieve a linear scaling runtime using up to 2048 processors on Cray XT machines for a number of real-world biological networks.  相似文献   

13.
The general problem considered in the paper is partitioning of a matrix operation between processors of a parallel system in an optimum load-balanced way without potential memory contention. The considered parallel system is defined by several features the main of which is availability of a virtual shared memory divided into segments. If partitioning of a matrix operation causes parallel access to the same memory segment with writing data to the segment by at least one processor, then contention between processors arises which implies performance degradation. To eliminate such situation, a restriction is imposed on a class of possible partitionings, so that no two processors would write data to the same segment. On the resulting class of contention-free partitionings, a load-balanced optimum partitioning is defined as satisfying independent minimax criteria. The main result of the paper is an algorithm for finding the optimum partitioning by means of analytical solution of respective minimax problems. The paper also discusses implementation and performance issues related to the algorithm, on the basis of experience at Kendall Square Research Corporation, where the partitioning algorithm was used for creating high-performance parallel matrix libraries  相似文献   

14.
Particle-in-cell simulations often suffer from load-imbalance on parallel machines due to the competing requirements of the field-solve and particle-push computations. We propose a new algorithm that balances the two computations independently. The grid for the field-solve computation is statically partitioned. The particles within a processor's sub-domain(s) are dynamically balanced by migrating spatially-compact groups of particles from heavily loaded processors to lightly loaded ones as needed. The algorithm has been implemented in the quicksilver electromagnetic particle-in-cell code. We provide details of the implementation and present performance results for quicksilver running models with up to a billion grid cells and particles on thousands of processors of a large distributed-memory parallel machine.  相似文献   

15.
We present an approach to parallel iterative four-atom quantum mechanics calculations in a computing environment of distributed memory nodes, each node consisting of a group of processors with a shared memory. We parallelize the action of the Hamiltonian matrix on a vector, which is the main computational bottleneck in both iterative calculations of eigenvalues and eigenvectors and the iterative determination of quantum dynamics information via, e.g., wavepacket methods. OpenMP is used to facilitate the parallel work within each node, and MPI is used to communicate information between nodes. For a realistic problem the approach is shown to scale very well up to 512 processors at the NERSC computing facility, working at up to 20% of the theoretical peak performance rate. The highest total floating point rate we achieve is 0.16 Tflops, using 768 processors. Our approach should also be applicable to quantum dynamics problems with more than four atoms.  相似文献   

16.
Many computational-intensive problems from science and engineering are irregular in nature. This makes it difficult to develop an efficient parallel implementation, even for shared-memory machines. As a typical example, we investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As an execution platform we use the SB-PRAM, a shared memory machine with up to 2048 processors. The processors of the SB-PRAM can access the global memory in unit time which is the basis for an exact performance prediction. Common approaches for parallel implementations like lock protection for concurrent accesses and sequential or distributed task queues are replaced by more efficient access mechanisms and data structures which can be realized by the powerful multiprefix operations of the SB-PRAM. Their use simplifies the implementation and yields large speedup values.  相似文献   

17.
Inhomogeneous boson systems, such as the dilute gases of integral spin atoms in low-temperature magnetic traps, are believed to be well described by the Gross-Pitaevskii equation (GPE). GPE is a nonlinear Schrödinger equation which describes the order parameter of such systems at the mean field level. In the present work, we describe a Fortran 90 computer program developed by us, which solves the GPE using a basis set expansion technique. In this technique, the condensate wave function (order parameter) is expanded in terms of the solutions of the simple-harmonic oscillator (SHO) characterizing the atomic trap. Additionally, the same approach is also used to solve the problems in which the trap is weakly anharmonic, and the anharmonic potential can be expressed as a polynomial in the position operators x, y, and z. The resulting eigenvalue problem is solved iteratively using either the self-consistent-field (SCF) approach, or the imaginary time steepest-descent (SD) approach. Iterations can be initiated using either the simple-harmonic-oscillator ground state solution, or the Thomas-Fermi (TF) solution. It is found that for condensates containing up to a few hundred atoms, both approaches lead to rapid convergence. However, in the strong interaction limit of condensates containing thousands of atoms, it is the SD approach coupled with the TF starting orbitals, which leads to quick convergence. Our results for harmonic traps are also compared with those published by other authors using different numerical approaches, and excellent agreement is obtained. GPE is also solved for a few anharmonic potentials, and the influence of anharmonicity on the condensate is discussed. Additionally, the notion of Shannon entropy for the condensate wave function is defined and studied as a function of the number of particles in the trap. It is demonstrated numerically that the entropy increases with the particle number in a monotonic way.

Program summary

Title of program:bose.xCatalogue identifier:ADWZ_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADWZ_v1_0Program obtainable from: CPC Program Library, Queen's University of Belfast, N. IrelandDistribution format:tar.gzComputers:PC's/Linux, Sun Ultra 10/Solaris, HP Alpha/Tru64, IBM/AIXProgramming language used:mostly Fortran 90Number of bytes in distributed program, including test data, etc.:27 313Number of lines in distributed program, including test data, etc.:28 015Card punching code:ASCIINature of physical problem:It is widely believed that the static properties of dilute Bose condensates, as obtained in atomic traps, can be described to a fairly good accuracy by the time-independent Gross-Pitaevskii equation. This program presents an efficient approach of solving this equation.Method of solution:The solutions of the Gross-Pitaevskii equation corresponding to the condensates in atomic traps are expanded as linear combinations of simple-harmonic oscillator eigenfunctions. Thus, the Gross-Pitaevskii equation which is a second-order nonlinear differential equation, is transformed into a matrix eigenvalue problem. Thereby, its solutions are obtained in a self-consistent manner, using methods of computational linear algebra.Unusual features of the program:None  相似文献   

18.
A parallel algorithm for transforming an n × n symmetric matrix to tridiagonal form is described. The algorithm implements Givens rotations on a square array of n × n processors in such a way that the transformation can be performed in time O(n log n). The processors require only nearest-neighbor communication. The reduction to tridiagonal form could be the first step in the parallel solution of the symmetric eigenvalue problem in time O(n log n).  相似文献   

19.
We propose an algorithm for the cost-effective evaluation of structured sparse matrix-vector multiplications on massively parallel SIMD computers with distributed memory. Under the assumption that a large percentage of the nonzero entries of the matrix is concentrated on almost full diagonals, a specific data transport problem arises, for which we derive an equivalent graph theoretical formulation. We find an optimal solution of this communication problem in terms of a minimum-cost spanning tree.

We demonstrate the efficiency of our matrix-vector multiplication on a MP-1 with 16,384 processors.  相似文献   


20.
对称矩阵三对角化的有效并行块算法设计   总被引:1,自引:0,他引:1  
在矩阵数值计算中,块算法通常比非块算法更有效,但这也增加了并行算法设计和实现的难度.在广义稠密对称矩阵特征问题并行求解器中,并行块算法的构造可应用到正定对称矩阵的Choleski分解、对称矩阵的三对角化和回代转化(back-transiation)操作中.本文将并行块算法的讨论集中在具有代表性的对称矩阵三对角化上,给出在非块存储方式下对称矩阵三对角化的并行块算法设计方法.分析块算法大小同矩阵规模和处理器数量的关系.在深腾6800上的试验表明,我们的算法具有很好的性能,并得到了比ScaLAPACK更高的性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号