期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Introducing k-point parallelism into VASP

Asimina Maniopoulou Erlend R.M. Davidson Ricardo Grau-Crespo Aron Walsh Ian J. Bush C. Richard A. Catlow Scott M. Woodley 《Computer Physics Communications》2012,183(8):1696-1701

For many years ab initio electronic structure calculations based upon density functional theory have been one of the main application areas in high performance computing (HPC). Typically, the Kohn–Sham equations are solved by minimisation of the total energy functional, using a plane wave basis set for valence electrons and pseudopotentials to obviate the representation of core states. One of the best known and widely used software for performing this type of calculation is the Vienna Ab initio Simulation Package, VASP, which currently offers a parallelisation strategy based on the distribution of bands and plane wave coefficients over the machine processors. We report here an improved parallelisation strategy that also distributes the k-point sampling workload over different processors, allowing much better scalability for massively parallel computers. As a result, some difficult problems requiring large k-point sampling become tractable in current computing facilities. We showcase three important applications: dielectric function of epitaxially strained indium oxide, solution energies of tetravalent dopants in metallic VO₂, and hydrogen on graphene. 相似文献

2.

Recursion method for electronic structure calculations

E. Lorin G. Zérah 《Computer Physics Communications》2004,158(1):39-46

We present a new recursion method based on the Trotter formula for the electronic structure calculations of molecules or solids. The proposed method has the feature to be more effective at high temperatures in contrast with direct calculations methods (real space or plane waves methods). 相似文献

3.

A highly flexible,distributed multiprocessor architecture for network processing

《Computer Networks》2003,41(5):563-586

Network processors (NPs) are an emerging field of programmable processors that are optimized to implement data plane packet processing networking functions. Unlike the general-purpose CPUs that rely heavily on caching for improving performance, the lack of locality in packet processing and need for high-performance I/O have forced designers to come up with innovative architectures that can hide memory latency while still processing packets at high data rates. Most of these NPs use some type of multiprocessing in combination with a hierarchy of memory types to achieve high performance. In addition, to keep up with packets arriving at high data rates over multiple incoming media interfaces, an NP must perform fast I/O and memory operations such as packet storage, table lookup, and extraction of fields in packet headers. We describe an architecture that uses a combination of distributed memory architecture and one or more multithreaded processors to achieve the necessary performance. We describe the challenges in programming such a processor including the issues related to consistency and maintaining packet ordering. We also present a programming model for generic network applications that uses software pipelines. We then demonstrate the use of the programming model in implementing two applications, namely, mapping traffic management algorithms onto a multithreaded architecture and an implementation of a media gateway based on voice-over-AAL2. 相似文献

4.

New approach for approximating the continuum wave function by Gaussian basis set

Marcelo Fiori J.E. Miraglia 《Computer Physics Communications》2012,183(12):2528-2534

A new approach for approximating the continuum wave functions for hydrogenic atoms with Gaussians basis sets is developed and tested. In this the plane wave is left unchanged and the distorting factor, represented by the Confluent Hypergeometric function, is expanded as a sum of Spherical Harmonics multiplied by a series of Gaussians. In this way the number of spherical waves and Gaussians will be significantly reduced and the plane wave will be responsible for the momentum conservation.As compared with previous methods that expand the full continuum, including the plane wave, our strategy does not require a great quantity of partial waves for convergence. Dense oscillations which are characteristic of the plane wave, are avoided. To test the performance of this approximation to describe a free-bound atomic form factor, the ionization cross section of hydrogen by impact of protons in first Born approximation is calculated. Compared with the exact results, a good agreement with just 4 spherical waves and ten Gaussians each is obtained. The method looks very interesting, especially to speed up atomic and molecular collision calculations involving the continuum. 相似文献

5.

Phase diagram calculations in the Co–Mo and Fe–Mo systems using first-principles results for the sigma phase

《Calphad》2005,29(2):133-139

相似文献

6.

Performance of computationally intensive parameter sweep applications on Internet‐based Grids of computers: the mapping of molecular potential energy hypersurfaces

S. Reyes C. Muoz‐Caro A. Nio R. M. Badia J. M. Cela 《Concurrency and Computation》2007,19(4):463-481

This work focuses on the use of computational Grids for processing the large set of jobs arising in parameter sweep applications. In particular, we tackle the mapping of molecular potential energy hypersurfaces. For computationally intensive parameter sweep problems, performance models are developed to compare the parallel computation in a multiprocessor system with the computation on an Internet‐based Grid of computers. We find that the relative performance of the Grid approach increases with the number of processors, being independent of the number of jobs. The experimental data, obtained using electronic structure calculations, fit the proposed performance expressions accurately. To automate the mapping of potential energy hypersurfaces, an application based on GRID superscalar is developed. It is tested on the prototypical case of the internal dynamics of acetone. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

7.

An OpenMP/MPI approach to the parallelization of iterative four-atom quantum mechanics

Dmitry M. Medvedev Stephen K. Gray 《Computer Physics Communications》2005,166(2):94-108

We present an approach to parallel iterative four-atom quantum mechanics calculations in a computing environment of distributed memory nodes, each node consisting of a group of processors with a shared memory. We parallelize the action of the Hamiltonian matrix on a vector, which is the main computational bottleneck in both iterative calculations of eigenvalues and eigenvectors and the iterative determination of quantum dynamics information via, e.g., wavepacket methods. OpenMP is used to facilitate the parallel work within each node, and MPI is used to communicate information between nodes. For a realistic problem the approach is shown to scale very well up to 512 processors at the NERSC computing facility, working at up to 20% of the theoretical peak performance rate. The highest total floating point rate we achieve is 0.16 Tflops, using 768 processors. Our approach should also be applicable to quantum dynamics problems with more than four atoms. 相似文献

8.

Implementation of a Numerical Solution of the Multicomponent Kinetic Collection Equation (MKCE) on Parallel Computers

《Journal of Parallel and Distributed Computing》2001,61(5):641-661

Two different numerical solutions of the two-component kinetic collection equation were implemented on parallel computers. The parallelization approach included domain decomposition and MPI commands for communications. Four different parallel codes were tested. A dynamic decomposition based on an occupancy function provided the optimum balance between time performance and flexibility for any number of processors. The occupancy function was defined according to the number of calculations required at each grid point in the domain. Speed-up performance depended very much on the parallel code used and in some cases very good results were obtained for up to 32 processors. 相似文献

9.

Fast Fourier transform-based correlation of DNA sequences using complex plane encoding

E A Cheever G C Overton D B Searls 《Computer applications in the biosciences》1991,7(2):143-154

The detection of similarities between DNA sequences can be accomplished using the signal-processing technique of cross-correlation. An early method used the fast Fourier transform (FFT) to perform correlations on DNA sequences in O(n log n) time for any length sequence. However, this method requires many FFTs (nine), runs no faster if one sequence is much shorter than the other, and measures only global similarity, so that significant short local matches may be missed. We report that, through the use of alternative encodings of the DNA sequence in the complex plane, the number of FFTs performed can be traded off against (i) signal-to-noise ratio, and (ii) a certain degree of filtering for local similarity via k-tuple correlation. Also, when comparing probe sequences against much longer targets, the algorithm can be sped up by decomposing the target and performing multiple small FFTs in an overlap-save arrangement. Finally, by decomposing the probe sequence as well, the detection of local similarities can be further enhanced. With current advances in extremely fast hardware implementations of signal-processing operations, this approach may prove more practical than heretofore. 相似文献

10.

Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer

Bin Fang Glenn Martyna 《Computer Physics Communications》2007,176(8):531-538

QCDOC is a massively parallel supercomputer with tens of thousands of nodes distributed on a six-dimensional torus network. The 6D structure of the network provides the needed communication resources for many communication-intensive applications. In this paper, we present a parallel algorithm for three-dimensional Fast Fourier Transform and its implementation for a 4096-node QCDOC prototype. Two techniques have been used to increase its parallel performance: simultaneous multi-dimensional communication and communication-and-computation overlapping. Benchmarking experiments suggest that 3D FFTs of size 128×128×128 can scale well on such platforms up to 4096 nodes. Our performance results suggest stronger scalability on QCDOC than on IBM BlueGene/L supercomputer. 相似文献

11.

Parallel implementation of the projector augmented plane wave method for charged systems

Eric J. Bylaska Marat ValievRyoichi Kawai John H. Weare 《Computer Physics Communications》2002,143(1):11-28

A parallel implementation of the projector augmented plane wave (PAW) method with the applications to several transition metal complexes is presented. A unique aspect of our PAW code is that it can treat both charged and neutral cluster systems. We discuss how this is achieved via accurate numerical treatment of the Coulomb Green's function with free space boundary conditions. The strategy for parallelizing the PAW code is based on distributing the plane wave basis across processors. This is a versatile approach and is implemented using a parallel three-dimensional Fast Fourier Transformation (FFT). We report parallel performance analysis of our program and of the three-dimensional FFT's and discuss large-scale parallelization issues of the PAW code. Using a series of transition metal monoxides and dioxides, as well as two iron aqueous complexes, it is shown that a free space PAW code can give structural parameters and energies in good accord with Gaussian based methods. 相似文献

12.

一种阵列众核处理器的多级指令缓存结构

陈逸飞李宏亮刘骁高红光《计算机工程与科学》2018,40(4):571-579

阵列众核处理器由于其较高的计算性能和能效比已经被广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器中,核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。在阵列众核处理器中,在单核心中引入硬件同时多线程技术,针对实验中一级指令缓存命中率随着线程数增加而显著降低的问题,提出了一种面向阵列众核处理器的冗余指令缓存存储结构,基于该结构,提出采用FIFO及类LRU替换策略。通过上述优化的高速缓存结构设计,经实验模拟,双线程整体指令Cache失效率降低了25.2%,整体CPI性能提升了30.2%。相似文献

13.

基于Blackfin处理器的嵌入式GUI优化与实现

下载免费PDF全文

卢刚《计算机工程》2009,35(6):269-271

针对Blackfin处理器强大的多媒体性能，研究图形用户界面（GUI）的相关优化技术，完成一个基于uClinux的嵌入式GUI系统。该系统具有体积小、高性能等特点，适用于需要高分辨率显示的电子产品。将该嵌入式GUI与Microwindows进行比较，相关的性能测试结果证明，该嵌入式GUI的实现为高性能嵌入式多媒体产品奠定了基础。相似文献

14.

MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs

下载免费PDF全文

李焱张云泉刘益群龙国平贾海鹏《计算机科学技术学报》2013,28(1):90-105

Fourier methods have revolutionized many fields of science and engineering,such as astronomy,medical imaging,seismology and spectroscopy,and the fast Fourier transform(FFT) is a computationally efficient method of generating a Fourier transform.The emerging class of high performance computing architectures,such as GPU,seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software.However,the complexity of GPU programming poses a significant challenge to developers.In this paper,we propose an automatic performance tuning framework for FFT on various OpenCL GPUs,and implement a high performance library named MPFFT based on this framework.For power-of-two length FFTs,our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs.Furthermore,our library also supports non-power-of-two size.For 3D non-power-of-two FFTs,our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050. 相似文献

15.

High grid resolution and parallelized tsunami simulation with fully nonlinear Boussinesq equations

Nuttita Pophet Narongrit Kaewbanjak Jack Asavanant Mansour Ioualalen 《Computers & Fluids》2011,40(1):258-268

Numerical simulation of tsunami propagation in large basin across the ocean demands significantly high computational capability in terms of CPU time and memory allocation. Due to this limitation, the use of sequential codes in a single scientific workstation is possible only for small-scale tsunami problem. To overcome this difficulty, a parallel Boussinesq wave model is developed based on the original FUNWAVE sequential model for efficient simulation of long wave propagation, coastal inundation and runup. The numerical resolution is decomposed into small sub-domains using domain decomposition technique for each processor to perform the calculations. The wave information is exchanged between processors via message passing interface (MPI). We show the effectiveness of this parallel code on distributed- and shared-memory computer clusters in simulating two tsunami events: the 2004 Indian Ocean and the 1999 Vanuatu tsunamis. Communication in the overlapping domains and load balancing in the partitioned domains are considered to ensure the efficiency of this method. It is found that the performance of the parallel model for both large- and small-scale tsunami problems is very satisfactory. Finally, the parallel model is applied to a spatial hierarchical grids methodology for a location-specific numerical simulation. Grid sensitivity and improved simulation results for runups along Phang Nga coastline from Takua Thung to Khao Lak are presented. 相似文献

16.

Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations

T. Auckenthaler H. Lederer 《Parallel Computing》2011,37(12):783-794

The computation of selected eigenvalues and eigenvectors of a symmetric (Hermitian) matrix is an important subtask in many contexts, for example in electronic structure calculations. If a significant portion of the eigensystem is required then typically direct eigensolvers are used. The central three steps are: reduce the matrix to tridiagonal form, compute the eigenpairs of the tridiagonal matrix, and transform the eigenvectors back. To better utilize memory hierarchies, the reduction may be effected in two stages: full to banded, and banded to tridiagonal. Then the back transformation of the eigenvectors also involves two stages. For large problems, the eigensystem calculations can be the computational bottleneck, in particular with large numbers of processors. In this paper we discuss variants of the tridiagonal-to-banded back transformation, improving the parallel efficiency for large numbers of processors as well as the per-processor utilization. We also modify the divide-and-conquer algorithm for symmetric tridiagonal matrices such that it can compute a subset of the eigenpairs at reduced cost. The effectiveness of our modifications is demonstrated with numerical experiments. 相似文献

17.

Quickstep: Fast and accurate density functional calculations using a mixed Gaussian and plane waves approach

Joost VandeVondele Matthias Krack Michele Parrinello Jürg Hutter 《Computer Physics Communications》2005,167(2):103-128

We present the Gaussian and plane waves (GPW) method and its implementation in Quickstep which is part of the freely available program package CP2K. The GPW method allows for accurate density functional calculations in gas and condensed phases and can be effectively used for molecular dynamics simulations. We show how derivatives of the GPW energy functional, namely ionic forces and the Kohn-Sham matrix, can be computed in a consistent way. The computational cost of computing the total energy and the Kohn-Sham matrix is scaling linearly with the system size, even for condensed phase systems of just a few tens of atoms. The efficiency of the method allows for the use of large Gaussian basis sets for systems up to 3000 atoms, and we illustrate the accuracy of the method for various basis sets in gas and condensed phases. Agreement with basis set free calculations for single molecules and plane wave based calculations in the condensed phase is excellent. Wave function optimisation with the orbital transformation technique leads to good parallel performance, and outperforms traditional diagonalisation methods. Energy conserving Born-Oppenheimer dynamics can be performed, and a highly efficient scheme is obtained using an extrapolation of the density matrix. We illustrate these findings with calculations using commodity PCs as well as supercomputers. 相似文献

18.

Practical aspects of algorithmic solutions to gas pulsation problems of pipeline systems

《Computers in Industry》1986,7(6):505-509

Experiences gained in the field of program development for acoustic pressure and velocity pulsation calculation of reciprocating compressor pipeline systems are discussed. An interactive personal computer program package based on the acoustic plane wave model has been elaborated. Using graph and matrix interpretation of the system, tree-structure pipeline networks are handled by generalized algorithms which make it possible to realize pulsation calculations on two levels, with general logical operations of structure recognition being separated from structure dependent numerical calculations. Economical time and memory usage, as well as program structure and data flow design are detailed. 相似文献

19.

Embedded divide-and-conquer algorithm on hierarchical real-space grids: parallel molecular dynamics simulation based on linear-scaling density functional theory

Fuyuki Shimojo Rajiv K. Kalia Priya Vashishta 《Computer Physics Communications》2005,167(3):151-164

A linear-scaling algorithm has been developed to perform large-scale molecular-dynamics (MD) simulations, in which interatomic forces are computed quantum mechanically in the framework of the density functional theory. A divide-and-conquer algorithm is used to compute the electronic structure, where non-additive contribution to the kinetic energy is included with an embedded cluster scheme. Electronic wave functions are represented on a real-space grid, which is augmented with coarse multigrids to accelerate the convergence of iterative solutions and adaptive fine grids around atoms to accurately calculate ionic pseudopotentials. Spatial decomposition is employed to implement the hierarchical-grid algorithm on massively parallel computers. A converged solution to the electronic-structure problem is obtained for a 32,768-atom amorphous CdSe system on 512 IBM POWER4 processors. The total energy is well conserved during MD simulations of liquid Rb, showing the applicability of this algorithm to first principles MD simulations. The parallel efficiency is 0.985 on 128 Intel Xeon processors for a 65,536-atom CdSe system. 相似文献

20.

A multi-algorithm approach to very high performance one-dimensional FFTs

Jim Armstrong 《The Journal of supercomputing》1988,2(4):415-433

This paper presents a multi-algorithm approach to computing one-dimensional FFTs. The type of parallelism introduced is most amenable to execution on multi-headed vector machines. The usage of multiple algorithms provides high performance regardless of transform size. 相似文献