首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, a programming model is presented which enables scalable parallel performance on multi-core shared memory architectures. The model has been developed for application to a wide range of numerical simulation problems. Such problems involve time stepping or iteration algorithms where synchronization of multiple threads of execution is required. It is shown that traditional approaches to parallelism including message passing and scatter-gather can be improved upon in terms of speed-up and memory management. Using spatial decomposition to create orthogonal computational tasks, a new task management algorithm called H-Dispatch is developed. This algorithm makes efficient use of memory resources by limiting the need for garbage collection and takes optimal advantage of multiple cores by employing a “hungry” pull strategy. The technique is demonstrated on a simple finite difference solver and results are compared to traditional MPI and scatter-gather approaches. The H-Dispatch approach achieves near linear speed-up with results for efficiency of 85% on a 24-core machine. It is noted that the H-Dispatch algorithm is quite general and can be applied to a wide class of computational tasks on heterogeneous architectures involving multi-core and GPGPU hardware.  相似文献   

2.
With the growing popularity of the velocity map imaging technique, a need for the analysis of photoion and photoelectron images arose. Here, a computer program is presented that allows for the analysis of cylindrically symmetric images. It permits the inversion of the projection of the 3D charged particle distribution using the Onion Peeling Algorithm. Further analysis includes the determination of radial and angular distributions, from which velocity distributions and spatial anisotropy parameters are obtained. Identification and quantification of the different photolysis channels is therefore straightforward. In addition, the program features geometry correction, centering, and multi-Gaussian fitting routines, as well as a user-friendly graphical interface and the possibility of generating synthetic images using either the fitted or user-defined parameters.

Program summary

Title of program: Glass OnionCatalogue identifier: ADRYProgram Summary URL:http://cpc.cs.qub.ac.uk/summaries/ADRYProgram obtainable from: CPC Program Library, Queen's University of Belfast, N. IrelandLicensing provisions: noneComputer: IBM PCOperating system under which the program has been tested: Windows 98, Windows 2000, Windows NTProgramming language used: Delphi 4.0Memory required to execute with typical data: 18 MwordsNo. of bits in a word: 32No. of bytes in distributed program, including test data, etc.: 9 911 434Distribution format: zip fileKeywords: Photofragment image, onion peeling, anisotropy parametersNature of physical problem: Information about velocity and angular distributions of photofragments is the basis on which the analysis of the photolysis process resides. Reconstructing the three-dimensional distribution from the photofragment image is the first step, further processing involving angular and radial integration of the inverted image to obtain velocity and angular distributions. Provisions have to be made to correct for slight distortions of the image, and to verify the accuracy of the analysis process.Method of solution: The “Onion Peeling” algorithm described by Helm [Rev. Sci. Instrum. 67 (6) (1996)] is used to perform the image reconstruction. Angular integration with a subsequent multi-Gaussian fit supplies information about the velocity distribution of the photofragments, whereas radial integration with subsequent expansion of the angular distributions over Legendre Polynomials gives the spatial anisotropy parameters. Fitting algorithms have been developed to centre the image and to correct for image distortion.Restrictions on the complexity of the problem: The maximum image size (1280×1280) and resolution (16 bit) are restricted by available memory and can be changed in the source code. Initial centre coordinates within 5 pixels may be required for the correction and the centering algorithm to converge. Peaks on the velocity profile separated by less then the peak width may not be deconvolved. In the charged particle image reconstruction, it is assumed that the kinetic energy released in the dissociation process is small compared to the energy acquired in the electric field. For the fitting parameters to be physically meaningful, cylindrical symmetry of the image has to be assumed but the actual inversion algorithm is stable to distortions of such symmetry in experimental images.Typical running time: The analysis procedure can be divided into three parts: inversion, fitting, and geometry correction. The inversion time grows approx. as R3, where R is the radius of the region of interest: for R=200 pixels it is less than a minute, for R=400 pixels less then 6 min on a 400 MHz IBM personal computer. The time for the velocity fitting procedure to converge depends strongly on the number of peaks in the velocity profile and the convergence criterion. It ranges between less then a second for simple curves and a few minutes for profiles with up to twenty peaks. The time taken for the image correction scales as R2 and depends on the curve profile. It is on the order of a few minutes for images with R=500 pixels.Unusual features of the program: Our centering and image correction algorithm is based on Fourier analysis of the radial distribution to insure the sharpest velocity profile and is insensitive to an uneven intensity distribution. There exists an angular averaging option to stabilize the inversion algorithm and not to loose the resolution at the same time.  相似文献   

3.
In the paper we present compact library for analysis of nuclear spectra. The library consists of sophisticated functions for background elimination, smoothing, peak searching, deconvolution, and peak fitting. The functions can process one- and two-dimensional spectra. The software described in the paper comprises a number of conventional as well as newly developed methods needed to analyze experimental data.

Program summary

Program title: SpecAnalysLib 1.1Catalogue identifier: AEDZ_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDZ_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 42 154No. of bytes in distributed program, including test data, etc.: 2 379 437Distribution format: tar.gzProgramming language: C++Computer: Pentium 3 PC 2.4 GHz or higher, Borland C++ Builder v. 6. A precompiled Windows version is included in the distribution packageOperating system: Windows 32 bit versionsRAM: 10 MBWord size: 32 bitsClassification: 17.6Nature of problem: The demand for advanced highly effective experimental data analysis functions is enormous. The library package represents one approach to give the physicists the possibility to use the advanced routines simply by calling them from their own programs. SpecAnalysLib is a collection of functions for analysis of one- and two-parameter γ-ray spectra, but they can be used for other types of data as well. The library consists of sophisticated functions for background elimination, smoothing, peak searching, deconvolution, and peak fitting.Solution method: The algorithms of background estimation are based on Sensitive Non-linear Iterative Peak (SNIP) clipping algorithm. The smoothing algorithms are based on the convolution of the original data with several types of filters and algorithms based on discrete Markov chains. The peak searching algorithms use the smoothed second differences and they can search for peaks of general form. The deconvolution (decomposition - unfolding) functions use the Gold iterative algorithm, its improved high resolution version and Richardson-Lucy algorithm. In the algorithms of peak fitting we have implemented two approaches. The first one is based on the algorithm without matrix inversion - AWMI algorithm. It allows it to fit large blocks of data and large number of parameters. The other one is based on the calculation of the system of linear equations using Stiefel-Hestens method. It converges faster than the AWMI, however it is not suitable for fitting large number of parameters.Restrictions: Dimensionality of the analyzed data is limited to two.Unusual features: Dynamically loadable library (DLL) of processing functions users can call from their own programs.Running time: Most processing routines execute interactively or in a few seconds. Computationally intensive routines (deconvolution, fitting) execute longer, depending on the number of iterations specified and volume of the processed data.  相似文献   

4.
In this article we focus on the implementation of a Lattice Monte Carlo simulation for a generic pair potential within a reconfigurable computing platform. The approach presented was used to simulate a specific soft matter system.We found the performed simulations to be in excellent accordance with previous theoretical and simulation studies. By taking advantage of the shortened processing time, we were also able to find new micro- and macroscopic properties of this system. Furthermore we analyzed analytically the effects of the spatial discretization introduced by the Lattice Monte Carlo algorithm.  相似文献   

5.
6.
Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method is not only easily parallelizable across CPU clusters, but as we report here, it also has a high degree of data parallelism. This facilitates the use of recent technological advances in Graphical Processing Units (GPUs), a powerful type of processor well known to computer gamers. In this paper we report on an end-to-end QMC application with core elements of the algorithm running on a GPU. With individual kernels achieving as much as 30× speed up, the overall application performs at up to 6× faster relative to an optimized CPU implementation, yet requires only a modest increase in hardware cost. This demonstrates the speedup improvements possible for QMC in running on advanced hardware, thus exploring a path toward providing QMC level accuracy as a more standard tool. The major current challenge in running codes of this type on the GPU arises from the lack of fully compliant IEEE floating point implementations. To achieve better accuracy we propose the use of the Kahan summation formula in matrix multiplications. While this drops overall performance, we demonstrate that the proposed new algorithm can match CPU single precision.  相似文献   

7.
The growing power and number of high performance computing resources made available through computational grids present major opportunities as well as a number of challenges to the user. At issue is how these resources can be accessed and how their power can be effectively exploited. In this paper we first present our views on the usability of contemporary high-performance computational resources. We introduce the concept of grid application virtualization as a solution to some of the problems with grid-based HPC usability. We then describe a middleware tool that we have developed to realize the virtualization of grid applications, the Application Hosting Environment (AHE), and describe the features of the new release, AHE 2.0, which provides access to a common platform of federated computational grid resources in standard and non-standard ways. Finally, we describe a case study showing how AHE supports clinical use of whole brain blood flow modelling in a routine and automated fashion.

Program summary

Program title: Application Hosting Environment 2.0Catalogue identifier: AEEJ_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEJ_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GNU Public Licence, Version 2No. of lines in distributed program, including test data, etc.: not applicableNo. of bytes in distributed program, including test data, etc.: 1 685 603 766Distribution format: tar.gzProgramming language: Perl (server), Java (Client)Computer: x86Operating system: Linux (Server), Linux/Windows/MacOS (Client)RAM: 134 217 728 (server), 67 108 864 (client) bytesClassification: 6.5External routines: VirtualBox (server), Java (client)Nature of problem: The middleware that makes grid computing possible has been found by many users to be too unwieldy, and presents an obstacle to use rather than providing assistance [1,2]. Such problems are compounded when one attempts to harness the power of a grid, or a federation of different grids, rather than just a single resource on the grid.Solution method: To address the above problem, we have developed AHE, a lightweight interface, designed to simplify the process of running scientific codes on a grid of HPC and local resources. AHE does this by introducing a layer of middleware between the user and the grid, which encapsulates much of the complexity associated with launching grid applications.Unusual features: The server is distributed as a VirtualBox virtual machine. VirtualBox (http://www.virtualbox.org) must be downloaded and installed in order to run the AHE server virtual machine. Details of how to do this are given in the AHE 2.0 Quick Start Guide.Running time: Not applicableReferences:
  • [1] 
    J. Chin, P.V. Coveney, Towards tractable toolkits for the grid: A plea for lightweight, useable middleware, NeSC Technical Report, 2004, http://nesc.ac.uk/technical_papers/UKeS-2004-01.pdf.
  • [2] 
    P.V. Coveney, R.S. Saksena, S.J. Zasada, M. McKeown, S. Pickles, The Application Hosting Environment: Lightweight middleware for grid-based computational science, Computer Physics Communications 176 (2007) 406–418.
  相似文献   

8.
We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3–4 and 4–16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development.

Program summary

Program title: HONEICatalogue identifier: AEDW_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPLv2No. of lines in distributed program, including test data, etc.: 216 180No. of bytes in distributed program, including test data, etc.: 1 270 140Distribution format: tar.gzProgramming language: C++Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3Operating system: LinuxRAM: at least 500 MB freeClassification: 4.8, 4.3, 6.1External routines: SSE: none; [1] for GPU, [2] for Cell backendNature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the underlying hardware towards heterogeneity and parallelism. This is particularly relevant for data-intensive problems stemming from discretisations with local support, such as finite differences, volumes and elements.Solution method: To address these issues, we present a hardware aware collection of libraries combining the advantages of modern software techniques and hardware oriented programming. Applications built on top of these libraries can be configured trivially to execute on CPUs, GPUs or the Cell processor. In order to evaluate the performance and accuracy of our approach, we provide two domain specific applications; a multigrid solver for the Poisson problem and a fully explicit solver for 2D shallow water equations.Restrictions: HONEI is actively being developed, and its feature list is continuously expanded. Not all combinations of operations and architectures might be supported in earlier versions of the code. Obtaining snapshots from http://www.honei.org is recommended.Unusual features: The considered applications as well as all library operations can be run on NVIDIA GPUs and the Cell BE.Running time: Depending on the application, and the input sizes. The Poisson solver executes in few seconds, while the SWE solver requires up to 5 minutes for large spatial discretisations or small timesteps.References:
  • [1] 
    http://www.nvidia.com/cuda.
  • [2] 
    http://www.ibm.com/developerworks/power/cell.
  相似文献   

9.
We describe the hardwired implementation of algorithms for Monte Carlo simulations of a large class of spin models. We have implemented these algorithms as VHDL codes and we have mapped them onto a dedicated processor based on a large FPGA device. The measured performance on one such processor is comparable to O(100) carefully programmed high-end PCs: it turns out to be even better for some selected spin models. We describe here codes that we are currently executing on the IANUS massively parallel FPGA-based system.  相似文献   

10.
In order to model complex heterogeneous biophysical macrostructures with non-trivial charge distributions such as globular proteins in water, it is important to evaluate the long range forces present in these systems accurately and efficiently. The Smooth Particle Mesh Ewald summation technique (SPME) is commonly used to determine the long range part of electrostatic energy in large scale molecular simulations. While the SPME technique does not give rise to a performance bottleneck on a single processor, current implementations of SPME on massively parallel, supercomputers become problematic at large processor numbers, limiting the time and length scales that can be reached. Here, a synergistic investigation involving method improvement, parallel programming and novel architectures is employed to address this difficulty. A relatively simple modification of the SPME technique is described which gives rise to both improved accuracy and efficiency on both massively parallel and scalar computing platforms. Our fine grained parallel implementation of the modified SPME method for the novel QCDOC supercomputer with its 6D-torus architecture is then given. Numerical tests of algorithm performance on up to 1024 processors of the QCDOC machine at BNL are presented for two systems of interest, a β-hairpin solvated in explicit water, a system which consists of 1142 water molecules and a 20 residue protein for a total of 3579 atoms, and the HIV-1 protease solvated in explicit water, a system which consists of 9331 water molecules and a 198 residue protein for a total of 29508 atoms.  相似文献   

11.
Given the resurgent attractiveness of single-instruction-multiple-data (SIMD) processing, it is important for high-performance computing applications to be SIMD-capable. The Hartree-Fock SCF (HF-SCF) application, in it's canonical form, cannot fully exploit SIMD processing. Prior attempts to implement Electron Repulsion Integral (ERI) sorting functionality to essentially “SIMD-ify” the HF-SCF application have met frustration because of the low throughput of the sorting functionality. With greater awareness of computer architecture, we discuss how the sorting functionality may be practically implemented to provide high-performance. Overall system performance analysis, including memory locality analysis, is also conducted, and further emphasises that a system with ERI sorting is capable of very high throughput. We discuss two alternative implementation options, with one immediately accessible software-based option discussed in detail. The impact of workload characteristics on expected performance is also discussed, and it is found that in general as basis set size increases the potential performance of the system also increases. Consideration is given to conventional CPUs, GPUs, FPGAs, and the Cell Broadband Engine architecture.  相似文献   

12.
The efficiency and scalability of traditional parallel force-decomposition (FD) algorithms are not good because of high communication cost which is introduced when skew-symmetric character of force matrix is applied. This paper proposed a new parallel algorithm called UTFBD (Under Triangle Force Block Decomposition), which is based on a new efficient force matrix decomposition strategy. This strategy decomposes only the under triangle force matrix and greatly reduces parallel communication cost, e.g., the communication cost of UTFBD algorithm is only one third of Taylor's FD algorithm. UTFBD algorithm is implemented on Cluster system and applied to solve a physical nucleation problem with 500,000 particles. Numerical results are analyzed and compared among three algorithms, namely, FRI, Taylor's FD and UTFBD. The efficiency of UTFBD on 105 processors is 41.3%, and the efficiencies of FRI and Taylor's FD on 100 processors are 4.3 and 35.2%, respectively. In another words, the efficiency of UTFBD on about 100 processors is 37.0 and 6.1% higher than that of FRI and Taylor's FD, respectively. Results show that UTFBD can increase the efficiency of parallel MD (Molecular Dynamics) simulation to a higher degree and has a better scalability.  相似文献   

13.
A linear-scaling algorithm has been developed to perform large-scale molecular-dynamics (MD) simulations, in which interatomic forces are computed quantum mechanically in the framework of the density functional theory. A divide-and-conquer algorithm is used to compute the electronic structure, where non-additive contribution to the kinetic energy is included with an embedded cluster scheme. Electronic wave functions are represented on a real-space grid, which is augmented with coarse multigrids to accelerate the convergence of iterative solutions and adaptive fine grids around atoms to accurately calculate ionic pseudopotentials. Spatial decomposition is employed to implement the hierarchical-grid algorithm on massively parallel computers. A converged solution to the electronic-structure problem is obtained for a 32,768-atom amorphous CdSe system on 512 IBM POWER4 processors. The total energy is well conserved during MD simulations of liquid Rb, showing the applicability of this algorithm to first principles MD simulations. The parallel efficiency is 0.985 on 128 Intel Xeon processors for a 65,536-atom CdSe system.  相似文献   

14.
We present a data mining technique for the analysis of multichannel oscillatory timeseries data and show an application using poloidal arrays of magnetic sensors installed in the H-1 heliac. The procedure is highly automated, and scales well to large datasets. The timeseries data is split into short time segments to provide time resolution, and each segment is represented by a singular value decomposition (SVD). By comparing power spectra of the temporal singular vectors, related singular values are grouped into subsets which define fluctuation structures. Thresholds for the normalised energy of the fluctuation structure and the normalised entropy of the SVD can be used to filter the dataset. We assume that distinct classes of fluctuations are localised in the space of phase differences Δψ(n,n+1) between each pair of nearest neighbour channels. An expectation maximisation clustering algorithm is used to locate the distinct classes of fluctuations and assign mode numbers where possible, and a cluster tree mapping is used to visualise the results.  相似文献   

15.
16.
17.
Accurate and efficient automatic or semi-automatic brain image segmentation methods are of great interest to both scientific and clinical researchers of the human central neural system. Cerebral white matter segmentation in brain Magnetic Resonance Imaging (MRI) data becomes a challenging problem due to a combination of several factors like low contrast, presence of noise and imaging artifacts, partial volume effects, intrinsic tissue variation due to neurodevelopment and neuropathologies, and the highly convoluted geometry of the cortex. In this paper, we propose a new set of edge weights for the traditional graph cut algorithm (Boykov and Jolly, 2001) to correctly segment the cerebral white matter from T1-weighted MRI sequence. In this algorithm, the edge weights of Boykov and Jolly (2001) are modified by comparing the probabilities of an individual voxel and its neighboring voxels to belong to different segmentation classes. A shape prior in form of a series of ellipses is used next to model the contours of the human skull in various 2D slices in the sequence. This shape constraint is imposed to prune the original graph constructed from the input to form a subgraph consisting of voxels within the skull contours. Our graph cut algorithm with new set of edge weights is applied to the above subgraph, thereby increasing the segmentation accuracy as well as decreasing the computation time. Average segmentation errors for the proposed algorithm, the graph cut algorithm (Boykov and Jolly, 2001), and the Expectation Maximization Segmentation (EMS) algorithm Van Leemput et al., 2001 in terms of Dice coefficients are found to be (3.72 ± 1.12)%, (14.88 ± 1.69)%, and (11.95 ± 5.2)%, respectively.  相似文献   

18.
A scalable parallel algorithm has been designed to perform multimillion-atom molecular dynamics (MD) simulations, in which first principles-based reactive force fields (ReaxFF) describe chemical reactions. Environment-dependent bond orders associated with atomic pairs and their derivatives are reused extensively with the aid of linked-list cells to minimize the computation associated with atomic n-tuple interactions (n?4 explicitly and ?6 due to chain-rule differentiation). These n-tuple computations are made modular, so that they can be reconfigured effectively with a multiple time-step integrator to further reduce the computation time. Atomic charges are updated dynamically with an electronegativity equalization method, by iteratively minimizing the electrostatic energy with the charge-neutrality constraint. The ReaxFF-MD simulation algorithm has been implemented on parallel computers based on a spatial decomposition scheme combined with distributed n-tuple data structures. The measured parallel efficiency of the parallel ReaxFF-MD algorithm is 0.998 on 131,072 IBM BlueGene/L processors for a 1.01 billion-atom RDX system.  相似文献   

19.
Yu Liu  Yi Zhang  Yong Gao 《Information Sciences》2008,178(9):2163-2175
A data model, named generalized network (GNet), is proposed to perform various network-tracing tasks, especially tracing conceptual proposition networks in qualitative spatial reasoning (QSR). The GNet model can be defined as a 6-tuple: (VAq, ⊕, ∼, L). By specifying each element in the 6-tuple, a GNet can function as a conventional network, or an activity on edge (AOE) network, etc. The algorithm for searching for the generalized optimum path weight (GOPW) between two vertices in a GNet is developed by extending the Bellman-Ford algorithm (EBFA). Based on the GNet model, this paper focuses on representing spatial knowledge, which consists of a set of binary relations. We present two applications of GNets, namely the RCC8 network and the hybrid RCC8 network involving cardinal direction relations. Both can be traced to infer new spatial knowledge using EBFA. The applications demonstrate that the GNet model provides a promising approach to dealing with proposition-based geospatial knowledge based on weak composition. We also point out that EBFA can check whether a network is algebraically closed, or path-consistent when the corresponding composition table is extensional.  相似文献   

20.
An algorithm has been designed to search for the escape paths with the lowest activation barriers when starting from a local minimum-energy configuration of a many-atom system. The pathfinder algorithm combines: (1) a steered eigenvector-following method that guides a constrained escape from the convex region and subsequently climbs to a transition state tangentially to the eigenvector corresponding to the lowest negative Hessian eigenvalue; (2) discrete abstraction of the atomic configuration to systematically enumerate concerted events as linear combinations of atomistic events; (3) evolutionary control of the population dynamics of low activation-barrier events; and (4) hybrid task + spatial decompositions to implement massive search for complex events on parallel computers. The program exhibits good scalability on parallel computers and has been used to study concerted bond-breaking events in the fracture of alumina.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号