首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Agent-based models, an emerging paradigm of simulation of complex systems, appear very suitable to parallel processing. However, during the parallelization of a simulator of financial markets, we found that some features of these codes highlight non-trivial issues of the present hardware/software platforms for parallel processing. Here we present the results of a series of tests, on different platforms, of simplified codes that reproduce such problems and can be used as a starting point in the search of a possible solution.  相似文献   

2.
The extended full-potential (FPX) helicopter rotor computational fluid dynamics (CFD) code of Fortran in its reduced two-dimensional version is successfully converted into a parallel version for multiprocessing. The FPX code with an internal grid generator solves the compressible full-potential equation using an approximately factored finite-difference scheme with added numerous physical modeling enhancements, including viscous boundary layers, shock-induced entropy corrections and wake-vortex embedding. The parallel version of the code uses open multi-processing (OpenMP) directives as parallel programming tool in shared-memory (SM) environment. The OpenMP code is portable and scalable, which can run on various computer platforms including UNIX platforms and Windows NT platforms. The performance study of the parallel code on SGI Origin 2000 UNIX platform is made. The results show that reasonable speedups through parallelization are obtained and that OpenMP is easy to use and an efficient parallel programming tool for the present problem.  相似文献   

3.
A new parallel normalized optimized approximate inverse algorithm, based on the concept of antidiagonal wave pattern, for computing classes of explicitly approximate inverses, is introduced for symmetric multiprocessor systems. The parallel normalized explicit approximate inverses are used in conjunction with parallel normalized explicit preconditioned conjugate gradient schemes for the efficient solution of finite element sparse linear systems. The parallel design and implementation issues of the new algorithm are discussed and the parallel performance is presented using OpenMP. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

4.
The NAME  Atmospheric Dispersion Model is a Lagrangian particle model used by the Met Office to predict the propagation and spread of pollutants in the atmosphere. The model is routinely used in emergency response applications, where it is important to obtain results as quickly as possible. This requirement for a short runtime and the increase in core number of commonly available CPUs, such as the Intel Xeon series, has motivated the parallelisation of NAME  in the OpenMP  shared memory framework. In this work we describe the implementation of this parallelisation strategy in NAME  and discuss the performance of the model for different setups. Due to the independence of the model particles, the parallelisation of the main compute intensive loops is relatively straightforward. The random number generator for modelling sub-grid scale turbulent motion needs to be adapted to ensure that different particles use independent sets of random numbers. We find that on Intel Xeon X5680 CPUs the model shows very good strong scaling up to 12 cores in a realistic emergency response application for predicting the dispersion of volcanic ash in the North Atlantic airspace. We implemented a mechanism for asynchronous reading of meteorological data from disk and demonstrate how this can reduce the runtime if disk access plays a significant role in a model run. To explore the performance on different chip architectures we also ported the part of the code which is used for calculating the gamma dose from a cloud of radioactive particles to a graphics processing unit (GPU) using CUDA-C. We were able to demonstrate a significant speedup of around one order of magnitude relative to the serial CPU version.  相似文献   

5.
We present the development of a novel high‐performance face detection system using a neural network‐based classification algorithm and an efficient parallelization with OpenMP. We discuss the design of the system in detail along with experimental assessment. Our parallelization strategy starts with one level of threads and moves to the exploitation of nested parallel regions in order to further improve, by up to 19%, the image‐processing capability. The presented system is able to process images in real time (38 images/sec) by sustaining almost linear speedups on a system with a quad‐core processor and a particular OpenMP runtime library. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

6.
Direct volume visualization is an important method in many areas, including computational fluid dynamics and medicine. Achieving interactive rates for direct volume rendering of large unstructured volumetric grids is a challenging problem, but parallelizing direct volume rendering algorithms can help achieve this goal. Using Compute Unified Device Architecture (CUDA), we propose a GPU-based volume rendering algorithm that itself is based on a cell projection-based ray-casting algorithm designed for CPU implementations. We also propose a multicore parallelized version of the cell-projection algorithm using OpenMP. In both algorithms, we favor image quality over rendering speed. Our algorithm has a low memory footprint, allowing us to render large datasets. Our algorithm supports progressive rendering. We compared the GPU implementation with the serial and multicore implementations. We observed significant speed-ups that, together with progressive rendering, enables reaching interactive rates for large datasets.  相似文献   

7.
The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures.  相似文献   

8.
Genetic algorithms (GAs) have been applied to solve the 2-page crossing number problem successfully, but since they work with one global population, the search time and space are limited. Parallelisation provides an attractive prospect to improve the efficiency and solution quality of GAs. This paper investigates the complexity of parallel genetic algorithms (PGAs) based on two evaluation measures: computation time to communication time and population size to chromosome size. Moreover, the paper unifies the framework of PGA models with the function PGA (subpopulation size, cluster size, migration period, topology), and explores the performance of PGAs for the 2-page crossing number problem.  相似文献   

9.
The Journal of Supercomputing - We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel applications’ performance running on a...  相似文献   

10.
JPEG 2000 and MPEG-4 Visual Texture Coding (VTC) are both wavelet-based and state of the art in still image coding. In this paper we show sequential as well as parallel strategies for speeding up two selected implementations of MPEG-4 VTC and JPEG 2000 using the popular shared memory programming paradigm OpenMP. Furthermore, we discuss the sequential and parallel performance of the improved versions and compare the efficiency of both algorithms.  相似文献   

11.
The Design of OpenMP Tasks   总被引:2,自引:0,他引:2  
OpenMP has been very successful in exploiting structured parallelism in applications. With increasing application complexity, there is a growing need for addressing irregular parallelism in the presence of complicated control structures. This is evident in various efforts by the industry and research communities to provide a solution to this challenging problem. One of the primary goals of OpenMP 3.0 is to define a standard dialect to express and efficiently exploit unstructured parallelism. This paper presents the design of the OpenMP tasking model by members of the OpenMP 3.0 tasking sub-committee which was formed for this purpose. The paper summarizes the efforts of the sub-committee (spanning over two years) in designing, evaluating and seamlessly integrating the tasking model into the OpenMP specification. In this paper, we present the design goals and key features of the tasking model, including a rich set of examples and an in-depth discussion of the rationale behind various design choices. We compare a prototype implementation of the tasking model with existing models, and evaluate it on a wide range of applications. The comparison shows that the OpenMP tasking model provides expressiveness, flexibility, and huge potential for performance and scalability.  相似文献   

12.
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions.  相似文献   

13.
This paper investigates a high performance implementation of an Arbitrary Lagrangian Eulerian moving mesh technique on shared memory systems using OpenMP environment. Moving mesh techniques are considered an integral part of a wider class of fluid mechanics problems that involve moving and deforming spatial domains, namely, free-surface flows and Fluid Structure Interaction (FSI). The moving mesh technique adopted in this work is based on the notion of nodes relocation, subjected to a certain evolution as well as constraint conditions. A conjugate gradient method augmented with preconditioning is employed for solution of the resulting system of equations. The proposed algorithm, initially, reorders the mesh using an efficient divide and conquer approach and then parallelizes the ALE moving mesh scheme. Numerical simulations are conducted on the multicore AMD Opteron and Intel Xeon processors, and unstructured triangular and tetrahedral meshes are used for the 2D and 3D problems. The quality of generated meshes is checked by comparing the element Jacobians in the reference and current meshes, and by keeping track of the change in the interior angles in triangles and tetrahedrons. Overall, 51 and 72% efficiencies in terms of speedup are achieved for both the parallel mesh reordering and ALE moving mesh algorithms, respectively.  相似文献   

14.
The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic language-based tools for simple and efficient parallelization of recursively defined computational problems and other problems that need both task and data parallelization techniques. We show how to use these models of parallel programming to transform a source code of Adaptive Simpson’s Integration to programs that can utilize multiple cores of modern processors. Using the example of Belman–Ford algorithm for solving single-source shortest path problems, we advise how to improve performance of data parallel algorithms by tuning data structures for better utilization of vector extensions of modern processors. Manual vectorization techniques based on Cilk array notation and intrinsics are presented. We also show how to simplify such optimization using Intel SIMD Data Layout Template containers.  相似文献   

15.
16.
17.
Protein secondary structure prediction has a fundamental influence on today’s bioinformatics research. In this work, tertiary classifiers for the protein secondary structure prediction are implemented on Denoeux Belief Neural Network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 matrix and PSSM matrix are experimented separately as the encoding schemes for DBNN. Hydrophobicity matrix, BLOSUM62 matrix and PSSM matrix are applied to DBNN architecture for the first time. The experimental results contribute to the design of new encoding schemes. Our accuracy of the tertiary classifier with PSSM encoding scheme reaches 72.01%, which is almost 10% better than the previous results obtained in 2003. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the Hyper-Threading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that Hyper-Threading technology for Intel architecture is efficient for parallel biological algorithms.
Yi Pan (Corresponding author)Email:
  相似文献   

18.
We are dealing here with the parallelization of fire spreading simulations following detailed physical experiments. The proposal presented in this paper has been tested and evaluated in collaboration with physicists to meet their requirements in terms of both performance and precision. For this purpose, an object-oriented framework using two abstraction levels has been developed. A first level considers the simulation as a global phenomenon which evolves in space and time. A local level describes the phenomena occurring on elementary parts of the domain. In order to develop an extensible and modular architecture, the cellular automata paradigm, the DEVS discrete event system formalism and design patterns have been used. Simulation treatments are limited to a set of active elements to improve execution times. A new kind of model, called Active-DEVS is then specified. The model is computed with a fine grain parallelization very efficient for present day multi-core processors which are elementary units of modern computing clusters and computing grids. In this paper, the parallelization with Open MultiProcessing (OpenMP) standard directives on Symmetric MultiProcessing (SMP) architectures is discussed and the efficiency of the retained solution is studied.  相似文献   

19.
In this study, we propose a new method to apply the rapid flood spreading model (RFSM) using cellular automata (CA) to multiple inflows of Carlisle, UK. The purpose of the RFSM is to generate predictions of water depth and flood extent using less computer resource than required by two-dimensional shallow water equation models (SWEMs). To be useful the RFSM must produce predictions that are comparable with those obtained from SWEMs. This paper reports a validation data available to the date on an urban flood, collected in January 2005 after a major event in the city of Carlisle, UK. This demonstrates an agreement between the proposed RFSM and measured data.  相似文献   

20.
We propose simple models to predict the performance degradation of disk requests due to storage device contention in consolidated virtualized environments. Model parameters can be deduced from measurements obtained inside Virtual Machines (VMs) from a system where a single VM accesses a remote storage server. The parameterized model can then be used to predict the effect of storage contention when multiple VMs are consolidated on the same server. We first propose a trace-driven approach that evaluates a queueing network with fair share scheduling using simulation. The model parameters consider Virtual Machine Monitor level disk access optimizations and rely on a calibration technique. We further present a measurement-based approach that allows a distinct characterization of read/write performance attributes. In particular, we define simple linear prediction models for I/O request mean response times, throughputs and read/write mixes, as well as a simulation model for predicting response time distributions. We found our models to be effective in predicting such quantities across a range of synthetic and emulated application workloads.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号