首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Agent-based models, an emerging paradigm of simulation of complex systems, appear very suitable to parallel processing. However, during the parallelization of a simulator of financial markets, we found that some features of these codes highlight non-trivial issues of the present hardware/software platforms for parallel processing. Here we present the results of a series of tests, on different platforms, of simplified codes that reproduce such problems and can be used as a starting point in the search of a possible solution.  相似文献   

2.
Direct volume visualization is an important method in many areas, including computational fluid dynamics and medicine. Achieving interactive rates for direct volume rendering of large unstructured volumetric grids is a challenging problem, but parallelizing direct volume rendering algorithms can help achieve this goal. Using Compute Unified Device Architecture (CUDA), we propose a GPU-based volume rendering algorithm that itself is based on a cell projection-based ray-casting algorithm designed for CPU implementations. We also propose a multicore parallelized version of the cell-projection algorithm using OpenMP. In both algorithms, we favor image quality over rendering speed. Our algorithm has a low memory footprint, allowing us to render large datasets. Our algorithm supports progressive rendering. We compared the GPU implementation with the serial and multicore implementations. We observed significant speed-ups that, together with progressive rendering, enables reaching interactive rates for large datasets.  相似文献   

3.
The NAME  Atmospheric Dispersion Model is a Lagrangian particle model used by the Met Office to predict the propagation and spread of pollutants in the atmosphere. The model is routinely used in emergency response applications, where it is important to obtain results as quickly as possible. This requirement for a short runtime and the increase in core number of commonly available CPUs, such as the Intel Xeon series, has motivated the parallelisation of NAME  in the OpenMP  shared memory framework. In this work we describe the implementation of this parallelisation strategy in NAME  and discuss the performance of the model for different setups. Due to the independence of the model particles, the parallelisation of the main compute intensive loops is relatively straightforward. The random number generator for modelling sub-grid scale turbulent motion needs to be adapted to ensure that different particles use independent sets of random numbers. We find that on Intel Xeon X5680 CPUs the model shows very good strong scaling up to 12 cores in a realistic emergency response application for predicting the dispersion of volcanic ash in the North Atlantic airspace. We implemented a mechanism for asynchronous reading of meteorological data from disk and demonstrate how this can reduce the runtime if disk access plays a significant role in a model run. To explore the performance on different chip architectures we also ported the part of the code which is used for calculating the gamma dose from a cloud of radioactive particles to a graphics processing unit (GPU) using CUDA-C. We were able to demonstrate a significant speedup of around one order of magnitude relative to the serial CPU version.  相似文献   

4.
The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures.  相似文献   

5.
The Journal of Supercomputing - We present PPT-Multicore, an analytical model embedded in the Performance Prediction Toolkit (PPT) to predict parallel applications’ performance running on a...  相似文献   

6.
This paper investigates a high performance implementation of an Arbitrary Lagrangian Eulerian moving mesh technique on shared memory systems using OpenMP environment. Moving mesh techniques are considered an integral part of a wider class of fluid mechanics problems that involve moving and deforming spatial domains, namely, free-surface flows and Fluid Structure Interaction (FSI). The moving mesh technique adopted in this work is based on the notion of nodes relocation, subjected to a certain evolution as well as constraint conditions. A conjugate gradient method augmented with preconditioning is employed for solution of the resulting system of equations. The proposed algorithm, initially, reorders the mesh using an efficient divide and conquer approach and then parallelizes the ALE moving mesh scheme. Numerical simulations are conducted on the multicore AMD Opteron and Intel Xeon processors, and unstructured triangular and tetrahedral meshes are used for the 2D and 3D problems. The quality of generated meshes is checked by comparing the element Jacobians in the reference and current meshes, and by keeping track of the change in the interior angles in triangles and tetrahedrons. Overall, 51 and 72% efficiencies in terms of speedup are achieved for both the parallel mesh reordering and ALE moving mesh algorithms, respectively.  相似文献   

7.
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions.  相似文献   

8.
The Design of OpenMP Tasks   总被引:2,自引:0,他引:2  
OpenMP has been very successful in exploiting structured parallelism in applications. With increasing application complexity, there is a growing need for addressing irregular parallelism in the presence of complicated control structures. This is evident in various efforts by the industry and research communities to provide a solution to this challenging problem. One of the primary goals of OpenMP 3.0 is to define a standard dialect to express and efficiently exploit unstructured parallelism. This paper presents the design of the OpenMP tasking model by members of the OpenMP 3.0 tasking sub-committee which was formed for this purpose. The paper summarizes the efforts of the sub-committee (spanning over two years) in designing, evaluating and seamlessly integrating the tasking model into the OpenMP specification. In this paper, we present the design goals and key features of the tasking model, including a rich set of examples and an in-depth discussion of the rationale behind various design choices. We compare a prototype implementation of the tasking model with existing models, and evaluate it on a wide range of applications. The comparison shows that the OpenMP tasking model provides expressiveness, flexibility, and huge potential for performance and scalability.  相似文献   

9.
10.
The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic language-based tools for simple and efficient parallelization of recursively defined computational problems and other problems that need both task and data parallelization techniques. We show how to use these models of parallel programming to transform a source code of Adaptive Simpson’s Integration to programs that can utilize multiple cores of modern processors. Using the example of Belman–Ford algorithm for solving single-source shortest path problems, we advise how to improve performance of data parallel algorithms by tuning data structures for better utilization of vector extensions of modern processors. Manual vectorization techniques based on Cilk array notation and intrinsics are presented. We also show how to simplify such optimization using Intel SIMD Data Layout Template containers.  相似文献   

11.
Genetic algorithms (GAs) have been applied to solve the 2-page crossing number problem successfully, but since they work with one global population, the search time and space are limited. Parallelisation provides an attractive prospect to improve the efficiency and solution quality of GAs. This paper investigates the complexity of parallel genetic algorithms (PGAs) based on two evaluation measures: computation time to communication time and population size to chromosome size. Moreover, the paper unifies the framework of PGA models with the function PGA (subpopulation size, cluster size, migration period, topology), and explores the performance of PGAs for the 2-page crossing number problem.  相似文献   

12.
Protein secondary structure prediction has a fundamental influence on today’s bioinformatics research. In this work, tertiary classifiers for the protein secondary structure prediction are implemented on Denoeux Belief Neural Network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 matrix and PSSM matrix are experimented separately as the encoding schemes for DBNN. Hydrophobicity matrix, BLOSUM62 matrix and PSSM matrix are applied to DBNN architecture for the first time. The experimental results contribute to the design of new encoding schemes. Our accuracy of the tertiary classifier with PSSM encoding scheme reaches 72.01%, which is almost 10% better than the previous results obtained in 2003. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the Hyper-Threading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that Hyper-Threading technology for Intel architecture is efficient for parallel biological algorithms.
Yi Pan (Corresponding author)Email:
  相似文献   

13.
14.
We are dealing here with the parallelization of fire spreading simulations following detailed physical experiments. The proposal presented in this paper has been tested and evaluated in collaboration with physicists to meet their requirements in terms of both performance and precision. For this purpose, an object-oriented framework using two abstraction levels has been developed. A first level considers the simulation as a global phenomenon which evolves in space and time. A local level describes the phenomena occurring on elementary parts of the domain. In order to develop an extensible and modular architecture, the cellular automata paradigm, the DEVS discrete event system formalism and design patterns have been used. Simulation treatments are limited to a set of active elements to improve execution times. A new kind of model, called Active-DEVS is then specified. The model is computed with a fine grain parallelization very efficient for present day multi-core processors which are elementary units of modern computing clusters and computing grids. In this paper, the parallelization with Open MultiProcessing (OpenMP) standard directives on Symmetric MultiProcessing (SMP) architectures is discussed and the efficiency of the retained solution is studied.  相似文献   

15.
In this study, we propose a new method to apply the rapid flood spreading model (RFSM) using cellular automata (CA) to multiple inflows of Carlisle, UK. The purpose of the RFSM is to generate predictions of water depth and flood extent using less computer resource than required by two-dimensional shallow water equation models (SWEMs). To be useful the RFSM must produce predictions that are comparable with those obtained from SWEMs. This paper reports a validation data available to the date on an urban flood, collected in January 2005 after a major event in the city of Carlisle, UK. This demonstrates an agreement between the proposed RFSM and measured data.  相似文献   

16.
We propose simple models to predict the performance degradation of disk requests due to storage device contention in consolidated virtualized environments. Model parameters can be deduced from measurements obtained inside Virtual Machines (VMs) from a system where a single VM accesses a remote storage server. The parameterized model can then be used to predict the effect of storage contention when multiple VMs are consolidated on the same server. We first propose a trace-driven approach that evaluates a queueing network with fair share scheduling using simulation. The model parameters consider Virtual Machine Monitor level disk access optimizations and rely on a calibration technique. We further present a measurement-based approach that allows a distinct characterization of read/write performance attributes. In particular, we define simple linear prediction models for I/O request mean response times, throughputs and read/write mixes, as well as a simulation model for predicting response time distributions. We found our models to be effective in predicting such quantities across a range of synthetic and emulated application workloads.  相似文献   

17.
This paper outlines how it is possible to decompose a complex non-linear modelling problem into a set of simpler linear modelling problems. Local ARMAX models valid within certain operating regimes are interpolated to construct a global NARMAX (non-linear NARMAX) model. Knowledge of the system behaviour in terms of operating regimes is the primary basis for building such models, hence it should not be considered as a pure black-box approach, but as an approach that utilizes a limited amount of a priori system knowledge. It is shown that a large class of non-linear systems can be modelled in this way, and indicated how to decompose the systems range of operation into operating regimes. Standard system identification algorithms can be used to identify the NARMAX model, and several aspects of the system identification problem are discussed and illustrated by a simulation example.  相似文献   

18.
For many applications two-dimensional hydraulic models are time intensive to run due to their computational requirements, which can adversely affect the progress of both research and industry modelling projects. Computational time can be reduced by running a model in parallel over multiple cores. However, there are many parallelisation methods and these differ in terms of difficulty of implementation, suitability for particular codes and parallel efficiency. This study compares three parallelisation methods based on OpenMP, message passing and specialised accelerator cards. The parallel implementations of the codes were required to produce near identical results to a serial version for two urban inundation test cases. OpenMP was considered the easiest method to develop and produced similar speedups (of ~3.9×) to the message passing code on up to four cores for a fully wet domain. The message passing code was more efficient than OpenMP, and remained over 90% efficient on up to 50 cores for a completely wet domain. All parallel codes were less efficient for a partially wet domain test case. The accelerator card code was faster and more power efficient than the standard code on a single core for a fully wet domain, but was subject to longer development time (2 months compared to <2 week for the other methods).  相似文献   

19.
Cache coherence in shared-memory multiprocessor systems has been studied mostly from an architecture viewpoint, often by means of aggregating metrics. In many cases, aggregate events provide insufficient information for programmers to understand and optimize the coherence behavior of their applications. A better understanding would be given by source code correlations of not only aggregate events, but also finer granularity metrics directly linked to high-level source code constructs, such as source lines and data structures. In this paper, we explore a novel application-centric approach to studying coherence traffic. We develop a coherence analysis framework based on incremental coherence simulation of actual reference traces. We provide tool support to extract these reference traces and synchronization information from OpenMP threads at runtime using dynamic binary rewriting of the application executable. These traces are fed to ccSIM, our cache-coherence simulator. The novelty of ccSIM lies in its ability to relate low-level cache coherence metrics (such as coherence misses and their causative invalidations) to high-level source code constructs including source code locations and data structures. We explore the degree of freedom in interleaving data traces from different processors and assess simulation accuracy in comparison to metrics obtained from hardware performance counters. Our quantitative results show that: 1) Cache coherence traffic can be simulated with a considerable degree of accuracy for SPMD programs, as the invalidation traffic closely matches the corresponding hardware performance counters. 2) Detailed, high-level coherence statistics are very useful in detecting, isolating, and understanding coherence bottlenecks. We use ccSIM with several well-known benchmarks and find coherence optimization opportunities leading to significant reductions in coherence traffic and savings in wall-clock execution time  相似文献   

20.
In this paper, we describe our experience of creating an OpenMP implementation of the SPICE3 circuit simulator program. Given the irregular patterns of access to dynamic data structures in the SPICE code, a parallelization using current standard OpenMP directives is impossible without major rewriting of the original program. The aim of this work is to present a case study showing the development of a shared memory parallel code with minimum effort. We present two implementations, one with minimal code modification and one without modification to the original SPICE3 program using Intel’s taskq construct. We also discuss the results of the case study in terms of what future compiler tools may be needed to help OpenMP application developers with similar porting goals. Our experiments using SPICE3, based on SRAM model simulation, were compiled by the SUN compiler running on a SunFire V880 UltraSPARC-III 750 MHz and by the Intel icc compiler running on both an IBM Itanium with four CPUs and Intel Xeon of two processors machines. The results are promising.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号