首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence.GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled.We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics.  相似文献   

2.
Fu  You  Zhou  Wei 《The Journal of supercomputing》2022,78(7):9017-9037
The Journal of Supercomputing - Biological interaction databases accommodate information about interacted proteins or genes. Clustering on the networks formed by the interaction information for...  相似文献   

3.
Hyperspectral unmixing is essential for efficient hyperspectral image processing. Nonnegative matrix factorization based on minimum volume constraint (MVC-NMF) is one of the most widely used methods for unsupervised unmixing for hyperspectral image without the pure-pixel assumption. But the model of MVC-NMF is unstable, and the traditional solution based on projected gradient algorithm (PG-MVC-NMF) converges slowly with low accuracy. In this paper, a novel parallel method is proposed for minimum volume constrained hyperspectral image unmixing on CPU–GPU Heterogeneous Platform. First, a optimized unmixing model of minimum logarithmic volume regularized NMF is introduced and solved based on the second-order approximation of function and alternating direction method of multipliers (SO-MVC-NMF). Then, the parallel algorithm for optimized MVC-NMF (PO-MVC-NMF) is proposed based on the CPU–GPU heterogeneous platform, taking advantage of the parallel processing capabilities of GPUs and logic control abilities of CPUs. Experimental results based on both simulated and real hyperspectral images indicate that the proposed algorithm is more accurate and robust than the traditional PG-MVC-NMF, and the total speedup of PO-MVC-NMF compared to PG-MVC-NMF is over 50 times.  相似文献   

4.
Solving block-tridiagonal systems is one of the key issues in numerical simulations of many scientific and engineering problems. Non-zero elements are mainly concentrated in the blocks on the main diagonal for most block-tridiagonal matrices, and the blocks above and below the main diagonal have little non-zero elements. Therefore, we present a solving method which mixes direct and iterative methods. In our method, the submatrices on the main diagonal are solved by the direct methods in the iteration processes. Because the approximate solutions obtained by the direct methods are closer to the exact solutions, the convergence speed of solving the block-tridiagonal system of linear equations can be improved. Some direct methods have good performance in solving small-scale equations, and the sub-equations can be solved in parallel. We present an improved algorithm to solve the sub-equations by thread blocks on GPU, and the intermediate data are stored in shared memory, so as to significantly reduce the latency of memory access. Furthermore, we analyze cloud resources scheduling model and obtain ten block-tridiagonal matrices which are produced by the simulation of the cloud-computing system. The computing performance of solving these block-tridiagonal systems of linear equations can be improved using our method.  相似文献   

5.
The Journal of Supercomputing - The heterogeneous accelerated processing units (APUs) integrate a multi-core CPU and a GPU within the same chip. Modern APUs implement CPU–GPU platform atomics...  相似文献   

6.
Incorporating a GPU architecture into CMP, which is more efficient with certain types of applications, is a popular architecture trend in recent processors. This heterogeneous mix of architectures will use an on-chip interconnection to access shared resources such as last-level cache tiles and memory controllers. The configuration of this on-chip network will likely have a significant impact on resource distribution, fairness, and overall performance.  相似文献   

7.
Heterogeneous architectures comprising a multi-core CPU and many-core GPU(s) are increasingly being used within cluster and cloud environments. In this paper, we study the problem of optimizing the overall throughput of a set of applications deployed on a cluster of such heterogeneous nodes. We consider two different scheduling formulations. In the first formulation, we consider jobs that can be executed on either the GPU or the CPU of a single node. In the second formulation, we consider jobs that can be executed on the CPU, GPU, or both, of any number of nodes in the system. We have developed scheduling schemes addressing both of the problems. In our evaluation, we first show that the schemes proposed for first formulation outperform a blind round-robin scheduler and approximate the performances of an ideal scheduler that involves an impractical exhaustive exploration of all possible schedules. Next, we show that the scheme proposed for the second formulation outperforms the best of existing schemes for heterogeneous clusters, TORQUE and MCT, by up to 42%. Additionally, we evaluate the robustness of our proposed scheduling policies under inaccurate inputs to account for real execution scenarios. We show that, with up to 20% of inaccuracy in the input, the degradation in performance is marginal (less than 7%) on the average.  相似文献   

8.

In the last decades, the socio-demographic evolution of the population has substantially changed mobility demand, posing new challenges in minimizing urban congestion and reducing environmental impact. In this scenario, understanding how different modes of transport can efficiently share (partially or totally) a common infrastructure is crucial for urban development. To this aim, we present a stochastic model-based analysis of critical intersections shared by tram traffic and private traffic, combining a microscopic model of the former with a macroscopic model of the latter. Advanced simulation tools are typically used for such kind of analyses, by playing various traffic scenarios. However, simulation is not an exhaustive approach, and some critical, possibly rare, event may be ignored. For this reason, our aim is instead to adopt suitable analytical solution techniques and tools that can support instead a complete, exhaustive analysis, so being able to take into account rare events as well. Transient analysis of the overall traffic model using the method of stochastic state classes is adopted to support the evaluation of relevant performance measures, namely the probability of traffic congestion over time and the average number of private vehicles in the queue over time. A sensitivity analysis is performed with respect to multiple parameters, notably including the arrival rate of private vehicles, the frequency of tram rides, and the time needed to recover from traffic congestion.

  相似文献   

9.
Multiphase flow implementations of the lattice Boltzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular “color” LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6×2.6× as compared to multi-core CPU solution and 1.8×1.8× compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidths. Furthermore, we verify that the proposed formulation provides an accurate physical representation of multiphase flow processes and demonstrate that the approach can be applied to perform heterogeneous simulations of two-phase flow in porous media using a typical GPU-accelerated workstation.  相似文献   

10.
The NaCl–KCl–ZnCl2 ternary system is examined and modeled using the CALPHAD methodology in conjunction with molecular dynamics (MD) simulations. In particular, MD simulations are used for calculating liquid enthalpies of mixing as a function of composition for the ternary and its binary sub-systems. In addition, key structural features are obtained from MD that is then used for informing the employed two-sublattice ionic liquid model (Na+1, K+1: Cl−1, ZnCl2), which describes the ternary liquid phase. The structure of the simulated liquid systems show that Zn+2 cations primarily exhibit 4-fold coordination in addition to a smaller percentage of 5-fold followed by 3-fold coordination; in contrast, the coordination of both Na+ and K+ cations are distributed between 2- and 4-fold states. The optimized self-consistent thermodynamic model parameters show good agreement with MD data obtained in this work and available experimental literature data.  相似文献   

11.
The Journal of Supercomputing - Since the advent of deep belief network deep learning technology in 2006, artificial intelligence technology has been utilized in various convergence areas, such as...  相似文献   

12.
Neural Computing and Applications - Seismic catalogs are vital to understanding and analyzing the progress of active fault systems. The background seismicity rate in a seismic catalog, strongly...  相似文献   

13.
14.
The oncoprotein MDM2 (murine double minute 2) negatively regulates the activity and stability of tumor suppressor p53. Inactivation of the MDM2–p53 interaction by potent inhibitors offers new possibilities for anticancer therapy. Molecular dynamics (MD) simulations were performed on three inhibitors–MDM2 complexes to investigate the stability and structural transitions. Simulations show that the backbone of MDM2 maintains stable during the whole time. However, slightly structural changes of inhibitors and MDM2 are observed. Furthermore, the molecular mechanics generalized Born surface area (MM-GBSA) approach was introduced to analyze the interactions between inhibitors and MDM2. The results show that binding of inhibitor pDIQ to MDM2 is significantly stronger than that of pMI and pDI to MDM2. The side chains of residues have more contribution than backbone of residues in energy decomposition. The structure–affinity analyses show that L54, I61, M62, Y67, Q72, H73 and V93 produce important interaction energy with inhibitors. The residue W/Y22′ is also very important to the interaction between the inhibitors and MDM2. The three-dimensional structures at different times indicate that the mobility of Y100 influences on the binding of inhibitors to MDM2, and its change has important role in conformations of inhibitors and MDM2.  相似文献   

15.
Face tracking is an important computer vision technology that has been widely adopted in many areas, from cell phone applications to industry robots. In this paper, we introduce a novel way to parallelize a face contour detecting application based on the color-entropy preprocessed Chan–Vese model utilizing a total variation G-norm. This particular application is a complicated and unsupervised computational method requiring a large amount of calculations. Several core parts therein are difficult to parallelize due to heavily correlated data processing among iterations and pixels.We develop a novel approach to parallelize the data-dependent core parts and significantly improve the runtime performance of the model computation. We implement the parallelized program on OpenCL for both multi-core CPU and GPU. For 640 * 480 input images, the parallelized program on a NVidia GTX970 GPU, a NVidia GTX660 GPU, and an AMD FX8530 8-core CPU is on average 18.6, 12.0 and 4.40 times faster than its single-thread C version on the AMD FX8530 CPU, respectively. Some parallelized routines have much higher performance improvement compared to the whole program. For instance, on the NVidia GTX970 GPU, the parallelized entropy filter routine is on average 74.0 times faster than its single-thread C version on the AMD FX8530 8-core CPU. We discuss the parallelization methodologies in detail, including the scalability, thread models, as well as synchronization methods for both multi-core CPU and GPU.  相似文献   

16.
We present a study on the performance of the Wang–Landau algorithm in a lattice model of liquid crystals which is a continuous lattice spin model. We propose a novel method of the spin update scheme in a continuous lattice spin model. The proposed scheme reduces the autocorrelation time of the simulation and results in faster convergence.  相似文献   

17.
An extremely scalable lattice Boltzmann (LB)–cellular automaton (CA) model for simulations of two-dimensional (2D) dendritic solidification under forced convection is presented. The model incorporates effects of phase change, solute diffusion, melt convection, and heat transport. The LB model represents the diffusion, convection, and heat transfer phenomena. The dendrite growth is driven by a difference between actual and equilibrium liquid composition at the solid–liquid interface. The CA technique is deployed to track the new interface cells. The computer program was parallelized using the Message Passing Interface (MPI) technique. Parallel scaling of the algorithm was studied and major scalability bottlenecks were identified. Efficiency loss attributable to the high memory bandwidth requirement of the algorithm was observed when using multiple cores per processor. Parallel writing of the output variables of interest was implemented in the binary Hierarchical Data Format 5 (HDF5) to improve the output performance, and to simplify visualization. Calculations were carried out in single precision arithmetic without significant loss in accuracy, resulting in 50% reduction of memory and computational time requirements. The presented solidification model shows a very good scalability up to centimeter size domains, including more than ten million of dendrites.  相似文献   

18.
19.
《Parallel Computing》2014,40(5-6):70-85
QR factorization is a computational kernel of scientific computing. How can the latest computer be used to accelerate this task? We investigate this topic by proposing a dense QR factorization algorithm with adaptive block sizes on a hybrid system that contains a central processing unit (CPU) and a graphic processing unit (GPU). To maximize the use of CPU and GPU, we develop an adaptive scheme that chooses block size at each iteration. The decision is based on statistical surrogate models of performance and an online monitor, which avoids unexpected occasional performance drops. We modify the highly optimized CPU–GPU based QR factorization in MAGMA to implement the proposed schemes. Numerical results suggest that our approaches are efficient and can lead to near-optimal block sizes. The proposed algorithm can be extended to other one-sided factorizations, such as LU and Cholesky factorizations.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号