共查询到20条相似文献,搜索用时 0 毫秒
1.
Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of Particle-To-Grid interpolation. 相似文献
2.
《Parallel Computing》2014,40(5-6):144-158
One of the main difficulties using multi-point statistical (MPS) simulation based on annealing techniques or genetic algorithms concerns the excessive amount of time and memory that must be spent in order to achieve convergence. In this work we propose code optimizations and parallelization schemes over a genetic-based MPS code with the aim of speeding up the execution time. The code optimizations involve the reduction of cache misses in the array accesses, avoid branching instructions and increase the locality of the accessed data. The hybrid parallelization scheme involves a fine-grain parallelization of loops using a shared-memory programming model (OpenMP) and a coarse-grain distribution of load among several computational nodes using a distributed-memory programming model (MPI). Convergence, execution time and speed-up results are presented using 2D training images of sizes 100 × 100 × 1 and 1000 × 1000 × 1 on a distributed-shared memory supercomputing facility. 相似文献
3.
Graph matching and similarity measures of graphs have many applications to pattern recognition, machine vision in robotics,
and similarity-based approximate reasoning in artificial intelligence. This paper proposes a method of matching and a similarity
measure between two directed labeled graphs. We define the degree of similarity, the similar correspondence, and the similarity
map which denotes the matching between the graphs. As an approximate computing method, we apply genetic algorithms (GA) to
find a similarity map and compute the degree of similarity between graphs. For speed, we make parallel implementations in
almost all steps of the GA. We have implemented the sequential GA and the parallel GA in C programs, and made simulations
for both GAs. The simulation results show that our method is efficient and useful.
This work was presented, in part, at the Second International Symposium on Artificial Life and Robotics, Oita Japan, February
18–20, 1997 相似文献
4.
This paper generalizes the widely used Nelder and Mead (Comput J 7:308–313, 1965) simplex algorithm to parallel processors.
Unlike most previous parallelization methods, which are based on parallelizing the tasks required to compute a specific objective
function given a vector of parameters, our parallel simplex algorithm uses parallelization at the parameter level. Our parallel
simplex algorithm assigns to each processor a separate vector of parameters corresponding to a point on a simplex. The processors
then conduct the simplex search steps for an improved point, communicate the results, and a new simplex is formed. The advantage
of this method is that our algorithm is generic and can be applied, without re-writing computer code, to any optimization
problem which the non-parallel Nelder–Mead is applicable. The method is also easily scalable to any degree of parallelization
up to the number of parameters. In a series of Monte Carlo experiments, we show that this parallel simplex method yields computational
savings in some experiments up to three times the number of processors. 相似文献
5.
In this paper we present efficient algorithms for packet routing on the reconfigurable linear array and the reconfigurable
two-dimensional mesh. We introduce algorithms that are efficient in the worst case and algorithms that are better on average.
The time bounds presented are better than those achievable on the conventional mesh and previously known algorithms. We present
two variants of the reconfigurable mesh. In the first model, M
r
, the processors are attached to a reconfigurable bus, the individual edge connections being bidirectional. In the second
model, M
mr
, the processors are attached to two unidirectional buses. In this paper we present lower bounds and nearly matching upper
bounds for packet routing on these two models. As a consequence, we solve two of the open problems mentioned in [9].
Received August 17, 1998; revised November 3, 1999. 相似文献
6.
This work parallelized a widely used structural analysis platform called OpenSees using graphical processing units (GPU). This paper presents task decomposition diagrams with data flow and the sequential and parallel flowcharts for element matrix/vector calculations. It introduces a Bulk Model to ease the parallelization of the element matrix/vector calculations. An implementation of this model for shell elements is presented. Three versions of the Bulk Model—sequential, OpenMP multi-threaded, and CUDA GPU parallelized—were implemented in this work. Nonlinear dynamic analyses of two building models subjected to a tri-axial earthquake were tested. The results demonstrate speedups higher than four on a 4-core system, while the GPU parallelism achieves speedups higher than 7.6 on a single GPU device in comparison to the original sequential implementation. 相似文献
7.
Doruk Bozdağ Assefaw H. Gebremedhin Fredrik Manne Erik G. Boman Umit V. Catalyurek 《Journal of Parallel and Distributed Computing》2008
We present a scalable framework for parallelizing greedy graph coloring algorithms on distributed-memory computers. The framework unifies several existing algorithms and blends a variety of techniques for creating or facilitating concurrency. The latter techniques include exploiting features of the initial data distribution, the use of speculative coloring and randomization, and a BSP-style organization of computation and communication. We experimentally study the performance of several specialized algorithms designed using the framework and implemented using MPI. The experiments are conducted on two different platforms and the test cases include large-size synthetic graphs as well as real graphs drawn from various application areas. Computational results show that implementations that yield good speedup while at the same time using about the same number of colors as a sequential greedy algorithm can be achieved by setting parameters of the framework in accordance with the size and structure of the graph being colored. Our implementation is freely available as part of the Zoltan parallel data management and load-balancing library. 相似文献
8.
We construct a parallel algorithm, suitable for distributed memory architectures, of an explicit shock-capturing finite volume method for solving the two-dimensional shallow water equations. The finite volume method is based on the very popular approximate Riemann solver of Roe and is extended to second order spatial accuracy by an appropriate TVD technique. The parallel code is applied to distributed memory architectures using domain decomposition techniques and we investigate its performance on a grid computer and on a Distributed Shared Memory supercomputer. The effectiveness of the parallel algorithm is considered for specific benchmark test cases. The performance of the realization measured in terms of execution time and speedup factors reveals the efficiency of the implementation. 相似文献
9.
Teodoro G Tavares T Ferreira R Kurc T Meira W Guedes D Pan T Saltz J 《International journal of parallel programming》2008,36(2):250-266
Scientific workflow systems have been introduced in response to the demand of researchers from several domains of science
who need to process and analyze increasingly larger datasets. The design of these systems is largely based on the observation
that data analysis applications can be composed as pipelines or networks of computations on data. In this work, we present
a run-time support system that is designed to facilitate this type of computation in distributed computing environments. Our
system is optimized for data-intensive workflows, in which efficient management and retrieval of data, coordination of data
processing and data movement, and check-pointing of intermediate results are critical and challenging issues. Experimental
evaluation of our system shows that linear speedups can be achieved for sophisticated applications, which are implemented
as a network of multiple data processing components. 相似文献
10.
Kenneth L. Rice Tarek M. Taha Christopher N. Vutsinas 《The Journal of supercomputing》2009,47(1):21-43
This paper presents the implementation and scaling of a neocortex inspired cognitive model on a Cray XD1. Both software and
reconfigurable logic based FPGA implementations of the model are examined. This model belongs to a new class of biologically
inspired cognitive models. Large scale versions of these models have the potential for significantly stronger inference capabilities
than current conventional computing systems. These models have large amounts of parallelism and simple computations, thus
allowing highly efficient hardware implementations. As a result, hardware-acceleration of these models can produce significant
speedups over fully software implementations. Parallel software and hardware-accelerated implementations of such a model are
investigated for networks of varying complexity. A scaling analysis of these networks is presented and utilized to estimate
the throughput of both hardware-accelerated and software implementations of larger networks that utilize the full resources
of the Cray XD1. Our results indicate that hardware-acceleration can provide average throughput gains of 75 times over software-only
implementations of the networks we examined on this system.
相似文献
Christopher N. VutsinasEmail: |
11.
A repartitioning hypergraph model for dynamic load balancing 总被引:1,自引:0,他引:1
Umit V. Catalyurek Erik G. Boman Karen D. Devine Doruk Bozdağ Robert T. Heaphy Lee Ann Riesen 《Journal of Parallel and Distributed Computing》2009
In parallel adaptive applications, the computational structure of the applications changes over time, leading to load imbalances even though the initial load distributions were balanced. To restore balance and to keep communication volume low in further iterations of the applications, dynamic load balancing (repartitioning) of the changed computational structure is required. Repartitioning differs from static load balancing (partitioning) due to the additional requirement of minimizing migration cost to move data from an existing partition to a new partition. In this paper, we present a novel repartitioning hypergraph model for dynamic load balancing that accounts for both communication volume in the application and migration cost to move data, in order to minimize the overall cost. The use of a hypergraph-based model allows us to accurately model communication costs rather than approximate them with graph-based models. We show that the new model can be realized using hypergraph partitioning with fixed vertices and describe our parallel multilevel implementation within the Zoltan load balancing toolkit. To the best of our knowledge, this is the first implementation for dynamic load balancing based on hypergraph partitioning. To demonstrate the effectiveness of our approach, we conducted experiments on a Linux cluster with 1024 processors. The results show that, in terms of reducing total cost, our new model compares favorably to the graph-based dynamic load balancing approaches, and multilevel approaches improve the repartitioning quality significantly. 相似文献
12.
Many problems in the operations research field cannot be solved to optimality within reasonable amounts of time with current computational resources. In order to find acceptable solutions to these computationally demanding problems, heuristic methods such as genetic algorithms are often developed. Parallel computing provides alternative design options for heuristic algorithms, as well as the opportunity to obtain performance benefits in both computational time and solution quality of these heuristics. Heuristic algorithms may be designed to benefit from parallelism by taking advantage of the parallel architecture. This study will investigate the performance of the same global parallel genetic algorithm on two popular parallel architectures to investigate the interaction of parallel platform choice and genetic algorithm design. The computational results of the study illustrate the impact of platform choice on parallel heuristic methods. This paper develops computational experiments to compare algorithm development on a shared memory architecture and a distributed memory architecture. The results suggest that the performance of a parallel heuristic can be increased by considering the desired outcome and tailoring the development of the parallel heuristic to a specific platform based on the hardware and software characteristics of that platform. 相似文献
13.
High-dimensional data is pervasive in many fields such as engineering, geospatial, and medical. It is a constant challenge to build tools that help people in these fields understand the underlying complexities of their data. Many techniques perform dimensionality reduction or other “compression” to show views of data in either two or three dimensions, leaving the data analyst to infer relationships with remaining independent and dependent variables. Contextual self-organizing maps offer a way to represent and interact with all dimensions of a data set simultaneously. However, computational times needed to generate these representations limit their feasibility to realistic industry settings. Batch self-organizing maps provide a data-independent method that allows the training process to be parallelized and therefore sped up, saving time and money involved in processing data prior to analysis. This research parallelizes the batch self-organizing map by combining network partitioning and data partitioning methods with CUDA on the graphical processing unit to achieve significant training time reductions. Reductions in training times of up to twenty-five times were found while using map sizes where other implementations have shown weakness. The reduced training times open up the contextual self-organizing map as viable option for engineering data visualization. 相似文献
14.
遗传算法(Genetic Algorithm)是一类借鉴生物界的进化规律演化而来的随机化搜索方法,已经成功运用在很多大规模的组合优化问题中。利用如今流行的并行计算机系统,对遗传算法进行并行化,可解决标准遗传算法的速度瓶颈问题。本文在MPI并行环境下,用C++语言实现了粗粒度模型的并行遗传算法。结合并行遗传算法的特点,提出了解决物流配送路线优化的策略以及给出相应的算法过程,并进行了有效验证。通过研究结果表明,与传统遗传算法相比,并行遗传算法提高了运算速度,降低了平均开销时间并且最小总路径值更理想。 相似文献
15.
16.
17.
Cloud computing enables many applications of Web services and rekindles the interest of providing ERP services via the Internet. It has the potentials to reshape the way IT services are consumed. Recent research indicates that ERP delivered thru SaaS will outperform the traditional IT offers. However, distributing a service compared to distributing a product is more complicated because of the immateriality, the integration and the one-shot-principle referring to services. This paper defines a CloudERP platform on which enterprise customers can select web services and customize a unique ERP system to meet their specific needs. The CloudERP aims to provide enterprise users with the flexibility of renting an entire ERP service through multiple vendors. This paper also addresses the challenge of composing web services and proposes a web-based solution for automating the ERP service customization process. The proposed service composition method builds on the genetic algorithm concept and incorporates with knowledge of web services extracted from the web service platform with the rough set theory. A system prototype was built on the Google App Engine platform to verify the proposed composition process. Based on experimental results from running the prototype, the composition method works effectively and has great potential for supporting a fully functional CloudERP platform. 相似文献
18.
大型科学仪器远程共享机制研究 总被引:1,自引:0,他引:1
大型科学仪器的共享是目前国家科技基础条件平台建设的重要组成部分,搭建共享平台是推动科技资源开放共享、提升创新能力的有效措施,对科学仪器共享现状进行了分析,提出了大型科学仪器远程共享的分类方法,并且针对可远程使用的大型科学仪器的远程共享设计了解决方案,该方案是对传统方式大型科学仪器信息共享的一种突破,具有重要的社会价值和经济价值。 相似文献
19.
李昕 《计算机工程与科学》2005,27(10):107-110
论文首先对网格计算中的Web服务技术的发展现状进行分析,阐述了Web服务架构、Web服务的表示与描述、Web服务中的信息交换、发现与发布的研究现状,然后对Web服务资 源模型、Web服务资源的描述、发现与定位以及Web服务资源调度与分配的发展进行了论述,最后指出了网格计算中Web服务技术发展所需要解决的重要问题。 相似文献
20.
Juan A. Villar Francisco J. Andújar José L. Sánchez Francisco J. Alfaro José A. Gámez José Duato 《Journal of Parallel and Distributed Computing》2013
High-radix switches reduce network cost and improve network performance, especially in large switch-based interconnection networks. However, there are some problems related to the integration scale to implement such switches in a single chip. An interesting alternative for building high-radix switches consists of combining several current smaller single-chip switches to obtain switches with a greater number of ports. A key design issue of this kind of high-radix switches is the internal switch configuration, specifically, the correspondence between the ports of these high-radix switches and the ports of their smaller internal single-chip switches. In this paper we use artificial intelligence and data mining techniques in order to obtain the optimal internal configuration of all the switches in the network of large supercomputers running parallel applications. Simulation results show that using the resultant switch configurations, it is possible to achieve similar performance as with single-chip switches with the same radix, which would be unfeasible with the current integration scale. 相似文献