首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 30 毫秒
1.

In the current study, the problems of elastic and elastoplastic torsion were formulated by finite element method. The finite element code was parallelized on both shared and distributed memory architectures. An assembling method with high parallelism ability and consuming minimum memory was proposed to obtain compressed global stiffness matrix directly. Parallel programming principles were expressed in two shared memory and distributed memory approaches; moreover, parallel well-known mathematical libraries were briefly expressed. In this paper, the main focus was on a lucid explanation of parallelization mechanisms in detail on two memory architectures such as some settings of Linux operating system for large-scale problems. To verify the ability of the proposed method and its parallel performance, several benchmark examples were represented with different mesh sizes and were compared with their respective analytical solutions. Considering the obtained results, the proposed sparse assembling algorithm decreased required memory significantly (about 103.5 to 105.5 times) and the obtained speedup was about 3.4 for the elastoplastic torsion problem in a simple multicore computer.

  相似文献   

2.
Data-parallel,volumerendering algorithms   总被引:1,自引:0,他引:1  
In this presentation, we consider the image-composition scheme for parallel volume rendering in which each processor is assigned a portion of the volume. A processor renders its data by using any existing volume-rendering algorithm. We describe one such parallel algorithm that also takes advantage of vector-processing capabilities. The resulting images from all processors are then combined (composited) in visibility order to form the final image. The major advantage of this approach is that, as viewing and shading parameters change, only 2D partial images, and not 3D volume data, are communicated among processors. Through experimental results and performance analysis, we show that our parallel algorithm is amenable to extremely efficient implementations on distributed memory, multiple instruction-multiple data (MIMD), vector-processor architectures. This algorithm is also very suitable for hardware implementation based on image composition architectures. It supports various volume-rendering algorithms, and it can be extended to provide load-balanced execution.  相似文献   

3.

In this paper, we present several important details in the process of legacy code parallelization, mostly related to the problem of maintaining numerical output of a legacy code while obtaining a balanced workload for parallel processing. Since we maintained the non-uniform mesh imposed by the original finite element code, we have to develop a specially designed data distribution among processors so that data restrictions are met in the finite element method. In particular, we introduce a data distribution method that is initially used in shared memory parallel processing and obtain better performance than the previous parallel program version. Besides, this method can be extended to other parallel platforms such as distributed memory parallel computers. We present results including several problems related to performance profiling on different (development and production) parallel platforms. The use of new and old parallel computing architectures leads to different behavior of the same code, which in all cases provides better performance in multiprocessor hardware.

  相似文献   

4.
3-D data visualization is very useful for medical imaging and computational fluid dynamics. Volume rendering can be used to exhibit the shape and volumetric properties of 3-D objects. However, volume rendering requires a considerable amount of time to process the large volume of data. To deliver the necessary rendering rates, parallel hardware architectures such as distributed memory multicomputers offer viable solutions. The challenge is to design efficient parallel algorithms that utilize the hardware parallelism effectively. In this paper, we present two efficient parallel volume rendering algorithms, the 1D-partition and 2D-partition methods, based on the shear-warp factorization for distributed memory multicomputers. The 1D-partition method has a performance bound on the size of the volume data. If the number of processors is less than a threshold, the 1D-partition method can deliver a good rendering rate. If the number of processors is over a threshold, the 2D-partition method can be used. To evaluate the performance of these two algorithms, we implemented the proposed methods along with the slice data partitioning, volume data partitioning, and sheared volume data partitioning methods on an IBM SP2 parallel machine. Six volume data sets were used as the test samples. The experimental results show that the proposed methods outperform other compatible algorithms for all test samples. When the number of processors is over a threshold, the experimental results also demonstrate that the 2D-partition method is better than the 1D-partition method.  相似文献   

5.
The paper presents an extension of the composite programmable graph grammar (CP‐graph grammar) suitable for modeling the parallel direct solver algorithm utilized by the hp finite element method (hp‐FEM). In the proposed graph grammar model, the computational mesh is represented by a CP‐graph. The presented graph grammar models the solver algorithm by a set of graph grammar productions. The graph grammar model makes it possible to examine the concurrency of the algorithm by analyzing the interdependence between the atomic tasks, tasks and super‐tasks. The atomic tasks correspond to the graph grammar productions, representing basic undividable parts of the algorithms. The level of atomic tasks models the concurrency for the shared memory architectures. On the other hand, the tasks correspond to the groups of atomic tasks with predefined inter‐task communication channels. They constitute the grain for the decomposition of the parallel algorithm for the distributed memory architecture. Finally, the super‐tasks correspond to a group of tasks resulting from the execution of load balancing algorithm. The solver algorithm is tested on distributed memory linux cluster for up to 192 processors. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

6.
The authors describe a novel algorithm for computing view-independent finite element radiosity solutions on distributed shared memory parallel architectures. Our approach is based on the notion of a subiteration being the transfer of energy from a single source to a subset of the scene's receiver patches. By using an efficient queue based scheduling system to process these subiterations, we show how radiosity solutions can be generated without the need for processor synchronization between iterations of the progressive refinement algorithm. The only significant source of interprocessor communication required by our method is for visibility calculations. We also describe a perceptually driven approach to visibility estimation, which employs an efficient volumetric grid structure and attempts to reduce the amount of interprocessor communication by approximating visibility queries between distant patches. Our algorithm also eliminates the need for dynamic load balancing until the end of the solution process and is shown to achieve a superlinear speedup in many situations  相似文献   

7.

Community detection (or clustering) in large-scale graphs is an important problem in graph mining. Communities reveal interesting organizational and functional characteristics of a network. Louvain algorithm is an efficient sequential algorithm for community detection. However, such sequential algorithms fail to scale for emerging large-scale data. Scalable parallel algorithms are necessary to process large graph datasets. In this work, we show a comparative analysis of our different parallel implementations of Louvain algorithm. We design parallel algorithms for Louvain method in shared memory and distributed memory settings. Developing distributed memory parallel algorithms is challenging because of inter-process communication and load balancing issues. We incorporate dynamic load balancing in our final algorithm DPLAL (Distributed Parallel Louvain Algorithm with Load-balancing). DPLAL overcomes the performance bottleneck of the previous algorithms and shows around 12-fold speedup scaling to a larger number of processors. We also compare the performance of our algorithm with some other prominent algorithms in the literature and get better or comparable performance . We identify the challenges in developing distributed memory algorithm and provide an optimized solution DPLAL showing performance analysis of the algorithm on large-scale real-world networks from different domains.

  相似文献   

8.
This paper first identifies some of the key concerns about the techniques and algorithms developed for parallel model checking; specifically, the inherent problem with load balancing and large queue sizes resultant in a static partition algorithm. This paper then presents a load balancing algorithm to improve the run time performance in distributed model checking, reduce maximum queue size, and reduce the number of states expanded before error discovery. The load balancing algorithm is based on generalized dimension exchange (GDE). This paper presents an empirical analysis of the GDE based load balancing algorithm on three different supercomputing architectures—distributed memory clusters, Networks of Workstations (NOW) and shared memory machines. The analysis shows increased speedup, lower maximum queue sizes and fewer total states explored before error discovery on each of the architectures. Finally, this paper presents a study of the communication overhead incurred by using the load balancing algorithm, which although significant, does not offset performance gains.  相似文献   

9.
In this paper a modified parallel Jacobi-conditioned conjugate gradient (CG) method is proposed for solving linear elastic finite element system of equations. The conventional element-by-element and diagonally conditioned approaches are discussed with respect to parallel implementation on distributed memory MIMD architectures. The effects of communication overheads on the efficiency of the parallel CG solver are considered and it is shown that for the efficient performance of a parallel CG solver, the interprocessor communication has to be carried out concurrently. A concurrent communication scheme is proposed by relating the semi-bandwidth of the stiffness matrix with the number of independent degrees of freedom and the number of processors and inducing directionalization of communication within the processor pipeline. With the aid of two examples the effectiveness of the proposed method is demonstrated showing that the cost of communication remains low and relatively insensitive to the increase in the number of processors.  相似文献   

10.
Initially, parallel algorithms were designed by parallelising the existing sequential algorithms for frequently occurring problems on available parallel architectures.

More recently, parallel strategies have been identified and utilised resulting in many new parallel algorithms. However, the analysis of such techniques reveals that further strategies can be applied to increase the parallelism. One of these strategies, i.e., increasing the computational work in each processing node, can reduce the memory accesses and hence congestion in a shared memory multiprocessor system. Similarly, when network message passing is minimised in a distributed memory processor system, dramatic improvements in the performance of the algorithm ensue.

A frequently occurring computational problem in digital signal processing (DSP) is the solution of symmetric positive definite Toeplilz linear systems. The Levinson algorithm for solving such linear equations is where the Toeplitz matrix property is utilised in the elimination process of each element to advantage. However, it can be shown that in the Parallel Implicit Elimination (PIE) method where more than one element is eliminated simultaneously, the Toeplitz structure can again be utilised to advantage. This relatively simple strategy yields a reduction in accesses to shared memory or network message passing, resulting in a significant improvement in the performance of the algorithm [2],  相似文献   

11.
We consider a graph theoretical model and study a parallel implementation of the well-known Gaussian elimination method on parallel distributed memory architectures, where the communication delay for the transmission of an elementary data is higher than the computation time of an elementary instruction. We propose and analyze two low-complexity algorithms for scheduling the tasks of the parallel Gaussian elimination on an unbounded number of completely connected processors. We compare these two algorithms with a higher-complexity general-purpose scheduling algorithm, the DSC heuristic, proposed by A. Gerasoulis and T. Yang (1993)  相似文献   

12.
In a previous paper (Vidal et al., 2008, [21]), we presented a parallel solver for the symmetric Toeplitz eigenvalue problem, which is based on a modified version of the Lanczos iteration. However, its efficient implementation on modern parallel architectures is not trivial.In this paper, we present an efficient implementation on multicore processors which takes advantage of the features of this architecture. Several optimization techniques have been incorporated to the algorithm: improvement of Discrete Sine Transform routines, utilization of the Gohberg-Semencul formulas to solve the Toeplitz linear systems, optimization of the workload distribution among processors, and others. Although the algorithm follows a distributed memory parallel programming paradigm that is led by the nature of the mathematical derivation, special attention has been paid to obtaining the best performance in multicore environments. Hybrid techniques, which merge OpenMP and MPI, have been used to increase the performance in these environments. Experimental results show that our implementation takes advantage of multicore architectures and clearly outperforms the results obtained with LAPACK or ScaLAPACK.  相似文献   

13.
Atomistic simulations of thin film deposition, based on the lattice Monte Carlo method, provide insights into the microstructure evolution at the atomic level. However, large-scale atomistic simulation is limited on a single computer—due to memory and speed constraints. Parallel computation, although promising in memory and speed, has not been widely applied in these simulations because of the intimidating overhead. The key issue in achieving optimal performance is, therefore, to reduce communication overhead among processors. In this paper, we propose a new parallel algorithm for the simulation of large-scale thin film deposition incorporating two optimization strategies: (1) domain decomposition with sub-domain overlapping and (2) asynchronous communication. This algorithm was implemented both on message-passing-processor systems (MPP) and on cluster computers. We found that both architectures are suitable for parallel Monte Carlo simulation of thin film deposition in either a distributed memory mode or a shared memory mode with message-passing libraries.  相似文献   

14.
针对多片TMS320C64 DSP构成的多计算机体系结构的实时图像识别系统,设计并实现了面向并行图像处理的实时分布式操作系统PIPORTDOS(Parallel image processing-oriented real-time distributed operating system).PIPORTDOS基于微内核体系结构,包括硬件抽象层、系统核心层、分布式消息通信机制和系统服务层四个层次.其多任务内核实现了基于优先级的抢先式调度、任务闻的同步和通信原语、实时的中断处理以及面向应用的缓存管理机制.为了实现对分布式并行图像处理的支持,PIPORTDOS采用了基于消息传递(Message Passing)的方式,并在实现中充分考虑了上层图像处理算法的应用需求以及DSP的硬件功能.相关性能指标表明,本文设计实现的PIPORTDOS完全可以满足系统的强实时性要求.在功能上也能适应算法对不同并行结构的需求.  相似文献   

15.
Many problems in the operations research field cannot be solved to optimality within reasonable amounts of time with current computational resources. In order to find acceptable solutions to these computationally demanding problems, heuristic methods such as genetic algorithms are often developed. Parallel computing provides alternative design options for heuristic algorithms, as well as the opportunity to obtain performance benefits in both computational time and solution quality of these heuristics. Heuristic algorithms may be designed to benefit from parallelism by taking advantage of the parallel architecture. This study will investigate the performance of the same global parallel genetic algorithm on two popular parallel architectures to investigate the interaction of parallel platform choice and genetic algorithm design. The computational results of the study illustrate the impact of platform choice on parallel heuristic methods. This paper develops computational experiments to compare algorithm development on a shared memory architecture and a distributed memory architecture. The results suggest that the performance of a parallel heuristic can be increased by considering the desired outcome and tailoring the development of the parallel heuristic to a specific platform based on the hardware and software characteristics of that platform.  相似文献   

16.
Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.  相似文献   

17.
Among the many techniques in computer graphics, ray tracing is prized because it can render realistic images, albeit at great computational expense. Ray tracing's large computation requirements, coupled with its inherently parallel nature, make ray tracing algorithms attractive candidates for parallel implementation. In this paper we illustrate the utility and the importance of a suite of performance analysis tools when exploring the performance of several approaches to ray tracing on a distributed memory parallel system. These ray tracing algorithm variations introduce parallelism based on both ray and object partitions. Traditional timing analysis can quantify the performance effects of different algorithm choices (i.e. when an algorithm is best matched to a given problem), but it cannot identify the causes of these performance differences. We argue, by example, that a performance instrumentation system is needed that can trace the execution of distributed memory parallel programs by recording the occurrence of parallel program events. The resulting event traces can be used to compile summary stapistics that provide a global view of program performance. In addition, visualization tools permit the graphic display of event traces. Visual presentation of performance data is particularly useful, indeed, necessary for large-scale, parallel computations; assimilating the enormous volume of performance data mandates visual display.  相似文献   

18.
We present and compare a variety of parallelization approaches for a real-world case study on modern parallel and distributed computer architectures. Our case study is a production-quality, time-intensive algorithm for medical image reconstruction used in computer tomography (PET). We parallelize this algorithm for the main kinds of contemporary parallel architectures: shared-memory multiprocessors, distributed-memory clusters, graphics processing units (GPU) using the CUDA framework, the Cell processor and, finally, how various architectures can be accessed in a distributed Grid environment. The main contribution of the paper, besides the parallelization approaches, is their systematic comparison regarding four important criteria: performance, programming comfort, accessibility, and cost-effectiveness. We report results of experiments on particular parallel machines of different architectures that confirm the findings of our systematic comparison.  相似文献   

19.
In this paper, a programming model is presented which enables scalable parallel performance on multi-core shared memory architectures. The model has been developed for application to a wide range of numerical simulation problems. Such problems involve time stepping or iteration algorithms where synchronization of multiple threads of execution is required. It is shown that traditional approaches to parallelism including message passing and scatter-gather can be improved upon in terms of speed-up and memory management. Using spatial decomposition to create orthogonal computational tasks, a new task management algorithm called H-Dispatch is developed. This algorithm makes efficient use of memory resources by limiting the need for garbage collection and takes optimal advantage of multiple cores by employing a “hungry” pull strategy. The technique is demonstrated on a simple finite difference solver and results are compared to traditional MPI and scatter-gather approaches. The H-Dispatch approach achieves near linear speed-up with results for efficiency of 85% on a 24-core machine. It is noted that the H-Dispatch algorithm is quite general and can be applied to a wide class of computational tasks on heterogeneous architectures involving multi-core and GPGPU hardware.  相似文献   

20.
Any parallel program has abstractions that are shared by the program's multiple processes. Such shared abstractions can considerably affect the performance of parallel programs, on both distributed and shared memory multiprocessors. As a result, their implementation must be efficient, and such efficiency should be achieved without unduly compromising program portability and maintainability. The primary contribution of the DSA library is its representation of shared abstractions as objects that may be internally distributed across different nodes of a parallel machine. Such distributed shared abstractions (DSA) are encapsulated so that their implementations are easily changed while maintaining program portability across parallel architectures. The principal results presented are: a demonstration that the fragmentation of object state across different nodes of a multiprocessor machine can significantly improve program performance; and that such object fragmentation can be achieved without compromising portability by changing object interfaces. These results are demonstrated using implementations of the DSA library on several medium scale multiprocessors, including the BBN Butterfly, Kendall Square Research, and SGI shared memory multiprocessors. The DSA library's evaluation uses synthetic workloads and a parallel implementation of a branch and bound algorithm for solving the traveling salesperson problem (TSP)  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号