首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
There are two distinct types of MIMD (Multiple Instruction, Multiple Data) computers: the shared memory machine, e.g. Butterfly, and the distributed memory machine, e.g. Hypercubes, Transputer arrays. Typically these utilize different programming models: the shared memory machine has monitors, semaphores and fetch-and-add; whereas the distributed memory machine uses message passing. Moreover there are two popular types of operating systems: a multi-tasking, asynchronous operating system and a crystalline, loosely synchronous operating system.

In this paper I firstly describe the Butterfly, Hypercube and Transputer array MIMD computers, and review monitors, semaphores, fetch-and-add and message passing; then I explain the two types of operating systems and give examples of how they are implemented on these MIMD computers. Next I discuss the advantages and disadvantages of shared memory machines with monitors, semaphores and fetch-and-add, compared to distributed memory machines using message passing, answering questions such as “is one model ‘easier’ to program than the other?” and “which is ‘more efficient‘?”. One may think that a shared memory machine with monitors, semaphores and fetch-and-add is simpler to program and runs faster than a distributed memory machine using message passing but we shall see that this is not necessarily the case. Finally I briefly discuss which type of operating system to use and on which type of computer. This of course depends on the algorithm one wishes to compute.  相似文献   


2.
This paper discusses the development of an automatic mesh generation technique designed to operate effectively on multiple instruction multiple data (MIMD) parallel computers. The meshing approach is hierarchical, that is, model entities are meshed after their boundaries have been meshed. Focus is on the region meshing step. An octree is constructed to serve as a localization tool and for efficiency. The tree is also key to the efficient parallelization of the meshing process since it supports the distribution of load to processors. The parallel mesh generation procedure repartitions the domain to be meshed and applies on processor face removals until all face removals with local data have been performed. The portion of the domain to be meshed remaining is dynamically repartitioned at the octant level using an Inertial Recursive Bisection method and local face removals are reperformed. Migration of a terminal octant involves migration of the octant data and the octant's mesh faces and/or mesh regions. Results show relatively good speed-ups for parallel face removals on small numbers of processors. Once the three-dimensional mesh has been generated, mesh regions may be scattered across processors. Therefore, a final dynamic repartitioning step is applied at the region level to produce a partition ready for finite element analysis.  相似文献   

3.
This paper presents a general strategy, based on a high level of abstraction, designed in order to reduce the development time and cost of parallel explicit finite element codes. Such a level of abstraction provides a general, rigorous, and efficient framework to tackle the parallelisation of various solvers. The intention of this work is not to do an exhaustive study and make comparisons of different platforms, but rather to bring some quantitative input of the efficiency of the presented parallel implementation strategy. The results, obtained so far, show that a good scalability is achieved on parallel architectures such as the Cray T3D. They demonstrate that solutions that took days on a sequential machine can be reduced to hours or minutes on a parallel machine, such as the Cray T3D.  相似文献   

4.
The R-matrix and Logarithmic Derivative methods are numerically very stable and are therefore ideal for integrating the large sets of coupled second-order linear differential equations which arise in non-exchange scattering problems (e.g., electron scattering by atoms and molecules). These calculations, which typically are repeated at many scattering energies, can become computationally demanding requiring the use of massively parallel computers.Here the results of a study of various parallel decompositions of typical R-matrix propagator methods are reported. A data decomposition approach is employed in the solution following Baluja—Burke—Morgan method whereas a hybrid approach, involving both control and domain decomposition, is adopted for the potential following Light Walker method. Timings of test computations obtained using a Cray T3D computer demonstrate that R-matrix external region computations involving between 500 and 1500 scattering channels are feasible. The approach is easily extended to much larger calculations and to other computer architectures.  相似文献   

5.
Implementation of a boundary element method on distributed memory computers   总被引:1,自引:0,他引:1  
In this paper, we analyse and compare different parallel implementations of the Boundary Element Method on distributed memory computers. We deal with the computation of two-dimensional magnetostatic problems. The resulting linear system will be solved using Householder transformation and Gaussian elimination. Experimental results are obtained on a Meiko Computing Surface with 32 T800 transputers.  相似文献   

6.
We describe the parallel performance of the pure Java CartaBlanca code on heat transfer and multiphase fluid flow problems. CartaBlanca is designed for parallel computations on partitioned unstructured meshes. It uses Java's thread facility to manage computations on each of the mesh partitions. Inter‐partition communications are handled by two compact objects for node‐by‐node communication along partition boundaries and for global reduction calculations across the entire mesh. For distributed calculations, the JavaParty package from the University of Karlsruhe is demonstrated to work with CartaBlanca. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

7.
We present a new parallel algorithm for computing a maximum cardinality matching in a bipartite graph suitable for distributed memory computers.The presented algorithm is based on the Push-Relabel algorithm which is known to be one of the fastest algorithms for the bipartite matching problem. Previous attempts at developing parallel implementations of it have focused on shared memory computers using only a limited number of processors.We first present a straightforward adaptation of these shared memory algorithms to distributed memory computers. However, this is not a viable approach as it requires too much communication. We then develop our new algorithm by modifying the previous approach through a sequence of steps with the main goal being to reduce the amount of communication and to increase load balance. The first goal is achieved by changing the algorithm so that many push and relabel operations can be performed locally between communication rounds and also by selecting augmenting paths that cross processor boundaries infrequently. To achieve good load balance, we limit the speed at which global relabelings traverse the graph. In several experiments on a large number of instances, we study weak and strong scalability of our algorithm using up to 128 processors.The algorithm can also be used to find ?-approximate matchings quickly.  相似文献   

8.
Basic techniques for measuring the performance of parallel computers with distributed memory are considered. The results obtained via the de-facto standard LINPACK benchmark suite are shown to be weakly related to the efficiency of applied parallel programs. As a result, models and methods of macro-piping computations proposed by V. M. Glushkov in the late 1970s become topical again. These results are presented in the context of the modern architecture of cluster complexes.  相似文献   

9.
This paper describes the parallel implementation of a numerical model for the simulation of problems from fluid dynamics on distributed memory multiprocessors. The basic procedure is to apply a fully explicit upwind finite difference approximation on a staggered grid. A theoretical time complexity analysis shows that a perfect speedup is achieved asympotically. Experimental results on the Intel Touchstone Delta System confirm the analytical performance model. © 1997 John Wiley & Sons, Ltd.  相似文献   

10.
《Computers & Fluids》1999,28(4-5):675-700
This work describes the application of a control theory-based aerodynamic shape optimization method to the problem of supersonic aircraft design. A high fidelity computational fluid dynamics (CFD) algorithm modelling the Euler equations is used to calculate the aerodynamic properties of complex three-dimensional aircraft configurations. The design process is greatly accelerated through the use of both control theory and parallel computing. Control theory is employed to derive the adjoint differential equations whose solution allows for the evaluation of design gradient information at a fraction of the computational cost required by previous design methods. The resulting problem is then implemented in parallel using a domain decomposition approach, an optimized communication schedule, and the Message Passing Interface (MPI) Standard for portability and efficiency. In our earlier studies, the serial implementation of this design method, was shown to be effective for the optimization of airfoils, wings, wing–bodies, and complex aircraft configurations using both the potential equation and the Euler equations. In this work, our concern will be to extend the methodologies such that the combined capabilities of these new technologies can be used routinely and efficiently in an industrial design environment. The aerodynamic optimization of a supersonic transport configuration is presented as a demonstration test case of the capability. A particular difficulty of this test case is posed by the close coupling of the propulsion/airframe integration.  相似文献   

11.
The parallel version of the sheet metal forming semi-implicit finite element code ITAS3D has been developed using the domain decomposition method and direct solution methods at both subdomain and interface levels. IBM Message Passing Library is used for data communication between tasks of the parallel code. Solutions of some sheet metal forming problems on IBM SP2 computer show that the adopted DDM algorithm with the direct solver provides acceptable parallel efficiency using a moderate number of processors. The speedup 6.7 is achieved for the problem with 20000 degrees-of-freedom on the 8-processor configuration.  相似文献   

12.
The problem of optimization of communications during the execution of a program on a parallel computer with distributed memory is investigated. Statements are formulated that make it possible to determine the possibility of organization of data broadcast and translation. The conditions proposed are represented in the form suitable for practical application and can be used for automated parallelization of programs. This work was done within the framework of the State Program of Fundamental Studies of the Republic of Belarus (under the code name “Mathematical structures 21”) with the partial support of the Foundation for Fundamental Studies of the Republic of Belarus (grant F03-062). __________ Translated from Kibernetika i Sistemnyi Analiz, No. 2, pp. 166–182, March–April 2006.  相似文献   

13.
The paper describes Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non-transposed matrix multiplication routine C = A ⋅ B, but also transposed multiplication routines C = AT ⋅ B, C = A ⋅ BT, and C = AT ⋅ BT, for a block cyclic data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.  相似文献   

14.
What might seem perfectly intuitive to a young rehabilitation engineer designing assistive devices might not be intuitive at all to a disabled or elderly person experiencing a serious loss of function for the first time. When designers understand the complex nature of disabilities, they are more likely to meet the disabled users' needs. Using the results of his work in designing assistive technology, the author describes impaired people's needs and offers design strategies to accommodate them. He also presents his research to develop a wearable computer-based orientation and wayfinding aid for the severely visually impaired.  相似文献   

15.
A common time reference (i.e. global clock) is needed for observing the behavior of a distributed algorithm on a distributed computing system. The paper presents a pragmatic algorithm to build a global clock on any distributed system, which is optimal for homogeneous distributed memory parallel computers (DMPCs). In order to observe and sort concurrent events in common DMPCs, we need a global clock with a resolution finer than the message transfer time variance, which is better than what deterministic and fault-tolerant algorithms can obtain. Thus a statistical method is chosen as a building block to derive an original algorithm valid for any topology. Its main originality over related approaches is to cope with the problem of clock granularity in computing frequency offsets between local clocks to achieve a resolution comparable with the resolution of the physical clocks. This algorithm is particularly well suited for debugging distributed algorithms by means of trace recordings because after its acquisition step it does not induce message overhead: the perturbation induced on the execution remains as small as possible. It has been implemented on various DMPCs: Intel iPSC/2 hypercube and Paragon XP/S, Transputer-based networks and Sun networks, so we can provide some data about its behavior and performances on these DMPCs.  相似文献   

16.
Ray tracing is a well known technique to generate life-like images. Unfortunately, ray tracing complex scenes can require large amounts of CPU time and memory storage. Distributed memory parallel computers with large memory capacities and high processing speeds are ideal candidates to perform ray tracing. However, the computational cost of rendering pixels and patterns of data access cannot be predicted until runtime. To parallelize such an application efficiently on distributed memory parallel computers, the issues of database distribution, dynamic data management and dynamic load balancing must be addressed. In this paper, we present a parallel implementation of a ray tracing algorithm on the Intel Delta parallel computer. In our database distribution, a small fraction of database is duplicated on each processor, while the remaining part is evenly distributed among groups of processors. In the system, there are multiple copies of the entire database in the memory of groups of processors. Dynamic data management is acheived by an ALRU cache scheme which can exploit image coherence to reduce data movements in ray tracing consecutive pixels. We balance load among processors by distributing subimages to processors in a global fashion based on previous workload requests. The success of our implementation depends crucially on a number of parameters which are experimentally evaluated. © 1997 John Wiley & Sons, Ltd.  相似文献   

17.
18.
An extension of the threaded code technique has been developed. This new development has been used to convert a typical minicomputer into a threaded code machine by the addition of a simple, fast interpreter implemented in microcode. When made directly available to the programmer, the threaded code programming technique is a very convenient and efficient means of structuring programs, particularly in systems where programs are continually being modified.  相似文献   

19.
We describe the evolution of a distributed shared memory (DSM) system, Mirage, and the difficulties encountered when moving the system from a Unix-based* kernel on the VAX to a Unix-based kernel on personal computers. Mirage provides a network transparent form of shared memory for a loosely coupled environment. The system hides network boundaries for processes that are accessing shared memory and is upward compatible with the Unix System V Interface Definition. This paper addresses the architectural dependencies in the design of the system and evaluates performance of the implementation. The new version, MIRAGE +, performs well compared to Mirage even though eight times the amount of data is sent on each page fault because of the larger page size used in the implementation. We show that performance of systems with a large page size to network packet size can be dramatically improved on conventional hardware by applying three well-known techniques: packet blasting, compression, and running at interrupt level. The measured time for a page fault in MIRAGE + has been reduced 37 per cent by sending a page using packet blasting instead of using a handshake for each portion of the page. When compression was added to MIRAGE +, the time to fault a page across the network was further improved by 47 per cent when the page was compressed into one network packet. Our measured performance compares favorably with the amount of time it takes to fault a page from disk. Lastly, running at interrupt level may improve performance 16 per cent when faulting pages without compression.  相似文献   

20.
The parallel ‘Deutschland-Modell’ and its implementation on distributed memory parallel computers using the message-passing library PARMACS 6.0 is described. Performance results on a Cray T3D are given and the problem of dynamical load imbalances is addressed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号