首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
针对以往关于可扩展性研究中未充分考虑并行执行时间因素,可扩展性与并行执行时间的关系仍未研究清楚的问题,深入和全面研究延迟可扩展性和并行执行时间的关系,得出并证明了不同算法〖CD*2〗机器组合体在相同初始状态下进行延迟扩展后,若执行更快的组合体具有更好的延迟扩展性,则该组合体在扩展后仍将保持更快等重要结论。这些结论丰富了可扩展性和并行执行时间关系的研究内容,为并行计算延迟扩展获得理想扩展性能提供了理论依据。最后,通过对不同算法〖CD*2〗机器组合体进行扩展实验,进一步验证了结论的有效性。  相似文献   

2.
3.
Parallel algorithms are presented for the Fourier pseudospectral method and parts of the Chebyshev pseudospectral method. Performance of these schemes is reported as implemented on the NCUBE hypercube. The problem to which these methods are applied is the time integration of the incompressible Navier-Stokes equations. Despite serious communication requirements, the efficiencies are high; e.g., 92% for a 1283 mesh on 1024 processors. Benchmark timings rival those of optimized codes on supercomputers.  相似文献   

4.
5.
Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape.  相似文献   

6.
Dinning  A. 《Computer》1989,22(7):66-77
An examination is given of how traditional synchronization methods influence the design of MIMD (multiple-instruction multiple-data-stream) multiprocessors. She provides an overview of MIMD multiprocessing and goes on to discuss semaphore-based implementations (Ultracomputers, Cedar, and the Sequent Balance/21000), monitor-based implementations (the HM2p) and implementations based on message-passing (HEP, the BBN Butterfly and the Transputer)  相似文献   

7.
Performance studies show that traditional semi-join processing methods are sometimes inefficient because of the storage and processing overhead. To remedy this problem, a new semi-join processing method, called one-shot semi-join execution is proposed. This method allows parallel generation of all the semi-join projections, parallel transmission of all the semi-join projections, and parallel execution of all the semi-joins. The authors apply this method to optimize the response time for processing distributed queries. A response time model is established, which considers both data transmission time and local processing time. Based on this model, an efficient query processing algorithm is developed and analyzed  相似文献   

8.
The use of parallel operations in automation, such as part fabrication and assembly, computation and control of industrial robots, etc., may yield a minimum production time and thereby increase productivity. However, the coupling between consecutive phases of operations results in series-parallel precedence constraints that may create unavoidable idle time intervals during the operations. To solve the problem, idle time intervals must be broken down and reduced. An algorithm that determines a minimum time-ordered schedule for the parallel operaitons is developed based on the Program Evaluation and Review Technique using the method of “variable” branch and bound.  相似文献   

9.
The role of multistage turbomachinery simulation in the development of propulsion system models is discussed. Particularly, the need for simulations with higher fidelity and faster turnaround time is highlighted. It is shown how such fast simulations can be used in engineering-oriented environments. The use of parallel processing to achieve the required turnaround times is discussed. Current work by several researchers in this area is summarized, as well as efforts at the NASA Lewis Research Center. The latter efforts are focused on implementing the average-passage turbomachinery model on MIMD, distributed memory parallel computers. Performance results are given for inviscid, single blade row and viscous, multistage applications on several parallel computers, including networked workstations.  相似文献   

10.
《Parallel Computing》1999,25(13-14):2015-2037
Parallel computers have demonstrated their principle suitability for numerical simulation during the eighties and early nineties. In particular, they were able to provide a cost-effective means of achieving high performance computing (HPC) power. Even so, there was only a limited impact of this technology on industrial computing. In order to foster the take-up of this technology by industrial users, the European Commission launched a number of projects as part of the Esprit programme to parallelize commercial application programs, to demonstrate, document and disseminate the benefits of parallel architectures, and to explore the potential of parallel simulation in new application areas. Large-scale technology transfer initiatives such as Europort,1 Europort-D and Preparatory Support and Transfer Programme (PST) aimed at helping the industry in Europe to exploit the benefits of HPC, based on parallel computing, thus increasing their competitiveness. This paper gives a review on major activities and highlights their impact on industry by means of some selected examples.  相似文献   

11.
In a previous work we studied the concurrent implementation of a numerical model, CONDIFP, developed for the analysis of depth-averaged convection–diffusion problems. Initial experiments were conducted on the Intel Touchstone Delta System, using up to 512 processors and different problem sizes. As for other computation-intensive applications, the results demonstrated an asymptotic trend to unity efficiency when the computational load dominates the communication load. This paper relates some other numerical experiences, in both one and two space dimensions with various choices of initial and boundary conditions, carried out on the Intel Paragon XP/S Model L38 with the aim to illustrate the parallel solver versatility and reliability.  相似文献   

12.
《Parallel Computing》1997,23(14):2135-2142
This special issue on ‘regional weather models’ complements the October 1995 special issue on ‘climate and weather modeling’, which focused on global models. In this introduction we review the similarities and differences between regional and global atmospheric models. Next, the structure of regional models is described and we consider how the basic algorithms applied in these models influence the parallelization strategy. Finally, we give a brief overview of the eight articles in this issue and discuss some remaining challenges in the area of adapting regional weather models to parallel computers.  相似文献   

13.
One of the essential problems in parallel computing is: Can SIMD machines handle asynchronous problems? This is a difficult, unsolved problem because of the mismatch between asynchronous problems and SIMD architectures. We propose a solution to let SIMD machines handle general asynchronous problems. Our approach is to implement a runtime support system which can run MIMD-like software on SIMD hardware. The runtime support system, named P kernel, is thread-based. There are two major advantages of the thread-based model. First, for application problems with irregular and/or unpredictable features, automatic scheduling can move some threads from overloaded processors to underloaded processors. Second, and more importantly, the granularity of threads can be controlled to reduce system overhead. The P kernel is also able to handle bookkeeping and message management, as well as to make these low-level tasks transparent to users. Substantial performance has been obtained on Maspar MP-1  相似文献   

14.
Various proposals for networks of large numbers of processors are reviewed. Bottleneck problems arise in these networks with the flow of data between processors. Communication problems which can arise in practical situations are discussed and techniques for reducing bottlenecks are developed. Some simulation results are given for the binary n-cube.  相似文献   

15.
Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in under 2 h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences.  相似文献   

16.
This paper discusses the implementation of methods for the approximate calculation of multiple integrals on a SIMD parallel computer. Adaptive methods using polynomial integrating basic rules and Monte Carlo and number theoretic basic rules are considered with particular reference to implementation on the ICL DAP computer. Test results are given which compare serial and parallel versions of the same method.  相似文献   

17.
The principal current development in computing is the advent of the parallel computer in all its various forms - for example pipelined vector computers (CRAY-1 and CYBER 205) and arrays of processors (ICL DAP). This paper defines a two-parameter characterization of such computers that measures both the maximum performance and the amount of hardware parallelism. This allows a rational comparison of the performance of alternative algorithms on widely differing computers. As an example we consider the choice of the best algorithm for the solution of tridiagonal systems of equations.  相似文献   

18.
19.
The parallel solution of large sets of equations derived from finite element or finite difference methods often involves the use of domain decomposition methods. This paper is concerned with the related problem of mapping subdomain partitioning to a processor topology in such a way that the communication overheads during solution are minimized. In this manner, processors far away from each other in the processor network are not required to communicate directly. This is achieved by solving an optimization problem using the simulated annealing method. The techniques described here can be applied more generally to distribute tasks (as opposed to subdomains) to processors. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

20.
The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号