首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The rapid rise of OpenMP as the preferred parallel programming paradigm for small‐to‐medium scale parallelism could slow unless OpenMP can show capabilities for becoming the model‐of‐choice for large scale high‐performance parallel computing in the coming decade. The main stumbling block for the adaptation of OpenMP to distributed shared memory (DSM) machines, which are based on architectures like cc‐NUMA, stems from the lack of capabilities for data placement among processors and threads for achieving data locality. The absence of such a mechanism causes remote memory accesses and inefficient cache memory use, both of which lead to poor performance. This paper presents a simple software programming approach called copy‐inside–copy‐back (CC) that exploits the data privatization mechanism of OpenMP for data placement and replacement. This technique enables one to distribute data manually without taking away control and flexibility from the programmer and is thus an alternative to the automat and implicit approaches. Moreover, the CC approach improves on the OpenMP‐SPMD style of programming that makes the development process of an OpenMP application more structured and simpler. The CC technique was tested and analyzed using the NAS Parallel Benchmarks on SGI Origin 2000 multiprocessor machines. This study shows that OpenMP improves performance of coarse‐grained parallelism, although a fast copy mechanism is essential. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

2.
Distributed shared memory (DSM) systems have become popular as a means of utilizing clusters of computers for solving large applications. We have developed a high-performance DSM, Strings. In addition, to improve the performance of our DSM, a memory hierarchy simulator has been developed that allows us to compare various techniques very quickly and with much less effort. This paper describes our simulator, DSMSim. We show that the simulator's performance closely matches the real system and demonstrate potential performance gains of up to 60% after adding optimization features to the simulator. The simulator also accepts the same code as the software distributed shared memory.  相似文献   

3.
Explains why the Force parallel programming language has been easily portable between eight different shared memory multiprocessors. The authors show how a two-layer macro processor allows them to hide machine dependencies and to build machine-independent high-level language constructs. The importance of packaging low-level synchronization operations is demonstrated by a proof of mutual exclusion for asynchronous variable operations. The Force constructs enable one to write portable parallel programs largely independent of the number of processes executing them  相似文献   

4.
Threads provides a mechanism for simulating the execution of parallel algorithms on a simplified model of a shared-memory multiprocessor. The algorithms can be expressed in a high-level block-structured language, which supports multiple threads of execution within a common body of program code. Results show an ability to achieve good speedup for small problems using algorithms derived by simple modifications of sequential algorithms. As well, a sibling thread synchronisation feature provides the basis for the synchronous execution of threads. k-parallel algorithms tailored to the machine size and implemented as synchronously executing iterations, can provide near linear speedup as the problem size is increased. The techniques described in this paper seem to promise an effective synchronous execution mode for shared-memory MIMD architectures.  相似文献   

5.
Proposed hardware optimizations to CC-NUMA machines-shared memory multiprocessors that use cache consistency protocols-can shorten the time processors lose because of cache misses and invalidations. The authors look at cost-performance trade-offs for each  相似文献   

6.
The lack of high-level languages and good compilers for parallel machines hinders their widespread acceptance and use. Programmers must address issues such as process decomposition, synchronization, and load balancing. We have developed a parallelizing compiler that, given a sequential program and a memory layout of its data, performs process decomposition while balancing parallelism against locality of reference. A process decomposition is obtained by specializing the program for each processor to the data that resides on that processor. If this analysis fails, the compiler falls back to a simple but inefficient scheme called run-time resolution. Each process's role in the computation is determined by examining the data required for execution at run-time. Thus, our approach to process decomposition is data-driven rather than program-driven. We discuss several message optimizations that address the issues of overhead and synchronization in message transmission. Accumulation reorganizes the computation of a commutative and associative operator to reduce message traffic. Pipelining sends a value as close to its computation as possible to increase parallelism. Vectorization of messages combines messages with the same source and the same destination to reduce overhead. Our results from experiments in parallelizing SIMPLE, a large hydrodynamics benchmark, for the Intel iPSC/2, show a speedup within 60% to 70% of handwritten code  相似文献   

7.
We present a randomized parallel list ranking algorithm for distributed memory multiprocessors, using a BSP type model. We first describe a simple version which requires, with high probability, log(3p)+log ln(n)=Õ(logp+log logn) communication rounds (h-relations withh=Õ(n/p)) andÕ(n/p)) local computation. We then outline an improved version that requires high probability, onlyr?(4k+6) log(2/3p)+8=Õ(k logp) communication rounds wherek=min{i?0 |ln(i+1)n?(2/3p)2i+1}. Notekn) is an extremely small number. Forn andp?4, the value ofk is at most 2. Hence, for a given number of processors,p, the number of communication rounds required is, for all practical purposes, independent ofn. Forn?1, 500,000 and 4?p?2048, the number of communication rounds in our algorithm is bounded, with high probability, by 78, but the actual number of communication rounds observed so far is 25 in the worst case. Forn?10010100 and 4?p?2048, the number of communication rounds in our algorithm is bounded, with high probability, by 118; and we conjecture that the actual number of communication rounds required will not exceed 50. Our algorithm has a considerably smaller member of communication rounds than the list ranking algorithm used in Reid-Miller’s empirical study of parallel list ranking on the Cray C-90.(1) To our knowledge, Reid-Miller’s algorithm(1) was the fastest list ranking implementation so far. Therefore, we expect that our result will have considerable practical relevance.  相似文献   

8.
In this paper, we present a scalable three-dimensional hybrid parallel Delaunay image-to-mesh conversion algorithm (PDR.PODM) for distributed shared memory architectures. PDR.PODM is able to explore parallelism early in the mesh generation process thanks to the aggressive speculative approach employed by the Parallel Optimistic Delaunay Mesh generation algorithm (PODM). In addition, it decreases the communication overhead and improves data locality by making use of a data partitioning scheme offered by the Parallel Delaunay Refinement algorithm (PDR). PDR.PODM supports fully functional volume grading by creating elements with varying size. Small elements are created near boundary or inside the critical regions in order to capture the fine features while big elements are created in the rest of the mesh. We tested PDR.PODM on Blacklight, a distributed shared memory (DSM) machine in Pittsburgh Supercomputing Center. For the uniform mesh generation, we observed a weak scaling speedup of 163.8 and above for up to 256 cores as opposed to PODM whose weak scaling speedup is only 44.7 on 256 cores. PDR.PODM scales well on uniform refinement cases running on DSM supercomputers. The end result is that PDR.PODM can generate 18 million elements per second as opposed to 14 million per second in our earlier work. The varying size version sharply reduces the number of elements compared to the uniform version and thus reduces the time to generate the mesh while keeping the same fidelity.  相似文献   

9.
We present a parallel algorithm for distributed memory multiprocessors, which is based on generalized marching (GM), one of the fastest methods in the class of fast Poisson solvers. The GM algorithm is not suited for any but very coarse-grain parallel processing. The main difficulty with parallelization is that the number of independent processes and the amount of work in each process change exponentially and in inverse proportion of each other. To improve parallelism, the matrices involved in GM are diagonalized performing multiple FFTs. In this way, independent processes extending across all the algorithm are obtained. The parallel GM has been tested on an Ncube/10 and a Symult S2010, running the Express communication system. A performance evaluation has been carried out using a scaled efficiency model and some classical parameters.  相似文献   

10.
OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has advantages on its ease of use and incremental programming, message passing is today still the most widely-used programming model for distributed memory architectures. How to effectively extend OpenMP to distributed memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for distributed memory architectures. Based on the partially replicating shared arrays memory model, we propose ...  相似文献   

11.
The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented  相似文献   

12.
We describe a single-copy mechanism which enables an efficient message passing among UNIX processes on shared memory multiprocessors. A special version of PVMe, IBM's AIX implementation of the PVM message passing programming model, has been built based on this approach. Some preliminary results here reported show the clear advantage of the single-copy with respect to more conventional schemes.  相似文献   

13.
Dandamudi  S.P. 《Computer》1997,30(3):82-89
The performance of parallel processing systems, especially large systems, is sensitive to various types of overhead and contention. Performance consequences may be serious when contention occurs for hardware resources such as memory or the interconnection network. Contention can also occur for software resources such as critical data structures maintained by either system or application software. A run queue is one such critical data structure that can affect overall system performance. There are two basic types of run queues, centralized and distributed. Both present performance problems. There are also several techniques to mitigate their drawbacks, but none is completely satisfactory. Instead, the author proposes a different run queue organization, a hierarchical organization that inherits the best features of the centralized and the distributed queue organizations while avoiding their pitfalls. Thus, the hierarchical organization is suitable for building large-scale multiprocessor systems  相似文献   

14.
The design, implementation, and performance of heterogeneous distributed shared memory (HDSM) are studied. A prototype HDSM system that integrates very different types of hosts has been developed, and a number of applications of this system are reported. Experience shows that despite a number of difficulties in data conversion, HDSM is implementable with minimal loss in functional and performance transparency when compared to homogeneous DSM systems  相似文献   

15.
《Parallel Computing》1997,23(7):915-925
Direct volume rendering algorithms are too computationally expensive to offer interactive frame rates when rendering large 3D medical datasets on standard workstations. This article presents an image space parallelization of an image order volume rendering algorithm aimed at shared memory multiprocessors. This parallel implementation of direct volume rendering can significantly speed up rendering times and visualize 3D datasets with speeds of several frames per second. The algorithm was implemented and evaluated on Convex SPP Exemplar and SGI Challenge multiprocessors.  相似文献   

16.
This paper proposes a new memory allocation method for shared memory multiprocessors with large virtual address spaces. An evaluation of its performance is also presented. For effective use of shared memory multiprocessors, it is important that no processor's execution is blocked. If several processors simultaneously access a shared variable, their processes are blocked and access to the variable is serialized. Thus, frequent access to shared variables reduces the parallelism. In particular, the parallelism is significantly reduced when a special shared variable – the ‘allocation pointer’ – is frequently accessed in the dynamic object allocation by an application program. In this paper, we propose a new method for allocating physical memory pages where the allocation pointer is monotonically increased in the virtual address space in contrast to the conventional method. This allows the critical sections for access to the allocation pointer to be executed effectively and atomically by using the fetch-and-add primitive. Our method improves the application program's parallelism by access to the allocation pointer with considerably short blocking time to the process. © 1997 John Wiley & Sons, Ltd.  相似文献   

17.
We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single-source and multiple-source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multidimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, the data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory and a set of processor registers. The analytic models are verified by measurements on the Connection Machine Model CM-2.  相似文献   

18.
Algorithms implementing distributed shared memory   总被引:1,自引:0,他引:1  
Stumm  M. Zhou  S. 《Computer》1990,23(5):54-64
Four basic algorithms for implementing distributed shared memory are compared. Conceptually, these algorithms extend local virtual address spaces to span multiple hosts connected by a local area network, and some of them can easily be integrated with the hosts' virtual memory systems. The merits of distributed shared memory and the assumptions made with respect to the environment in which the shared memory algorithms are executed are described. The algorithms are then described, and a comparative analysis of their performance in relation to application-level access behavior is presented. It is shown that the correct choice of algorithm is determined largely by the memory access behavior of the applications. Two particularly interesting extensions of the basic algorithms are described, and some limitations of distributed shared memory are noted  相似文献   

19.
When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

20.
One data-independent and one data-dependent algorithm for the computation of image histograms on parallel computers are presented, analysed and implemented on the Connection Machine system CM-2. The data-dependent algorithm has a lower requirement on communication bandwidth by only transferring bins with a non-zero count. Both algorithms perform all-to-all reduction, which is implemented through a sequence of exchanges as defined by a butterfly network. The two algorithms are compared based on predicted and actual performance on the Connection Machine CM-2. With few pixels per processor the data-dependent algorithm requires in the order of √B data transfers for B bins compared to B data transfers for the data-independent algorithm. As the number of pixels per processor grows the advantage of the data-dependent algorithm decreases. The advantage of the data-dependent algorithm increases with the number of bins of the histogram.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号