首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We study the problem of exploiting parallelism from search-based AI systems on share-nothing platforms, i.e., platforms where different machines do not have access to any form of shared memory. We propose a novel environment representation technique, called stack-splitting, which is a modification of the well-known stack-copying technique, that enables the efficient exploitation of or-parallelism from AI systems on distributed-memory machines. Stack-splitting, coupled with appropriate scheduling strategies, leads to reduced communication during distributed execution and effective distribution of larger grain-sized work to processors. The novel technique can also be implemented on shared-memory machines and it is quite competitive. In this paper we present a distributed implementation of or-parallelism based on stack-splitting including results. Our results suggest that stack-splitting is an effective technique for obtaining high performance parallel AI systems on shared-memory as well as distributed-memory multiprocessors.  相似文献   

2.
The VMMP (virtual machine for multiprocessors) software package is presented. It provides a coherent set of services for parallel application programs running on diverse multiple input multiple data (MIMD) multiprocessors, including shared memory and message passing multiprocessors. The communication, synchronization, and data distribution requirements of parallel algorithms are analyzed. Related languages and tools are described. VMMP services are identified. VMMP implementation, coding and portability are discussed. Some measurements of the performance of VROMP application programs and VMMP overhead are given. Several hints for improving the performance of application programs are described  相似文献   

3.
Irregular parallel algorithms pose a significant challenge for achieving high performance because of the difficulty predicting memory access patterns or execution paths. Within an irregular application, fine-grained synchronization is one technique for managing the coordination of work; but in practice the actual performance for irregular problems depends on the input, the access pattern to shared data structures, the relative speed of processors, and the hardware support of synchronization primitives. In this paper, we focus on lock-free and mutual exclusion protocols for handling fine-grained synchronization. Mutual exclusion and lock-free protocols have received a fair amount of attention in coordinating accesses to shared data structures from concurrent processes. Mutual exclusion offers a simple programming abstraction, while lock-free data structures provide better fault tolerance and eliminate problems associated with critical sections such as priority inversion and deadlock. These synchronization protocols, however, are seldom used in parallel algorithm designs, especially for algorithms under the SPMD paradigm, as their implementations are highly hardware dependent and their costs are hard to characterize. Using graph-theoretic algorithms for illustrative purposes, we show experimental results on two shared-memory multiprocessors, the IBM pSeries 570 and the Sun Enterprise 4500, that irregular parallel algorithms with efficient fine-grained synchronization may yield good performance.  相似文献   

4.
《Parallel Computing》1997,23(6):637-647
This paper considers the parallel solution of finite element equations using the preconditioned conjugate gradient method on shared memory multiprocessors. The preconditioner used is an incomplete LU factorization of the stiffness matrix. We have designed a new method for implementing parallel forward and backward substitutions which requires the L and U factors to have an independent column structure, which we refer to as I2LU. The algorithm has been implemented on an Encore Multimax and parallel efficiencies up to 80% of the maximum possible theoretical value have been obtained using around 6–8 processors. These figures are shown to be comparable with the efficiencies of an implementation of a similar row-oriented approach, based on level scheduling, tested with the same problems.  相似文献   

5.
A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with detailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

6.
Concurrent processes can be used both for programming computation and for programming storage. Previous implementations of Flat GHC, however, have been tuned for computation-intensive programs, and perform poorly for storage-intensive programs (such as programs implementing reconfigurable data structures using processes and streams) and demand-driven programs. This paper proposes an optimization technique for programs in which processes are almost always suspended. The technique compiles unification for data transfer into message passing. Instead of reducing the number of process switching operations, the technique optimizes the cost of each process switching operation and reduces the number ofcons operations for data buffering. The technique is based on a mode system which is powerful enough to analyze bidirectional communication and streams of streams. The mode system is based on mode constraints imposed by individual clauses rather than on global dataflow analysis, enabling separate analysis of program modules. The introduction of a mode system into Flat GHC effectively subsets Flat GHC; the resulting language is calledModed Flat GHC. Moded Flat GHC programs enjoy two important properties under certain conditions: (1) reduction of a goal clause retains the well-modedness of the clause, and (2) when execution terminates, all the variables in an initial goal clause will be bound to ground terms. Practically, the computational complexity of all-at-once mode analysis can be made almost linear with respect to the sizen of the program and the complexity of the data structures used, and the complexity of separate analysis is higher only by O(logn) times. Mode analysis provides useful information for debugging as well. Benchmark results show that the proposed technique well improves the performance of storage-intensive programs and demand-driven programs compared with a conventional native-code implementation. It also improves the performance of some computation-intensive programs. We expect that the proposed technique will expand the application areas of concurrent logic languages.  相似文献   

7.
We present three parallel sorting algorithms suitable for implementation on tightly coupled multiprocessors and compare their performance on the Denelcor HEP. Two of the algorithms implemented—parallel Shellsort and quickmerge—are new. Shellsort is amenable to parallelization; however, since Shellsort has higher complexity than quicksort, parallel Shellsort is inferior to parallel quicksort. A second new parallel algorithm, called quickmerge, is based upon both quicksort and mergesort. Our implementation of quickmerge achieves significantly higher speedup than occur implementation of parallel quicksort.  相似文献   

8.
Shared memory multiprocessors offer a relatively simple programming model and are suitable for a wide variety of parallel applications. Unfortunately, shared memory multiprocessors are not as scalable as distributed memory multiprocessors owing to memory and switch contentions that can result in the formation of hot spots. Spinning on synchronization variables appears to be the main culprit behind the formation of hot spots, affecting system scalability adversely. The purpose of this paper is to address the issue of performing barrier synchronization efficiently in large-scale shared memory multiprocessors. We propose a very simple design for a hardware barrier synchronizer that has the characteristics of what one would call an ideal barrier synchronizer. In particular, the proposed barrier synchronizer allows fast barrier synchronization without injecting spin traffic to create hot spots and can be reused as soon as it has completed a barrier synchronization. We also show that by augmenting this barrier synchronizer with a few gates, it can be used to perform dynamic barrier synchronization, where neither the number, nor the exact identity of processors participating in the barrier is known a priori. We will also show that a low-latency barrier synchronizer can be used not only for high-speed barrier synchronization but also, very profitably, for implementing software combining (allowing distributed hot spot accessing), for data and producer-consumer type synchronization and for the implementation of a variety of other useful applications. A high-speed barrier synchronizer can also be used to implement highly concurrent data structures and will also allow a MIMD (Multiple Instruction streams, Multiple Data streams) system to be effectively operated in a SIMD (Single Instruction stream, Multiple Data streams)-style mode, giving rise to a number of potential advantages. We use simulations to confirm that our proposed synchronizers and their applications outperform the existing barrier synchronization schemes.  相似文献   

9.
A distributed counter is a concurrent object which provides a fetch-and-increment operation on a shared value. On the basis of a distributed counter, one can implement various fundamental data structures, such as queues or stacks. We present the counting pyramid, an efficient implementation of a distributed counter in a message passing system, which is based on software combining. The counting pyramid adapts gracefully to changing access patterns, guarantees linearizability, and offers more general fetch-and-Φ operations. We analyze the expected performance of the counting pyramid, using queueing theory and simulation. We show that the latency of the counting pyramid is asymptotically optimal.  相似文献   

10.
In this paper we describe the parallelization of a medium-size symbolic fixed-point computation, CONSAT. CONSAT is a constraint satisfaction system that computes globally consistent solutions. The parallel version of CONSAT is implemented using abstractions from a parallel programming toolbox we developed. The toolbox is intended for novice parallel programmers, and programs based on abstractions from this toolbox may be executed on both uniprocessors and shared-memory multiprocessors without modifications. We explain how parallelism is introduced, and how concurrent accesses to shared data structures are handled. We will also describe the performance of CONSAT on sample inputs.  相似文献   

11.
Low-latency communication over ATM networks using active messages   总被引:1,自引:0,他引:1  
von Eicken  T. Basu  A. Buch  V. 《Micro, IEEE》1995,15(1):46-53
Today's communication architectures for parallel machines reduce communication overheads and latencies by over an order of magnitude. However, carrying over these techniques to workstation clusters connected by an ATM network presents major design challenges. We discuss the differences in communication characteristics between workstation clusters built from standard hardware and software components and state-of-the-art multiprocessors, and then evaluate a prototype implementation of an active message communication layer. Application round-trip latencies of about 50 microseconds for small messages roughly compare to a similar implementation on the Thinking Machines CM-5 multiprocessor  相似文献   

12.
Particle swarm optimization (PSO) is an evolutionary heuristics-based method used for continuous function optimization. PSO is stochastic yet very robust. Nevertheless, real-world optimizations require a high computational effort to converge to a good solution for the problem. In general, parallel PSO implementations provide good performance. However, this depends heavily on the parallelization strategy used as well as the number and characteristics of the exploited processors. In this paper, we propose a cooperative strategy, which consists of subdividing an optimization problem into many simpler sub-problems. Each of these focuses on a distinct subset of the problem dimensions. The optimization work for all the selected sub-problems is done in parallel. We map the work onto four different parallel high-performance multiprocessors, which are based on multi- and many-core architectures. The performance of the strategy thus implemented is evaluated for four well known benchmark functions with high-dimension and different complexity. The obtained speedups are compared to that yielded by a serial PSO implementation.  相似文献   

13.
In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems.  相似文献   

14.
Filtering algorithms are well accepted as a means of speeding up the solution of the consistent labeling problem (CLP). Despite the fact that path consistency does a better job of filtering than arc consistency, AC is still the preferred technique because it has a much lower time complexity. We are implementing parallel path consistency algorithms on multiprocessors and comparing their performance to the best sequential and parallel arc consistency algorithms.(1,2) (See also work by Kerethoet al. (3) and Kasif(4)) Preliminary work has shown linear performance increases for parallelized path consistency and also shown that in many cases performance is significantly better than the theoretical worst case. These two results lead us to believe that parallel path consistency may be a superior filtering technique. Finally, we have implemented path consistency as an outer product computation and have obtained good results (e.g., linear speedup on a 64K-node Connection Machine 2).  相似文献   

15.
Robert J. McGlinn 《Software》1989,19(10):917-930
The Cook and Kim algorithm is a well known method for sorting presorted lists. This paper presents observations based on an implementation of the algorithm on a single processor. An extension of the algorithm to a tightly coupled multiprocessor will also be presented. The performance of the parallel version of the algorithm on presorted lists will be compared to that of a heavily used parallel sort algorithm for tightly coupled multiprocessors, Parallel Quicksort.  相似文献   

16.
The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms  相似文献   

17.
We describe a single-copy mechanism which enables an efficient message passing among UNIX processes on shared memory multiprocessors. A special version of PVMe, IBM's AIX implementation of the PVM message passing programming model, has been built based on this approach. Some preliminary results here reported show the clear advantage of the single-copy with respect to more conventional schemes.  相似文献   

18.
In this paper, we introduce an analytical technique based on queueing networks and Petri nets for making a performance analysis of dataflow computations when executed on the Manchester machine. This technique is also applicable for the analysis of parallel computations on multiprocessors. We characterize the parallelism in dataflow computations through a four-parameter characterization, namely, the minimum parallelism, the maximum parallelism, the average parallelism and the variance in parallelism. We observe through detailed investigation of our analytical models that the average parallelism is a good characterization of the dataflow computations only as long as the variance in parallelism is small. However, significant difference in performance measures will result when the variance in parallelism is comparable to or higher than the average parallelism.  相似文献   

19.
20.
Many-task computing is a well-established paradigm for implementing loosely coupled applications (tasks) on large-scale computing systems. However, few of the model’s existing implementations provide efficient, low-latency support for executing tasks that are tightly coupled multiprocessing applications. Thus, a vast array of parallel applications cannot readily be used effectively within many-task workloads. In this work, we present JETS, a middleware component that provides high performance support for many-parallel-task computing (MPTC). JETS is based on a highly concurrent approach to parallel task dispatch and on new capabilities now available in the MPICH2 MPI implementation and the ZeptoOS Linux operating system. JETS represents an advance over the few known examples of multilevel many-parallel-task scheduling systems: it more efficiently schedules and launches many short-duration parallel application invocations; it overcomes the challenges of coupling the user processes of each multiprocessing application invocation via the messaging fabric; and it concurrently manages many application executions in various stages. We report here on the JETS architecture and its performance on both synthetic benchmarks and an MPTC application in molecular dynamics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号