首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
For pt.I. see ibid., p. 170-80. In pt.I, we presented a binding environment for the AND and OR parallel execution of logic programs. This environment was instrumental in rendering a compiler for the AND and OR parallel execution of logic programs machine independent. In this paper, we describe a compiler based on the Reduce-OR process model (ROPM) for the parallel execution of Prolog programs, and provide performance of the compiler on five parallel machines: the Encore Multimax, the Sequent Symmetry, the NCUBE 2, the Intel i860 hypercube and a network of Sun workstations. The compiler is part of a machine independent parallel Prolog development system built on top of a run time environment for parallel programming called the Chare kernel, and runs unchanged on these multiprocessors. In keeping with the objectives behind the ROPM, the compiler supports both on and independent AND parallelism in Prolog programs and is suitable for execution on both shared and nonshared memory machines. We discuss the performance of the Prolog compiler in some detail and describe how grain size can be used to deliver performance that is within 10% of the underlying sequential Prolog compiler on one processor, and scale linearly with increasing number of processors on problems exhibiting sufficient parallelism. The loose coupling between parallel and sequential components makes it possible to use the best available sequential compiler as the sequential component of our compiler  相似文献   

2.
We describe a binding environment for the AND and OR parallel execution of logic programs that is suitable for both shared and nonshared memory multiprocessors. The binding environment was designed with a view of rendering a compiler using this binding environment machine independent. The binding environment is similar to closed environments proposed by J. Conery. However, unlike Conery's scheme, it supports OR and independent AND parallelism on both types of machines. The term representation, the algorithms for unification and the join algorithms for parallel AND branches are presented in this paper. We also detail the differences between our scheme and Conery's scheme. A compiler based on this binding environment has been implemented on a platform for machine independent parallel programming called the Chare Kernel  相似文献   

3.
We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh.  相似文献   

4.
This paper describes a technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory. The algorithm is described within the framework of an implementation of the parallel logic language Parlog. However, the algorithm is a general one and can easily be adapted to parallel Prolog systems and to other languages. The performance of the algorithm executing a few simple Parlog benchmarks is analyzed. Finally, it is shown how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm.  相似文献   

5.
Logic programming is expected to make knowledge information processing feasible. However, conventional Prolog systems lack both processing power and flexibility for solving large problems. To overcome these limitations, an approach is developed in which natural execution features of logic programs can be represented using Proof Diagrams. AND/ OR parallel processing based on a goal-rewriting model is examined. Then the abstract architecture of a highly parallel inference engine (PIE) is described. PIE makes it possible to achieve logic/control separation in machine architecture. The architecture proposed here is discussed from the viewpoint of its high degree of parallelism and flexibility in problem solving in comparison with other approaches.  相似文献   

6.
Parallel unification algorithms are not nearly so numerous or well-developed as sequential ones. In order to estimate the improvement in efficiency which may be expected, we define and discuss an objective measure of the effect of parallelism on a sequential algorithm. This measure, known as thepotential parallel factor (PPF), is applied to parallel versions of the unification algorithms of Yasuura and Jaffar. The PPFs for these algorithms are measured on a variety of running Prolog programs to estimate what increase in speed may be expected in a Prolog environment from the use of parallelism. Other potential uses of parallelism may be evaluated by different applications of our general methods and techniques.  相似文献   

7.
This paper begins by describing BSL, a new logic programming language fundamentally different from Prolog. BSL is a nondeterministic Algol-class language whose programs have a natural translation to first order logic; executing a BSL program without free variables amounts to proving the corresponding first order sentence. A new approach is proposed for parallel execution of logic programs coded in BSL, that relies on advanced compilation techniques for extracting fine grain parallelism from sequential code. We describe a new “Very Long Instruction Word” (VLIW) architecture for parallel execution of BSL programs. The architecture, now being designed at the IBM Thomas J. Watson Research Center, avoids the synchronization and communication delays (normally associated with parallel execution of logic programs on multiprocessors), by determining data dependences between operations at compile time, and by coupling the processing elements very tightly, via a single central shared register file. A simulator for the architecture has been implemented and some simulation results are reported in the paper, which are encouraging.  相似文献   

8.
The restricted and-parallelism (RAP) execution model of logic programs is described. This model uses a compile-time data-dependence analysis to generate execution graph expressions for the clauses in a Prolog program. These expressions use simple run-time tests to determine the possibilities of parallelism and produce multiple parallel execution graphs from a single execution graph expression. A simple algorithm is then presented which automatically produces these execution graphs for Prolog clauses. Several ways in which the algorithm can be significantly improved by using the results of program-level data-dependence analysis are discussed.  相似文献   

9.
This paper presents a system for parallel execution of Prolog supporting both independent conjunctive and disjunctive parallelism. The system is intended for distributed memory architecture and is composed of a set of workers with a hierarchical structure scheduler. The execution model has been designed in such a way that each worker's environment does not contain references to terms in other environments, thus reducing communication overhead. In order to guarantee the improvement of the performance by the parallelism exploitation, a granularity control has been introduced for each kind of parallelism. For conjunctive parallelism PDP applies a control based on the estimation provided by CASLOG. The features of the system allow to introduce this control without adding overhead. For disjunctive parallelism PDP controls granularity by applying a heuristic-based method, which can be adapted to other parallel Prolog systems. Different scheduling policies have also been tested. The system has been implemented on a transputer network and performance results show that it provides a high speedup for coarse grain parallel programs.  相似文献   

10.
A method known asclosed environments can be used to represent variable bindings for OR-parellel logic programs without relying on a shared memory or common address space. The representation is based on a procedure that trans-forms stack frames after unification, taking into account problems with common unbound ancestors and shared instances of complex terms. Closed environments were developed for the AND/OR Process Model, but may be applicable to other OR-parallel models.  相似文献   

11.
PAN is a general purpose, portable environment for executing logic programs in parallel. It combines a flexible, distributed architecture which is resilient to software and platform evolution with facilities for automatically extracting and exploiting AND and OR parallelism in ordinary Prolog programs. PAN incorporates a range of compile-time and run-time techniques to deliver the performance benefits of parallel execution while rertaining sequential execution semantics. Several examples illustrate the efficiency of the controls that facilitate the execution of logic programs in a distributed manner and identify the class of applications that benefit from distributed platforms like PAN. George Xirogiannis, Ph.D.: He received his B.S. in Mathematics from the University of Ioannina, Greece in 1993, his M.S in Artificial Intelligence from the University of Bristol in 1994 and his Ph.D. in Computer Science from Heriot-Watt University, Edinburgh in 1998. His Ph.D. thesis concerns the automated execution of Prolog on distributed heterogeneous multi-processors. His research interests have progressed from knowledge-based systems to distributed logic programming and data mining. Currently, he is working as a senior IT consultant at Pricewaterhouse Coopers. He is also a Research Associate at the National Technical University of Athens, researching in knowledge and data mining. Hamish Taylor, Ph.D.: He is a lecturer in Computer Science in the Computing and Electrical Engineering Department of Heriot-Watt University in Edinburgh. He received M.A. and MLitt degrees in philosophy from Cambridge University and an M.S. and a Ph.D. degree in computer science from Heriot-Watt University, Scotland. Since 1985 he has worked on research projects concerned with implementing concurrent logic programming languages, developing formal models for automated reasoning, performance modelling parallel relational database systems, and visualisizing resources in shared web caches. His current research interests are in applications of collaborative virtual environments, parallel logic programming and networked computing technologies.  相似文献   

12.
We argue that in order to exploit both Independent And-and Or-parallelism in Prolog programs there is advantage in recomputing some of the independent goals, as opposed to all their solutions being reused. We present an abstract model, called the Composition-tree, for representing and-or parallelism in Prolog programs. The Composition-tree closely mirrors sequential Prolog execution by recomputing some independent goals rather than fully re-using them. We also outline two environment representation techniques for And-Or parallel execution offull Prolog based on the Composition-tree model abstraction. We argue that these techniques have advantages over earlier proposals for exploiting and-or parallelism in Prolog.  相似文献   

13.
We present an optimal parallel algorithm for computing a cycle separator of ann-vertex embedded planar undirected graph inO(logn) time onn/logn processors. As a consequence, we also obtain an improved parallel algorithm for constructing a depth-first search tree rooted at any given vertex in a connected planar undirected graph in O(log2 n) time on n/logn processors. The best previous algorithms for computing depth-first search trees and cycle separators achieved the same time complexities, but withn processors. Our algorithms run on a parallel random access machine that permits concurrent reads and concurrent writes in its shared memory and allows an arbitrary processor to succeed in case of a write conflict.A preliminary version of this paper appeared as Improved Parallel Depth-First Search in Undirected Planar Graphs in theProceedings of the Third Workshop on Algorithms and Data Structures, 1993, pp. 407–420.Supported in part by NSF Grant CCR-9101385.  相似文献   

14.
With the growing availability of multiprocessors, a great deal of attention has been given to executing Prolog in parallel. A question that naturally arises is how to execute standard sequential Prolog programs with side effects in parallel. The problem of performing side effects in AND parallel systems has been considered elsewhere. This paper presents a method that generates sequential semantics of side effect predicates in an OR parallel system. First, a general method is given for performing data side effects such as read and write. This method is then extended to control side effects such as asserta, assertz, and retract. Finally, a constant-time algorithm for performing cut is presented.The work of L. V. Kale was supported by the National Science Foundation under Grant NSF-CCR-8700988. The work of D. A. Padua and D. C. Sehr was supported in part by the National Science Foundation under Grant NSF-MIP-8410110, the Department of Energy under Grant DOE DE-FG02-85ER25001, and a donation from the IBM Corporation to the Center for Supercomputing Research and Development. D. C. Sehr holds a fellowship from the Office of Naval Research.  相似文献   

15.
This paper describes the implementation of a logic programming language on a massively parallel architecture. This implementation is based on the AND/OR Process Model which allows the exploitation of both AND and OR parallelism in logic programs. A distributed memory model is used, and a decentralized control mechanism has been designed. The multicomputer, which the system has been implemented on, consists of a network of Inmos Transputers. The AND/OR processes are implemented as Occam processes mapped onto the Transputer nodes. After the presentation of the system architecture and a deep discussion of the distributed memory management, some preliminary performance results are discussed.  相似文献   

16.
A parallel-execution model that can concurrently exploit AND and OR parallelism in logic programs is presented. This model employs a combination of techniques in an approach to executing logic problems in parallel, making tradeoffs among number of processes, degree of parallelism, and combination bandwidth. For interpreting a nondeterministic logic program, this model (1) performs frame inheritance for newly created goals, (2) creates data-dependency graphs (DDGs) that represent relationships among the goals, and (3) constructs appropriate process structures based on the DDGs. (1) The use of frame inheritance serves to increase modularity. In contrast to most previous parallel models that have a large single process structure, frame inheritance facilitates the dynamic construction of multiple independent process structures, and thus permits further manipulation of each process structure. (2) The dynamic determination of data dependency serves to reduce computational complexity. In comparison to models that exploit brute-force parallelism and models that have fixed execution sequences, this model can reduce the number of unification and/or merging steps substantially. In comparison to models that exploit only AND parallelism, this model can selectively exploit demand-driven computation, according to the binding of the query and optional annotations. (3) The construction of appropriate process structures serves to reduce communication complexity. Unlike other methods that map DDGs directly onto process structures, this model can significantly reduce the number of data sent to a process and/or the number of communication channels connected to a process  相似文献   

17.
In distributed shared memory multiprocessor systems, parallel tasks communicate through sharing memory data. As the system size increases, such communication cost becomes the main factor that limits the overall parallelism and performance. In this paper, we propose a new solution to the problem through judiciously managing the relevant resource, namely, the shared data and the interconnection network (IN) through which the sharing is carried out. In this approach, communication cost is minimized by means of data migration/allocation which is based on analyzing general layered task graphs, sharing behavior of parallel tasks, and network topology. Our method is not applicable for read only variables. Further, for the time being, the usefulness of the method is limited to multiprocessors where no cache coherence mechanism is implemented. Four typical interconnection topologies for multiprocessors are considered, namely, shared-bus, hierarchical-bus, 2-D mesh, and fat-tree structures. Efficient data allocation algorithms for each of the four network topologies are developed that make decision on data allocation/migration at the compile time. The complexity of one algorithm isO(np) for shared-bus andO(n2p) for the remaining three in a system withnprocessors executing ap-layer task graph for one shared variable. We have also given an algorithm to determine optimal allocation/migration scheme for multiple shared variables. However, the cost of the algorithm become prohibitive when the number of shared variables is high. Therefore, a heuristic of low complexity is suggested. The heuristic is optimal for some topologies.  相似文献   

18.
Muse (Multi-sequential Prolog engines) is a simple and efficient approach to Or-parallel execution of Prolog programs. It is based on having several sequential Prolog engines, each with its local address space, and some shared memory space. It is currently implemented on a 7-processors machine with local/shared memory constructed at SICS, a 16-processors Sequent Symmetry, a 96-processors BBN Butterfly I, and a 45-processors BBN Butterfly II. The sequential SICStus Prolog system has been adapted to Or-parallel implementation. Extra overhead associated with this adaptation is very low in comparison with the other approaches. The speed-up factor is very close to the number of processors in the system for a large class of problems.The goal of this paper is to present the Muse execution model, some of its implementation issues, a variant of Prolog suitable for multiprocessor implementations, and some experimental results obtained from two different multiprocessor systems.  相似文献   

19.
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed‐memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task‐parallel programs executed on hybrid distributed‐memory CPU‐graphics processing unit (GPU) systems in a global‐address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU‐GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state‐of‐the‐art CCSD(T) application module from the computational chemistry domain. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

20.
In this paper, we propose a three dimensional two-level Locality-Aware Parallel Delaunay image-to-mesh conversion algorithm (LAPD). The algorithm exploits two levels of parallelism at different granularities: coarse-grain parallelism at the region level (which is mapped to a node with multiple cores) and medium-grain parallelism at the cavity level (which is mapped to a single core). We employ a data locality-aware mesh refinement process to reduce the latency caused by the remote memory access. We evaluated LAPD on Blacklight, a cache-coherent NUMA distributed shared memory (DSM) machine in the Pittsburgh Supercomputing Center, and observed a weak scaling efficiency of almost 70% for roughly 200 cores, compared to only 30% for the previous algorithm, Parallel Optimistic Mesh Generation algorithm (PODM). To the best of our knowledge, LAPD exhibits the best scalability for parallel Delaunay mesh generation algorithms running on NUMA DSM supercomputers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号