期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Automatic compile-time parallelization of logic programs for restricted,goal level,independent and parallelism

《The Journal of Logic Programming》1999,38(2):165-218

A framework for the automatic parallelization of (constraint) logic programs is proposed and proved correct. Intuitively, the parallelization process replaces conjunctions of literals with parallel expressions. Such expressions trigger at run-time the exploitation of restricted, goal-level, independent and parallelism. The parallelization process performs two steps. The first one builds a conditional dependency graph (which can be simplified using compile-time analysis information), while the second transforms the resulting graph into linear conditional expressions, the parallel expressions of the &-Prolog language. Several heuristic algorithms for the latter (“annotation”) process are proposed and proved correct. Algorithms are also given which determine if there is any loss of parallelism in the linearization process with respect to a proposed notion of maximal parallelism. Finally, a system is presented which implements the proposed approach. The performance of the different annotation algorithms is compared experimentally in this system by studying the time spent in parallelization and the effectiveness of the results in terms of speedups. 相似文献

2.

Parallelization of deduction strategies: An analytical study 总被引：1，自引：0，他引：1

Maria Paola Bonacina Jieh Hsiang 《Journal of Automated Reasoning》1994,13(1):1-33

In this paper we present a general analysis of the parallelization of deduction strategies. We classify strategies assubgoal-reduction strategies, expansion-oriented strategies, andcontraction-based strategies. For each class we analyze how and what types of parallelism can be utilized. Since the operational semantics of deduction-based programming languages can be construed as subgoal-reduction strategies, our analysis encompasses, at the abstract level, both strategies for deduction-based programming and those for theorem proving. We distinguish different types of parallel deduction based on the granularity of parallelism. These two criteria — the classification of strategies and of types of parallelism — provide us with a framework to treat problems and with a grid to classify approaches to parallel deduction. Within this framework, we analyze many issues, including the dynamicity and size of the database of clauses during the derivation, the possibility of conflicts between parallel inferences, and duplication versus sharing of clauses. We also suggest the type of architectures that may be suitable for each class of strategies. We substantiate our analysis by describing existing methods, emphasizing parallel expansion-oriented strategies and parallel contraction-based strategies for theorem proving. The most interesting and least explored by existing approaches are the contraction-based strategies. The presence of contraction rules — rules that delete clauses — and especially the application ofbackward contraction, emerges as a key issue for parallelization of these strategies. Backward contraction is the main reason for the impressive experimental success of contraction-based strategies. Our analysis shows that backward contraction makes efficient parallelization much more difficult. In our analysis, coarse-grain parallelism appears to be the best choice for parallelizing contraction-based reasoning. Accordingly, we propose a notion ofparallelism at the search level as coarse-grain parallelism for deduction.Supported by the GE Foundation Faculty Fellowship to the University of Iowa and by the National Science Foundation with grant CCR-94-08667.Supported by grant NSC 83-0408-E-002-012T of the National Science Council of the Republic of China. 相似文献

3.

A study of potential parallelism among traces in Java programs

Borys J. Bradel Tarek S. Abdelrahman 《Science of Computer Programming》2009,74(5-6):296-313

The exploitation of parallelism among traces, i.e. hot paths of execution in programs, is a novel approach to the automatic parallelization of Java programs and it has many advantages. However, to date, the extent to which parallelism exists among traces in programs has not been made clear. The goal of this study is to measure the amount of trace-level parallelism in several Java programs. We extend the Jupiter Java Virtual Machine with a simulator that models an abstract parallel system. We use this simulator to measure trace-level parallelism. We further use it to examine the effects of the number of processors, trace window size, and communication type and cost on performance. We also analyze the dependence characteristics of the benchmarks and see how they relate to parallelism. The results indicate that enough trace-level parallelism exists for a modest number of processors. Thus, we conclude that trace-based parallelization is a potentially viable approach to improve the performance of Java programs. 相似文献

4.

A parallel spectral model for atmospheric transport processes

Thomas Kindler Karsten Schwan Dilma Silva Mary Trauner Fred Alyea 《Concurrency and Computation》1996,8(9):639-666

The paper describes a parallel implementation of a grand challenge problem: global atmospheric modeling. The novel contributions of our work include (1) a detailed investigation of opportunities for parallelism in atmospheric global modeling based on spectral solution methods, (2) the experimental evaluation of overheads arising from load imbalances and data movement for alternative parallelization methods, and (3) the development of a parallel code that can be monitored and steered interactively based on output data visualizations and animations of program functionality or performance. Code parallelization takes advantage of the relative independence of computations at different levels in the earth's atmosphere, resulting in parallelism of up to 40 processors, each independently performing computations for different atmospheric levels and requiring few communications between different levels across model time steps. Next, additional parallelism is attained within each level by taking advantage of the natural parallelism offered by the spectral computations being performed (e.g. taking advantage of independently computable terms in equations). Performance measurements are performed on a 64-node KSR2 supercomputer. However, the parallel code has been ported to several shared memory parallel machines, including SGI multiprocessors, and has also been ported to distributed memory platforms like the IBM SP-2. 相似文献

5.

ParC—An Extension of C for Shared Memory Parallel Processing

YOSI BEN-ASHER DROR G. FEITELSON LARRY RUDOLPH 《Software》1996,26(5):581-612

ParC is an extension of the C programming language with block-oriented parallel constructs that allow the programmer to express fine-grain parallelism in a shared-memory model. It is suitable for the expression of parallel shared-memory algorithms, and also conducive for the parallelization of sequential C programs. In addition, performance enhancing transformations can be applied within the language, without resorting to low-level programming. The language includes closed constructs to create parallelism, as well as instructions to cause the termination of parallel activities and to enforce synchronization. The parallel constructs are used to define the scope of shared variables, and also to delimit the sets of activities that are influenced by termination or synchronization instructions. The semantics of parallelism are discussed, especially relating to the discrepancy between the limited number of physical processors and the potentially much larger number of parallel activities in a program. 相似文献

6.

Use of parallel FORTRAN for engineering problems on the IBM 3090 vector multiprocessor

《Parallel Computing》1988,9(1):107-115

相似文献

7.

EFFECTIVE PARALLELIZATION TECHNIQUES FOR LOOP NESTS WITH NON-UNIFORM DEPENDENCES

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1):37-64

The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development. 相似文献

8.

多色SSOR-PCG的MPI+OpenMP混合编程实现

林绍忠许合伟颉志强《计算机辅助工程》2013,22(6):79-83

针对对称逐步超松驰预处理共轭梯度（Symmetric Successive Over Relaxation Preconditioned Conjugate Gradient,SSOR-PCG）法并行化时每步迭代都要并行求解2个三角方程组的困难,采用多色排序技术提高并行度,基于MPI＋OpenMP混合编程模型开发适合于分布共享内存计算机的并行程序,通过测试选择有效的MPI通信函数,并给出3种避免共享数据竞争的措施,供不同规模问题和不同内存容量计算机情况选用．相似文献

9.

Parallelism and Scalability in an Image Processing Application

Morten S. Rasmussen Matthias B. Stuart Sven Karlsson 《International journal of parallel programming》2009,37(3):306-323

The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chips. This means that parallel processing is required in application areas that traditionally have not used parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately available parallelism and further extraction of parallelism is limited by small data sets and a relatively high parallelization overhead. Load balance is difficult to obtain due to the limited parallelism and made worse by non-uniform memory latency. Three parallel OpenMP implementations of the application are discussed and evaluated. We show that with some modifications relative speedups in excess of 9 on a 16 CPU system can be reached. 相似文献

10.

Dynamic resolution: A runtime technique for the parallelization of modifications to directed acyclic graphs

Lorenz Huelsbergen 《International journal of parallel programming》1997,25(5):385-417

相似文献

11.

Automatic Parallelization of Recursive Procedures

Manish Gupta Sayak Mukhopadhyay Navin Sinha 《International journal of parallel programming》2000,28(6):537-562

Parallelizing compilers have traditionally focussed mainly on parallelizing loops. This paper presents a new framework for automatically parallelizing recursive procedures that typically appear in divide-and-conquer algorithms. We present compile-time analysis, using powerful, symbolic array section analysis, to detect the independence of multiple recursive calls in a procedure. This allows exploitation of a scalable form of nested parallelism, where each parallel task can further spawn off parallel work in subsequent recursive calls. We describe a runtime system which efficiently supports this kind of nested parallelism without unnecessarily blocking tasks. We have implemented this framework in a parallelizing compiler, which is able to automatically parallelize programs like quicksort and mergesort, written in C. For cases where even the advanced compile-time analysis we describe is not able to prove the independence of procedure calls, we propose novel techniques for speculative runtime parallelization, which are more efficient and powerful in this context than analogous techniques proposed previously for speculatively parallelizing loops. Our experimental results on an IBM G30 SMP machine show good speedups obtained by following our approach. 相似文献

12.

A scalable method for run-time loop parallelization

Lawrence Rauchwerger Nancy M. Amato David A. Padua 《International journal of parallel programming》1995,23(6):537-576

Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and An abstract of this paper has been publsihed in Ref. 1. Research supported in part by Army contract #DABT63-92-C-0033. This work is not necessarily representative of the positions or policies of the Army of the Government. Research supported in part by Intel and NASA Graduate Fellowships. Research supported in part by an AT&T Bell Laboratoroies Graduate Fellowship and by the International Computer Science Institute, Berkeley, California. 相似文献

13.

Stabilizing large‐scale generalized systems on parallel computers using multithreading and message‐passing

Peter Benner Maribel Castillo Rafael Mayo Enrique S. Quintana‐Ortí Gregorio Quintana‐Ortí 《Concurrency and Computation》2007,19(4):531-542

We discuss the parallelization of an efficient algorithm for the partial stabilization of large‐scale linear control systems in generalized state‐space form. The algorithm is composed of highly parallel iterative schemes that appear in the computation of certain matrix functions. Here we evaluate different approaches to exploit parallelism at two levels, based on threads and processes. Our experimental results on a cluster of symmetric multiprocessors and a CC‐NUMA platform show that the efficiency of the matrix operations underlying the iterative schemes carry over to the parallel implementation of the stabilization algorithm. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

14.

SPS-Parallelism + SETHEO = SPTHEO

Christian B. Suttner 《Journal of Automated Reasoning》1999,22(4):397-431

相似文献

15.

Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs

Benkner Siegfried Sipkova Viera 《International journal of parallel programming》2003,31(1):3-19

Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques. 相似文献

16.

Performance characteristics of the multi-zone NAS parallel benchmarks

《Journal of Parallel and Distributed Computing》2006,66(5):674-685

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multi-zone, is derived from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on four different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes. 相似文献

17.

Potential thread-level-parallelism exploration with superblock reordering

John Ye Hui Yan Honglun Hou Tianzhou Chen 《Computing》2014,96(6):545-564

The growing number of processing cores in a single CPU is demanding more parallelism from sequential programs. But in the past decades few work has succeeded in automatically exploiting enough parallelism, which casts a shadow over the many-core architecture and the automatic parallelization research. However, actually few work was tried to understand the nature, or amount, of the potentially available parallelism in programs. In this paper we will analyze at runtime the dynamic data dependencies among superblocks of sequential programs. We designed a meta re-arrange buffer to measure and exploit the available parallelism, with which the superblocks are dynamically analyzed, reordered and dispatched to run in parallel on an ideal many-core processor, while the data dependencies and program correctness are still maintained. In our experiments, we observed that with the superblock reordering, the potential speedup ranged from 1.08 to 89.60. The results showed that the potential parallelism of normal programs was still far from fully exploited by existing technologies. This observation makes the automatic parallelization a promising research direction for many-core architectures. 相似文献

18.

Compilation techniques for parallel systems

《Parallel Computing》1999,25(13-14):1741-1783

Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed. 相似文献

19.

Time stamp algorithms for runtime parallelization of DOACROSS loopswith dynamic dependences

Xu C.-Z. Chaudhary V. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(5):433-450

This paper presents a time stamp algorithm for runtime parallelization of general DOACROSS loops that have indirect access patterns. The algorithm follows the INSPECTOR/EXECUTOR scheme and exploits parallelism at a fine-grained memory reference level. It features a parallel inspector and improves upon previous algorithms of the same generality by exploiting parallelism among consecutive reads of the same memory element. Two variants of the algorithm are considered: One allows partially concurrent reads (PCR) and the other allows fully concurrent reads (FCR). Analyses of their time complexities derive a necessary condition with respect to the iteration workload for runtime parallelization. Experimental results for a Gaussian elimination loop, as well as an extensive set of synthetic loops on a 12-way SMP server, show that the time stamp algorithms outperform iteration-level parallelization techniques in most test cases and gain speedups over sequential execution for loops that have heavy iteration workloads. The PCR algorithm performs best because it makes a better trade-off between maximizing the parallelism and minimizing the analysis overhead. For loops with light or unknown iteration loads, an alternative speculative runtime parallelization technique is preferred 相似文献

20.

Implementation of functional parallel typified language (FPTL) on multicore computers

V. P. Kutepov P. N. Shamal’ 《Journal of Computer and Systems Sciences International》2014,53(3):345-358

A functional programming language supporting implicit parallelization of programs is described. The language is based on four operations of composition, of which three can perform parallel processing. Functional programs are represented schematically to use a dynamic parallelization algorithm. The implemented algorithms make it possible to dynamically distribute the load between processors and control the grain of parallelism. Experimental results for the efficiency of the implemented system obtained on examples of typical problems are presented. 相似文献