期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A scalable method for run-time loop parallelization

Lawrence Rauchwerger Nancy M. Amato David A. Padua 《International journal of parallel programming》1995,23(6):537-576

Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and An abstract of this paper has been publsihed in Ref. 1. Research supported in part by Army contract #DABT63-92-C-0033. This work is not necessarily representative of the positions or policies of the Army of the Government. Research supported in part by Intel and NASA Graduate Fellowships. Research supported in part by an AT&T Bell Laboratoroies Graduate Fellowship and by the International Computer Science Institute, Berkeley, California. 相似文献

2.

The LRPD test: speculative run-time parallelization of loops withprivatization and reduction parallelization

Rauchwerger L. Padua D.A. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(2):160-180

Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall and apply a fully parallel data dependence test to determine if it had any cross-iteration dependences; if the test fails, then the loop is reexecuted serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run-time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching: it detects at run-time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks, which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods 相似文献

3.

Parallel computing for irregular applications

J. Chassin de KergommeauxP. J. HatcherL. Rauchwerger 《Parallel Computing》2000,26(13-14)

相似文献

4.

An Adaptive Algorithm Selection Framework for Reduction Parallelization

Hao Yu Rauchwerger L. 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(10):1084-1096

Irregular and dynamic memory reference patterns can cause performance variations for low level algorithms in general and for parallel algorithms in particular. In this paper, we present an adaptive algorithm selection framework which can collect and interpret the characteristics of a particular instance of parallel reduction algorithms and select the best performing one from an existing library. The framework consists of the following components: 1) an offline systematic process for characterizing the input sensitivity of parallel reduction algorithms and a method for building corresponding predictive performance models, 2) an online input characterization and algorithm selection module, and 3) a small library of parallel reduction algorithms, which represent the algorithmic choices made available at runtime. We also present one possible integration of this framework in a restructuring compiler. We validate our design experimentally and show that our framework 1) selects the most appropriate algorithms in 85 percent of the cases studied, 2) overall, delivers 98 percent of the optimal performance, 3) adaptively selects the best algorithms for dynamic phases of a running program (resulting in performance improvements otherwise not possible), and 4) adapts to the underlying machine architectures (evaluated on IBM Regatta and HP V-Class systems). 相似文献

5.

Hybrid Analysis: Static & Dynamic Memory Reference Analysis

Rus Silvius Rauchwerger Lawrence Hoeflinger Jay 《International journal of parallel programming》2003,31(4):251-283

We present a novel Hybrid Analysis technology which can efficiently and seamlessly integrate all static and run-time analysis of memory references into a single framework that is capable of performing all data dependence analysis and can generate necessary information for most associated memory related optimizations. We use HA to perform automatic parallelization by extracting run-time assertions from any loop and generating appropriate run-time tests that range from a low cost scalar comparison to a full, reference by reference run-time analysis. Moreover we can order the run-time tests in increasing order of complexity (overhead) and thus risk the minimum necessary overhead. We accomplish this by both extending compile time IP analysis techniques and by incorporating speculative run-time techniques when necessary. Our solution is to bridge free compile time techniques with exhaustive run-time techniques through a continuum of simple to complex solutions. We have implemented our framework in the Polaris compiler by introducing an innovative intermediate representation called RT_LMAD and a run-time library that can operate on it. Based on the experimental results obtained to date we hope to automatically parallelize most and possibly all PERFECT codes, a significant accomplishment. 相似文献