共查询到20条相似文献,搜索用时 31 毫秒
1.
S.D. Kaushik C.-H. Huang P. Sadayappan 《Journal of Parallel and Distributed Computing》1996,38(2):237
In languages such as High Performance Fortran (HPF), array statements are used to express data parallelism. In compiling array statements for distributed-memory machines, efficient enumeration of local index sets and commmunication sets is important. A method based on a virtual processor approach has been proposed for efficient index set enumeration for array statements involving arrays distributed using block-cyclic distributions. The virtual processor approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to the physical processors. The key idea of the method is to first develop closed forms in terms of simple regular sections for the index sets for arrays distributed using block or cyclic distributions. These closed forms are then used with the virtual processor approach to give an efficient solution for arrays with the block-cyclic distribution. HPF supports a two-level mapping of arrays to processors. Arrays are first aligned with a template at an offset and a stride and the template is then distributed among the processors using a regular data distribution. The introduction of a nonunit stride in the alignment creates “holes” in the distributed arrays which leads to memory wastage. In this paper, using simple mathematical properties of regular sections, we extend the virtual processor approach to address the memory allocation and index set enumeration problems for array statements involving arrays mapped using the two-level mapping. We develop a methodology for translating the closed forms for block and cyclically distributed arrays mapped using a one-level mapping to closed forms for arrays mapped using the two-level mapping. Using these closed forms, the virtual processor approach is extended to handle array statements involving arrays mapped using two-level mappings. Performance results on the Cray T3D are presented to demonstrate the efficacy of the extensions and identify various trade-offs associated with the proposed method. 相似文献
2.
《Journal of Parallel and Distributed Computing》1994,21(2):223-236
The problem of scheduling tasks on distributed memory machines is known to be NP-complete in the strong sense, ruling out the possibility of a pseudo-polynomial algorithm. This paper introduces a new heuristic algorithm for scheduling Sisal (Streams and Iterations In a Single Assignment Language) programs on a distributed memory machine, Intel Touchstone i860. Our compile time scheduling method works on IF-2, an intermediate form based on the dataflow parallelism in the program. We initially carry out a dependence analysis, to bind the implicit dependencies across IF-2 graph boundaries, followed by a cost assignment based on Intel Touchstone i860 timings. The scheduler works in two phases. The first phase of the scheduler finds the earliest and latest completion times of each task given by the shortest and longest paths from root task to the given task respectively. A threshold defined as the difference between the latest and the earliest start times of the task, is found. The scheduler varies the value of the allowable threshold, and determines the best value for minimal schedule length. In the second phase of the scheduler, we merge the processors to generate schedule to match the available number of processors. Schedule results for several benchmark programs have been included to demonstrate the effectiveness of our approach. 相似文献
3.
We give an in-depth introduction to the design of our functional array programming language SaC, the main aspects of its compilation into host machine code, and its parallelisation based on multi-threading. The language design of SaC aims at combining high-level, compositional array programming with fully automatic resource management for highly productive code development and maintenance. We outline the compilation process that maps SaC programs to computing machinery. Here, our focus is on optimisation techniques that aim at restructuring entire applications from nested compositions of general fine-grained operations into specialised coarse-grained operations. We present our implicit parallelisation technology for shared memory architectures based on multi-threading and discuss further optimisation opportunities on this level of code generation. Both optimisation and parallelisation rigorously exploit the absence of side-effects and the explicit data flow characteristic of a functional setting. 相似文献
4.
An Interleaving Transformation for Parallelizing Reductions for Distributed-Memory Parallel Machines
Reduction operations frequently appear in algorithms. Due to their mathematical invariance properties (assuming that round-off errorscan be tolerated), it is reasonable to ignore ordering constraints on the computation of reductions in order to take advantage of the computing power of parallel machines.One obvious and widely-used compilation approach for reductions is syntactic pattern recognition. Either the source language includes explicit reduction operators, or certain specific loops are recognized as equivalent to known reductions. Once such patterns are recognized, hand optimized code for the reductions are incorporated in the target program. The advantage of this approach is simplicity. However, it imposes restrictions on the reduction loops—no data dependence other than that caused by the reduction operation itself is allowed in the reduction loops.In this paper, we present a parallelizing technique, interleaving transformation, for distributed-memory parallel machines. This optimization exploits parallelism embodied in reduction loops through combination of data dependence analysis and region analysis. Data dependence analysis identifies the loop structures and the conditions that can trigger this optimization. Region analysis divides the iteration domain into a sequential region and an order-insensitive region. Parallelism is achieved by distributing the iterations in the order-insensitive region among multiple processors. We use a triangular solver as an example to illustrate the optimization. Experimental results on various distributed-memory parallel machines, including the Connection Machines CM-5, the nCUBE, the IBM SP-2, and a network of Sun Workstations are reported. 相似文献
5.
6.
《Journal of Parallel and Distributed Computing》2000,60(9):1074-1102
We describe a dynamic load-balancing algorithm for ray-tracing by progressive refinement on a distributed-memory parallel computer. Parallelization of progressive ray-tracing for single images is difficult because of the inherent sequential nature of the sample location generation process, which is optimized (and different) for any given image. Parallelization of progressive ray-tracing when generating image sequences at a fixed interactive rate is even more difficult, because of the time and synchronization constraints imposed on the system. The fixed frame rate requirement complicates matters and even renders meaningless traditional measures of parallel system performance (e.g., speedup). We show how to overcome these problems, which, to the best of our knowledge, have not been treated before. Exploiting the temporal coherence between frames enables us to both accelerate rendering and improve the load-balance throughout the sequence. Our dynamic load-balance algorithm combines local and global methods to account not only for rendering performance, but also for communication overhead and synchronization issues. The algorithm is shown to be robust to the harsh environment imposed by a time-critical application, such as the one we consider. 相似文献
7.
This paper describes an implementation of P
L
for a massively parallel SIMD machine, the M
P
MP-1. The system is based on a byte code interpreter which can emulate as many virtual processors on each physical processor as desired (within the limits of memory). The implementation makes it possible to activate more virtual processors once execution has begun and this feature can be used to support nested parallelism. Nested parallelism describes the ability to nest data parallel constructs, a feature of P
L
, C
M
L
, and N
; however, the outer parallel forms usually have to be sequentialized, with only the innermost forms being executed in parallel. N
and a subset of P
L
have been implemented to fully support nested parallelism by flattening nested structures at compile time. To do this the languages must impose various restrictions on both the data and control structures. There is an overhead associated with the runtime technique described here, but it is very versatile and can execute code in parallel that cannot be “flattened.” Hence this technique can be used to effectively support many of the moredifficultaspects of P
L
. 相似文献
8.
9.
In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access on distributed memory multicomputers. Since the redistribution is performed at run-time, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods for multi-dimensional array redistribution. Based on the previous work, the basic-cycle calculation technique, we present a basic-block calculation (BBC) and a complete-dimension calculation (CDC) techniques. We also developed a theoretical model to analyze the computation costs of these two techniques. The theoretical model shows that the BBC method has smaller indexing costs and performs well for the redistribution with small array size. The CDC method has smaller packing/unpacking costs and performs well when array size is large. When implemented these two techniques on an IBM SP2 parallel machine along with the PITFALLS method and the Prylli's method, the experimental results show that the BBC method has the smallest execution time of these four algorithms when the array size is small. The CDC method has the smallest execution time of these four algorithms when the array size is large. 相似文献
10.
11.
Optimizations for Efficient Array Redistribution on Distributed Memory Multicomputers 总被引:2,自引:0,他引:2
Shankar Ramaswamy Barbara Simons Prithviraj Banerjee 《Journal of Parallel and Distributed Computing》1996,38(2):217
Appropriate data distribution has been found to be critical for obtaining good performance on distributed memory multicomputers such as the Thinking Machines CM-5, Intel Paragon, and IBM SP-1/SP-2. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called FALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on the MPI communication library. The results presented show the efficiency and scalability of our redistribution algorithm. 相似文献
12.
《IEEE transactions on pattern analysis and machine intelligence》2008,34(5):597-613
We present Delta Execution, a technique that speeds up state-space exploration of object-oriented programs. State-space exploration is the essence of model checking and an increasingly popular approach for automating test generation. A key issue in exploration of object-oriented programs is handling the program state, in particular the heap. We exploit the fact that many execution paths in state-space exploration partially overlap. Delta Execution simultaneously operates on several states/heaps and shares the common parts across the executions, separately executing only the "deltas" where the executions differ. We implemented Delta Execution in two model checkers: JPF, a popular general-purpose model checker for Java programs, and BOX, a specialized model checker that we developed for efficient exploration of sequential Java programs. The results for bounded-exhaustive exploration of ten basic subject programs and one larger case study show that Delta Execution reduces exploration time from 1.06x to 126.80x (with median 5.60x) in JPF and from 0.58x to 4.16x (with median 2.23x) in BOX. The results for a non-exhaustive exploration in JPF show that Delta Execution reduces exploration time from 0.92x to 6.28x (with median 4.52x). 相似文献
13.
Daniel J. Palermo Eugene W. Hodges IV Prithviraj Banerjee 《Journal of Parallel and Distributed Computing》1996,38(2):158
For distributed-memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performace. This task has traditionally been the user's responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributons that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insufficient to obtain acceptable performance. The selection of distributions that dynamically change over the course of a program's execution adds another dimension to the data partitioning problem. In this paper, we present a technique that can be used to automatically determine which partitionings are most beneficial over specific sections of a program while taking into account the added overhead of performing redistribution. This system has been implemented as part of the PARADIGM (PARAllelizing compiler for DIstributed-memory General-purpose Multicomputers) project at the University of Illinois. The complete system strives to provide a fully automated means to parallelize programs written in a serial programming model obtaining high performance on a wide range of distributed-memory multicomputers. 相似文献
14.
《Journal of Parallel and Distributed Computing》2000,60(2):189-216
This paper presents compilation techniques used to compress holes, which are caused by the nonunit alignment stride in a two-level data-processor mapping. Holes are the memory locations mapped by useless template cells. To fully utilize the memory space, memory holes should be removed. In a two-level data-processor mapping, there is a repetitive pattern for array elements mapped onto processors. We classify blocks into classes and use a class table to record the distribution of each class in the first repetitive data distribution pattern. Similarly, data distribution on a processor also has a repetitive pattern. We use a compression table to record the distribution of each block in the first repetitive data distribution pattern on a processor. By using a class table and a compression table, hole compression can be easily and efficiently achieved. Compressing holes can save memory usage, improve spatial locality and further improve system performance. The proposed method is efficient, stable, and easy to implement. The experimental results do confirm the advantages of our proposed method over existing methods. 相似文献
15.
Teodoro G Tavares T Ferreira R Kurc T Meira W Guedes D Pan T Saltz J 《International journal of parallel programming》2008,36(2):250-266
Scientific workflow systems have been introduced in response to the demand of researchers from several domains of science
who need to process and analyze increasingly larger datasets. The design of these systems is largely based on the observation
that data analysis applications can be composed as pipelines or networks of computations on data. In this work, we present
a run-time support system that is designed to facilitate this type of computation in distributed computing environments. Our
system is optimized for data-intensive workflows, in which efficient management and retrieval of data, coordination of data
processing and data movement, and check-pointing of intermediate results are critical and challenging issues. Experimental
evaluation of our system shows that linear speedups can be achieved for sophisticated applications, which are implemented
as a network of multiple data processing components. 相似文献
16.
Array partitioning is an important research problem in array management area, since the partitioning strategies have important influence on storage, query evaluation, and other components in array management systems. Meanwhile, compression is highly needed for the array data due to its growing volume. Observing that the array partitioning can affect the compression performance significantly, this paper aims to design the efficient partitioning method for array data to optimize the compression performance. As far as we know, there still lacks research efforts on this problem. In this paper, the problem of array partitioning for optimizing the compression performance (PPCP for short) is firstly proposed. We adopt a popular compression technique which allows to process queries on the compressed data without decompression. Secondly, because the above problem is NP-hard, two essential principles for exploring the partitioning solution are introduced, which can explain the core idea of the partitioning algorithms proposed by us. The first principle shows that the compression performance can be improved if an array can be partitioned into two parts with different sparsities. The second principle introduces a greedy strategy which can well support the selection of the partitioning positions heuristically. Supported by the two principles, two greedy strategy based array partitioning algorithms are designed for the independent case and the dependent case respectively. Observing the expensive cost of the algorithm for the dependent case, a further optimization based on random sampling and dimension grouping is proposed to achieve linear time cost. Finally, the experiments are conducted on both synthetic and real-life data, and the results show that the two proposed partitioning algorithms achieve better performance on both compression and query evaluation. 相似文献
17.
This paper proposes a complementary novel idea, called MiniTasking to further reduce the number of cache misses by improving the data temporal locality for multiple concurrent queries. Our idea is based on the observation that, in many workloads such as decision support systems (DSS), there is usually significant amount of data sharing among different concurrent queries. MiniTasking exploits such data sharing to improve data temporal locality by scheduling query execution at three levels: query level batching, operator level grouping and mini-task level scheduling. The experimental results with various types of concurrent TPC-H query workloads show that, with the traditional N-ary Storage Model (NSM) layout, MiniTasking significantly reduces the L2 cache misses by up to 83%, and thereby achieves 24% reduction in execution time. With the Partition Attributes Across (PAX) layout, MiniTasking further reduces the cache misses by 65% and the execution time by 9%. For the TPC-H throughput test workload, MiniTasking improves the end performance up to 20%. 相似文献
18.
《Journal of Parallel and Distributed Computing》1995,24(1):11-26
We discuss implementations of block algorithms for recursive least squares (RLS for short) problems on ring distributed-memory multiprocessors. We consider the sliding rectangular window case which involves triangularization followed by updating and downdating of the data matrix. We compare several schemes for computing the current least-squares solution, including a direct back-substitution scheme and a scheme where the previous solution vector is updated to the current solution vector by adding the so-called Kalman gain vector. The techniques are implemented on a linear array of transputers and on the Intel iPSC/2 hypercube, and evaluated with respect to their execution time and numerical accuracy. 相似文献
19.
20.
《Journal of Parallel and Distributed Computing》2002,62(4):517-543
The problem of designing efficient parallel algorithms for summing and prefix summing for certain classes of the LogP model is studied. We present optimal algorithms for summing and show that any optimal summing algorithm must have a certain inherent structure. Moreover, we present optimal or near-optimal algorithms for prefix summing for both non-commutative and commutative binary operators. Furthermore, we show that the optimal algorithms for prefix summing for these two types of operators are not equivalent. 相似文献