首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
Iteration space tiling is a common strategy used by parallelizing compilers and in performance tuning of parallel codes. We address the problem of determining the tile size that minimizes the total execution time. We restrict our attention to uniform dependency computations with two-dimensional, parallelogram-shaped iteration domain which can be tiled with lines parallel to the domain boundaries. The target architecture is a linear array (or a ring). Our model is developed in two steps. We first abstract each tile by two simple parameters, namely tile periodPtand intertile latencyLt. We formulate and partially resolve the corresponding optimization problem independent of the machine and program. Next, we refine the model with realistic machine and program parameters, yielding a discrete nonlinear optimization problem. We solve this analytically, yielding a closed form solution, which can be used by a compiler before code generation.  相似文献   

2.
This paper proposes a new method for the problem of minimizing the execution time of nested for-loops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to nonlocal processors. We survey the application of our schedule to modern communication architectures. We performed two sets of experimental results, one using MPI primitives over FastEthernet and one using the SISCI API over an SCI network. In both cases, the total completion time is significantly reduced.  相似文献   

3.
It has been difficult to develop simple formulations to predict the execution time of parallel programs due to the complexity of characterizing parallel hardware and software. In an attempt to clarify these characterizations, we introduce a methodology for applying a simple performance model based on Amdahl′s law. Our formulation results in accurate predictions of execution time on available systems, allowing programmers to select the optimal number of processors to apply to a particular problem or to select an appropriate problem size for the number of processors available. In short, we accurately quantify the scalability of a specific algorithm when it is run on a specific parallel computer. Our predictions are based on simple experiments that characterize machine performance and on a simple analysis of the parallel program. We illustrate our method for a program executed on a Sequent Symmetry multiprocessor with 20 processors. Our predictions closely match experimental results, differing by no more than 5% from the actual execution times. Our results illustrate key performance limitations of parallel systems, showing the impact of overhead and the scaling of problem size.  相似文献   

4.
In many applications of parallel computing, distribution of the data unambiguously implies distribution of work among processors. But, there are exceptions where some tasks can be assigned to one of several processors without altering the total volume of communication. In this paper, we study the problem of exploiting this flexibility in assignment of tasks to improve load balance. We first model the problem in terms of network flow and use combinatorial techniques for its solution. Our parametric search algorithms use maximum flow algorithms for probing on a candidate optimal solution value. We describe two algorithms to solve the assignment problem with log W/sub T/ and |P| probe calls, where W/sub T/ and |P|, respectively, denote the total workload and number of processors. We also define augmenting paths and cuts for this problem, and show that any algorithm based on augmenting paths can be used to find an optimal solution for the task assignment problem. We then consider a continuous version of the problem and formulate it as a linearly constrained optimization problem, i.e., min /spl par/Ax/spl par//spl infin/, s.t. Bx=d. To avoid solving an intractable /spl infin/-norm optimization problem, we show that, in this case, minimizing the 2-norm is sufficient to minimize the /spl infin/-norm, which reduces the problem to the well-studied linearly constrained least squares problem. The continuous version of the problem has the advantage of being easily amenable to parallelization. Our experiments with molecular dynamics and overlapped domain decomposition applications proved the effectiveness of our methods with significant improvements in load balance. We also discuss how our techniques can be extended to heterogeneous parallel computers.  相似文献   

5.
Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor  相似文献   

6.
The problem of merging k (k⩾2) sorted lists is considered. We give an optimal parallel algorithm which takes O((n log k/p)+log n) time using p processors on a parallel random access machine that allows concurrent reads and exclusive writes, where n is the total size of the input lists. This algorithm achieves O(log n) time using p=n log k/log n processors. Most of the previous log n research for this problem has been focused on the case when k=2. Very recently, parallel solutions for the case when k=2 have been reported. Our solution is the first logarithmic time optimal parallel algorithm for the problem when k⩾2. It can also be seen as a unified optimal parallel algorithm for sorting and merging. In order to support the algorithm, a new processor assignment strategy is also presented  相似文献   

7.
Consider a set P of points in the plane sorted by the x-coordinate. A point p in P is said to be a proximate point if there exists a point q on the x-axis such that p is the closest point to q over all points in P. The proximate point problem is to determine all the proximate points in P. Our main contribution is to propose optimal parallel algorithms for solving instances of size n of the proximate points problem. We begin by developing a work-time optimal algorithm running in O(log log n) time and using n/loglogn Common-CRCW processors. We then go on to show that this algorithm can be implemented to run in O(log n) time using n/logn EREW processors. In addition to being work-time optimal, our EREW algorithm turns out to also be time-optimal. Our second main contribution is to show that the proximate points problem finds interesting, and quite unexpected, applications to digital geometry and image processing. As a first application, we present a work-time optimal parallel algorithm for finding the convex hull of a set of n points in the plane sorted by x-coordinate; this algorithm runs in O(log log n) time using n/logn Common-CRCW processors. We then show that this algorithm can be implemented to run in O(log n) time using n/logn EREW processors. Next, we show that the proximate points algorithms afford us work-time optimal (resp, time-optimal) parallel algorithms for various fundamental digital geometry and image processing problems  相似文献   

8.
Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape.  相似文献   

9.

Centrality measures or indicators of centrality identify most relevant nodes of graphs. Although optimized algorithms exist for computing of most of them, they are still time consuming and are even infeasible to apply to big enough graphs like the ones representing social networks or extensive enough computer networks. In this paper, we present a parallel implementation in C language of some optimal algorithms for computing of some indicators of centrality. Our parallel version greatly reduces the execution time of their sequential (non-parallel) counterpart. The proposed solution relies on threading, allowing for a theoretical improvement in performance close to the number of logical processors (cores) of the single computer in which it is running. Our software has been tested in several platforms, including the supercomputer Calendula, in which we achieved execution times close to 18 times faster when running our parallel implementation instead of our sequential one. Our solution is multi-platform and portable, working on any machine with several logical processor which is capable of compiling and running C language code.

  相似文献   

10.
The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. In this paper we examine the numerical solution of an elliptic partial differential equation in order to study the relationship between problem size and architecture. The equation's domain is discretized into n2 grid points which are divided into partitions and mapped onto the individual processor memories. We analytically quantify the relationships among grid size, stencil type, partitioning strategy processor execution time, and communication network type. In doing so, we determine the optimal number of processors to assign to the solution (and hence the optimal speedup), and identify (i) the smallest grid size which fully benefits from using all available processors, (ii) the leverage on performance given by increasing processor speed or communication network speed, and (iii) the suitability of various architectures for large numerical problems. Finally, we compare the predictions of our analytic model with measurements from a multiprocessor and find that the model accurately predicts performance.  相似文献   

11.
This paper deals with the scheduling of operations on a multiprocessor machine in the context of shoe manufacturing. Multiprocessor machines are composed of several parallel processors. Unlike parallel machines, the entire machine needs to be stopped whenever a single processor needs a setup. Our industrial partner, one of the major winter-shoe manufacturers in Canada, offers a relatively large variety of models. For each of these models, all the sizes proposed by this manufacturer must be produced in various quantities. Our objective is to schedule the production of the required sizes on the machine's different processors in order to minimize the global makespan, which includes both the production time and the set up time. We first present a mathematical formulation of the problem. Then, we introduce a decomposition procedure based on the mathematical model, and we demonstrate that this procedure is very efficient for small- and medium-size instances. We also propose two construction heuristics and two improvement heuristics for larger problems, and we report the results of our extensive computational experiments to demonstrate the quality of the proposed heuristics in terms of reducing production time and computational effort.  相似文献   

12.
This paper introduces a novel surface-modeling method to stochastically distribute features on arbitrary topological surfaces. The generated distribution of features follows the Poisson disk distribution, so we can have a minimum separation guarantee between features and avoid feature overlap. With the proposed method, we not only can interactively adjust and edit features with the help of the proposed Poisson disk map, but can also efficiently re-distribute features on object surfaces. The underlying mechanism is our dual tiling scheme, known as the Dual Poisson-Disk Tiling. First, we compute the dual of a given surface parameterization, and tile the dual surface by our specially-designed dual tiles; during the pre-processing, the Poisson disk distribution has been pre-generated on these tiles. By dual tiling, we can nicely avoid the problem of corner heterogeneity when tiling arbitrary parameterized surfaces, and can also reduce the tile set complexity. Furthermore, the dual tiling scheme is non-periodic, and we can also maintain a manageable tile set. To demonstrate the applicability of this technique, we explore a number of surface-modeling applications: pattern and shape distribution, bump-mapping, illustrative rendering, mold simulation, the modeling of separable features in texture and BTF, and the distribution of geometric textures in shell space.  相似文献   

13.
Current connectionist simulations require huge computational resources. We describe a neural network simulator for the IBM GF11, an experimental SIMD machine with 566 processors and a peak arithmetic performance of 11 Gigaflops. We present our parallel implementation of the backpropagation learning algorithm, techniques for increasing efficiency performance measurements on the NETTALK text-to-speech benchmark, and a performance model for the simulator. Our simulator currently runs the backpropagation learning algorithm at 900 million connections per second, where each “connection per second” includes both a forward and a backward pass. This figure was obtained on the machine when only 356 processors were working; with all 566 processors operational, our simulator will run at over one billion connections per second. We conclude that the GF11 is well-suited to neural network simulation, and we analyze our use of the machine to determine which features are the most important for high performance.  相似文献   

14.
This paper investigates the problem of allocating parallel application tasks to processors in heterogeneous distributed computing systems with the goal of maximizing the system reliability. The problem of finding an optimal task allocation for more than three processors is known to be NP-hard in the strong sense. To deal with this challenging problem, we propose a simple and effective iterative greedy algorithm to find the best possible solution within a reasonable amount of computation time. The algorithm first uses a constructive heuristic to obtain an initial assignment and iteratively improves it in a greedy way. We study the performance of the proposed algorithm over a wide range of parameters including problem size, the ratio of average communication time to average computation time, and task interaction density. The viability and effectiveness of our algorithm is demonstrated by comparing it with recently proposed task allocation algorithms for maximizing system reliability available in the literature.  相似文献   

15.
Computing an optimal solution to the knapsack problem is known to be NP-hard. Consequently, fast parallel algorithms for finding such a solution without using an exponential number of processors appear unlikely. An attractive alternative is to compute an approximate solution to this problem rapidly using a polynomial number of processors. In this paper, we present an efficient parallel algorithm for finding approximate solutions to the 0–1 knapsack problem. Our algorithm takes an , 0 < < 1, as a parameter and computes a solution such that the ratio of its deviation from the optimal solution is at most a fraction of the optimal solution. For a problem instance having n items, this computation uses O(n5/2/3/2) processors and requires O(log3n + log2nlog(1/)) time. The upper bound on the processor requirement of our algorithm is established by reducing it to a problem on weighted bipartite graphs. This processor complexity is a significant improvement over that of other known parallel algorithms for this problem.  相似文献   

16.
The problem of distributing and processing a divisible load in a heterogeneous linear network of processors with arbitrary processors release times is considered. A divisible load is very large in size and has computationally intensive CPU requirements. Further, it has the property that the load can be partitioned arbitrarily into any number of portions and can be scheduled onto processors independently for computation. The load is assumed to arrive at one of the farthest end processors, referred to as boundary processors, for processing. The processors in the network are assumed to have nonzero release times, i.e., the time instants from which the processors are available for processing the divisible load. Our objective is to design a load distribution strategy by taking into account the release times of the processors in such a way that the entire processing time of the load is a minimum. We consider two generic cases in which all processors have identical release times and when all processors have arbitrary release times. We adopt both the single and multiinstallment strategies proposed in the divisible load scheduling literature in our design of load distribution strategies, wherever necessary, to achieve a minimum processing time. Finally, when optimal strategies cannot be realized, we propose two heuristic strategies, one for the identical case, and the other for nonidentical release times case, respectively. Several conditions are derived to determine whether or not optimal load distribution exists and illustrative examples are provided for the ease of understanding.  相似文献   

17.
Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemented, as well as, the type of methodology utilized by an algorithm. In this paper, we determine the parallel complexity of multiplying two (not necessarily square) matrices on parallel distributed-memory machines and/or networks. In other words, we provided an achievable parallel run-time that can not be beaten by any algorithm (known or unknown) for solving this problem. In addition, any algorithm that claims to be optimal must attain this run-time. In order to obtain results that are general and useful throughout a span of machines, we base our results on the well-known LogP model. Furthermore, three important criteria must be considered in order to determine the running time of a parallel algorithm; namely, (i) local computational tasks, (ii) the initial data layout, and (iii) the communication schedule. We provide optimality results by first proving general lower bounds on parallel run-time. These lower bounds lead to significant insights on (i)–(iii) above. In particular, we present what types of data layouts and communication schedules are needed in order to obtain optimal run-times. We prove that no one data layout can achieve optimal running times for all cases. Instead, optimal layouts depend on the dimensions of each matrix, and on the number of processors. Lastly, optimal algorithms are provided.  相似文献   

18.
Improving scheduling of tasks in a heterogeneous environment   总被引:1,自引:0,他引:1  
Optimal scheduling of parallel tasks with some precedence relationship, onto a parallel machine is known to be NP-complete. The complexity of the problem increases when task scheduling is to be done in a heterogeneous environment, where the processors in the network may not be identical and take different amounts of time to execute the same task. We introduce a task duplication-based scheduling algorithm for network of heterogeneous systems (TANH), with complexity O(V/sup 2/), which provides optimal results for applications represented by directed acyclic graphs (DAGs), provided a simple set of conditions on task computation and network communication time could be satisfied. The performance of the algorithm is illustrated by comparing the scheduling time with an existing "best imaginary level scheduling (BIL)" scheme for heterogeneous systems. The scalability for a higher or lower number of processors, as per their availability is also discussed. We have shown to provide substantial improvement over existing work on the task duplication-based scheduling algorithm (TDS).  相似文献   

19.
We present an optimal parallel algorithm for computing a cycle separator of ann-vertex embedded planar undirected graph inO(logn) time onn/logn processors. As a consequence, we also obtain an improved parallel algorithm for constructing a depth-first search tree rooted at any given vertex in a connected planar undirected graph in O(log2 n) time on n/logn processors. The best previous algorithms for computing depth-first search trees and cycle separators achieved the same time complexities, but withn processors. Our algorithms run on a parallel random access machine that permits concurrent reads and concurrent writes in its shared memory and allows an arbitrary processor to succeed in case of a write conflict.A preliminary version of this paper appeared as Improved Parallel Depth-First Search in Undirected Planar Graphs in theProceedings of the Third Workshop on Algorithms and Data Structures, 1993, pp. 407–420.Supported in part by NSF Grant CCR-9101385.  相似文献   

20.
This paper presents a network parallel genetic algorithm for the one machine sequencing problem. It examines a parallel genetic algorithm in which processors exchange their best solution found at periodic intervals and the case when no exchange is performed. The network parallel genetic algorithm is executed on a cluster of IBM RS/6000 workstations using a master-slave approach. Performance to a serial genetic algorithm is reported.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号