首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Stochastic bounds are obtained on execution times of parallel programs when the number of processors is unlimited. A parallel program is considered to consist of interdependent tasks with synchronization constraints. These constraints are described by an acyclic directed graph called a task graph. The execution times of tasks are considered to be independently identically distributed (i.i.d.) random variables. The performance measure of interest is the overall execution of the considered parallel program (task graph). Stochastic bound methods are applied to obtain lower and upper bounds on this measure. Another upper bound is obtained for parallel programs having `new better than used in expectation' (NBUE) random variables as task execution times. NBUE random variables are replaced with exponential random variables of the same mean to derive this upper bound  相似文献   

2.
3.
Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism.  相似文献   

4.
We study a multiprocessing computer system which accepts parallel programs that have a fork-join computational paradigm. The multiprocessing computer system under study is modeled as K homogeneous servers, each with an infinite capacity queue. Parallel programs arrive at the multiprocessing system according to a series-parallel phase type interarrival process with mean arrival rate of h. Upon the program arrival, it forks into K-independent tasks and each task is assigned to an unique server. Each task's service time has a k-stage Erlang distribution with mean service time of λ. A parallel program is completed upon the completion of its last task. This kind of queuing model has no known closed form solution in the general (K⩾2) case. In this paper, we show that by carefully modifying the arrival and service distributions at some imbedded points in time, we can obtain tight performance bounds. We also provide a computational efficient algorithm for obtaining upper and lower bounds on the expected response time. The methodology is flexible and allows one to trade-off the tightness of the bounds and computational cost  相似文献   

5.
Large scientific parallel applications demand large amounts of memory space. Current parallel computing platforms schedule jobs without fully knowing their memory requirements. This leads to uneven memory allocation in which some nodes are overloaded. This, in turn, leads to disk paging, which is extremely expensive in the context of scientific parallel computing. To solve this problem, we propose a new peer-to-peer solution called parallel network RAM. This approach avoids the use of disk, better utilizes available RAM resources, and will allow larger problems to be solved while reducing the computational, communication, and synchronization overhead typically involved in parallel applications. We proposed several different parallel network RAM designs and evaluated the performance of each under different conditions. We discovered that different designs are appropriate in different situations.  相似文献   

6.
A method known asclosed environments can be used to represent variable bindings for OR-parellel logic programs without relying on a shared memory or common address space. The representation is based on a procedure that trans-forms stack frames after unification, taking into account problems with common unbound ancestors and shared instances of complex terms. Closed environments were developed for the AND/OR Process Model, but may be applicable to other OR-parallel models.  相似文献   

7.
Martonosi  M. Gupta  A. Anderson  T.E. 《Computer》1995,28(4):32-40
To improve program memory performance, programmers and compiler writers can transform the application so that its memory-referencing behavior better exploits the memory hierarchy. The challenge in achieving these program transformations is overcoming the difficulty of statically analyzing or reasoning about an application's referencing behavior and interactions. In addition, many performance-monitoring tools collect high-level information that is inadequately detailed to analyze specific memory performance bugs. We describe MemSpy, a performance-monitoring tool we designed to help programmers discern where and why memory bottlenecks occur. MemSpy guides programmers toward program transformations that improve memory performance through detailed statistics on cache-miss causes and frequency. Because of the natural link between data-reference patterns and memory performance, MemSpy helps programmers comprehend data structure and code segment interactions by displaying statistics in terms of both the program's data and code structures, rather than for code structures alone  相似文献   

8.
Based on a thorough study of the relationship between array element accesses and loop indices of the nested loop,a method is presented with which the staggering relation and the compacting relation between the threads of the nested loop (either with a single linear function of with multiple linear functions) can be determined at compile-time,and accordingly the nested loop (either perfectly nested one or imperfectly nested one) can be restructured to avoid the thrashing problem.Due to its simplicity,our method can be efficiently implemented in any parallel compiler,and the improvement of the performance is significant as shown be the experimental results.  相似文献   

9.
Parallelism has become a way of life for many scientific programmers. A significant challenge in bringing the power of parallel machines to these programmers is providing them with a suite of software tools similar to the tools that sequential programmers currently utilize. Unfortunately, writing correct parallel programs remains a challenging task.In particular, automatic or semi‐automatic testing tools for parallel programs are lacking. This paper takes a first step in developing an approach to providing all‐uses coverage for parallel programs. A testing framework and theoretical foundations for structural testing are presented, including test data adequacy criteria and hierarchy, formulation and illustration of all‐uses testing problems, classification of all‐uses test cases for parallel programs, and both theoretical and empirical results with regard to what can be achieved with all‐uses coverage for parallel programs. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

10.
In this paper,we focus on the compiling implementation of parlalel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstact machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed.The performance statistics on transputer array demonstrate the effectiveness of our model,parallel abstract machine,optimizing compilation strategies and compiler.  相似文献   

11.
Span programs provide a linear algebraic model of computation. Lower bounds for span programs imply lower bounds for formula size, symmetric branching programs, and contact schemes. Monotone span programs correspond also to linear secret-sharing schemes. We present a new technique for proving lower bounds for monotone span programs. We prove a lower bound of (m 2.5) for the 6-clique function. Our results improve on the previously known bounds for explicit functions.  相似文献   

12.
Parallel execution is normally used to decrease the amount of time required to run a program. However, the parallel execution may require far more space than that required by the sequential execution. Worse yet, the parallel space requirement may be very much more difficult to predict than the sequential space requirement because there are more factors to consider. These include essentially nondeterministic factors that can influence scheduling, which in turn may dramatically influence space requirements. We survey some scheduling algorithms that attempt to place bounds on the amount of time and space used during parallel execution. We also outline a direction for future research. This direction takes us into the area of functional programming, where the declarative nature of the languages can help the programmer to produce correct parallel programs, a feat that can be difficult with procedural languages. Currently the high-level nature of functional languages can make it difficult for the programmer to understand the operational behavior of the program. We look at some of the problems in this area, with the goal of achieving a programming environment that supports correct, efficient parallel programs.  相似文献   

13.
A syntactic read-k-times branching program has the restriction that no variable occurs more thank times on any path (whether or not consistent) of the branching program. We first extend the result in [31], to show that the “n/2 clique only function”, which is easily seen to be computable by deterministic polynomial size read-twice programs, cannot be computed by nondeterministic polynomial size read-once programs, although its complement can be so computed. We then exhibit an explicit Boolean functionf such that every nondeterministic syntactic read-k-times branching program for computingf has size exp $$\left( {\Omega \left( {\frac{n}{{4^k k^3 }}} \right)} \right).$$   相似文献   

14.
In this paper we investigate the ``mandatory-work-first' approach to parallel alpha-beta search first proposed by Akl, Barnard, and Doran. This approach is based on a version of alpha-beta search without deep cutoffs and a two-stage evaluation process, the second stage of which is often pruned. Our analysis shows that for best-first ordering on the lookahead tree, this approach provides greater speedup than the Palphabeta tree-splitting technique, and that for worst-first ordering, mandatory work first provides only slightly worse speedup than Palphabeta.  相似文献   

15.
Given a distributed system of \(n\) balls and \(n\) bins, how evenly can we distribute the balls to the bins, minimizing communication? The fastest non-adaptive and symmetric algorithm achieving a constant maximum bin load requires \(\varTheta (\log \log n)\) rounds, and any such algorithm running for \(r\in {\mathcal {O}}(1)\) rounds incurs a bin load of \(\varOmega ((\log n/\log \log n)^{1/r})\). In this work, we explore the fundamental limits of the general problem. We present a simple adaptive symmetric algorithm that achieves a bin load of 2 in \(\log ^* n+{\mathcal {O}}(1)\) communication rounds using \({\mathcal {O}}(n)\) messages in total. Our main result, however, is a matching lower bound of \((1-o(1))\log ^* n\) on the time complexity of symmetric algorithms that guarantee small bin loads. The essential preconditions of the proof are (i) a limit of \({\mathcal {O}}(n)\) on the total number of messages sent by the algorithm and (ii) anonymity of bins, i.e., the port numberings of balls need not be globally consistent. In order to show that our technique yields indeed tight bounds, we provide for each assumption an algorithm violating it, in turn achieving a constant maximum bin load in constant time.  相似文献   

16.
17.
Curare, the program restructurer described in this paper automatically transforms a sequential Lisp program into an equivalent concurrent program that runs on a multiprocessor.Data dependences constrain the program's concurrent execution because, in general, two conflicting statements cannot execute in a different order without affecting the program's result. Not all dependences are essential to produce the program's result.Curare attempts to transform the program so it computes its result with fewer conflicts. An optimized program will execute with less synchronization and more concurrency. Curare then examines loops in a program to find those that are unconstrained or lightly constrained by dependences. By necessity,Curare treats recursive functions as loops and does not limit itself to explicit program loops. Recursive functions offer several advantages over explicit loops since they provide a convenient framework for inserting locks and handling the dynamic behavior of symbolic programs. Loops that are suitable for concurrent execution are changed to execute on a set of concurrent server processes. These servers execute single loop iterations and therefore need to be extremely inexpensive to invoke.Restructured programs execute significantly faster than the original sequential programs. This improvement is large enough to attract programmers to a multiprocessor, particularly since it requires little effort on their part.This research was funded by DARPA contract numbers N00039-85-C-0269 (SPUR) and N00039-84-C-0089 (XCS) and by an NSF Presidential Young Investigator award to Paul N. Hilfinger. Additional funding came from the California MICRO program (in conjunction with Texas Instruments, Xerox, Honeywell, and Phillips/Signetics).  相似文献   

18.
The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High‐Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade‐off between failure‐free and recovery performances. In this paper we describe our contribution in fault tolerance for high‐level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure‐free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

19.
To speed up production systems, researchers have studied how to execute many rules simultaneously. Unfortunately, such systems can yield results that are impossible for a serial system to produce, leading to erroneous behaviors. We present a formal solution to the problem of guaranteeing serializable behavior in synchronous parallel production systems that execute many rules simultaneously. We also present a variety of algorithms that implement this solution. Our framework builds upon the work of Ishida and Stolfo [14] and improves upon their solution. The practical advantages of these strategies are demonstrated with measurements from an initial implementation.  相似文献   

20.
Parallel simulation of parallel programs for large datasets has been shown to offer significant reduction in the execution time of many discrete event models. The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message passing, or written in UC, a data parallel language, compiled to use message passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. The paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in the paper yielded significant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号