期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Stochastic bounds on execution times of parallel programs

Yazia-Pekergin N. Vincent J.-M. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(10):1005-1012

Stochastic bounds are obtained on execution times of parallel programs when the number of processors is unlimited. A parallel program is considered to consist of interdependent tasks with synchronization constraints. These constraints are described by an acyclic directed graph called a task graph. The execution times of tasks are considered to be independently identically distributed (i.i.d.) random variables. The performance measure of interest is the overall execution of the considered parallel program (task graph). Stochastic bound methods are applied to obtain lower and upper bounds on this measure. Another upper bound is obtained for parallel programs having `new better than used in expectation' (NBUE) random variables as task execution times. NBUE random variables are replaced with exponential random variables of the same mean to derive this upper bound 相似文献

2.

Performance enhancement methods for parallel programs with multilevel memory

A. E. Doroshenko 《Cybernetics and Systems Analysis》1994,30(4):573-581

相似文献

3.

Optimizing FORTRAN Programs for Hierarchical Memory Parallel Processing Systems

下载免费PDF全文

Jin Guohua Chen Fujie 《计算机科学技术学报》1993,8(3):19-30

Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism. 相似文献

4.

Computing performance bounds of fork-join parallel programs under amultiprocessing environment

Lui J.C.S. Muntz R.R. Towsley D. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(3):295-311

We study a multiprocessing computer system which accepts parallel programs that have a fork-join computational paradigm. The multiprocessing computer system under study is modeled as K homogeneous servers, each with an infinite capacity queue. Parallel programs arrive at the multiprocessing system according to a series-parallel phase type interarrival process with mean arrival rate of h. Upon the program arrival, it forks into K-independent tasks and each task is assigned to an unique server. Each task's service time has a k-stage Erlang distribution with mean service time of λ. A parallel program is completed upon the completion of its last task. This kind of queuing model has no known closed form solution in the general (K⩾2) case. In this paper, we show that by carefully modifying the arrival and service distributions at some imbedded points in time, we can obtain tight performance bounds. We also provide a computational efficient algorithm for obtaining upper and lower bounds on the expected response time. The methodology is flexible and allows one to trade-off the tightness of the bounds and computational cost 相似文献

5.

Effectively utilizing global cluster memory for large data-intensive parallel programs

Oleszkiewicz J. Xiao L. Yunhao Liu 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(1):66-77

Large scientific parallel applications demand large amounts of memory space. Current parallel computing platforms schedule jobs without fully knowing their memory requirements. This leads to uneven memory allocation in which some nodes are overloaded. This, in turn, leads to disk paging, which is extremely expensive in the context of scientific parallel computing. To solve this problem, we propose a new peer-to-peer solution called parallel network RAM. This approach avoids the use of disk, better utilizes available RAM resources, and will allow larger problems to be solved while reducing the computational, communication, and synchronization overhead typically involved in parallel applications. We proposed several different parallel network RAM designs and evaluated the performance of each under different conditions. We discovered that different designs are appropriate in different situations. 相似文献

6.

Binding environments for parallel logic programs in non-shared memory multiprocessors

John S. Conery 《International journal of parallel programming》1988,17(2):125-152

A method known asclosed environments can be used to represent variable bindings for OR-parellel logic programs without relying on a shared memory or common address space. The representation is based on a procedure that trans-forms stack frames after unification, taking into account problems with common unbound ancestors and shared instances of complex terms. Closed environments were developed for the AND/OR Process Model, but may be applicable to other OR-parallel models. 相似文献

7.

Tuning memory performance of sequential and parallel programs

Martonosi M. Gupta A. Anderson T.E. 《Computer》1995,28(4):32-40

To improve program memory performance, programmers and compiler writers can transform the application so that its memory-referencing behavior better exploits the memory hierarchy. The challenge in achieving these program transformations is overcoming the difficulty of statically analyzing or reasoning about an application's referencing behavior and interactions. In addition, many performance-monitoring tools collect high-level information that is inadequately detailed to analyze specific memory performance bugs. We describe MemSpy, a performance-monitoring tool we designed to help programmers discern where and why memory bottlenecks occur. MemSpy guides programmers toward program transformations that improve memory performance through detailed statistics on cache-miss causes and frequency. Because of the natural link between data-reference patterns and memory performance, MemSpy helps programmers comprehend data structure and code segment interactions by displaying statistics in terms of both the program's data and code structures, rather than for code structures alone 相似文献

8.

On the Problem of Optimizing Parallel Programs for Complex Memory Hierarchies

下载免费PDF全文

Jin Guohua Chen Fujie 《计算机科学技术学报》1994,9(1):1-26

Based on a thorough study of the relationship between array element accesses and loop indices of the nested loop,a method is presented with which the staggering relation and the compacting relation between the threads of the nested loop (either with a single linear function of with multiple linear functions) can be determined at compile-time,and accordingly the nested loop (either perfectly nested one or imperfectly nested one) can be restructured to avoid the thrashing problem.Due to its simplicity,our method can be efficiently implemented in any parallel compiler,and the improvement of the performance is significant as shown be the experimental results. 相似文献

9.

All‐uses testing of shared memory parallel programs

Cheer‐Sun D. Yang Lori L. Pollock 《Software Testing, Verification and Reliability》2003,13(1):3-24

Parallelism has become a way of life for many scientific programmers. A significant challenge in bringing the power of parallel machines to these programmers is providing them with a suite of software tools similar to the tools that sequential programmers currently utilize. Unfortunately, writing correct parallel programs remains a challenging task.In particular, automatic or semi‐automatic testing tools for parallel programs are lacking. This paper takes a first step in developing an approach to providing all‐uses coverage for parallel programs. A testing framework and theoretical foundations for structural testing are presented, including test data adequacy criteria and hierarchy, formulation and illustration of all‐uses testing problems, classification of all‐uses test cases for parallel programs, and both theoretical and empirical results with regard to what can be achieved with all‐uses coverage for parallel programs. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

10.

Optimized Parallel Execution of Declarative Programs on Distributed Memory Multiprocessors

下载免费PDF全文

Shen Meiming Tian Xinmin Wang Dingxing Zheng Weimin Wen Dongchan 《计算机科学技术学报》1993,8(3):43-52

In this paper,we focus on the compiling implementation of parlalel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstact machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed.The performance statistics on transputer array demonstrate the effectiveness of our model,parallel abstract machine,optimizing compilation strategies and compiler. 相似文献

11.

Lower bounds for monotone span programs

Amos Beimel Anna Gál Mike Paterson 《Computational Complexity》1996,6(1):29-45

Span programs provide a linear algebraic model of computation. Lower bounds for span programs imply lower bounds for formula size, symmetric branching programs, and contact schemes. Monotone span programs correspond also to linear secret-sharing schemes. We present a new technique for proving lower bounds for monotone span programs. We prove a lower bound of (m ^2.5) for the 6-clique function. Our results improve on the previously known bounds for explicit functions. 相似文献

12.

Memory requirements for parallel programs

F. Warren Burton David J. Simpson 《Parallel Computing》2000,26(13-14)

Parallel execution is normally used to decrease the amount of time required to run a program. However, the parallel execution may require far more space than that required by the sequential execution. Worse yet, the parallel space requirement may be very much more difficult to predict than the sequential space requirement because there are more factors to consider. These include essentially nondeterministic factors that can influence scheduling, which in turn may dramatically influence space requirements. We survey some scheduling algorithms that attempt to place bounds on the amount of time and space used during parallel execution. We also outline a direction for future research. This direction takes us into the area of functional programming, where the declarative nature of the languages can help the programmer to produce correct parallel programs, a feat that can be difficult with procedural languages. Currently the high-level nature of functional languages can make it difficult for the programmer to understand the operational behavior of the program. We look at some of the problems in this area, with the goal of achieving a programming environment that supports correct, efficient parallel programs. 相似文献

13.

On lower bounds for read-k-times branching programs

A. Borodin A. Razborov R. Smolensky 《Computational Complexity》1993,3(1):1-18

A syntactic read-k-times branching program has the restriction that no variable occurs more thank times on any path (whether or not consistent) of the branching program. We first extend the result in [31], to show that the “n/2 clique only function”, which is easily seen to be computable by deterministic polynomial size read-twice programs, cannot be computed by nondeterministic polynomial size read-once programs, although its complement can be so computed. We then exhibit an explicit Boolean functionf such that every nondeterministic syntactic read-k-times branching program for computingf has size exp $$\left( {\Omega \left( {\frac{n}{{4^k k^3 }}} \right)} \right).$$ 相似文献

14.

Improved speedup bounds for parallel alpha-Beta search

Finkel RA Fishburn JP 《IEEE transactions on pattern analysis and machine intelligence》1983,(1):89-92

In this paper we investigate the ``mandatory-work-first' approach to parallel alpha-beta search first proposed by Akl, Barnard, and Doran. This approach is based on a version of alpha-beta search without deep cutoffs and a two-stage evaluation process, the second stage of which is often pruned. Our analysis shows that for best-first ordering on the lookahead tree, this approach provides greater speedup than the Palphabeta tree-splitting technique, and that for worst-first ordering, mandatory work first provides only slightly worse speedup than Palphabeta. 相似文献

15.

Tight bounds for parallel randomized load balancing

Christoph Lenzen Roger Wattenhofer 《Distributed Computing》2016,29(2):127-142

Given a distributed system of $n$ balls and $n$ bins, how evenly can we distribute the balls to the bins, minimizing communication? The fastest non-adaptive and symmetric algorithm achieving a constant maximum bin load requires $\varTheta (\log \log n)$ rounds, and any such algorithm running for $r\in {\mathcal {O}}(1)$ rounds incurs a bin load of $\varOmega ((\log n/\log \log n)^{1/r})$. In this work, we explore the fundamental limits of the general problem. We present a simple adaptive symmetric algorithm that achieves a bin load of 2 in $\log ^* n+{\mathcal {O}}(1)$ communication rounds using ${\mathcal {O}}(n)$ messages in total. Our main result, however, is a matching lower bound of $(1-o(1))\log ^* n$ on the time complexity of symmetric algorithms that guarantee small bin loads. The essential preconditions of the proof are (i) a limit of ${\mathcal {O}}(n)$ on the total number of messages sent by the algorithm and (ii) anonymity of bins, i.e., the port numberings of balls need not be globally consistent. In order to show that our technique yields indeed tight bounds, we provide for each assumption an algorithm violating it, in turn achieving a constant maximum bin load in constant time. 相似文献

16.

Locality analysis for parallel C programs

Yingchun Zhu Hendren L.J. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(2):99-114

相似文献

17.

Compiling lisp programs for parallel execution

James R. Larus 《LISP and Symbolic Computation》1991,4(1):29-99

Curare, the program restructurer described in this paper automatically transforms a sequential Lisp program into an equivalent concurrent program that runs on a multiprocessor.Data dependences constrain the program's concurrent execution because, in general, two conflicting statements cannot execute in a different order without affecting the program's result. Not all dependences are essential to produce the program's result.Curare attempts to transform the program so it computes its result with fewer conflicts. An optimized program will execute with less synchronization and more concurrency. Curare then examines loops in a program to find those that are unconstrained or lightly constrained by dependences. By necessity,Curare treats recursive functions as loops and does not limit itself to explicit program loops. Recursive functions offer several advantages over explicit loops since they provide a convenient framework for inserting locks and handling the dynamic behavior of symbolic programs. Loops that are suitable for concurrent execution are changed to execute on a set of concurrent server processes. These servers execute single loop iterations and therefore need to be extremely inexpensive to invoke.Restructured programs execute significantly faster than the original sequential programs. This improvement is large enough to attract programmers to a multiprocessor, particularly since it requires little effort on their part.This research was funded by DARPA contract numbers N00039-85-C-0269 (SPUR) and N00039-84-C-0089 (XCS) and by an NSF Presidential Young Investigator award to Paul N. Hilfinger. Additional funding came from the California MICRO program (in conjunction with Texas Instruments, Xerox, Honeywell, and Phillips/Signetics). 相似文献

18.

Fault tolerance for data parallel programs

C. Bertolli M. Vanneschi 《Concurrency and Computation》2011,23(6):595-632

The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High‐Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade‐off between failure‐free and recovery performances. In this paper we describe our contribution in fault tolerance for high‐level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure‐free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

19.

Guaranteeing serializable results in synchronous parallel production systems

James G. Schmolze 《Journal of Parallel and Distributed Computing》1991,13(4)

To speed up production systems, researchers have studied how to execute many rules simultaneously. Unfortunately, such systems can yield results that are impossible for a serial system to produce, leading to erroneous behaviors. We present a formal solution to the problem of guaranteeing serializable behavior in synchronous parallel production systems that execute many rules simultaneously. We also present a variety of algorithms that implement this solution. Our framework builds upon the work of Ishida and Stolfo [14] and improves upon their solution. The practical advantages of these strategies are demonstrated with measurements from an initial implementation. 相似文献

20.

Asynchronous parallel simulation of parallel programs

Prakash S. Deelman E. Bagrodia R. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(5):385-400

Parallel simulation of parallel programs for large datasets has been shown to offer significant reduction in the execution time of many discrete event models. The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message passing, or written in UC, a data parallel language, compiled to use message passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. The paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in the paper yielded significant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model 相似文献