期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A task-uncoordinated distributed dataflow model for scalable high performance parallel program execution

《Parallel Computing》2016

相似文献

2.

Optimized Parallel Execution of Declarative Programs on Distributed Memory Multiprocessors

下载免费PDF全文

Shen Meiming Tian Xinmin Wang Dingxing Zheng Weimin Wen Dongchan 《计算机科学技术学报》1993,8(3):43-52

In this paper,we focus on the compiling implementation of parlalel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstact machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed.The performance statistics on transputer array demonstrate the effectiveness of our model,parallel abstract machine,optimizing compilation strategies and compiler. 相似文献

3.

On the parallel execution time of tiled loops

Hogstedt K. Carter L. Ferrante J. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(3):307-321

Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape. 相似文献

4.

Scheduling parallel operations in automation for minimum execution time based on pert

J.Y.S. Luh C.S. Lin 《Computers & Industrial Engineering》1985,9(2):149-164

The use of parallel operations in automation, such as part fabrication and assembly, computation and control of industrial robots, etc., may yield a minimum production time and thereby increase productivity. However, the coupling between consecutive phases of operations results in series-parallel precedence constraints that may create unavoidable idle time intervals during the operations. To solve the problem, idle time intervals must be broken down and reduced. An algorithm that determines a minimum time-ordered schedule for the parallel operaitons is developed based on the Program Evaluation and Review Technique using the method of “variable” branch and bound. 相似文献

5.

Derivation of optimal input parameters for minimizing execution time of matrix-based computations on a GPU

《Parallel Computing》2014,40(10):628-645

As GPUs are continually being utilized as coprocessors, the demand for optimally utilizing them for various computations continues to grow. The goal of this work is to derive input parameters which yield the minimum execution time for matrix-based computations executing on a GPU. Input parameters are defined as the dimensions of the grid and blocks assigned for execution on the GPU. Since input parameters inadequately represent the executional behavior of the GPU, execution metrics are formulated as functions of the input parameters to represent the behavior. The execution metrics are architecture independent and are utilized to derive optimal input parameters, which are input parameters that yield the minimum execution time. Optimal input parameters are derived for the following matrix-based computations: matrix–vector multiplication (Mv), matrix–matrix multiplication (MM), and convolution. The derivation allows for selection of optimal input parameters without executing code. Results, for all matrix-based computations and sizes tested, show that utilizing the derived optimal input parameters often yields the minimum execution time, and, at worst, execution time within 13.6% of the minimum. 相似文献

6.

Optimising the distributed execution of join queries in polynomial time

《Computers & Mathematics with Applications》1999,37(3):105-126

It is proposed that an optimal strategy for executing a join query in a distributed database system may be computed in a time which is bounded by a polynomial function of the number of relations and the size parameters of the network. The solution so unveiled considers both the transmission costs and the processing costs incurred in delivering the required result to the user that issued the query.The query specifies that several relational tables are to be coalesced and presented to the appropriate user. Undertaking this task demands the utilisation of limited system resources, so that a strategy for fulfilling the request that imposes minimal cost to the system should be devised. Both the processor sites, and the communications links that interconnect them, are utilised; an optimal strategy is one that minimises a weighted sum of processing and data transmission costs.An integer linear programming model of this problem was originally proposed in [1]; however, no suggestion was given as to how this model might be efficiently solved. By extending the earlier analysis, the recursive nature of the join computation is revealed. Further investigations then produce a modified relationship amenable to algorithmic solution; the resultant procedure has polynomial time and space requirements. 相似文献

7.

The parallel execution of loops: The parallelepiped method

V. A. Val'kovskii 《Cybernetics and Systems Analysis》1982,18(2):201-215

相似文献

8.

Compiling lisp programs for parallel execution

James R. Larus 《LISP and Symbolic Computation》1991,4(1):29-99

Curare, the program restructurer described in this paper automatically transforms a sequential Lisp program into an equivalent concurrent program that runs on a multiprocessor.Data dependences constrain the program's concurrent execution because, in general, two conflicting statements cannot execute in a different order without affecting the program's result. Not all dependences are essential to produce the program's result.Curare attempts to transform the program so it computes its result with fewer conflicts. An optimized program will execute with less synchronization and more concurrency. Curare then examines loops in a program to find those that are unconstrained or lightly constrained by dependences. By necessity,Curare treats recursive functions as loops and does not limit itself to explicit program loops. Recursive functions offer several advantages over explicit loops since they provide a convenient framework for inserting locks and handling the dynamic behavior of symbolic programs. Loops that are suitable for concurrent execution are changed to execute on a set of concurrent server processes. These servers execute single loop iterations and therefore need to be extremely inexpensive to invoke.Restructured programs execute significantly faster than the original sequential programs. This improvement is large enough to attract programmers to a multiprocessor, particularly since it requires little effort on their part.This research was funded by DARPA contract numbers N00039-85-C-0269 (SPUR) and N00039-84-C-0089 (XCS) and by an NSF Presidential Young Investigator award to Paul N. Hilfinger. Additional funding came from the California MICRO program (in conjunction with Texas Instruments, Xerox, Honeywell, and Phillips/Signetics). 相似文献

9.

一种精确程序最坏执行时间分析方法 总被引：1，自引：0，他引：1

下载免费PDF全文

吴国伟李张《计算机工程与应用》2010,46(18):60-64

Java语言的动态特性使程序的最坏执行时间分析较悲观和难以预测,提出一种精确最坏执行时间分析方法,在高层分析中,引入一种标记方法,对带有标记的Java类文件进行反编译提取控制流程,得到每一个基本块中的Java 字节码指令的最坏情况下的执行次数,在底层分析中,建立结合流水线和高级缓存影响的时间模型,得到每条指令所对应的执行时间,最后结合高层分析和底层分析的结果得到程序的最坏情况下的执行时间。实验表明,该方法可以使对实时Java 程序的最坏情况执行时间预测更加安全和精确。相似文献

10.

Optimizing the execution of multiple data analysis queries on parallel and distributed environments

Andrade H. Kurc T. Sussman A. Saltz J. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(6):520-532

We investigate techniques for efficiently executing multiquery workloads from data and computation-intensive applications in parallel and/or distributed computing environments. In this context, we describe a database optimization framework that supports data and computation reuse, query scheduling, and active semantic caching to speed up the evaluation of multiquery workloads. Its most striking feature is the ability of optimizing the execution of queries in the presence of application-specific constructs by employing a customizable data and computation reuse model. Furthermore, we discuss how the proposed optimization model is flexible enough to work efficiently irrespective of the parallel/distributed environment underneath. In order to evaluate the proposed optimization techniques, we present experimental evidence using real data analysis applications. For this purpose, a common implementation for the queries under study was provided according to the database optimization framework and deployed on top of three distinct experimental configurations: a shared memory multiprocessor, a cluster of workstations, and a distributed computational Grid-like environment. 相似文献

11.

A parallel execution model of logic programs

Chen A.C. Wu C.-I. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(1):79-92

A parallel-execution model that can concurrently exploit AND and OR parallelism in logic programs is presented. This model employs a combination of techniques in an approach to executing logic problems in parallel, making tradeoffs among number of processes, degree of parallelism, and combination bandwidth. For interpreting a nondeterministic logic program, this model (1) performs frame inheritance for newly created goals, (2) creates data-dependency graphs (DDGs) that represent relationships among the goals, and (3) constructs appropriate process structures based on the DDGs. (1) The use of frame inheritance serves to increase modularity. In contrast to most previous parallel models that have a large single process structure, frame inheritance facilitates the dynamic construction of multiple independent process structures, and thus permits further manipulation of each process structure. (2) The dynamic determination of data dependency serves to reduce computational complexity. In comparison to models that exploit brute-force parallelism and models that have fixed execution sequences, this model can reduce the number of unification and/or merging steps substantially. In comparison to models that exploit only AND parallelism, this model can selectively exploit demand-driven computation, according to the binding of the query and optional annotations. (3) The construction of appropriate process structures serves to reduce communication complexity. Unlike other methods that map DDGs directly onto process structures, this model can significantly reduce the number of data sent to a process and/or the number of communication channels connected to a process 相似文献

12.

A parallel architecture for distributed adaptive control

《Journal of Microcomputer Applications》1994,17(1):17-34

This paper describes the development of a flexible control system based on multiple T800 floating point transputers. In Part 1 of this paper, entitled 'The application of transputers to distributed control', an overview has been presented of distributed command and control systems (DCCS) and the suitability of the transputer for implementation within such systems discussed. In Part 2 of the article a transputer-based control module which has been developed at the University of Paisley is described. The module allows a number of different modes of operation in that it can be configured to act as either a fixed controller with the coefficients being down-loaded from a central control station, or as an adaptive controller which can make use of an explicit pole placement or linear quadratic Gaussian (LQG) structure. A supervisor, or coordination level, is incorporated into the adaptive controller to monitor the various parameters produced within the controller and direct the system to maintain a safe operation. This improves the applicability, robustness and integrity of the controller in real-time applications. The ease with which software tasks can be distributed over different transputer architectures allows the same software to be configured to accommodate between one and four transputers within the module. In this way the controller module can utilize a number of different transputer configurations depending on cost/performance trade-offs.The use of the transputer also allows the controller to communicate easily through its serial links with other controllers, hosts and external devices. In this way the module can be used as a universal controller node which can be easily incorporated into a large DCCS. Software has been developed to facilitate the production of this type of integrated environment such that a central network interface can initialize, analyse and supervise a number of the controller modules. 相似文献

13.

On the minimal time required for execution of distributed concurrent processes in synchronous modes

V. P. Ivannikov N. S. Kovalenko V. M. Metel'skii 《Programming and Computer Software》2000,26(5):268-274

Methods of linear complexity are proposed for finding the minimum total time required for performing a given amount of computations in various modes of synchronous interaction of distributed heterogeneous concurrent processes. 相似文献

14.

Improved QMRCGSTAB method in distributed parallel environments

《国际计算机数学杂志》2012,89(9):1141-1148

An improved version of the QMRCGSTAB method (IQMRCGSTAB method) for solving large sparse linear systems with non-symmetric coefficient matrices is proposed. The proposed method combines elements of numerical stability and parallel algorithm design without increasing computational costs. Performance analysis shows that the IQMRCGSTAB method has better parallelism and scalability than the original method. Numerical experiments show the efficiency of the new method. 相似文献

15.

A parallel block LU decomposition method for distributed finite element matrices

Daniel Maurer Christian Wieners 《Parallel Computing》2011,37(12):742-758

In this work we present a new parallel direct linear solver for matrices resulting from finite element problems. The algorithm follows the nested dissection approach, where the resulting Schur complements are also distributed in parallel. The sparsity structure of the finite element matrices is used to pre-compute an efficient block structure for the LU factors. We demonstrate the performance and the parallel scaling behavior by several test examples. 相似文献

16.

A distributed parallel programming framework

Stankovic N. Kang Zhang 《IEEE transactions on pattern analysis and machine intelligence》2002,28(5):478-493

This paper presents Visper, a novel object-oriented framework that identifies and enhances common services and programming primitives, and implements a generic set of classes applicable to multiple programming models in a distributed environment. Groups of objects, which can be programmed in a uniform and transparent manner, and agent-based distributed system management, are also featured in Visper. A prototype system is designed and implemented in Java, with a number of visual utilities that facilitate program development and portability. As a use case, Visper integrates parallel programming in an MPI-like message-passing paradigm at a high level with services such as checkpointing and fault tolerance at a lower level. The paper reports a range of performance evaluation on the prototype and compares it to related works 相似文献

17.

Optimization of parallel execution for multi-join queries 总被引：5，自引：0，他引：5

Ming-Syan Chen Yu P.S. Kun-Lung Wu 《Knowledge and Data Engineering, IEEE Transactions on》1996,8(3):416-428

We study the subject of exploiting interoperator parallelism to optimize the execution of multi-join queries. Specifically, we focus on two major issues: (1) scheduling the execution sequence of multiple joins within a query, and (2) determining the number of processors to be allocated for the execution of each join operation obtained in (1). For the first issue, we propose and evaluate by simulation several methods to determine the general join sequences, or bushy trees. Despite their simplicity, the heuristics proposed can lead to the general join sequences that significantly outperform the optimal sequential join sequence. The quality of the join sequences obtained by the proposed heuristics is shown to be fairly close to that of the optimal one. For the second issue, it is shown that the processor allocation for exploiting interoperator parallelism is subject to more constraints-such as execution dependency and system fragmentation-than those in the study of intraoperator parallelism for a single join. The concept of synchronous execution time is proposed to alleviate these constraints. Several heuristics to deal with the processor allocation, categorized by bottom-up and top-down approaches, are derived and are evaluated by simulation. The relationship between issues (1) and (2) is explored. Among all the schemes evaluated, the two-step approach proposed, which first applies the join sequence heuristic to build a bushy tree as if under a single processor system, and then, in light of the concept of synchronous execution time, allocates processors to execute each join in the bushy tree in a top-down manner, emerges as the best solution to minimize the query execution time 相似文献

18.

Relationships between efficiency and execution time of fullmultigrid methods on parallel computers

Martin I. Tirado F. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):562-573

The large number of processing elements in current parallel systems necessitates the development of more comprehensive and realistic tools for the scalability analysis of algorithms on those architectures. This paper presents a simple analytical tool with which to study the scalability of parallel algorithm-architecture combinations. Our practical method studies separately execution time, efficiency, and memory usage in the accuracy-critical scaling model, where the problem size-input data set size-increases with the number of processors, which is the relevant one in many situations. The paper defines quantitative and qualitative measurements of the scalability and derives important relationships between execution time and efficiency. For example, results show that the best way to scale the system (to deteriorate as little as possible the properties of the system) is by maintaining constant execution time. These analytical results are verified with one candidate application for massive parallel computers: the full multigrid method. We study the scalability of a general d-dimensional full multigrid method on an r-dimensional mesh of processors. The analytical expressions are verified through experimental results obtained by implementing the full multigrid method on a Transputer-based machine and on the CRAY T3D 相似文献

19.

An analysis method for variable execution time tasks based on histograms

Joan Vila-Carbó Enrique Hernández-Orallo 《Real-Time Systems》2008,38(1):1-37

相似文献

20.

A massively parallel architecture for distributed genetic algorithms

《Parallel Computing》2004,30(5-6):647-676

相似文献