首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Asynchronous pipelining is a form of parallelism that is useful in both distributed and shared memory systems. We show that asynchronous pipeline schedules are a generalization of both noniterative DAG (directed acyclic graph) schedules as well as simpler pipeline schedules, unifying these two types of scheduling. We generalize previous work on determining if a pipeline schedule will deadlock, and generalize Reiter's well-known formula for determining the iteration interval of a deadlock-free schedule, which is the primary measure of the execution time of a schedule. Our generalizations account for nonzero communication times (easy) and the assignment of multiple tasks to processors (nontrivial). A key component of our generalized approach to pipeline schedule analysis is the use of pipeline scheduling edges with potentially negative data dependence distances. We also discuss implementation of an asynchronous pipeline schedule at runtime; show how to efficiently simulate pipeline execution on a sequential processor; define and derive bounds on the startup time of a schedule, which is a secondary schedule performance measure; and describe a new algorithm for evaluating the iteration interval formula.  相似文献   

2.
Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application tasks with dependences. These applications exhibit both task and data parallelism, and combining these two (also called mixed parallelism) has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task and data parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the task graph, the runtime estimates and scalability characteristics of the tasks, and the intertask data communication volumes. A locality-conscious scheduling strategy is used to improve intertask data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications and synthetic graphs shows that our algorithm consistently generates schedules with a lower makespan as compared to Critical Path Reduction (CPR) and Critical Path and Allocation (CPA), two previously proposed scheduling algorithms. Our algorithm also produces schedules that have a lower makespan than pure task- and data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.  相似文献   

3.
With the increasing amount of parallelism obtainable on multicore platforms, stream programming has been proposed as an effective solution for exposing distributed parallelization. Nonetheless, a pressing demand of scheduling task and data parallelism in stream programming exists that can accomplish robust multicore performance in the face of varying application characteristics. This paper addresses the problem of scheduling task and data parallelism in stream programming. We present StreamMDE, an asynchronous concurrency stream programming framework which offers a novel parallel programming model for scheduling task and data parallelism in the message-driven execution paradigm. A key property of this framework is exposing controlled-grained parallelism, which allows us to control the granularity of task and data parallelism in stream graph. Our empirical evaluation of StreamMDE shows that higher efficiency of mixed task and data parallelism in stream programming can be exploited with the appropriate granularity control. The framework bridges the gap between the parallel scale and the architecture of stream programs and facilitates in designing and coding stream features in different schedules.  相似文献   

4.
The use of multiprocessors for discrete event simulation is an active research area where work has focused on strategies for model execution with little regard for the underlying formalism in which models may be expressed. However, a formalism-based approach offers several advantages including the ability to migrate models from sequential to parallel platforms and the ability to calibrate simulation architectures to model structural properties. In this article, we extend the DEVS (discrete event system specification) formalism, originally developed for sequential simulation, to accommodate the full potential of parallel processing. The extension facilitates exploitation of both internal and external event parallelism manifested in hierarchical, modular DEVS models. After developing a mapping of the extended formalism to parallel architectures, we describe an implementation of the approach on a massively parallel architecture, the Connection Machine. Execution results are discussed for a class of models exhibiting high external and internal event parallelism, the so-called broadcast models. These verify the tenets of the underlying theory and demonstrate that significant reduction in execution time is possible compared to the same model executed in serial simulation.  相似文献   

5.
Recent simulation based studies suggest that while superpipelines and superscalars are equally capable of exploiting fine grained concurrency, multiprocessors are better at exploiting coarse grained parallelism. An analytical model that is more flexible and less costly in terms of run time than simulation, is proposed as a tool for analyzing the tradeoff between superpipelined processors, superscalar processors, and multiprocessors. The duality of superpipelines and superscalars is examined in detail. The performance limit for these systems has been derived and it supports the fetch bottleneck observation of previous researchers. Common characteristics of utilization curves for such systems are examined. Combined systems, such as superpipelined multiprocessors and superscalar multiprocessors, are also analyzed. The model shows that the number of pipelines (or processors) at which the maximum throughput is obtained is, as memory access time increases, increasingly sensitive to the ratio of memory access time to network access delay. Further, as a function of interiteration dependence distance, optimum throughput is shown to vary nonlinearly, whereas the corresponding optimum number of processors varies linearly. The predictions from the analytical model agree with similar results published using simulation based techniques  相似文献   

6.
7.
A scientific workflow, usually consists of a good mix of fine and coarse computational granularity tasks displaying varied runtime requirements. It has been observed that fine grained tasks incur more scheduling overhead than their execution time, when executed on widely distributed platforms. Task clustering is extensively used, in such situations, as a runtime optimization method which involves combining multiple short duration tasks into a cluster, to be scheduled on a single resource. This helps in minimizing the scheduling overheads of the fine grained tasks. However, tasks grouping curtails the degree of parallelism and hence needs to be done optimally. Though a number of task clustering techniques have been developed to reduce the impact of system overheads, they fail to identify the appropriate number of clusters at each level of workflow in order to achieve maximum possible parallelism. This work proposes a level based autonomic Workflow-and-Platform Aware (WPA) task clustering technique which takes into consideration both; the workflow structure and the underlying resource set size for task clustering. It aims to achieve maximum possible parallelism among the tasks at a level of a workflow while minimizing the system overheads and resource wastage. A comparative study with current state of the art task clustering approaches on four well-known scientific workflows show that the proposed method significantly reduces the overall workflow execution time and at the same time is able to consolidate the load onto minimum possible resources.  相似文献   

8.
新型体系结构概念—虚拟寄存器与并行的指令处理部件   总被引:4,自引:1,他引:3  
随着程序对地址空间的需求日益提高,研究者提出了虚拟存储器概念,使程序访问的地址空间免受物理存储器的限制。随着面向寄存器的RISC技术发展以及多发射结构中指令调度的日益重要,我们提出了虚拟寄存器的新概念,使寄存器空间不受物理寄存器堆大小的束缚,有利于指令调度和寄存器重新命名技术,提高指令级并行性ILP。此外,现代新型RISC处理机都着重于加强数据处理部件中的执行并行度,忽略了放在存储器中指令的处理。  相似文献   

9.
We model a deterministic parallel program by a directed acyclic graph of tasks, where a task can execute as soon as all tasks preceding it have been executed. Each task can allocate or release an arbitrary amount of memory (i.e., heap memory allocation can be modeled). We call a parallel schedule “space efficient” if the amount of memory required is at most equal to the number of processors times the amount of memory required for some depth-first execution of the program by a single processor. We describe a simple, locally depth-first scheduling algorithm and show that it is always space efficient. Since the scheduling algorithm is greedy, it will be within a factor of two of being optimal with respect to time. For the special case of a program having a series-parallel structure, we show how to efficiently compute the worst case memory requirements over all possible depth-first executions of a program. Finally, we show how scheduling can be decentralized, making the approach scalable to a large number of processors when there is sufficient parallelism  相似文献   

10.
In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented.  相似文献   

11.
The Muse (multiple sequential Prolog engines) approach has been used to make a simple and efficient OR-parallel implementation of the full Prolog language. The performance results of the Muse system on bus-based multiprocessor machines have been presented in previous papers. This paper discusses the implementation and performance results of the Muse system on switch-based multiprocessors (the BBN Butterfly GP1000 and TC2000). The results of Muse execution show that high real speedups can be achieved for Prolog programs that exhibit coarse-grained parallelism. The scheduling overhead is equivalent to around 8–26 Prolog procedure calls per task on the TC2000. The paper also compares the Muse results with corresponding results for the Aurora OR-parallel Prolog system. For a large set of benchmarks, the results are in favor of the Muse system.  相似文献   

12.
Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem  相似文献   

13.
Extensive research has been done on extracting parallelism from single instruction stream processors. This paper presents our investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism.  相似文献   

14.
Automatic scheduling for directed acyclic graphs (DAG) and its applications for coarse-grained irregular problems such as largen-body simulation have been studied in the literature. However, solving irregular problems with mixed granularities such as sparse matrix factorization is challenging since it requires efficient run-time support to execute a DAG schedule. In this paper, we investigate run-time optimization techniques for executing general asynchronous DAG schedules on distributed memory machines and discuss an approach for exploiting parallelism from commuting operations in the DAG model. Our solution tightly integrates the run-time scheme with a fast communication mechanism to eliminate unnecessary overhead in message buffering and copying. We present a consistency model incorporating the above optimizations, and take advantage of task dependence properties to ensure the correctness of execution. We demonstrate the applications of this scheme in sparse matrix factorizations and triangular equation solving for which actual speedups are difficult to obtain. We provide a detailed experimental study on Meiko CS-2 to show that the automatically scheduled code has achieved good performance for these difficult problems, and the run-time overhead is small compared to total execution times.  相似文献   

15.
Real-time database systems associate the concept of deadlines with transaction executions. Previous approaches use “best effort” techniques to schedule a given set of transactions to meet the deadlines as well as to ensure the consistency of the database. However, such approaches are inadequate for target applications which have “hard” real-time deadlines that need to be met in the event of crisis situations. In such cases, it is important to obtain contingency plans that may be invoked with guaranteed execution time characteristics. This paper presents an alternative model for real-time database systems in which deadlines are associated with “contingency” constraints rather than directly with transactions. Our approach leads to a predicate-based model that intrinsically incorporates both triggering and relative timing constraints regarding the transaction executions. We exhibit that selecting contingency plans with respect to various optimality criteria has inherent computational inefficiencies. We study the issues in scheduling of the selected plans with the focus on the contention among the transactions for data resources. Our results exhibit that the data contention, by itself, has a severe adverse impact on the schedulability of the deadline-constrained transactions. We discuss some of the practical implications of our results, and we suggest some counter-measures to handle the computational complexities  相似文献   

16.
Compile-time scheduling is one approach to extract parallelism which has proved effective when the execution behavior is predictable. Unfortunately, the performance of most priority-based scheduling algorithms is computation dependent. Scheduling based on the concept of earliest-startable-task produces reasonably short schedules only when available parallelism is large enough to cover the communications. A priority-based decision is more effective when parallelism is low. We propose a scheduling in which the decision function combines two concepts: (1) task-level as global priority and (2) earliest-task-first as local priority The degree of dominance of one of the above concepts is controlled by a computation profile factor that is the ratio of task parallelism to communication. It is shown that the above factor is an upper bound on the deviation of schedule length from optimum. To tune the solution finish time the above scheduler is iteratively applied on the computation graph. In each iteration, the newly generated schedule is used to sharpen the task-levels which contribute in finding shorter schedules in the next iteration. Evaluation is carried out for a wide category of computation graphs with communications for which optimum schedules are known. It is found that pure local scheduling and static priority-based scheduling significantly deviate from the optimum under specific problem instances. Our approach to adapting the scheduling decision to the computation profile is able to produce near-optimum solutions via a much reduced number of iterations than other approaches.  相似文献   

17.
Optimization of parallel execution for multi-join queries   总被引:5,自引:0,他引:5  
We study the subject of exploiting interoperator parallelism to optimize the execution of multi-join queries. Specifically, we focus on two major issues: (1) scheduling the execution sequence of multiple joins within a query, and (2) determining the number of processors to be allocated for the execution of each join operation obtained in (1). For the first issue, we propose and evaluate by simulation several methods to determine the general join sequences, or bushy trees. Despite their simplicity, the heuristics proposed can lead to the general join sequences that significantly outperform the optimal sequential join sequence. The quality of the join sequences obtained by the proposed heuristics is shown to be fairly close to that of the optimal one. For the second issue, it is shown that the processor allocation for exploiting interoperator parallelism is subject to more constraints-such as execution dependency and system fragmentation-than those in the study of intraoperator parallelism for a single join. The concept of synchronous execution time is proposed to alleviate these constraints. Several heuristics to deal with the processor allocation, categorized by bottom-up and top-down approaches, are derived and are evaluated by simulation. The relationship between issues (1) and (2) is explored. Among all the schemes evaluated, the two-step approach proposed, which first applies the join sequence heuristic to build a bushy tree as if under a single processor system, and then, in light of the concept of synchronous execution time, allocates processors to execute each join in the bushy tree in a top-down manner, emerges as the best solution to minimize the query execution time  相似文献   

18.
Increases in instruction level parallelism are needed to exploit the potential parallelism available in future wide issue architectures. Predicated execution is an architectural mechanism that increases instruction level parallelism by removing branches and allowing simultaneous execution of multiple paths of control, only committing instructions from the correct path. In order for the compiler to expose and use such parallelism, traditional compiler data-flow and path analysis needs to be extended to predicated code. In this paper, we motivate the need for renaming and for predicates that reflect path information. We present Predicated Static Single Assignment (PSSA) which uses renaming and introduces Full -Path Predicates to remove false dependences and enable aggressive predicated optimization and instruction scheduling. We demonstrate the usefulness of PSSA for Predicated Speculation and Control Height Reduction. These two predicated code optimizations used during instruction scheduling reduce the dependence length of the critical paths through a predicated region. Our results show that using PSSA to enable speculation and control height reduction reduces execution time from 12 to 68%.  相似文献   

19.
This paper describes the implementation and the performance of a synchronous, parallel discrete event simulation (SPaDES) method on two shared memory multiprocessors. The presented method aims at the efficient simulation of architectural designs for which the asynchronous PDES methods are less effective. A multiprocessor machine is simulated, and the performance achieved is compared to the performance of a parallel version of the centralized-time synchronous simulation method. The results show that the SPaDES method alleviates bottlenecks usually attributed to synchronous methods, and thus we are able to efficiently exploit most of the parallelism available in the simulation of synchronous architectural designs.  相似文献   

20.
We investigate the efficient iterative solution of large-scale sparse linear systems on shared-memory multiprocessors. Our parallel approach is based on a multilevel ILU preconditioner which preserves the mathematical semantics of the sequential method in ILUPACK. We exploit the parallelism exposed by the task tree corresponding to the nested dissection hierarchy (task parallelism), employ dynamic scheduling of tasks to processors to improve load balance, and formulate all stages of the parallel PCG method conformal with the computation of the preconditioner to increase data reuse. Results on a CC-NUMA platform with 16 processors reveal the parallel efficiency of this solution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号