首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Performance Evaluation》2006,63(4-5):265-277
Performance prediction for parallel applications running in heterogeneous clusters is difficult to accomplish due to the unpredictable resource contention patterns that can be found in such environments. Typically, components of a parallel application will contend for the use of resources among themselves and with entities external to the application, such as other processes running in the computers of the cluster. The performance modeling approach should be able to represent these sources of contention and to produce an estimate of the execution time, preferably in polynomial time. This paper presents a polynomial time static performance prediction approach in which the prediction takes the form of an interval of values instead of a single value. The extra information given by an interval of values represents the variability of the underlying environment more accurately, as indicated by the practical examples presented.  相似文献   

2.
《Parallel Computing》1997,23(10):1405-1420
Performance prediction is necessary in order to deal with multi-dimensional performance effects on parallel systems. The compiler-generated analytical model developed in this paper accounts for the effects of cache behavior, CPU execution time and message passing overhead for real programs written in high level data-parallel languages. The performance prediction technique is shown to be effective in analyzing several non-trivial data-parallel applications as the problem size and number of processors vary. We leverage technology from the Maple symbolic manipulation system and the S-PLUS statistical package in order to present users with critical performance information necessary for performance debugging, architectural enhancement and procurement of parallel systems. The usability of these results is improved through specifying confidence intervals as well as predicted execution times for data-parallel applications.  相似文献   

3.
4.
Low-cost task scheduling for distributed-memory machines   总被引:2,自引:0,他引:2  
In compile-time task scheduling for distributed-memory systems, list scheduling is generally accepted as an attractive approach, since it pairs low cost with good results. List-scheduling algorithms schedule tasks in order of their priority. This priority can be computed either (1) statically, before the scheduling, or (2) dynamically, during the scheduling. In this paper, we show that list scheduling with statically-computed priorities (LSSP) can be performed at a significantly lower cost than existing approaches, without sacrificing performance. Our approach is general, i.e. it can be applied to any LSSP algorithm. The low complexity is achieved by using low-complexity methods for the most time-consuming parts in list-scheduling algorithms, i.e. processor selection and task selection, preserving the criteria used in the original algorithms. We exemplify our method by applying it to the MCP (Modified Critical Path) algorithm. Using an extension of this method, we can also reduce the time complexity of a particular class of list scheduling with dynamic priorities (LSDP) [including algorithms such as DLS (Dynamic Level Scheduling), ETF (Earliest Task First) and ERT (Earliest Ready Task)]. Our results confirm that the modified versions of the list-scheduling algorithms obtain a performance comparable to their original versions, yet at a significantly lower cost. We also show that the modified versions of the list-scheduling algorithms consistently outperform multi-step algorithms, such as DSC-LLB (Dynamic Sequence Clustering with List Load Balancing), which also have higher complexity and clearly outperform algorithms in the same class of complexity, such as CPM (Critical Path Method)  相似文献   

5.
Performance modeling for large industrial or scientific codes is of value for program tuning or for selection of new machines when benchmarking is not yet possible. We discuss an empirical method of estimating runtime for certain large parallel programs where computational work is estimated by regression functions based on measurements and time cost of communication is modeled by program analysis and benchmarks for communication primitives. The method is demonstrated with the local weather model (LM) of the German Weather Service (DWD) on SP-2, T3E, and SX-4. The method is an economic way of developing performance models because only a moderate number of measurements is required. The resulting model is sufficiently accurate even for very large test cases.  相似文献   

6.
PSEE (Parallel System Evaluation Environment) is a software tool that provides a multiprocessor system for research into alternative architectural decisions and experimentation, with such issues as selection, design, tuning, scheduling, clustering and routing policies. PSEE facilitates simulation and performance evaluation as well as a prediction environment for the design and tuning of parallel systems. These tasks involve cycles through programming, simulation, measurement, visualization and modification of parallel system parameters. PSEE includes a parallel programming tool, a simulator for link oriented parallel systems, BOLAS, and a performance evaluation tool, GRAPH. These PSEE modules are tools oriented to support the above tasks in user-friendly, interactive and animated graphical form. PSEE provides quantitative information in a graphical tailored form. This numerical/graphical output helps the user make decisions about his/her particular development.  相似文献   

7.
Cognition, Technology & Work - Research on human multitasking suggests several measures to evaluate performance. However, the suggested measures evaluate performance either when tasks are...  相似文献   

8.
Effective task scheduling is essential for obtaining high performance in heterogeneous distributed computing systems (HeDCSs). However, finding an effective task schedule in HeDCSs requires the consideration of both the heterogeneity of processors and high interprocessor communication overhead, which results from non-trivial data movement between tasks scheduled on different processors. In this paper, we present a new high-performance scheduling algorithm, called the longest dynamic critical path (LDCP) algorithm, for HeDCSs with a bounded number of processors. The LDCP algorithm is a list-based scheduling algorithm that uses a new attribute to efficiently select tasks for scheduling in HeDCSs. The efficient selection of tasks enables the LDCP algorithm to generate high-quality task schedules in a heterogeneous computing environment. The performance of the LDCP algorithm is compared to two of the best existing scheduling algorithms for HeDCSs: the HEFT and DLS algorithms. The comparison study shows that the LDCP algorithm outperforms the HEFT and DLS algorithms in terms of schedule length and speedup. Moreover, the improvement in performance obtained by the LDCP algorithm over the HEFT and DLS algorithms increases as the inter-task communication cost increases. Therefore, the LDCP algorithm provides a practical solution for scheduling parallel applications with high communication costs in HeDCSs.  相似文献   

9.
10.
针对现有任务调度算法优先级选取过于单一所产生局部较优调度结果的问题,从全局较优出发,提出一种先分层后分支决定优先级的静态任务调度算法—HGCOTS算法。该算法考虑了任务间较大的通信开销和冗余任务对异构CMP任务调度效率的影响,通过综合区间插入和任务复制技术最大限度地降低了任务间的通信开销,对冗余任务进行删除,明显提高了任务调度效率。使用随机生成图进行模拟实验,与其他算法相比,新算法具有更小的调度长度。  相似文献   

11.
The growth in size of networked high performance computers along with novel accelerator‐based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub‐optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on the performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter‐task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm‐based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. Application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, thereby enabling the applications to achieve better time to solution and scalability on Titan during production. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

12.
On the static assignment to parallel servers   总被引:1,自引:0,他引:1  
The authors study the static assignment to M parallel, exponential, heterogeneous servers. Blocked customers are lost, while the objective is to minimize the average number of blocked customers. The problem is formulated as a stochastic control problem with partial observation, and an equivalent full observation problem is formulated. Numerical experiments are conducted and the structure of the optimal policies is studied  相似文献   

13.
A general parallel task scheduling problem is considered. A task can be processed in parallel on one of several alternative subsets of processors. The processing time of the task depends on the subset of processors assigned to the task. We first show the hardness of approximating the problem for both preemptive and nonpreemptive cases in the general setting. Next we focus on linear array network of m processors. We give an approximation algorithm of ratio O(logm) for nonpreemptive scheduling, and another algorithm of ratio 2 for preemptive scheduling. Finally, we give a nonpreemptive scheduling algorithm of ratio O(log2m) for m×m two-dimensional meshes.  相似文献   

14.
In this paper we introduce our estimation method for parallel execution times, based on identifying separate “parts” of the work done by parallel programs. Our run time analysis works without any source code inspection. The time of parallel program execution is expressed in terms of the sequential work and the parallel penalty. We measure these values for different problem sizes and numbers of processors and estimate them for unknown values in both dimensions using statistical methods. This allows us to predict parallel execution time for unknown inputs and non-available processor numbers with high precision. Our prediction methods require orders of magnitude less data points than existing approaches. We verified our approach on parallel machines ranging from a multicore computer to a peta-scale supercomputer.  相似文献   

15.
科学与工程计算中的很多复杂应用问题需要使用科学工作流技术,超算领域中的科学工作流常以并行任务图建模,并行任务图的有效调度对应用的高效执行有重要意义。给出了资源限制条件下并行任务图的调度模型;针对Fork-Join类并行任务图给出了若干最优化调度结论;针对一般并行任务图提出了一种新的调度算法,该算法考虑了数据通信开销对资源分配和调度性能的影响,并对已有的CPA算法在特定情况下进行了改进。通过实验与常用的CPR和CPA算法做比较,验证了提出的新算法能够获得很好的调度效果。本文提出的调度算法和得到的最优调度结论对工作流应用系统的高性能调度功能开发具有借鉴意义。  相似文献   

16.
17.
A parallel algorithm for syntactic image segmentation is introduced. Stochastic tree grammar is used as a context-generating model. It is shown that when this context-generating process is in the equilibrium state, a matched filter can be designed and applied in parallel to the image. This process can be used for image segmentation in a syntactic pattern recognition system to enhance the performance of the succeeding recognition process.  相似文献   

18.
A number of high‐level parallel programming platforms for networks of workstations (NOWs) have been developed in recent times. Most of these platforms target the exploitation of data parallelism in applications. They do not allow expressibility of applications as a collection of tasks along with their precedence relationships. As a result, the control or task parallelism in an application cannot be expressed or exploited. The current work aims at integrating the notion of task parallelism and precedence relationships among constituting tasks to such high‐level data parallel platforms for NOWs. Our model of integration provides for arbitrary nesting of data and task parallel modules. Also, the precedence relationships are clearly reflected from the program structure. The model relieves the programmer from the need to design applications for non‐determinism in the order of completion of constituting tasks. The design of the runtime support as well as system‐level book keeping is discussed. The model is general enough to be applied to a wide range of data parallel platforms. A specific case of integrating the model into anonymous remote computing (ARC), a data parallel programming platform, is presented. The performance related aspects are also discussed. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

19.
On exploiting task duplication in parallel program scheduling   总被引:1,自引:0,他引:1  
One of the main obstacles in obtaining high performance from message-passing multicomputer systems is the inevitable communication overhead which is incurred when tasks executing on different processors exchange data. Given a task graph, duplication-based scheduling can mitigate this overhead by allocating some of the tasks redundantly on more than one processor. In this paper, we focus on the problem of using duplication in static scheduling of task graphs on parallel and distributed systems. We discuss five previously proposed algorithms and examine their merits and demerits. We describe some of the essential principles for exploiting duplication in a more useful manner and, based on these principles, propose an algorithm which outperforms the previous algorithms. The proposed algorithm generates optimal solutions for a number of task graphs. The algorithm assumes an unbounded number of processors. For scheduling on a bounded number of processors, we propose a second algorithm which controls the degree of duplication according to the number of available processors. The proposed algorithms are analytically and experimentally evaluated and are also compared with the previous algorithms  相似文献   

20.
In order to improve a parallel program's performance it is critical to evaluate how even the work contained in a program is distributed over all processors dedicated to the computation. Traditional work distribution analysis is commonly performed at the machine level. The disadvantage of this method is that it cannot identify whether the processors are performing useful or redundant (replicated) work. The paper describes a novel method of statically estimating the useful work distribution of distributed-memory parallel programs at the program level, which carefully distinguishes between useful and redundant work. The amount of work contained in a parallel program, which correlates with the number of loop iterations to be executed by each processor, is estimated by accurately modeling loop iteration spaces, array access patterns and data distributions. A cost function defines the useful work distribution of loops, procedures and the entire program. Lower and upper bounds of the described parameter are presented. The computational complexity of the cost function is independent of the program's problem size, statement execution and loop iteration counts. As a consequence, estimating the work distribution based on the described method is considerably faster than simulating or actually compiling and executing the program. Automatically estimating the useful work distribution is fully implemented as part of P3T, which is a static parameter based performance prediction tool under the Vienna Fortran Compilation System (VFCS). The Lawrence Livermore Loops are used as a test case to verify the approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号