首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The recent data deluge needing to be processed represents one of the major challenges in the computational field. This fact led to the growth of specially-designed applications known as data-intensive applications. In general, in order to ease the parallel execution of data-intensive applications input data is divided into smaller data chunks that can be processed separately. However, in many cases, these applications show severe performance problems mainly due to the load imbalance, inefficient use of available resources, and improper data partition policies. In addition, the impact of these performance problems can depend on the dynamic behavior of the application.This work proposes a methodology to dynamically improve the performance of data-intensive applications based on: (i) adapting the size and the number of data partitions to reduce the overall execution time; and (ii) adapting the number of processing nodes to achieve an efficient execution. We propose to monitor the application behavior for each exploration (query) and use gathered data to dynamically tune the performance of the application. The methodology assumes that a single execution includes multiple related queries on the same partitioned workload.The adaptation of the workload partition factor is addressed through the definition of the initial size for the data chunks; the modification of the scheduling policy to send first data chunks with large processing times; dividing of the data chunks with the biggest associated computation times; and joining of data chunks with small computation times. The criteria for dividing or gathering chunks are based on the chunks’ associated execution time (average and standard deviation) and the number of processing elements being used. Additionally, the resources utilization is addressed through the dynamic evaluation of the application performance and the estimation and modification of the number of processing nodes that can be efficiently used.We have evaluated our strategy using as cases of study a real and a synthetic data-intensive application. Analytical expressions have been analyzed through simulation. Applying our methodology, we have obtained encouraging results reducing total execution times and efficient use of resources.  相似文献   

2.
New computer architectures based on large numbers of processors are now used in various application areas ranging from embedded systems to supercomputers. Efficient parallel processing algorithms are applied in a wide variety of applications such as simulation, robot control, and image synthesis. This article presents two novel parallel algorithms for computing robot inverse dynamics (as well as control laws) starting from customized symbolic robot models. To gain the most benefit from the concurrent processor architecture, the whole job is divided into a large number of simple tasks, each involving only a single floating-point operation. Although requiring sophisticated scheduling schemes, fine granularity of tasks was the key factor for achieving nearly maximum efficiency and speedup. The first algorithm resolves the scheduling problem for an array of pipelined processors. The second one is devoted to parallel processors connected by a complete crossbar interconnection network. The main feature of the proposed algorithms is that they take into account the communication delays between processors and minimize both the execution time and communication cost. To prove the theoretical results, the algorithms have been verified by experiments on an INMOS T800 transputer-based system. We used four transputers in serial and parallel configurations. The experimental results show that the most complicated dynamic control laws can be executed in a submilisecond time interval. © 1993 John Wiley & Sons, Inc.  相似文献   

3.
We examine parallel-processing applications to the analysis of large data sets typically used in social science research. Our research uses a parallel environment which makes it possible to have 1024 processors working simultaneously on a problem. The application is tested using various configurations of number of processors and block-size of data reads on the estimation of a linear model of earnings for the California portion of the 15% sample of the 1970 Census. Performance factors assessed include total execution time, speed-up and efficiency. Execution times are also compared with reference to execution times on an IBM 3081 using SPSS-X. Results indicate that optimal configurations of number of processors and data block-size can produce significantly faster execution times for linear model estimation on relatively large (80,000 cases) data sets. We also discuss other applications of parallel processing to statistical analyses commonly found in social science.  相似文献   

4.
Models for two processor sharing policies called task scheduling processor sharing and job scheduling processor sharing are developed and analyzed. The first policy schedules each task independently and allows parallel execution of an individual program, whereas the second policy schedules each job as a unit, thereby not allowing parallel execution of an individual program. It is found that task scheduling performs better than job scheduling for most system parameter values. The performance of the task scheduling processor sharing is compared to a first come first serve policy. First come first serve performs better than processor sharing over a wide range of system parameters. Processor sharing performs best when the task service time variability is high. The performance of processor sharing and first come first serve is studied with two classes of jobs, and for when a specific number of processors is statically assigned to each of the classes  相似文献   

5.
Multi-core real-time scheduling for generalized parallel task models   总被引:1,自引:0,他引:1  
Multi-core processors offer a significant performance increase over single-core processors. They have the potential to enable computation-intensive real-time applications with stringent timing constraints that cannot be met on traditional single-core processors. However, most results in traditional multiprocessor real-time scheduling are limited to sequential programming models and ignore intra-task parallelism. In this paper, we address the problem of scheduling periodic parallel tasks with implicit deadlines on multi-core processors. We first consider a synchronous task model where each task consists of segments, each segment having an arbitrary number of parallel threads that synchronize at the end of the segment. We propose a new task decomposition method that decomposes each parallel task into a set of sequential tasks. We prove that our task decomposition achieves a resource augmentation bound of 4 and 5 when the decomposed tasks are scheduled using global EDF and partitioned deadline monotonic scheduling, respectively. Finally, we extend our analysis to a directed acyclic graph (DAG) task model where each node in the DAG has a unit execution requirement. We show how these tasks can be converted into synchronous tasks such that the same decomposition can be applied and the same augmentation bounds hold. Simulations based on synthetic workload demonstrate that the derived resource augmentation bounds are safe and sufficient.  相似文献   

6.
Corollaries to Amdahl's Law for Energy   总被引:1,自引:0,他引:1  
This paper studies the important interaction between parallelization and energy consumption in a parallelizable application. Given the ratio of serial and parallel portion in an application and the number of processors, we first derive the optimal frequencies allocated to the serial and parallel regions in the application to minimize the total energy consumption, while the execution time is preserved (i.e., speedup = 1). We show that dynamic energy improvement due to parallelization has a function rising faster with the increasing number of processors than the speed improvement function given by the well-known Amdahl's Law. Furthermore, we determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. The formula we obtain capture the fundamental relationship between parallelization, speedup, and energy consumption and can be directly utilized in energy aware processor resource management. Our results form a basis for several interesting research directions in the area of power and energy aware parallel processing.  相似文献   

7.
Metaschedulers co-allocate resources by requesting a fixed number of processors and usage time for each cluster. These static requests, defined by users, limit the initial scheduling and prevent rescheduling of applications to other resource sets. It is also difficult for users to estimate application execution times, especially on heterogeneous environments. To overcome these problems, metaschedulers can use performance predictions for automatic resource selection. This paper proposes a resource co-allocation technique with rescheduling support based on performance predictions for multi-cluster iterative parallel applications. Iterative applications have been used to solve a variety of problems in science and engineering, including large-scale computations based on the asynchronous model more recently. We performed experiments using an iterative parallel application, which consists of benchmark multiobjective problems, with both synchronous and asynchronous communication models on Grid’5000. The results show run time predictions with an average error of 7% and prevention of up to 35% and 57% of run time overestimations to support rescheduling for synchronous and asynchronous models, respectively. The performance predictions require no application source code access. One of the main findings is that as the asynchronous model masks communication and computation, it requires no network information to predict execution times. By using our co-allocation technique, metaschedulers become responsible for run time predictions, process mapping, and application rescheduling; releasing the user from these burden tasks.  相似文献   

8.
Linear-Time Approximation Schemes for Scheduling Malleable Parallel Tasks   总被引:1,自引:0,他引:1  
Jansen  Porkolab 《Algorithmica》2002,32(3):507-520
A malleable parallel task is one whose execution time is a function of the number of (identical) processors alloted to it. We study the problem of scheduling a set of nindependent malleable tasks on a fixed number of parallel processors, and propose an approximation scheme that for any fixed ε > 0 , computes in O(n)time a non-preemptive schedule of length at most (1+ε)times the optimum.  相似文献   

9.
Data flow acyclic directed graphs (digraph) are widely used to describe the data dependency of mesh-based scientific computing. The parallel execution of such digraphs can approximately depict the flowchart of parallel computing. During the period of parallel execution, vertex priorities are key performance factors. This paper firstly takes the distributed digraph and its resource-constrained parallel scheduling as the vertex priorities model, and then presents a new parallel algorithm for the solution of vertex priorities using the well-known technique of forward–backward iterations. Especially, in each iteration, a more efficient vertex ranking strategy is proposed. In the case of simple digraphs, both theoretical analysis and benchmarks show that the vertex priorities produced by such an algorithm will make the digraph scheduling time converge non-increasingly with the number of iterations. In other cases of non-simple digraphs, benchmarks also show that the new algorithm is superior to many traditional approaches. Embedding the new algorithm into the heuristic framework for the parallel sweeping solution of neutron transport applications, the new vertex priorities improve the performance by 20 % or so while the number of processors scales up from 32 to 2048.  相似文献   

10.
In current multiprogrammed multiprocessor systems, to take into account the performance of parallel applications is critical to decide an efficient processor allocation. In this paper, we present the performance-driven processor allocation policy (PDPA). PDPA is a new scheduling policy that implements a processor allocation policy and a multiprogramming-level policy, in a coordinated way, based on the measured application performance. With regard to the processor allocation, PDPA is a dynamic policy that allocates to applications the maximum number of processors to reach a given target efficiency. With regard to the multiprogramming level, PDPA allows the execution of a new application when free processors are available and the allocation of all the running applications is stable, or if some applications show bad performance. Results demonstrate that PDPA automatically adjusts the processor allocation of parallel applications to reach the specified target efficiency, and that it adjusts the multiprogramming level to the workload characteristics. PDPA is able to adjust the processor allocation and the multiprogramming level without human intervention, which is a desirable property for self-configurable systems, resulting in a better individual application response time.  相似文献   

11.
Parallel applications typically do not perform well in a multiprogrammed environment that uses time‐sharing to allocate processor resources to the applications' parallel threads. Co‐scheduling related parallel threads, or statically partitioning the system, often can reduce the applications' execution times, but at the expense of reducing the overall system utilization. To address this problem, there has been increasing interest in dynamically allocating processors to applications based on their resource demands and the dynamically varying system load. The Loop‐Level Process Control (LLPC) policy (Yue K, Lilja D. Efficient execution of parallel applications in multiprogrammed multiprocessor systems. 10th International Parallel Processing Symposium, 1996; 448–456) dynamically adjusts the number of threads an application is allowed to execute based on the application's available parallelism and the overall system load. This study demonstrates the feasibility of incorporating the LLPC strategy into an existing commercial operating system and parallelizing compiler and provides further evidence of the performance improvement that is possible using this dynamic allocation strategy. In this implementation, applications are automatically parallelized and enhanced with the appropriate LLPC hooks so that each application interacts with the modified version of the Solaris operating system. The parallelism of the applications are then dynamically adjusted automatically when they are executed in a multiprogrammed environment so that all applications obtain a fair share of the total processing resources. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

12.
Allocating Independent Tasks to Parallel Processors: An Experimental Study   总被引:1,自引:0,他引:1  
We study a scheduling or allocation problem with the following characteristics: The goal is to execute a number of unspecified tasks on a parallel machine in any order and as quickly as possible. The tasks are maintained by a central monitor that will hand out batches of a variable number of tasks to requesting processors. A processor works on the batch assigned to it until it has completed all tasks in the batch, at which point it returns to the monitor for another batch. The time needed to execute a task is assumed to be a random variable with known mean and variance, and the execution times of distinct tasks are assumed to be independent. Moreover, each time a processor obtains a new batch from the monitor, it suffers a fixed delay. The challenge is to allocate batches to processors in such a way as to achieve a small expected overall finishing time. We introduce a new allocation strategy, the Bold strategy, and show that it outperforms other strategies suggested in the literature in a number of simulations.  相似文献   

13.
一种网格环境下的资源协同调度算法   总被引:2,自引:0,他引:2  
基于传统的DAGs调度算法,提出了一种适于网格环境下使用的协同调度算法,目标是使得一组应用程序的整体执行时间最小,并且算法中提供了对资源的提前预约的支持。模拟结果表明,该算法与传统的调度算法相比,性能有了较大的提高 。  相似文献   

14.
A general parallel task scheduling problem is considered. A task can be processed in parallel on one of several alternative subsets of processors. The processing time of the task depends on the subset of processors assigned to the task. We first show the hardness of approximating the problem for both preemptive and nonpreemptive cases in the general setting. Next we focus on linear array network of m processors. We give an approximation algorithm of ratio O(logm) for nonpreemptive scheduling, and another algorithm of ratio 2 for preemptive scheduling. Finally, we give a nonpreemptive scheduling algorithm of ratio O(log2m) for m×m two-dimensional meshes.  相似文献   

15.
多核平台下XEN虚拟机动态调度算法研究   总被引:1,自引:0,他引:1  
虚拟机调度算法对并行任务的执行效率考虑不够充分。现代处理器平台具备了多个可用的计算核心,使多个虚拟机并发执行成为了现实。针对多核平台下的并行虚拟机调度优化问题,提出一种基于任务特征虚拟机CON-Credit调度算法。该算法在调度并行任务时,使用动态方式对计算机核心进行分配,采用传统的虚拟机调度算法为执行普通任务的虚拟机进行分配;采用定制的同步算法给执行并行任务的虚拟机分进分配。相关实验显示,CON-Credit调度算法能显著提高并行任务的执行效率。  相似文献   

16.
In this paper, we present a fine-grained parallel implementation of the MPEG-2 video encoder an the Intel Paragon XP/S parallel computer. We use a data-parallel approach and exploit parallelism within each frame, unlike some of the previous approaches that employ multiple processing of several disjoint video sequences. This makes our encoder suitable for real-time applications where the complete video sequence may not be present on the disk and may become available on a frame-by-frame basis with time. The Express parallel programming environment is employed as the underlying message-passing system making our encoder portable across a wide range of parallel and distributed architectures. The encoder also provides control over various parameters such as the number of processors in each dimension, the size of the motion search window, buffer management, and bitrate. Moreover, it has the flexibility to allow the inclusion of fast and new algorithms for different stages of the codec into the program, replacing current algorithms. Comparisons of execution times, speedups, and frame encoding rates using different numbers of processors are provided. An analysis of frame data distribution among multiple processors is also presented. In addition, our study reveals the degrees of parallelism and bottlenecks in the various computational modules of the MPEG-2 algorithm. We have used two motion estimation techniques and five different video sequences for our experiments. Using maximum parallelism by dividing one block per processor, an encoding rate higher than 30 frames/s has been achieved.  相似文献   

17.
Load balancing algorithms are designed essentially to equally distribute the load on processors and maximize their utilities while minimizing the total task execution time. In order to achieve these goals, the load-balancing mechanism should be “fair” in distributing the load across the different processors. This implies that the difference between the heaviest-loaded and the lightest-loaded processors should be minimized. Therefore, the load information on each processor must be updated such that the load-balancing mechanism can be more effective. In this work, we present an application independent dynamic algorithm for scheduling tasks and load- balancing in message passing systems. We propose a DAG-based Dynamic Load Balancing algorithm for Real time applications (DAG-DLBR) that is designed to work dynamically to cope with possible changes in the load that might occur during runtime. This algorithm addresses the challenge of devising a load balancing scheme which judicially deals with the hybrid execution of existing real-time application (represented by a Direct Acyclic Graph (DAG)) together with newly arriving jobs. The main objective of this algorithm is to reduce response times of the newly arriving jobs while maintaining the time constrains of the existing DAG. To evaluate the performance of the DAG-DLBR algorithm, a comparison with the performance of two common dynamic load balancing algorithms is presented. This comparison is performed by evaluating, experimentally, the execution time of different load balancing algorithms on a homogenous real parallel machine. In addition, the values of load imbalance, the execution time, and the communication overhead time are evaluated analytically using different benchmarks as test-bed workloads. These workloads cover a wide range of dynamic applications with different task types. Experimental results illustrate the improved performance of the DAG-DLBR algorithm compared to both distributed and hierarchal based algorithms by at least 12 and 19%, respectively. This improvement is true for all workloads, even with highly dependent workload. The DAG-DLBR algorithm achieves lower computation time than its corresponding values of both the distributed and the hierarchical-based algorithms for 4, 8, 12 and 16 processors.  相似文献   

18.
This paper addresses the problem of minimizing the scheduling length (make-span) of a batch of jobs with different arrival times. A job is described by a direct acyclic graph (DAG) of parallel tasks. The paper proposes a dynamic scheduling method that adapts the schedule when new jobs are submitted and that may change the processors assigned to a job during its execution. The scheduling method is divided into a scheduling strategy and a scheduling algorithm. We also propose an adaptation of the Heterogeneous Earliest-Finish-Time (HEFT) algorithm, called here P-HEFT, to handle parallel tasks in heterogeneous clusters with good efficiency without compromising the makespan. The results of a comparison of this algorithm with another DAG scheduler using a simulation of several machine configurations and job types shows that P-HEFT gives a shorter makespan for a single DAG but scores worse for multiple DAGs. Finally, the results of the dynamic scheduling of a batch of jobs using the proposed scheduler method showed significant improvements for more heavily loaded machines when compared to the alternative resource reservation approach.  相似文献   

19.
Jansen  Porkolab 《Algorithmica》2008,32(3):507-520
Abstract. A malleable parallel task is one whose execution time is a function of the number of (identical) processors alloted to it. We study the problem of scheduling a set of n independent malleable tasks on a fixed number of parallel processors, and propose an approximation scheme that for any fixed ε > 0 , computes in O(n) time a non-preemptive schedule of length at most (1+ε) times the optimum.  相似文献   

20.
Rod Adams  Sue Gray 《Software》1995,25(9):1003-1020
Multiple-instruction-issue processors seek to improve performance over scalar RISC processors by providing multiple pipelined functional units in order to fetch, decode and execute several instructions per cycle. The process of identifying instructions which can be executed in parallel and distributing them between the available functional units is referred to as instruction scheduling. This paper describes a simple compile-time scheduling technique, called conditional compaction, which uses the concept of conditional execution to move instructions across basic block boundaries. It then presents the results of an investigation into the performance of the scheduling technique using C benchmark programs scheduled for machines with different functional unit configurations. This paper represents the culmination of our investigation into how much performance improvement can be obtained using conditional execution as the sole scheduling technique.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号