期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An effective processor allocation strategy for multiprogrammedshared-memory multiprocessors

Yue K.K. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(12):1246-1258

Existing techniques for sharing the processing resources in multiprogrammed shared-memory multiprocessors, such as time-sharing, space-sharing, and gang-scheduling, typically sacrifice the performance of individual parallel applications to improve overall system utilization. We present a new processor allocation technique called Loop-Level Process Control (LLPC) that dynamically adjusts the number of processors an application is allowed to use for the execution of each parallel section of code, based on the current system load. This approach exploits the maximum parallelism possible for each application without overloading the system. We implement our scheme on a Silicon Graphics Challenge multiprocessor system and evaluate its performance using applications from the Perfect Club benchmark suite and synthetic benchmarks. Our approach shows significant improvements over traditional time-sharing and gang-scheduling. It has performance comparable to, or slightly better than, static space-sharing, but our strategy is more robust since, unlike static space-sharing, it does not require a priori knowledge of the applications' parallelism characteristics 相似文献

2.

Implementing a dynamic processor allocation policy for multiprogrammed parallel applications in the SolarisTM

Kelvin K. Yue David J. Lilja 《Concurrency and Computation》2001,13(6):449-464

Parallel applications typically do not perform well in a multiprogrammed environment that uses time‐sharing to allocate processor resources to the applications' parallel threads. Co‐scheduling related parallel threads, or statically partitioning the system, often can reduce the applications' execution times, but at the expense of reducing the overall system utilization. To address this problem, there has been increasing interest in dynamically allocating processors to applications based on their resource demands and the dynamically varying system load. The Loop‐Level Process Control (LLPC) policy (Yue K, Lilja D. Efficient execution of parallel applications in multiprogrammed multiprocessor systems. 10th International Parallel Processing Symposium, 1996; 448–456) dynamically adjusts the number of threads an application is allowed to execute based on the application's available parallelism and the overall system load. This study demonstrates the feasibility of incorporating the LLPC strategy into an existing commercial operating system and parallelizing compiler and provides further evidence of the performance improvement that is possible using this dynamic allocation strategy. In this implementation, applications are automatically parallelized and enhanced with the appropriate LLPC hooks so that each application interacts with the modified version of the Solaris operating system. The parallelism of the applications are then dynamically adjusted automatically when they are executed in a multiprogrammed environment so that all applications obtain a fair share of the total processing resources. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

3.

FEADS: A Framework for Exploring the Application Design Space on Network Processors

Rajani Pai R. Govindarajan 《International journal of parallel programming》2007,35(1):1-31

Network processors are designed to handle the inherently parallel nature of network processing applications. However, partitioning and scheduling of application tasks and data allocation to reduce memory contention remain as major challenges in realizing the full performance potential of a given network processor. The large variety of processor architectures in use and the increasing complexity of network applications further aggravate the problem. This work proposes a novel framework, called FEADS, for automating the task of application partitioning and scheduling for network processors. FEADS uses the simulated annealing approach to perform design space exploration of application mapping onto processor resources. Further, it uses cyclic and r-periodic scheduling to achieve higher throughput schedules. To evaluate dynamic performance metrics such as throughput and resource utilization under realistic workloads, FEADS automatically generates a Petri net (PN) which models the application, architectural resources, mapping and the constructed schedule and their interaction. The throughput obtained by schedules constructed by FEADS is comparable to that obtained by manual scheduling for linear task flow graphs; for more complicated task graphs, FEADS’ schedules have a throughput which is upto 2.5 times higher compared to the manual schedules. Further, static scheduling of tasks results in an increase in throughput by upto 30% compared to an implementation of the same mapping without task scheduling. 相似文献

4.

Static and Dynamic Processor Scheduling Disciplines in Heterogeneous Parallel Architectures

《Journal of Parallel and Distributed Computing》1995,28(1):1-18

Most parallel jobs cannot be fully parallelized. In a homogeneous parallel machine-one in which all processors are identical-the serial fraction of the computation has to be executed at the speed of any of the identical processors, limiting the speedup that can be obtained due to parallelism. In a heterogeneous architecture, the sequential bottleneck can be greatly reduced by running the sequential part of the job or even the critical tasks in a faster processor. This paper uses Markov chain based models to analyze the performance of static and dynamic processor assignment policies for heterogeneous architectures. Parallel jobs are assumed to be described by acyclic directed task graphs. A new static processor assignment policy, called Largest Task First Minimum Finish Time (LTFMFT), is introduced. The analysis shows that this policy is very sensitive to the degree of heterogeneity of the architecture, and that it outperforms all other policies analyzed. Three dynamic assignment disciplines are compared and it is shown that, in heterogeneous environments, the disciplines that perform better are those that consider the structure of the task graph, and not only the service demands of the individual tasks. The performance of heterogeneous architectures is compared with cost-equivalent homogeneous ones taking into account different scheduling policies. Finally, static and dynamic processor assignment disciplines are compared in terms of performance. 相似文献

5.

Performance-driven processor allocation

Corbalan J. Martorell X. Labarta J. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(7):599-611

In current multiprogrammed multiprocessor systems, to take into account the performance of parallel applications is critical to decide an efficient processor allocation. In this paper, we present the performance-driven processor allocation policy (PDPA). PDPA is a new scheduling policy that implements a processor allocation policy and a multiprogramming-level policy, in a coordinated way, based on the measured application performance. With regard to the processor allocation, PDPA is a dynamic policy that allocates to applications the maximum number of processors to reach a given target efficiency. With regard to the multiprogramming level, PDPA allows the execution of a new application when free processors are available and the allocation of all the running applications is stable, or if some applications show bad performance. Results demonstrate that PDPA automatically adjusts the processor allocation of parallel applications to reach the specified target efficiency, and that it adjusts the multiprogramming level to the workload characteristics. PDPA is able to adjust the processor allocation and the multiprogramming level without human intervention, which is a desirable property for self-configurable systems, resulting in a better individual application response time. 相似文献

6.

Parallel processsing in finite flement structural analysis

Ahmed K. Noor 《Engineering with Computers》1988,3(4):225-241

A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on mechanisms for parallel processing, construction and implementation of parallel numerical algorithms, performance evaluation of parallel processing machines and numerical algorithms, and parallelism in finite element computations. A novel partitioning strategy is outlined for maximizing the degree of parallelism on computers with a small number of powerful processors. 相似文献

7.

基于多核并行的海量数据序列模式挖掘*

俞东进郑苏杭李万清《计算机应用研究》2012,29(2):478-481

为了在多核处理器上充分利用多核资源以提升挖掘性能,提出了一种动态与静态任务分配机制相结合的基于多核的并行序列模式挖掘算法。该算法采用数据并行与任务并行相结合的策略,在各处理器核生成局部序列模式后,再与其他处理器核协同,以最终获得所有的全局序列模式。算法通过并行局部归约技术消除了局部序列的重复生成与计算,并可结合静态与动态任务分配机制解决处理器的负载不均衡问题。理论分析和实验都证实了该算法可有效利用多核计算平台及多核体系结构优势,具有较高的运行效率和加速比。相似文献

8.

Impact of workload and system parameters on next generation clusterscheduling mechanisms

Yanyong Zhang Sivasubramaniam A. Moreira J. Franke H. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(9):967-985

Scheduling of processes onto processors of a parallel machine has always been an important and challenging area of research. The issue becomes even more crucial and difficult as we gradually progress to the use of off-the-shelf workstations, operating systems, and high bandwidth networks to build cost-effective clusters for demanding applications. Clusters are gaining acceptance not just in scientific applications that need supercomputing power, but also in domains such as databases, web service, and multimedia which place diverse Quality-of-Service (QoS) demands on the underlying system. Further, these applications have diverse characteristics in terms of their computation, communication, and I/O requirements, making conventional parallel scheduling solutions, such as space sharing or gang scheduling, unattractive. At the same time, leaving it to the native operating system of each node to make decisions independently can lead to ineffective use of system resources whenever there is communication. Instead, an emerging class of dynamic coscheduling mechanisms that attempt to take remedial actions to guide the system toward coscheduled execution without requiring explicit synchronization offers a lot of promise for cluster scheduling. Using a detailed simulator, this paper evaluates the pros and cons of different dynamic coscheduling alternatives while comparing their advantages over traditional gang scheduling (and not performing any coordinated scheduling at all). The impact of dynamic job arrivals, job characteristics, and different system parameters on these alternatives is evaluated in terms of several performance criteria. In addition, heuristics to enhance one of the alternatives even further are identified, classified, and evaluated. It is shown that these heuristics can significantly outperform the other alternatives over a spectrum of workload and system parameters and is thus a much better option for clusters than conventional gang scheduling 相似文献

9.

On the Interplay of Parallelization, Program Performance, and Energy Consumption 总被引：2，自引：0，他引：2

Cho Sangyeun Melhem Rami G. 《Parallel and Distributed Systems, IEEE Transactions on》2010,21(3):342-353

This paper derives simple, yet fundamental formulas to describe the interplay between parallelism of an application, program performance, and energy consumption. Given the ratio of serial and parallel portions in an application and the number of processors, we derive optimal frequencies allocated to the serial and parallel regions in an application to either minimize the total energy consumption or minimize the energy-delay product. The impact of static power is revealed by considering the ratio between static and dynamic power and quantifying the advantages of adding to the architecture capability to turn off individual processors and save static energy. We further determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. While the formulas we obtain use simplifying assumptions, they provide valuable theoretical insights into energy-aware processor resource management. Our results form a basis for several interesting research directions in the area of energy-aware multicore processor architectures. 相似文献

10.

Power and performance management for parallel computations in clouds and data centers

《Journal of Computer and System Sciences》2016,82(2):174-190

We address scheduling independent and precedence constrained parallel tasks on multiple homogeneous processors in a data center with dynamically variable voltage and speed as combinatorial optimization problems. We consider the problem of minimizing schedule length with energy consumption constraint and the problem of minimizing energy consumption with schedule length constraint on multiple processors. Our approach is to use level-by-level scheduling algorithms to deal with precedence constraints. We use a simple system partitioning and processor allocation scheme, which always schedules as many parallel tasks as possible for simultaneous execution. We use two heuristic algorithms for scheduling independent parallel tasks in the same level, i.e., SIMPLE and GREEDY. We adopt a two-level energy/time/power allocation scheme, namely, optimal energy/time allocation among levels of tasks and equal power supply to tasks in the same level. Our approach results in significant performance improvement compared with previous algorithms in scheduling independent and precedence constrained parallel tasks. 相似文献

11.

TLA: Temporal look-ahead processor allocation method for heterogeneous multi-cluster systems

Po-Chi Shih Kuo-Chan Huang Che-Rung Lee I-Hsin Chung Yeh-Ching Chung 《Journal of Parallel and Distributed Computing》2013

In a heterogeneous multi-cluster (HMC) system, processor allocation is responsible for choosing available processors among clusters for job execution. Traditionally, processor allocation in HMC considers only resource fragmentation or processor heterogeneity, which leads to heuristics such as Best-Fit (BF) and Fastest-First (FF). However, those heuristics only favor certain types of workloads and cannot be changed adaptively. In this paper, a temporal look-ahead (TLA) method is proposed, which uses an allocation simulation process to guide the decision of processor allocation. Thus, the allocation decision is made dynamically according to the current workload and system configurations. We evaluate the performance of TLA by simulations, with different workloads and system configurations, in terms of average turnaround time. Simulation results indicate that, with precise runtime information, TLA outperforms traditional processor allocation methods and has up to an 87% performance improvement. 相似文献

12.

Optimization of parallel execution for multi-join queries 总被引：5，自引：0，他引：5

Ming-Syan Chen Yu P.S. Kun-Lung Wu 《Knowledge and Data Engineering, IEEE Transactions on》1996,8(3):416-428

We study the subject of exploiting interoperator parallelism to optimize the execution of multi-join queries. Specifically, we focus on two major issues: (1) scheduling the execution sequence of multiple joins within a query, and (2) determining the number of processors to be allocated for the execution of each join operation obtained in (1). For the first issue, we propose and evaluate by simulation several methods to determine the general join sequences, or bushy trees. Despite their simplicity, the heuristics proposed can lead to the general join sequences that significantly outperform the optimal sequential join sequence. The quality of the join sequences obtained by the proposed heuristics is shown to be fairly close to that of the optimal one. For the second issue, it is shown that the processor allocation for exploiting interoperator parallelism is subject to more constraints-such as execution dependency and system fragmentation-than those in the study of intraoperator parallelism for a single join. The concept of synchronous execution time is proposed to alleviate these constraints. Several heuristics to deal with the processor allocation, categorized by bottom-up and top-down approaches, are derived and are evaluated by simulation. The relationship between issues (1) and (2) is explored. Among all the schemes evaluated, the two-step approach proposed, which first applies the join sequence heuristic to build a bushy tree as if under a single processor system, and then, in light of the concept of synchronous execution time, allocates processors to execute each join in the bushy tree in a top-down manner, emerges as the best solution to minimize the query execution time 相似文献

13.

一种同构机群系统中的处理机分配算法 总被引：5，自引：0，他引：5

温钰洪王鼎兴沈美明《软件学报》1997,8(3):161-169

机群系统的分布式计算环境为并行处理技术带来了新的研究与应用问题，正成为并行计算的热点问题．如何合理、有效地将并行任务划分到机群系统的结点上，将直接影响系统的执行性能．本文分析影响系统执行效率的执行开销因素，同时提出一个启发式的处理机分配算法. 相似文献

14.

Adaptive parallelism and Piranha

Carriero N. Freeman E. Gelernter D. Kaminsky D. 《Computer》1995,28(1):40-49

相似文献

15.

The impact of task service time variability on gang scheduling performance in a two-cluster system

Zafeirios C. Papazachos Helen D. Karatza 《Simulation Modelling Practice and Theory》2009,17(7):1276-1289

Gang scheduling is a common task scheduling policy for parallel and distributed systems which combines elements of space-sharing and time-sharing. In this paper we present a migration strategy which reduces the fragmentation in the schedule caused by gang scheduled jobs. We consider the existence of high priority jobs in the workload. These jobs need to be started immediately and they may interrupt a parallel job’s execution. A distributed system consisting of two homogeneous clusters is simulated to evaluate the performance for various workloads. We study the impact on performance of the variability in service time of the parallel tasks. Our simulation results indicate that the proposed strategy can result in a significant performance gain and that the performance improvement depends on the variability of gang tasks’ service time. 相似文献

16.

Large scale scientific computing - future directions

G.S. Patterson 《Computer Physics Communications》1982,26(3-4):217-225

Every new generation of scientific computers has opened up new areas of science for exploration through the use of more realistic numerical models or the ability to process ever larger amounts of data. Concomitantly, scientists, because of the success of past models and the wide range of physical phenomena left unexplored, have pressed computer designers to strive for the maximum performance that current technology will permit. This encompasses not only increased processor speed, but also substantial improvements in processor memory, I/O bandwidth, secondary storage and facilities to augment the scientist's ability both to program and to understand the results of a computation. Over the past decade, performance improvements for scientific calculations have come from algoeithm development and a major change in the underlying architecture of the hardware, not from significantly faster circuitry. It appears that this trend will continue for another decade. A future archetectural change for improved performance will most likely be multiple processors coupled together in some fashion. Because the demand for a significantly more powerful computer system comes from users with single large applications, it is essential that an application be efficiently partitionable over a set of processors; otherwise, a multiprocessor system will not be effective. This paper explores some of the constraints on multiple processor architecture posed by these large applications. In particular, the trade-offs between large numbers of slow processors and small numbers of fast processors is examined. Strategies for partitioning range from partitioning at the language statement level (in-the-small) and at the program module level (in-the-large). Some examples of partitioning in-the-large are given and a strategy for efficiently executing a partitioned program is explored. 相似文献

17.

Availability-based noncontiguous processor allocation policies for 2D mesh-connected multicomputers 总被引：1，自引：0，他引：1

Ismail Ababneh 《Journal of Systems and Software》2008,81(7):1081-1092

Various contiguous and noncontiguous processor allocation policies have been proposed for mesh-connected multicomputers. Contiguous allocation suffers from high external processor fragmentation because it requires that the processors allocated to a parallel job be contiguous and have the same topology as the multicomputer. The goal of lifting the contiguity condition in noncontiguous allocation is reducing processor fragmentation. However, this can increase the communication overhead because the distances traversed by messages can be longer, and messages from different jobs can interfere with each other by competing for communication resources. The extra communication overhead depends on how the allocation request is partitioned and mapped to free processors. In this paper, we investigate a new class of noncontiguous allocation schemes for two-dimensional mesh-connected multicomputers. These schemes are different from previous ones in that request partitioning is based on the submeshes available for allocation. The available submeshes selected for allocation to a job are such that a high degree of contiguity among their processors is achieved. The proposed policies are compared to previous noncontiguous policies using detailed simulations, where several common communication patterns are considered. The results show that the proposed policies can reduce the communication overhead and improve performance substantially. 相似文献

18.

Dynamic Data Reallocation for Skew Management in Shared-Nothing Parallel Databases

Abdelsalam Helal David Yuan Hesham EL-Rewini 《Distributed and Parallel Databases》1997,5(3):271-288

The shared nothing parallel database architecture is gaining wide popularity due to its scalability and increased data availability. However, in order to efficiently utilize parallelism in such architecture, independent data sets must be assigned to different processing nodes. This, of course, can initially be achieved by employing a careful partitioning scheme that allocates disjoint data sets to different processors. However, variations in the data access pattern may render some processors overloaded while others underloaded. This skewness in data access decreases the effective parallelism and eventually leads to overall performance degradation. A number of solutions have been proposed to periodically perform data re-allocation to remove the skewness in data access. Most of the proposed solutions perform either static re-allocation that requires the system to be taken off-line or dynamic, but non-transactional, re-allocation. In this paper, we introduce a dynamic and transactional re-allocation scheme based on the work on disk cooling in shared memory architecture by Scheuermann et al. The proposed scheme enhances the effective parallelism in the system regardless of the variations in the pattern of access. The proposed scheme detects access skew as it occurs and re-allocates data partitions to underloaded processing elements on the fly. Only the block being moved becomes unavailable. In addition, mutual consistency among transactions concurrent to the re-allocation event is preserved. The proposed scheme also uses replication as an additional cooling mechanism to help distribute access load over multiple replicas. We conducted a series of simulation experiments to study the behavior of shared nothing parallel database systems with and without the proposed dynamic re-allocation scheme. We also experimented with several replication strategies to measure their impact on the system performance. Finally, we studied the effect of using different concurrency control strategies on the efficiency of dynamic re-allocation. 相似文献

19.

Co-scheduling algorithms for high-throughput workload execution

Guillaume Aupy Manu Shantharam Anne Benoit Yves Robert Padma Raghavan 《Journal of Scheduling》2016,19(6):627-640

This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several applications concurrently. We partition the original application set into a series of packs, which are executed one by one. A pack comprises several applications, each of them with an assigned number of processors, with the constraint that the total number of processors assigned within a pack does not exceed the maximum number of available processors. The objective is to determine a partition into packs, and an assignment of processors to applications, that minimize the sum of the execution times of the packs. We thoroughly study the complexity of this optimization problem, and propose several heuristics that exhibit very good performance on a variety of workloads, whose application execution times model profiles of parallel scientific codes. We show that co-scheduling leads to faster workload completion time (40 % improvement on average over traditional scheduling) and to faster response times (50 % improvement). Hence, co-scheduling increases system throughput and saves energy, leading to significant benefits from both the user and system perspectives. 相似文献

20.

On the Performance of Randomized Embedding of Reproduction Trees in Static Networks

Keqin Li 《International journal of parallel programming》2003,31(5):393-406

High performance computing requires high quality load distribution of processes of a parallel application over processors in a parallel computer at runtime such that both the maximum load and dilation are minimized. The performance of a simple randomized tree embedding algorithm that dynamically supports tree-structured parallel computations on arbitrary static networks is analyzed in this paper. The algorithm spreads newly created tree nodes to neighboring processors, which actually provides randomized dilation-1 tree embedding in static networks. We develop a linear system of equations that characterizes expected loads on all processors under the reproduction tree model, which can generate trees of arbitrary size and shape. It is shown that as the tree size becomes large, the asymptotic performance ratio of the randomized tree embedding algorithm is the ratio of the maximum processor degree to the average processor degree. This implies that the simple randomized tree embedding algorithm is able to generate high quality load distributions on virtually all static networks commonly employed in parallel and distributed computing. 相似文献