首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
In current multiprogrammed multiprocessor systems, to take into account the performance of parallel applications is critical to decide an efficient processor allocation. In this paper, we present the performance-driven processor allocation policy (PDPA). PDPA is a new scheduling policy that implements a processor allocation policy and a multiprogramming-level policy, in a coordinated way, based on the measured application performance. With regard to the processor allocation, PDPA is a dynamic policy that allocates to applications the maximum number of processors to reach a given target efficiency. With regard to the multiprogramming level, PDPA allows the execution of a new application when free processors are available and the allocation of all the running applications is stable, or if some applications show bad performance. Results demonstrate that PDPA automatically adjusts the processor allocation of parallel applications to reach the specified target efficiency, and that it adjusts the multiprogramming level to the workload characteristics. PDPA is able to adjust the processor allocation and the multiprogramming level without human intervention, which is a desirable property for self-configurable systems, resulting in a better individual application response time.  相似文献   

2.
In this paper we analyze the algorithms expressed as a system of recurrence equations. The algorithms are called 2?1 output algorithms if two output values of one function (variable identification) are specified by the system of recurrence equations for each index point in the algorithm. The algorithm is in free form if the indexes of these two values are not dependent. Two standard classes are determined by this criteria: the nearest neighbour and the all pair form. For example the sorting algorithm can be expressed in the all pair form i.e., the linear insertion algorithm or in the nearest neighbour form i.e., the bubble sort algorithm. However these algorithms are different in their nature. A procedure to eliminate the computational broadcast for the all pair 2?1output algorithm has been proposed by the authors in [1]. The result obtained by implementing this procedure was a localized form of the algorithm and a system of uniform recurrence equations by eliminating the computational and data broadcast. So he data dependence method can be efficiently used for parallel implementations. The proposed procedure cannot be implemented directly on the nearest neighbour form algorithms. Here we show how the algorithm can be restructured into a form where the computational and data broadcast can be eliminated. These transformations result in localized algorithms. A few examples show how these algorithms can be implemented on processor arrays. For example, the Gentleman Kung triangular array [2] can be used for solving the QR decomposition algorithm for both forms of the algorithm. The implementations differ in the order of the data flow and the processor operation. We show that the implementation of the nearest neighbour algorithm is even better than the standard one.  相似文献   

3.
Allocating nodes in a concurrent computer system depends on the topology of the system. In this work, we present a number of processor allocation strategies for Hypercycle based concurrent systems. Hypercycles is a class of multidimensional interconnection networks which includes such widely used networks as the binary n-cubes, k-ary n-cubes, generalized hypercubes etc. The allocation strategies presented include a statically optimal first-fit allocation, a suboptimal-first fit, and strategies with extended search space through the inclusion of additional search lists formed by permuting the base through which a hypercycle is defined. For all these strategies, we examine their optimality and present simulation results characterizing their performance relative to each other  相似文献   

4.
We show that there is a randomizedoblivious algorithm for routing any (partial) permutation on ann ×n grid in 2n +O(logn) parallel communication steps. The queues will not grow larger than Θ(logn/log logn) with high probability. We then modify this to obtain a (nonoblivious) algorithm with the same running time such that the size of the queues is bounded by a constant with high probability. For permutations withlocality, where each packet has to travel a distance at mostL, a generalization of the algorithm routes in time proportional toL with high probability. Finally, we identify a class of meshlike networks that have optimal or near-optimal diameter. These meshes have the potential of being adapted to run existing sorting and routing algorithms with corresponding reduction in their running times.  相似文献   

5.
Effective fault tolerance techniques are essential for improving the reliability of multiprocessor systems. At the same time, fault tolerance must be achieved at high speed to meet the real-time constraints of embedded systems. While parallelism has often been exploited to increase performance, to the best of our knowledge, there has been no previously reported work on parallel reconfiguration of mesh-connected processor arrays with faults. This paper presents two parallel algorithms to accelerate reconfiguration of the processor arrays. The first algorithm reconfigures a host array in parallel in a multithreading manner. The threads in the parallel algorithm execute independently within a safe rerouting distance. The second algorithm is based on a divide-and-conquer approach to first generate the leftmost segments in parallel and then merge the segments in parallel. When compared to the conventional algorithm, simulation results from a large number of instances confirm that the proposed algorithms significantly accelerate the reconfiguration without loss of harvest.  相似文献   

6.
We show that there is a randomizedoblivious algorithm for routing any (partial) permutation on ann ×n grid in 2n +O(logn) parallel communication steps. The queues will not grow larger than (logn/log logn) with high probability. We then modify this to obtain a (nonoblivious) algorithm with the same running time such that the size of the queues is bounded by a constant with high probability. For permutations withlocality, where each packet has to travel a distance at mostL, a generalization of the algorithm routes in time proportional toL with high probability. Finally, we identify a class of meshlike networks that have optimal or near-optimal diameter. These meshes have the potential of being adapted to run existing sorting and routing algorithms with corresponding reduction in their running times.Preliminary reports of portions of the results contained in this paper have appeared in theProceedings of the 1988 Aegean Workshop on Computing [5], and in theProceedings of the 1987 Conference on Foundations of Software Technology and Theoretical Computer Science [18]. The work of the first author was supported in part by NSF Grant NSF-DCR-85-03251 and ONR contract N00014-80-C-0647. The work of the second author was supported in part by NSF Grant NSF-DCR-86-00379.  相似文献   

7.
Existing techniques for sharing the processing resources in multiprogrammed shared-memory multiprocessors, such as time-sharing, space-sharing, and gang-scheduling, typically sacrifice the performance of individual parallel applications to improve overall system utilization. We present a new processor allocation technique called Loop-Level Process Control (LLPC) that dynamically adjusts the number of processors an application is allowed to use for the execution of each parallel section of code, based on the current system load. This approach exploits the maximum parallelism possible for each application without overloading the system. We implement our scheme on a Silicon Graphics Challenge multiprocessor system and evaluate its performance using applications from the Perfect Club benchmark suite and synthetic benchmarks. Our approach shows significant improvements over traditional time-sharing and gang-scheduling. It has performance comparable to, or slightly better than, static space-sharing, but our strategy is more robust since, unlike static space-sharing, it does not require a priori knowledge of the applications' parallelism characteristics  相似文献   

8.
An efficient processor allocation policy is presented for hypercube computers. The allocation policy is called free list since it maintains a list of free subcubes available in the system. An incoming request of dimension k (2k nodes) is allocated by finding a free subcube of dimension k or by decomposing an available subcube of dimension greater than k. This free list policy uses a top-down allocation rule in contrast to the bottom-up approach used by the previous bit-map allocation algorithms. This allocation scheme is compared to the buddy, gray code (GC), and modified buddy allocation policies reported for the hypercubes. It is shown that the free list policy is optimal in a static environment, as are the other policies, and it also gives better subcube recognition ability compared to the previous schemes in a dynamic environment. The performance of this policy, in terms of parameters such as average delay, system utilization, and time complexity, is compared to the other schemes to demonstrate its effectiveness. The extension of the algorithm for parallel implementation, noncubic allocation, and inclusion/exclusion allocation is also given  相似文献   

9.
Wu  Ying-Jhih  Yu  Shuo-Ting  Lai  Kuan-Chou  Chhabra  Amit  Chang  Hsi-Ya  Huang  Kuo-Chan 《The Journal of supercomputing》2020,76(12):10212-10239
The Journal of Supercomputing - Most modern parallel programs are written with the moldable property. However, most existing parallel computing systems treat such parallel programs as rigid jobs...  相似文献   

10.
Homogeneous processor arrays are emerging in tera-scale computation and effective fault tolerance techniques are essential to improving the reliability of such complex integrated circuits. We study the degradable processor arrays to achieve fault tolerance by employing reconfiguration. Three bypass schemes and three rerouting schemes are proposed to reconfigure three-dimensional processor arrays with defective processors to achieve target arrays without faults. A heuristic algorithm is proposed to construct a target array on the selected rows and columns. It is also proved that the proposed greedy plane rerouting algorithm (GPR) produces maximum target array. In addition, the problem of constructing the communication efficient array is considered in this paper. An algorithm is proposed to refine the communication among processors within the target array constructed by GPR. Experimental study shows that the proposed algorithm GPR produces target arrays with higher harvest and lower degradation on the host arrays with fault density no more than 5%. In addition, the communication performance is significantly optimized by reducing the number of long interconnects, and the average improvement is about 34% for all cases considered in this paper.  相似文献   

11.
VLIW是DSP芯片上使用最多的一种技术,要发挥DSP芯片的性能优势,需要编译器的支持.目前关于VLLW技术的研究主要集中在如何形成更长的基本块,以及基本块之间的代码优化算法上,对于如何选择指令从而形成一个超长指令字的算法,却没有仔细地描述和实现,但这是在编译器的指令调度模块中需要具体考虑的问题,具有工程实践意义.本文通过改进编译器的lisf算法实现了支持VLIW技术的指令调度优化算法,改进的算法可以充分利用芯片的VLIW结构的优势,加速程序运行,具有较好性能.  相似文献   

12.
Two new algorithms for the distributed static resource allocation problem are presented. The first algorithm, which shows excellent average case behavior in simulation requires the maintenance of a global queue. The second, which needs only local communication, has polynomial waiting time and exhibits better average case behavior than the only other known polynomial time algorithm. Both algorithms have polynomial message complexity  相似文献   

13.
14.
Examines a distributed algorithm where the processors may fail in a random fashion. This results in a model with random communication delays. Convergence conditions are derived. Extensions of the analysis and results to cases where the random processor failures are perceived and corrected within random time intervals are possible. For the sake of simplicity, the analysis is presented for a two processor model for solving a system of linear equations  相似文献   

15.
In this paper we present algorithms for the solution of two server (machine) allocation problems that occur in manufacturing networks. The manufacturing network is modelled as an open network of queues with general interarrival time and service time distributions. The queueing network is analyzed by using the parametric decomposition method: a two-moment approximation scheme. The server allocation problems are solved by means of a marginal analysis scheme. Numerical results on two manufacturing networks are presented.  相似文献   

16.
17.
Various contiguous and noncontiguous processor allocation policies have been proposed for mesh-connected multicomputers. Contiguous allocation suffers from high external processor fragmentation because it requires that the processors allocated to a parallel job be contiguous and have the same topology as the multicomputer. The goal of lifting the contiguity condition in noncontiguous allocation is reducing processor fragmentation. However, this can increase the communication overhead because the distances traversed by messages can be longer, and messages from different jobs can interfere with each other by competing for communication resources. The extra communication overhead depends on how the allocation request is partitioned and mapped to free processors. In this paper, we investigate a new class of noncontiguous allocation schemes for two-dimensional mesh-connected multicomputers. These schemes are different from previous ones in that request partitioning is based on the submeshes available for allocation. The available submeshes selected for allocation to a job are such that a high degree of contiguity among their processors is achieved. The proposed policies are compared to previous noncontiguous policies using detailed simulations, where several common communication patterns are considered. The results show that the proposed policies can reduce the communication overhead and improve performance substantially.  相似文献   

18.
Processor Allocator (PA) is a crucial factor in modern Chip MultiProcessors (CMPs). A modern CMP uses Network on Chip (NoC) as communication technique between cores. Thus, the topology of the implemented NoC has also significant impact on the CMP’s performance. A good processor allocation technique needs to be fast and ensure the highest possible system utilization. In this paper, we propose a processor allocation technique for such an efficient and fast PA. The PA is driven by a Bit Map Allocation for Torus (BMAT) algorithm, which is a technique designed for k-ary 2-cube topology. The proposed BMAT scheme is presented and described along with a new Busy List Allocation for Torus (BLAT), Sorting Allocation for Torus (SAT) and Stack Based Allocation for Torus (SBAT) algorithms. The presented techniques are compared with previously known important schemes for k-ary 2-mesh topology. The research ideas have been verified using experiments that have also been described in the paper. The presented simulation results reveal that the proposed processor allocation algorithm for k-ary 2-cube, as a technique for PA, achieves better allocation time than all other existing algorithms while the CMP with such a PA is characterized by very high system utilization.  相似文献   

19.
Scheduling is a fundamental issue in achieving high performance on metacomputers and computational grids. For the first time, the job scheduling problem for grid computing on metacomputers is studied as a combinatorial optimization problem. A cost model is proposed for modeling communication heterogeneity on computational grids. A processor allocation algorithm is developed which always finds an optimal processor allocation that minimizes the effective execution time of a job when the job is being scheduled. It is proven that the list scheduling (LS) algorithm can achieve reasonable worst-case performance bound in grid environments supporting distributed supercomputing with large applications. We compare the performance of various job scheduling and processor allocation algorithms for grid computing on metacomputers. We evaluate the performance of 128 combinations of two job scheduling algorithms, four initial job ordering strategies, four processor allocation algorithms, and four metacomputers by extensive simulation. It is found that the combination of largest job first (LJF) initial job ordering and minimum effective execution time (MEET) or largest machine first (LMF) processor allocation algorithm yields the best average-case performance, and the choice of FCFS and LS depends on the range of job sizes. It is also observed that communication heterogeneity does have significant impact on schedule lengths.  相似文献   

20.
Consider the problem of scheduling a set of preemptible tasks in one or more processor systems. The task system consists of a set of independent tasks or a task set with precedence relations. Each task is characterized by execution time and deadline. This article presents scheduling algorithms that guarantee all time constraints. These algorithms are so easy to implement that they can be used in real-time operating systems. An overview is given for the different feasible scheduling algorithms of some task and processor systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号