首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 625 毫秒
1.
In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in Θ(log P) time and, in addition, achieves good initial load balance. The QE strategy prossess certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel A* algorithms to solve the Traveling Salesman Problem on an nCUBE2 hypercube multicomputer. The QE strategy yields average speedup improvements of about 20-185% and 15-120% at low and intermediate work densities (the ratio of the problem size to P), respectively, over three well-known load balancing methods-the round-robin (RR), the random communication (RC), and the neighborhood averaging (NA) strategies. The average speedup observed on 1024 processors is about 985, representing a very high efficiency of 0.96. Finally, we analyze and empirically evaluate the scalability of parallel A* algorithms in terms of the isoefficiency metric. Our analysis gives (1) a Θ(P log P) lower bound on the isoefficiency function of any parallel A* algorithm, and (2) a general expression for the upper bound on the isoefficiency function of our parallel A* algorithm using the QE strategy on any topology-for the hypercube and 2-D mesh architectures the upper bounds on the isoefficiency function are found to be Θ(P log2P) and Θ(P[formula]), respectively. Experimental results validate our analysis, and also show that parallel A* search has better scalability using the QE load balancing strategy than using the RR, RC, or NA strategies.  相似文献   

2.
Load balancing involves assigning to each processor work proportional to its performance, thereby minimizing the execution time of a program. Although static load balancing can solve many problems (e.g., those caused by processor heterogeneity and nonuniform loops) for most regular applications, the transient external load due to multiple users on a network of workstations necessitates a dynamic approach to load balancing. In this paper we show that different load balancing schemes are best for different applications under varying program and system parameters. Therefore, application-driven customized dynamic load balancing becomes essential for good performance. We present a hybrid compile-time and run-time modeling and decision process which selects (customizes) the best scheme, along with automatic generation of parallel code with calls to a run-time library for load balancing.  相似文献   

3.
We present new methods for load balancing of unstructured tree computations on large-scale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on an SIMD machine consists of two major components: a triggering mechanism, which determines when the search space redistribution must occur to balance the search space over processors, and a scheme to redistribute the search space. We have devised a new redistribution mechanism and a new triggering mechanism. Either of these can be used in conjunction with triggering and redistribution mechanisms developed by other researchers. We analyze the scalability of these mechanisms and verify the results experimentally. The analysis and experiments show that our new load-balancing methods are highly scalable on SIMD architectures. Their scalability is shown to he no worse than that of the best load-balancing schemes on MIMD architectures. We verify our theoretical results by implementing the 15-puzzle problem on a CM-2 SIMD parallel computer  相似文献   

4.
Distributed virtual environments (DVEs) are becoming very popular in recent years, due to the rapid growing of applications, such as massive multiplayer online games (MMOGs). As the number of concurrent users increases, scalability becomes one of the major challenges in designing an interactive DVE system. One solution to address this scalability problem is to adopt a multi-server architecture. While some methods focus on the quality of partitioning the load among the servers, others focus on the efficiency of the partitioning process itself. However, all these methods neglect the effect of network delay among the servers on the accuracy of the load balancing solutions. As we show in this paper, the change in the load of the servers due to network delay would affect the performance of the load balancing algorithm. In this work, we conduct a formal analysis of this problem and discuss two efficient delay adjustment schemes to address the problem. Our experimental results show that our proposed schemes can significantly improve the performance of the load balancing algorithm with neglectable computation overhead.  相似文献   

5.
We consider the problem of reducing data traffic among processor nodes during the parallel factorization of a sparse matrix on a hypercube multiprocessor. A task assignment strategy based on the structure of an elimination tree is presented. This assignment is aimed at achieving load balancing among the processors and also reducing the amount of processor-to-processor data communication. An analysis of regular grid problems is presented, providing a bound on communication volume generated by the new strategy, and showing that the allocation scheme is optimal in the asymptotic sense. Some experimental results on the performance of this scheme are presented.  相似文献   

6.
The efficient implementation of particle-in-cell (PIC) plasma simulation codes on distributed memory concurrent computers is made difficult by the conflicting aims of balancing the computational load and minimizing interprocessor communication. This paper surveys previous work on PIC plasma simulation codes on advanced architecture computers, identifies the main issues in parallelizing such codes, and discusses different decomposition schemes. For MIMD concurrent computers the adaptive Eulerian (AE) decomposition scheme is attractive because it seeks to maintain approximate load balance dynamically while avoiding excessive non-local communication. The load balance and communication characteristics of a large-scale AE/PIC code on the hypercube are investigated by simulating the behavior of the parallel code on a Cray-2. The results show that for the three-dimensional problems studied, in which the particle distribution is highly inhomogeneous, the communication of data between the particle and mesh decompositions dominates the communication overhead. Performing the load balancing may also make a significant contribution to the concurrent overhead if the grain size is sufficiently small. The advantages of the simulation approach in investigating the behavior of concurrent large-scale applications are discussed, together with portability and software engineering issues.  相似文献   

7.
Multidisciplinary design optimization (MDO) for large-scale engineering problems poses many challenges (e.g. the design of an efficient concurrent paradigm for global optimization based on disciplinary analyses, expensive computations over vast data sets, etc.). This work focuses on the application of distributed schemes for massively parallel architectures to MDO problems, as a tool for reducing computation time and solving larger problems. The specific problem considered here is configuration optimization of a high speed civil transport (HSCT), and the efficient parallelization of the embedded paradigm for reasonable design space identification. Two distributed dynamic load balancing techniques (random polling and global round robin with message combining) and two necessary termination detection schemes (global task count and token passing) were implemented and evaluated in terms of effectiveness and scalability to large problem sizes and a thousand processors. The effect of certain parameters on execution time was also inspected. Empirical results demonstrated stable performance and effectiveness for all schemes, and the parametric study showed that the selected algorithmic parameters have a negligible effect on performance. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

8.
The response time is the most important factor determining user experiences in the service provision model involving server clusters. However, traditional server cluster load balancing scheme are limited by the hardware conditions, and cannot completely exploit the server response times for load balancing. In order to effectively resolve the traditional load balancing schemes, we propose a load balancing scheme based on server response times by using the advantage of SDN flexibility, named LBBSRT. Using the real-time response time of each server measured by the controller for load balancing, we process user requests by obtaining an evenly balanced server loads. Simulation experiments show that our scheme exhibits a better load balancing effect and process requests with a minimum average server response times. In addition, our scheme is easy to implement, and exhibits good scalability and low cost characteristics.  相似文献   

9.
This paper presents the analysis of a parallel formulation of depth-first search. At the heart of this parallel formulation is a dynamic work-distribution scheme that divides the work between different processors. The effectiveness of the parallel formulation is strongly influenced by the work-distribution scheme and the target architecture. We introduce the concept of isoefficiency function to characterize the effectiveness of different architectures and work-distribution schemes. Many researchers considered the ring architecture to be quite suitable for parallel depth-first search. Our analytical and experimental results show that hypercube and shared-memory architectures are significantly better. The analysis of previously known work-distribution schemes motivated the design of substantially improved schemes for ring and shared-memory architectures. In particular, we present a work-distribution algorithm that guarantees close to optimal performance on a shared-memory/-network-with-message-combining architecture (e.g. RP3). Much of the analysis presented in this paper is applicable to other parallel algorithms in which work is dynamically shared between different processors (e.g., parallel divide-and-conquer algorithms). The concept of isoefficiency is useful in characterizing the scalability of a variety of parallel algorithms.This work was supported by Army Research Office Grant No. DAAG29-84-K-0060 to the Artificial Intelligence Laboratory, and Office of Naval Research Grant N00014-86-K-0763 to the Computer Science Department at the University of Texas at Austin.  相似文献   

10.
In this paper, we present a game theoretic framework for obtaining a user-optimal load balancing scheme in heterogeneous distributed systems. We formulate the static load balancing problem in heterogeneous distributed systems as a noncooperative game among users. For the proposed noncooperative load balancing game, we present the structure of the Nash equilibrium. Based on this structure we derive a new distributed load balancing algorithm. Finally, the performance of our noncooperative load balancing scheme is compared with that of other existing schemes. The main advantages of our load balancing scheme are the distributed structure, low complexity and optimality of allocation for each user.  相似文献   

11.
In this paper the performance of the Intel iPSC/2 hypercube multiprocessor is analyzed. Computation and communication performance for a number of benchmarks are presented. We derive some fundamental performance parameters of the machine. Further, we investigate the difference between several communication schemes. Using the results of our measurements we can highlight some features and peculiarities in the iPSC/2 hardware and software. Where possible we make a comparison with the iPSC/1 and Ncube hypercubes.  相似文献   

12.
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objectives of this paper are to critically assess the state of the art in the theory of scalability analysis, and to motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoefficiency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.  相似文献   

13.
现有基于经典数论问题假设的无证书代理签名方案无法抵御量子计算机攻击,在应用于有大量用户的系统时会存在单点失效和不易扩展等局限。针对这些问题,提出一种基于格的分层无证书代理签名方案。首先,采用拒绝采样技术和无陷门技术提高密钥生成的计算效率;其次,不同层级的原始签名人和代理签名人通过交换随机选取的矩阵进行互认证,实现代理授权;最后,在随机预言机模型下的小整数解(SIS)困难问题假设下证明了该方案的安全性。相较于现有的代理签名方案,所提方案允许签名人来自不同层级且隶属于不同密钥生成中心(KGC)。性能评价实验结果表明,该方案的公钥尺寸是一个常数,代理签名和验证开销与层级无关,且代理密钥和签名尺寸非层级的线性量。因此,该方案可更好地满足大规模分布式异构网络对均衡负载的需求,是高效可行的。  相似文献   

14.
We present a fully distributed dynamic load balancing algorithm for parallel MIMD architectures. The algorithm can be described as a system of identical parallel processes, each running on a processor of an arbitrary interconnected network of processors. We show that the algorithm can be interpreted as a Poisson (heath) equation in a graph. This equation is analysed using Markov chain techniques and is proved to converge in polynomial time resulting in a global load balance. We also discuss some important parallel architectures and interconnection schemes such as linear processor arrays, tori, hypercubes, etc. Finally we present two applications where the algorithm has been successfully embedded (process mapping and molecular dynamic simulation).  相似文献   

15.
Scalability is a primary issue in existing sequential pattern mining algorithms for dealing with a large amount of data. Previous work, namely sequential pattern mining on the cloud (SPAMC), has already addressed the scalability problem. It supports the MapReduce cloud computing architecture for mining frequent sequential patterns on large datasets. However, this existing algorithm does not address the iterative mining problem, which is the problem that reloading data incur additional costs. Furthermore, it did not study the load balancing problem. To remedy these problems, we devised a powerful sequential pattern mining algorithm, the sequential pattern mining in the cloud-uniform distributed lexical sequence tree algorithm (SPAMC-UDLT), exploiting MapReduce and streaming processes. SPAMC-UDLT dramatically improves overall performance without launching multiple MapReduce rounds and provides perfect load balancing across machines in the cloud. The results show that SPAMC-UDLT can significantly reduce execution time, achieves extremely high scalability, and provides much better load balancing than existing algorithms in the cloud.  相似文献   

16.
沈耿彪  李清  江勇  汪漪  徐明伟 《软件学报》2020,31(7):2221-2244
数据中心网络是现代网络和云计算的重要基础设施,实现数据中心网络负载均衡是保证网络吞吐并提高服务体验的关键环节.首先分析了数据中心网络与传统互联网之间的区别,总结其特点及特殊性在负载均衡方案设计方面的优势.然后从数据中心的复杂性和多样性角度分析其负载均衡方案设计所面临的挑战.将现有数据中心网络负载均衡方案根据不同的实现层次从网络层、传输层、应用层和综合方案四个角度进行分析,对比各个方案的优缺点,并从控制结构、负载均衡粒度、拥塞感知机制、负载均衡策略、可扩展性和部署难度几个方面进行综合评价.最后对现有数据中心网络负载均衡方案进行总结,并指出未来可能的研究方向.  相似文献   

17.
Diffusion Schemes for Load Balancing on Heterogeneous Networks   总被引:1,自引:0,他引:1  
Several different diffusion schemes have previously been developed for load balancing on homogeneous processor networks. We generalize existing schemes, in order to deal with heterogeneous networks. Generalized schemes may operate efficiently on networks where each processor can have arbitrary computing power, i.e., the load will be balanced proportionally to these powers. The balancing flow that is calculated by schemes for homogeneous networks is minimal with regard to the l 2 -norm and we prove this to hold true for generalized schemes, too. We demonstrate the usability of generalized schemes by a number of experiments on several heterogeneous networks.  相似文献   

18.
This paper describes a software architecture designed as a support for tackling the load distribution problem when solving complex problems on concurrent processors. We have considered transputer-based MIMD multiprocessors as concurrent processors and a simulator for biologically inspired neural networks as a case study. Biologically inspired neural networks are characterized by having many thousands of neurons and synapses and topologically based connection schemes. It has been our main aim to give the user the possibility of simply defining and modifying widely differing load distribution strategies, in order to make it possible to deal with a broad range of neural network architectures and processor topologies. Furthermore we provide a real tool for hiding communication delays.  相似文献   

19.
One of the most significant causes for performance degradation of scientific and engineering applications on high performance computing systems is the uneven distribution of the computational work to the resources of the system. This effect, which is known as load imbalance, is even more noticeable in the case of irregular applications and heterogeneous distributed systems. This motivated the parallel and distributed computing research community to focus on methods that provide good load balancing for scientific and engineering applications running on (heterogeneous) distributed systems. Efficient load balancing and scheduling methods are employed for scientific applications from various fields, such as mechanics, materials, physics, chemistry, biology, applied mathematics, etc. Such applications typically employ a large number of computational methods in order to simulate complex phenomena, on very large scales of time and magnitude. These simulations consist of routines that perform repetitive computations (in the form of DO/FOR loops) over very large data sets, which, if not properly implemented and executed, may suffer from poor performance. The number of repetitive computations in the simulation codes is not always constant. Moreover, the computational nature of these simulations may be in fact irregular, leading to the case when one computation takes (unpredictably) more time than others. For successful and timely results, large scale simulations require the use of large scale computing systems, which often are widely distributed and highly heterogeneous. Moreover, large scale computing systems are usually shared among multiple users, which causes the quality and quantity of the available resources to be highly unpredictable. There are numerous load balancing methods in the literature for different parallel architectures. The most recent of these methods typically follow the master-worker paradigm, where a single coordinator (master) is responsible for making all the scheduling decisions based on information provided by the workers. Depending on the application requirements, the scheduling policy and the computational environment, the benefits of this paradigm may be limited as follows: (1) its efficiency may not scale as the number of processors increases, and (2) it is quite probable that the scheduling decisions are made based on outdated information, especially on systems where the workload changes rapidly. In an effort to address these limitations, we propose a distributed (master-less) load balancing scheme, in which the scheduling decisions are made by the workers in a distributed fashion. We implemented this method along with other two master-worker schemes (a previously existing one and a recently modified one) for three different scientific computational kernels. In order to validate the usefulness and efficiency of the proposed scheme, we conducted a series of comparative performance tests with the two master-worker schemes for each computational kernel. The target system is an SMP cluster, on which we simulated three different patterns of system load fluctuation. The experiments strongly support the belief that the distributed approach offers greater performance and better scalability on such systems, showing an overall improvement ranging from 13% to 24% over the master-worker approaches.  相似文献   

20.
In this paper, we study parallel branch and bound on fine grained hypercube multiprocessors. Each processor in a fine grained system has only a very small amount of memory available. Therefore, current parallel branch and bound methods for coarse grained systems ( 1000 nodes) cannot be applied, since all these methods assume that every processor stores the path from the node it is currently processing back to the node where the process was created (the back-up path). Furthermore, the much larger number of processors available in a fine grained system makes it imperative that global information (e.g. the current best solution) is continuously available at every processor; otherwise the amount of unnecessary search would become intolerable. We describe an efficient branch-and-bound algorithm for fine grained hypercube multiprocessors. Our method uses a global scheme where all processors collectively store all back-up paths such that each processor needs to store only a constant amount of information. At each iteration of the algorithm, all current nodes may decide whether they need to create new children, be pruned, or remain unchanged. We describe an algorithm that, based on these decisions, updates the current back-up paths and distributes global information in O(log m) steps, where m is the current number of nodes. This method also includes dynamic allocation of search processes to processors and provides optimal load balancing. Even if very drastic changes in the set of current nodes occur, our load balancing mechanism does not suffer any slow down.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号