首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 656 毫秒
1.
It has been suggested that parallel processing helps in the solution of difficult discrete optimization problems, in particular, those problems that exhibit combinatorial search and require large-scale computations. By using a number of processors that are connected, coordinated and operating simultaneously, the solutions to such problems can be obtained much more quickly. The purpose of this paper is to propose an efficient parallel hypercube algorithm for the discrete resource allocation problem (DRAP). A sequential divide-and-conquer algorithm is first proposed. The algorithm is then modified for a parallel hypercube machine by exploiting its inherent parallelism. To allocate N units of discrete resources to n agents using a d-dimensional hypercube of p=2/sup d/ nodes, this parallel algorithm solves the DRAP in O((n/p+log/sub 2/p)N/sup 2/) time. A simulation study is conducted on a 32-node nCUBE/2 hypercube computer to present the experimental results. The speedup factor of the parallel hypercube algorithm is found to be more significant when the number of agents in the DRAP is much greater than the number of processing nodes on the hypercube. Some issues related to load balancing, routing, scalability, and mappings of the parallel hypercube algorithm are also discussed.  相似文献   

2.
袁源  李炳法  杨杰  丁莹  彭代毅 《计算机工程》2006,32(23):193-195
在分析了迭代加深启发式搜索(Iterative Deepening A*)算法及其可并行性后,提出了一种新的基于混合负载平衡的并行迭代加深启发式搜索算法。该算法综合了静态负载平衡和动态负载平衡的优点,可以在多结点的并行搜索计算中获得很高的加速比和效率。给出了该算法的Java RMI实现。通过在72个结点的并行机上的试验表明,该算法可以极大地提高并行搜索算法的加速度和效率。  相似文献   

3.
This paper presents the analysis of a parallel formulation of depth-first search. At the heart of this parallel formulation is a dynamic work-distribution scheme that divides the work between different processors. The effectiveness of the parallel formulation is strongly influenced by the work-distribution scheme and the target architecture. We introduce the concept of isoefficiency function to characterize the effectiveness of different architectures and work-distribution schemes. Many researchers considered the ring architecture to be quite suitable for parallel depth-first search. Our analytical and experimental results show that hypercube and shared-memory architectures are significantly better. The analysis of previously known work-distribution schemes motivated the design of substantially improved schemes for ring and shared-memory architectures. In particular, we present a work-distribution algorithm that guarantees close to optimal performance on a shared-memory/-network-with-message-combining architecture (e.g. RP3). Much of the analysis presented in this paper is applicable to other parallel algorithms in which work is dynamically shared between different processors (e.g., parallel divide-and-conquer algorithms). The concept of isoefficiency is useful in characterizing the scalability of a variety of parallel algorithms.This work was supported by Army Research Office Grant No. DAAG29-84-K-0060 to the Artificial Intelligence Laboratory, and Office of Naval Research Grant N00014-86-K-0763 to the Computer Science Department at the University of Texas at Austin.  相似文献   

4.
The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture  相似文献   

5.
Massively parallel computers are becoming quite common in their use in computational fluid dynamics. In this study, a parallel algorithm of a 3-D primitive-equation coastal ocean circulation model is designed on the hypercube MIMD computer architecture. The grid is partitioned using one-dimensional domain decomposition. The code is tested in a uniform rectangular grid problem for which the model domain in each node is a cube. For the problem where the grain size (n y ) is fixed, the speedup is linear and is close to ideal forP 8 processors. The overhead (F C ) increases as the number of processors increases. The background overhead is inversely proportional to the size of the grain. The slopeF C vs.P is a measure of the fraction of non-parallel code. For the problem where the domain is fixed, the speedup is 7.8 using 8-processors and 29.6 using 32-processors. The overhead increases linearly withP. The slopeF C is a measure of the communication cost. The load balancing problem is examined for a model of the Gulf of California whose computational domain is irregular.  相似文献   

6.
非规则数据场并行体绘制算法   总被引:1,自引:0,他引:1  
并行算法是实现体绘制加速的重要途径,然而现有的并行体制绘制算法大部分是针对规则数据场的。  相似文献   

7.
Ordering clones from a genomic library into physical maps of whole chromosomes presents a pivotal computational problem in genetics. Previous research has shown the physical mapping problem to be isomorphic to the NP-complete Optimal Linear Arrangement (OLA) problem for which no polynomial-time algorithm for determining the optimal solution is known. Serial implementations of stochastic global optimization techniques such as simulated annealing yielded very good results but proved computationally intensive. The design, analysis and implementation of coarse-grained parallel MIMD algorithms for simulated annealing on the Intel iPSC/860 hypercube is presented. Data decomposition and control decomposition strategies based on Markov chain decomposition, perturbation methods and problem-specific annealing heuristics are proposed and applied to the physical mapping problem. A suite of parallel algorithms are implemented on an 8-node Intel iPSC/860 hypercube, exploiting the nearest-neighbor communication pattern on the Boolean hypercube topology. Convergence, speedup and scalability characteristics of the various parallel algorithms are analyzed and discussed. Results indicate a deterioration of performance when a single Markov chain of solution states is distributed across multiple processing elements in the Intel iPSC/860 hypercube.  相似文献   

8.
Much prior work in AI on various attempts to speed up rule-based systems by parallel processing has been reported. Unfortunately, many of these results indicate that there is limited parallelism to be found when rules are applied to relatively small amounts of data. Thus, one can predict that much greater parallelism can be extracted when rules are applied to large amounts of data. However, traditional compile-time parallelization strategies as developed for main-memory based systems do not scale to large databases. We propose a scalable strategy for the efficient parallel implementation of rule-based systems operating upon large databases. We concentrate on load balancing techniques in a synchronous model of rule execution, where the variance in runtime of the distributed sites is minimized per cycle of rule processing, thus increasing utilization and speedup. We demonstrate that static load balancing techniques are insufficient, and thus low overhead dynamic load balancing is the key to successful scaling. We present a form of dynamic load balancing that is based upon predicting future system loads, rather than conventional demand-driven approaches that monitor current system state. We analyze a number of possible predictive dynamic load balancing protocols by isoefficiency analysis to guide the design of a parallel database rule processing system.  相似文献   

9.
This paper proposes a simple yet efficient algorithm to distribute loads evenly on multiprocessor computers with hypercube interconnection networks. This algorithm was developed based upon the well-known dimension exchange method. However, the error accumulation suffered by other algorithms based on the dimension exchange method is avoided by exploiting the notion of regular distributions, which are commonly deployed for data distributions in parallel programming. This algorithm achieves a perfect load balance over P processors with an error of 1 and the worst-case time complexity of O(M log2 P), where M is the maximum number of tasks initially assigned to each processor. Furthermore, perfect load balance is achieved over subcubes as well—once a hypercube is balanced, if the cube is decomposed into two subcubes by the lowest bit of node addresses, then the difference between the numbers of the total tasks of these subcubes is at most 1.  相似文献   

10.
基于序列划分策略的Hmmsearch程序两级并行实现   总被引:2,自引:0,他引:2  
Hmmsearch程序是生物序列分析软件包HMMER中最主要程序之一,采用集中式动态负载平衡方案。为了提高其可扩展性,提出一种基于序列划分策略的两级并行算法,采用分布式动态负载平衡方案。通过增加中间层形成“Master-subMaster’Slave”结构,来改善原“Master-Slave”结构的不足。实验数据表明改进算法比原算法具有更好的可扩展并行性能。  相似文献   

11.
We present two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyze their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers, while QSP1 is fairly close to optimal. Langet al. (1) and Schnorret al. (2) have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-perprocessor case. Both QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). We also analyze a different variant of Lang's sort which is as scalable as QSP2. We briefly discuss another metric called resource consumption. According to this metric, both QSP1 and QSP2 are superior to variants of Lang's sort.  相似文献   

12.
分布式并行约束归纳逻辑程序设计研究   总被引:1,自引:0,他引:1  
CILP是关系数据挖掘的主要技术之一。为提高CILP系统的效率,提出了一种基于C3模型,元学习技术和主从式静态负载平衡策略的分布式并行CILP算法,并实现了一个基于COW机群结构的分布式并行CILP原型系统。实验表明该算法是高效的,能获得较好的负载平衡,较高的加速比和并行效率。  相似文献   

13.
In this paper we present a parallel algorithm for CTL* model checking on a virtual shared-memory high-performance parallel machine architecture. The algorithm is automata-driven and follows a games approach where one player wins if the automaton is empty while the other player wins if the automaton is nonempty. We show how this game can be played in parallel using a dynamic load balancing technique to divide the work across the processors. The practicality and effective speedup of the algorithm is illustrated by performance graphs.  相似文献   

14.
The paper discusses issues pertinent to performance analysis of massively parallel systems. A model of parallel execution based on threads of control and events is then introduced. The key ingredient of this model is a measure of the communication complexity, which gives the number of events E as a function of the number of threads of control P and provides a signature of a parallel computation. Various consequences for speedup and load balancing are presented.  相似文献   

15.
We present a parallel algorithm for computing an optimal sequence alignment in efficient space. The algorithm is intended for a message-passing architecture with one-dimensional-array topology. The algorithm computes an optimal alignment of two sequences of lengthsM andN inO((M+N) 2 /P) time andO((M+N)/P) space per processor, where the number of processors isP>=max(M, N). Thus, whenP=max(M, N) it achieves linear speedup and requires constant space per processor. Some experimental results on an Intel hypercube are provided.This research was supported by NIH Grant LM05110 from the National Library of Medicine.  相似文献   

16.
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objectives of this paper are to critically assess the state of the art in the theory of scalability analysis, and to motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoefficiency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.  相似文献   

17.
In this paper, we present a static load balancing method for mapping production rules in an expert system onto processors of a message-passing multicomputer. The method uses simulated annealing to achieve a nearly optimal allocation of production rules onto processor nodes. The approach balances the initial rule distribution to avoid higher communication demand among processor nodes at run time. A formal mapping model is developed and a new cost function is defined for the annealing process. New heuristic swap functions and cooling policies are provided to ensure the quality of the annealing process. A software load balancing package, SIMAL, was implemented on a SUN workstation to carry out the benchmark experiments. The overhead associated with this mapping method is O(m In m), where m is the number of rules in the production system. Two benchmark production systems, Toru-waltz and Tourney, are mapped onto a hypercube computer with 32 nodes. Experimental benchmark results verify the effectiveness of the rule mapping method. The method can be applied for distributed artificial intelligence processing or for the parallel execution of cooperating expert systems on a message-passing multicomputer.  相似文献   

18.
Scheduling tasks onto the processors of a parallel system is a crucial part of program parallelisation. Due to the NP-hard nature of the task scheduling problem, scheduling algorithms are based on heuristics that try to produce good rather than optimal schedules. Nevertheless, in certain situations it is desirable to have optimal schedules, for example for time-critical systems or to evaluate scheduling heuristics. This paper investigates the task scheduling problem using the A* search algorithm which is a best-first state space search. The adaptation of the A* search algorithm for the task scheduling problem is referred to as the A* scheduling algorithm. The A* scheduling algorithm can produce optimal schedules in reasonable time for small to medium sized task graphs with several tens of nodes. In comparison to a previous approach, the here presented A* scheduling algorithm has a significantly reduced search space due to a much improved consistent and admissible cost function f(s) and additional pruning techniques. Experimental results show that the cost function and the various pruning techniques are very effective for the workload. Last but not least, the results show that the proposed A* scheduling algorithm significantly outperforms the previous approach.  相似文献   

19.
PVM环境中提高并行计算效率的途径   总被引:1,自引:0,他引:1  
本文通过对PVM的分析与实际应用开发,分析了为提高基于工作站群机环境的并行计算效率应考虑的因素及应采取的一般措施。提出了在多网络构成的群机环境中采用按构成的网段分组实施动态负载平衡策略的方法以减少实施策略本身所带来的通信开销,这是通过减少节点间的负载平衡消息交互和任务迁移而达到的,模拟表明这是一种有效的策略。  相似文献   

20.
This paper describes and evaluates a parallel program for determining the three-dimensional structure of nucleic acids. A parallel constraint satisfaction algorithm is used to search a discrete space of shapes. Using two realistic data sets, we compare a previous sequential version of the program written in Miranda to the new sequential and parallel versions written in C, Scheme, and Multilisp, and explain how these new versions were designed to attain good absolute performance. Critical issues were the performance of floating-point operations, garbage collection, load balancing, and contention for shared data. We found that speedup was dependent on the data set. For the first data set, nearly linear speedup was observed for up to 64 processors whereas for the second the speedup was limited to a factor of 16.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号