首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
Keqin   《Performance Evaluation》2005,60(1-4):141-163
The problem of dynamic tree embedding in static networks is studied in this paper. We provide a unified framework for studying the performance of randomized tree embedding algorithms which allow a newly created tree node to take a random walk of short distance to reach a processor nearby. In particular, we propose simple randomized algorithms on several most common and important static networks, including d-dimensional meshes, d-dimensional tori, and hypercubes. It is shown that these algorithms, which have small constant dilation, are asymptotically optimal for embedding healthy trees. Our analysis technique is based on random walks on static networks. Hence, analytical expressions for expected load on all the processors are available.  相似文献   

2.
There are many parallel computations that are tree structured. The structure of a tree is usually unpredictable at compiler-time; the tree grows gradually during the course of a computation. The dynamic tree embedding problem is to distribute the processes of a parallel computation over processors in a parallel computer such that processors perform roughly the same amount of computation, and that communicating processes are assigned to processors that are close to each other. In this paper, we establish lower bounds for the performance ratio of dynamic tree embedding in bipartite static networks, including numerous important networks such asn-dimensional meshes,n-dimensional tori,k-aryn-cubes, cube-connected cycles, and butterflies. Our lower bounds are obtained from both tree and network properties and are applicable to a general class of dynamic tree embedding algorithms.  相似文献   

3.
为了在多核处理器上充分利用多核资源以提升挖掘性能,提出了一种动态与静态任务分配机制相结合的基于多核的并行序列模式挖掘算法。该算法采用数据并行与任务并行相结合的策略,在各处理器核生成局部序列模式后,再与其他处理器核协同,以最终获得所有的全局序列模式。算法通过并行局部归约技术消除了局部序列的重复生成与计算,并可结合静态与动态任务分配机制解决处理器的负载不均衡问题。理论分析和实验都证实了该算法可有效利用多核计算平台及多核体系结构优势,具有较高的运行效率和加速比。  相似文献   

4.
The problem of maximal clique enumeration (MCE) is to enumerate all of the maximal cliques in a graph. Once enumerated, maximal cliques are widely used to solve problems in areas such as 3-D protein structure alignment, genome mapping, gene expression analysis, and detection of social hierarchies. Even the most efficient serial MCE algorithms require large amounts of time to enumerate the maximal cliques in networks arising from these problems that contain hundreds, thousands, or larger numbers of vertices. The previous attempts to provide practical solutions to the MCE problem through parallel implementation have had limited success, largely due to a number of challenges inherent to the nature of the MCE combinatorial search space. On the one hand, MCE algorithms often create a backtracking search tree that has a highly irregular and hard-or-impossible to predict structure; therefore, almost any static decomposition of the search tree by parallel processors results in highly unbalanced processor execution times. On the other hand, the data-intensive nature of the MCE problem often makes naive dynamic load distribution strategies that require extensive data movement prohibitively expensive. As a result, good scaling of the overall execution time of parallel MCE algorithms has been reported for only up to a couple hundred processors. In this paper, we propose a parallel, scalable, and memory-efficient MCE algorithm for distributed and/or shared memory high performance computing architectures, whose runtime scales linearly for thousands of processors on real-world application graphs with hundreds and thousands of nodes. Its scalability and efficiency are attributed to the proposed: (a) representation of the search tree decomposition to enable parallelization; (b) parallel depth-first backtracking search to both constrain the search space and minimize memory requirement; (c) least stringent synchronization to minimize data movement; and (d) on-demand work stealing intelligently coupled with work stack splitting to minimize computing elements’ idle time. To the best of our knowledge, the proposed parallel MCE algorithm is the first to achieve a linear scaling runtime using up to 2048 processors on Cray XT machines for a number of real-world biological networks.  相似文献   

5.
异构计算中的负载共享   总被引:18,自引:0,他引:18  
曾国荪  陆鑫达 《软件学报》2000,11(4):551-556
在基于消息传递的异构并行计算系统中 ,各处理器或计算机具有自制和独立地调度、执行作业的能力 .当一个可划分的作业初始位于一个处理器上时 ,为了提高计算性能 ,该处理器可以请求其他异构处理器负载共享 ,参与协同计算 ,减少作业的完成时间 .该文提出了异构计算负载共享的一种方案 .首先 ,调用负载共享协议 ,收集当前各处理器参与负载共享的许可数据 ,包括共享时间段、计算能力等 .然后 ,构造一个作业量与作业完成时间之间的关系函数 .该函数是选择一组合适的处理器群、优化作业划分、作业完成时间最小的理论基础 .最  相似文献   

6.
Restricted computational capacity has become a key factor in limiting the development of a majority of distributed basin models. Parallel computing is one of the most effective methods for solving this problem. Although many parallel-computing methods have been employed in basin models, few studies have carried out theoretical research on parallel characteristics of river basins. In this paper, the drainage network of river basins is treated as a binary-tree structure. Using the binary-tree theory, we find that there exists a maximum speedup curve (MSC) for an arbitrary drainage network. The x-coordinate of the MSC represents the number of processors used during the computing, while the y-coordinate corresponds to the maximum speedup ratio (MSR) that can be obtained. Under several essential assumptions, the theoretical function of MSC is established. The function indicates that the MSC consists of an ascending section and a horizontal section. A parallel algorithm capable of acquiring the MSC is proposed as well. Using this algorithm, the MSC is tested at two different-resolution drainage networks of the Lhasa River Basin. A 2-year rainfall-runoff process is simulated. The results prove the existence of MSC. However, primarily influenced by the load imbalance of subbasins, the simulation values of MSR are usually smaller than the theoretical ones.  相似文献   

7.
This paper analyzes the effect of communication delay on the optimal distribution of processing loads in distributed computing networks. The processing load is assumed to satisfy the property of arbitrary divisibility. The objective is to divide and distribute this processing load among various processors in the network in order to minimize the processing time. An asymptotic analysis of the performance of such networks is carried out to obtain a limit on the performance enhancement obtained by using additional processors. The architectures considered are linear and single-level tree configurations. The cases when the processors are equipped with and without front-ends are considered.  相似文献   

8.
There is extensive literature concerning the divisible load theory. Based on the divisible load theory (DLT) the load can be divided into some arbitrary independent parts, in which each part can be processed independently by a processor. The divisible load theory has also been examined on the processors that cheat the algorithm, i.e., the processors do not report their true computation rates. According to the literature, if the processors do not report their true computation rates, the divisible load scheduling model fails to achieve its optimal performance. This paper focuses on the divisible load scheduling, where the processors cheat the algorithm. In this paper, a multi-objective method for divisible load scheduling is proposed. The goal is to improve the performance of the divisible load scheduling when the processors cheat the algorithm. The proposed method has been examined on several function approximation problems. The experimental results indicate the proposed method has approximately 66% decrease in finish time in the best case.  相似文献   

9.
This paper describes an object recognition algorithm both on a sequential machine and on a single instruction multiple data (SIMD) parallel processor such as the MIT connection machine. The problem, in the way it is presently formulated on a sequential machine, is essentially a propagation of constraints through a tree of possibilities in an attempt to prune the tree to a small number of leaves. The tree can become excessively large, however, and so implementations on massively parallel machines are sought in order to speed up the problem. Two fast parallel algorithms are described here, a static algorithm and a dynamic algorithm. The static algorithm reformulates the problem by assigning every leaf in the completely expanded unpruned tree to a separate processor in the connection machine. Then pruning is done in nearly constant time by broadcasting constraints to the entire SIMD array. This parallel version is shown to run three to four orders of magnitude faster than the sequential version. For large recognition problems which would exceed the capacity of the machine, a dynamic algorithm is described which performs a series of loading and pruning steps, dynamically allocating and deallocating processors through the use of the connection machine's global router communications mechanism.  相似文献   

10.
The problem of optimally distributing a divisible load to the nodes of an arbitrary processor tree is tackled in this paper. The rigorous mathematical foundation presented allows the derivation of the sequence of operations that is necessary to obtain the minimum processing time, along with closed-form expressions that yield the solution in time O(NP), where P is the number of tree nodes and N their maximum degree. The main contributions of this work are: (1) both load distribution and result collection overheads are considered, thus providing better resource utilization, and (2) arbitrary processor trees are examined in contrast with previous approaches that examined either complete homogeneous trees, or single level trees. Additionally, approximate algorithms for solving the problem of specifying the optimum subset of active processors for a given load, are presented and evaluated  相似文献   

11.
In this paper we introduce a biologically inspired distributed computing model called networks of evolutionary processors with parallel string rewriting rules (NEPPS), which is a variation of the hybrid networks of evolutionary processors introduced by Martin-Vide et al. Such a network contains simple processors that are located in the nodes of a virtual graph. Each processor has strings (each string having multiple copies) and string rewriting rules. The rules are applied parallely on the strings. After the strings have been rewritten, they are communicated among the processors through filters. We show that we can theoretically break the DES (data encryption standard), which is the most widely used cryptosystem, using NEPPS. We prove that, given an arbitrary <plain-text, cipher-text> pair, one can recover the DES key in a constant number of steps.  相似文献   

12.
A domain decomposition algorithm for molecular dynamics simulation of atomic and molecular systems with arbitrary shape and non-periodic boundary conditions is described. The molecular dynamics program uses cell multipole method for efficient calculation of long range electrostatic interactions and a multiple time step method to facilitate bigger time steps. The system is enclosed in a cube and the cube is divided into a hierarchy of cells. The deepest level cells are assigned to processors such that each processor has contiguous cells and static load balancing is achieved by redistributing the cells so that each processor has approximately same number of atoms. The resulting domains have irregular shape and may have more than 26 neighbors. Atoms constituting bond angles and torsion angles may straddle more than two processors. An efficient strategy is devised for initial assignment and subsequent reassignment of such multiple-atom potentials to processors. At each step, computation is overlapped with communication greatly reducing the effect of communication overhead on parallel performance. The algorithm is tested on a spherical cluster of water molecules, a hexasaccharide and an enzyme both solvated by a spherical cluster of water molecules. In each case a spherical boundary containing oxygen atoms with only repulsive interactions is used to prevent evaporation of water molecules. The algorithm shows excellent parallel efficiency even for small number of cells/atoms per processor.  相似文献   

13.
处理器阵列的容错重构技术是片上网络多核、众核高性能体系结构的可靠性技术之一。现有的最大逻辑阵列并行重构技术仅对单条逻辑列的构造实现了并行化,而对多条逻辑列的同步并行仍未见可行算法。依据处理器阵列的潜在并行性,在分治策略的基础上,提出了一种阵列分块的并行重构算法。算法对处理器阵列实施横向分块划分,对每个阵列块进行并行重构,并对所得逻辑子阵列进行归并,实现了多条逻辑列的同步并行重构。与现有的并行算法相比,新算法同样能够生成最大逻辑列,并且减少了通信开销与计算中的数据冗余,有效提高了运行速度。实验结果表明,在物理阵列大小为64×64的处理器阵列上,运行速度比现有并行算法提高39.55%,并且具有良好的可扩展性。  相似文献   

14.
§1.引 言EPHDC-2D程序是一个计算二维弹塑性流体动力学拉氏程序,其空间离散采用四节点的有限元方法,其时间积分采用显式的中心差分格式.程序分为三大部分:导程序、主体计算和网格重分.主体计算又分为滑移线计算和守恒方程的计算.我们已实现守恒方程的并行计算,(见文[5]),本文总结关于滑移线的并行计算.  相似文献   

15.
Scheduling precedence constrained task graphs, with or without duplication, is one of the most challenging NP-complete problems in parallel and distributed computing systems. Duplication heuristics are more effective, in general, for fine grain task graphs and for networks with high communication latencies. However, most of the available duplication algorithms are designed under the assumption of unbounded availability of fully connected processors, and lie in a high complexity range. Low complexity optimal duplication algorithms work under restricted cost and/or shape parameters for the task graphs. Further, the required number of processors grows in proportion to the task-graph size significantly. An improved duplication strategy is proposed that works for arbitrary task graphs, with a limited number of interconnection-constrained processors. Unlike most other algorithms that replicate all possible parents/ancestors of a given task, the proposed algorithm tends to avoid redundant duplications and duplicates the nodes selectively, only if it helps in improving the performance. This results in lower duplications and also lower time and space complexity. Simulation results are presented for clique and an interconnection-constrained network topology with random and regular benchmark task graph suites, representing a variety of parallel numerical applications. Performance, in terms of normalized schedule length and efficiency, is compared with some of the well-known and recently proposed algorithms. The suggested algorithm turns out to be most efficient, as it generates better or comparable schedules with remarkably less processor consumption.  相似文献   

16.
张云泉  施巍松 《软件学报》2000,11(12):1674-1680
用户在编写并行程序时,通常是把物理处理器看成逻辑的处理器(进程)网格,以便于算法的实现.随着用户可用处理器的不断增多,可选择的网格形状也随之增加,如何为基于消息传递的并行程序选择合适的、能发挥出并行机潜在性能的处理器网格形状,是一个迫切需要解决的问题.在提出基于通信点概念的最小度数通信点集合法之后,通过对并行程序通信模式的分析,试图解决与负载平衡无关的并行程序的最适处理器网格选择问题.通过对ScaLAPACK软件包中的一个并行测试程序——并行Cholesky(对称正定矩阵分解)通信点集合度的分析,此方法成功地选择了最适处理器网格形状,并与实验结果相一致.  相似文献   

17.
一种同构机群系统中的处理机分配算法   总被引:5,自引:0,他引:5  
机群系统的分布式计算环境为并行处理技术带来了新的研究与应用问题,正成为并行计算的热点问题.如何合理、有效地将并行任务划分到机群系统的结点上,将直接影响系统的执行性能.本文分析影响系统执行效率的执行开销因素,同时提出一个启发式的处理机分配算法.  相似文献   

18.
Parallelizing the Data Cube   总被引:1,自引:0,他引:1  
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.  相似文献   

19.
An important issue for the efficient use of multiprocessor systems is the assignment of parallel processors to nested parallel loops. It is desirable for a processor assignment algorithm to be fast and always generate an optimal processor assignment. The paper proposes two efficient algorithms to decide the optimal number of processors assigned to each individual loop. Efficient parallel counterparts of these two algorithms are also presented. These algorithms not only always generate an optimal processor assignment, but also are much faster than the exiting optimal algorithm in the literature. The paper discusses improving the performance of parallel execution by transforming a nested parallel loop into a semantically equivalent one. Three loop transformations are investigated. It is observed that, in most cases, the parallel execution time is improved after applying these transformations  相似文献   

20.
This paper describes a pipelined parallel algorithm for the MMSE-OSIC decoding procedure proposed in V-BLAST wireless MIMO systems, for heterogeneous networks of processors. It is based on a block version of the square-root Kalman Filter algorithm that was initially devised to solve the RLS problem. It has been parallelized in a pipelined way obtaining a good efficiency and scalability. The optimum load balancing for this parallel algorithm is dynamic, but we derive a static load balancing scheme with good performance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号