首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 187 毫秒
1.
TFQMR算法是一种Krylov子空间算法,常用来求解大型稀疏线性方程组.通过改变TFQMR算法的计算次序,提出了一种改进的TFQMR(ITFQMR)算法.对比TFQMR算法,ITFQMR算法的数值稳定性和TFQMR算法相同,几乎没有增加计算量,但考虑了在MIMD并行机上实现时并行算法的性能,其同步开销减少为TFQMR算法的一半,并且所有内积计算以及矩阵向量乘是独立的,没有数据相关性,可以进行计算与通信的重叠.从理论和实验两个角度来讨论ITFQMR算法的性能,当处理机台数较多时,ITFQMR算法的计算速度快于TFQMR算法.实验说明了在有64台处理机机群上进行,最快的并行ITFQMR算法的计算速度大约比TFQMR算法快20%.  相似文献   

2.
本文根据影响并行蚁群算法性能的关键因素,提出了一种自适应的并行蚁群算法.首先提出了基于适应度和基于距离选择的两种不同的信息交流策略,使得各处理机自适应地选择与之进行信息交换的处理机,然后采用自适应的更新策略进行信息素的更新.为了增强该算法的搜索能力,还根据解的多样性给出了自适应地调节处理机之间的信息交流周期的方法.在MPP处理机深腾1800上对TSP问题的实验结果表明了该算法在保证有效的加速比的同时,具有很好的收敛性.  相似文献   

3.
基于PVM的线性方程组的一种网上并行迭代算法   总被引:1,自引:0,他引:1  
针对基于PVM的桌面PC机联网而成的网络并行计算环境中,处理机的运算速度较快,而处理机间的通信相对较慢的实际情况,提出了求解线性方程组的一种分组Guass-Seidel并行迭代算法,该算法将线性方程组的增广矩阵按行分块储存在各处理机,每台处理机分别对各自的块采用Guass-Seidel迭代法进行迭代计算,其处理机间的通信较少,实现容易。并用1~24台桌面PC机联成的局域网,在PVM 3.4 on Windows2000,VC 6.0并行计算平台上编程对该算法进行了数值试验,试验结果表明,该算法较传统的Jacobi并行迭代算法和传统的Guass—Seidel并行迭代算法更优越。  相似文献   

4.
基于异构分布式系统的实时容错调度算法   总被引:26,自引:1,他引:26  
目前文献中研究的实时容错调度算法都是基于同构分布式系统,系统中的所有处理机完全相同。该文首先建立了一个基于异构分布式系统实时容错调度模型,异构分布式系统中的各个处理机均不相同。基于该异构分布式系统模型,该文引入了可靠性代价(reliability cost)概念,并提出两种静态实时容错调度算法(RTFTNO和RTFTRC)用于调度周期性实时容错任务。算法RTFTRC在调度任务时,尽量使系统的可靠性代价最小;而算法RTFTNO在调度实时任务时,没有考虑系统的可靠性代价。该文详细讨论了两种调度算法的性能。性能模拟实验分别比较了两个算法的可靠性代价,超时比率和可调度性;并研究了任务的计算时间与可靠性代价的关系以及调度长度阈值与最小处理机个数的关系。实验结果表明,算法RTFTRC的性能优于算法RTFTNO。  相似文献   

5.
针对二分图匹配算法在任务之间存在时序关系时无法进行有效调度以及EFT算法没有充分考虑各处理机性能及网络通信状况的问题,提出基于二分图匹配的改进ETF算法。该算法综合考虑任务之间的时序关系、处理机的性能、处理机之间的通信情况及已处理任务的调度情况,利用二分图最佳匹配思想对局部任务进行调度。实验表明该算法具有较小的调度长度和较好的负载均衡性。  相似文献   

6.
针对基于工作站网络环境下,处理机的运算速度较快而处理机间的通信相对较慢的实际情况,给出了一种基于行循环分布的并行求解线性方程组的Guass-Seidel迭代算法.该算法将方程组的增广矩阵按行循环分布存储在各处理机中,循环传送每一次的迭代向量以减少处理器间的通信次数,同时,采用计算与通信部分重叠技术,提高并行算法的效率.同时用8台PC机联成局域网,在DebianLinux4.0操作系统、MPICH1.2.7并行计算平台上对该算法进行了数值实验,实验结果表明,该算法较传统的基于行带状分布的Guass-Seidel并行迭代算法优越.  相似文献   

7.
提出了一种在MIMD分布式存储环境下求解块三对角线性方程组的并行算法。基于Galerkin原理适当取基构造算法,使整个计算过程只在相邻处理机间通信两次,并给出了系数矩阵为对称正定矩阵时算法收敛的条件。在HP rx2600集群系统上进行的数值计算结果表明该算法与多分裂方法相比具有较高的加速比和并行效率。  相似文献   

8.
稀疏矩阵乘以一个向量(SpM×V)的问题是许多大型应用问题的核心计算问题,文中提出了一种在并行计算机上并行计算SpMXV的负载平衡算法,计算复杂性为O(N)(N为稀疏矩阵的阶),而目前计算此类问题的最优负载平衡算法的计算复杂性为O(N·P)(P为处理机台数)。文章最后给出了并行数值实验。  相似文献   

9.
提出了求解系数矩阵为块三对角的线性方程组的一种适合于MIMD分布式存储的并行算法,该算法以系数矩阵分解为基础,充分利用了系数矩阵结构的特殊性,进行了近似处理,使整个计算过程只在相邻处理机间通信两次,具有很高的并行效率,并在理论上给出了该算法成立的充分条件。最后,在HPrx2600集群上进行数值试验,结果表明,加速比呈线性增加,并行效率达到90%以上。  相似文献   

10.
基于机群的求解TSP问题的分布式演化算法   总被引:4,自引:0,他引:4  
讨论一种基于PC机群的术解TSP问题的分布式演化算法.在PVM环境下进行了数值实验,结果表明该算法在一定的扩展范围内可以得到接近线性加速比的性能.  相似文献   

11.
A new multiresolution coarse-to-fine search algorithm for efficient computation of the Hough transform is proposed. The algorithm uses multiresolution images and parameter arrays. Logarithmic range reduction is proposed to achieve faster convergence. Discretization errors are taken into consideration when accumulating the parameter array. This permits the use of a very simple peak detection algorithm. Comparative results using three peak detection methods are presented. Tests on synthetic and real-world images show that the parameters converge rapidly toward the true value. The errors in ρ and &thetas;, as well as the computation time, are much lower than those obtained by other methods. Since the multiresolution Hough transform (MHT) uses a simple peak detection algorithm, the computation time will be significantly lower than other algorithms if the time for peak detection is also taken into account. The algorithm can be generalized for patterns with any number of parameters  相似文献   

12.
This note is a sequel of paper (Escudero et al. (2012) [1]), in which the sequential Branch-and-Fix Coordination referred to as the BFC-MS algorithm was introduced for solving large-scale multistage mixed 0–1 optimization problems up to optimality under uncertainty. The aim of the note is to present the parallelization version of the BFC algorithm, referred to as PC-BFCMS, so that the elapsed time reduction on problem solving is analyzed. The parallelization is performed at two levels. The inner level parallelizes the optimization of the MIP submodels attached to the set of scenario clusters that have been created by the modeler-defined break stage to decompose the original problem, where the nonanticipativity constraints are partially relaxed. Several strategies are presented for analyzing the performance of the inner parallel computation based on MPI (Message Passing Interface) threads to solve scenario cluster submodels versus the sequential version of the BFC-MS methodology. The outer level of parallelization defines a set of 0–1 variables, the combinations of whose 0–1 values, referred to as paths (one for each combination), allow independent models to be optimized in parallel, such that each one can itself be internally optimized with the inner parallelization approach. The results of using the outer–inner parallelization are remarkable: the larger the number of paths and MPI threads (in addition to the number of threads that the MIP solver allows to be used), the smaller the elapsed time to obtain the optimal solution. This new approach allows problems to be solved faster, and, can thus solve very large scale problems that could not otherwise be solved by plain use of a state-of the-art MIP solver, or could not be solved by the sequential version of the decomposition algorithm in acceptable elapsed time.  相似文献   

13.
This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor,as a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation  相似文献   

14.
The conjugate gradient squared (CGS) algorithm is a Krylov subspace algorithm that can be used to obtain fast solutions for linear systems (Ax=b) with complex nonsymmetric, very large, and very sparse coefficient matrices (A). By considering electromagnetic scattering problems as examples, a study of the performance and scalability of this algorithm on two MIMD machines is presented. A modified CGS (MCGS) algorithm, where the synchronization overhead is effectively reduced by a factor of two, is proposed in this paper. This is achieved by changing the computation sequence in the CGS algorithm. Both experimental and theoretical analyses are performed to investigate the impact of this modification on the overall execution time. From the theoretical and experimental analysis it is found that CGS is faster than MCGS for smaller number of processors and MCGS outperforms CGS as the number of processors increases. Based on this observation, a set of algorithms approach is proposed, where either CGS or MGS is selected depending on the values of the dimension of the A matrix (N) and number of processors (P). The set approach provides an algorithm that is more scalable than either the CGS or MCGS algorithms. The experiments performed on a 128-processor mesh Intel Paragon and on a 16-processor IBM SP2 with multistage network indicate that MCGS is approximately 20% faster than CGS.  相似文献   

15.
16.
This paper presents a new cost-effective algorithm to compute exact loop bounds when multilevel tiling is applied to a loop nest having affine functions as bounds (nonrectangular loop nest). Traditionally, exact loop bounds computation has not been performed because its complexity is doubly exponential on the number of loops in the multilevel tiled code and, therefore, for certain classes of loops (i.e., nonrectangular loop nests), can be extremely time consuming. Although computation of exact loop bounds is not very important when tiling only for cache levels, it is critical when tiling includes the register level. This paper presents an efficient implementation of multilevel tiling that computes exact loop bounds and has a much lower complexity than conventional techniques. To achieve this lower complexity, our technique deals simultaneously with all levels to be tiled, rather than applying tiling level by level as is usually done. For loop nests having very simple affine functions as bounds, results show that our method is between 15 and 28 times faster than conventional techniques. For loop nests caving not so simple bounds, we have measured speedups as high as 2,300. Additionally, our technique allows eliminating redundant bounds efficiently. Results show that eliminating redundant bounds in our method is between 22 and 11 times faster than in conventional techniques for typical linear algebra programs.  相似文献   

17.
In this paper, a new algorithm is presented to compute the disparity map from a stereo pair of images by using Belief Propagation (BP). While many algorithms have been proposed in recent years, the real-time computation of an accurate disparity map is still a challenging task. The computation time and run-time memory requirements are two very important factors for all real-time applications. The proposed algorithm divides the matching process into two steps; they are initial matching and disparity map refinement. Initial matching is performed by memory efficient hierarchical belief propagation algorithm that uses less than half memory at run-time and minimizes the energy function at much faster rate as compare to other hierarchical BP algorithms that makes it more suitable for real-time applications. Disparity map refinement uses a simple but very effective single-pass approach that improves the accuracy without affecting the computation cost. Experiments by using Middlebury dataset demonstrate that the performance of our algorithm is the best among other real-time stereo matching algorithms.  相似文献   

18.
This paper presents new efficient shortest path algorithms to solve single origin shortest path problems (SOSP problems) and multiple origins shortest path problems (MOSP problems) for hierarchically clustered data networks. To solve an SOSP problem for a network with n nodes, the distributed version of our algorithm reaches the time complexity of O(log(n)), which is less than the time complexity of O(log 2 (n)) achieved by the best existing algorithm. To solve an MOSP problem, our algorithm minimizes the needed computation resources, including computation processors and communication links for the computation of each shortest path so that we can achieve massive parallelization. The time complexity of our algorithm for an MOSP problem is O(m log(n)), which is much less than the time complexity of O(M log2 (0)) of the best previous algorithm. Here, M is the number of the shortest paths to be computed and m is a positive number related to the network topology and the distribution of the nodes incurring communications, m is usually much smaller than M. Our experiment shows that m is almost a constant when the network size increases. Accordingly, our algorithm is significantly faster than the best previous algorithms to solve MOSP problems for large data networks  相似文献   

19.
Given a distributed computation and a global predicate, predicate detection involves determining whether there exists at least one consistent cut (or global state) of the computation that satisfies the predicate. On the other hand, computation slicing is concerned with computing the smallest subcomputation (with the least number of consistent cuts) that contains all consistent cuts of the computation satisfying the predicate. In this paper, we investigate the relationship between predicate detection and computation slicing and show that the two problems are actually equivalent. Specifically, given an algorithm to detect a predicate b in a computation C, we derive an algorithm to compute the slice of C with respect to b. The time complexity of the (derived) slicing algorithm is O(n|E|T), where n is the number of processes, E is the set of events, and O(T) is the time complexity of the detection algorithm. We discuss how the "equivalence" result of this paper can be utilized to derive a faster algorithm for solving the general predicate detection problem in many cases. Slicing algorithms described in our earlier papers are all offline in nature. In this paper, we also present two online algorithms for computing the slice. The first algorithm can be used to compute the slice for a general predicate. Its amortized time complexity is O(n(c + n)T) per event, where c is the average concurrency in the computation and O(T) is the time complexity of the detection algorithm. The second algorithm can be used to compute the slice for a regular predicate. Its amortized time complexity is only O(n2) per event.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号