期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

林洪陈国良《小型微型计算机系统》1996,(1)

巨量并行处理（ＭＰＰ）强调并行系统结构和并行算法的可扩放性。在一个可扩放的并行系统结构上，可扩放的并行算法应该能够有效地利用不断增加的处理机，算法的有效性通常以算法运行时的处理机效率来衡量。一个被普遍忽视的因素是通讯效率，这是一个具有一般性的问题。本文给出了通讯效率的定义，研究了它与处理机效率的关系，并通过对一个典型算法的运行情况分析，研究了几个常见的并行系统结构的通讯效率。本文的结果表明：处理机效率和通讯效率的综合才能全面地评价算法的可扩放性并指导并行系统结构的设计。相似文献

2.

广义Hermitian特征问题标准化转换的有效并行块算法

赵永华迟学斌程强《计算机研究与发展》2007,44(10):1724-1732

广义Hermitian特征问题并行求解器的性能依赖于所选择的并行算法和矩阵的分布策略等诸多方面.基于块存储和快算法策略,提出了一个新的标准化转化的并行算法,该并行算法将Cholesky分解结合到广义特征问题标准化转换中, 降低了已有并行算法的通信开销,并增加了算法的并行性.新算法可显著改善已有并行算法的性能和可扩展性.另外给出了一个有效求解具有多个右端项的三角矩阵方程AX=B的并行块算法.通过自主开发的特征问题并行软件包PSEPS的测试结果表明,并行算法比传统的并行算法快大约1倍,并具有较好的可扩展性. 相似文献

3.

基于二维结构化网格的可压缩流体并行算法研究

皇甫永硕刘杰龚春叶《计算机工程与科学》2017,39(9):1602-1609

基于二维/轴对称高精度可压缩多相流计算流体力学方法 MuSiC-CCASSIM的结构化网格部分,设计了区域并行分解方法;针对各处理器边界数据的通信,设计了阻塞式通信与非阻塞式通信并行算法;为了减少通信开销,设计了MPI/OpenMP混合并行优化算法。在天河二号超级计算机上进行了测试,每个核固定网格规模为625*250,最多调用8 192核。测试数据表明,采用MPI/OpenMP混合并行算法、纯MPI非阻塞式通信并行算法和纯MPI阻塞式通信并行算法的程序的平均并行效率分别达到86%、83%和77%,三种算法都具有良好的可扩展性。相似文献

4.

并行算法的可扩放性分析 总被引：8，自引：0，他引：8

陈国良《小型微型计算机系统》1995,16(2):10-16

本文讨论并行算法的可扩放性的定义，研究目的和各种评判标准，以期有助于了解并行算法和体系结构的匹配关系，最大化系统的加速和效率以及预计目前小规模并行机上的并行算法运行于巨最并行机ＭＰＣ上时的性能。相似文献

5.

三维激光烧蚀流体界面不稳定性程序的并行化 总被引：1，自引：0，他引：1

左风丽莫则尧叶文华《数值计算与计算机应用》2005,26(1):1-12

在共享存储并行机和MPP并行机上,基于MPI(MessagePassingInterface)并行编程环境,本文研究三维激光烧蚀界而不稳定性程序(Lared-S)的并行实现.三维激光烧蚀的数值模拟采用分裂方法,其90％以上的计算负载存在于流体方程和热传导方程的求解(流体方程的求解采用分裂显格式,热传导方程的求解采用分裂隐格式).本文给出基于三维分裂格式的交替平面数据通信模式.分裂隐格式的求解转化为三对角方程组的求解,其并行实现采用块流水线并行算法.数值实验结果表明交替平面数据通信策略和块流水线并行算法是有效且可扩展的.在共享存储并行机上,应用64台处理机获得93％以上的并行效率;在MPP并行机上,应用128台处理机获得90％以上的并行效率. 相似文献

6.

基于BLACS的2.5D并行矩阵乘法

廖霞李胜国卢宇彤杨灿群《计算机学报》2021,44(5):1037-1050

并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel... 相似文献

7.

Scalable Load Balancing Strategies for Parallel A* Algorithms

Dutt S. Mahapatra N. R. 《Journal of Parallel and Distributed Computing》1994,22(3)

In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in Θ(log P) time and, in addition, achieves good initial load balance. The QE strategy prossess certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel A* algorithms to solve the Traveling Salesman Problem on an nCUBE2 hypercube multicomputer. The QE strategy yields average speedup improvements of about 20-185% and 15-120% at low and intermediate work densities (the ratio of the problem size to P), respectively, over three well-known load balancing methods-the round-robin (RR), the random communication (RC), and the neighborhood averaging (NA) strategies. The average speedup observed on 1024 processors is about 985, representing a very high efficiency of 0.96. Finally, we analyze and empirically evaluate the scalability of parallel A* algorithms in terms of the isoefficiency metric. Our analysis gives (1) a Θ(P log P) lower bound on the isoefficiency function of any parallel A* algorithm, and (2) a general expression for the upper bound on the isoefficiency function of our parallel A* algorithm using the QE strategy on any topology-for the hypercube and 2-D mesh architectures the upper bounds on the isoefficiency function are found to be Θ(P log²P) and Θ(P[formula]), respectively. Experimental results validate our analysis, and also show that parallel A* search has better scalability using the QE load balancing strategy than using the RR, RC, or NA strategies. 相似文献

8.

Parallel simulated annealing using an adaptive resampling interval

《Parallel Computing》2016

This paper presents a parallel simulated annealing algorithm that is able to achieve 90% parallel efficiency in iteration on up to 192 processors and up to 40% parallel efficiency in time when applied to a 5000-dimension Rastrigin function. Our algorithm breaks scalability barriers in the method of Chu et al. (1999) by abandoning adaptive cooling based on variance. The resulting gains in parallel efficiency are much larger than the loss of serial efficiency from lack of adaptive cooling. Our algorithm resamples the states across processors periodically. The resampling interval is tuned according to the success rate for each specific number of processors. We further present an adaptive method to determine the resampling interval based on the adoption rate. This adaptive method is able to achieve nearly identical parallel efficiency but higher success rates compared to the fixed interval one using the best interval found. 相似文献

9.

非结构网格粒子输运Sn并行算法

迟利华刘杰田平《计算机工程与科学》2010,32(10):85-89

本文基于网格区域剖分,提出了一种新的非结构网格粒子输运Sn并行算法,实现了多个角方向和多个能群的同时计算,在计算的过程中不用进行优先级计算和优先级队列维护,只需要按照计算队列的次序组织并行计算。综合考虑所有方向和所有网格点的数据依赖关系,结合B-level优先级,提出了一种优先级计算方法,优先计算需要数据发送的任务,延迟需要接收数据的任务,达到减少处理器等待时间和计算与通信重叠的目的。使用本文的Sn并行算法和优先级队列针对二维粒子输运问题进行的数值实验表明,并行算法具有良好的并行计算加速效果,扩展到1 024个处理机时,相对64个处理机的并行效率达到52%。相似文献

10.

A parallel algorithm for mining multiple partial periodic patterns

Guanling Lee Wenpo Yang Jia-Min Lee 《Information Sciences》2006,176(24):3591-3609

The mining of partial periodic patterns is an interesting type of data mining that is widely used in the analysis of markets, such as for stock management and sales management. However, the existence of huge data sets make the scalability of data-mining algorithms a very important objective, and in recent years parallel computing has been applied to general data-mining algorithms. This paper addresses the problem of mining multiple partial periodic patterns in a parallel computing environment. To reduce the cost of communication between the processors, our approach employs the independence property of prime numbers to classify partial periodic patterns into multiple independent sets. Moreover, a novel method of distributing mining tasks among the processors is proposed. A set of simulations is used to demonstrate the benefits of our approach. 相似文献

11.

A block Jacobi method on a mesh of processors

D. Gimnez V. Hernndez R. van de Geijn A. M. Vidal 《Concurrency and Computation》1997,9(5):391-411

In this paper, we study the parallelization of the Jacobi method to solve the symmetric eigenvalue problem on a mesh of processors. To solve this problem obtaining a theoretical efficiency of 100% it is necessary to exploit the symmetry of the matrix. The only previous algorithm we know exploiting the symmetry on multicomputers is that of van de Geijn (1991), but that algorithm uses a storage scheme adequate for a logical ring of processors, so having a low scalability. In this paper we show how matrix symmetry can be exploited on a logical mesh of processors obtaining a higher scalability than that obtained with van de Geijn's algorithm. In addition, we show how the storage scheme exploiting the symmetry can be combined with a scheme by blocks to obtain a highly efficient and scalable Jacobi method for solving the symmetric eigenvalue problem for distributed memory parallel computers. We report performance results from the Intel Touchstone Delta, the iPSC/860, the Alliant FX/80 and the PARSYS SN-1040. © 1997 by John Wiley & Sons, Ltd. 相似文献

12.

Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing Compilers

Ching-Hsien Hsu Ming-Hao Chen Chao-Tung Yang Kuan-Ching Li 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(11):1226-1241

Dynamic data redistribution is used to enhance data locality and algorithm performance by reducing interprocessor communication in many parallel scientific applications on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a processor replacement scheme to minimize the cost of interprocessor data exchange during runtime. The main idea of the proposed technique is to develop a replacement function for reordering logical processors in the destination phase. Based on the replacement function, a realigned sequence of destination processors can be derived and is then used to perform data decomposition in the receiving phase. Together with local matrix and compressed CRS vectors transposition schemes, the interprocessor communication can be eliminated during runtime. A significant improvement of this approach is that the realignment of data can be performed without interprocessor communication for special cases. The second contribution of the present technique is that the complicated communication sets generation could be simplified by applying local matrix transposition. Consequently, the indexing cost could be reduced significantly. The proposed techniques can be applied in both dense and sparse applications. A generalized symmetric redistribution algorithm is also presented in this work. To analyze the efficiency of the proposed technique, the theoretical analysis proves that up to (p-1)/p data transmission cost can be saved. For general cases, the symmetric redistribution algorithm saves 1/p communication overheads compared with the traditional method. Experimental results also show that the proposed techniques provide superior performance in most data redistribution instances 相似文献

13.

Parallel computation of symbolic robot models and control laws: Theory and application to transputer networks

N. Kir&#x;anski T. Petrovi&#x; M. Vukobratovi&#x; 《野外机器人技术杂志》1993,10(3):345-368

New computer architectures based on large numbers of processors are now used in various application areas ranging from embedded systems to supercomputers. Efficient parallel processing algorithms are applied in a wide variety of applications such as simulation, robot control, and image synthesis. This article presents two novel parallel algorithms for computing robot inverse dynamics (as well as control laws) starting from customized symbolic robot models. To gain the most benefit from the concurrent processor architecture, the whole job is divided into a large number of simple tasks, each involving only a single floating-point operation. Although requiring sophisticated scheduling schemes, fine granularity of tasks was the key factor for achieving nearly maximum efficiency and speedup. The first algorithm resolves the scheduling problem for an array of pipelined processors. The second one is devoted to parallel processors connected by a complete crossbar interconnection network. The main feature of the proposed algorithms is that they take into account the communication delays between processors and minimize both the execution time and communication cost. To prove the theoretical results, the algorithms have been verified by experiments on an INMOS T800 transputer-based system. We used four transputers in serial and parallel configurations. The experimental results show that the most complicated dynamic control laws can be executed in a submilisecond time interval. © 1993 John Wiley & Sons, Inc. 相似文献

14.

A parallel finite element procedure for contact-impact problems 总被引：2，自引：0，他引：2

Jason Har Robert E. Fulton 《Engineering with Computers》2003,19(2-3):67-84

An efficient parallel finite element procedure for contact-impact problems is presented within the framework of explicit finite element analysis with thepenalty method. The procedure concerned includes a parallel Belytschko-Lin-Tsay shell element generation algorithm and a parallel contact-impact algorithm based on the master-slave slideline algorithm. An element-wise domain decomposition strategy and a communication minimization strategy are featured to achieve almost perfect load balancing among processors and to show scalability of the parallel performance. Throughout this work, a prototype code, named GT-PARADYN, is developed on the IBM SP2 to implement the procedure presented, under message-passing paradigm. Some examples are provided to demonstrate the timing results of the algorithms, discussing the accuracy and efficiency of the code. 相似文献

15.

一种基于平衡划分的集合差并行算法

沈燕芬颜启华《计算机时代》2010,(11):46-49

基于精确划分的思想提出了一种新的集合差并行算法DIFF—DL。利用DL子算法查找最终全局序列中等分位置上的划分点,将数据平均划分并分配给所有处理机,使每个处理机具有相同的工作负载。给出了网络并行计算环境下的实验结果,并与DIFF-S、DIFF-NS算法进行了对比。理论分析和实验的结果都表明,DIFF-DL算法具有很高的并行效率和扩展性,原因是划分类算法的性能和划分后区间数据量的平均程度正相关。相似文献

16.

A parallel workload balanced and memory efficient lattice-Boltzmann algorithm with single unit BGK relaxation time for laminar Newtonian flows

David Vidal Robert Roy 《Computers & Fluids》2010,39(8):1411-1423

A parallel workload balanced and memory efficient lattice-Boltzmann algorithm for laminar Newtonian fluid flow through large porous media is investigated. It relies on a simplified LBM scheme using a single unit BGK relaxation time, which is implemented by means of a shift algorithm and comprises an even fluid node partitioning domain decomposition strategy based on a vector data structure. It provides perfect parallel workload balance, and its two-nearest-neighbour communication pattern combined with a simple data transfer layout results in 20-55% lower communication cost, 25-60% higher computational parallel performance and 40-90% lower memory usage than previously reported LBM algorithms. Performance tests carried out using scale-up and speed-up case studies of laminar Newtonian fluid flow through hexagonal packings of cylinders and a random packing of polydisperse spheres on two different computer architectures reveal parallel efficiencies with 128 processors as high as 75% for domain sizes comprising more than 5 billion fluid nodes. 相似文献

17.

块三对角线性方程组的一种有效并行算法

肖曼玉吕全义《计算机应用与软件》2006,23(6):107-108,134

提出了求解系数矩阵为块三对角的线性方程组的一种适合于MIMD分布式存储的并行算法,该算法以系数矩阵分解为基础,充分利用了系数矩阵结构的特殊性,进行了近似处理,使整个计算过程只在相邻处理机间通信两次,具有很高的并行效率,并在理论上给出了该算法成立的充分条件。最后,在HPrx2600集群上进行数值试验,结果表明,加速比呈线性增加,并行效率达到90％以上。相似文献

18.

Executing algorithms with hypercube topology on torusmulticomputers

Gonzalez A. Valero-Garcia M. Diaz de Cerio L. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(8):803-814

Many parallel algorithms use hypercubes as the communication topology among their processes. When such algorithms are executed on hypercube multicomputers the communication cost is kept minimum since processes can be allocated to processors in such a way that only communication between neighbor processors is required. However, the scalability of hypercube multicomputers is constrained by the fact that the interconnection cost-per-node increases with the total number of nodes. From scalability point of view, meshes and toruses are more interesting classes of interconnection topologies. This paper focuses on the execution of algorithms with hypercube communication topology on multicomputers with mesh or torus interconnection topologies. The proposed approach is based on looking at different embeddings of hypercube graphs onto mesh or torus graphs. The paper concentrates on toruses since an already known embedding, which is called standard embedding, is optimal for meshes. In this paper, an embedding of hypercubes onto toruses of any given dimension is proposed. This novel embedding is called xor embedding. The paper presents a set of performance figures for both the standard and the xor embeddings and shows that the latter outperforms the former for any torus. In addition, it is proven that for a one-dimensional torus (a ring) the xor embedding is optimal in the sense that it minimizes the execution time of a class of parallel algorithms with hypercube topology. This class of algorithms is frequently found in real applications, such as FFT and some class of sorting algorithms 相似文献

19.

Parallel eigenvalue reordering in real Schur forms

R. Granat B. Kgstrm D. Kressner 《Concurrency and Computation》2009,21(9):1225-1250

A parallel algorithm for reordering the eigenvalues in the real Schur form of a matrix is presented and discussed. Our novel approach adopts computational windows and delays multiple outside‐window updates until each window has been completely reordered locally. By using multiple concurrent windows the parallel algorithm has a high level of concurrency, and most work is level 3 BLAS operations. The presented algorithm is also extended to the generalized real Schur form. Experimental results for ScaLAPACK‐style Fortran 77 implementations on a Linux cluster confirm the efficiency and scalability of our algorithms in terms of more than 16 times of parallel speedup using 64 processors for large‐scale problems. Even on a single processor our implementation is demonstrated to perform significantly better compared with the state‐of‐the‐art serial implementation. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

20.

求解大规模矩阵特征问题的并行算法研究

下载免费PDF全文

赵韬迟学斌陆忠华赵永华《计算机工程》2010,36(6):12-14

基于数据并行的重启动Arnoldi并行算法,基于使用数据并行模型的重启动Arnoldi并行算法,提出一个精化重启动Arnoldi并行算法。为了降低弱扩展性对并行性能的负面影响,该算法使用任务图模型并行计算精化向量,减少处理器进程之间的通信次数,有效地实现并行计算。在KD-50-I万亿次机上的测试结果表明,该算法具有较好的可扩展性和并行效率。相似文献