首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Parallel Gaussian elimination on an MIMD computer   总被引:3,自引:0,他引:3  
This paper introduces a graph-theoretic approach to analyse the performances of several parallel Gaussian-like triangularization algorithms on an MIMD computer. We show that the SAXPY, GAXPY and DOT algorithms of Dongarra, Gustavson and Karp, as well as parallel versions of the LDMt, LDLt, Doolittle and Cholesky algorithms, can be classified into four task graph models. We derive new complexity results and compare the asymptotic performances of these parallel versions.  相似文献   

2.
Gaussian elimination is a canonical linear algebra procedure for solving linear systems of equations. In the last few years, the algorithm has received a lot of attention in an attempt to improve its parallel performance. This article surveys recent developments in parallel implementations of Gaussian elimination for shared memory architecture. Five different flavors are investigated. Three of them are based on different strategies for pivoting: partial pivoting, incremental pivoting, and tournament pivoting. The fourth one replaces pivoting with the Partial Random Butterfly Transformation, and finally, an implementation without pivoting is used as a performance baseline. The technique of iterative refinement is applied to recover numerical accuracy when necessary. All parallel implementations are produced using dynamic, superscalar, runtime scheduling and tile matrix layout. Results on two multisocket multicore systems are presented. Performance and numerical accuracy is analyzed. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

3.
A parallel FFT on an MIMD machine   总被引:5,自引:0,他引:5  
In this paper we present a parallelization of the Cooley- Tukey FFT algorithm that is implemented on a shared-memory MIMD (non-vector) machine that was built in the Dept. of Computer Science, Tel Aviv University. A parallel algorithm is presented for one dimension Fourier transform with performance analysis. For a large array of complex numbers to be transformed, an almost linear speed-up is demonstrated. This algorithm can be executed by any number of processors, but generally the number is much less than the length of the input data.  相似文献   

4.
Various proposals for networks of large numbers of processors are reviewed. Bottleneck problems arise in these networks with the flow of data between processors. Communication problems which can arise in practical situations are discussed and techniques for reducing bottlenecks are developed. Some simulation results are given for the binary n-cube.  相似文献   

5.
Gaussian elimination over GF(2) is used in a number of applications including the factorisation of large integers. The boolean nature of arithmetic in GF(2) makes the task well suited to highly parallel bit-organised computers. A program to work with up to 4096 × 4096 matrices has been developed for the ICL-DAP. A method has been developed that needs no extra storage to store the history of the elimination. The algorithm is presented and its correctness proved.  相似文献   

6.
为满足大规模线性方程组对内存容量的要求,针对对称方程组提出一种高斯消去的并行化方案。对称方程组在高斯消去过程中其子方阵的对称性仍然存在,因此在并行计算时只读入和计算三角部分的数据,从而减少储存空间的大小,提高并行效率。测试表明,该方案的并行效率优于传统算法,可应用于对称方程组的大规模数值计算中。  相似文献   

7.
一种基于高斯分布的白平衡改进算法   总被引:1,自引:0,他引:1  
针对白平衡算法中普遍存在的校正不足及算法复杂度过高等问题,结合灰度世界算法和完美反射算法的优点,提出了一种基于高斯分解的自动白平衡算法。该算法通过对图像灰度分布的高斯分解,得到图像的高斯分布模型,并以此计算增益曲线。实验结果表明,该算法能有效校正图像的偏色问题,适用于多种不同场景。  相似文献   

8.
As is well known, the strategy of divide-and-conquer is widely used in problem solving. The method of partitioning is also a fundamental strategy for the design of a parallel algorithm. The problem of enumerating the spanning trees of a graph arises in several contexts such as computer-aided design and computer networks. A parallel algorithm for solving the problem is presented in this paper. It is based on the principle of the inclusion and exclusion of sets, and not directly based on the partitioning of the graph itself. The results of the preliminary experiments on a MIMD system appear promising.  相似文献   

9.
In a previous article, one of the authors presented an extension of an iterative approximate orthogonalisation algorithm, due to Z. Kovarik, for arbitrary rectangular matrices. In the present article, we propose a modified version of this extension for the class of arbitrary symmetric matrices. For this new algorithm, the computational effort per iteration is much smaller than for the initial one. We prove its convergence and also derive an error reduction factor per iteration. In the second part of the article, we show that we can eliminate the matrix inversion required by the previous algorithm in each iteration, by replacing it with a polynomial matrix expression. Some numerical experiments are also presented for a collocation discretisation of a first kind integral equation.  相似文献   

10.
11.
为了解决串行部分选主元的高斯消去算法不能充分利用多核处理器的问题,提出并实现了并行多线程的部分选主元的高斯消去算法,并将整个算法进行了分析和优化,使数据的存储布局和算法的访存模式匹配,从而大幅提高了程序的性能。通过对本地Linux服务器以及美国亚马逊EC2云的多种平台上的实验结果的比较和分析,确定了部分选主元的高斯消去算法受缓存影响较大,所以在CPU和内存/缓存配置较为均衡的平台上运行性能最好。文中展现了一种高效率、扩展性好的多线程并行部分选主元的高斯消去算法以及将一般性串行算法进行并行化和优化的方法。  相似文献   

12.
The computational complexity of a parallel algorithm depends critically on the model of computation. We describe a simple and elegant rule-based model of computation in which processors apply rules asynchronously to pairs of objects from a global object space. Application of a rule to a pair of objects results in the creation of a new object if the objects satisfy the guard of the rule. The model can be efficiently implemented as a novel MIMD array processor architecture, the Intersecting Broadcast Machine. For this model of computation, we describe an efficient parallel sorting algorithm based on mergesort. The computational complexity of the sorting algorithm isO(nlog2 n), comparable to that for specialized sorting networks and an improvement on theO(n 1.5) complexity of conventional mesh-connected array processors.  相似文献   

13.
In this paper, a more efficient and a more accurate algorithm is developed for designing asymptotically optimal unrestricted uniform polar quantization (UUPQ) of bivariate Gaussian random variables compared to the existing algorithms on this subject. The proposed algorithm is an iterative one defining the analytical model of asymptotically optimal UUPQ in only a few iterations. The UUPQ model is also improved via optimization of the last magnitude reconstruction level so that the mean squared error (MSE) is minimal. Moreover, for the straightforward performance assessment of our analytical UUPQ model an asymptotic formula for signal to quantization noise ratio (SQNR) is derived, which is reasonably accurate for any rate (R) greater than or equal to 2.5 bits/sample. It is demonstrated empirically that our asymptotically optimal UUPQ model outperforms the previous UUPQ models in terms of SQNR. Eventually, the transition from the analytical to the practically designed UUPQ model, as an important aspect in quantizer design, is considered in the paper and, as a result, a novel method to achieve this is provided. The proposed method is applicable to the practical design of any unrestricted polar quantization.  相似文献   

14.
针对蝙蝠算法个体越界、易早熟收敛的问题,提出一种基于越界重置和高斯变异的蝙蝠优化算法。新算法将飞越解空间边界的个体拉回解空间内,利用越界重置策略重新分配位置。通过高斯变异策略控制个体的搜索范围,使种群以最优解为中心向四周呈放射状搜索,增强了算法的局部搜索和全局寻优能力。蝙蝠算法在靠近目标解时响度和脉冲发射频率更新不协调,影响了算法的持续进化能力,通过线性渐变策略保证响度和脉冲发射频率的变化与算法持续进化相适应。研究了在解空间不同位置关系的情况下新算法和对比算法的优化能力,并结合实验数据对算法收敛稳定性进行分析。实验结果表明,提出的新算法具有较好的收敛速度和精度,其全局寻优能力和高维问题优化能力体现了很好的鲁棒性。  相似文献   

15.
This paper introduces the parallelization on a distributed memory multicomputer of two iterative methods for finding all the roots of a given polynomial. The parallel algorithms share the computation of the roots among the processors and perform a total exchange of the data at each step. Since the amount of communications is the main drawback of this approach, we study the effect of the network topology on the performance of the algorithms. Particularly, we show that among the different classical processors networks topologies (ring, 2d-torus or n-cube), the hypercube topology minimizes the communications. For each topology is computed the optimal number of processors. Experiments on the hypercube FPS T40 illustrate the results.  相似文献   

16.
高斯烟羽模型扩散面积的算法研究   总被引:4,自引:1,他引:3  
高斯烟羽模型在研究非重气云扩散方面应用越来越广泛,但是计算高斯烟羽模型扩散面积却没有特别精确有效的方法。因此对高斯烟羽模型计算扩散面积提出一种合理、科学的算法显得十分必要。本文提出了一种计算高斯烟羽模型扩散面积的算法,并介绍了该算法的理论基础和思路实现。应用该算法于工程实例中,并对结果进行验证,表明该算法方便、精确、便于应用,是计算高斯烟羽模型扩散面积的一种行之有效的算法。  相似文献   

17.
Configurable computing, where hardware resources are configured appropriately to match specific hardware designs, has recently demonstrated its ability to significantly improve performance for a wide range of computation‐intensive applications. With steady advances in silicon technology, as predicted by Moore's Law, Field‐Programmable Gate Array (FPGA) technologies have enabled the implementation of System‐on‐a‐Programmable‐Chip (SOPC or SOC) computing platforms, which, in turn, have given a significant boost to the field of configurable computing. It is possible to implement various specialized parallel machines in a single silicon chip. In this paper, we describe our design and implementation of a parallel machine on an SOPC development board, using multiple instances of a soft IP configurable processor; we use this machine for LU factorization. LU factorization is widely used in engineering and science to solve efficiently large systems of linear equations. Our implementation facilitates the efficient solution of linear equations at a cost much lower than that of supercomputers and networks of workstations. The intricacies of our FPGA‐based design are presented along with tradeoff choices made for the purpose of illustration. Performance results prove the viability of our approach. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

18.
This paper presents a parallel algorithm for computing the inversion of a dense matrix based on modified Jordan's elimination which requires fewer calculation steps than the standard one. The algorithm is proposed for the implementation on the linear array with a small to moderate number of processors which operate in a parallel-pipeline fashion. A communication between neighboring processors is achieved by a common memory module implemented as a FIFO memory module. For the proposed algorithm we define a task scheduling procedure and prove that it is time optimal. In order to compute the speedup and efficiency of the system, two definitions (Amdahl's and Gustafson's) were used. For the proposed architecture, involving two to 16 processors, estimated Gustafson's (Amdahl's) speedups are in the range 1.99 to 13.76 (1.99 to 9.69).  相似文献   

19.
文中提出了在分布式环境下并行求解对称带状矩阵特征值问题的并行二分.多分法及其改进,该算法利用变形高斯消去法计算对称带状矩阵的Sturm序列,并利用Rayleigh商迭代对二分/多分法加以改进,在算法的并行执行过程中,各处理机间不需通信,特别适用在分布式环境下的并行计算,最后给出了数值实验结果。  相似文献   

20.
After introducing the parallel Schwarz overrelaxation method for linear systems, we analyse the convergence factor of the method in detail. The optimal overrelaxation parameter ω of the method is discussed in this paper. Some examples are also shown.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号