首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 11 毫秒
1.
On Dawning-1000,the two-dimension mesh interconnection network enables low-latency,high-bandwidth communication,however,these capabilities have not been realized because of the high processing overhead imposed by existing communication software.Active Messages provide an efficient communication mechanism with small overhead,which may expose the raw capabilities of the underlying hardware.In addition,one of the most promising techniques,use-level communication,is often used to improve the performance of the traditional protocols such as TCP and UDP,and is also adopted in implementing the novel abstractions like Active Messages.Thus a user-level Active Messages model is designed and implemented on Dawning-1000.Preliminary experiments show that the combination of Active Messages mechanism and user-level communication technique is quite efficient in reducing software overhead associated with sending and receiving messages,and in exploiting the capabilities of the interconnection network.  相似文献   

2.
Over the years, computational physics and chemistry served as an ongoing source of problems that demanded the ever increasing performance from hardware as well as the software that ran on top of it. Most of these problems could be translated into solutions for systems of linear equations: the very topic of numerical linear algebra. Seemingly then, a set of efficient linear solvers could be solving important scientific problems for years to come. We argue that dramatic changes in hardware designs precipitated by the shifting nature of the marketplace of computer hardware had a continuous effect on the software for numerical linear algebra. The extraction of high percentages of peak performance continues to require adaptation of software. If the past history of this adaptive nature of linear algebra software is any guide then the future theme will feature changes as well–changes aimed at harnessing the incredible advances of the evolving hardware infrastructure.  相似文献   

3.
当前最流行的网络并行计算消息传递模型是PVM和MPI,通常使用者认为PVM和MPI仅是代表了解决相同问题的不同解答方案。而该文结合曙光-2000(分布式大规模并行计算机系统)所用的消息传递型编程模型,从PVM和MPI的设计目标,起源,规范、动态进程,非阻塞操作等几个方面来说明这两种程序设计方法有许多明显区别点,通常用来解决不同的问题。  相似文献   

4.
5.
《国际计算机数学杂志》2012,89(3-4):249-270
We study the parallel implementation of two diagonalization methods for solving dense linear systems: the well known Gauss-Jordan method and a new one introduced by Huard. The number of arithmetic operations performed by the Huard method is the same as for Gaussian elimination, namely 2n 3/3, less than for the Jordan method, namely n 3. We introduce parallel versions of these methods, compare their performances and study their complexity. We assume a shared memory computer with a number of processors p of the order of n, the size of the problem to be solved, We show that the best parallel version for Jordan's method is by rows whereas the best one for Huard's method is by columns. Our main result states that for a small number of processors the parallel Huard method is faster than the parallel Jordan method and slower otherwise. The separation is obtained for p = 0.44n.  相似文献   

6.
面向对象数值软件Trilinos及其线性代数包剖析   总被引:1,自引:0,他引:1  
张妮  曹建文 《计算机工程与设计》2007,28(5):993-998,1011
分析研究了美国Sandia国家实验室Trilinos项目的设计思想、组织结构,该项目受到ASC计划等的资助.Trilinos框架中定义了一个线性代数对象模型,作为各种软件包的构建基础和沟通载体,此模型当前较成熟的实现为Epetra.详细地阐述了Epetra的组织设计和主要类的层次结构,通过两个数值实验考察该软件当前的性能.可以看到,以Trilinos为代表的众多数值计算项目,为各种数值软件的有效协作、集成和扩展所进行的努力和一些有价值的经验.  相似文献   

7.
《国际计算机数学杂志》2012,89(3-4):113-121
Given n items, a parallel algorithm for generating all the n! permutations is presented. The computational model used is a linear array which consists of n identical processing elements with a simple structure. One permutation is produced at each other time step. The elapsed time to produce a permutation is independent of the integer n. The basic idea used is the iterative method and the exchange of two consecutive components in an existing permutation. The design procedures of this algorithm are considered in detail. The ranking and unranking functions of the required permutations are also discussed.  相似文献   

8.
Numerical linear algebra libraries provide many kernels that can be composed to perform complex computations. For a given computation, there is typically a large number of functionally equivalent kernel compositions. Some of these compositions achieve better response times than others for particular data and when executed on a particular computer architecture. Previous research provides methods to enumerate (a subset of) these kernel compositions. In this work, we study the problem of determining the composition that yields the lowest response time. Our approach is based on a response time prediction for each candidate combination. While this prediction could in principle be obtained using analytical and/or empirical performance models, developing accurate such models is known to be challenging. Instead, we define a feature space that captures salient properties of kernel combinations and predict response time using supervised machine learning. We experiment with a standard set of machine learning algorithms and identify an effective algorithm for our kernel composition selection problem. Using this algorithm, our approach widely outperforms the strategy that would consist in always using the simplest kernel composition and is often close to the fastest kernel compositions among those evaluated. We quantify the potential benefit of our approach if it were to be implemented as part of an interactive computational tool. We find that although the potential benefit is substantial, a limiting factor is the kernel composition enumeration overhead. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

9.
Recently, AMT has issued an extended version of Fortran Plus [1] which allows software to be developed without the developer needing to take explicit accout of the grid size of the target processor. Fortran-Plus and its extension, Fortran Plus Enhanced [2], have been developed for use on the AMT DAP 510 array processor. This machine has 1024 processors arranged in a square grid with nearest neighbour and wraparound connections. It is interesting to enquire whether the performance of code generated by the Fortran-Plus Enhanced compiler is, for a particular application, superior to that generated by the Fortran-Plus compiler from a program which recognises and is tailored to fit the characteristic features of the DAP 510. In this paper the performances of two implementations of an algorithm for the eigensolution of real tridiagonal symmetric matrices are compared. The algorithm is characterised by its heavy use of matrix operations, all of which can be efficiently implemented on an array processor. Some of the constituent operations commonly occur in other applications while others are specific to the problem being addressed.  相似文献   

10.
Any factorization/back substitution scheme for the solution of linear systems consists of two phases which are different in nature, and hence may be inefficient for parallel implementation on a single computational network. The Gauss-Jordan elimination scheme unifies the nature of the two phases of the solution process and thus seems to be more suitable for parallel architectures, especially if reconfiguration of the communication pattern is not permitted. In this communication, a computational network for the Gauss-Jordan algorithm is presented. This network compares favorably with optimal implementations of the Gauss elimination/back substitution algorithm.  相似文献   

11.
This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201.  相似文献   

12.
The partition method of Wang for tridiagonal equations is generalized to the arbitrary band case. A stability criterion is given. The algorithm is compared to Gaussian elimination and cyclic reduction.  相似文献   

13.
基于递归耦合方法的三对角线性方程组分布式并行算法   总被引:1,自引:2,他引:1  
方蓉  赵瑛 《计算机工程与设计》2006,27(4):670-671,687
提出了一种在分布式计算机上用递归倍增方法解三对角线性方程组的并行算法。通过研究算法中的额外开销达到优化标量算法的执行和通讯,并减少了存储开销。当三对角线性方程组的系数矩阵满足对角占优时,该算法在运行过程中不会中断。最后,在采用消息传递编程模型的基于局域网MPI并行环境下对算法进行了评价。数值实验结果表明,该算法是高效的。  相似文献   

14.
Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

15.
Consider the problem of exploring a large state-space for a goal state where although many such states may exist in the state-space, finding any one state satisfying the requirements is sufficient. All the methods known until now for conducting such search in parallel using multiprocessors fail to provide consistent linear speedups over sequential execution. The speedups vary between sublinear to superlinear and from one execution to another. Further, adding more processors may sometimes lead to a slow-down rather than speedup, giving rise to speedup anomalies reported in literature. We present a prioritizing strategy which yields consistent speedups that are close toP withP processors, and that monotonically increase with the additon of processors. This is achieved by keeping the total number of nodes expanded during parallel search very close to that of a sequential search. In addition, the strategy requires substantially smaller memory relative to other methods. The performance of this strategy is demonstrated on a multiprocessor with several state-space search problems.This research has been supported in part by the National Science Foundation under Contract No. CCR-89-02496.  相似文献   

16.
A linear scaling parallel clustering algorithm implementation and its application to very large datasets for cluster analysis is reported. WaveCluster is a novel clustering approach based on wavelet transforms. Despite this approach has an ability to detect clusters of arbitrary shapes in an efficient way, it requires considerable amount of time to collect results for large sizes of multi-dimensional datasets. We propose the parallel implementation of the WaveCluster algorithm based on the message passing model for a distributed-memory multiprocessor system. In the proposed method, communication among processors and memory requirements are kept at minimum to achieve high efficiency. We have conducted the experiments on a dense dataset and a sparse dataset to measure the algorithm behavior appropriately. Our results obtained from performed experiments demonstrate that developed parallel WaveCluster algorithm exposes high speedup and scales linearly with the increasing number of processors.  相似文献   

17.
并行数据库上的并行CMD-Join算法   总被引:3,自引:1,他引:3  
李建中  都薇 《软件学报》1998,9(4):256-262
并行数据库在多处理机之间的分布方法(简称数据分布方法)对并行数据操作算法的性能影响很大.如果在设计并行数据操作算法时充分利用数据分布方法的特点,可以得到十分有效的并行算法.本文研究如何充分利用数据分布方法的特点,设计并行数据操作算法的问题,提出了基于CMD多维数据分布方法的并行CMD-Join算法.理论分析和实验结果表明,并行CMD-Join算法的效率高于其它并行Join算法.  相似文献   

18.
周期块三对角线性方程组的一种并行算法   总被引:1,自引:1,他引:0  
该文提出了分布式环境下求解周期块三对角线性方程组的一种并行算法,该算法通过对系数矩阵进行一次预处理后,充分利用系数矩阵结构的特殊性,使算法只在相邻处理机间通信两次。并从理论上给出了算法收敛的一个充分条件。最后,在HPrx2600集群上进行了数值试验,结果表明,实算与理论是一致的,并行性也很好。  相似文献   

19.
CUDA架构下大规模稠密线性方程组的并行求解   总被引:1,自引:0,他引:1       下载免费PDF全文
在Gauss-Jordan消去法的基础上,给出了一种适应于CUDA架构的改进Gauss-Jordan消去并行算法。通过分析该方法的处理过程以及CUDA架构的相应限制,在CUDA的grid-block-thread三层组织结构的基础上,从算法构造的角度提出了grid-strip-group-block-thread五层结构,给出了基础行以及全局基础行等概念,并构建了适应于CUDA架构的Gauss-Jordan消去法的并行版本,在最高维数为4 000维的大规模稠密线性方程组的算例求解上与串行Gauss-Jordan消去法进行了比较,实验结果表明,该算法能够充分利用GPU的硬件特性,有效地降低了大规模稠密线性方程组的求解时间。  相似文献   

20.
Parallel modifications of linear iteration schemes are proposed that are used to solve systems of liner algebraic equations and to achieve the time complexity equal to T = log2 k · O(log2 n), where k is the number of iterations of an original scheme and n is the dimension of a system. Such schemes are extended to the case of approximate solution of systems of linear differential equations with constant coefficients. Based on them and using a program, the stability of solutions in the Lyapunov sense is analyzed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号