首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Incomplete LU (ILU) factorizations are popular preconditioning techniques for solving large linear systems that arise in scientific computations. We propose a (generalized) domain decomposition-based approach that leads to almost perfect speed-up with respect to standard ILU. Experimental results and theoretical spectral condition number are reported for two-dimensional problems.  相似文献   

2.
In this paper a computer memory system intended for storing an arbitrary sequence of multidimensional arrays is described. This memory system permits a parallel access to the cuts distinguished in the given array by fixing one of the coordinates and to the large set of parallelepipeds which are the same dimension subarrays of the given arrays.  相似文献   

3.
We present a polynomial preconditioner that can be used with the conjugate gradient method to solve symmetric and positive definite systems of linear equations. Each step of the preconditioning is achieved by simultaneously taking an iteration of the SOR method and an iteration of the reverse SOR method (equations taken in reverse order) and averaging the results. This yields a symmetric preconditioner that can be implemented on parallel computers by performing the forward and reverse SOR iterations simultaneously. We give necessary and sufficient conditions for additive preconditioners to be positive definite.

We find an optimal parameter, ω, for the SOR-Additive linear stationary iterative method applied to 2-cyclic matrices. We show this method is asymptotically twice as fast as SSOR when the optimal ω is used.

We compare our preconditioner to the SSOR polynomial preconditioner for a model problem. With the optimal ω, our preconditioner was found to be as effective as the SSOR polynomial preconditioner in reducing the number of conjugate gradient iterations. Parallel implementations of both methods are discussed for vector and multiple processors. Results show that if the same number of processors are used for both preconditioners, the SSOR preconditioner is more effective. If twice as many processors are used for the SOR-Additive preconditioner, it becomes more efficient than the SSOR preconditioner when the number of equations assigned to a processor is small. These results are confirmed by the Blue Chip emulator at the University of Washington.  相似文献   


4.
Three commercial systems are considered from a programmer's point of view. The three are the Intel iPSC, a network of Inmos transputers, and the Sequent Balance. The differences in overhead are examined by implementing a solution to the traveling-salesman problem on all three. The evaluation focuses on three major issues in parallel programming: (1) how execution is divided among processing elements and how it is controlled; (2) how data are shared; and (3) how events are synchronized. The experiences of the authors are presented and some specific as well as general conclusions are drawn  相似文献   

5.
With nearest-neighbor load-balancing algorithms, a processor makes balancing decisions based on localized workload information and manages workload migrations within its neighborhood. The paper compares a couple of fairly well-known nearest-neighbor algorithms, the dimension-exchange (DE) and the diffusion (DF) methods and their several variants—the average dimension-exchange (ADE), optimally tuned dimension-exchange (ODE), local average diffusion (ADF) and optimally tuned diffusion (ODF). The measures of interest are their efficiency in driving any initial workload distribution to a uniform distribution and their ability in controlling the growth of the variance among the processors' workloads. The comparison is made with respect to both one-port and all-port communication architectures and in consideration of various implementation strategies including synchronous/asynchronous invocation policies and static/dynamic random workload behaviors. It turns out that the dimension-exchange method outperforms the diffusion method in the one-port communication model. In particular, the ODE algorithm is best suited for statically synchronous implementations of a load-balancing process regardless of its underlying communication models. The strength of the diffusion method is in asynchronous implementations in the all-port communication model; the ODF algorithm performs best in that case. The underlying communication networks considered assume the most popular topologies, the mesh and the torus and their special cases: the hypercube and the k-ary n-cube.  相似文献   

6.
Almost all simulational computations require uniformly distributed random numbers. Generators of uniform random numbers are considered and assessed with respect to their possible use on parallel computers. Two recent, commercially available computers are given special attention: the Connection Machine and the T Series. Feedback shift register type generators with a large Mersenne prime are recommended for implementation on these computers.  相似文献   

7.
This paper presents a novel parallel memory architecture for multimedia computers. Applying a configurable or programmable addressing circuitry capable of parallel memory accesses, the memory management of multimedia applications can be enhanced. Necessary computer architecture changes to virtual address representation, paging, virtual memory, address computation circuitry and data permutation are discussed. These changes allow the memory to be partitioned for different access functions. In addition, the same memory area can be accessed by multiple access patterns. Therefore, a general-purpose computing system that is capable of exploiting the repeating memory access patterns in its applications can be built. Performance of the configurable parallel memory architecture (CPMA) is analyzed in the case of a selection of algorithms from a video encoder. These motion estimation algorithms and zigzag scanning benefit from the multiple memory access functions, which is apparent from the comparisons to the traditional sequential memory accesses.  相似文献   

8.
Algorithms for parallel memory,I: Two-level memories   总被引:1,自引:0,他引:1  
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is new and gives a realistic treatmentof parallel block transfer, in which during a single I/O each of theP secondary storage devices can simultaneously transfer a contiguous block ofB records. The model pertains to a large-scale uniprocessor system or parallel multiprocessor system withP disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory intoP separate physical devices. Our algorithms' performances can be significantly better than those obtained by the wellknown but nonoptimal technique of disk striping. Our optimal sorting algorithm is randomized, but practical; the probability of using more than times the optimal number of I/Os is exponentially small inl(logl) log(M/B), whereM is the internal memory size.A summarized version of this research was presented at the 22nd Annual ACM Symposium on Theory of Computing, Baltimore, MD, May 1990. This work was done while the first author was at Brown University. Support was provided in part by a National Science Foundation Presidential Young Investigator Award with matching funds from IBM, by NSF Research Grants DCR-8403613 and CCR-9007851, by Army Research Office Grant DAAL03-91-G-0035, and by the Office of Naval Research and the Defense Advanced Research Projects Agency under Contract N00014-91-J-4052 ARPA Order 8225. This work was done in part while the second author was at Brown University supported by a Bellcore graduate fellowship and at Bellcore.  相似文献   

9.
10.
分子动力学作为一种重要的计算手段在许多领域有着广泛的应用,由于它的计算量比较庞大,因此并行计算方法被越来越多地引入到分子动力学的模拟中。本文在目前常见的SMP集群系统上,根据系统的结构特点,针对分子动力学的三种并行算法:区域分解法、原子分解法和力分解法,利用MPI Pthread的混合编程模型,采用节点间消息传递模式以及节点内部共享存储的编程模式,实现了近程作用分子动力学的两级并行计算。计算结果表明,不同的算法采用了两级并行的方式和原来只有消息传递的并行方式相比,具有不同的计算效率,但是从总体来说采用两级并行的计算方式可以利用更多的计算资源,从而有助于提高计算能力。  相似文献   

11.
12.
Dinning  A. 《Computer》1989,22(7):66-77
An examination is given of how traditional synchronization methods influence the design of MIMD (multiple-instruction multiple-data-stream) multiprocessors. She provides an overview of MIMD multiprocessing and goes on to discuss semaphore-based implementations (Ultracomputers, Cedar, and the Sequent Balance/21000), monitor-based implementations (the HM2p) and implementations based on message-passing (HEP, the BBN Butterfly and the Transputer)  相似文献   

13.
In this article, we present a multigrid algorithm for parallel computers, the chopped parallel multigrid (CPMG) algorithm. The CPMG algorithm improves the processor utilization by reducing the work load on coarse grids without affecting the convergence rate of the algorithm. This is in contrast to earlier approaches (Gannon and van Rosendale, 1986; Frederickson and McBryan, 1989), where unutilized processors are used to improve the convergence rate. The CPMG algorithm reduces the coarse grid work bychopping the alternate cycles of multigrid. Using analytical results and simulations on sequential machines we show that the CPMG can achieve almost the same convergence rate as standard MG for many cases. Analytically we show that the advantage gained by CPMG over standard MG on a mesh connected massively parallel machine is 33% in hardware utilization, 50% in communication overheads and 38% in overall execution time. We have also evaluated the performance of CPMG on an actual massively parallel machine, the DAP-510. The advantage gained by CPMG over standard MG is 35% in overall execution time. Moreover, the CPMG can be integrated with other parallel multigrid algorithms, such as the PSMG algorithm (Frederickson and McBryan, 1989) and Decker's algorithm (Decker, 1990).  相似文献   

14.
In a previous work we studied the concurrent implementation of a numerical model, CONDIFP, developed for the analysis of depth-averaged convection–diffusion problems. Initial experiments were conducted on the Intel Touchstone Delta System, using up to 512 processors and different problem sizes. As for other computation-intensive applications, the results demonstrated an asymptotic trend to unity efficiency when the computational load dominates the communication load. This paper relates some other numerical experiences, in both one and two space dimensions with various choices of initial and boundary conditions, carried out on the Intel Paragon XP/S Model L38 with the aim to illustrate the parallel solver versatility and reliability.  相似文献   

15.
One of the essential problems in parallel computing is: Can SIMD machines handle asynchronous problems? This is a difficult, unsolved problem because of the mismatch between asynchronous problems and SIMD architectures. We propose a solution to let SIMD machines handle general asynchronous problems. Our approach is to implement a runtime support system which can run MIMD-like software on SIMD hardware. The runtime support system, named P kernel, is thread-based. There are two major advantages of the thread-based model. First, for application problems with irregular and/or unpredictable features, automatic scheduling can move some threads from overloaded processors to underloaded processors. Second, and more importantly, the granularity of threads can be controlled to reduce system overhead. The P kernel is also able to handle bookkeeping and message management, as well as to make these low-level tasks transparent to users. Substantial performance has been obtained on Maspar MP-1  相似文献   

16.
This paper is a discussion of functional languages and parallel computers. It is aimed at an audience that has a background in computer architecture, but not necessarily in the area of functional languages. It therefore constitutes an introductory survey of functional languages, on the one hand, and a non-introductory discussion of parallel computers, on the other. The aim is to highlight some important issues regarding the use of adequacy of these languages and also on the design of parallel computers to interpret them. The concluding thesis of put forth is twofold: one, that to widen their scope of applicability, functional languages need to include more features of nondeterminism and may need to be integrated with features from conventional languages; two, that the right sort of architectures for such extended languages may well be less-specialised ones with a von Neumann flavour.  相似文献   

17.
Computing an optimal solution to the knapsack problem is known to be NP-hard. Consequently, fast parallel algorithms for finding such a solution without using an exponential number of processors appear unlikely. An attractive alternative is to compute an approximate solution to this problem rapidly using a polynomial number of processors. In this paper, we present an efficient parallel algorithm for finding approximate solutions to the 0–1 knapsack problem. Our algorithm takes an , 0 < < 1, as a parameter and computes a solution such that the ratio of its deviation from the optimal solution is at most a fraction of the optimal solution. For a problem instance having n items, this computation uses O(n5/2/3/2) processors and requires O(log3n + log2nlog(1/)) time. The upper bound on the processor requirement of our algorithm is established by reducing it to a problem on weighted bipartite graphs. This processor complexity is a significant improvement over that of other known parallel algorithms for this problem.  相似文献   

18.
Eckardt  H. 《Computing》1994,53(1):13-31
Computing - I/O in computer systems is prone to become a bottleneck. This is a particular severe problem in highly parallel machines where some applications are fully I/O bound if only one or few...  相似文献   

19.
Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.  相似文献   

20.
Various proposals for networks of large numbers of processors are reviewed. Bottleneck problems arise in these networks with the flow of data between processors. Communication problems which can arise in practical situations are discussed and techniques for reducing bottlenecks are developed. Some simulation results are given for the binary n-cube.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号