首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
基于SMP集群的MPI+OpenMP混合编程模型及有效实现   总被引:12,自引:1,他引:11  
SMP集群混合了两个内存模型:每个节点是一个共享存储的多处理器,而节点间使用分布存储。这一多级体系结构引起了编程模型和性能方面的问题。文章讨论了MPI+OpenMP混合编程模型的性能和不同的实现方法,提出了多粒度MPI+OpenMP混合编程方法。建立了对称三对角特征问题的多粒度混合并行算法.并在深腾6800超级计算机上同纯MPI算法作了性能方面的比较。结果表明,该混合并行算法具有更好的扩展性和加速比。  相似文献   

2.
分布式共享存储系统的特点是每个节点内是共享存储的,而节点间是分布式存储.为了更好地利用这种多级体系结构,讨论了MPI+OpenMP混合编程模型的性能及实现方法,建立了大规模三对角线性方程组的MPI+OpenMP混合并行算法,并在上海大学高性能计算集群上与单纯MPI算法进行了性能方面的比较.结果表明,MPI+OpenMP混合并行算法具有更好的加速比和扩展性.  相似文献   

3.
The parallelization of many algorithms can be obtained using space-time transformations which are applied on nested do-loops or on recurrence equations. In this paper, we analyze systems of linear recurrence equations, a generalization of uniform recurrence equations. The first part of the paper describes a method for finding automatically whether such a system can be scheduled by an affine timing function, independent of the size parameter of the algorithm. In the second part, we describe a powerful method that makes it possible to transform linear recurrences into uniform recurrence equations. Both parts rely on results on integral convex polyhedra. Our results are illustrated on the Gauss elimination algorithm and on the Gauss-Jordan diagonalization algorithm. This work was partially funded by the French Coordinated Research ProgramC 3 and by a Grant from the SOREP company  相似文献   

4.
In this paper we examine order reduction of parabolic systems using modal truncation. The parabolic distributed system is first approximated using the Galerkin method. The system matrices have a special structure that allows us to find the approximate spectrum of the parabolic system. To do this we compute approximate inverses of tridiagonal, diagonally dominant symmetric matrices. The approximation leads to algorithms of orderO(n), as opposed to traditional algorithms of orderO(n), wheren is the order of the system. Finally, an example is presented to illustrate the proposed algorithm.This research was supported by the National Science Foundation under contract NCR-9210408, by the Advanced Research Programs Agency under contract MDA-972-93-1-0032, and by the University of Hawaii Research Council Funds.  相似文献   

5.
A new method of cryptologic attack on binary sequences is given, using their linear complexities relative to odd prime numbers. We show that, relative to a particular prime number p, the linear complexity of a binary geometric sequence is low. It is also shown that the prime p can be determined with high probability by a randomized algorithm if a number of bits much smaller than the linear complexity is known. This determination is made by exploiting the imbalance in the number of zeros and ones in the sequences in question, and uses a new statistical measure, the partial imbalance.This project was sponsored by the National Security Agency under Grant No. MDA904-91-H-0012. The United States Government is authorized to reproduce and distribute reprints notwithstanding any copyright notation hereon.  相似文献   

6.
On convergence of the Horn and Schunck optical-flow estimation method   总被引:2,自引:0,他引:2  
The purpose of this study is to prove convergence results for the Horn and Schunck optical-flow estimation method. Horn and Schunck stated optical-flow estimation as the minimization of a functional. When discretized, the corresponding Euler-Lagrange equations form a linear system of equations We write explicitly this system and order the equations in such a way that its matrix is symmetric positive definite. This property implies the convergence Gauss-Seidel iterative resolution method, but does not afford a conclusion on the convergence of the Jacobi method. However, we prove directly that this method also converges. We also show that the matrix of the linear system is block tridiagonal. The blockwise iterations corresponding to this block tridiagonal structure converge for both the Jacobi and the Gauss-Seidel methods, and the Gauss-Seidel method is faster than the (sequential) Jacobi method.  相似文献   

7.
An ideal computation of the decorrelating or the linear minimum mean-squared-error (LMMSE) detector requires computational complexity of order K 3 when there K is the number of users. To alleviate the computational complexity, iterative decorrelating and LMMSE detectors are proposed for solving a set of linear equations corresponding to linear interference cancellation structures. Iterative conjugate gradient (CG) method has been used for the linear interference cancellation detectors. Its main advantages are to reduce the order of computation complexity and their suitability to highly parallel implementations. In this paper, the symmetric successive overrelaxation (SSOR) preconditioning scheme is applied to the CG method. The performance of the detectors is investigated and it is found that the SSOR preconditioned CG method can provide significantly faster convergence than CG method.  相似文献   

8.
For a large-scale adaptive array, heavy computational load and high-rate data transmission are two challenges in the implementation of an adaptive digital beamforming system. Moreover, the large-scale array becomes extremely sensitive to array imperfections. First, based on a restructured recursive linearly constrained minimum variance algorithm and a gradient-based optimization method, a new robust recursive linearly constrained minimum variance (RRLCMV) algorithm is proposed in this paper. The computational load of the RRLCMV algorithm is on the order of o(N), which is less than that of the conventional gradient-based robust adaptive algorithm. Then, a new efficient parallel robust recursive linearly constrained minimum variance (PRRLCMV) adaptive algorithm is proposed by appropriately partitioning the RRLCMV algorithm into a number of operational modules. It can be easily executed in a distributed-parallel-processing fashion, sequentially and in parallel. As a result, the PRRLCMV algorithm provides an effective solution that can alleviate the bottleneck of high-rate data transmission and reduce the computational cost. Finally, an implementation scheme of the PRRLCMV algorithm based on a distributed-parallel-processing system is also proposed. The simulation results demonstrate that the new PRRLCMV algorithm can significantly reduce the degradation due to various array errors.  相似文献   

9.
A new method to recover packet losses using (2,1,m)convolutional codes is proposed.The erasure correcting decoding algorithm and the decoding determinant theorem is presented.It is also proved that the codes with optimal distance profile have also optimal delay characteristic.Simulation results show that the proposed method can recover the packet losses more efficiently than RS codes over different decoding delay conditions and thus suits for different packet network delay conditions.  相似文献   

10.
Multiple on-chip memory modules are attractive to many high-performance digital signal processing (DSP) applications. This architectural feature supports higher memory bandwidth by allowing multiple data memory accesses to be executed in parallel. However, making effective use of multiple memory modules remains difficult. The performance gain in this kind of architecture strongly depends on variable partitioning and scheduling techniques. In this paper, we propose a graph model known as the variable independence graph (VIG) and algorithms to tackle the variable partitioning problem. Our results show that VIG is more effective than interference graph for solving variable partitioning problem. Then, we present a scheduling algorithm known as the rotation scheduling with variable repartition (RSVR) to improve the schedule lengths efficiently on a multiple memory module architecture. This algorithm adjusts the variable partitions during scheduling and generates a compact schedule based on retiming and software pipelining. The experimental results show that the average improvement on schedule lengths is 44.8% by using RSVR with VIG. We also propose a design space exploration algorithm using RSVR to find the minimum number of memory modules and functional units satisfying a schedule length requirement. The algorithm produces more feasible solutions with equal or fewer number of functional units compared with the method using interference graph.  相似文献   

11.
A novel parallel sequence fault simulation (PSF) algorithm for synchronous sequential circuits is presented. The algorithm successfully extend the parallel pattern method for combinational circuits to sequential circuits by proposing a multiple-pass mechanism to overcome the state dependency in sequential circuits. The fault simulation is performed in parallel by partitioning the entire sequence into subsequences of equal length. Furthermore, techniques are developed to minimize the number of simulation passes. Notably, two compact counters, C x and C d , are proposed to faciliate the early stabilization detection of faulty circuit simulation with minimum space overhead. The experimental results on the benchmark circuits show that the speedup ratio over a serial sequence fault simulator based on ROOFS is 9.16 on average for pseudo random vectors. The parallel sequence algorithm of PSF is especially adaptable to parallel and distributed simulation which exploits sequence partition.  相似文献   

12.
分析了双向并行分裂DPP算法存在空闲等待、通信阻塞以及冗余计算等方面的不足,并在此基础上提出一种基于动态分配模式求解三对角线性方程组的并行算法.该算法摒弃了DPP算法平均分配方程组的模式和完成向中间通信后必须消除所有下(上)对角元素的方式,而采用基于运算和通信参数的动态分布模式以及仅适量消元的方法,从而在保持通信畅通的前提下,充分利用计算与通信重叠技术,减少处理机空闲等待和冗余计算.最后分析了新算法的理论性能,并在IBMRS60000机群上进行了数值实验.实验结果表明,该算法的效率较DPP算法有较大提高.  相似文献   

13.
In this paper, precoding techniques are studied for combating the intersymbol interference (ISI)/multipath effects in communication systems. We introduce a new type-precoder which we term thepolynomial ambiguity resistant (PAR) precoder for its ability to resist signal distortion induced by finite impulse response (FIR) channels. In particular, the precoder allows a receiver to identify an input signal without knowing the channel characteristics at the expense of a minimum amount of bandwidth increase. A family of such precoders, which is linear (no modulo operations), channel independent, and modulation pattern preserving (except for some occasional 0 symbols), is presented. Also presented is a closed-form algorithm that can simultaneously identify the input signals and zero-forcing equalizers.Work supported in part by the Air Force Office of Scientific Research (AFOSR) under Grant No. F49620-97-1-0253 and F49620-98-1-0352, the National Science Foundation CAREER Program under Grant MIP-9703377, and the University of Delaware Research Foundation.Work supported in part by the Air Force Office of Scientific Research (AFOSR) under Grant No. F49620-97-1-0318, the Office of Naval Research (ONR) under Grant No. N00014-97-1-0475, and the National Science Foundation CAREER Program under Grant MIP-9703074.  相似文献   

14.
An algorithm can be modeled as a set of indexed computations, and a schedule is a mapping of the algorithm index space into time.Linear schedules are a special class of schedules that are described by a linear mapping and are commonly used in many systolic algorithms.Free schedules cause computations of an algorithm to execute as soon as their operands are available. If one computation uses data generated by another computation, then a data dependence exists between these two computations which can be represented by the difference of their indices (calleddependence vector). Many important algorithms are characterized by the fact that data dependencies areuniform, i.e., the values of the dependence vectors are independent of the indices of computations. There are applications where it is of interest to find an optimal linear schedule with respect to the time of execution ofa specific computation of the given algorithm. This paper addresses the problem of identifying optimal linear schedules for uniform dependence algorithms so that the execution time ofa specific computation of the algorithm is minimized and proposes a procedure to solve this problem based on the mathematical solution of a linear optimization problem. Also, linear schedules are compared with free schedules. The comparison indicates that optimal linear schedules can be as efficient as free schedules, the best schedules possible, and identifies a class of algorithms for which this is always true. This research was supported in part by the National Science Foundation under Grant DCI-8419745 and in part by the Innovative Science and Technology Office of the Strategic Defense Initiative Organization and was administered through the Office of Naval Research under contracts No. 00014-85-k-0588 and No. 00014-88-k-0723.  相似文献   

15.
Serialization of memory access can be a critical bottleneck in shared memory parallel computers. The NYU Ultracomputer, a large-scale MIMD (multiple instruction stream, multiple data stream) shared memory architecture, may be viewed as a column of processors and a column of memory modules connected by a rectangular network of enhanced 2×2 buffered crossbars. These VLSI nodes enable the network to combine multiple requests directed at the same memory location. Such requests include a new coordination primitive, fetch- and-add, which permits task coordination to be achieved in a highly parallel manner. Processing within the network is used to reduce serialization at the memory modules.To avoid large network latency, the VLSI network nodes must be high-performance components. Design tradeoffs between architectural features, asymptotic performance requirements, cycle time, and packaging limitations are complex. This report sketches the Ultracomputer architecture and discusses the issues involved in the design of the VLSI enhanced buffered crossbars which are the key element in reducing serialization.This work was supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under contract number DEA-C0276-ER03077-V, and in part by the National Science Foundation, under grant number DCR-8413359.  相似文献   

16.
We present a parallel implementation of the alternating direction implicit—body of revolution—finite difference time domain (ADI-BOR-FDTD) method on a high performance computer using a Message Passing Interface (MPI) library. In BOR-FDTD codes, the body of revolution symmetry is exploited to reduce the computational complexity by projecting the 3-D Yee-cell in cylindrical coordinates onto a 2-D plane. Adopting an implicit update scheme (ADI) frees the time-step size from the Courant–Friedrichs–Lewy time step constraint. We demonstrate further performance gains by parallelizing the algorithm, although the communication overhead between processors is proportional to the area of the domain. In our parallel ADI-BOR-FDTD method each tridiagonal matrix system is solved by a single processor, with the parallel computer architecture being exploited to solve multiple systems at the same time. We benchmarked our code on an IBM p690 symmetric multiprocessor revealing excellent scalability and efficiency.   相似文献   

17.
The structure inherent to the centrosymmetric matrices is exploited to obtain factorization results leading to significant computational savings in many engineering applications. Several interesting properties of the matrices are discussed with a view toward algorithm computational complexity. It is shown that the multiplicative complexity involved in the process of principal component (eigen-value/eigenvector) extraction, and in the evaluation of the determinant and inverse of such matrices, can be reduced by nearly 75% by employing the results presented here. The theory presented hereunifies andgeneralizes previous computationally efficient results onseveral specialized generalized centrosymmetric matrices; for example, the class of symmetric centrosymmetric (or doubly symmetric) matrices is shown to be a special case of the class of centrosymmetric matrices, and since the results obtained here are applicable over the field of complex numbers, the factorization results available for centrosymmetric matrices are readilyextended to the complex field. The centrosymmetric matrices play an important role in a number of areas such as pattern recognition, antenna theory, mechanical and electrical systems, and quantum physics. Specific examples of pattern recognition feature selection, a uniform linear antenna array, vibration in structures, and the quantummechanical oscillator are discussed to demonstrate that the theory developed here has a wide range of applications. In addition, certain specialized cases of the centrosymmetric matrices have, in the past, proved their indisputable usefulness in the areas of communication theory, digital filters, linear systems, linear prediction, and speech analysis.Research supported by NSERC Grant A0912.  相似文献   

18.
With the computational capabilities of parallel computers, we should investigate new methods which have performance advantages on parallel computers even if they are not faster than conventional methods on sequential computers. One such method is the partitioning finite-element method (FEM). In this paper, we consider the implementation of the partitioning FEM on both the Gray Y-MP and the Intel Touchstone Delta. The partitioning method is shown to have many advantages over a traditional finite-element approach. On the Gray YMP and sequential computers, the partitioning method requires significantly less memory. For parallel processors such as the Intel Delta, we show that the partitioning FEM has a higher parallel efficiency than traditional FEM. EM scattering from an infinitely long dielectric cylinder is used as an example  相似文献   

19.
Correlation properties of a general binary combiner with memory   总被引:8,自引:0,他引:8  
Correlation properties of a general binary combiner with an arbitrary number M of memory bits are derived and novel design criteria proposed. For any positive integer m, the sum of the squares of the correlation coefficients between all nonzero linear functions of m successive output bits and all linear functions of the corresponding m successive inputs is shown to be dependent upon a particular combiner, unlike the memoryless combiners. The minimum and maximum values of the correlation sum as well as the necessary and sufficient conditions for them to be achieved are determined. It turns out that the security of combiners with memory can be considerably improved if M is not small.An efficient linear sequential circuit approximation (LSCA) method is developed for obtaining output and input linear functions with comparatively large correlation coefficients which is feasible for large M and works for any practical scheme. The method consists in deriving and solving a linear sequential circuit with additional nonbalanced inputs that is based on linear approximations of the output and the component next-state functions. The corresponding correlation attack on combiners with linear feedback shift registers is analyzed and it is shown that every such combiner with or without memory is essentially zero-order correlation immune.A preliminary version of this paper was presented at Eurocrypt '92 and was published in the proceedings. This research was supported in part by the Science Fund of Serbia, Grant #0403, through the Institute of Mathematics, Serbian Academy of Arts and Sciences.  相似文献   

20.
We compare Yee's finite-difference time-domain (FDTD) and symmetric condensed node transmission-line matrix (SCN-TLM) solutions for a cavity containing a metallic fin. Differential equation-based numerical methods are known to produce inaccurate results for this type of problem due to the rapid spatial variation of the field distribution in the vicinity of the singularity at the edge of the metal fin. This problem is relevant to the analysis of structures of practical interest such as microstrip and coplanar waveguides. Based on simulations, it is determined that for identical discretizations, SCN-TLM is more accurate than FDTD for this problem. We interpret this result as an indication that the symmetric condensed representation of fields (used within the SCN-TLM) lends itself to a more accurate algorithm than the distributed representation used by Yee. We estimate that the FDTD method requires 3.33 times more cells for a given three-dimensional problem than the transmission-line matrix (TLM) method (1.49 times more cells per linear dimension of the problem) in order to achieve the same accuracy. If we consider the requirements to update and store a single TLM or FDTD cell, we find the SCN-TLM algorithm is more efficient than the Yee FDTD algorithm in terms of both computational effort and memory requirements. Our conclusions regarding computational effort and memory requirements are limited to problems with homogeneous material properties  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号