首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The solution of the algebraic eigenvalue problem is an important component of many applications in science and engineering. With the advent of novel architecture machines, much research effort is now being expended in the search for parallel algorithms for the computation of eigensystems which can gainfully exploit the processing power which these machines provide. Among important recent work References 1-4 address the real symmetric eigenproblem in both its dense and sparse forms, Reference 5 treats the unsymmetric eigenproblem, and Reference 6 investigates the solution of the generalized eigenproblem. In this paper two algorithms for the parallel computation of the eigensolution of Hermitian matrices on an array processor are presented. These algorithms are based on the Parallel Orthogonal Transformation algorithm (POT) for the solution of real symmetric matrices[7,8]. POT was developed to exploit the SIMD parallelism supported by array processors such as the AMT DAP 510. The new algorithms use the highly efficient implementation strategies devised for use in POT. The implementations of the algorithms permit the computation of the eigensolution of matrices whose order exceeds the mesh size of the array processor used. A comparison of the efficiency of the two algorithms for the solution of a variety of matrices is given.  相似文献   

2.
Recently, AMT has issued an extended version of Fortran Plus [1] which allows software to be developed without the developer needing to take explicit accout of the grid size of the target processor. Fortran-Plus and its extension, Fortran Plus Enhanced [2], have been developed for use on the AMT DAP 510 array processor. This machine has 1024 processors arranged in a square grid with nearest neighbour and wraparound connections. It is interesting to enquire whether the performance of code generated by the Fortran-Plus Enhanced compiler is, for a particular application, superior to that generated by the Fortran-Plus compiler from a program which recognises and is tailored to fit the characteristic features of the DAP 510. In this paper the performances of two implementations of an algorithm for the eigensolution of real tridiagonal symmetric matrices are compared. The algorithm is characterised by its heavy use of matrix operations, all of which can be efficiently implemented on an array processor. Some of the constituent operations commonly occur in other applications while others are specific to the problem being addressed.  相似文献   

3.
This paper considers how the image restoration technique of Geman and Geman, which involves searching for the maximum a posteriori distribution of an image modelled as bounded Markov random fields using simulated annealing, can be approximated on a parallel SIMD processor array, the ICL Distributed Array Processor (DAP). For the version implemented, the potential speed-up over an equivalent serial processor is equal to half the number of processors in the array, or 2048 for the 64×64 DAP. The time taken by the DAP for one updating cycle is about 30 ms, and the typical complete picture restoration consisting of 1000 annealing cycles takes around 30 s. A method of increasing the efficiency of the present algorithm is suggested and the possibility of making the algorithm work with several frames of data collected over time is discussed.  相似文献   

4.
In the paper three distinct approaches to the implementation of an application in linear algebra on the AMT DAP 510 using the Fortran-Plus Enhanced language and compiler are investigated. In the first the implementation is tailored to fit a virtual array processor whose edge size corresponds to that of the maximum dimension of matrix used in the application. The second approach is a special case of the first, and corresponds to a situation in which the virtual array processor is constrained to have the same dimensions as the AMT DAP 510. The third approach may be considered to be a hybrid approach. In this case an implementation of the application is constructed which incorporates the most efficient features of the first two. Finally, comparisons of the performances of the three implementations, all of which were compiled using the Fortran-Plus Enhanced compiler, are presented.  相似文献   

5.
The authors consider a mesh-connected processor array with broadcast, in which the processors in each row and in each column can access a shared bus. This computation model is an abstraction of commercially available parallel machines such as the distributed array processor (DAP). The authors show efficient data movement using the buses, which leads to significantly faster solution times for several image problems compared to those on an ordinary mesh-connected array. For several problems, the time performance is comparable to that on the pyramid of corresponding size. Alternate parallel organizations with broadcast feature are also shown. Utilizing the broadcast feature on the arrays can lead to significant improvement of the time performance of many image computations  相似文献   

6.
Theorists in Edinburgh University Physics Department are currently using two ICL Distributed Array Processors (DAPs), programmed in a matrix and vector extension of FORTRAN called DAP FORTRAN, to perform a variety of numerical simulations. However, many of the next generation of array processors, in particular the GEC Rectangular Image and Data processor (GRID), will be programmed in parallel extensions of C, like GRID-extended C. In this paper software is described which translates DAP FORTRAN into GRID-extended C, as well as FORTRAN 77 into C, enabling DAP FORTRAN programs to be run on the GRID.  相似文献   

7.
Effective fault tolerance techniques are essential for improving the reliability of multiprocessor systems. At the same time, fault tolerance must be achieved at high speed to meet the real-time constraints of embedded systems. While parallelism has often been exploited to increase performance, to the best of our knowledge, there has been no previously reported work on parallel reconfiguration of mesh-connected processor arrays with faults. This paper presents two parallel algorithms to accelerate reconfiguration of the processor arrays. The first algorithm reconfigures a host array in parallel in a multithreading manner. The threads in the parallel algorithm execute independently within a safe rerouting distance. The second algorithm is based on a divide-and-conquer approach to first generate the leftmost segments in parallel and then merge the segments in parallel. When compared to the conventional algorithm, simulation results from a large number of instances confirm that the proposed algorithms significantly accelerate the reconfiguration without loss of harvest.  相似文献   

8.
处理器阵列的容错重构技术是片上网络多核、众核高性能体系结构的可靠性技术之一。现有的最大逻辑阵列并行重构技术仅对单条逻辑列的构造实现了并行化,而对多条逻辑列的同步并行仍未见可行算法。依据处理器阵列的潜在并行性,在分治策略的基础上,提出了一种阵列分块的并行重构算法。算法对处理器阵列实施横向分块划分,对每个阵列块进行并行重构,并对所得逻辑子阵列进行归并,实现了多条逻辑列的同步并行重构。与现有的并行算法相比,新算法同样能够生成最大逻辑列,并且减少了通信开销与计算中的数据冗余,有效提高了运行速度。实验结果表明,在物理阵列大小为64×64的处理器阵列上,运行速度比现有并行算法提高39.55%,并且具有良好的可扩展性。  相似文献   

9.
Timed Petri-nets are used to model numerous types of large complex systems, especially computer architectures and communication networks. While formal analysis of such models is sometimes possible, discrete-event simulation remains the most general technique available for assessing the model′s behavior. Simulation′s computational requirements, however, can be massive, especially on the large complex models that defeat analytic methods. One way of meeting these requirements is by executing the simulation on a parallel machine. This paper describes simple techniques for the automated parallelization of timed Petri-net simulations. We address the issue of processor synchronization as well as the automated mapping, both static and dynamic, of the Petri-net to the parallel architecture. As part of this effort we describe a new mapping algorithm, one that also applies to more general parallel computations. We establish analytic properties of the solution produced by the algorithm, including optimality on some regular topologies, The viability of our integrated approach is demonstrated empirically on the Intel iPSC/860 and Delta architectures on Petri-net-based simulations of parallel architectures.  相似文献   

10.
In this paper, we present a class of preconditioning methods for a parallel solution of the three-dimensional Richards equation. The preconditioning methods Jacobi scaling, block-Jacobi, incomplete lower–upper, incomplete Cholesky and algebraic multigrid were applied in combination with a parallel conjugate gradient solver and tested for robustness and convergence using two model scenarios. The first scenario was an infiltration into initially dry, sandy soil discretised in 500,000 nodes. The second scenario comprised spatially distributed soil properties using 275,706 numerical nodes and atmospheric boundary conditions. Computational results showed a high efficiency of the nonlinear parallel solution procedure for both scenarios using up to 64 processors. Using 32 processors for the first scenario reduced the wall clock time to slightly more than 1% of the single processor run. For scenario 2 the use of 64 processors reduces the wall clock time to slightly more than 20% of the 8 processors wall clock time. The difference in the efficiency of the various preconditioning methods is moderate but not negligible. The use of the multigrid preconditioning algorithm is recommended, since on average it performed best for both scenarios.  相似文献   

11.
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient algorithms for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution. The most significant improvement of our methods is that a processor does not need to construct the send/receive data sets for a redistribution. Based on the packing/unpacking information that derived from the BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) redistribution and vice versa, a processor can pack/unpack array elements into (from) messages directly. To evaluate the performance of our methods, we have implemented our methods along with the Thakur's methods and the PITFALLS method on an IBM SP2 parallel machine. The experimental results show that our algorithms outperform the Thakur's methods and the PITFALLS method for all test samples. This result encourages us to use the proposed algorithms for array redistribution.  相似文献   

12.
基于SIMD—PRAM模型的分块图像匹配算法设计   总被引:1,自引:0,他引:1  
该文研究了基于SIMD—PRAM计算模型的有限处理元阵列规模的并行计算算法设计,分析了求阵列和的并行算法性能,运用求阵列和、图像分块映射与四邻平移等方法提出了一种根据处理元局部可用资源进行分块计算的图像匹配算法。计算机模拟实验结果表明,该算法完整、高效地执行了图像匹配,具有良好的并行计算性能。  相似文献   

13.
The Finite Element Method for solving partial differential equations using the long vector mode of the DAP is presented. This work was developed on a 32 × 32 version of the DAP attached to a Perq scientific workstation.

First, the implementation of finite elements using the long vector mode of the DAP is given, followed by the treatment of boundary conditions and the solution of the finite element equations using a parallel conjugate gradient method. Two solution procedures for the parallel conjugate gradient method, first without global matrix assembly and second with global matrix assembly, are presented and their advantages and disadvantages are discussed. Preconditioners for the conjugate gradient method based on iteration methods are also discussed and results include a 1-step point Jacobi preconditioner, a m-step point Jacobi preconditioner and a m-step multi-colour preconditioner. Finally long vector implementations for a larger system which stores multinodes per processor using a sliced mapping technique and domain decomposition are included.  相似文献   


14.
The speedup factor in real time simulation of dynamic systems using multiprocessor resources depends on: the architecture of the multiprocessor system, type of interconnection between parallel processors, numerical methods and techniques used for discretization and task assignment and scheduling policy. The minimization of the number of processors needed for real time simulation requires the minimization of processors times for interprocessor communications and efficient scheduling policy. Therefore, this article presents a methodology for the real time simulation of dynamic systems including a new pre-emptive static assignment and scheduling policy. The advantages of applying digital signal processor with parallel architecture, for example TMS320C40, in real time simulation have been described. Some important issues in real time architectures necessary for efficient multiprocessor real time simulations, such as multiple I/O channels, concurrent I/O and CPU processing, direct high speed interprocessor communications, fast context switching, multiple busses, multiple memories, and powerful arithmetic units are inherent to this processor. These features minimize interprocessor communication time and maximize sustained CPU performance.  相似文献   

15.
16.
Earley's algorithm has been commonly used for the parsing of general context-free languages and the error-correcting parsing in syntactic pattern recognition. The time complexity for parsing is 0(n3). This paper presents a parallel Earley's recognition algorithm in terms of an ``X*' operator. By restricting the input context-free grammar to be ?-free, the parallel algorithm can be executed on a triangular-shape VLSI array. This array system has an efficient way of moving data to the right place at the right time. Simulation results show that this system can recognize a string with length n in 2n + 1 system time. We also present a parallel parse-extraction algorithm, a complete parsing algorithm, and an error-correcting recognition algorithm. The parallel complete parsing algorithm has been simulated on a processor array which is similar to the triangular VLSI array. For an input string of length n the processor array will give the correct right-parse at system time 2n + 1 if the string is accepted. The error-correcting recognition algorithm has also been simulated on a triangular VLSI array. This array recognizes an erroneous string of length n in time 2n + 1 and gives the correct error count. These parallel algorithms are especially useful for syntactic pattern recognition.  相似文献   

17.
A parallel finite element solution algorithm for analysing large rotationally periodic structures on MIMD parallel computer systems is described. For a rotationally periodic structure, the global stiffness matrix under the corresponding symmetric coordinate system is periodic, i.e. possesses isomorphic properties, so that the global equation system can be transformed into a number of smaller equation systems which are fully decoupled. These decoupled equation systems then can be solved simultaneously on a multiprocessor parallel computer. The algorithm also generates the decoupled equation systems in parallel, without explicitly assembling the global stiffness matrix of the structure. A prototype implementation of the algorithm on an array of transputers is presented, and the efficiency of the program is also studied in this paper. Finally, a numerical example is given to demonstrate the speedup of the program.  相似文献   

18.
线性系统求解中迭代算法的GPU加速方法   总被引:1,自引:0,他引:1  
在求解线性系统时,迭代法是一种基本的方法,特别是在系数矩阵为大规模稀疏矩阵的情况下,高效地使用迭代法求解变得十分重要。本文通过分析迭代法的一般特点,提出了使用具有强大计算能力和存储带宽的GPU加速迭代法的一般方法。利用这些方法,在两种主流GPU平台上实现了一个经典的迭代法PQMRCGSTAB,并且针对不同的GPU平台特点提出了具体的优化方法。与AMD Opteron 2.4GHz 4核处理器相比,双精度版本的PQMRCGSTAB算法经NVIDIA Tesla S1070加速后性能提高31倍,经AMD Radeon HD 4870 X2加速后性能提高9倍。  相似文献   

19.
Efficient sorting is a key requirement for many computer science algorithms. Acceleration of existing techniques as well as developing new sorting approaches is crucial for many real‐time graphics scenarios, database systems, and numerical simulations to name just a few. It is one of the most fundamental operations to organize and filter the ever growing massive amounts of data gathered on a daily basis. While optimal sorting models for serial execution on a single processor exist, efficient parallel sorting remains a challenge. In this paper, we present a hardware‐optimized parallel implementation of the radix sort algorithm that results in a significant speed up over existing sorting implementations. We outperform all known General Processing Unit (GPU) based sorting systems by about a factor of two and eliminate restrictions on the sorting key space. This makes our algorithm not only the fastest, but also the first general GPU sorting solution.  相似文献   

20.
The ICL Distributed Array Processor (DAP) is an SIMD array processor containing a large, 2-D array of bit serial processing elements. The architecture of the DAP makes it well suited to data processing applications where searching operations must be carried out on large numbers of data records. This paper discusses the use of the DAP for two such applications, these being the scanning of serial text files and the clustering of a range of types of database. The processing efficiency of the DAP, when compared with a serial processor, is greatest when fixed length records are processed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号