期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Distributed selectsort sorting algorithms on broadcast communication networks

Jau-Hsiung Huang Leonard Kleinrock 《Parallel Computing》1990,16(2-3):183-190

In this paper, a distributed selectsort algorithm and a parameterized selectsort algorithm are presented to be applied on distributed systems for cases when N P where N is the number of elements to be sorted and P is the number of processors in the system. The distributed system considered in this paper uses a broadcasting channel for communication between processors. We show that the number of messages required for the parameterized selectsort algorithm is independent of N and is of complexity O(P), which is optimal in a distributed system with P processors. Furthermore, the amount of communication required in terms of elements is N + O(P³) and the computation time complexity is O((N/P)lgN + P²lg(N/P)). Hence, when N P³, the computation time complexity is O((N/P)lgN), which is optimal using P processors. In addition, this parameterized algorithm provides us with a parameter K such that by choosing the value of K allows us to trade among processing requirement, memory requirement, and communication requirement. It is shown that this parameterized algorithm can reduce the communication requirements significantly while only slightly increasing the computation requirements. 相似文献

2.

Parallel nested dissection

John M. Conroy 《Parallel Computing》1990,16(2-3):139-156

Nested dissection is a very popular direct method for solving sparse linear systems that arise from finite difference and finite element methods. Worley and Schreiber [16] give a fine grain algorithm for a square array of processors. Their algorithm uses O(N²) processors, each with O(N) memory, to factor an N² by N² sparse matrix whose graphs is an N × N mesh. The efficiency of their method is between 1/46 and 1/12. George et al. [6] [8] give a medium grain algorithm for hypercube architecture, while George et al. [7] give an algorithm for shared memory machines. These papers present a column oriented approach which can exploit O(N) parallelism and yield efficiencies up to 50%. Lucas [11] also gives a column oriented scheme which achieves up to 75% efficiency and O(N) parallelism. In this paper, we present a medium to fine grain algorithm for a P × P array of processors with local memory. This algorithm can exploit up to O(N²) parallelism. The efficiency of the fine grain version is comparable to [16] while as a medium grain algorithm achieves about 49% efficiency. The strength of the method is due to three factors: its ability to pipeline much of the computation, overlapping computation and communication, and the use of level 3 BLAS like primitives. In addition to its high efficiency its memory requirement is optimal, only O(N² log N/P²) words memory is needed per processor. 相似文献

3.

Scheduling parallel iterative methods on multiprocessor systems

Nikolaos M. Missirlis 《Parallel Computing》1987,5(3):295-302

The paper describes the implementation of the Successive Overrelaxation (SOR) method on an asynchronous multiprocessor computer for solving large, linear systems. The parallel algorithm is derived by dividing the serial SOR method into noninterfering tasks which are then combined with an optimal schedule of a feasible number of processors. The important features of the algorithm are: (i) achieves a speedup S_p O(N/3) and an efficiency E_p 2/3 using P = [N/2] processors, where N is the number of the equations, (ii) contains a high level of inherent parallelism, whereas on the other hand, the convergence theory of the parallel SOR method is the same as its sequential counterpart and (iii) may be modified to use block methods in order to minimise the overhead due to communication and synchronisation of the processors. 相似文献

4.

Efficient communication primitives on hypercubes

Ching-Tien Ho M. T. Raghunath 《Concurrency and Computation》1992,4(6):427-457

We give practical algorithms, complexity analysis and implementation for one-to-all broadcasting, all-to-all personalized communication and matrix transpose (with two-dimensional partitioning of the matrix) on hypercubes. We assume the following communication characteristics: circuit-switched, e-cube routing and one-port communication model. For one-to-all broadcasting, we give an algorithm that combines the well-known recursive doubling algorithm[1] and the algorithm based on edgedisjoint spanning trees[2]. The measured times of the combined algorithm are always superior to those of the edge-disjoint spanning tree algorithm and outperform the recursive doubling algorithm. For all-to-all personalized communication we propose a hybrid algorithm that combines the well-known recursive doubling algorithm[3,4] and the recently proposed direct-route algorithm[5,6] Our hybrid algorithm balances between data transfer time and start-up time of these two algorithms, and its communication complexity is estimated to be better than the two previous algorithms for a range of machine parameters. For matrix transpose with two-dimensional partitioning of the matrix, we relate a two-phase algorithm to the previous result in Reference 7. The algorithm is predicted to be better than the recursive transpose algorithm[8] by n nearest-neighbor communications[4]. It takes advantage of circuit-switched routing and is congestion-free within each phase. We also suggest a way of storing the matrix such that the transpose operation can be realized in one phase without congestion. 相似文献

5.

Communication-Efficient Sorting Algorithms on Reconfigurable Array of Processors With Slotted Optical Buses

《Journal of Parallel and Distributed Computing》1999,57(2):166-187

The reconfigurable array with slotted optical buses (RASOB) has recently received a lot of attention from the research community. In this paper, we first discuss the reconfiguration methods and communication capabilities of the RASOB architecture. Then, we use this architecture for the implementation of efficient sorting algorithms on the 1D RASOB and the 2D RASOB. Our parallel sorting algorithm on the 1D RASOB is based on an efficient divide-and-conquer scheme. It sortsNdata items usingNprocessors inO(k) communication cycles where k is the size of the data items to be sorted in bits. We further develop a parallel sorting algorithm on the 2D RASOB based on the sorting algorithm on the 1D RASOB in conjunction with the well known Rotatesort algorithm. Similarly, this algorithm sortsNdata items on a 2D RASOB of sizeNinO(k) communication cycles. These sorting algorithms are much more efficient than state-of-the-art sorting algorithms on reconfigurable arrays of processors withelectronicbuses using the same number of processors. 相似文献

6.

并行PCG算法在电法勘探中的应用研究

陈荣征李代平何驰黄健《微计算机信息》2007,23(9)

采用有限元法进行电法勘探时,会产生大型稀疏线性方程组,如何提高方程组的求解效率成为物探研究的关键。针对传统直接法难以实现并行求解的缺点,提出了在Beowulf集群环境下,采用并行PCG算法求解物探系统线性方程组。在集群环境下,该算法具有机器间相互通讯少、时间复杂度低等优点,并且易于并行实现。实验结果表明,采用PCG算法获得了良好的并行效果。相似文献

7.

基于集群的EBE-PCG算法在物探中的应用

陈荣征秦昭晖张希花赵娟《现代计算机》2007,(6):56-58,69

采用有限元法进行电法勘探时,会产生大型稀疏线性方程组,如何提高方程组的求解效率成为物探研究的关键.提出了在Beowulf集群环境下,采用粗粒度EBE-PCG算法处理物探问题.在集群环境下,该算法具有机器间相互通信少,易于并行实现等优点.实验结果表明,采用EBE-PCG算法获得了良好的并行效果. 相似文献

8.

Two minimum spanning forest algorithms on fixed-size hypercube computers

Sajal K. Das Narsingh Deo Sushil Prasad 《Parallel Computing》1990,15(1-3):179-187

Two parallel algorithms for finding minimum spanning forest (MSF) of a weighted undirected graph on hypercube computers, consisting of a fixed number of processors, are presented. One algorithm is suited for sparse graphs, the other for dense graphs. Our design strategy is based on successive elimination of non-MSF edges. The input graph is partitioned equally among different processors, which then repeatedly eliminate non-MSF edges and merge results to gradually construct the desired MSF of the entire graph. Low communication overhead is achieved by restricting the message-flow to between the neighboring processors in the hypercube topology. The correctness of our approach is due to a theorem which states that with total-ordered edges, if an edge of an arbitrary subgraph does not belong to its MSF, then it does not belong to the MSF of the entire graph. For a graph of n vertices and m edges, our first algorithm finds an MSF in O(m log m)/p) time using p processors for p ≤ (mlog m)/n(1+log(m/n)). The second algorithm, efficient for dense graphs, requires O(n²/p) time for p≤n/log n. 相似文献

9.

基于Beowulf机群中改进粒子滤波的3D人体运动跟踪

李敏宋曰聪吴斌彭保《计算机工程与应用》2015,51(14):17-22

针对标准的粒子滤波算法在视频三维人体运动跟踪中存在的计算量巨大、粒子退化、跟踪失效而无法同时满足跟踪精度和跟踪实时性要求的问题,提出了基于Beowulf机群中改进的粒子滤波新算法。新算法通过三维人体模型参数的自动初始化、粒子数目和模板的调整来实现跟踪失效的自动恢复,基于任务动态分配策略、低开销通信策略设计的Beowulf机群中的迁移式粒子滤波并行算法克服了粒子退化问题和提高了计算速度。实验结果显示：新方法有效地减轻了粒子退化和跟踪失效问题,降低了计算时间,提高了跟踪精度,能够同时满足三维人体运动跟踪精度和实时性的要求。相似文献

10.

Linear rotation based algorithm and systolic architecture for solving linear system equations

I. -Chang Jou 《Parallel Computing》1989,11(3):367-379

A linear rotation based algorithm is proposed for solving linear system equations, Ax = b. This algorithm modified the conventional Gaussian elimination method and can avoid the problems of numerical singularity and ill condition. In this study, the implementation of a trapezoidal systolic array of n²/2 + n −2 processors as well as a linear array of n processors are accomplished for this algorithm. The trapezoidal systolic array performs the triangularization of a matrix A by using the modified linear rotation algorithm; while the linear array performs the backward substitution for evaluating the solution of x. The computing time for solving a linear equation system will be O(5n) time units. Also an implicit representation of the elimination factor by means of the sign parameter sequence instead of an numerical value is introduced for simplifying the hardware complexity. It is clear that this systolic architecture is simple, uniform, and regular, and therefore well suitable for the implementation of a VLSI chip. 相似文献

11.

O(log n) bimodality analysis

Tsai-Yun Phillips Azriel Rosenfeld Allen C. Sher 《Pattern recognition》1989,22(6):741-746

The bimodality of a population P can be measured by dividing its range into two intervals so as to maximize the Fisher distance between the resulting two subpopulations P₁ and P₂. If P is a mixture of two (approximately) Gaussian subpopulations, then P₁ and P₂ are good approximations to the original Gaussians, if their Fisher distance is great enough. Moreover, good approximations to P₁ and P₂ can be obtained by dividing P into small parts; finding the maximum-distance (MD) subdivision of each part; combining small groups of these subdivisions into (approximate) MD subdivisions of larger parts; and so on. This divide-and-conquer approach yields an approximate MD subdivision of P in O(log n) computational steps using O(n) processors, where n is the size of P. 相似文献

12.

An Efficient Parallel Algorithm for Computing the Gaussian Convolution of Multi-dimensional Image Data

Yip Hoi-Man Ahmad Ishfaq Pong Ting-Chuen 《The Journal of supercomputing》1999,14(3):233-255

In this paper, we propose a parallel convolution algorithm for estimating the partial derivatives of 2D and 3D images on distributed-memory MIMD architectures. Exploiting the separable characteristics of the Gaussian filter, the proposed algorithm consists of multiple phases such that each phase corresponds to a separated filter. Furthermore, it exploits both the task and data parallelism, and reduces communication through data redistribution. We have implemented the proposed algorithm on the Intel Paragon and obtained a substantial speedup using more than 100 processors. The performance of the algorithm is also evaluated analytically. The analytical results confirming with the experimental results indicate that the proposed algorithm scales very well with the problem size and number of processors. We have also applied our algorithm to the design and implementation of an efficient parallel scheme for the 3D surface tracking process. Although our focus is on 3D image data, the algorithm is also applicable to 2D image data, and can be useful for a myriad of important applications including medical imaging, magnetic resonance imaging, ultrasonic imagery, scientific visualization, and image sequence analysis. 相似文献

13.

Optimal and nearly optimal algorithms for approximating polynomial zeros

V.Y. Pan 《Computers & Mathematics with Applications》1996,31(12):97-138

We substantially improve the known algorithms for approximating all the complex zeros of an n^th degree polynomial p(x). Our new algorithms save both Boolean and arithmetic sequential time, versus the previous best algorithms of Schönhage [1], Pan [2], and Neff and Reif [3]. In parallel (NC) implementation, we dramatically decrease the number of processors, versus the parallel algorithm of Neff [4], which was the only NC algorithm known for this problem so far. Specifically, under the simple normalization assumption that the variable x has been scaled so as to confine the zeros of p(x) to the unit disc x : |x| ≤ 1, our algorithms (which promise to be practically effective) approximate all the zeros of p(x) within the absolute error bound 2^−b, by using order of n arithmetic operations and order of (b + n)n² Boolean (bitwise) operations (in both cases up to within polylogarithmic factors). The algorithms allow their optimal (work preserving) NC parallelization, so that they can be implemented by using polylogarithmic time and the orders of n arithmetic processors or (b + n)n² Boolean processors. All the cited bounds on the computational complexity are within polylogarithmic factors from the optimum (in terms of n and b) under both arithmetic and Boolean models of computation (in the Boolean case, under the additional (realistic) assumption that n = O(b)). 相似文献

14.

A cost-optimal parallel tridiagonal system solver

Ferng-Ching Lin Kuo-Liang Chung 《Parallel Computing》1990,15(1-3):189-199

We first show how to transform the solution of an n × n tridiagonal system into suffix computations of continued fractions. Then a parallel substitution scheme is introduced to compute the suffix values. The derived parallel algorithm allows the tridiagonal system to be solved in O(log n) time on an unshuffle network with Θ(n /log n) processors. It is cost-optimal in the sense that processor number times execution time is minimized. Our solver is conceptually simple and easy for implementation. 相似文献

15.

Pipelined Diagnosis of Wafer-Scale Linear Arrays

《Journal of Parallel and Distributed Computing》1994,20(2):212-223

We present a comparison-based algorithm for identifying faulty and fault-free elements in a wafer-scale linear array of processors (or other logic elements). Only nearest neighbor communication is assumed to be possible between the processors in the array. Because the algorithm is simple and requires no storage of test vectors or test outcomes, it is ideally suited for implementation on the wafer to provide the capability for built-in production (or post production) testing. We show that, surprisingly, this method achieves high accuracy of diagnosis over a wide range of yields even though the diagnosis may be based on a high proportion of results produced by faulty processors. We also propose an improvement to the above algorithm which uses a processor diagnosed as fault-free by the basic algorithm as the starting point in improving the accuracy with which faulty processors are identified. Quantitative and qualitative reasoning validate the efficiency of these schemes. 相似文献

16.

Portable and Scalable Algorithm for Irregular All-to-All Communication

《Journal of Parallel and Distributed Computing》2002,62(10):1493-1526

In irregular all-to-all communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algorithm reduces the total number of message start-ups. It also reduces node contention by smoothing out the lengths of the messages communicated. As compared to the earlier approaches, our algorithm provides deterministic performance and also reduces the buffer space at the nodes during message passing. The performance of the algorithm is characterised using a simple communication model of high-performance computing (HPC) platforms. We show the implementation on T3D and SP2 using C and the message passing interface standard. These can be easily ported to other HPC platforms. The results show the effectiveness of the proposed technique as well as the interplay among the machine size, the variance in message length, and the network interface. 相似文献

17.

Efficient and scalable quicksort on a linear array with a reconfigurable pipelined bus system 总被引：3，自引：0，他引：3

Yi Pan Mounir Hamdi Keqin Li 《Future Generation Computer Systems》1998,13(6):501-513

Based on the current fiber optic technology, a new computational model, called a linear array with a reconfigurable pipelined abus system (LARPBS), is proposed in this paper. A parallel quicksort algorithm is implemented on the model, and its time complexity is analyzed. For a set of N numbers, the quicksort algorithm reported in this paper runs in O(log₂ N) average time on a linear array with a reconfigurable pipelined bus system of size N. If the number of processors available is reduced to P, where P < N, the algorithm runs in O((N/P) log₂ N) average time and is still scalable. Besides proposing a new algorithm on the model, some basic data movement operations involved in the algorithm are discussed. We believe that these operations can be used to design other parallel algorithms on the same model. Future research in this area is also identified in this paper. 相似文献

18.

Designing efficient parallel algorithms on CRAP

Tzong-Wann Kao Shi-Jinn Horng Yue-Li Wang Horng-Ren Tsai 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(5):554-560

A cross-bridge reconfigurable array of processors is a parallel processing system which has the ability to change dynamically the supported interconnection scheme during the execution of an algorithm. Based on this architecture, several O(1) time basic operations such as the transpose, the untranspose, the shift, the unshift and the prefix sum of a binary sequence are first proposed. Then, these basic operations can be used to find the kth smallest element of N m bits unsigned integers in O(m) time using N processors and to sort N data items in O(1) time using O(N^5/3) processors instead of using O(N²) processors as those proposed by other researchers 相似文献

19.

A projection method for solving nonsymmetric linear systems on multiprocessors

Chandrika Kamath Ahmed Sameh 《Parallel Computing》1989,9(3):291-312

We consider the iterative solution of large sparse linear systems of equations arising from elliptic and parabolic partial differential equations in two or three space dimensions. Specifically, we focus our attention on nonsymmetric systems of equations whose eigenvalues lie on both sides of the imaginary axis, or whose symmetric part is not positive definite. This system of equation is solved using a block Kaczmarz projection method with conjugate gradient acceleration. The algorithm has been designed with special emphasis on its suitability for multiprocessors. In the first part of the paper, we study the numerical properties of the algorithm and compare its performance with other algorithms such as the conjugate gradient method on the normal equations, and conjugate gradient-like schemes such as ORTHOMIN(k), GCR(k) and GMRES(k). We also study the effect of using various preconditioners with these methods. In the second part of the paper, we describe the implementation of our algorithm on the CRAY X-MP/48 multiprocessor, and study its behavior as the number of processors is increased. 相似文献

20.

A simplified design strategy for mapping image processing algorithms on a SIMD torus

Guna Seetharaman 《Theoretical computer science》1995,140(2):319-331

It is proposed to enhance and simplify the programming of a two dimensional (2-D) torus (and mesh) connected SIMD array of simple processing elements (PEs) by introducing two dedicated communication registers in each PE. A new SIMD algorithm to transpose a matrix using only two buffers at each PE is described. A method is proposed to effectively realize large number of arbitrary, one-to-one, personalized, and concurrent communication between the PEs, by suitably repeating the matrix transpose algorithm. Implementation of several image processing tasks of shift-variant nature, such as hough transform, histogram, median filters, which involve such communication, is enhanced by this approach. The dynamic behavior of such a SIMD implementation is data independent, unlike the ones that employ greedy methods for handling the overall communication. This feature facilitates coordinated use of several independently operating SIMD meshes in a newly emerging computer vision paradigm known as multiview image-sequence analysis (MVISA) for 3-D perception of unstructured dynamic scenes. 相似文献