期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Implementation of GAMMA on a Massively Parallel Computer 总被引：1，自引：0，他引：1

Huang Linpeng Tong Weiqin Kam Wing Ng Sun Yongqiang 《计算机科学技术学报》1997,12(1):29-39

The GAMMA paradigm is recently proposed by Banatre and Metayer to describe the systematic construction of parallel programs without introducing artificial sequentiality.This paper presents two synchronous execution models for GAMMA and discusses how to implement them on MasPar MP-1,a massively data parallel computer.The results show that GAMMA paradign can be implemented very naturally on data parallel machines,and very high level language,such as GAMMA in which parallelism is left implicit,is suitable for specifying massively parallel applications. 相似文献

2.

Efficient parallel algorithms for linear recurrence computation

Albert C. Greenberg Richard E. Ladner Michael S. Paterson Zvi Galil 《Information Processing Letters》1982,15(1):31-35

相似文献

3.

A massively parallel fault-tolerant architecture for time-critical computing

Ishfaq Ahmad 《The Journal of supercomputing》1995,9(1-2):135-162

Building large-scale parallel computer systems for time-critical applications is a challenging task since the designers of such systems need to consider a number of related factors such as proper support for fault tolerance, efficient task allocation and reallocation strategies, and scalability. In this paper we propose a massively parallel fault-tolerant architecture using hundreds or thousands of processors for critical applications with timing constraints. The proposed architecture is based on an interconnection network called thebisectional network. A bisectional network is isomorphic to a hypercube in that a binary hypercube network can be easily extended as a bisectional network by adding additional links. These additional links add to the network some rich topological properties such as node symmetry, small diameter, small internode distance, and partitionability. The important property of partitioning is exploited to propose a redundant task allocation and a task redistribution strategy under realtime constraints. The system is partitioned into symmetric regions (spheres) such that each sphere has a central control point. The central points, calledfault control points (FCPs), are distributed throughout the entire system in an optimal fashion and provide two-level task redundancy and efficiently redistribute the loads of failed nodes. FCPs are assigned to the processing nodes such that each node is assigned two types of FCPs for storing two redundant copies of every task present at the node. Similarly, the number of nodes assigned to each FCP is the same. For a failure-repair system environment the performance of the proposed system has been evaluated and compared with a hypercube-based system. Simulation results indicate that the proposed system can yield improved performance in the presence of a high number of node failures. 相似文献

4.

VLSI design for massively parallel signal processors

SY Kung Jurgen Annevelink 《Microprocessors and Microsystems》1983,7(10):461-468

相似文献

5.

A parallel Householder tridiagonalization stratagem using scattered square decomposition

H. Y. Chang

S. Utku

M. SalamaD. Rapp 《Parallel Computing》1988,6(3):297-311

The parallel stratagem in this paper uses scattered square decomposition, introduced by G. Fox, for its data assignment and then exploits parallelism in the solution steps of the sequential Householder tridiagonalization algorithm. One may condense a real symmetric full matrix A of order n into a tridiagonal form by the stratagem in concurrent machines where N(= D²) processors are used. Expressions for efficiency and speedup are given for the evaluation of the stratagem. An alternative stratagem which requires less data transmission but more computations is also discussed. The results shown that the Householder Method of tridiagonalization may be implemented on a concurrent machine efficiently by scattered square decomposition provided that the number of matrix elements contained in each processor is much larger than the number of processors of the concurrent machine, and the ratio of the time to transmit one data item from one processor to any other processor to the time to perform a floating-point arithmetic operation is small enough. 相似文献

6.

Interconnection network analysis for a compliant massively parallel processor

D. Bryan Perdue Daniel Tabak 《Journal of Systems Architecture》1997,42(9-10):665-678

The paper analyzes and selects an appropriate interconnection network for a compliant multiprocessor. The multiprocessor is compliant to the tasks assigned to it in the sense that it can be reconfigured to provide a more efficient fit to the tasks to be executed. A number of possible candidate networks for the multiprocessor is considered: Omega, ADM, Hypercube and Torus. The potential applicability of these networks to the multiprocessor is analyzed from the points of view of partitionability, inter-PE delay, fault impact, and cost. After the individual analysts of the above points of consideration is completed, a weighted network factor is formed, and the optimal type of network is selected, under different performance criteria. The overall results point to the selection of the Torus or Hypercube network for most cases under consideration. 相似文献

7.

An improved parallel Jacobi method for diagonalizing a symmetric matrix

Alan H. Karp John Greenstadt 《Parallel Computing》1987,5(3):281-294

We compare five implementations of the Jacobi method for diagonalizing a symmetric matrix. Two of these, the classical Jacobi and sequential sweep Jacobi, have been used on sequential processors. The third method, the parallel sweep Jacobi, has been proposed as the method of choice for parallel processors. The fourth and fifth methods are believed to be new. They are similar to the parallel sweep method but use different schemes for selecting the rotations.

The classical Jacobi method is known to take O(n⁴) time to diagonalize a matrix of order n. We find that the parallel sweep Jacobi run on one processor is about as fast as the sequential sweep Jacobi. Both of these methods take O(n³ log₂n) time. One of our new methods also takes O(n³ log₂n) time, but the other one takes only O(n³) time. The choice among the methods for parallel processors depends on the degree of parallelism possible in the hardware. The time required to diagonalize a matrix on a variety of architectures is modeled.

Unfortunately for proponents of the Jacobi method, we find that the sequential QR method is always faster than the Jacobi method. The QR method is faster even for matrices that are nearly diagonal. If we perform the reduction to tridiagonal form in parallel, the QR method will be faster even on highly parallel systems. 相似文献

8.

Towards a single model of efficient computation in real parallel machines

Pilar de la Torre Clyde P Kruskal 《Future Generation Computer Systems》1992,8(4):395-408

We propose a model of parallel computation, the YPRAM, that allows general parallel algorithms to be designed for a wide class of parallel models. The basic model captures locality among processors, which is measured as a function of two parameters; latency and bandwidth.

We design YPRAM algorithms for solving several fundamental problems: parallel prefix, sorting, sorting numbers from a bounded range, and list ranking. We show that our model predicts, reasonably accurately, the actual known performances of several basic parallel models — PRAM, hypercube, mesh and tree — when solving these problems. 相似文献

9.

GPU支持的SAR影像几何校正大规模并行处理

下载免费PDF全文

杨景辉程春泉张继贤黄国满《中国图象图形学报》2015,20(3):374-385

目的几何校正(又称地理编码)是合成孔径雷达(SAR)影像处理流程中重要的一个步骤,具有一定的计算复杂度,需要用到几何定位模型。本文针对星载SAR影像,采用有理多项式系数(RPC)定位模型,提出了图形处理器(GPU)支持的几何校正大规模并行处理方法。方法该方法充分利用GPU计算资源强大及几何校正过程中每个像素处理步骤一致的特点,每次导入大量像素至GPU,为每个像素分配一个线程,每个线程执行有理函数计算、投影变换、插值采样等计算复杂度高的步骤,通过优化配置dim Grid和dim Block参数,提升GPU的并行性能。该方法通过分块处理实现SAR影像大幅面处理,且可适用于多个不同分块大小。结果实验结果显示其计算加速比为38 44,为全面客观地分析GPU并行处理的特点,还计算了整体加速比,通过多个实验分析影响整体加速性能的因素,提出大块读写提高I/O性能的优化方法。结论该方法形式简洁,通用性好,可适用于几乎所有的星载SAR影像、不同的影像幅面;且加速性能明显。相似文献

10.

Scalable communication architectures for massively parallel hardware multi-processors

Yahya Jan Lech Jóźwiak 《Journal of Parallel and Distributed Computing》2012

Modern complex embedded applications in multiple application fields impose stringent and continuously increasing functional and parametric demands. To adequately serve these applications, massively parallel multi-processor systems on a single chip (MPSoCs) are required. This paper is devoted to the design of scalable communication architectures of massively parallel hardware multi-processors for highly-demanding applications. We demonstrated that in the massively parallel hardware multi-processors the communication network influence on both the throughput and circuit area dominates the processors influence, while the traditionally used flat communication architectures do not scale well with the increase of parallelism. Therefore, we propose to design highly optimized application-specific partitioned hierarchical organizations of the communication architectures through exploiting the regularity and hierarchy of the actual information flows of a given application. We developed related communication architecture synthesis strategies and incorporated them into our quality-driven model-based multi-processor design methodology and related automated architecture exploration framework. Using this framework we performed a large series of architecture synthesis experiments. Some of the results of the experiments are presented in this paper. They demonstrate many features of the synthesized communication architectures and show that our method and related framework are able to efficiently synthesize well scalable communication architectures even for the high-end massively parallel multi-processors that have to satisfy extremely stringent computation demands. 相似文献

11.

The parallel computation of eigenvalues and eigenvectors of large Hermitian matrices using the AMT DAP 510

J. S. Weston M. Clint C. W. Bleakney 《Concurrency and Computation》1991,3(3):179-185

The solution of the algebraic eigenvalue problem is an important component of many applications in science and engineering. With the advent of novel architecture machines, much research effort is now being expended in the search for parallel algorithms for the computation of eigensystems which can gainfully exploit the processing power which these machines provide. Among important recent work References 1-4 address the real symmetric eigenproblem in both its dense and sparse forms, Reference 5 treats the unsymmetric eigenproblem, and Reference 6 investigates the solution of the generalized eigenproblem. In this paper two algorithms for the parallel computation of the eigensolution of Hermitian matrices on an array processor are presented. These algorithms are based on the Parallel Orthogonal Transformation algorithm (POT) for the solution of real symmetric matrices[7,8]. POT was developed to exploit the SIMD parallelism supported by array processors such as the AMT DAP 510. The new algorithms use the highly efficient implementation strategies devised for use in POT. The implementations of the algorithms permit the computation of the eigensolution of matrices whose order exceeds the mesh size of the array processor used. A comparison of the efficiency of the two algorithms for the solution of a variety of matrices is given. 相似文献

12.

Design and implementation of rasterization parallel algorithm based on boundary algebra filling

ZHOU Chen CHEN Zhen jie ZHANG Shuai 《计算机工程与科学》2013,35(4):37

相似文献

13.

Parallel Implementation of Linear Algebra Problems on Dawning-1000

下载免费PDF全文

Chi Xuebin 《计算机科学技术学报》1998,13(2):141-146

In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached. 相似文献

14.

Extending the DEVS formalism for massively parallel simulation

Yung-Hsin Wang Bernard P. Zeigler 《Discrete Event Dynamic Systems》1993,3(2-3):193-218

The use of multiprocessors for discrete event simulation is an active research area where work has focused on strategies for model execution with little regard for the underlying formalism in which models may be expressed. However, a formalism-based approach offers several advantages including the ability to migrate models from sequential to parallel platforms and the ability to calibrate simulation architectures to model structural properties. In this article, we extend the DEVS (discrete event system specification) formalism, originally developed for sequential simulation, to accommodate the full potential of parallel processing. The extension facilitates exploitation of both internal and external event parallelism manifested in hierarchical, modular DEVS models. After developing a mapping of the extended formalism to parallel architectures, we describe an implementation of the approach on a massively parallel architecture, the Connection Machine. Execution results are discussed for a class of models exhibiting high external and internal event parallelism, the so-called broadcast models. These verify the tenets of the underlying theory and demonstrate that significant reduction in execution time is possible compared to the same model executed in serial simulation. 相似文献

15.

The universal space for parallel computation

R. Zuczek 《Information Processing Letters》1977,6(2):42-45

相似文献

16.

更实际的并行计算模型 总被引：7，自引：0，他引：7

陈国良《小型微型计算机系统》1995,16(2):1-9

过去所报导的大量并行算法在小规模的并行机上均运行得很好，然而将其移植到大规模并行机上运行时性能却很差。原因之一就是并行计算模型（如ＰＲＡＭ）过于抽象，略去了一些诸如通信、同步等算法运行时不可忽略的因素。本文介绍目前所提出的几个较能反映近代并行机性能的更为实际的并行计算模型，包括异步ＰＲＡＭ，ＢＳＰ，ｌｏｇＰ和Ｃ３模型等。当然这些模型在与真实并行机吻合的程度、可使用性和分析较复杂算法时的可操作性等方面尚存异议，但是它们的确打开了研究并行计其模型的新途径，成为当今并行算法研究的热点之一。相似文献

17.

Overview of parallel processing research in Japan

R Ohbuchi 《Parallel Computing》1985,2(3):219-228

This paper gives an overview of Japanese research and development efforts on the parallel processing architectures. Projects are categorized by their application domains. Following an introduction, general trends and some examples of research projects for each of the application domains such as artificial intelligence, numerical processing, and others like database, image, graphics, etc. are presented. 相似文献

18.

Analysis of a model for parallel image processing

S. Yalamanchili J.K. Aggarwal 《Pattern recognition》1985,18(1):1-16

An increasing awareness of the need for high speed parallel processing systems for image analysis has stimulated a great deal of interest in the design and development of such systems. Efficient processing schemes for several specific problems have been developed providing some insight into the general problems encountered in designing efficient image processing algorithms for parallel architectures. However it is still not clear what architecture or architectures are best suited for image processing in general, or how one may go about determining those which are. An approach that would allow application requirements to specify architectural features would be useful in this context. Working towards this goal, general principles are outlined for formulating parallel image processing tasks by exploiting parallelism in the algorithms and data structures employed. A synchronous parallel processing model is proposed which governs the communication and interaction between these tasks. This model presents a uniform framework for comparing and contrasting different formulation strategies. In addition, techniques are developed for analyzing instances of this model to determine a high level specification of a parallel architecture that best ‘matches’ the requirements of the corresponding application. It is also possible to derive initial estimates of the component capabilities that are required to achieve predefined performance levels. Such analysis tools are useful both in the design stage, in the selection of a specific parallel architecture, or in efficiently utilizing an existing one. In addition, the architecture independent specification of application requirements makes it a useful tool for benchmarking applications. 相似文献

19.

CP-PACS: A massively parallel processor at the University of Tsukuba 总被引：1，自引：0，他引：1

Kisaburo Nakazawa Hiroshi Nakamura Taisuke Boku Ikuo Nakata Yoshiyuki Yamashita 《Parallel Computing》1999,25(13-14):1635-1661

Computational Physics by Parallel Array Computer System (CP-PACS) is a massively parallel processor developed and in full operation at the Center for Computational Physics at the University of Tsukuba. It is an MIMD machine with a distributed memory, equipped with 2048 processing units and 128 GB of main memory. The theoretical peak performance of CP-PACS is 614.4 Gflops. CP-PACS achieved 368.2 Gflops with the Linpack benchmark in 1996, which at that time was the fastest Gflops rating in the world.CP-PACS has two remarkable features. Pseudo Vector Processing feature (PVP-SW) on each node processor, which can perform high speed vector processing on a single chip superscalar microprocessor; and a 3-dimensional Hyper-Crossbar (3-D HXB) Interconnection network, which provides high speed and flexible communication among node processors.In this article, we present the overview of CP-PACS, the architectural topics, some details of hardware and support software, and several performance results. 相似文献

20.

Efficient parallel processing of competitive learning algorithms 总被引：1，自引：0，他引：1

Kentaro Sano Shintaro Momose Hiroyuki Takizawa Hiroaki Kobayashi Tadao Nakamura 《Parallel Computing》2004,30(12):1361-1383

Vector quantization (VQ) is an attractive technique for lossy data compression, which has been a key technology for data storage and/or transfer. So far, various competitive learning (CL) algorithms have been proposed to design optimal codebooks presenting quantization with minimized errors. Although algorithmic improvements of these CL algorithms have achieved faster codebook design than conventional ones, limitations of speedup still exist when large data sets are processed on a single processor. Considering a variety of CL algorithms, parallel processing on flexible computing environment, like general-purpose parallel computers is in demand for a large-scale codebook design. This paper presents a formulation for efficiently parallelizing CL algorithms, suitable for distributed-memory parallel computers with a message-passing mechanism. Based on this formulation, we parallelize three CL algorithms: the Kohonen learning algorithm, the MMPDCL algorithm and the LOJ algorithm. Experimental results indicate a high scalability of the parallel algorithms on three different types of commercially available parallel computers: IBM SP2, NEC AzusA and PC cluster. 相似文献