期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Alan Colvin 《Computer Communications》1980,3(3):127-131

The four general methods of parallel processing are described. These are: single instruction single data stream (SISD), multiple instruction single data stream (MISD), single instruction multiple data stream (SIMD), and multiple instruction multiple data stream (MIMD). Most single computers in use are SISD machines, while most parallel processing applications use the MIMD aproach. The paper outlines and compares the four basic MIMD architectures: tightly coupled, loosely coupled, voting, and peripheral processing. The latter is one of the most practical methods using existing minicomputers. Many modern programmable intelligent I/O controllers are peripheral processors. For a computer manufacturer to remain competitive, it is concluded that he will have to include such devices in his hardware. 相似文献

2.

A high‐performance sorting algorithm for multicore single‐instruction multiple‐data processors

Hiroshi Inoue Takao Moriyama Hideaki Komatsu Toshio Nakatani 《Software》2012,42(6):753-777

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both single‐instruction multiple‐data (SIMD) instructions and thread‐level parallelism. In this paper, we propose a new high‐performance sorting algorithm, called aligned‐access sort (AA‐sort), that exploits both the SIMD instructions and thread‐level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in‐core sorting phase and an out‐of‐core merging phase. The in‐core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out‐of‐core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA‐sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA‐sort using SIMD instructions outperformed IBM's optimized sequential sorting library by 1.8 times and bitonic mergesort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32‐bit integers. Also, a parallel version of AA‐sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic mergesort on both platforms. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

3.

Implementation of an ADI Method on parallel computers

Raad A. Fatoohi Chester E. Grosch 《Journal of scientific computing》1987,2(2):175-193

In this paper we discuss the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented. 相似文献

4.

Fast sorting algorithms on a linear array with a reconfigurablepipelined bus system

Datta A. Soundaralakshmi S. Owens R. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(3):212-222

We present two fast algorithms for sorting on a linear array with a reconfigurable pipelined bus system (LARPBS), one of the recently proposed parallel architectures based on optical buses. In our first algorithm, we sort N numbers in O(log N log log N) worst-case time using N processors. In our second algorithm, we sort N numbers in O((log log N)²) worst-case time using N^1+ε processors, for any fixed ε such that 0 < ε < 1. Our algorithms are based on a novel deterministic sampling scheme for merging two sorted arrays of length N each in O(log log N) time on an LARPBS with N processors. To our knowledge, the previous best sorting algorithm on this architecture has a running time of O((log N)²) using N processors 相似文献

5.

Parallel computer architectures for image processing

《Computer Vision, Graphics, and Image Processing》1984,25(1):68-88

Image processing problems frequently involve large structured arrays of data and a need for very rapid computation. Special parallel processing schemes have evolved over the last 20 years to deal with these problems. In this paper many parallel systems which have been developed for image processing are outlined and the features of their underlying architectures are discussed. Most of these special architectures may be loosely classified as either SIMD or pipeline structures although some MIMD structures have been designed for high level image analysis. In recent years several multiple SIMD (MSIMD) schemes have been proposed as suitable architectures for image processing. The fundamental problems of developing an effective MSIMD system are discussed and a simple SIMD/MIMD computational model for comparison with such systems is proposed. 相似文献

6.

Parallel Algorithms for Image Template Matching on Hypercube SIMD Computers 总被引：1，自引：0，他引：1

Fang Z Li X Ni LM 《IEEE transactions on pattern analysis and machine intelligence》1987,(6):835-841

This correspondence presents several parallel algorithms for image template matching on an SIMD array processor with a hypercube interconnection network. For an N by N image and an M by M window, the time complexity is reduced from O(N2M2) for the serial algorithm to O(M2/K2 + M * log2 N/K + log2 N * log2 K) for the N2K2-PE system (1 ? K ? M), or to O(N2M2/L2) for the L2-PE system (L ? N). With efficient use of the inter-PE communication network, each PE requires only a small local memory, many unnecessary data transmissions are eliminated, and the time complexity is greatly reduced. 相似文献

7.

A parallel algorithm for graph matching and its MasParimplementation

Allen R. Cinque L. Tanimoto S. Shapiro L. Yasuda D. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(5):490-501

Search of discrete spaces is important in combinatorial optimization. Such problems arise in artificial intelligence, computer vision, operations research, and other areas. For realistic problems, the search spaces to be processed are usually huge, necessitating long computation times, pruning heuristics, or massively parallel processing. We present an algorithm that reduces the computation time for graph matching by employing both branch-and-bound pruning of the search tree and massively-parallel search of the as-yet-unpruned portions of the space. Most research on parallel search has assumed that a multiple-instruction-stream/multiple-data-stream (MIMD) parallel computer is available. Since massively parallel stream (SIMD) computers are much less expensive than MIMD systems with equal numbers of processors, the question arises as to whether SIMD systems can efficiently handle state-space search problems. We demonstrate that the answer is yes, and in particular, that graph matching has a natural and efficient implementation on SIMD machines 相似文献

8.

A survey of parallel computer architectures 总被引：1，自引：0，他引：1

Duncan R. 《Computer》1990,23(2):5-16

An attempt is made to place recent architectural innovations in the broader context of parallel architecture development by surveying the fundamentals of both newer and more established parallel computer architectures and by placing these architectural alternatives in a coherent framework. The primary emphasis is on architectural constructs rather than specific parallel machines. Three categories of architecture are defined and discussed: synchronous architectures, comprising vector, SIMD (single-instruction-stream, multiple-data-stream) and systolic machines; MIMD (multiple-instruction-stream, multiple-data-stream) with either distributed or shared memory; and MIMD-based paradigms, comprising MIMD/SIMD hybrid, dataflow, reduction, and wavefront types 相似文献

9.

Experimental application-driven architecture analysis of anSIMD/MIMD parallel processing system

Bronson E.C. Casavant T.L. Jamieson L.H. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(2):195-205

An experimental analysis of the architecture of an SIMD/MIMD parallel processing system is presented. Detailed implementations of parallel fast Fourier transform (FFT) programs were used to examine the performance of the prototype of the PASM (Partitionable SIMD/MIMD) parallel processing system. Detailed execution-time measurements using specialized timing hardware were made for the complete FFT and for components of SIMD, MIMD, and barrier-synchronized MIMD implementations. The component measurements isolated the effects of floating-point arithmetic operations, interconnection network transfer operations, and program control overhead. The measurements allow an accurate extrapolation of the execution time, speedup, and efficiency of the MIMD, SIMD, and barrier-synchronized MIMD programs to a full 1024-processor PASM system. This constitutes one of the first results of this kind, in which controlled experiments on fixed hardware were used to make comparisons of these fundamental modes of computing. Overall, the experimental results demonstrate the value of mixed-mode SIMD/MIMD computing and its suitability for computational intensive algorithms such as the FET 相似文献

10.

Parallel marching Poisson solvers

Marian Vajter&#x;ic 《Parallel Computing》1984,1(3-4):325-330

The paper presents parallel algorithms for solving Poisson equation at N² mesh points. The methods based on marching techniques are structured for efficient parallel realization. Using orthogonal decomposition properties of arising matrices, the algorithms can be formulated in terms of transformed vectors. On a MIMD computer with not more than N processors, the computations can be performed in horizontal slices with minimal synchronization requirements. Considering an SIMD machine with N² processors, the complexity bound O(log N) has been achieved, whereby the single marching requires 10 log N steps only. 相似文献

11.

SIMD计算机发展概述

景晓军方滨兴《计算机科学》1995,22(3):4-8

一、引言并行处理是提高计算机性能的有效途径,已成为计算机系统结构研究的热点。IMD(单指令多数据流)计算机由M.J.Flynn,在1966年对计算机系相似文献

12.

Analysis of Multi-Sort Algorithm on Multi-Mesh of Trees (MMT) architecture

Nitin Rakesh Nitin 《The Journal of supercomputing》2011,57(3):276-313

Various sorting algorithms using parallel architectures have been proposed in the search for more efficient results. This paper introduces the Multi-Sort Algorithm for Multi-Mesh of Trees (MMT) Architecture for N=n ⁴ elements with more efficient time complexity compared to previous architectures. The shear sort algorithm on Single Instruction Multiple Data (SIMD) mesh model requires \(4\sqrt{N}+O\sqrt{N}\) time for sorting N elements, arranged on a \(\sqrt{N}\times \sqrt{N}\) mesh, whereas Multi-Sort algorithm on the SIMD Multi-Mesh (MM) Architecture takes O(N ^1/4) time for sorting the same N elements, which proves that Multi-Sort is a better sorting approach. We have improved the time complexity of intrablock Sort. The Communication time complexity for 2D Sort in MM is O(n), whereas this time in MMT is O(log?n). The time complexity of compare–exchange step in MMT is same as that in MM, i.e., O(n). It has been found that the time complexity of the Multi-Sort on MMT has been improved as on Multi-Mesh architecture. 相似文献

13.

基于图形硬件的双调排序算法优化

宾洪斌何锫胡明辉《计算机工程与设计》2008,29(14)

介绍一种新的并行排序算法,该算法以双调归并排序为基础,运用图形硬件的并行体系结构和二叉排序树数据结构的优点,用部分并行代替所有阶段的顺序执行,对双调排序算法进行优化.对该算法进行分析,在理论上n个序列在P个流处理器上的排序,最优的时间复杂度为O((nlogn)/p).实验测试结果表明,优化后的算法比其它基于图形硬件的双调归并排序算法所用时间短. 相似文献

14.

A Heterogeneous Mixed-Mode Execution Model for Massively Parallel Systems

《Journal of Parallel and Distributed Computing》1999,56(1):2-16

In this paper, we consider a massively parallel system that is composed of heterogeneous processors, that is, processors with different processing power, and that combines the advantages of the SIMD and MIMD architectures. The heterogeneous mixed-mode (HeMM) execution model is composed of two main components, which operate in the well-known SIMD and MIMD paradigms. The main computing power comes from a component that is composed of a massive number of processors and operates in a data parallel manner. The other component is composed of a few (or even one) fast processors which operate in the MIMD paradigm. The operation of a small number of processors in an MIMD paradigm has been well demonstrated through actual systems. The processors in this component add flexibility to the execution of the parallel programs such that it adjusts to the changing parallelism of the program to enhance the performance. Based on this execution model we analyze the gains in performance that is obtainable by this new system. We show that substantial performance gains can be obtained by using the HeMM system. 相似文献

15.

Architecture-independent parallel computation

Skillicorn D.B. 《Computer》1990,23(12):38-50

The major parallel architecture classes are considered: single-instruction multiple-data (SIMD) computers, tightly coupled multiple-instruction multiple-data (MIMD) computers, hypercuboid computers and constant-valence MIMD computers. An argument that the PRAM model is universal over tightly coupled and hypercube systems, but not over constant-valence-topology, loosely coupled-system is reviewed, showing precisely how the PRAM model is too powerful to permit broad universality. Ways in which a model of computation can be restricted to become universal over less powerful architectures are discussed. The Bird-Meertens formalism (R.S. Bird, 1989), is introduced and it is shown how it is used to express computations in a compact way. It is also shown that the Bird-Meertens formalism is universal over all four architecture classes and that nontrivial restrictions of functional programming languages exist that can be efficiently executed on disparate architectures. The use of the Bird-Meertens formalism as the basis for a programming language is discussed, and it is shown that it is expressive enough to be used for general programming. Other models and programming languages with architecture-independent properties are reviewed 相似文献

16.

PERFORM: a fast object recognition method using intersection ofprojection

Modayur B.R. Shapiro L.G. 《IEEE transactions on pattern analysis and machine intelligence》1997,19(5):499-506

This paper describes an object recognition methodology called PERFORM that finds matches by establishing correspondences between model and image features using this formulation. PERFORM evaluates correspondences by intersecting error regions in the image space. The algorithm is analyzed with respect to theoretical complexity as well as actual running times. When a single solution to the matching problem is sought, the time complexity of the sequential matching algorithm for 2D-2D matching using point features is of the order O(l³ N ²), where N is the number of model features and l is the number of image features. When line features are used, the sequential complexity is of the order O(l² N²). When a single solution is sought, PERFORM runs faster than the fastest known algorithm to solve the bounded-error matching problem. The PERFORM method is shown to be easily realizable on both SIMD and MIMD architectures 相似文献

17.

Massively parallel algorithms for trace-driven cache simulations

Nicol D.M. Greenberg A.G. Lubachevsky B.D. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(8):849-859

Considers the use of massively parallel architectures to execute a trace-driven simulation of a single cache set. A method is presented for the least-recently-used (LRU) policy, which, regardless of the set size C, runs in time O(log N) using N processors on the EREW (exclusive read, exclusive write) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. We present timings of this algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference-based line replacement policies are considered, which includes LRU as well as the least-frequently-used (LFU) and random replacement policies. A simulation method is presented for any such policy that, on any trace of length N directed to a C line set, runs in O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation 相似文献

18.

多目标动态规划时段轮换并行算法

康一梅吴沧浦《自动化学报》1994,20(5):561-569

针对SIMD和MIMD结构的并行机提出多目标动态规划时段轮换并行算法,多目标动态规划的时段轮换迭代算法,将全过程优化问题转化成子过程优化问题,然后在子过程非劣解集中寻找全过程非劣解.这样,将多目标动态规划内存不足的问题转化成时间问题,然后利用并行机超高速运算的优势来有效地解决内存不足问题.通过时间复杂性、加速比分析及实例.说明了算法的有效性及优越性. 相似文献

19.

分布存储并行计算机优化数据安放的一种方法

孙彤方滨兴李晓明《计算机学报》1995,(7)

本文分析了面向分布存储ＳＩＭＤ／ＭＩＭＤ并行机的并行程序的优化数据安放问题，在ＦＯＲＡＬＬ程序模型和ＭＥＳＨ通信模型之上，研究了数据分解过程中减少通信代价的优化要求．我们使用维偏好图描述并行数组之间的对准需求，通过消除维偏好图中的冲突，可得到维对准图．一个维对准图就对应一个数据安放方案．维对准图的总代价越大，对应的通信代价就越小．文中给出了求最大代价维对准目的一个近似算法．相似文献

20.

Parallel Wavelet Transform over Distributed Computer Network for Real–Time Applications

《Real》2000,6(5):375-389

This paper presents an approach to parallel implementation of wavelet transforms in a distributed computing environment. To achieve robustness and efficiency, we proposed a parallel algorithm for wavelet transform which can be implemented in SIMD, MIMD and pipeline architectures on the configured system. Our experimental results show that our proposed algorithm will speed up the wavelet-based image processing tasks on a network of computer workstation clusters. 相似文献