期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Asynchronous problems on SIMD parallel computers

Shu W. Wu M.-Y. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):704-713

One of the essential problems in parallel computing is: Can SIMD machines handle asynchronous problems? This is a difficult, unsolved problem because of the mismatch between asynchronous problems and SIMD architectures. We propose a solution to let SIMD machines handle general asynchronous problems. Our approach is to implement a runtime support system which can run MIMD-like software on SIMD hardware. The runtime support system, named P kernel, is thread-based. There are two major advantages of the thread-based model. First, for application problems with irregular and/or unpredictable features, automatic scheduling can move some threads from overloaded processors to underloaded processors. Second, and more importantly, the granularity of threads can be controlled to reduce system overhead. The P kernel is also able to handle bookkeeping and message management, as well as to make these low-level tasks transparent to users. Substantial performance has been obtained on Maspar MP-1 相似文献

2.

A SIMD approach to parallel heuristic search

Ambuj MahantiCharles J. Daniels 《Artificial Intelligence》1993,60(2):243-282

Serial search algorithms often exhibit exponential run times and may require an exponential amount of storage as well. Thus, the design of parallel search algorithms with limited memory is of obvious interest. This paper presents an efficient SIMD parallel algorithm, called IDPS (for iterative-deepening parallel search). At a broad level IDPS is a parallel version of IDA*. While generically we have called our algorithm an IDPS, performance of four variants of it has been studied through experiments conducted on the well-known test-bed problem for search algorithms, namely the Fifteen Puzzle. During the experiments, data were gathered under two different static load balancing schemes. Under the first scheme, an unnormalized average efficiency of approximately 3/4 was obtained for 4K, 8K, and 16K processors. Under the second scheme, unnormalized average efficiencies of 0.92 and 0.76, and normalized average efficiencies of 0.70 and 0.63 were obtained for 8K and 16K processors, respectively. We show (as shown previously only for MIMD machines) that for admissible search, high average speedup can be obtained for problems of significant size. We believe that this research will enhance AI problem solving using parallel heuristic search algorithms. 相似文献

3.

Parallel fuzzy clustering on fixed size hypercube SIMD computers

E. L. Zapata F. F. Rivera O. G. Plata M. A. Ismail 《Parallel Computing》1989,11(3):291-303

This article presents PFCM, a parallel algorithm for fuzzy clustering of large data sets. Being a generalization of FCM, the algorithm enables arbitrary numbers of data points, features and clusters to be handled cost-optimally by hypercube SIMD computers of arbitrary cube dimension, the only limitation being the size of the local memories of the processors. Speedup responds optimally to enlarging the hypercube. PFCM owes its flexibility to the technique employed in its derivation from the sequential fuzzy C-means algorithm FCM: the association of each of the three dimensions of the problem (numbers of data points, features and clusters) with a distinct subset of hypercube dimensions. 相似文献

4.

Efficient tree codes on SIMD computer architectures

Kevin M. Olson 《Computer Physics Communications》1996,98(3):267-287

This paper describes changes made to a previous implementation of an N-body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ∼ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures. 相似文献

5.

A SIMD parallel trapezoid rasterization algorithm

R.Victor Klassen 《Computers & Graphics》1991,15(4):553-559

An algorithm for the rasterization of trapezoids on the Connection Machine^TM (CM) is described. The input consists of an array of trapezoids, with two horizontal sides, arranged with one trapezoid per processor. (Unless otherwise indicated, “processor” should be taken to mean virtual processor.) Each trapezoid is converted to an edge record and the edge records are then distributed to enough processors so that each processor is responsible for one scanline of one trapezoid. Each processor computes a scan record for its scanline, and the scan records are then distributed to enough processors so that one processor is responsible for a single pixel. Final interpolation of position, and possibly shading information, is performed in parallel for all pixels thus created, and the pixels are then broadcast into a frame buffer array, with depth comparisons being performed at the receiving end to ensure that the nearest pixel appears in the array. The Connection Machine-specific features used by the algorithm are logarithmic time cumulative summing, general communication with comparisons for collision arbitration, and virtual processor sets. Performance is similar to that of good graphics workstations. The intended application is to display data already resident in the machine as the result of some previous computation, when a high-performance graphics workstation is not available. 相似文献

6.

Non-numerical methods on parallel computers

P.M. Flanders 《Computer Physics Communications》1982,26(3-4):363-371

相似文献

7.

Synthesizing variable instruction issue interpreters forimplementing functional parallelism on SIMD computers

Abu-Ghazaleh N.B. Wilsey P.A. Xianzhi Fan Hensgen D.A. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(4):412-423

Functional parallelism can be supported on SIMD machines by interpretation. Under such a scheme, the programs and data of each task are loaded on the processing elements (PEs) and the Control Unit of the machine executes a central control algorithm that causes the concurrent interpretation of the tasks on the PEs. The central control algorithm is, in many respects, analogous to the control store program on microprogrammed machines. Accordingly, the organization of the control algorithm greatly influences the performance of the synthesized MIMD environment. Most central control algorithms are constructed to interpret the execution phase of all instructions during every cycle (iteration). However, it is possible to delay the interpretation of infrequent and costly instructions to improve the overall performance. Interpreters that attempt improved performance by delaying the issue of infrequent instructions are referred to as variable issue control algorithms. This paper examines the construction of optimized variable issue control algorithms. In particular, a mathematical model for the interpretation process is built and two objective functions (instruction throughput and PE utilization) are defined. The problem of deriving variable issue control algorithms for these objective functions has been shown elsewhere to be NP-complete. Therefore, this paper investigates three heuristic algorithms for constructing near optimal variable issue control algorithms. The performance of the algorithms is studied on four different instruction sets and the trends of the schedulers with respect to the instruction sets and the objective functions are analyzed 相似文献

8.

Pyramidal thinning algorithm for SIMD parallel machines

Stphane 《Pattern recognition》1995,28(12):1993-2000

We propose a parallel thinning algorithm for binary pictures. Given an N × N binary image including an object, our algorithm computes in O(N²) the skeleton of the object, using a pyramidal decomposition of the picture. The behavior of this algorithm is studied considering a family of digitalization of the same object at a different level of resolution. With the Exclusive Read Exclusive Write (EREW) Parallel Random Access Machine (PRAM), our algorithm runs in O(log N) time using O(N²/logN) processors and it is work-optimal. The same result is obtained with high-connectivity distributed memory SIMD machines having strong hypercube and pyramid. We describe the basic operator, the pyramidal algorithm and some experimental results on the SIMD MasPar parallel machine. 相似文献

9.

Structured sparse matrix-vector multiplication on massively parallel SIMD architectures

T. Dehn M. Eiermann K. Giebermann V. Sperling 《Parallel Computing》1995,21(12):1867-1894

We propose an algorithm for the cost-effective evaluation of structured sparse matrix-vector multiplications on massively parallel SIMD computers with distributed memory. Under the assumption that a large percentage of the nonzero entries of the matrix is concentrated on almost full diagonals, a specific data transport problem arises, for which we derive an equivalent graph theoretical formulation. We find an optimal solution of this communication problem in terms of a minimum-cost spanning tree.

We demonstrate the efficiency of our matrix-vector multiplication on a MP-1 with 16,384 processors. 相似文献

10.

Environmental modeling on massively parallel computers

《Environmental Modelling & Software》2000,15(5):489-496

In a previous work we studied the concurrent implementation of a numerical model, CONDIFP, developed for the analysis of depth-averaged convection–diffusion problems. Initial experiments were conducted on the Intel Touchstone Delta System, using up to 512 processors and different problem sizes. As for other computation-intensive applications, the results demonstrated an asymptotic trend to unity efficiency when the computational load dominates the communication load. This paper relates some other numerical experiences, in both one and two space dimensions with various choices of initial and boundary conditions, carried out on the Intel Paragon XP/S Model L38 with the aim to illustrate the parallel solver versatility and reliability. 相似文献

11.

Pseudospectral methods on massively parallel computers

Richard B. Pelz 《Computer Methods in Applied Mechanics and Engineering》1990,80(1-3):493-503

Parallel algorithms are presented for the Fourier pseudospectral method and parts of the Chebyshev pseudospectral method. Performance of these schemes is reported as implemented on the NCUBE hypercube. The problem to which these methods are applied is the time integration of the incompressible Navier-Stokes equations. Despite serious communication requirements, the efficiencies are high; e.g., 92% for a 128³ mesh on 1024 processors. Benchmark timings rival those of optimized codes on supercomputers. 相似文献

12.

Optimal parallel algorithms for constructing and maintaining a balancedm-way search tree

Eliezer Dekel Shietung Peng S. Sitharma Lyengar 《International journal of parallel programming》1986,15(6):503-528

We present parallel algorithms for constructing and maintaining balancedm-way search trees. These parallel algorithms have time complexity O(1) for ann processors configuration. The formal correctness of the algorithms is given in detail. 相似文献

13.

Communication problems on MIMD parallel computers

G.P. McKeown V.J. Rayward-Smith 《Information Processing Letters》1984,19(2):69-73

Various proposals for networks of large numbers of processors are reviewed. Bottleneck problems arise in these networks with the flow of data between processors. Communication problems which can arise in practical situations are discussed and techniques for reducing bottlenecks are developed. Some simulation results are given for the binary n-cube. 相似文献

14.

Assembling genomes on large-scale parallel computers

A. S.J. P.S. S. 《Journal of Parallel and Distributed Computing》2007,67(12):1240

Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in under 2 h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences. 相似文献

15.

Numerical multiple integration on parallel computers

Alan C. Genz 《Computer Physics Communications》1982,26(3-4):349-352

This paper discusses the implementation of methods for the approximate calculation of multiple integrals on a SIMD parallel computer. Adaptive methods using polynomial integrating basic rules and Monte Carlo and number theoretic basic rules are considered with particular reference to implementation on the ICL DAP computer. Test results are given which compare serial and parallel versions of the same method. 相似文献

16.

Programming three parallel computers

Kallstrom M. Thakkar S.S. 《Software, IEEE》1988,5(1):11-22

Three commercial systems are considered from a programmer's point of view. The three are the Intel iPSC, a network of Inmos transputers, and the Sequent Balance. The differences in overhead are examined by implementing a solution to the traveling-salesman problem on all three. The evaluation focuses on three major issues in parallel programming: (1) how execution is divided among processing elements and how it is controlled; (2) how data are shared; and (3) how events are synchronized. The experiences of the authors are presented and some specific as well as general conclusions are drawn 相似文献

17.

Design, analysis, and implementation of a parallel tree search algorithm

Akl SG Barnard DT Doran RJ 《IEEE transactions on pattern analysis and machine intelligence》1982,(2):192-203

The alpha-beta algorithm for searching decision trees is adapted to allow parallel activity in different parts of a tree during the search. The algorithm has been implemented in a procedural simulation language (GASP IV). The simulation environment provides the illusion of multiple software processes and multiple hardware processors. A number of preliminary experiments have been done to gather statistics on run time, nodes scored, and nodes visited. Results indicate that a substantial reduction in time of search occurs because of the use of parallelism. An analytic expression for the storage requirements of the algorithm is derived. The analysis provides an example of the classical tradeoff between time and storage. 相似文献

18.

The scalability of FFT on parallel computers

Gupta A. Kumar V. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(8):922-932

The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture 相似文献

19.

Representing shared data on distributed-memory parallel computers

K. T. Herley 《Theory of Computing Systems》1996,29(2):111-156

The problem of representing a setU≜{u ₁,...,u _m} of read-write variables on ann-node distributed-memory parallel computer is considered. It is shown thatU can be represented among then nodes of a variant of the mesh of trees usingO((m/n) polylog(m/n)) storage per node such that anyn-tuple of variables may be accessed inO(logn (log logn)²) time in the worst case form polynomial inn. This work was supported in part by the Joint Services Electronics Program under Contract F49620-87-C-0044 and by IBM under Agreement 12060043. Earlier versions of these results appeared in theProceedings of the 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, October 1989 and in theProceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, Crete, July 1990. 相似文献

20.

Competitive neural networks on message-passing parallel computers

Michele Ceccarelli Alfredo Petrosino Roberto Vaccaro 《Concurrency and Computation》1993,5(6):449-470

相似文献