首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
One of the essential problems in parallel computing is: Can SIMD machines handle asynchronous problems? This is a difficult, unsolved problem because of the mismatch between asynchronous problems and SIMD architectures. We propose a solution to let SIMD machines handle general asynchronous problems. Our approach is to implement a runtime support system which can run MIMD-like software on SIMD hardware. The runtime support system, named P kernel, is thread-based. There are two major advantages of the thread-based model. First, for application problems with irregular and/or unpredictable features, automatic scheduling can move some threads from overloaded processors to underloaded processors. Second, and more importantly, the granularity of threads can be controlled to reduce system overhead. The P kernel is also able to handle bookkeeping and message management, as well as to make these low-level tasks transparent to users. Substantial performance has been obtained on Maspar MP-1  相似文献   

2.
Serial search algorithms often exhibit exponential run times and may require an exponential amount of storage as well. Thus, the design of parallel search algorithms with limited memory is of obvious interest. This paper presents an efficient SIMD parallel algorithm, called IDPS (for iterative-deepening parallel search). At a broad level IDPS is a parallel version of IDA*. While generically we have called our algorithm an IDPS, performance of four variants of it has been studied through experiments conducted on the well-known test-bed problem for search algorithms, namely the Fifteen Puzzle. During the experiments, data were gathered under two different static load balancing schemes. Under the first scheme, an unnormalized average efficiency of approximately 3/4 was obtained for 4K, 8K, and 16K processors. Under the second scheme, unnormalized average efficiencies of 0.92 and 0.76, and normalized average efficiencies of 0.70 and 0.63 were obtained for 8K and 16K processors, respectively. We show (as shown previously only for MIMD machines) that for admissible search, high average speedup can be obtained for problems of significant size. We believe that this research will enhance AI problem solving using parallel heuristic search algorithms.  相似文献   

3.
This article presents PFCM, a parallel algorithm for fuzzy clustering of large data sets. Being a generalization of FCM, the algorithm enables arbitrary numbers of data points, features and clusters to be handled cost-optimally by hypercube SIMD computers of arbitrary cube dimension, the only limitation being the size of the local memories of the processors. Speedup responds optimally to enlarging the hypercube. PFCM owes its flexibility to the technique employed in its derivation from the sequential fuzzy C-means algorithm FCM: the association of each of the three dimensions of the problem (numbers of data points, features and clusters) with a distinct subset of hypercube dimensions.  相似文献   

4.
This paper describes changes made to a previous implementation of an N-body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ∼ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures.  相似文献   

5.
An algorithm for the rasterization of trapezoids on the Connection MachineTM (CM) is described. The input consists of an array of trapezoids, with two horizontal sides, arranged with one trapezoid per processor. (Unless otherwise indicated, “processor” should be taken to mean virtual processor.) Each trapezoid is converted to an edge record and the edge records are then distributed to enough processors so that each processor is responsible for one scanline of one trapezoid. Each processor computes a scan record for its scanline, and the scan records are then distributed to enough processors so that one processor is responsible for a single pixel. Final interpolation of position, and possibly shading information, is performed in parallel for all pixels thus created, and the pixels are then broadcast into a frame buffer array, with depth comparisons being performed at the receiving end to ensure that the nearest pixel appears in the array. The Connection Machine-specific features used by the algorithm are logarithmic time cumulative summing, general communication with comparisons for collision arbitration, and virtual processor sets. Performance is similar to that of good graphics workstations. The intended application is to display data already resident in the machine as the result of some previous computation, when a high-performance graphics workstation is not available.  相似文献   

6.
7.
Functional parallelism can be supported on SIMD machines by interpretation. Under such a scheme, the programs and data of each task are loaded on the processing elements (PEs) and the Control Unit of the machine executes a central control algorithm that causes the concurrent interpretation of the tasks on the PEs. The central control algorithm is, in many respects, analogous to the control store program on microprogrammed machines. Accordingly, the organization of the control algorithm greatly influences the performance of the synthesized MIMD environment. Most central control algorithms are constructed to interpret the execution phase of all instructions during every cycle (iteration). However, it is possible to delay the interpretation of infrequent and costly instructions to improve the overall performance. Interpreters that attempt improved performance by delaying the issue of infrequent instructions are referred to as variable issue control algorithms. This paper examines the construction of optimized variable issue control algorithms. In particular, a mathematical model for the interpretation process is built and two objective functions (instruction throughput and PE utilization) are defined. The problem of deriving variable issue control algorithms for these objective functions has been shown elsewhere to be NP-complete. Therefore, this paper investigates three heuristic algorithms for constructing near optimal variable issue control algorithms. The performance of the algorithms is studied on four different instruction sets and the trends of the schedulers with respect to the instruction sets and the objective functions are analyzed  相似文献   

8.
Stphane 《Pattern recognition》1995,28(12):1993-2000
We propose a parallel thinning algorithm for binary pictures. Given an N × N binary image including an object, our algorithm computes in O(N2) the skeleton of the object, using a pyramidal decomposition of the picture. The behavior of this algorithm is studied considering a family of digitalization of the same object at a different level of resolution. With the Exclusive Read Exclusive Write (EREW) Parallel Random Access Machine (PRAM), our algorithm runs in O(log N) time using O(N2/logN) processors and it is work-optimal. The same result is obtained with high-connectivity distributed memory SIMD machines having strong hypercube and pyramid. We describe the basic operator, the pyramidal algorithm and some experimental results on the SIMD MasPar parallel machine.  相似文献   

9.
We propose an algorithm for the cost-effective evaluation of structured sparse matrix-vector multiplications on massively parallel SIMD computers with distributed memory. Under the assumption that a large percentage of the nonzero entries of the matrix is concentrated on almost full diagonals, a specific data transport problem arises, for which we derive an equivalent graph theoretical formulation. We find an optimal solution of this communication problem in terms of a minimum-cost spanning tree.

We demonstrate the efficiency of our matrix-vector multiplication on a MP-1 with 16,384 processors.  相似文献   


10.
In a previous work we studied the concurrent implementation of a numerical model, CONDIFP, developed for the analysis of depth-averaged convection–diffusion problems. Initial experiments were conducted on the Intel Touchstone Delta System, using up to 512 processors and different problem sizes. As for other computation-intensive applications, the results demonstrated an asymptotic trend to unity efficiency when the computational load dominates the communication load. This paper relates some other numerical experiences, in both one and two space dimensions with various choices of initial and boundary conditions, carried out on the Intel Paragon XP/S Model L38 with the aim to illustrate the parallel solver versatility and reliability.  相似文献   

11.
Parallel algorithms are presented for the Fourier pseudospectral method and parts of the Chebyshev pseudospectral method. Performance of these schemes is reported as implemented on the NCUBE hypercube. The problem to which these methods are applied is the time integration of the incompressible Navier-Stokes equations. Despite serious communication requirements, the efficiencies are high; e.g., 92% for a 1283 mesh on 1024 processors. Benchmark timings rival those of optimized codes on supercomputers.  相似文献   

12.
We present parallel algorithms for constructing and maintaining balancedm-way search trees. These parallel algorithms have time complexity O(1) for ann processors configuration. The formal correctness of the algorithms is given in detail.  相似文献   

13.
Various proposals for networks of large numbers of processors are reviewed. Bottleneck problems arise in these networks with the flow of data between processors. Communication problems which can arise in practical situations are discussed and techniques for reducing bottlenecks are developed. Some simulation results are given for the binary n-cube.  相似文献   

14.
Assembly of large genomes from tens of millions of short genomic fragments is computationally demanding requiring hundreds of gigabytes of memory and tens of thousands of CPU hours. The advent of high throughput sequencing technologies, new gene-enrichment sequencing strategies, and collective sequencing of environmental samples further exacerbate this situation. In this paper, we present the first massively parallel genome assembly framework. The unique features of our approach include space-efficient and on-demand algorithms that consume only linear space, and strategies to reduce the number of expensive pairwise sequence alignments while maintaining assembly quality. Developed as part of the ongoing efforts in maize genome sequencing, we applied our assembly framework to genomic data containing a mixture of gene enriched and random shotgun sequences. We report the partitioning of more than 1.6 million fragments of over 1.25 billion nucleotides total size into genomic islands in under 2 h on 1024 processors of an IBM BlueGene/L supercomputer. We also demonstrate the effectiveness of the proposed approach for traditional whole genome shotgun sequencing and assembly of environmental sequences.  相似文献   

15.
This paper discusses the implementation of methods for the approximate calculation of multiple integrals on a SIMD parallel computer. Adaptive methods using polynomial integrating basic rules and Monte Carlo and number theoretic basic rules are considered with particular reference to implementation on the ICL DAP computer. Test results are given which compare serial and parallel versions of the same method.  相似文献   

16.
Three commercial systems are considered from a programmer's point of view. The three are the Intel iPSC, a network of Inmos transputers, and the Sequent Balance. The differences in overhead are examined by implementing a solution to the traveling-salesman problem on all three. The evaluation focuses on three major issues in parallel programming: (1) how execution is divided among processing elements and how it is controlled; (2) how data are shared; and (3) how events are synchronized. The experiences of the authors are presented and some specific as well as general conclusions are drawn  相似文献   

17.
The alpha-beta algorithm for searching decision trees is adapted to allow parallel activity in different parts of a tree during the search. The algorithm has been implemented in a procedural simulation language (GASP IV). The simulation environment provides the illusion of multiple software processes and multiple hardware processors. A number of preliminary experiments have been done to gather statistics on run time, nodes scored, and nodes visited. Results indicate that a substantial reduction in time of search occurs because of the use of parallelism. An analytic expression for the storage requirements of the algorithm is derived. The analysis provides an example of the classical tradeoff between time and storage.  相似文献   

18.
The authors present the scalability analysis of a parallel fast Fourier transform (FFT) algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. It is shown that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this work is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture  相似文献   

19.
The problem of representing a setU≜{u 1,...,u m} of read-write variables on ann-node distributed-memory parallel computer is considered. It is shown thatU can be represented among then nodes of a variant of the mesh of trees usingO((m/n) polylog(m/n)) storage per node such that anyn-tuple of variables may be accessed inO(logn (log logn)2) time in the worst case form polynomial inn. This work was supported in part by the Joint Services Electronics Program under Contract F49620-87-C-0044 and by IBM under Agreement 12060043. Earlier versions of these results appeared in theProceedings of the 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, October 1989 and in theProceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, Crete, July 1990.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号