首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
An architecture that efficiently supports both message-passing and systolic communications in one system is presented. This architecture incorporates a variety of innovative features unifying both computational power and communications flexibility in one VLSI component, the iWarp microprocessor. The message-based communication model is discussed, and an overview of the architecture is given. Two principle iWarp components, called the communication agent and the computation agent, and the register file they share are described. The efficiencies of word-level communication are examined. The software development environment is also described  相似文献   

2.
The inadequacies of conventional parallel languages for programming multicomputers are identified. The C* language is briefly reviewed, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented. Results illustrating the efficiency of executing data-parallel programs on a hypercube multicomputer are reported. They show the speedup achieved by three hand-compiled C* programs executing on an N-Cube 3200 multicomputer. The first two programs, Mandelbrot set calculation and matrix multiplication, have a high degree of parallelism and a simple control structure. The C* compiler can generate relatively straightforward code with performance comparable to hand-written C code. Results for a C* program that performs Gaussian elimination with partial pivoting are also presented and discussed  相似文献   

3.
The paper describes a new performance monitoring tool, called Tmon, that has been developed to help application programmers understand the run-time behavior of a parallel system and tune the performance of their programs. Tmon measures resource utilization and traces process activities transparently during execution. A global interrupt technique used by Tmon allows measurement tasks to be executed simultaneously on different processors and an accurate global clock to be maintained with minimal overhead. Experimental results indicate that both the accuracy and the overhead of the monitor are well within an acceptable range. We introduce a performance model from which it is possible to derive several different parallel programming metrics. A weighted critical path analysis tool is also presented that focusses the user's attention on those parts of the program whose modification would most improve performance. An example in which the tool is used successfully to improve the performance of a parallel application is also presented. Tmon is currently implemented on top of the Trollius Operating System and runs on a 74-node transputer-based multicomputer.  相似文献   

4.
Many CFD (computational fluid dynamics) and other scientific applications can be partitioned into subproblems. However, in general, the partitioned subproblems are very large. They demand high-performance computing power themselves, and their solutions have to be combined at each time step. In this paper, the cube-connect cube (CCCube) architecture is studied. The CCCube architecture is an extended hypercube structure with each node represented as a cube. It requires fewer physical links between nodes than the hypercube, and provides the same communication support as the hypercube does on many applications. The reduced physical links can be used to enhance the bandwidth of the remanding links and, therefore, enhance the overall performance. The concept and the method to obtain optimal CCCubes, which are the CCCubes with a minimum number of links under a given total number of nodes, are proposed. The superiority of optimal CCCubes over standard hypercubes has also been shown in terms of the link usage in the embedding of a binomial tree. A useful computation structure based on a semi-binomial tree for divide-and-conquer type of parallel algorithms has been identified. We have shown that this structure can be implemented in optimal CCCubes without performance degradation compared with regular hypercubes. The result presented in this paper should provide a useful approach to design of scientific parallel computers.  相似文献   

5.
Talia  D. 《Micro, IEEE》1993,13(3):62-72
The Tiny, CSN, Multiple Rings, and Ordered Dimensions, and interval labeling routing systems for transputer networks are reviewed. The systems are compared with respect to several criteria, such as adaptivity, deadlock freedom, generality, livelock freedom, and network latency  相似文献   

6.
Implementation of Ada's parallel tasks on a multicomputer architecture requires additional communication and naming overhead because tasks can operate on shared data via global variables and pointers. This increases the complexity of implementing Ada and has a negative impact on program understandability.  相似文献   

7.
An efficient parallel algorithm is presented for convolution on a mesh-connected computer with wraparound. The algorithm does not require a broadcast feature for data values, as assumed by previously proposed algorithms. As a result, the algorithm is applicable to both SIMD and MIMD meshes. For an N×N image and a M×M template, the previous algorithms take O (M2q) time on an N×N mesh-connected multicomputer (q is the number of bits in each entry of the convolution matrix). The algorithms have complexity O(M2r), where r=max {number of bits in an image entry, number of bits in a template entry}. In addition to not requiring a broadcast capability, these algorithms are faster for binary images  相似文献   

8.
In this paper, a network-partitioning approach for one-to-all broadcasting on wormhole-routed networks is proposed. To broadcast a message, the scheme works in three phases. First, a number of data-distributing networks (DDNs), which can work independently, are constructed. Then the message is evenly divided into submessages, each being sent to a representative node in one DDN. Second, the submessages are broadcast on the DDNs concurrently. Finally, a number of data-collecting networks (DCNs), which can work independently too, are constructed. Then, concurrently on each DCN, the submessages are collected and combined into the original message. Our approach, especially designed for wormhole-routed networks, is conceptually similar but fundamentally very different from the traditional approach of using multiple edge-disjoint spanning trees in parallel for broadcasting in store-and-forward networks. One interesting issue is on the definition of independent DDNs and DCNs, in the sense of wormhole routing. We show how to apply this approach to tori, meshes, and hypercubes. Thorough analyses and comparisons based on different system parameters and configurations are conducted. The results do confirm the advantage of our scheme, under various system parameters and conditions, over other existing broadcasting algorithms  相似文献   

9.
A new approach for dynamic job scheduling in mesh-connected multiprocessor systems, which supports a multiuser environment, is proposed in this paper. Our approach combines a submesh reservation policy with a priority-based scheduling policy to obtain high performance in terms of high throughput, high utilization, and low turn-around times for jobs. This high performance is achieved at the expense of scheduling jobs in a strictly fair, FCFS fashion; in fact, the algorithm is parameterized to allow trade-offs between performance and (short-term) POPS fairness. The proposed scheduler can be used with any submesh allocation policy. A fast and efficient implementation of the proposed scheduler has also been presented. The performance of the proposed scheme has been compared with the FCFS policy, the only existing scheduling strategy for meshes, to demonstrate the effectiveness of the proposed approach. Simulation results indicate that our scheduling strategy outperforms the FCFS policy significantly. Specifically, our strategy significantly reduces the average waiting delay of jobs over the FCFS policy. The fast implementation of the proposed scheduler results in low allocation and deallocation time overhead, as well as low space overhead  相似文献   

10.
To successfully exploit the benefits of optical technology in a tightly coupled multicomputer, the architectural design must reflect both the advantages and limitations of optics. This article describes a class of such architectures, based upon inverted-graph topologies. We consider the physical construction of these systems, demonstrating the relevant technological components necessary to manufacture a working system. We then consider sample inverted-hypercube and inverted-mesh topologies, illustrating their properties, including processor labeling, topological embeddings, and message-routing algorithms  相似文献   

11.
Reliable Communication on Cube-Based Multicomputers   总被引:1,自引:0,他引:1       下载免费PDF全文
We consider a distributed unicasting algorithm for hypercubes with faulty nodes(including disconnected hypercubes)using the safety level concept.The safety level of ach node in an n-dimensional hypercube in an approximated measure of the number and distribution of faulty nodes in the neighborhood and it can be easily calculated through n-1 rounds of information exchange among neighboring nodes.Optimal unicasting between two nodes is guaranteed if the safety level of the source node is no less than the Hamming distance between the source and the destination.The feasibility of an optimal or suboptimal unicasting can be easily determined at the source node by comparing its safety level,together with its neighbors‘ safety levels,with the Hamming distance between the source and the destination.The proposed scheme is also the first attempt to address the unicasting problem in discronnected hypercubes.The safety level concept is also extended to be used in hypercubes with both faulty nodes and links and in generalized hypercubes.  相似文献   

12.
This paper presents an analysis and evaluation of a multicomputer system called SM3 which has a number of special architectural features, viz.: (1) a number of independent computer systems interconnected by a physically partitionable bus which allows the system to be dynamically reconfigured into a number of clusters (or subnetworks) of adjacent computers to achieve a high degree of parallel, MIMD processing; (2) a set of dual Switchable main memory modules used to facilitate fast data/command/message transfers among the computers; and (3) a number of special-purpose control lines for efficient interprocesor control, communication, and synchronization. Three representative database algorithms are described and modeled analytically by a set of timing equations and the results of a performance evaluation study are presented. The performance of the SM3 architecture is compared with that of several existing architectures. It is found that SM3 has an excellent overall performance for various categories of database operations and that its architecture is flexible for adapting different software algorithms.  相似文献   

13.
The Paradigm compiler for distributed-memory multicomputers   总被引:1,自引:0,他引:1  
To harness the computational power of massively parallel distributed-memory multicomputers, users must write efficient software. This process is laborious because of the absence of global address space. The programmer must manually distribute computations and data across processors and explicitly manage communication. The Paradigm (PARAllelizing compiler for DIstributed-memory, General-purpose Multicomputers) project at the University of Illinois addresses this problem by developing automatic methods for the efficient parallelization of sequential programs. A unified approach efficiently supports regular and irregular computations using data and functional parallelism  相似文献   

14.
Efficient algorithms to compute the Hough transform on MIMD and SIMD hypercube multicomputer are developed. Our algorithms can compute p angles of the Hough transform of an N × N image, p N, in 0(p + log N) time on both MIMD and SIMD hypercubes. These algorithms require 0(N 2) processors. We also consider the computation of the Hough transform on MIMD hypercubes with a fixed number of processors. Experimental results on an NCUBE/7 hypercube are presented.This research was supported by the National Science Foundation under grants DCR84-20935 and 86-17374. All correspondence should be mailed to Sanjay Ranka.  相似文献   

15.
《Parallel Computing》1988,7(2):227-247
Data-flow multiprocessors have been shown to be a very efficient solution to the problem of multiprocessor schedulability. Recent research has demonstrated the critical importance of the proper allocation of program partitions to Processing Elements. We first describe in this paper several heuristic algorithms which have been used for program allocation. We then describe the layered approach to the problem of allocation (syntax directed and graph partitioning). A parallel approach to simulated annealing is used to perform allocation at the data-flow graph level. It is also shown how the results apply to the allocation problem in the Hughes Data-Flow Machine. Finally, simulation results indicate the validity of the solutions.  相似文献   

16.
A visualization programming environment for multicomputers   总被引:1,自引:0,他引:1  
The programming and run-time environment used for the authors' multicomputer visualization software are described. The particular approach to using multicomputers for scientific visualization provides a uniform interface to system and communications facilities and promotes modularity and code reuse. No breakthrough technology is involved; rather, a collection of methods that have been developed by others has been optimized for visual computing applications and unified into a system that is simple to use and easy to port to new hardware. The C language is used. Initial experience with the system has been good  相似文献   

17.
Hierarchically organized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dual-tier computers introduce many new factors into the programmer's performance model. We present a methodology for implementing block-structured numerical applications on dual-tier computers and a run-time infrastructure, called KeLP2, that implements the methodology. KeLP2 supports two levels of locality and parallelism via hierarchical SPMD control flow, run-time geometric meta-data, and asynchronous collective communication. KeLP applications can effectively overlap communication with computation under conditions where nonblocking point-to-point message passing fails to do so. KeLP's abstractions hide considerable detail without sacrificing performance and dual-tier applications written in KeLP consistently outperform equivalent single-tier implementations written in MPI. We describe the KeLP2 model and show how it facilitates the implementation of five block-structured applications specially formulated to hide communication latency on dual-tiered architectures. We support our arguments with empirical data from applications running on various single- and dual-tier multicomputers. KeLP2 supports a migration path from single-tier to dual-tier platforms and we illustrate this capability with a detailed programming example  相似文献   

18.
A new methodology named CALMANT (CC-cube Algorithms on Meshes and Tori) for mapping a type of algorithm that we call CC-cube algorithm onto multicomputers with hypercube, mesh, or torus interconnection topology is proposed. This methodology is suitable when the initial problem can be expressed as a set of processes that communicate through a hypercube topology (a CC-cube algorithm). There are many important algorithms that fit into the CC-cube type. CALMANT is based on three different techniques: (a) the standard embedding to assign the processes of the algorithm to the nodes of the mesh multicomputer; (b) the communication pipelining technique to increase the level of communication parallelism inherent in the CC-cube algorithms; and (c) optimal message-scheduling algorithms proposed in this work in order to avoid conflicts and minimizing in this way the communication time. Although CALMANT is proposed for multicomputers with different interconnection network topologies, the paper only focuses on the particular case of meshes.  相似文献   

19.
We propose a fault-tolerant tree-based multicast algorithm for 2-dimensional (2-D) meshes based on the concept of the extended safety level which is a vector associated with each node to capture fault information in the neighborhood.In this approach each destination is reached through a minimum number of hops,In order to minimize the total number of traffic steps,three heuristic strategies are proposed.This approach can be easily implemented by pipelined circuit switching(PCS).A simulation study is conducted to measure the total number of traffic steps under different strategies.Our approach is the first attempt to address the faulttolerant tree-based multicast strategies.Our approach is the first attempt to address the faulttolerant tree-based multicast problem in 2-D meshes based on limited global information with a simple model and succinct information.  相似文献   

20.
In a mesh multicomputer, performing jobs needs to schedule submeshes according to some processor allocation scheme. In order to assign the incoming jobs to a free submesh, a task compaction scheme is needed to generate a larger contiguous free region. The overhead of compaction depends on the efficiency of the task migration scheme. In this paper, two simple task migration schemes are first proposed in n-dimensional mesh multicomputers with supporting dimension-ordered wormhole routing in one-port communication model. Then, a hybrid scheme which combines advantages of the two schemes is discussed. Finally, we evaluate the performance of all of these proposed approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号