首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
Computer communication systems must undergo significant changes to keep pace with the increasingly demanding and diverse multimedia applications that will run on the next generation of high-performance networks. To facilitate these changes, we are developing A Dynamically Assembled Protocol Transformation, Integration and evaluation Environment (ADAPTIVE). ADAPTIVE provides an integrated environment for developing and experimenting with flexible transport system architectures that support lightweight and adaptive communication protocols for diverse multimedia applications running on high-performance networks. Our approach employs a collection of reusable ‘building-block’ protocol mechanisms that may be composed together automatically based upon functional specifications. The resulting protocols execute in parallel on several target platforms including shared-memory and message-passing multiprocessors. ADAPTIVE provides a framework for (1) determining the functionality of customized lightweight protocol configurations that efficiently support multimedia applications and (2) mapping this functionality onto efficient parallel process architectures.  相似文献   

2.
The efficiency of the basic operations of a NUMA (nonuniform memory access) multiprocessor determines the parallel processing performance on a NUMA multiprocessor. The authors present several analytical models for predicting and evaluating the overhead of interprocessor communication, process scheduling, process synchronization, and remote memory access, where network contention and memory contention are considered. Performance measurements to support the models and analyses through several numerical examples have been done on the BBN GP1000, a NUMA shared-memory multiprocessor. Analytical and experimental results give a comprehensive understanding of the various effects, which are important for the effective use of NUMA shared-memory multiprocessor. The results presented can be used to determine optimal strategies in developing an efficient programming environment for a NUMA system  相似文献   

3.
The paper presents several parallel DSP (digital signal processing) algorithms and their performance analysis, targetting a hybrid message-passing and shared-memory architecture that has been built at New Jersey Institute of Technology. The current version of our system contains eight powerful TMS320C40 processors. The algorithms are implemented on our system using message-passing only, shared-memory only, and, if possible, a combination of both of these parallel processing paradigms. Comparisons show that TurboNet's robust, hybrid architecture results in significant performance gains because of the flexibility it introduces.  相似文献   

4.
Portability, efficiency, and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes an approach, called Global Arrays (GAs), that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GAs is that they provide a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. We have implemented the GA library on a variety of computer systems, including the Intel Delta and Paragon, the IBM SP-1 and SP-2 (all message passers), the Kendall Square Research KSR-1/2 and the Convex SPP-1200 (nonuniform access shared-memory machines), the CRAY T3D (a globally addressable distributed-memory computer), and networks of UNIX workstations. We discuss the design and implementation of these libraries, report their performance, illustrate the use of GAs in the context of computational chemistry applications, and describe the use of a GA performance visualization tool.(An earlier version of this paper was presented at Supercomputing'94.)  相似文献   

5.
Even though there have been strong research activities about distributed virtual shared-memory (DVSM) systems, their architectures have been not widely used in current high-performance computing markets. The reason is that the previously introduced DVSM systems use conventional interconnection technologies like Ethernet, which incurs high execution overhead due to process interruption at data communication for memory consistency. In this paper, we present the DVSM architecture based on the next generation of an interconnection technique, the InfiniBand Architecture (IBA). Because the IBA supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations in hardware, we can minimize the communication overhead for memory consistency on the DVSM system. For characterizing multithreaded applications on our IBA-based DVSM system, we examined two different shared-memory programming models, i.e. SPMD and OpenMP benchmarks. We show that our DVSM to use full features of the IBA can improve the performance significantly over the IPoIB-based DVSM system in all benchmarks, and also comparable to the bus-based shared-memory multiprocessor system in some benchmarks.  相似文献   

6.
PC机群上共享存储与消息传递的比较   总被引:7,自引:0,他引:7       下载免费PDF全文
共享存储和消息传递是目前两种主流的并行编程模型.一般认为,消息传递的可编程性不及共享存储友好.OpenMP是目前共享存储编程的实际工业标准.机群OpenMP系统在机群上提供了OpenMP编程环境,具有易编程和可扩展的特点,但是其性能如何一直是关注的热点.以机群OpenMP系统OpenMP/JIAJIA和典型的消息传递系  相似文献   

7.
Compiling programs for distributed-memory multiprocessors   总被引:1,自引:0,他引:1  
We describe a new approach to programming distributed-memory computers. Rather than having each node in the system explicitly programmed, we derive an efficient message-passing program from a sequential shared-memory program annotated with directions on how elements of shared arrays are distributed to processors. This article describes one possible input language for describing distributions and then details the compilation process and the optimization necessary to generate an efficient program.Research supported by Intel.  相似文献   

8.
Lightweight threads have an important role to play in parallel systems: they can be used to exploit shared-memory parallelism, to mask communication and I/O latencies, to implement remote memory access, and to support task-parallel and irregular applications. In this paper, we address the question of how to integrate threads and communication in high-performance distributed-memory systems. We propose an approach based on global pointer and remote service request mechanisms, and explain how these mechanisms support dynamic communication structures, asynchronous messaging, dynamic thread creation and destruction, and a global memory model via interprocessor references. We also explain how these mechanisms can be implemented in various environments. Our global pointer and remote service request mechanisms have been incorporated in a runtime system called Nexus that is used as a compiler target for parallel languages and as a substrate for higher-level communication libraries. We report the results of performance studies conducted using a Nexus implementation; these results indicate that Nexus mechanisms can be implemented efficiently on commodity hardware and software systems.  相似文献   

9.
Large-scale distributed shared-memory multiprocessors (DSMs) provide a shared address space by physically distributing the memory among different processors. A fundamental DSM communication problem that significantly affects scalability is an increase in remote memory latency as the number of system nodes increases. Remote memory latency, caused by accessing a memory location in a processor other than the one originating the request, includes both communication latency and remote memory access latency over I/O and memory buses. The proposed architecture reduces remote memory access latency by increasing connectivity and maximizing channel availability for remote communication. It also provides efficient and fast unicast, multicast, and broadcast capabilities, using a combination of aggressively designed multiplexing techniques. Simulations show that this architecture provides excellent interconnect support for a highly scalable, high-bandwidth, low-latency network.  相似文献   

10.
Future chip-multiprocessors (CMP) will integrate many cores interconnected with a high-bandwidth and low-latency scalable network-on-chip (NoC). However, the potential that this approach offers at the transport level needs to be paired with an analogous paradigm shift at the higher levels. In particular, the standard shared-memory programming model fails to address the requirements of scalability of the many-core era. Fast data exchange among the cores and low-latency synchronization are desirable but hard to achieve in practice due to the memory hierarchy. The message-passing paradigm permits instead direct data communication and synchronization between the cores. The shared-memory could still be used for the instruction fetch. Hence, we propose a hybrid approach that combines shared-memory and message passing in a single general-purpose CMP architecture that allows efficient execution of applications developed with both parallel programming approaches. Cores fetch instructions from a hierarchical memory and exchange their data through the same memory, for compatibility with existing software, or directly through the fast NoC. We developed a fast SystemC based cycle-accurate simulator for design space explorations that we used to evaluate the performance with real benchmarks. The various components have been RTL coded and mapped to a CMOS 45 nm technology to build a silicon area model that we used to select the best architectural configurations.  相似文献   

11.
Mining association rules from large databases is very costly. We propose to develop parallel algorithms for this task on shared-memory multiprocessor (SMP). All proposed parallel algorithms for other paradigms follow the conventional level-wise approach: they need as many iterations as the length of the maximum large itemset. To make matter worse, they impose a synchronization in every iteration which would cause serious I/O contention on shared-memory parallel system. An adaptive asynchronous parallel mining algorithm APM has been proposed for SMP. All processors generate candidates dynamically and count itemset supports independently without synchronization. Two optimization techniques have been proposed for the reduction of database scanning and the number of candidates. The algorithm APM has been implemented on a Sun Enterprise 4000 shared-memory multiprocessor with 12 nodes. The experiments show that the optimizations have very good effects and APM has a substantial lead in performance over other proposed level-wise algorithms.  相似文献   

12.
This paper presents a schematic algorithm for distributed systems. This schematic algorithm uses a black-box procedure for communication, the output of which must meet two requirements: a global-order requirement and a deadlock-free requirement. This algorithm is valid in any distributed system model that can provide such a communication procedure that complies with these requirements. Two such models exist in an asynchronous fail-stop environment: one in the shared-memory model and one in the message-passing model. The implementation of the block-box procedure in these models enables us to translate existing algorithms between the two models whenever these algorithms are based on the schematic algorithm.We demonstrate this idea in two ways. First, we present a randomized algorithm for the consensus problem in the message-passing model based on the algorithm of Aspnes and Herlihy [AH] in the shared-memory model. This solution is the fastest known randomized algorithm that solves the consensus problem against a strong fail-stop adversary with one-half resiliency. Second, we solve the processor renaming problem in the shared-memory model based on the solution of Attiyaet al. [ABD+] in the message-passing model. The existence of the solution to the renaming problem should be contrasted with the impossibility result for the consensus problem in the shared-memory model [CIL], [DDS], [LA].A preliminary version of this paper, Shared-Memory vs. Message-Passing in an Asynchronous Distributed Environment, appeared inProc. 8th ACM Symp. on Principles of Distributed Computing, pp. 307–318, 1989. Part of this work was done while A. Bar-Noy visited the Computer Science Department, Stanford University, Stanford, CA 94305, USA, and his research was supported in part by a Weizmann fellowship, by Contract ONR N00014-88-K-0166, and by a grant of Stanford's Center for Integrated Systems.  相似文献   

13.
In the standard kernel organization on a bus-based multiprocessor, all processors share the code and data of the operating system; explicit synchronization is used to control access to kernel data structures. Distributed-memory multicomputers use an alternative approach, in which each instance of the kernel performs local operations directly and uses remote invocation to perform remote operations. Either approach to interkernel communication can be used in a large-scale shared-memory multiprocessor. In the paper we discuss the issues and architectural features that must be considered when choosing between remote memory access and remote invocation. We focus in particular on experience with the Psyche multiprocessor operating system on the BBN Butterfly Plus. We find that the Butterfly architecture is biased towards the use of remote invocation for kernel operations that perform a significant number of memory references, and that current architectural trends are likely to increase this bias in future machines. This conclusion suggests that straightforward parallelization of existing kernels (e.g. by using semaphores to protect shared data) is unlikely in the future to yield acceptable performance. We note, however, that remote memory access is useful for small, frequently-executed operations, and is likely to remain so.  相似文献   

14.
并行计算技术是计算机技术发展的重要方向之一。当前并行程序模型主要有消息传递模型和共享存储模型两种。随着处理器多核技术的发展,在一枚多核处理器中集成两个或多个完整的计算引擎(内核),并充分利用多核计算机的特性,发挥多核计算机的性能成为一个很重要的研究方向。介绍一种新的MPI实现机制,这种机制集成了共享存储模型和消息通信模型的优点,在节点内使用共享存储模型,在节点间使用消息传递模型,并且通过自动生成线程级的任务来获得更好的性能。.  相似文献   

15.
Pal Singh  J. Gupta  A. Levoy  M. 《Computer》1994,27(7):45-55
Recently, a new class of scalable, shared-address-space multiprocessors has emerged. Like message-passing machines, these multiprocessors have a distributed interconnection network and physically distributed main memory. However, they provide hardware support for efficient implicit communication through a shared address space, and they automatically exploit temporal locality by caching both local and remote data in a processor's hardware cache. In this article, we show that these architectural characteristics make it much easier to obtain very good speedups on the best known visualization algorithms. Simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the high degree of temporal locality obviates the need for explicit data distribution and communication management. We demonstrate our claims through parallel versions of three state-of-the-art algorithms: a recent hierarchical radiosity algorithm by Hanrahan et al. (1991), a parallelized ray-casting volume renderer by Levoy (1992), and an optimized ray-tracer by Spach and Pulleyblank (1992). We also discuss a new shear-warp volume rendering algorithm that provides the first demonstration of interactive frame rates for a 256×256×256 voxel data set on a general-purpose multiprocessor  相似文献   

16.
异步检查点容错PVM   总被引:1,自引:0,他引:1  
以工作站簇为代表的计算环境是当前分布式系统和并行计算的研究重点之一,PVM所提供的消息传递机制支持了高效的异构网络计算。但标准PVM缺乏对系统容错的支持,这可以通过使用检查点的回滚恢复方式予以弥补。该文对如何在用户级实现PVM全局容错,分析其设计思想和实现技术。主要思想是使用进行消息记录的异步检查点算法,并利用PVM守护进程和全局调度进程进行控制,所有操作对应用程序都是透明的。利用该系统还可以进一步实现PVM的透明进程迁移和负载均衡。  相似文献   

17.
For the past decades computer engineers have focused on building high-performance and large-scale computer systems with low-cost. One of the examples is a distributed-memory computer system like a cluster, where fast processing nodes to use commodity processors are connected through a high speed network. But it is not easy to develop applications on this system, because a programmer must consider all data and control dependences between processes and program them explicitly. For alleviating this problem the distributed virtual shared-memory (DVSM) system has been proposed. It is well known that the performance of the DVSM system highly depends on the network’s performance and programming semantics, and currently its performance is very limited on a conventional network. Recently many advanced hardware-based interconnection technologies have been introduced, and one of them is the InfiniBand Architecture (IBA) which supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations. In this paper, we present the implementation of our InfiniBand-based DVSM system and analyze the performance of SPEC OMP benchmarks in detail by comparing with the DVSM based on the traditional network architecture and the hardware shared-memory multiprocessor (SMP) system. As experiment result, we show that our DVSM system to use full features of the IBA can improve the performance significantly over the IPoIB-based traditional system on the IBA, and furthermore the performance of one application on the IBA-based DVSM system is better than on the hardware SMP.  相似文献   

18.
Parallel computing performance on scalable shared-memory architectures is affected by the structure of the interconnection networks linking processors to memory modules and on the efficiency of the memory/cache management systems. Cache Coherence Nonuniform Memory Access (CC-NUMA) and Cache Only Memory Access (COMA) are two effective memory systems, and the hierarchical ring structure is an efficient interconnection network in hardware. This paper focuses on comparative performance modeling and evaluation of CC-NUMA and COMA on a hierarchical ring shared-memory architecture. Analytical models for the two memory systems for comparative evaluation are presented. Intensive performance measurements on data migrations have been conducted on the KSR-1, a COMA hierarchical ring shared-memory machine. Experimental results support the analytical models, and we present practical observations and comparisons of the two cache coherence memory systems. Our analytical and experimental results show that a COMA system balances the work load well. However the overhead of frequent data movement may match the gains obtained from improving load balance. We believe our performance results could be further generalized to the two memory systems on a hierarchical network architecture. Although a CC-NUMA system may not automatically balance the load at the system level, it provides an option for a user to explicitly handle data locality for a possible performance improvement  相似文献   

19.
Several parallel programming languages, libraries and environments have been developed to ease the task of writing programs for multiprocessors. Proponents of each approach often point out various language features that are designed to provide the programmer with a simple programming interface. However, virtually no data exist that quantitatively evaluate the relative ease of use of different parallel programming languages. The paper borrows techniques from the software engineering field to quantify the complexity of three predominant programming models: shared-memory, message-passing and high-performance Fortran. It is concluded that traditional software complexity metrics are effective indicators of the relative complexity of parallel programming languages. The impact of complexity on run-time performance is also discussed in the context of message-passing vs. HPF on an IBM SP2. © 1998 John Wiley & Sons, Ltd.  相似文献   

20.
Distributed concurrent computing based on lightweight processes can potentially address performance and functionality limits in heterogeneous systems. The TPVM framework, based on the notion of ‘exportable services’, is an extension to the PVM message-passing system, but uses threads as units of computing, scheduling, and parallelism. TPVM facilitates and supports three different distributed concurrent programming paradigms: (a) the traditional, task based, explicit message-passing model; (b) a data-driven instantiation model that enables straightforward specification of computation based on data dependencies; and (c) a partial shared-address space model via remote memory access, with naming and typing of distributed data areas. The latter models offer significantly different computing paradigms for network-based computing, while maintaining a close resemblance to, and building upon, the conventional PVM infrastructure in the interest of compatibility and ease of transition. The TPVM system comprises three basic modules: a library interface that provides access to thread-based distributed concurrent computing facilities, a portable thread interface module which abstracts the required thread-related services, and a thread server module which performs scheduling and system data management. System implementation as well as applications experiences have been very encouraging, indicating the viability of the proposed models, the feasibility of portable and efficient threads systems for distributed computing, and the performance improvements that result from multithreaded concurrent computing. © 1998 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号