期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems

Angelos Bilas Dongming Jiang Jaswinder Pal Singh 《Journal of Parallel and Distributed Computing》2003,63(12):1257-1276

Although the shared memory abstraction is gaining ground as a programming abstraction for parallel computing, the main platforms that support it, small-scale symmetric multiprocessors (SMPs) and hardware cache-coherent distributed shared memory systems (DSMs), seem to lie inherently at the extremes of the cost-performance spectrum for parallel systems. In this paper we examine if shared virtual memory (SVM) clusters can bridge this gap by examining how application performance scales on a state-of-the-art shared virtual memory cluster. We find that: (i) The level of application restructuring needed is quite high compared to applications that perform well on a DSM system of the same scale and larger problem sizes are needed for good performance. (ii) However, surprisingly, SVM performs quite well for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end DSM system at the same scale and often much more. 相似文献

2.

Parallelizing with BDSC,a resource-constrained scheduling algorithm for shared and distributed memory systems

《Parallel Computing》2015

We introduce a new parallelization framework for scientific computing based on BDSC, an efficient automatic scheduling algorithm for parallel programs in the presence of resource constraints on the number of processors and their local memory size. BDSC extends Yang and Gerasoulis’s Dominant Sequence Clustering (DSC) algorithm; it uses sophisticated cost models and addresses both shared and distributed parallel memory architectures. We describe BDSC, its integration within the PIPS compiler infrastructure and its application to the parallelization of four well-known scientific applications: Harris, ABF, equake and IS. Our experiments suggest that BDSC’s focus on efficient resource management leads to significant parallelization speedups on both shared and distributed memory systems, improving upon DSC results, as shown by the comparison of the sequential and parallelized versions of these four applications running on both OpenMP and MPI frameworks. 相似文献

3.

一种新的用于分布共享存储系统的存储器一致性算法

邢浩沈美明高耀清《软件学报》1995,6(8):468-472

ＳＶＭ系统或称ＤＳＭ系统是在基于分布存储器的多处理机上，实现物理上分布但逻辑上共享的存储系统．它集共享存储器易于编程和分布存储器可扩充性好于一体，为ＭＰＰ计算机的使用带来了方便．本文首先介绍了ＳＶＭ系统的数据一致性问题及其解决办法，然后提出了一种新的固定分布管理算法ＮＦＤＭＡ，并对此算法作了分析，最后与李凯的固定分布管理算法作了比较．相似文献

4.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

5.

一种新的分布式共享存储协议:用户级共享存储协议

吴俊敏高原江松郑世荣《小型微型计算机系统》2000,21(4):337-340

在大规模并行处理系统中,采用共享存储和消息传递两种通信模型均存在各自的局限性,本文提出了一种新的改善共享存储系统性能的设计策略：用户级共享存储协议,并在基于Ｘ８６处理器的分布式共享存储系统仿真器ＳｉｍＤＳＭ上对两个典型应用问题进行了测试,实验结果表明,它的性能比采用传统协议有显著提高．相似文献

6.

Replication techniques for speeding up parallel applications on distributed systems

Henri E. Bal M. Frans Kaashoek Andrew S. Tanenbaum Jack Jansen 《Concurrency and Computation》1992,4(5):337-355

Most methods for programming loosely coupled systems are based on message-passing. Recently, however, methods have emerged based on ‘virtually’ sharing data. These methods simplify distributed programming, but are hard to implement efficiently, as loosely coupled systems do not contain physical shared memory. We introduce a new model, the shared data-object model, that eases the implementation of parallel applications on loosely coupled systems, but can still be implemented efficiently. In our model, shared data are encapsulated in passive data-objects, which are variables of user-defined abstract data types. To speed up access to shared data, data-objects are replicated. This ability to replicate objects is a significant difference with other object-based models (e.g. Emerald and Amber). Also, by replicating logical objects rather than physical pages, our model has many advantages over shared virtual memory systems. This paper discusses the design choices involved in replicating objects and their effect on performance. Important issues are: how to maintain consistency among different copies of an object; how to implement changes to objects; which strategy for object replication to use. We have implemented several options to determine which ones are the most efficient. 相似文献

7.

DaSH: A benchmark suite for hybrid dataflow and shared memory programming models

《Parallel Computing》2015

The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics. 相似文献

8.

Delay tolerant lazy release consistency for distributed shared memory in opportunistic networks

《Pervasive and Mobile Computing》2016

Opportunistic networks (ONs) allow mobile wireless devices to interact with one another through a series of opportunistic contacts. While ONs exploit mobility of devices to route messages and distribute information, the intermittent connections among devices make many traditional computer collaboration paradigms, such as distributed shared memory (DSM), very difficult to realize. DSM systems, developed for traditional networks, rely on relatively stable, consistent connections among participating nodes to function properly.We propose a novel delay tolerant lazy release consistency (DTLRC) mechanism for implementing distributed shared memory in opportunistic networks. DTLRC permits mobile devices to remain independently productive while separated, and provides a mechanism for nodes to regain coherence of shared memory if and when they meet again. DTLRC allows applications to utilize the most coherent data available, even in the challenged environments typical to opportunistic networks. Simulations demonstrate that DTLRC is a viable concept for enhancing cooperation among mobile wireless devices in opportunistic networking environment. 相似文献

9.

Towards implementation of a novel scheme for data prefetching on distributed shared memory systems

Hsiao-Hsi Wang Kuan-Ching Li Ssu-Hsuan Lu Chun-Chieh Yang 《The Journal of supercomputing》2009,47(2):111-126

High speed networks and rapidly improving microprocessor performance make the network of workstations an extremely important tool for parallel computing in order to speedup the execution of scientific applications. Shared memory is an attractive programming model for designing parallel and distributed applications, where the programmer can focus on algorithmic development rather than data partition and communication. Based on this important characteristic, the design of systems to provide the shared memory abstraction on physically distributed memory machines has been developed, known as Distributed Shared Memory (DSM). DSM is built using specific software to combine a number of computer hardware resources into one computing environment. Such an environment not only provides an easy way to execute parallel applications, but also combines available computational resources with the purpose of speeding up execution of these applications. DSM systems need to maintain data consistency in memory, which usually leads to communication overhead. Therefore, there exists a number of strategies that can be used to overcome this overhead issue and improve overall performance. Strategies as prefetching have been proven to show great performance in DSM systems, since they can reduce data access communication latencies from remote nodes. On the other hand, these strategies also transfer unnecessary prefetching pages to remote nodes. In this research paper, we focus on the access pattern during execution of a parallel application, and then analyze the data type and behavior of parallel applications. We propose an adaptive data classification scheme to improve prefetching strategy with the goal to improve overall performance. Adaptive data classification scheme classifies data according to the accessing sequence of pages, so that the home node uses past history access patterns of remote nodes to decide whether it needs to transfer related pages to remote nodes. From experimental results, we can observe that our proposed method can increase the accuracy of data access in effective prefetch strategy by reducing the number of page faults and misprefetching. Experimental results using our proposed classification scheme show a performance improvement of about 9–25% over the same benchmark applications running on top of an original JIAJIA DSM system.

Kuan-Ching Li (Corresponding author)Email:

相似文献

10.

Anonymous and fault-tolerant shared-memory computing

Rachid Guerraoui Eric Ruppert 《Distributed Computing》2007,20(3):165-177

The vast majority of papers on distributed computing assume that processes are assigned unique identifiers before computation begins. But is this assumption necessary? What if processes do not have unique identifiers or do not wish to divulge them for reasons of privacy? We consider asynchronous shared-memory systems that are anonymous. The shared memory contains only the most common type of shared objects, read/write registers. We investigate, for the first time, what can be implemented deterministically in this model when processes can fail. We give anonymous algorithms for some fundamental problems: time-stamping, snapshots and consensus. Our solutions to the first two are wait-free and the third is obstruction-free. We also show that a shared object has an obstruction-free implementation if and only if it satisfies a simple property called idempotence. To prove the sufficiency of this condition, we give a universal construction that implements any idempotent object. 相似文献

11.

GPU矩阵乘法的性能定量分析模型

尹孟嘉许先斌熊曾刚张涛《计算机科学》2015,42(12):13-17, 22

性能评价和优化是设计高效率并行程序必不可少的重要工作,存储系统的性能高低直接影响到处理器的整体性能。利用GPGPU-Sim对GPU的存储层次结构进行了模拟,找出了SM数量与存储控制器数量之间最佳配置关系。矩阵乘法是科学计算领域中的基本组成部分,是一种具有计算和访存密集特点的典型应用,其性能是GPU高性能计算的一个重要指标。性能模型作为并行系统性能评价的新的技术解决方案,具有许多其它性能评价方法无法比拟的优势。建立了一个性能模型,模型通过对指令流水线、共享存储器访存、全局存储器访存进行定量分析,找到了程序运行瓶颈,提高了执行速度。实验证明,该模型具有实用性,并有效地实现了矩阵乘法的优化。相似文献

12.

Adapting the Network Interface for High-Performance Computing: The CNI Approach

Sarkar Prasenjit Bailey Mary 《The Journal of supercomputing》1997,11(2):181-200

As the prices of commodity workstations go down, clusters of workstations have started to emerge as a viable economic solution for scalable computing. Recent advances in networking technology have made it possible to obtain high-bandwidth connections between applications. However, the interconnect latency between workstation nodes in a cluster remains a serious concern and can prove to be the limiting factor in workstation performance. In this paper, we present the CNI orcluster network interface that achieves the twin goals of low latency and high bandwidth. In addition, CNI efficiently supports multiple programming paradigms for programming generality. This is done by functionally coupling the network interface more closely to the CPU without violating the constraints of a standard workstation architecture, CNI results in performance gains for applications, substantially reducing communication overhead and delay. 相似文献

13.

Issues and experiences in implementing a distributed tuplespace

James B. Fenwick Lori L. Pollock 《Software》1997,27(10):1199-1232

相似文献

14.

A Polynomial-Time Algorithm for Memory Space Reduction

Yonghong?Song Email author Cheng?Wang Zhiyuan?Li 《International journal of parallel programming》2005,33(1):1-33

Reducing memory space requirement is important to many applications. For data-intensive applications, it may help avoid executing the program out-of-core. For high-performance computing, memory space reduction may improve the cache hit rate as well as performance. For embedded systems, it can reduce the memory requirement, the memory latency and the energy consumption. This paper investigates program transformations which a compiler can use to reduce the memory space required for storing program data. In particular, the paper uses integer programming to model the problem of combining loop shifting, loop fusion and array contraction to minimize the data memory required to execute a collection of multi-level loop nests. The integer programming problem is then reduced to an equivalent network flow problem which can be solved in polynomial time. 相似文献

15.

Design and Simulation of the Aquarius-II Multiprocessor

Vason P. Srini Tam M. Nguyen Darren R. Busing Mike J. Carlton Bruce K. Holmer Georges E. Smine Alvin M. Despain 《Journal of Systems Integration》1997,7(2):151-178

Aquarius-II is a cache coherent multiprocessor system designed for the parallel execution of Prolog programs. It contains two tiers of memory: synchronization memory and high bandwidth (HB) memory. The synchronization memory consists of snooping caches connected to a bus and is used to store rendezvous points, synchronization bits, synchronization variables such as locks and semaphores and most of the write shared data. The HB memory is used to store the bulk of the application program code and data. It contains caches and an inexpensive VLSI chip based crossbar interconnection network to memory. The caches connected to the crossbar do not have full snooping capability. The architecture is evaluated by a full simulation of parallel execution of Prolog programs on Aquarius-II. The design details of the components of the architecture and simulation results are presented. Simulation results indicate that the two tier memory system significantly reduces memory interference and speeds up synchronization when compared to a single bus multi. This shared memory multiprocesor architecture has the potential to support other parallel programming paradigms. 相似文献

16.

Program Development Tools for Clusters of Shared Memory Multiprocessors

Chapman B. Merlin J. Pritchard D. Bodin F. Mevel Y. Sørevik T. Hill L. 《The Journal of supercomputing》2000,17(3):311-322

Applications are increasingly being executed on computational systems that have hierarchical parallelism. There are several programming paradigms which may be used to adapt a program for execution in such an environment. In this paper, we outline some of the challenges in porting codes to such systems, and describe a programming environment that we are creating to support the migration of sequential and MPI code to a cluster of shared memory parallel systems, where the target program may include MPI, OpenMP or both. As part of this effort, we are evaluating several experimental approaches to aiding in this complex application development task. 相似文献

17.

An implementation of distributed shared memory

Umakishore Ramachandran M. Yousef A. Khalidi 《Software》1991,21(5):443-464

Shared memory is a simple yet powerful paradigm for structuring systems. Recently, there has been an interest in extending this paradigm to non-shared memory architectures as well. For example, the virtual address spaces for all objects in a distributed object-based system could be viewed as constituting a global distributed shared memory. We propose a set of primitives for managing distributed shared memory. We present an implementation of these primitives in the context of an object-based operating system as well as on top of Unix. 相似文献

18.

DASD sharing in DOS/VSE

Josef S. Ottmann 《Software》1982,12(9):835-842

The first part of this report discusses DASD sharing in loosely-coupled data processing systems. For the locking of shared files two design concepts are presented. The second part describes DASD sharing in the IBM operating system DOS/VSE. The control program routines which serialize the lock requests within one processing unit and the cross-system locking facility are discussed. 相似文献

19.

基于InfiniBand网络的消息可扩展技术研究

彭龙根尤洪涛尹万旺《计算机科学》2013,40(3):104-106

InfiniBand是目前HPC系统互连的主流网络之一,其提供的可靠连接传输服务因为支持RDMA、原子操作等功能而被广泛应用于MPI等并行应用编程模型。但是支撑可靠连接所需的消息队列及缓冲区开销往往会随着并行规模的扩大而急剧增加,从而制约了应用规模的扩大。为了解决这种内存开销带来的消息可扩展性问题,先从InfiniBand传输优化方面介绍了共享接收队列和扩展可靠连接技术,然后基于并行通信模型提出了分组连接技术。通过这些技术可以将节点内存开销减少2个数量级,并且开销不会随并行规模的扩大而明显增加。相似文献

20.

PAMM:一种面向基于内存共享的域间通信的优化模型

孙瑞辰孙磊《计算机科学》2015,42(Z11):218-221, 235

云计算平台和虚拟化技术的结合为虚拟机域间通信带来了新的需求,基于内存共享的域间通信可以提高运行在同一物理机上的虚拟机间的通信效率。但是,基于内存共享的域间过程中产生的上下文状态切换限制了其优化能力。引入一种新的内存共享模型PAMM,即通过添加一个管理模块对内存共享过程中所传递的内存页进行聚合管理,减少申请超级调用的次数,以达到减少状态切换的目的。实验表明,PAMM能够提升基于内存共享的域间通信的通信效率。相似文献