期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陈小文陈书明鲁中海 Axel Jantsch 《计算机工程与科学》2012,34(7):54-59

在多核处理器芯片中,分布式共享存储DSM虽然提供了统一的全局寻址的存储空间,但却引入了虚地址向实地址转换的开销,这对性能产生了负面的影响。我们注意到,在并行程序的执行过程中,被处理的数据属性(私有或共享)并不是一成不变的。并行程序中不同的数据具有不同的属性,即使同一数据在程序的不同执行阶段也可能具有不同的属性。本文首先详细地阐述了一种混合式的分布式共享存储空间,支持对共享数据采用全局寻址的虚地址访问而对私有数据采用快速寻址的实地址访问;进而提出了一种针对混合式的分布式共享存储空间的实时划分技术。该技术根据并行程序中数据的属性,在程序运行时,实时地调整和划分分布式共享存储空间。当数据为私有时,通过实地址访问加快数据的访问速度,当数据为共享时则维持虚地址访问,从而减少整个并行程序运行过程中的地址转换开销,提高系统的性能。实际应用程序的实验结果表明,与传统的分布式共享存储空间相比,实时划分的混合式的分布式共享存储空间具有性能优势,性能的提升比例与具体的网络规模、计算规模、并行程序映射方式等有关。在我们的实验中,性能的提升比例最高为13.14%,最低为6.98%。相似文献

2.

TreadMarks: shared memory computing on networks of workstations 总被引：2，自引：0，他引：2

Amza C. Cox A.L. Dwarkadas S. Keleher P. Honghui Lu Rajamony R. Weimin Yu Zwaenepoel W. 《Computer》1996,29(2):18-28

Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value 相似文献

3.

VIA(Virtual Interface Architecture)上的软件DSM系统实现和性能

史岗尹宏达胡明昌胡伟武《计算机学报》2003,26(12):1621-1628

在由高性能PC搭建的Linux机群系统上，传统的网络接口体系结构引入了巨大的软件处理开销，无法满足虚拟共享存储并行应用对通信带宽、延迟和进程间同步的需求．用户级网络接口标准——虚拟接口体系结构(Vilxual Interface Architecture，VIA)与传统的网络接口体系结构相比，在软件协议开销、通信关键路径上操作系统的干预程度、通信和计算的重叠程度以及实现零拷贝等方面，具有明显的优势．通过在传统网络通信接口和VIA通信接口上虚拟共享存储系统的性能对比，采用VIA网络接口体系结构可有效地提高虚拟共享存储系统的性能和可扩展性．相似文献

4.

Indigo: user-level support for building distributed shared abstractions

Prince Kohli Mustaque Ahamad Karsten Schwan 《Concurrency and Computation》1998,10(1):1-29

Distributed systems that consist of workstations connected by high performance interconnects offer computational power comparable to moderate size parallel machines. Middleware like distributed shared memory (DSM) or distributed shared objects (DSO) attempts to improve the programmability of such hardware by presenting to application programmers interfaces similar to those offered by shared memory machines. This paper presents the portable Indigo data sharing library which provides a small set of primitives with which arbitrary shared abstractions are easily and efficiently implemented across distributed hardware platforms. Sample shared abstractions implemented with Indigo include DSM as well as fragmented objects, where the object state is split across different machines and where interfragment communications may be customized to application-specific consistency needs. The Indigo library's design and implementation are evaluated on two different target platforms: a workstation cluster and an IBM SP2 machine. As part of this evaluation, a novel DSM system and consistency protocol are implemented and evaluated with several high performance applications. Application performance attained with the DSM system is compared to the performance experienced when utilizing the underlying basic message-passing facilities or when employing Indigo to construct customized fragmented objects implementing the application's shared state. Such experimentation results in insights concerning the efficient implementation of DSM systems (e.g. how to deal with false sharing). It also leads to the conclusion that Indigo provides a sufficiently rich set of abstractions for efficient implementation of the next generation of parallel programming models for high performance machines. © 1998 John Wiley & Sons, Ltd. 相似文献

5.

Integrated Memory Controllers with Parallel Coherence Streams

Chaudhuri M. Heinrich M. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1159-1173

Previous work in scalable hardware distributed shared memory (DSM) multiprocessors has established the critical and dominant role that protocol processing bandwidth (or its inverse, occupancy) plays in determining overall performance in architectures with standalone memory/coherence controllers. However, with recent architectural trends toward integrated (on-chip) memory controllers and the well-known fact that processor frequency is increasing more rapidly than memory systems, we must ask whether parallel coherence processing engines (either multiple integrated protocol processors/cores or multiple protocol threads) are needed in DSM machines constructed from modern processor architectures and, if so, when. We construct a useful analytical model to give the designer insight into when parallel coherence streams will improve performance and verify our model via detailed simulation on 64-threaded microbenchmarks and parallel applications and on single-node multiprogrammed workloads. Surprisingly, and contrary to related work, we find that, in these architectures, adding a second coherence engine has almost no impact on performance. Further, for less-tuned applications that suffer from hot spots (contentious requests to the same memory line), additional engines offer no benefit whatsoever. Even with double the memory bandwidth (or channels), an additional coherence processing stream yields only slight performance improvement. Only for a special class of DSM machines employing directoryless broadcast protocols over unordered interconnects does parallel "snoop" processing offer reasonable performance improvement for communication-intensive applications. Overall, given the architectural trends, this is good news for DSM designers who want to minimize the resources necessary (protocol threads or integrated protocol processor cores for maintaining internode coherence, respectively) to create SMTp-based or multi-CMP-based scalable DSM machines using directory protocols. 相似文献

6.

Towards implementation of a novel scheme for data prefetching on distributed shared memory systems

Hsiao-Hsi Wang Kuan-Ching Li Ssu-Hsuan Lu Chun-Chieh Yang 《The Journal of supercomputing》2009,47(2):111-126

High speed networks and rapidly improving microprocessor performance make the network of workstations an extremely important tool for parallel computing in order to speedup the execution of scientific applications. Shared memory is an attractive programming model for designing parallel and distributed applications, where the programmer can focus on algorithmic development rather than data partition and communication. Based on this important characteristic, the design of systems to provide the shared memory abstraction on physically distributed memory machines has been developed, known as Distributed Shared Memory (DSM). DSM is built using specific software to combine a number of computer hardware resources into one computing environment. Such an environment not only provides an easy way to execute parallel applications, but also combines available computational resources with the purpose of speeding up execution of these applications. DSM systems need to maintain data consistency in memory, which usually leads to communication overhead. Therefore, there exists a number of strategies that can be used to overcome this overhead issue and improve overall performance. Strategies as prefetching have been proven to show great performance in DSM systems, since they can reduce data access communication latencies from remote nodes. On the other hand, these strategies also transfer unnecessary prefetching pages to remote nodes. In this research paper, we focus on the access pattern during execution of a parallel application, and then analyze the data type and behavior of parallel applications. We propose an adaptive data classification scheme to improve prefetching strategy with the goal to improve overall performance. Adaptive data classification scheme classifies data according to the accessing sequence of pages, so that the home node uses past history access patterns of remote nodes to decide whether it needs to transfer related pages to remote nodes. From experimental results, we can observe that our proposed method can increase the accuracy of data access in effective prefetch strategy by reducing the number of page faults and misprefetching. Experimental results using our proposed classification scheme show a performance improvement of about 9–25% over the same benchmark applications running on top of an original JIAJIA DSM system.

Kuan-Ching Li (Corresponding author)Email:

相似文献

7.

Characterizing distributed shared memory performance: a case studyof the Convex SPP1000

Abandah G.A. Davidson E.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(2):206-216

In a distributed shared memory (DSM) multiprocessor, the processors cooperate in solving a parallel application by accessing the shared memory. The latency of a memory access depends on several factors, including the distance to the nearest valid data copy, data sharing conditions, and traffic of other processors. To provide a better understanding of DSM performance and to support application tuning and compiler development for DSM systems, this paper extends microbenchmarking techniques to characterize the important aspects of a DSM system. We present an experiment-based methodology for characterizing the memory, communication, scheduling, and synchronization performance, and apply it to the Convex SPP1000. We present carefully designed microbenchmarks to characterize the performance of the local and remote memory, producer-consumer communication involving two or more processors, and the effects on performance when multiple processors contend for utilization of the distributed memory and the interconnection network 相似文献

8.

Data race avoidance and replay scheme for developing and debugging parallel programs on distributed shared memory systems

Yung-Chang Chiu Tyng-Yeu Liang 《Parallel Computing》2011,37(1):11-25

Distributed shared memory (DSM) allows parallel programs to run on distributed computers by simulating a global virtual shared memory, but data racing bugs may easily occur when the threads of a multi-threaded process concurrently access the physically distributed memory. Earlier tools to help programmers locate data racing bugs in non-DSM parallel programs are not easily applied to DSM systems. This study presents the data race avoidance and replay scheme (DRARS) to assist debugging parallel programs on DSM or multi-core systems. DRARS is a novel tool which controls the consistency protocol of the target program, automatically preventing a large class of data racing bugs when the parallel program is subsequently run, obviating much of the need for manual debugging. For data racing bugs that cannot be avoided automatically, DRARS performs a deterministic replay-type function on DSM systems, faithfully reproducing the behavior of the parallel program during run time. Because one class of data racing bugs has already been eliminated, the remaining manual debugging task is greatly simplified. Unlike previous debugging methods, DRARS does not require that the parallel program be written in a specific style or programming language. Moreover, DRARS can be implemented in most consistency protocols. In this paper, DRARS is realized and verified in real experiments using the eager release consistency protocol on a DSM system with various applications. 相似文献

9.

在操作系统中实现分布共享存储、存储管理和文件系统的集成 总被引：3，自引：0，他引：3

刘福岩尤晋元耿彤《计算机工程与应用》2000,36(5):36-39

开发分布共享存储系统的目的是为了在分布式存储器的基础上构造逻辑上的共享存储器模型,对于如何在共享存储器模型的基础上为用户进程构造虚拟空间,传统的分布共享系统并未给予足够的重视。只有在操作系统中把分布共享存储技术、存储器管理和文件系统结合起来,才能充分发挥分布共享存储技术具有的能力。基于以上思想,在文中提出了一个实现了分布共享存储的操作系统模型,并分析了该模型一个实现原型,讨论该原型具有的优缺点。通过在操作系统中取消进程的逻辑空间,使进程直接在文件上运行,该模型不仅能够实现分布共享存储,而且和许多传统操作系统以及传统分布共享存储系统相比,具有许多优点。该操作系统实现了分布共享存储技术和操作系统中的存储管理以及文件系统的完美结合。相似文献

10.

Design of the Munin Distributed Shared Memory System

Carter J. B. 《Journal of Parallel and Distributed Computing》1995,29(2)

相似文献

11.

Simulation tools to study a distributed shared memory for clusters of symmetric multiprocessors

《Future Generation Computer Systems》2006,22(1-2):57-66

Distributed shared memory (DSM) systems have become popular as a means of utilizing clusters of computers for solving large applications. We have developed a high-performance DSM, Strings. In addition, to improve the performance of our DSM, a memory hierarchy simulator has been developed that allows us to compare various techniques very quickly and with much less effort. This paper describes our simulator, DSMSim. We show that the simulator's performance closely matches the real system and demonstrate potential performance gains of up to 60% after adding optimization features to the simulator. The simulator also accepts the same code as the software distributed shared memory. 相似文献

12.

DSM系统中共享存储抽象机制的实现

下载免费PDF全文

史扬金士尧《计算机工程与科学》1998,20(4):61-65

虽然ＤＳＭ系统相互之间差异很大，但ＤＳＭ存在一个共同特征，即提供共享存储抽象机制。本文分析了ＤＳＭ系统共享存储抽象机制的实现，总结了各种不同的实现途径、实现细节及各自的优缺点，指出了ＤＳＭ发展的趋势及一些亟待解决的问题。相似文献

13.

A flexible framework for consistency management

S. Weber P. A. Nixon B. Tangney 《Concurrency and Computation》2002,14(1):33-53

Recent distributed shared memory (DSM) systems provide increasingly more support for the sharing of objects rather than portions of memory. However, like earlier DSM systems these distributed shared object systems (DSO) still force developers to use a single protocol, or a small set of given protocols, for the sharing of application objects. This limitation prevents the applications from optimizing their communication behaviour and results in unnecessary overhead. A current general trend in software systems development is towards customizable systems, for example frameworks, reflection, and aspect‐oriented programming all aim to give the developer greater flexibility and control over the functionality and performance of their code. This paper describes a novel object‐oriented framework that defines a DSM system in terms of a consistency model and an underlying coherency protocol. Different consistency models and coherency protocols can be used within a single application because they can be customized, by the application programmer, on a per‐object basis. This allows application specific semantics to be exploited at a very fine level of granularity and with a resulting improvement in performance. The framework is implemented in JAVA and the speed‐up obtained by a number of applications that use the framework is reported. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献

14.

CAS-DSM: A Compiler Assisted Software Distributed Shared Memory

N. P. Manoj K. V. Manjunath R. Govindarajan 《International journal of parallel programming》2004,32(2):77-122

Traditional software Distributed Shared Memory (DSM) systems rely on the virtual memory management mechanisms to detect accesses to shared memory locations and maintain their consistency. The resulting involvement of the OS (kernel) and the associated overhead which is significant, can be avoided by careful compile time analysis and code instrumentation. In this paper, we propose such a Compiler Assisted Software support approach (CAS-DSM). In the CAS-DSM implementation, the involvement of the OS kernel is avoided by instrumenting the application code at the source level. The overhead caused by the execution of the instrumented code is reduced through several aggressive compile time optimizations. Finally, we also address the issue of reducing certain overheads in polling-based implementation of receiving asynchronous messages. We used SUIF, a public domain compiler tool, to implement compile time analysis, instrumentation and optimizations. We modified CVM, a publicly available software DSM to support the instrumentation inserted by the compiler. Detailed performance evaluation of CAS-DSM is reported using a set of Splash/Splash2 parallel application benchmarks on a distributed memory IBM SP-2 machine. CAS-DSM achieved moderate to good performance improvements for most of the applications compared to the original CVM implementation. Reducing the overheads in polling-based implementation improves the performance of CAS-DSM significantly resulting in an overall improvement of 12–52% over the original CVM implementation. 相似文献

15.

Transparent adaptation of sharing granularity in MultiView‐based DSM systems

Nitzan Niv Assaf Schuster 《Software》2001,31(15):1439-1459

In this paper we propose a mechanism that provides distributed shared memory (DSM) systems with a flexible sharing granularity. The size of the shared memory units is dynamically determined by the system during runtime. This size can range from that of a single variable up to the size of the entire shared memory space. During runtime, the DSM transparently adapts the granularity to the memory access pattern of the application in each phase of its execution. This adaptation, called ComposedView, provides efficient data sharing in software DSM while preserving sequential consistency. Neither complex code analysis nor annotation by the programmer or the compiler are required. Our experiments indicate a substantial performance boost (up to 80% speed‐up improvement) when running a large set of applications using our method, compared to running these benchmark applications with the best fixed granularity. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

16.

Moving address translation closer to memory in distributed shared-memory multiprocessors

Qiu X. Dubois M. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(7):612-623

To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low. 相似文献

17.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

18.

An application-driven study of parallel system overheads andnetwork bandwidth requirements

Sivasubramaniam A. Singla A. Ramachandran U. Venkateswaran H. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(3):193-210

Evaluating and analyzing the performance of a parallel application on an architecture to explain the disparity between projected and delivered performance is an important aspect of parallel systems research. However, conducting such a study is hard due to the vast design space of these systems. We study two important aspects related to the performance of parallel applications on shared memory parallel architectures. First, we quantify overheads observed during the execution of these applications on three different simulated architectures. We next use these results to synthesize the bandwidth requirements for the applications with respect to different network topologies. This study is performed using an execution-driven simulation tool called SPASM, which provides a way of isolating and quantifying the different parallel system overheads in a nonintrusive manner. The first exercise shows that in shared memory machines with private caches, as long as the applications are well-structured to exploit locality, the key determinant that impacts performance is network connection. The second exercise quantifies the network bandwidth needed to minimize the effect of network connection. Specifically, it is shown that for the applications considered, as long as the problem sizes are increased commensurate with the system size, current network technologies supporting 200-300 MBytes/sec link bandwidth are sufficient to keep the network overheads (such as latency and contention) within acceptable bounds 相似文献

19.

Experimenting with a Shared Virtual Memory Environment for Hypercubes

《Journal of Parallel and Distributed Computing》1995,29(2):228-235

This paper describes the design and implementation of a shared virtual memory (SVM) system for the nCUBE 2 machine. The SVM system provides the user a single coherent address space across all nodes. It is implemented at the user level in a C programming environment using high level constructs to support data sharing. Shared variables are treated as objects rather than pages. We have improved upon an existing algorithm for maintaining coherency in the SVM system, thus achieving a reduction in the number of internode messages required in coherency maintenance. Detailed timing analysis is conducted to analyze the feasibility of this shared environment. Experimental results indicate that parallel programs running under an SVM system show linear speedup, suggesting that SVM systems could provide an effective programming environment for the next generation of distributed memory parallel computers. The bottleneck of this implementation is associated with the expensive interrupt handling capability of the nCUBE 2. 相似文献

20.

A Lock-Based Cache Coherence Protocol for Scope Consistency 总被引：5，自引：2，他引：5

下载免费PDF全文

Hu Weiwu Shi Weisong Tang Zhimin Li Ming 《计算机科学技术学报》1998,13(2):97-109

Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors.Although scalable to a certain extent,directory protocols are complex enough to prevent it from being used in very large scale multiprocessors with tens of thousands of nodes.his paper proposes a lock-based cache coherence protocol for scope consistency.In does not rely on directory information to maintain cache coherence.Instead,cache coherence is maintained through requiring the releasing processor of a lock to stroe all write-notices generated in the associated critical section to the lock and the acquiring processor invalidates or updates its locally cached data copies according to the write notices of the lock.To evaluate the performance of the lock-based cache coherence protocol,a software SDM system named JIAJIA is built on network of workstations.Besides the lock-based cache coherence protocol,JIAJIA also characterizes itself with its shared memory organization scheme which combines the physical memories of multiple workstations to form a large shared space.Performance measurements with SPLASH2 program suite and NAS benchmarks indicate that,compared to recent SVM systems such as CVM,higher speedup is achieved by JIAJIA.Besides,JIAJIA can solve large scale problems that cannot be solved by other SVM systems due to memory size limitation. 相似文献