期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance improvement of parallel programs on a broadcast-based distributed shared memory multiprocessor by simulation

《Simulation Modelling Practice and Theory》2008,16(3):338-352

Due to advances in fiber optics and VLSI technology, interconnection networks that allow simultaneous broadcasts are becoming feasible. Distributed shared memory (DSM) implementations on such networks promise high performance even for small applications with small granularity. This paper, after summarizing the architecture of one such implementation called the Simultaneous Multiprocessor Optical Exchange Bus (SOME-Bus), presents simple algorithms for improving the performance of parallel programs running on the SOME-Bus multiprocessor implementing cache-coherent DSM. The algorithms are based on run-time data redistribution via dynamic page migration protocol. They use memory access references together with the information of average channel utilization, average channel waiting time, number of messages in the channel queue or short-term average channel waiting time reported by each node and gathered by hardware monitors to make correct decisions related to the placement of shared data. Simulations with four parallel codes on a 64-processor SOME-Bus show that the algorithms yield significant performance improvements such as reduction in the execution times, number of remote memory accesses, average channel waiting times, average network latencies and increase in average channel utilizations. 相似文献

2.

An Asynchronous Protocol for Release Consistent Distributed Shared Memory Systems

Yeo Jaeheung Yeom Heon Y. Park Taesoon 《The Journal of supercomputing》2003,24(1):25-41

Distributed shared memory (DSM) systems provide a simple programming paradigm for networks of workstations, which are gaining popularity due to their cost-effective high computing power. However, DSM systems usually exhibit poor performance due to the large communication delay between the nodes; and a lot of different memory consistency models have been proposed to mask the network delay. In this paper, we propose an asynchronous protocol for the release consistent memory model, which we call an Asynchronous Release Consistency (ARC) protocol. Unlike other protocols where the communication adheres to the synchronous request/receive paradigm, the ARC protocol is asynchronous, such that the necessary pages are broadcast before they are requested. Hence, the network delay can be reduced by proper prefetching of necessary pages. We have also compared the performance of the ARC protocol with the lazy release protocol by running standard benchmark programs; and the experimental results showed that the ARC protocol achieves a performance improvement of up to 29%. 相似文献

3.

Delay tolerant lazy release consistency for distributed shared memory in opportunistic networks

《Pervasive and Mobile Computing》2016

Opportunistic networks (ONs) allow mobile wireless devices to interact with one another through a series of opportunistic contacts. While ONs exploit mobility of devices to route messages and distribute information, the intermittent connections among devices make many traditional computer collaboration paradigms, such as distributed shared memory (DSM), very difficult to realize. DSM systems, developed for traditional networks, rely on relatively stable, consistent connections among participating nodes to function properly.We propose a novel delay tolerant lazy release consistency (DTLRC) mechanism for implementing distributed shared memory in opportunistic networks. DTLRC permits mobile devices to remain independently productive while separated, and provides a mechanism for nodes to regain coherence of shared memory if and when they meet again. DTLRC allows applications to utilize the most coherent data available, even in the challenged environments typical to opportunistic networks. Simulations demonstrate that DTLRC is a viable concept for enhancing cooperation among mobile wireless devices in opportunistic networking environment. 相似文献

4.

TreadMarks: shared memory computing on networks of workstations 总被引：2，自引：0，他引：2

Amza C. Cox A.L. Dwarkadas S. Keleher P. Honghui Lu Rajamony R. Weimin Yu Zwaenepoel W. 《Computer》1996,29(2):18-28

Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value 相似文献

5.

多核处理器中混合分布式共享存储空间的实时划分技术

陈小文陈书明鲁中海 Axel Jantsch 《计算机工程与科学》2012,34(7):54-59

在多核处理器芯片中,分布式共享存储DSM虽然提供了统一的全局寻址的存储空间,但却引入了虚地址向实地址转换的开销,这对性能产生了负面的影响。我们注意到,在并行程序的执行过程中,被处理的数据属性(私有或共享)并不是一成不变的。并行程序中不同的数据具有不同的属性,即使同一数据在程序的不同执行阶段也可能具有不同的属性。本文首先详细地阐述了一种混合式的分布式共享存储空间,支持对共享数据采用全局寻址的虚地址访问而对私有数据采用快速寻址的实地址访问;进而提出了一种针对混合式的分布式共享存储空间的实时划分技术。该技术根据并行程序中数据的属性,在程序运行时,实时地调整和划分分布式共享存储空间。当数据为私有时,通过实地址访问加快数据的访问速度,当数据为共享时则维持虚地址访问,从而减少整个并行程序运行过程中的地址转换开销,提高系统的性能。实际应用程序的实验结果表明,与传统的分布式共享存储空间相比,实时划分的混合式的分布式共享存储空间具有性能优势,性能的提升比例与具体的网络规模、计算规模、并行程序映射方式等有关。在我们的实验中,性能的提升比例最高为13.14%,最低为6.98%。相似文献

6.

一种构建于DSM的移动对象的全时态索引方法

顾星朱占宇杨群皮德常《小型微型计算机系统》2012,33(7):1503-1509

移动对象索引技术是移动对象数据库这个新兴的热点领域中的关键技术之一.针对该技术处理数据的繁琐复杂特性,提出构建于DSM的移动对象索引方法 DSM_MSMON,在分布式系统中并行的管理移动对象的信息,支持更新和查询操作.DSM_MSMON统一了单机和多机的内存管理策略,解决了DSM系统中的数据定位、一致性维护、负载平衡和可扩充性等主要问题,有效地提高了移动对象索引的效率.文中给出DSM_MSMON的设计思想和模型,并分析了DSM_MSMON的关键技术和程序流程.实验结果表明,该方法要优于MSMON结构. 相似文献

7.

Characterizing distributed shared memory performance: a case studyof the Convex SPP1000

Abandah G.A. Davidson E.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(2):206-216

In a distributed shared memory (DSM) multiprocessor, the processors cooperate in solving a parallel application by accessing the shared memory. The latency of a memory access depends on several factors, including the distance to the nearest valid data copy, data sharing conditions, and traffic of other processors. To provide a better understanding of DSM performance and to support application tuning and compiler development for DSM systems, this paper extends microbenchmarking techniques to characterize the important aspects of a DSM system. We present an experiment-based methodology for characterizing the memory, communication, scheduling, and synchronization performance, and apply it to the Convex SPP1000. We present carefully designed microbenchmarks to characterize the performance of the local and remote memory, producer-consumer communication involving two or more processors, and the effects on performance when multiple processors contend for utilization of the distributed memory and the interconnection network 相似文献

8.

Towards implementation of a novel scheme for data prefetching on distributed shared memory systems

Hsiao-Hsi Wang Kuan-Ching Li Ssu-Hsuan Lu Chun-Chieh Yang 《The Journal of supercomputing》2009,47(2):111-126

High speed networks and rapidly improving microprocessor performance make the network of workstations an extremely important tool for parallel computing in order to speedup the execution of scientific applications. Shared memory is an attractive programming model for designing parallel and distributed applications, where the programmer can focus on algorithmic development rather than data partition and communication. Based on this important characteristic, the design of systems to provide the shared memory abstraction on physically distributed memory machines has been developed, known as Distributed Shared Memory (DSM). DSM is built using specific software to combine a number of computer hardware resources into one computing environment. Such an environment not only provides an easy way to execute parallel applications, but also combines available computational resources with the purpose of speeding up execution of these applications. DSM systems need to maintain data consistency in memory, which usually leads to communication overhead. Therefore, there exists a number of strategies that can be used to overcome this overhead issue and improve overall performance. Strategies as prefetching have been proven to show great performance in DSM systems, since they can reduce data access communication latencies from remote nodes. On the other hand, these strategies also transfer unnecessary prefetching pages to remote nodes. In this research paper, we focus on the access pattern during execution of a parallel application, and then analyze the data type and behavior of parallel applications. We propose an adaptive data classification scheme to improve prefetching strategy with the goal to improve overall performance. Adaptive data classification scheme classifies data according to the accessing sequence of pages, so that the home node uses past history access patterns of remote nodes to decide whether it needs to transfer related pages to remote nodes. From experimental results, we can observe that our proposed method can increase the accuracy of data access in effective prefetch strategy by reducing the number of page faults and misprefetching. Experimental results using our proposed classification scheme show a performance improvement of about 9–25% over the same benchmark applications running on top of an original JIAJIA DSM system.

Kuan-Ching Li (Corresponding author)Email:

相似文献

9.

基于新型Cache一致性协议的共享虚拟存储系统 总被引：11，自引：2，他引：9

胡伟武施巍松唐志敏《计算机学报》1999,22(5):467-475

介绍了一个基于新型Ｃａｃｈｅ一致性协议的共享虚拟存储系统ＪＩＡＪＩＡ,与目前国际上具有代表性的共享虚拟存储系统相比,ＪＩＡＪＩＡ采用了基于ＵＮＭＡ的结构,能够把多个机器的物理地址空间组织成一个更大的共享虚拟地址空间,此外,ＪＩＡＪＩＡ实现了一种基于锁的新型一致性协议,通过附带在锁上的ｗｒｉｔｅ－ｎｏｔｉｃｅ来维护一致性,从而避免了传统的目录协议中由目录引起的存储开销和系统复杂度,利用一些被广泛使用相似文献

10.

Indigo: user-level support for building distributed shared abstractions

Prince Kohli Mustaque Ahamad Karsten Schwan 《Concurrency and Computation》1998,10(1):1-29

Distributed systems that consist of workstations connected by high performance interconnects offer computational power comparable to moderate size parallel machines. Middleware like distributed shared memory (DSM) or distributed shared objects (DSO) attempts to improve the programmability of such hardware by presenting to application programmers interfaces similar to those offered by shared memory machines. This paper presents the portable Indigo data sharing library which provides a small set of primitives with which arbitrary shared abstractions are easily and efficiently implemented across distributed hardware platforms. Sample shared abstractions implemented with Indigo include DSM as well as fragmented objects, where the object state is split across different machines and where interfragment communications may be customized to application-specific consistency needs. The Indigo library's design and implementation are evaluated on two different target platforms: a workstation cluster and an IBM SP2 machine. As part of this evaluation, a novel DSM system and consistency protocol are implemented and evaluated with several high performance applications. Application performance attained with the DSM system is compared to the performance experienced when utilizing the underlying basic message-passing facilities or when employing Indigo to construct customized fragmented objects implementing the application's shared state. Such experimentation results in insights concerning the efficient implementation of DSM systems (e.g. how to deal with false sharing). It also leads to the conclusion that Indigo provides a sufficiently rich set of abstractions for efficient implementation of the next generation of parallel programming models for high performance machines. © 1998 John Wiley & Sons, Ltd. 相似文献

11.

Architectural Support and Mechanisms for Object Caching in Dynamic Multithreaded Computations

《Journal of Parallel and Distributed Computing》1999,58(2):260-300

High-level parallel programming models supporting dynamic fine-grained threads in a global object space are becoming increasingly popular for expressing irregular applications based on sophisticated adaptive algorithms and pointer-based data structures. However, implementing these multithreaded computations on scalable parallel machines poses significant challenges, particularly with respect to object caching. Object caching techniques must be able to tolerate unresponsive processors and protocol handler occupancy delays. This paper examines whether these challenges can be offset by leveraging responsive general-purpose communication architectural features (such as remote memory access and atomic operations), possibly compensating for the lack of more sophisticated hardware primitives by relying upon increased involvement of the run-time system and the compiler. A detailed performance analysis of four irregular applications, using the Illinois Concert System on the Cray T3D and the SGI Origin 2000, finds that existing software distributed shared memory (DSM) systems are capable of delivering good performance only in the presence of a high level of responsive communication architecture support (specifically, support for remote atomic operations). Recognizing that this situation stems from the synchronous request–reply nature of DSM protocols, we present a composable object caching framework, called view caching, which exploits knowledge of application data access semantics to construct custom protocols that require reduced processor synchronization. View caching protocols are more tolerant to responsiveness and occupancy delays and are able to exploit even lower level responsive communication primitives (such as nonatomic remote memory accesses) for a performance benefit. 相似文献

12.

分布共享存储系统中的数据预送技术 总被引：3，自引：0，他引：3

谢向辉韩承德唐志敏《计算机学报》1999,22(3):241-248

远程数据访问的延迟已成有分布共享存储系统发展的最大障碍。它直接影响到ＤＳＭ系统的效率,尤其是对用软件实现的ＤＳＭ系统。为理解和分析ＤＳＭ系统中的数据行为,论文提出了一种新的分布共享存储结构模型,并在此基础上提出了一种叫做“数据预送”技术,旨在从缩小数据在系统不同层次间的语义差别入手,减少ＤＳＭ中的通信次数,提高对远程访问延迟的容忍力。文中对数据预送技术的原理和实现进行了描述。经对对原形系统的测试, 相似文献

13.

High Performance Software Coherence for Current and Future Architectures

《Journal of Parallel and Distributed Computing》1995,29(2):179-195

Shared memory provides an attractive and intuitive programming model for large-scale parallel computing, but requires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shared memory emulations on networks of workstations to tightly coupled fully cache-coherent distributed shared memory multiprocessors. Previous work indicates that performance varies dramatically from one end of this spectrum to the other. Hardware cache coherence is fast, but also costly and time-consuming to design and implement, while DSM systems provide acceptable performance on only a limit class of applications. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space, without cache coherence-can provide most of the performance benefits of fully cache-coherent hardware, at a fraction of the cost. To support this claim we present a software coherence protocol that runs on this class of machines, and use simulation to conduct a performance study. We look at both programming and architectural issues in the context of software and hardware coherence protocols. Our results suggest that software coherence on NCC-NUMA machines in a more cost-effective approach to large-scale shared-memory multiprocessing than either pure distributed shared memory or hardware cache coherence. 相似文献

14.

Robust wireless sharing of internet video streams

Steven Nichols Yu Zhang Kien A. Hua 《Multimedia Systems》2013,19(1):65-76

The gateways are the performance bottleneck of wireless mesh access networks and thus alleviating stress on them is essential to making such wireless networks robust and scalable. Using proxy servers or wireless peer-to-peer streaming techniques can help reduce the gateway load. However, these techniques, because they are data caching methods, do not save wireless resources. We instead consider a communication-sharing approach in this paper. Traditional stream sharing solutions depend on cooperation with the video server. However, in the wireless access network it is difficult to cooperate with online video sites. To address this problem in wireless mesh access networks, we propose a distributed video sharing technique called Dynamic Stream Merging (DSM). DSM is able to improve the robustness of the access network without cooperation from the online video site or the users and has the intelligence to handle sudden spikes in demand for certain videos due to specific events, thereby preventing adverse effects to other daily wireless traffic. The technique can also leverage the 80:20 data access pattern, common for many video applications, to substantially increase the service throughput. We explain the DSM technique, present the system prototype, and discuss the experimental results. 相似文献

15.

Integrated Memory Controllers with Parallel Coherence Streams

Chaudhuri M. Heinrich M. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1159-1173

Previous work in scalable hardware distributed shared memory (DSM) multiprocessors has established the critical and dominant role that protocol processing bandwidth (or its inverse, occupancy) plays in determining overall performance in architectures with standalone memory/coherence controllers. However, with recent architectural trends toward integrated (on-chip) memory controllers and the well-known fact that processor frequency is increasing more rapidly than memory systems, we must ask whether parallel coherence processing engines (either multiple integrated protocol processors/cores or multiple protocol threads) are needed in DSM machines constructed from modern processor architectures and, if so, when. We construct a useful analytical model to give the designer insight into when parallel coherence streams will improve performance and verify our model via detailed simulation on 64-threaded microbenchmarks and parallel applications and on single-node multiprogrammed workloads. Surprisingly, and contrary to related work, we find that, in these architectures, adding a second coherence engine has almost no impact on performance. Further, for less-tuned applications that suffer from hot spots (contentious requests to the same memory line), additional engines offer no benefit whatsoever. Even with double the memory bandwidth (or channels), an additional coherence processing stream yields only slight performance improvement. Only for a special class of DSM machines employing directoryless broadcast protocols over unordered interconnects does parallel "snoop" processing offer reasonable performance improvement for communication-intensive applications. Overall, given the architectural trends, this is good news for DSM designers who want to minimize the resources necessary (protocol threads or integrated protocol processor cores for maintaining internode coherence, respectively) to create SMTp-based or multi-CMP-based scalable DSM machines using directory protocols. 相似文献

16.

Transparent adaptation of sharing granularity in MultiView‐based DSM systems

Nitzan Niv Assaf Schuster 《Software》2001,31(15):1439-1459

In this paper we propose a mechanism that provides distributed shared memory (DSM) systems with a flexible sharing granularity. The size of the shared memory units is dynamically determined by the system during runtime. This size can range from that of a single variable up to the size of the entire shared memory space. During runtime, the DSM transparently adapts the granularity to the memory access pattern of the application in each phase of its execution. This adaptation, called ComposedView, provides efficient data sharing in software DSM while preserving sequential consistency. Neither complex code analysis nor annotation by the programmer or the compiler are required. Our experiments indicate a substantial performance boost (up to 80% speed‐up improvement) when running a large set of applications using our method, compared to running these benchmark applications with the best fixed granularity. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

17.

A VLSI implementation of an architecture for applicative programming

《Future Generation Computer Systems》1988,4(3):245-254

The Applicative Programming System Architecture contains a novel Data Structure Memory (DSM) which supports fast access operations on compact linear data structures. Several problems that arise in implementations of applicative and functional programming languages can be solved efficiently using special data representations on the DSM. Each memory word in the DSM contains a very small local processor, and there is also a tree-structured communications network within the DSM. Therefore the DSM is a massively parallel SIMD machine. This paper describes a VLSI implementation of the DSM architecture and compares its performance with implementations on a conventional sequential computer and the NASA Massively Parallel Processor. 相似文献

18.

基于JIAJIA系统的消息传递和共享存储编程模式比较

曾丽芳杨学军黄春赵克佳曾劲松《计算机工程》2002,28(10):102-104

为了研究基于软件DSM系统的OpenMP实现，该文以一类具有代表性的用户题为例，分别测试了其基于JIAJIA系统的两种实现方式的加速比：一种是用JIAJIA提供的消息传递系统调用，实现一个类MPI版本（方式1）；另一种是用多个处理机对共享数组的读写来替代消息传递系统调用（方式2）。测试结果发现，对少量处理机系统，两种方式还具有可比性，但是，随着处理机数的增多，共享存储应用的性能急剧下降。通过对测试结果的分析及对用户题的进一步测试，发现方式2的时间主要花费在做一致性处理和缺页中断处理而导致的大量小消息通信上。测试表明，JIAJIA共享存储程序一般会比MPI程序导致更重的网络负载。要在JIAJIA共享存储基础之上建立一种实用的共享并行计算环境，尤其在支持OpenMP等共享编程语言方面，还有待进一步工作。相似文献

19.

可恢复的软件DSM系统JIACKPT

下载免费PDF全文

章隆兵张福新胡伟武唐志敏《软件学报》2005,16(2):165-173

软件DSM(distributed shared memory)系统在机群上构造了共享存储编程环境,结合了共享存储的易编程性和机群的可扩展性,引起了广泛的研究.由于软件DSM系统是一个分布式系统,系统失败风险大,需要实现容错技术以促进其实用化.利用用户级检查点技术,在支持域存储一致模型的软件DSM系统JIAJIA的基础上,设计并实现了一个可恢复的高可移植的软件DSM系统JIACKPT(JIAjia with ChecKPoinTing).由于采用适合软件DSM系统的强全局一致状态以及多种优化措施,JIACKPT易于实现且获得很好的性能.在一个8节点的PC机群上的应用测试表明,即使每分钟做一次检查点,大部分应用的检查点开销也小于10%.此外,JIACKPT还具有高可移植性.这些都表明JIACKPT已经成为一个比较实用的系统. 相似文献

20.

Design and Implementation of an Agent Home Scheme Strategy for Prefetch-Based DSM Systems

Hsiao-Hsi Wang Kuan-Ching Li Ssu-Hsuan Lu Chun-Chieh Yang Jean-Luc Gaudiot 《International journal of parallel programming》2008,36(6):521-542

In recent years, cluster computing has been widely investigated and there is no doubt that it can provide a cost-effective computing infrastructure by aggregating computational power, communication, and storage resources. Moreover, it is also considered to be a very attractive platform for low-cost supercomputing. Distributed shared memory (DSM) systems utilize the physical memory of each computing node interconnected in a private network to form a global virtual shared memory. Since this global shared memory is distributed among the computing nodes, accessing the data located in remote computing nodes is an absolute necessity. However, this action will result in significant remote memory access latencies which are major sources of overhead in DSM systems. For these reasons, in order to increase overall system performance and decrease this overhead, a number of strategies have been devised. Prefetching is one such approach which can reduce latencies, although it always increases the workload in the home nodes. In this paper, we propose a scheme named Agent Home Scheme. Its most noticeable feature, when compared to other schemes, is that the agent home distributes the workloads of each computing nodes when sending data. By doing this, we can reduce not only the workload of the home nodes by balancing the workload for each node, but also the waiting time. Experimental results show that the proposed method can obtain about 20% higher performance than the original JIAJIA, about 18% more than History Prefetching Strategy (HPS), and about 10% higher than Effective Prefetch Strategy (EPS). 相似文献