首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 609 毫秒
1.
Cache-only memory access (COMA) multiprocessors support scalable coherent shared memory with a uniform memory access programming model. The local portion of shared memory associated with a processor is organized as a cache. This cache-based organization of memory results in long remote memory access latencies. Latency-hiding mechanisms can reduce effective remote memory access latency by making data present in a processor's local memory by the time the data are needed. In this paper we study the effectiveness of latency-hiding mechanisms on the KSR2 multiprocessor in improving the performance of three programs. The communication patterns of each program are analyzed and the mechanisms for latency hiding are applied. Results from a 52-processor system indicate that these mechanisms hide a significant portion of the latency of remote memory accesses. The results also quantify benefits in overall application performance.An earlier version of this paper was presented at the 1995 International Conference on Parallel Processing Techniques and Applications.  相似文献   

2.
The success of large-scale, hierarchical and distributed shared memory systems hinges on our ability to reduce delays resulting from remote accesses to shared data. To facilitate this, we present a compile-time algorithm for analyzing programs with doall-style parallelism to determine when read and write accesses to shared data areredundant (unnecessary). One identified, redundant remote accesses can be replaced by local accesses or eliminated entirely. This optimization improves program performance in two ways. First, slow memory accesses are replaced by faster ones. Second, the time to perform other remote memory accesses may be reduced as a result of the decreased traffic level. We also show how the information obtained through redundancy analysis can be used for other compiler optimizations such as prefetching and cache management.  相似文献   

3.
We study the performance benefits of speculation in a release consistent software distributed shared memory system. We propose a new protocol, speculative home-based release consistency (SHRC) that speculatively updates data at remote nodes to reduce the latency of remote memory accesses. Our protocol employs a predictor that uses patterns in past accesses to shared memory to predict future accesses. We have implemented our protocol in a release consistent software distributed shared memory system that runs on commodity hardware. We evaluate our protocol implementation using eight software distributed shared memory benchmarks and show that it can result in significant performance improvements.  相似文献   

4.
A. Chin 《Algorithmica》1994,12(2-3):170-181
Consider the problem of efficiently simulating the shared-memory parallel random access machine (PRAM) model on massively parallel architectures with physically distributed memory. To prevent network congestion and memory bank contention, it may be advantageous to hash the shared memory address space. The decision on whether or not to use hashing depends on (1) the communication latency in the network and (2) the locality of memory accesses in the algorithm.We relate this decision directly to algorithmic issues by studying the complexity of hashing in the Block PRAM model of Aggarwal, Chandra, and Snir, a shared-memory model of parallel computation which accounts for communication locality. For this model, we exhibit a universal family of hash functions having optimal locality. The complexity of applying these hash functions to the shared address space of the Block PRAM (i.e., by permuting data elements) is asymptotically equivalent to the complexity of performing a square matrix transpose, and this result is best possible for all pairwise independent universal hash families. These complexity bounds provide theoretical evidence that hashing and randomized routing need not destroy communication locality, addressing an open question of Valiant.This work was started when the author was a student at Oxford University, supported by a National Science Foundation Graduate Fellowship and a Rhodes Scholarship. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the National Science Foundation or the Rhodes Trust.  相似文献   

5.
Hash tables are widely used in network applications, as they can achieve O(1) query, insert, and delete operations at moderate loads. However, at high loads, collisions are prevalent in the table, which increases the access time and induces non-deterministic performance. Slow rates and non-determinism can considerably hurt the performance and scalability of hash tables in the multi-threaded parallel systems such as ASIC/FPGA and multi-core. So it is critical to keep the hash operations faster and more deterministic.This paper presents a novel fast collision-free hashing scheme using Discriminative Bloom Filters (DBFs) to achieve fast and deterministic hash table lookup. DBF is a compact summary stored in on-chip memory. It is composed of an array of parallel Bloom filters organized by the discriminator. Each element lookup performs parallel membership checks on the on-chip DBF to produce a possible discriminator value. Then, the element plus the discriminator value is hashed to a possible bucket in an off-chip hash table for validating the match. This DBF-based scheme requires one off-chip memory access per lookup as well as less off-chip memory usage. Experiments show that our scheme achieves up to 8.5-fold reduction in the number of off-chip memory accesses per lookup than previous schemes.  相似文献   

6.
This paper presents an efficient, writer-based logging scheme for recoverable distributed shared memory systems, in which logging of a data item is performed by its writer process, instead of every process that accesses the item logging it. Since the writer process maintains the log of data items, volatile storage can be used for logging. Only the readers' access information needs to be logged into the stable storage of the writer process to tolerate multiple failures. Moreover, to reduce the frequency of stable logging, only the data items accessed by multiple processes are logged with their access information when the items are invalidated, and also semantic-based optimization in logging is considered. Compared with the earlier schemes in which stable logging was performed whenever a new data item was accessed or written by a process, the size of the log and the logging frequency can be significantly reduced in the proposed scheme.  相似文献   

7.
Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective.  相似文献   

8.
以操作系统为中心的存储一致性模型--线程一致性模型   总被引:3,自引:0,他引:3  
分布共享存储系统为保证程序的正确执行,必须通过存储一致性模型对共享存储访问顺序加以限制,而现有模型在可扩展性和操作系统级实现方面存在不足。结合多线程的特点,提出了一种以操作系统为中心的线程一致性模型,通过并行程序执行过程中线程状态的变化来观察和限制存储访问事件的正确顺序,有利于系统的可扩展性、一致性维护信息获取的方便性和完备性以及操作系统本身的设计和实现。分别从模型的定义、正确性证明、实现方案和性能分析等几个方面展开了论述。  相似文献   

9.
在基于区块链的供应链管理溯源系统中,由于区块链技术是一种基于分布式的系统,对于区块链中存储的数据所有节点都会进行备份,如果直接把溯源数据存储在链上,这会导致数据占用大量内存,增加溯源系统维护成本和降低系统响应速度的问题。因此提出一种链下扩展存储方案,该方案首先利用SHA-256哈希算法的单向性对明文数据进行哈希运算得到哈希值,然后采用SM2加密算法产生的私钥对哈希值进行签名,保证信息上传者身份的可靠,最后把哈希值和签名值通过智能合约保存在区块链中,明文数据和其哈希值与签名值在区块链上存储的地址则存储在数据库中。通过结合中心化存储和区块链技术各自的优势,既可以保证溯源数据不可被篡改又可以有效减少区块链网络中溯源数据所占内存的大小。最后,在所提方案的基础上,对溯源系统进行详细设计并采用以太坊区块链平台对其进行实现。  相似文献   

10.
A discussion is presented of the use of dynamic storage schemes to improve parallel memory performance during three important classes of data accesses: vector accesses in which multiple strides are used to access a single vector, block accesses, and constant-geometry FFT accesses. The schemes investigated are based on linear address transformations, also known as XOR schemes. It has been shown that this class of schemes can be implemented more efficiently in hardware and has more flexibility than schemes based on row rotations or other techniques. Several analytical results are shown. These include: quantitative analysis of buffering effects in pipelined memory systems; design rules for storage schemes that provide conflict-free access using multiple strides, blocks, and FFT access patterns; and an analysis of the effects of memory bank cycle time on storage scheme capabilities  相似文献   

11.
基于弱一致性模型软件数据预取策略   总被引:1,自引:0,他引:1  
窦勇  周兴铭 《软件学报》1997,8(2):81-86
本文针对分布共享存储器中存在的远地访问大延迟问题,提出了基于弱序一致性模型的存储访问优化策略,主要是利用并行程序中同步操作提供的信息,在同步点成块预取将要被使用的数据.该方法能够有效地掩盖远地存储访问的大延迟.  相似文献   

12.
In this paper, we present a novel hash map-based sparse data structure for Smoothed Particle Hydrodynamics, which allows for efficient neighbourhood queries in spatially adaptive simulations as well as direct ray tracing of fluid surfaces. Neighbourhood queries for adaptive simulations are improved by using multiple independent data structures utilizing the same underlying self-similar particle ordering, to significantly reduce non-neighbourhood particle accesses. Direct ray tracing is performed using an auxiliary data structure, with constant memory consumption, which allows for efficient traversal of the hash map-based data structure as well as efficient intersection tests. Overall, our proposed method significantly improves the performance of spatially adaptive fluid simulations and allows for direct ray tracing of the fluid surface with little memory overhead.  相似文献   

13.
We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, calledcycle detection, is based on work by Shasha and Snir and checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information fromsynchronization analysis, which handles post–wait synchronization, barriers, and locks. We also make the analysis efficient by exploiting the common code image property of SPMD programs. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for nonblocking memory operations and one-way communication. The optimizations includemessage pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data reuse. The performance improvements are as high as 20–35% for programs running on a CM-5 multiprocessor using the Split-C language as a global address layer. Even larger benefits can be expected on machines with higher communication latency relative to processor speed.  相似文献   

14.
In light of recent shift towards shared-memory systems in parallel explicit model checking, we explore relative advantages and disadvantages of shared versus private hash tables. Since usage of shared state storage allows for techniques unavailable in distributed memory, these are evaluated, both theoretically and practically, in a prototype implementation. Experimental data is presented to assess practical utility of those techniques, compared to static partitioning of state space, more traditional in distributed memory algorithms.  相似文献   

15.
医疗数据对患者有着重要的作用,有助于患者的跨机构就医。为了安全高效地共享医疗数据,提出了一个以患者为中心的医疗数据共享模型。利用区块链技术构建可信的网络环境;使用变色龙哈希函数连接区块,实现数据的可编辑性。患者的医疗数据全部存储于一个区块中,消除了数据的碎片化。通过医疗数据共享模型,可以实现医疗数据安全高效地共享。  相似文献   

16.
Everlasting demands for solutions to ever growing computation problems and demands for efficient means to manage and utilize sophisticated information have caused an increase in the amount of data necessary to handle a job, while drastic reduction in CPU prices is encouraging massive parallel architectures for gigantic data processing. These trends are increasing the importance of a large shared buffer memory with 103~104 simultaneously accessible ports. This paper proposes a multiport page buffer architecture that allows 103~104 concurrent accesses and causes no access conflict nor suspension. It consists of a set of memory banks and multistaged switching networks with controllers that control each row of the networks. Consecutive words in each page are stored orthogonally across banks. Memory interleaving may be applied to improve access rate in consecutive retrievals. When used as a disk cache memory, it decreases the number of disk accesses and increases both the page transfer rate and the maximum number of concurrent page accesses.  相似文献   

17.
In a Multi-Processor System-on-a-Chip (MPSoC) based on Network-on-Chip (NoC), which processes massive data in a distributed fashion, communication is concentrated on shared memory. This paper proposes an assignment algorithm that can minimize the total power consumption for data communication in executing application programs and a switch structure that can reduce communication congestion resulting from simultaneous accesses to the shared memory. The proposed assignment algorithm gives higher priority to the tasks transferring a larger amount of data to shared memory, so that these tasks can be assigned to the PEs close to shared memory. The proposed switch structure was designed to support multi-port memory, which is often used for shared memory. The ports of the proposed switch are dedicated to be connected with in/out ports of shared memory in order to increase communication bandwidth between PEs and shared memories. By adopting the proposed scheme, the congestion caused by the concentrated requests to the memory can be reduced. Experimental results show that power consumption for transferring data in High-Definition (HD) H.264 decoder, Motion-JPEG decoder, MP3 decoder and 2D Wavelet transform codes has been reduced by 23.9% on the average, when compared with the cases of applying the well-known FC, BN and SA algorithms. The area has been slightly increased by 1.7% compared to conventional NoC structures.  相似文献   

18.
张国兵  曾武  黄皓 《计算机应用》2005,25(12):2742-2744
基于网络处理器的防火墙中大量的内存访问会影响对高速网络流的处理速度。哈希表是防火墙中重要的数据结构,用拉链法解决冲突时一次查表的平均内存访问次数与相应拉链的长度成正比。把一条拉链划分成两条可以缩短链的长度,减少总的内存访问次数,从而提高系统性能。介绍了用两条链处理哈希表冲突问题的方法,分析了它对性能的影响,并以网络处理器IXP2400为例给出了具体设计和实现。  相似文献   

19.
Building a high performance IP packet forwarding (PF) engine remains a challenge due to increasingly stringent throughput requirements and the growing size of IP forwarding tables. The router has to match the incoming packet’s IP address against all entries in the forwarding table. The matching process has to be done at increasingly higher wire speed; hence, scalability and low power consumption are critical for PF engines.Various hash table based schemes have been considered for use in PF engines. Set associative memory can be used for hardware implementations of hash tables with the property that each bucket of a hash table can be searched in a single memory cycle. However, the classic hashing downsides, such as collisions and worst case memory access time have to be dealt with. While open addressing hash tables, in general, provide good average case search performance, their memory utilization and worst case performance can degrade quickly due to collisions (that lead to bucket overflows).The two standard solutions to the overflow problem are either to use predefined probing (e.g., linear or quadratic probing) or to use multiple hash functions. This work presents two new simple hash schemes that extend both aforementioned solutions to tackle the overflow problem efficiently. The first scheme is a hash probing scheme that is called Content-based HAsh Probing (CHAP). As the name suggests, CHAP, based on the content of the hash table, avoids the classical side effects of predefined hash probing methods (i.e., primary and secondary clustering phenomena) and at the same time reduces the overflow. The second scheme, called Progressive Hashing (PH), is a general multiple hash scheme that reduces the overflow as well. The basic idea of PH is to split the prefixes into groups where each group is assigned one hash function, then reuse some hash functions in a progressive fashion to reduce the overflow. Both schemes are amenable to high-performance hardware implementations with low overflow and constant worst-case memory access time. We show by experimenting with real IP lookup tables and synthetic traces that both schemes outperform other existing hashing schemes.  相似文献   

20.
The Stanford Dash multiprocessor   总被引:2,自引:0,他引:2  
The overall goals and major features of the directory architecture for shared memory (Dash) are presented. The fundamental premise behind the architecture is that it is possible to build a scalable high-performance machine with a single address space and coherent caches. The Dash architecture is scalable in that it achieves linear or near-linear performance growth as the number of processors increases from a few to a few thousand. This performance results from distributing the memory among processing nodes and using a network with scalable bandwidth to connect the nodes. The architecture allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance. A distributed directory-based protocol that provides cache coherence without compromising scalability is discussed in detail. The Dash prototype machine and the corresponding software support are described  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号