首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 242 毫秒
1.
This paper presents the comparison of the COMOPS benchmark performance in MPI and shared memory on four different shared memory platforms: the DEC AlphaServer 8400/300, the SGI Power Challenge, the SGI Origin2000, and the HP-Convex Exemplar SPP1600. The paper also qualitatively analyzes the obtained performance data based on an understanding of the corresponding architecture and the MPI implementations. Some conclusions are made for the inter-processor communication performance on these four shared memory platforms.  相似文献   

2.
片上多核处理器(CMP)已经成为处理器发展的方向,处理器设计的重点也转到了互连网络和存储层次结构方面,其中的一个关键问题是如何维护各处理器各级缓存(Cache)的一致性,该问题在传统的共享存储多处理器中使用Cache一致性协议来解决,而CMP相对于传统的多处理器结构具有更高的片上互连带宽和速度,给Cache一致协议提出了新的要求,也提供了新的改进机会.传统的总线侦听协议存在可扩展性不足和不必要的广播、侦听过多的缺点,而目录协议则存在失效间接延时大和复杂度高、验证困难等问题.环形连接的可扩展性好于总线结构,而其实现复杂度也远小于通常目录协议所使用的包交换点到点网络.将基于环的侦听协议应用于CMP;并考虑利用环的顺序性取消原有协议中冲突引起的重发操作,消除可能的饥饿、死锁和活锁等情况,增加协议的稳定性,同时减少消息流量和功耗;利用片上互连延时短的特点,将侦听结果和侦听请求同时传播,使得处理器可以根据侦听结果来对侦听请求进行选择性的侦听操作,可减少不必要的侦听操作,降低功耗.  相似文献   

3.
Recent advances in the development of optical technologies suggest the possible emergence of broadcast-based optical interconnects within cache-coherent distributed shared memory (DSM) multiprocessor architectures. It is well known that the cache-coherence protocol is a critical issue in designing such architectures because it directly affects memory latencies. In this paper, we evaluate via simulation the performance of three directory-based cache-coherence protocols; strict request-response, intervention forwarding and reply forwarding on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus), which is a low-latency and high-bandwidth broadcast-based fiber-optic interconnection network supporting DSM. The simulated system contains 64 nodes, each of which has a processor, a cache controller, a directory controller and an output channel. Simulations have been conducted for each protocol to measure average processor utilization, average network latency and average number of packets transferred over the network for varying values of the important DSM parameters such as the ratio of the mean channel service time to mean thread run time (T/R), probability of a cache block being in modified state {P(M)}, the fraction of write misses {P(W)} and home node contention rate. The results reveal that for all cases, except for low values of P(M), intervention forwarding gives the worst performance (lowest processor utilization and highest latency). The performance of strict request-response and reply forwarding is comparable for several values of the DSM parameters and contention rate. For a contention rate of 0%, the increase of P(M) makes reply forwarding perform better than strict request-response. The performance of all protocols decreases with the increase of P(W) and contention rate. However, the performance of strict request-response is the least affected among other protocols due to the negative impact of the increase of P(W) and contention rate. Therefore, for the full contention case (i.e. contention rate of 100%); for low values of P(M), or for mid values of P(M) and high values of P(W), strict request-response performs better than reply forwarding. These results are significant in the sense that they provide an insight to multiprocessor architecture designers for comparing the performance of different directory-based cache-coherence protocols on a broadcast-based interconnection network for different values of the DSM parameters and varying rates of contention.  相似文献   

4.
Distributed shared memory (DSM) multiprocessors typically require disjoint networks for deadlock-free execution of cache coherence protocols. This is normally achieved by implementing virtual networks with the help of virtual channels or virtual lanes multiplexed on a single physical network. To keep the coherence protocol simple, messages are usually assigned to virtual lanes in a predefined static manner based on a cycle-free lane assignment dependence graph. However, this static split of virtual networks (such as request and reply networks) may lead to underutilization of certain virtual networks while saturating the other networks. In this paper, we explore different static and dynamic schemes to select the virtual lanes for outgoing messages and mix the load among them without restricting any particular type of message to be carried only by a particular virtual network. We achieve this by exposing the selection algorithms to the coherence protocol itself, so that it can inject messages into selected virtual lanes based on some local information, and still enjoy deadlock-freedom. Our execution-driven simulation on five applications from the SPLASH-2 suite shows that as the system scales, the virtual network selection algorithms play an important role. For 128-node systems, our dynamic selection algorithm speeds up parallel execution by as much as 22 percent over an optimized baseline system running a modified SGI Origin 2000 protocol. We also explore how network latency, the number of message buffers per virtual lane, and the depth of network interface output queues affect the relative performance of various virtual lane selection algorithms.  相似文献   

5.
We consider the problem of implementing transactional memory in large-scale distributed networked systems. We present Spiral, a novel distributed directory-based protocol for transactional memory, and theoretically analyze and experimentally evaluate it for the performance boundaries of this approach from the worst-case perspective. Spiral is designed for the data-flow distributed implementation of software transactional memory which supports three basic operations: publish, allowing a shared object to be inserted in the directory so that other nodes can find it; lookup, providing a read-only copy of the object to the requesting node; move, allowing the requesting node to write the object locally after the node gets it. The protocol runs on a hierarchical directory construction based on sparse covers, where clusters at each level are ordered to avoid race conditions while serving concurrent requests. Given a shared object the protocol maintains a directory path pointing to the object. The basic idea is to use “spiral” paths that grow outward to search for the directory path of the object in a bottom-up fashion. For general networks, this protocol guarantees an \(\mathcal{O}(\log ^2 n\cdot \log D)\) approximation in sequential and one-shot concurrent executions of a finite set of move requests, where \(n\) is the number of nodes and \(D\) is the diameter of the network. It also guarantees poly-log approximation for any single lookup request. Our bounds are deterministic and hold in the worst-case. Moreover, this protocol requires only polylogarithmic bits of memory per node. Experimental evaluations in real networks also confirm our theoretical findings. To the best of our knowledge, this is the first deterministic consistency protocol for distributed transactional memory that achieves poly-log approximation in general networks.  相似文献   

6.
The flash memory solid-state disk (SSD) is emerging as a killer application for NAND flash memory due to its high performance and low power consumption. To attain high write performance, recent SSDs use an internal SDRAM write buffer and parallel architecture that uses interleaving techniques. In such architecture, coarse-grained address mapping called superblock mapping is inevitably used to exploit the parallel architecture. However, superblock mapping shows poor performance for random write requests. In this paper, we propose a novel victim block selection policy for the write buffer considering the parallel architecture of SSD. We also propose a multi-level address mapping scheme that supports small-sized write requests while utilizing the parallel architecture. Experimental results show that the proposed scheme improves the I/O performance of SSD by up to 64% compared to the existing technique.  相似文献   

7.
When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

8.
As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, performance analysts perform benchmark analysis on existing multiprocessors. In this study, we present a detailed comparison of two architectures, the HP V-Class and the SGI Origin 2000. Our goal is to compare and contrast design techniques used in these multiprocessors. We present the impact of processor design, cache/memory hierarchies and coherence protocol optimizations on the memory system performance of these multiprocessors. We also study the effect of parallelism overheads such as process creation and synchronization on the user-level performance of these multiprocessors. Our experimental methodology uses microbenchmarks as well as scientific applications to characterize the user-level performance. Our microbenchmark results show the impact of Ll/L2 cache size and TLB size on uniprocessor load/store latencies, the effect of coherence protocol design/optimizations and data sharing patterns on multiprocessor memory access latencies and finally the overhead of parallelism. Our application-based evaluation shows the impact of problem size, dominant sharing patterns and number of Processors used on speedup and raw execution time. Finally, we use hardware counter measurements to study the correlation of system-level performance metrics and the application’s execution time performance.preliminary version of this paper appeared in the 13th ACM International Conference on Supercomputing (ICS’99).(13) This work was done while Iyer and Bhuyan were at Texas A&M. It was supported in part by a Hewlett-Packard Equipment Grant. Amato and Rauchwerger supported in part by NSF Grants ACI-9872126, EIA-9975018, EIA-0103742, EIA-9805823, ACR-0081510, ACR-0113971, CCR-0113974, EIA-9810937, EIA-0079874, by the DOE ASCI ASAP program, and by the Texas Higher Education Coordinating Board grant ATP-000512-0261-2001. Perdue supported in part by a Dept. of Education Graduate Fellowship (GAANN)  相似文献   

9.
The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures.  相似文献   

10.
网络节点受到攻击产生数据泄漏,需要进行防攻击协议设计,提出一种基于地址解析的网络节点数据防攻击协议。设计网络节点分布模型及信道解析模型,采用网络节点链路均衡配置方法设计数据转发控制协议,分布网络节点并部署最优节点。进行节点输出信道的波特间隔均衡控制,构建网络链路转发的信道调制模型,实现地址解析优化下的攻击特征提取,根据地址解析结果,实现对动态无线传感网络的攻击节点的快速定位和攻击检测。仿真结果表明,采用该方法进行网络节点数据防攻击协议设计,提高了节点的数据包转发能力和吞吐量,网络节点数据安全传输性能较好,抗攻击能力较强,提高了网络的安全性。  相似文献   

11.
In this paper we propose a new protocol for reliable multicast in a multihop mobile radio network. The protocol is reliable, i.e., it guarantees message delivery to all multicast nodes even when the topology of the network changes during multicasting. The proposed protocol uses a core-based shared tree. The multicast tree may get fragmented due to node movements. The notion of a forwarding region is introduced which is used to glue together fragments of multicast trees. The gluing process involves flooding the forwarding region of only those nodes that witness topology change due to node mobility. Delivery of multicast messages to mobile nodes is expedited through (i) pushing the message by witness nodes in their forwarding regions and (ii) pulling messages by a mobile node during (re)joining process. Hence, the protocol conserves network bandwidth by using a combination of the push–pull approach and by restricting flooding only to the essential parts of the network that are affected by topology change.  We develop a theoretical model to compute the probability of packet loss (as a function of the mobility rate) for our proposed scheme compared to the the core-based tree protocol (CBT); we also evaluate the effectiveness of forwarding regions as compared to traditional flooding. Our analysis shows that the proposed scheme significantly outperforms CBT.  相似文献   

12.
Flash memory has critical drawbacks such as long latency of its write operation and a short life cycle. In order to overcome these limitations, the number of write operations to flash memory devices needs to be minimized. The B-Tree index structure, which is a popular hard disk based index structure, requires an excessive number of write operations when updating it to flash memory. To address this, it was proposed that another layer that emulates a B-Tree be placed between the flash memory and B-Tree indexes. This approach succeeded in reducing the write operation count, but it greatly increased search time and main memory usage. This paper proposes a B-Tree index extension that reduces both the write count and search time with limited main memory usage. First, we designed a buffer that accumulates update requests per leaf node and then simultaneously processes the update requests of the leaf node carrying the largest number of requests. Second, a type of header information was written on each leaf node. Finally, we made the index automatically control each leaf node size. Through experiments, the proposed index structure resulted in a significantly lower write count and a greatly decreased search time with less main memory usage, than placing a layer that emulates a B-Tree.  相似文献   

13.
Deep packet inspection using parallel bloom filters   总被引:2,自引:0,他引:2  
There is a class of packet processing applications that inspect packets deeper than the protocol headers to analyze content. For instance, network security applications must drop packets containing certain malicious Internet worms or computer viruses carried in a packet payload. Content forwarding applications look at the hypertext transport protocol headers and distribute the requests among the servers for load balancing. Packet inspection applications, when deployed at router ports, must operate at wire speeds. With networking speeds doubling every year, it is becoming increasingly difficult for software-based packet monitors to keep up with the line rates. We describe a hardware-based technique using Bloom filters, which can detect strings in streaming data without degrading network throughput. A Bloom filter is a data structure that stores a set of signatures compactly by computing multiple hash functions on each member of the set. This technique queries a database of strings to check for the membership of a particular string. The answer to this query can be false positive but never a false negative. An important property of this data structure is that the computation time involved in performing the query is independent of the number of strings in the database provided the memory used by the data structure scales linearly with the number of strings stored in it. Furthermore, the amount of storage required by the Bloom filter for each string is independent of its length.  相似文献   

14.
移动自组网中传统的路由算法大多采用拉网式的盲搜索,导致路由开销较大,针对这一问题,提出一种基于方向预测的概率转发算法。该算法通过监听网络中传输的各种数据包,从中提取节点ID和时间信息,这些信息反映了到目的节点的距离;在此基础上,计算节点的转发概率,并根据网络的变化自适应地调整,使得路由过程始终沿着目的节点所在方向进行,限定了搜索区域。仿真结果表明,该算法的路由开销比洪泛降低了70%,比经典概率转发算法降低了20%,提高了网络性能。  相似文献   

15.
近年来,水下物联网和海洋物联网已经成为一个热门的研究方向,水声传感器网络路由协议作为海洋物联网的重要组成部分也得到研究人员的广泛重视。因此在HH-VBF协议的基础上,提出一种基于矢量转发的节能型水声传感器网络路由协议——ES-HH-VBF协议。ES-HH-VBF协议在保留了将下一跳节点的位置信息作为计算节点转发因子的参考值的基础上,引入了节点剩余能量改进节点转发因子的计算方式,以此来均衡网络中的能量消耗;并且还将预设的距离阈值由HH-VBF协议中的固定值改为根据节点剩余能量变化的动态值,从而可以动态地控制数据冗余。为了验证ES-HH-VBF协议的性能,在水下传感器网络仿真器Aqua-Sim上对HH-VBF协议和ES-HH-VBF协议的性能进行了对比分析。仿真结果表明,随着节点发包间隔的增加,ES-HH-VBF协议的包传递率比HH-VBF协议的包传递率高4.2%左右,网络平均时延比HH-VBF协议低11.3%左右,网络平均能耗比HH-VBF协议低8.2%左右。通过对ES-HH-VBF 协议和HH-VBF协议的仿真实验分析可知,ES-HH-VBF协议在提高数据包传递率、降低平均能耗和降低平均延时方面具有较大优势。  相似文献   

16.
Most reactive mobile ad hoc network (MANET) routing protocols such as AODV and DSR do not perform search for new routes until the network topology changes. But, low node mobility does not affect the MANET connectivity and the same routes may be used for a long time. This may cause concentration of traffic on few mobile stations (MSs), which results in congestion and hence longer end-to-end delay. In addition, continuous use of MSs may cause their battery power to get exhausted rapidly. Expiration of MS energy causes disruption of connections traversing through the MSs and could generate many simultaneous new routing requests. Therefore, we propose a load balancing approach called Simple Load Balancing Approach (SLBA), which can be transparently added to any current reactive routing protocol such as AODV and DSR. SLBA minimizes the traffic concentration by allowing each MS to drop RREQ or to give up packet forwarding depending on its own traffic load. Meanwhile, MSs may deliberately give up forwarding packets to save their own energy. For encouraging MSs to volunteer in forwarding packets, we introduce a reward scheme for packet forwarding, named Protocol-Independent Fairness Algorithm (PIFA). We compare the performance of AODV and DSR with and without SLBA and PIFA. Simulation results indicate that SLBA can distribute traffic very well and improve the MANET performance. PIFA is also observed to prevent MANET partitioning and any performance degradation due to selfish nodes.  相似文献   

17.
We present pSPADE, a parallel algorithm for fast discovery of frequent sequences in large databases. pSPADE decomposes the original search space into smaller suffix-based classes. Each class can be solved in main-memory using efficient search techniques and simple join operations. Furthermore, each class can be solved independently on each processor requiring no synchronization. However, dynamic interclass and intraclass load balancing must be exploited to ensure that each processor gets an equal amount of work. Experiments on a 12 processor SGI Origin 2000 shared memory system show good speedup and excellent scaleup results.  相似文献   

18.
提出了一种新的基于地理位置信息的车载网络路由协议——GDGP。GDGP是一种包转发机制,在进行包转发时将贪婪转发与方向转发相结合,通过两个独立的消息交互机制,更新各个节点上存储的目的节点的位置信息,以确保各个节点上记录的目的节点的位置信息的一致性,进而保证路由算法的可靠性。利用NS-2仿真平台,采取接近现实车辆运动情况的节点运动模式进行实验,和已有的GPSR路由协议进行比较。仿真结果表明,改进的路由协议在城市场景中有较好的性能。  相似文献   

19.
Due to advances in fiber optics and VLSI technology, interconnection networks that allow simultaneous broadcasts are becoming feasible. Distributed shared memory (DSM) implementations on such networks promise high performance even for small applications with small granularity. This paper, after summarizing the architecture of one such implementation called the Simultaneous Multiprocessor Optical Exchange Bus (SOME-Bus), presents simple algorithms for improving the performance of parallel programs running on the SOME-Bus multiprocessor implementing cache-coherent DSM. The algorithms are based on run-time data redistribution via dynamic page migration protocol. They use memory access references together with the information of average channel utilization, average channel waiting time, number of messages in the channel queue or short-term average channel waiting time reported by each node and gathered by hardware monitors to make correct decisions related to the placement of shared data. Simulations with four parallel codes on a 64-processor SOME-Bus show that the algorithms yield significant performance improvements such as reduction in the execution times, number of remote memory accesses, average channel waiting times, average network latencies and increase in average channel utilizations.  相似文献   

20.
基于WMPLS协议体系,结合无线移动自组网的特点,同时考虑路由的安全性,提出了一种支持自愈恢复的WMPLS信令建立标签交换路径的安全自组网路由协议SA-WMPLS。该协议不仅提高了选路的性能,简化了转发机制,而且能够快速恢复中断的链路。通过构建adhoc网络仿真模型,仿真分析了SA-WMPLS路由协议的性能,验证了协议的安全特性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号