首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The Cray X1 supercomputer, introduced in 2002, has several interesting architectural features. Two key features are the X1's distributed shared memory and its vector multiprocessors. The Cray X1 supercomputer's distributed shared memory presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of 1 byte/flop. In this article, we characterize the performance of the X1's distributed shared-memory system and its interconnection network using microbench-marks and applications.  相似文献   

2.
3.
Future chip-multiprocessors (CMP) will integrate many cores interconnected with a high-bandwidth and low-latency scalable network-on-chip (NoC). However, the potential that this approach offers at the transport level needs to be paired with an analogous paradigm shift at the higher levels. In particular, the standard shared-memory programming model fails to address the requirements of scalability of the many-core era. Fast data exchange among the cores and low-latency synchronization are desirable but hard to achieve in practice due to the memory hierarchy. The message-passing paradigm permits instead direct data communication and synchronization between the cores. The shared-memory could still be used for the instruction fetch. Hence, we propose a hybrid approach that combines shared-memory and message passing in a single general-purpose CMP architecture that allows efficient execution of applications developed with both parallel programming approaches. Cores fetch instructions from a hierarchical memory and exchange their data through the same memory, for compatibility with existing software, or directly through the fast NoC. We developed a fast SystemC based cycle-accurate simulator for design space explorations that we used to evaluate the performance with real benchmarks. The various components have been RTL coded and mapped to a CMOS 45 nm technology to build a silicon area model that we used to select the best architectural configurations.  相似文献   

4.
5.
Analytic evaluation of shared-memory architectures   总被引:1,自引:0,他引:1  
This paper develops and validates an efficient analytical model for evaluating the performance of shared memory architectures with ILP processors. First, we instrument the SimOS simulator to measure the parameters for such a model and we find a surprisingly high degree of processor memory request heterogeneity in the workloads. Examining the model parameters provides insight into application behaviors and how they interact with the system. Second, we create a model that captures such heterogeneous processor behavior, which is important for analyzing memory system design tradeoffs. Highly bursty memory request traffic and lock contention are also modeled in a significantly more robust way than in previous work. With these features, the model is applicable to a wide range of architectures and applications. Although the features increase the model complexity, it is a useful design tool because the size of the model input parameter set remains manageable, and the model is still several orders of magnitude quicker to solve than detailed simulation. Validation results show that the model is highly accurate, producing heterogeneous per processor throughputs that are generally within 5 percent and, for the workloads validated, always within 13 percent of the values measured by detailed simulation with SimOS. Several examples illustrate applications of the model to studying architectural design issues and the interactions between the architecture and the application workloads.  相似文献   

6.
7.
8.
This paper presents a schematic algorithm for distributed systems. This schematic algorithm uses a black-box procedure for communication, the output of which must meet two requirements: a global-order requirement and a deadlock-free requirement. This algorithm is valid in any distributed system model that can provide such a communication procedure that complies with these requirements. Two such models exist in an asynchronous fail-stop environment: one in the shared-memory model and one in the message-passing model. The implementation of the block-box procedure in these models enables us to translate existing algorithms between the two models whenever these algorithms are based on the schematic algorithm.We demonstrate this idea in two ways. First, we present a randomized algorithm for the consensus problem in the message-passing model based on the algorithm of Aspnes and Herlihy [AH] in the shared-memory model. This solution is the fastest known randomized algorithm that solves the consensus problem against a strong fail-stop adversary with one-half resiliency. Second, we solve the processor renaming problem in the shared-memory model based on the solution of Attiyaet al. [ABD+] in the message-passing model. The existence of the solution to the renaming problem should be contrasted with the impossibility result for the consensus problem in the shared-memory model [CIL], [DDS], [LA].A preliminary version of this paper, Shared-Memory vs. Message-Passing in an Asynchronous Distributed Environment, appeared inProc. 8th ACM Symp. on Principles of Distributed Computing, pp. 307–318, 1989. Part of this work was done while A. Bar-Noy visited the Computer Science Department, Stanford University, Stanford, CA 94305, USA, and his research was supported in part by a Weizmann fellowship, by Contract ONR N00014-88-K-0166, and by a grant of Stanford's Center for Integrated Systems.  相似文献   

9.
To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low.  相似文献   

10.
基于共享存储和Gzip的并行压缩算法研究   总被引:1,自引:1,他引:1  
Gzip无损压缩算法.尽管gzip算法能够取得很好的压缩比,但它在分析和压缩编码的过程需要进行大量的计算.为了缩短压缩时间,提出了一种基于共享存储的并行压缩策略,采用OpenMP标准和生产者/消费者模型实现了gzip的并行压缩版本.在Beowulf集群中的一个SMP节点(双CPU)和曙光天阔服务器(4路双核)上的测试表明,并行化的gzip程序取得了极大的性能提升,尤其是大文件的压缩.  相似文献   

11.
For the past decades computer engineers have focused on building high-performance and large-scale computer systems with low-cost. One of the examples is a distributed-memory computer system like a cluster, where fast processing nodes to use commodity processors are connected through a high speed network. But it is not easy to develop applications on this system, because a programmer must consider all data and control dependences between processes and program them explicitly. For alleviating this problem the distributed virtual shared-memory (DVSM) system has been proposed. It is well known that the performance of the DVSM system highly depends on the network’s performance and programming semantics, and currently its performance is very limited on a conventional network. Recently many advanced hardware-based interconnection technologies have been introduced, and one of them is the InfiniBand Architecture (IBA) which supports shared-memory programming semantics by means of remote direct-memory access (RDMA) and atomic operations. In this paper, we present the implementation of our InfiniBand-based DVSM system and analyze the performance of SPEC OMP benchmarks in detail by comparing with the DVSM based on the traditional network architecture and the hardware shared-memory multiprocessor (SMP) system. As experiment result, we show that our DVSM system to use full features of the IBA can improve the performance significantly over the IPoIB-based traditional system on the IBA, and furthermore the performance of one application on the IBA-based DVSM system is better than on the hardware SMP.  相似文献   

12.
Multi-agent systems have been attacking the challenges of information retrieval tasks on distributed environment. In this paper, we propose a consensus choice selection method based framework to evaluate the performance of cooperative information retrieval tasks of the multiple agents. Thereby, two well-known measurements, precision and recall, are extended to handle consensual closeness (i.e., local and global consensus) between the sets of retrieved results. We show that in a motivating example the proposed criteria are prone to solve the rigidity problem of classical precision and recall. More importantly, the retrieved results can be ranked with respect to the consensual score, and the ranking mechanism has been verified to be more reasonable.
Jason J. JungEmail: Email:
  相似文献   

13.
14.
Reliable multicast is a powerful communication primitive for structuring distributed programs in which multiple processes must closely cooperate together. We propose a protocol for supporting reliable multicast in a distributed system that includes mobile hosts and evaluate the performance of our proposal through simulation We consider a scenario in which mobile hosts communicate with a wired infrastructure by means of wireless technology. Our proposal provides several novel features. The sender of each multicast may select among three increasingly strong delivery ordering guarantees: FIFO, causal, total. Movements do not trigger the transmission of any message in the wired network as no notion of hand-off is used. The set of senders and receivers (group) may be dynamic. The size of data structures at mobile hosts, the size of message headers, and the number of messages in the wired network for each multicast are all independent of the number of group members. The wireless network is assumed to provide only incomplete spatial coverage and message losses could occur even within cells. Movements are not negotiated and a mobile host that leaves a cell may enter any other cell, perhaps after a potentially long disconnection. The simulation results show that the proposed protocol has good performance and good scalability properties  相似文献   

15.
This paper presents an analytical method to derive the worst-case traffic pattern caused by a task graph mapped to a cache-coherent shared-memory system. Our analysis allows designers to rapidly evaluate the impact of different mappings of tasks to IP cores on the traffic pattern. The accuracy varies with the application’s data sharing pattern, and is around 65% in the average case and 1% in the best case when considering the traffic pattern as a whole. For individual connections, our method produces tight worst-case bandwidths.  相似文献   

16.
Providing efficient workload management is an important issue for a large-scale heterogeneous distributed computing environment where a set of periodic applications is executed. The considered shipboard distributed system is expected to operate in an environment where the input workload is likely to change unpredictably, possibly invalidating a resource allocation that was based on the initial workload estimate. The tasks consist of multiple strings, each made up of an ordered sequence of applications. There is a quality of service (QoS) minimum throughput constraint that must be satisfied for each application in a string, and a maximum utilization constraint that must be satisfied on each of the hardware resources in the system. The challenge, therefore, is to efficiently and robustly manage both computation and communication resources in this unpredictable environment to achieve high performance while satisfying the imposed constraints. This work addresses the problem of finding a robust initial allocation of resources to strings of applications that is able to absorb some level of unknown input workload increase without rescheduling. The proposed hybrid two-stage method of finding a near-optimal allocation of resources incorporates two specially designed mapping techniques: (1) the Permutation Space Genitor-Based heuristic, and (2) the follow-up Branch-and-Bound heuristic based on an Integer Linear Programming (ILP) problem formulation. The performance of the proposed resource allocation method is evaluated under different simulation scenarios and compared to an iteratively computed upper bound.  相似文献   

17.
The probability of a station failing to deliver packets before their deadlines, called theprobability of dynamic failure, P dyn, is an important measure for the communication subsystem of a distributed real-time system. Another closely-related performance measure is the -bounded delivery time,T , which is defined as the least time needed to deliver a packet with probability greater than 1–. UsingP dyn andT , we comparatively evaluate four contention protocols often used in distributed real-time systems: (i) the token passing protocol and its priority-based variation (called thetoken scheduling protocol), and (ii) theP i-persistent protocol and a priority-based variation thereof. The communication subsystem equipped with different contention protocols is modeled first as embedded Markov chains. Then, we derive the probability distributions of access delay, from whichP dyn andT can be calculated. The blocking probability,Q i, can also be derived from the access delay distribution. These measures are derived first under the assumption of a single buffer at each station. The single-buffer model is then extended to the multiple-buffer case. The effects of buffer size onP dyn,T , andQ i, and the performance improvement with multiple buffers are analyzed over a wide range of network traffic.The work reported in this paper was supported in part by the ONR under Grant N00014-92-J—1080. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the ONR.  相似文献   

18.
在分析传统的分布式实时嵌入式系统开发过程中,针对服务质量(QoS)评价方法存在弊端的问题,总结了运用模型驱动工程(MDE)对分布式实时嵌入式系统的服务质量进行管理,将服务质量评价引入系统开发周期的各种探索,指出该方法应解决的几个关键问题,并分析了几种具有代表性解决方案的优缺点,提出了解决方法和建议.  相似文献   

19.
A hybrid system structure comprised of distributed systems to take advantage of locality of reference and a central system to handle transactions that access non-local data is examined. Several transaction processing applications, such as reservation systems, insurance and banking have such regional locality of reference. A concurrency and coherency control protocol that maintains the integrity of the data and performs well for transactions that access local or non-local data is described. It is shown that the performance of the hybrid system is much less sensitive to the fraction of remote accesses than the distributed system and offers similar performance to the distributed system for local transactions  相似文献   

20.
Algorithms and software for solving sparse symmetric positive definite systems on serial computers have reached a high state of development. In this paper, we present algorithms for performing sparse Cholesky factorization and sparse triangular solutions on a shared-memory multiprocessor computer, along with some numerical experiments demonstrating their performance on a Sequent Balance 8000 system.Research was supported by the Applied Mathematical Sciences Research Program of the Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with Martin Marieta Energy Systems, Inc., by the U.S. Air Force Office of Scientific Research under contract AFOSR-ISSA-85-00083 and by the Canadian Natural Sciences and Engineering Research Council under grants A8111 and A5509.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号