期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Reducing contention in shared-memory multiprocessors

Stenstrom P. 《Computer》1988,21(11):26-37

The techniques that can be used to design a memory system that reduces the impact of contention are examined. To exemplify the techniques, the implementations and the design decisions taken in each are reviewed. The discussion covers memory organization, interconnection networks, memory allocation, cache memory, and synchronization and contention. The multiprocessor implementations considered are C.mmp, CM*, RP3, Alliant FX, Cedar, Butterfly, SPUR, Dragon, Multimax, and Balance 相似文献

2.

Verifying sequential consistency on shared-memory multiprocessors by model checking

Qadeer S. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(8):730-741

The memory model of a shared-memory multiprocessor is a contract between the designer and the programmer of the multiprocessor. A memory model is typically implemented by means of a cache-coherence protocol. The design of this protocol is one of the most complex aspects of multiprocessor design and is consequently quite error-prone. However, it is imperative to ensure that the cache-coherence protocol satisfies the shared-memory model. We present a novel technique based on model checking to tackle this difficult problem for the important and well-known shared-memory model of sequential consistency. Surprisingly, verifying sequential consistency is undecidable in general, even for finite-state cache-coherence protocols. In practice, cache-coherence protocols satisfy the properties of causality and data independence. Causality is the property that values of read events flow from values of write events. Data independence is the property that all traces can be generated by renaming data values from traces where the written values are pairwise distinct. We show that, if a causal and data independent system also has the property that the logical order of write events to each location is identical to their temporal order, then sequential consistency is decidable. We present a novel model checking algorithm to verify sequential consistency on such systems for a finite number of processors and memory locations and an arbitrary number of data values. 相似文献

3.

A compiler optimization algorithm for shared-memory multiprocessors

McKinley K.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):769-787

This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines 相似文献

4.

Memory access dependencies in shared-memory multiprocessors

Dubois M. Scheurich C. 《IEEE transactions on pattern analysis and machine intelligence》1990,16(6):660-673

The presence of high-performance mechanisms in shared-memory multiprocessors such as private caches, the extensive pipelining of memory access, and combining networks may render a logical concurrency model complex to implement or inefficient. The problem of implementing a given logical concurrency model in such a multiprocessor is addressed. Two concurrency models are considered, and simple rules are introduced to verify that a multiprocessor architecture adheres to the models. The rules are applied to several examples of multiprocessor architectures 相似文献

5.

Sequential hardware prefetching in shared-memory multiprocessors

Dahlgren F. Dubois M. Stenstrom P. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):733-746

To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively 相似文献

6.

Data forwarding in scalable shared-memory multiprocessors

Koufaty D.A. Xiangfeng Chen Poulsen D.K. Torrellas J. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(12):1250-1264

Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups 相似文献

7.

Rsim: simulating shared-memory multiprocessors with ILP processors 总被引：1，自引：0，他引：1

Hughes C.J. Pai V.S. Ranganathan P. Adve S.V. 《Computer》2002,35(2):40-49

The early 1990s saw several announcements of commercial shared-memory systems using processors that aggressively exploited instruction-level parallelism (ILP), including the MIPS R10000, Hewlett-Packard PA8000, and Intel Pentium Pro. These processors could potentially reduce memory read stalls by overlapping read latency with other operations, possibly changing the nature of performance bottlenecks in the system. The authors' experience with Rsim demonstrates that modeling ILP features is important even in shared-memory multiprocessor systems. In particular, current simple processor-based approximations cannot model significant performance effects for applications exhibiting parallel read misses. Further, recent shared-memory designs such as aggressive implementations of sequential consistency use the aggressive ILP-enhancing features of modern processors that simple processor-based simulators do not model. As microprocessor systems become more complex, the availability of shared infrastructure source code is likely to become increasingly crucial. The authors plan to release a new Rsim version shortly that will include instruction caches, TLBs, multimedia extensions, simultaneous multithreading, Rabbit fast simulation mode, and ports to Linux platforms 相似文献

8.

Safe self-scheduling: A parallel loop scheduling scheme for shared-memory multiprocessors

Jie Liu Vikram A. Saletore Ted G. Lewis 《International journal of parallel programming》1994,22(6):589-616

In this papaer was present Safe Self-Scheduling (SSS), a new scheduling scheme that schedules parallel loops with variable length iteration execution times not known at compile time. The scheme assumes a shared memory space. SSS combines static scheduling with dynamic scheduling and draws favorable advantages from each. First, it reduces the dynamic scheduling overhead by statically scheduling a major portion of loop iterations. Second, the workload is balanced with a simple and efficient self-scheduling scheme by applying a new measure, thesmallest critical chore size. Experimental results comparing SSS with other scheduling schemes indicate that SSS surpasses other scheduling schemes. In the experiment on Gauss-Jordan, an application that is suitable for static scheduling schemes, SSS is the only self-scheduling scheme that outperforms the static scheduling scheme. This indicates that SSS achieves a balanced workload with a very small amount of overhead. This research has been supported in part by the National Science Foundation under Contract No. CCR-9210568. 相似文献

9.

A parallel copying garbage collection scheme for shared-memory multiprocessors

Khayri A. M. Ali 《New Generation Computing》1996,14(1):53-77

Earlier work on parallel copying garbage collection is based on parallelization of sequential schemes with breadth-first traversing of live data. Recently it has been demonstrated that sequential copying schemes with depth-first traversing of live data are more flexible and more efficient than the corresponding ones with breadth-first traversal. A clear advantage of the former is that they work with no extra space overheads on non-contiguous memory blocks, which allows more flexible implementation. This paper shows how to parallelize an efficient depth-first copying scheme while retaining its high efficiency and flexibility. Research interests: His research interests include techniques for parallel and distributed implementation of functional, logic, concurrent constraints and object-oriented programming systems. He is also interested in performance analysis and garbage collection. His current interest is efficient techniques for distributed implementation of the concurrent programming language Oz. 相似文献

10.

Memory referencing characteristics and caching performance of AND-Parallel Prolog on shared-memory multiprocessors

M. Hermenegildo E. Tick 《New Generation Computing》1989,7(1):37-58

This paper presents the performance analysis results for the RAP-WAM AND-Parallel Prolog architecture on shared-memory multiprocessor organizations. The goal of this parallel model is to provide inference speeds beyond those attainable in sequential systems, while supporting conventional logic programming semantics. Special emphasis is placed on sequential performance, storage efficiency, and low control overhead. First, the concepts and techniques used in the parallel execution model are described, along with the general methodology, benchmarks, and simulation tools used for its evaluation. Results are given both at the memory reference level and at the memory organization level. A two-level shared-memory architecture model is presented together with an analysis of various solutions to the cache coherency problem. Finally, RAP-WAM shared-memory simulation results are presented. It is argued that the RAP-WAM model can exploit coherent caches and attain speeds in excess of 2 MLIPS with current shared-memory multiprocessing technology for real applications that exhibit medium degrees of parallelism. MCC 相似文献

11.

Token coherence: a new framework for shared-memory multiprocessors

Martin M.M.K. Hill M.D. Wood D.A. 《Micro, IEEE》2003,23(6):108-116

Commercial workload and technology trends are pushing existing shared-memory multiprocessor coherence protocols in divergent directions. Token coherence provides a framework for new coherence protocols that can reconcile these opposing trends. The token coherence framework directly enforces the coherence invariant by counting tokens (requiring all of a block's tokens to write and at least one token to read). This token-counting approach enables more obviously correct protocols that do not rely on request ordering and can operate with alternative policies that seek to improve the performance of future multiprocessors. 相似文献

12.

Comparative performance evaluation of hot spot contention betweenMIN-based and ring-based shared-memory architectures

Xiaodong Zhang Yong Yan Castaneda R. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(8):872-886

Hot spot contention on a network-based shared-memory architecture occurs when a large number of processors try to access a globally shared variable across the network. While multistage interconnection network (MIN) and hierarchical ring (HR) structures are two important bases on which to build large scale shared-memory multiprocessors, the different interconnection networks and cache/memory systems of the two architectures respond very differently to network bottleneck situations. In this paper, we present a comparative performance evaluation of hot spot effects on the MIN-based and HR-based shared-memory architectures. Both nonblocking MIN-based and HR-based architectures are classified, and analytical models are described for understanding network differences and for evaluating hot spot performance on both architectures. The analytical comparisons indicate that HR-based architectures have the potential to handle various contentions caused by hot spots more efficiently than MIN-based architectures. Intensive performance measurements on hot spots have been conducted on the BBN TC2000 (MIN-based) and the KSR1 (HR-based) machines. Performance experiments were also conducted on the practical experience of hot spots with respect to synchronization lock algorithms. The experimental results are consistent with the analytical models, and present practical observations and an evaluation of hot spots on the two types of architectures 相似文献

13.

Reducing run queue contention in shared memory multiprocessors

Dandamudi S.P. 《Computer》1997,30(3):82-89

The performance of parallel processing systems, especially large systems, is sensitive to various types of overhead and contention. Performance consequences may be serious when contention occurs for hardware resources such as memory or the interconnection network. Contention can also occur for software resources such as critical data structures maintained by either system or application software. A run queue is one such critical data structure that can affect overall system performance. There are two basic types of run queues, centralized and distributed. Both present performance problems. There are also several techniques to mitigate their drawbacks, but none is completely satisfactory. Instead, the author proposes a different run queue organization, a hierarchical organization that inherits the best features of the centralized and the distributed queue organizations while avoiding their pitfalls. Thus, the hierarchical organization is suitable for building large-scale multiprocessor systems 相似文献

14.

Moving address translation closer to memory in distributed shared-memory multiprocessors

Qiu X. Dubois M. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(7):612-623

To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low. 相似文献

15.

Coarse-grained thread pipelining: a speculative parallel executionmodel for shared-memory multiprocessors

Kazi I.H. Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(9):952-966

This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor 相似文献

16.

PSCR: a coherence protocol for eliminating passive sharing inshared-bus shared-memory multiprocessors

Giorgi R. Prete C.A. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(7):742-763

In high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications 相似文献

17.

Hardware approaches to cache coherence in shared-memory multiprocessors, Part 1

《Micro, IEEE》1994,14(5):52

相似文献

18.

A subsystem-oriented performance analysis methodology forshared-bus multiprocessors

Chiung-San Lee Tai-Ming Parng 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(7):755-767

A methodology, called Subsystem Access Time (SAT) modeling, is proposed for the performance modeling and analysis of shared-bus multiprocessors. The methodology is subsystem-oriented because it is based on a Subsystem Access Time Per Instruction (SATPI) concept, in which we treat major components other than processors (e.g., off-chip cache, bus, memory, I/O) as subsystems and model for each of them the mean access time per instruction from each processor. The SAT modeling methodology is derived from the Customized Mean Value Analysis (CMVA) technique, which is request-oriented in the sense that it models the weighted total mean delay for each type of request processed in the subsystems. The subsystem-oriented view of the proposed methodology facilitates divide-and-conquer modeling and bottleneck analysis, which is rarely addressed previously. These distinguishing features lead to a simple, general, and systematic approach to the analytical modeling and analysis of complex multiprocessor systems. To illustrate the key ideas and features that are different from CMVA, an example performance model of a particular shared-bus multiprocessor architecture is presented. The model is used to conduct performance evaluation for throughput prediction. Thereby, the SATPIs of the subsystems are directly utilized to identify the bottleneck subsystem and find the requests or subsystem components that cause the bottleneck. Furthermore, the SATPIs of the subsystems are employed to explore the impact of several performance influencing factors, including memory latency, number of processors, data bus width, as well as DMA transfer 相似文献

19.

Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Alberto Ros Ricardo Fernández-Pascual Manuel E. Acacio José M. García 《Journal of Parallel and Distributed Computing》2008

In glueless shared-memory multiprocessors where cache coherence is usually maintained using a directory-based protocol, the fast access to the on-chip components (caches and network router, among others) contrasts with the much slower main memory. Unfortunately, directory-based protocols need to obtain the sharing status of every memory block before coherence actions can be performed. This information has traditionally been stored in main memory, and therefore these cache coherence protocols are far from being optimal. In this work, we propose two alternative designs for the last-level private cache of glueless shared-memory multiprocessors: the lightweight directory and the SGluM cache. Our proposals completely remove directory information from main memory and store it in the home node’s L2 cache, thus reducing both the number of accesses to main memory and the directory memory overhead. The main characteristics of the lightweight directory are its simplicity and the significant improvement in the execution time for most applications. Its drawback, however, is that the performance of some particular applications could be degraded. On the other hand, the SGluM cache offers more modest improvements in execution time for all the applications by adding some extra structures that cope with the cases in which the lightweight directory fails. 相似文献

20.

The performance of cache-based error recovery in multiprocessors

Janssens B. Fuchs W.K. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(10):1033-1043

Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval 相似文献