共查询到20条相似文献,搜索用时 15 毫秒
1.
Synchronization algorithms for shared-memory multiprocessors 总被引:2,自引:0,他引:2
A performance evaluation of the Symmetry multiprocessor system revealed that the synchronization mechanism did not perform well for highly contested locks, like those found in certain parallel applications. Several software synchronization mechanisms were developed and evaluated, using a hardware monitor, on the Symmetry multiprocessor system; the mechanisms were to reduce contention for the lock. The mechanisms remain valuable even when changes are made to the hardware synchronization mechanism to improve support for highly contested locks. The Symmetry architecture is described, and a number of lock algorithms and their use of hardware resources are examined. The performance of each lock is observed from the perspective of both the program itself and the total system performance 相似文献
2.
The techniques that can be used to design a memory system that reduces the impact of contention are examined. To exemplify the techniques, the implementations and the design decisions taken in each are reviewed. The discussion covers memory organization, interconnection networks, memory allocation, cache memory, and synchronization and contention. The multiprocessor implementations considered are C.mmp, CM*, RP3, Alliant FX, Cedar, Butterfly, SPUR, Dragon, Multimax, and Balance 相似文献
3.
Dahlgren F. Dubois M. Stenstrom P. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):733-746
To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pre-fetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively 相似文献
4.
Dubois M. Scheurich C. 《IEEE transactions on pattern analysis and machine intelligence》1990,16(6):660-673
The presence of high-performance mechanisms in shared-memory multiprocessors such as private caches, the extensive pipelining of memory access, and combining networks may render a logical concurrency model complex to implement or inefficient. The problem of implementing a given logical concurrency model in such a multiprocessor is addressed. Two concurrency models are considered, and simple rules are introduced to verify that a multiprocessor architecture adheres to the models. The rules are applied to several examples of multiprocessor architectures 相似文献
5.
Koufaty D.A. Xiangfeng Chen Poulsen D.K. Torrellas J. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(12):1250-1264
Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups 相似文献
6.
Commercial workload and technology trends are pushing existing shared-memory multiprocessor coherence protocols in divergent directions. Token coherence provides a framework for new coherence protocols that can reconcile these opposing trends. The token coherence framework directly enforces the coherence invariant by counting tokens (requiring all of a block's tokens to write and at least one token to read). This token-counting approach enables more obviously correct protocols that do not rely on request ordering and can operate with alternative policies that seek to improve the performance of future multiprocessors. 相似文献
7.
Gao Yaoqing Wang Dingxing Zheng Weimin Shen Meiming Huang Zhiyi Hu Shouren Giotto Levi 《计算机科学技术学报》1993,8(4):43-50
Logic programs offer many opportunities for the exploitation of parallelism.But the parallel execution of a task incurs various overheads.This paper focuses on the issues relevant to parallelizing Prolog on shared-memory multiprocessors efficiently. 相似文献
8.
《Journal of Parallel and Distributed Computing》2005,65(10):1158-1170
Synchronization is a crucial operation in many parallel applications. Conventional synchronization mechanisms are failing to keep up with the increasing demand for efficient synchronization operations as systems grow larger and network latency increases.The contributions of this paper are threefold. First, we revisit some representative synchronization algorithms in light of recent architecture innovations and provide an example of how the simplifying assumptions made by typical analytical models of synchronization mechanisms can lead to significant performance estimate errors. Second, we present an architectural innovation called active memory that enables very fast atomic operations in a shared-memory multiprocessor. Third, we use execution-driven simulation to quantitatively compare the performance of a variety of synchronization mechanisms based on both existing hardware techniques and active memory operations. To the best of our knowledge, synchronization based on active memory outforms all existing spinlock and non-hardwired barrier implementations by a large margin. 相似文献
9.
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines 相似文献
10.
Arun K. Nanda Honda Shing Ten-Hwan Tzen Lionel M. Ni 《Journal of Parallel and Distributed Computing》1991,12(4)
A standard metric conventionally employed to compare the performance of different multiprocessor systems is speedup. Although providing a measure of the improvement in execution speed achievable on a system, this metric does not yield any insight into the factors responsible for limiting the potential improvement in speed. This paper studies the performance degradation in shared-memory multiprocessors as a result of contention for shared-memory resources. A replicate workload framework with a flexible mechanism for workload specification is proposed for measuring performance. Two normalized performance metrics—efficiency and overhead factor—are introduced to quantify the factors limiting performance and facilitate comparison across architectures. Finally, the proposed model is employed to measure and compare the performance of three contemporary shared-memory systems, with special emphasis on the newly released BBN Butterfly-II (TC2000), currently undergoing Beta test. 相似文献
11.
This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the fine-grained thread pipelining model proposed for the superthreaded architecture, allows concurrent execution of loop iterations in a pipelined fashion with runtime data-dependence checking and control speculation. The speculative execution combined with the runtime dependence checking allows the parallelization of a variety of program constructs that cannot be parallelized with existing runtime parallelization algorithms. The pipelined execution of loop iterations in this new technique results in lower parallelization overhead than in other existing techniques. We evaluated the performance of this new model using some real applications and a synthetic benchmark. These experiments show that programs with a sufficiently large grain size compared to the parallelization overhead obtain significant speedup using this model. The results from the synthetic benchmark provide a means for estimating the performance that can be obtained from application programs that will be parallelized with this model. The library routines developed for this thread pipelining model are also useful for evaluating the correctness of the codes generated by the superthreaded compiler and in debugging and verifying the simulator for the superthreaded processor 相似文献
12.
Jie Liu Vikram A. Saletore Ted G. Lewis 《International journal of parallel programming》1994,22(6):589-616
In this papaer was present Safe Self-Scheduling (SSS), a new scheduling scheme that schedules parallel loops with variable
length iteration execution times not known at compile time. The scheme assumes a shared memory space. SSS combines static
scheduling with dynamic scheduling and draws favorable advantages from each. First, it reduces the dynamic scheduling overhead
by statically scheduling a major portion of loop iterations. Second, the workload is balanced with a simple and efficient
self-scheduling scheme by applying a new measure, thesmallest critical chore size. Experimental results comparing SSS with other scheduling schemes indicate that SSS surpasses other scheduling schemes. In
the experiment on Gauss-Jordan, an application that is suitable for static scheduling schemes, SSS is the only self-scheduling
scheme that outperforms the static scheduling scheme. This indicates that SSS achieves a balanced workload with a very small
amount of overhead.
This research has been supported in part by the National Science Foundation under Contract No. CCR-9210568. 相似文献
13.
Khayri A. M. Ali 《New Generation Computing》1996,14(1):53-77
Earlier work on parallel copying garbage collection is based on parallelization of sequential schemes with breadth-first traversing of live data. Recently it has been demonstrated that sequential copying schemes with depth-first traversing of live data are more flexible and more efficient than the corresponding ones with breadth-first traversal. A clear advantage of the former is that they work with no extra space overheads on non-contiguous memory blocks, which allows more flexible implementation. This paper shows how to parallelize an efficient depth-first copying scheme while retaining its high efficiency and flexibility. Research interests: His research interests include techniques for parallel and distributed implementation of functional, logic, concurrent constraints and object-oriented programming systems. He is also interested in performance analysis and garbage collection. His current interest is efficient techniques for distributed implementation of the concurrent programming language Oz. 相似文献
14.
The memory model of a shared-memory multiprocessor is a contract between the designer and the programmer of the multiprocessor. A memory model is typically implemented by means of a cache-coherence protocol. The design of this protocol is one of the most complex aspects of multiprocessor design and is consequently quite error-prone. However, it is imperative to ensure that the cache-coherence protocol satisfies the shared-memory model. We present a novel technique based on model checking to tackle this difficult problem for the important and well-known shared-memory model of sequential consistency. Surprisingly, verifying sequential consistency is undecidable in general, even for finite-state cache-coherence protocols. In practice, cache-coherence protocols satisfy the properties of causality and data independence. Causality is the property that values of read events flow from values of write events. Data independence is the property that all traces can be generated by renaming data values from traces where the written values are pairwise distinct. We show that, if a causal and data independent system also has the property that the logical order of write events to each location is identical to their temporal order, then sequential consistency is decidable. We present a novel model checking algorithm to verify sequential consistency on such systems for a finite number of processors and memory locations and an arbitrary number of data values. 相似文献
15.
《Journal of Systems Architecture》1999,45(9):687-708
Through simulations, the effect of several microarchitectural parameters on the performance of a dynamic out-of-order executing microprocessor is shown. Next, we show that memory instructions, especially stores, limit the available instruction level parallelism (ILP) considerably. Techniques are proposed to mitigate the memory instructions effect: A statical, a mixed statical/dynamical and a fully dynamical technique are proposed. We focus on the fully dynamical technique which enables the out-of-order execution of loads/stores. If a memory dependence fault is detected, the traditional branch misprediction recovery hardware is used for recovery. Since this scheme is not very performant, a dependence-fault predicting cache is introduced. 相似文献
16.
In high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications 相似文献
17.
Scalable shared-memory multiprocessor systems are typically NUMA (nonuniform memory access) machines, where the exploitation of the memory hierarchy is critical to achieving high performance. Iterative data parallel loops with near-neighbor communication account for many important numerical applications. In such loops, the communication of partial results stresses the memory system performance. In this paper, we develop data placement schemes that minimize communication time where the near-neighbor interaction is determined by a stencil. Under a given loop partition, our compile-time algorithm partitions global data into four classes for each processor, with each class requiring specific consistency maintenance requirements. The ADAPT (Automatic Data Allocation and Partitioning Tool) system was implemented to automatically partition parallel code segments for the BBN TC2000, a scalable shared-memory multiprocessor. ADAPT caches global arrays and maintains data consistency in software through instructions that flush data from private caches. Restructuring of a fluid flow code segment by ADAPT improved performance by a factor of more than 3 on the BBN TC2000. Features in current generation pipelined processors with multiple functional units permit the overlap of memory accesses with computation. Our experiments on the BBN TC2000 show that the degree of overlap is limited by architectural parameters, such as the number of CPU registers. 相似文献
18.
Michael L. Scott John M. Mellor-Crummey 《International journal of parallel programming》1994,22(4):449-481
In a previous article,(1) Gupta and Hill introduced anadaptive combining tree algorithm for busy-wait barrier synchronization on shared-memory multiprocessors. The intent of the algorithm was to achieve
a barrier in logarithmic time when processes arrive simultaneously, and in constant time after the last arrival when arrival
times are skewed. Afuzzy
(2) version of the algorithm allows a process to perform useful work between the point at which it notifies other processes of
its arrival and the point at which it waits for all other processes to arrive. Unfortunately, adaptive combining tree barriers
as originally devised perform a large amount of work at each node of the tree, including the acquisition and release of locks.
They also perform an unbounded number of accesses to nonlocal locations, inducing large amounts of memory and interconnect
contention. We present new adaptive combining tree barriers that eliminate these problems. We compare the performance of the
new algorithms to that of other fast barriers on a 64-node BBN Butterfly 1 multiprocessor, a 35-node BBN TC2000, and a 126-node
KSR 1. The results reveal scenarios in which our algorithms outperform all known alternatives, and suggest that both adaptation
and the combination of fuzziness with tree-style synchronization will be of increasing importance on future generations of
shared-memory multiprocessors.
At the University of Rochester, this work was supported in part by NSF Institutional Infrastructure grant number CDA-8822724
and ONR research contract number N00014-92-J-1801 (in conjunction with the ARPA Research in Information Science and Technology—High
Performance Computing, Software Science and Technology program, ARPA Order No. 8930). At Rice University, this work was supported
in part by NSF Cooperative Agreements CCR-8809615 and CCR-912008. 相似文献
19.
Large-scale distributed shared-memory multiprocessors (DSMs) provide a shared address space by physically distributing the memory among different processors. A fundamental DSM communication problem that significantly affects scalability is an increase in remote memory latency as the number of system nodes increases. Remote memory latency, caused by accessing a memory location in a processor other than the one originating the request, includes both communication latency and remote memory access latency over I/O and memory buses. The proposed architecture reduces remote memory access latency by increasing connectivity and maximizing channel availability for remote communication. It also provides efficient and fast unicast, multicast, and broadcast capabilities, using a combination of aggressively designed multiplexing techniques. Simulations show that this architecture provides excellent interconnect support for a highly scalable, high-bandwidth, low-latency network. 相似文献
20.
To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low. 相似文献