首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present a practical lock-free shared data structure that efficiently implements the operations of a concurrent deque as well as a general doubly linked list. The implementation supports parallelism for disjoint accesses and uses atomic primitives which are available in modern computer systems. Previously known lock-free algorithms of doubly linked lists are either based on non-available atomic synchronization primitives, only implement a subset of the functionality, or are not designed for disjoint accesses. Our algorithm only requires single-word compare-and-swap atomic primitives, supports fully dynamic list sizes, and allows traversal also through deleted nodes and thus avoids unnecessary operation retries. We have performed an empirical study of our new algorithm on two different multiprocessor platforms. Results of the experiments performed under high contention show that the performance of our implementation scales linearly with increasing number of processors. Considering deque implementations and systems with low concurrency, the algorithm by Michael shows the best performance. However, as our algorithm is designed for disjoint accesses, it performs significantly better on systems with high concurrency and non-uniform memory architecture.  相似文献   

2.
Irregular parallel algorithms pose a significant challenge for achieving high performance because of the difficulty predicting memory access patterns or execution paths. Within an irregular application, fine-grained synchronization is one technique for managing the coordination of work; but in practice the actual performance for irregular problems depends on the input, the access pattern to shared data structures, the relative speed of processors, and the hardware support of synchronization primitives. In this paper, we focus on lock-free and mutual exclusion protocols for handling fine-grained synchronization. Mutual exclusion and lock-free protocols have received a fair amount of attention in coordinating accesses to shared data structures from concurrent processes. Mutual exclusion offers a simple programming abstraction, while lock-free data structures provide better fault tolerance and eliminate problems associated with critical sections such as priority inversion and deadlock. These synchronization protocols, however, are seldom used in parallel algorithm designs, especially for algorithms under the SPMD paradigm, as their implementations are highly hardware dependent and their costs are hard to characterize. Using graph-theoretic algorithms for illustrative purposes, we show experimental results on two shared-memory multiprocessors, the IBM pSeries 570 and the Sun Enterprise 4500, that irregular parallel algorithms with efficient fine-grained synchronization may yield good performance.  相似文献   

3.
Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their utilization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for a concurrent, atomic update of shared data structures: (1)preemption-safe lockingand (2)nonblocking(lock-free)algorithms. Preemption-safe locking requires kernel support. Nonblocking algorithms generally require a universal atomic primitive such ascompare-and-swaporload-linked/store-conditionaland are widely regarded as inefficient.  We evaluate the performance of preemption-safe lock-based and nonblocking implementations of important data structures—queues, stacks, heaps, and counters—including nonblocking and lock-based queue algorithms of our own, in microbenchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that our nonblocking queue consistently outperforms the best known alternatives and that data-structure-specific nonblocking algorithms, which exist for queues, stacks, and counters, can work extremely well. Not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time, since general-purpose nonblocking techniques do not yet appear to be practical, preemption-safe locks remain the preferred alternative for complex data structures: they outperform conventional locks by significant margins on multiprogrammed systems.  相似文献   

4.
This paper presents an extended version of our previous work on using compiler technology to automatically convert sequential C++ data abstractions, for example, queues, stacks, maps, and trees, to concurrent lock-free implementations. A key difference between our work and existing research in software transactional memory (STM) is that our compiler-based approach automatically selects the best state-of-the-practice nonblocking synchronization method for the underlying sequential implementation of the data structure. The extended material includes a broader collection of the state-of-the-practice lock-free synchronization techniques, additional formal correctness proofs of the overall integration of the different synchronizations in our system, and a more comprehensive experimental study of the integrated techniques. We evaluate our compiler-generated nonblocking data structures both by using a collection of micro-benchmarks, including the Synchrobench suite, and by using a multi-threaded application Dedup from PARSEC. Our automatically synchronized code attains performance competitive to that of concurrent data structures manually-written by experts and much better performance than heavier-weight support by STM.  相似文献   

5.
Operations on basic data structures such as queues, priority queues, stacks, and counters can dominate the execution time of a parallel program due to both their frequency and their coordination and contention overheads. There are considerable performance payoffs in developing highly optimized, asynchronous, distributed, cache-conscious, parallel implementations of such data structures. Such implementations may employ a variety of tricks to reduce latencies and avoid serial bottlenecks, as long as the semantics of the data structure are preserved. The complexity of the implementation and the difficulty in reasoning about asynchronous systems increases concerns regarding possible bugs in the implementation. In this paper we consider postmortem, black-box procedures for testing whether a parallel data structure behaved correctly. We present the first systematic study of algorithms and hardness results for such testing procedures, focusing on queues, priority queues, stacks, and counters, under various important scenarios. Our results demonstrate the importance of selecting test data such that distinct values are inserted into the data structure (as appropriate). In such cases we present an O(n) time algorithm for testing linearizable queues, an O(n log n) time algorithm for testing linearizable priority queues, and an O( np 2 ) time algorithm for testing sequentially consistent queues, where n is the number of data structure operations and p is the number of processors. In contrast, we show that testing such data structures for executions with arbitrary input values is NP-complete. Our results also help clarify the thresholds between scenarios that admit polynomial time solutions and those that are NP-complete. Our algorithms are the first nontrivial algorithms for these problems.  相似文献   

6.
In asynchronous event systems, the production of an event is decoupled from its consumption via an event queue. The loose coupling of such systems allows great flexibility as to where and when within the system the event handler of a given event is actually executed, allowing them to scale with increasing number of processors on a given node or across multiple nodes in a distributed system. Although this flexibility is useful in heterogeneous distributed systems, such as SOA, it may appear not to be well suited for real-time systems, which require an upper bound on how long an event can remain unhandled in a queue. However, we show how the weighted fair scheduling algorithms developed for QoS (quality of service) routing can be adapted to event queues to bound the delay between production and consumption. We achieve this, despite the fact that the underlying execution environment is only weakly controlled, through the use of predictive techniques on a calibrated model of the event system. Our choice of algorithms and data structures is driven in part by the highly concurrent nature of the system. We show how a weighted fair scheduler can be built on a lock-free concurrent priority queue. We analyze the performance of various such queues proposed in the literature and show that a very simple linear quantizing queue yields the best performance.  相似文献   

7.
Core-to-core communication is critical to the effective use of multi-core processors. A number of software based concurrent lock-free queues have been proposed to address this problem. Existing solutions, however, suffer from performance degradation in real testbeds, or rely on auxiliary hardware or software timers to handle the deadlock problem when batching is used, making those solutions good in theory but difficult to use in practice. This paper describes the pros and cons of existing concurrent lock-free queues in both dummy and real testbeds and proposes B-Queue, an efficient and practical single-producer-single-consumer concurrent lock-free queue that solves the deadlock problem gracefully by introducing a self-adaptive backtracking mechanism. Experiments show that in real massively-parallel applications, B-Queue is faster than FastForward and MCRingBuffer, the two state-of-the-art concurrent lock-free queues, by up to 10x and 5x, respectively. Moreover, B-Queue outperforms FastForward and MCRingBuffer in terms of stability and scalability, making it a good candidate for fast core-to-core communication on multi-core architectures.  相似文献   

8.
The group mutual exclusion problem extends the traditional mutual exclusion problem by associating a type (or a group) with each critical section. In this problem, processes requesting critical sections of the same type can execute their critical sections concurrently. However, processes requesting critical sections of different types must execute their critical sections in a mutually exclusive manner. We present a distributed algorithm for solving the group mutual exclusion problem based on the notion of surrogate-quorum. Intuitively, our algorithm uses the quorum that has been successfully locked by a request as a surrogate to service other compatible requests for the same type of critical section. Unlike the existing quorum-based algorithms for group mutual exclusion, our algorithm achieves a low message complexity of O(q) and a low (amortized) bit-message complexity of O(bqr), where q is the maximum size of a quorum, b is the maximum number of processes from which a node can receive critical section requests, and r is the maximum size of a request while maintaining both synchronization delay and waiting time at two message hops. As opposed to some existing quorum-based algorithms, our algorithm can adapt without performance penalties to dynamic changes in the set of groups. Our simulation results indicate that our algorithm outperforms the existing quorum-based algorithms for group mutual exclusion by as much as 45 percent in some cases. We also discuss how our algorithm can be extended to satisfy certain desirable properties such as concurrent entry and unnecessary blocking freedom.  相似文献   

9.
We describe an approach to verifying concurrent data structures based on simulation between two Input/Output Automata (IOAs), modelling the specification and the implementation. We explain how we used this approach in mechanically verifying a simple lock-free stack implementation using forward simulation, and briefly discuss our experience in verifying three other lock-free algorithms which all required the use of backward simulation.  相似文献   

10.
The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms  相似文献   

11.
An approach to automating the selection of implementations for the data representations used in a program is presented. Formalisms are developed for specifying data representations and implementation structures. Using these formalisms, algorithms are presented which will recognize the use in a program of known data representations (e. g., stacks, queues, lists, arrays, etc.) so that alternative implementations can be retrieved from a library and which will, for those representations not recognized, generate alternative implementations with a wide range of space-time tradeoffs. Experience with using these algorithms indicates they are reasonably successful, although there are several problems that must be solved before automatic implementation selection systems will be practical.  相似文献   

12.
The literature describes two high performance concurrent stack algorithms based on combining funnels and elimination trees. Unfortunately, the funnels are linearizable but blocking, and the elimination trees are non-blocking but not linearizable. Neither is used in practice since they perform well only at exceptionally high loads. The literature also describes a simple lock-free linearizable stack algorithm that works at low loads but does not scale as the load increases. The question of designing a stack algorithm that is non-blocking, linearizable, and scales well throughout the concurrency range, has thus remained open.  相似文献   

13.
First-in-first-out (FIFO) queues are among the most fundamental and highly studied concurrent data structures. The most effective and practical dynamic-memory concurrent queue implementation in the literature is the lock-free FIFO queue algorithm of Michael and Scott, included in the standard Java TM Concurrency Package. This work presents a new dynamic-memory concurrent lock-free FIFO queue algorithm that in a variety of circumstances performs better than the Michael and Scott queue. The key idea behind our new algorithm is a novel way of replacing the singly-linked list of Michael and Scott, whose pointers are inserted using a costly compare-and-swap (CAS) operation, by an “optimistic” doubly - linked list whose pointers are updated using a simple store, yet can be “fixed” if a bad ordering of events causes them to be inconsistent. We believe it is the first example of such an “optimistic” approach being applied to a real world data structure. A preliminary version of this paper appeared in the proceedings of the 18th International Conference on Distributed Computing (DISC) 2004, pages 117–131.  相似文献   

14.
优先队列广泛地使用在许多并行算法中(例如,多处理机调度和某些组合优化算法)。在这些算法中,共享优先队列的存取冲突限制了加速比的提高。本文提出一种链表优先队列的并行插入和删除方法,具有较小并行开销和较大的并行度,并且保证和串行存取算法的优先顺序完全一致,即删除操作返回已经插入和正在插入的所有元素中的最佳元素。同时,我们还介绍了目前性能最好的堆的并行插入和删除算法,并对准和链表结构并行插入和删除算法的性能和适用范围进行了比较,进一步提出了散列结构的优先队列。在ENCORE Multimax520多处理机上的实验结果验证了我们的理论分析结果:使用链表结构的并行分枝限界算法性能上可获得很大提高。  相似文献   

15.
Software solutions for mutual exclusion developed over a 30‐year period, starting with complex ad hoc algorithms and progressing to simpler formal ones. While it is easy to dismiss software solutions for mutual exclusion, as this family of algorithms is antiquated and most platforms support atomic hardware instructions, there is still a need for these algorithms in threaded, embedded systems running on low‐cost processors lacking atomic instructions. While N‐thread solutions are usually short (10–25 lines of code), each is ingenious with exceptionally subtle aspects, often making it difficult to prove correctness or construct an implementation. This work examines correctness and performance of the implementations. An extensive survey of existing algorithms is presented, with explanations of the intuition behind the algorithms and how they work. Several errors were found and corrections made, as well as a few small improvements, in the existing algorithms; two new high‐performance algorithms were developed. Finally, a worst‐case high‐contention performance experiment is performed to compare the algorithms and contrast them with three common locks based on hardware atomic instructions. The results show our two new algorithms are highly competitive with an equivalent hardware lock (Mellor‐Crummey and Scott) over a range of 1–32 processors. Hence, threading is a viable alternative to event‐driven programming for complex embedded systems without atomic instructions. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

16.
Trace-based derivation of a scalable lock-free stack algorithm   总被引:1,自引:1,他引:0  
We show how a sophisticated, lock-free concurrent stack implementation can be derived from an abstract specification in a series of verifiable steps. The algorithm is based on the scalable stack algorithm of Hendler et al. (Proceedings of the sixteenth annual ACM symposium on parallel algorithms, 27–30 June 2004, Barcelona, Spain, pp 206–215), which allows push and pop operations to be paired off and eliminated without affecting the central stack, thus reducing contention on the stack, and allowing multiple pairs of push and pop operations to be performed in parallel. Our algorithm uses a simpler data structure than Hendler, Shavit and Yerushalmi’s, and avoids an ABA problem. We first derive a simple lock-free stack algorithm using a linked-list implementation, and discuss issues related to memory management and the ABA problem. We then add an abstract model of the elimination process, from which we derive our elimination algorithm. This allows the basic algorithmic ideas to be separated from implementation details, and provides a basis for explaining and comparing different variants of the algorithm. We show that the elimination stack algorithm is linearisable by showing that any execution of the implementation can be transformed into an equivalent execution of an abstract model of a linearisable stack. Each step in the derivation is either a data refinement which preserves the level of atomicity, an operational refinement which may alter the level of atomicity, or a refactoring step which alters the structure of the system resulting from the preceding derivation. We verify our refinements using an extension of Lipton’s reduction method, allowing concurrent and non-concurrent aspects to be considered separately.  相似文献   

17.
This paper aims towards designing a new token-based mutual exclusion algorithm for distributed systems. In some of the earlier work, token based algorithms for mutual exclusion are proposed for the distributed environment assuming inverted tree topology. In a wireless setup, such a stable, hierarchical topology is quite unrealistic due to frequent link failures. The proposed token-based algorithm works for processes with assigned priorities on any directed graph topology with or without cycles. The proposed algorithm, in spite of considering priorities of processes, ensures liveness in terms of token requests from low priority processes. Moreover, the algorithm keeps control message traffic reasonably low. The simulation results exhibit the performance of the proposed algorithm under varied contexts besides presenting a comparative performance with other recent algorithms for mutual exclusion like FAPP (Fairness Algorithm for Priority Process).  相似文献   

18.
Efficient, scalable memory allocation for multithreaded applications on multiprocessors is a significant goal of recent research. In the distributed computing literature it has been emphasized that lock-based synchronization and concurrency-control may limit the parallelism in multiprocessor systems. Thus, system services that employ such methods can hinder reaching the full potential of these systems. A natural research question is the pertinence and the impact of lock-free concurrency control in key services for multiprocessors, such as in the memory allocation service, which is the theme of this work. We show the design and implementation of NBmalloc, a lock-free memory allocator designed to enhance the parallelism in the system. The architecture of NBmalloc is inspired by Hoard, a well-known concurrent memory allocator, with modular design that preserves scalability and helps avoiding false-sharing and heap-blowup. Within our effort to design appropriate lock-free algorithms for NBmalloc, we propose and show a lock-free implementation of a new data structure, flat-set, supporting conventional “internal” set operations as well as “inter-object” operations, for moving items between flat-sets. The design of NBmalloc also involved a series of other algorithmic problems, which are discussed in the paper. Further, we present the implementation of NBmalloc and a study of its behaviour in a set of multiprocessor systems. The results show that the good properties of Hoard w.r.t. false-sharing and heap-blowup are preserved.  相似文献   

19.
We present an efficient and practical lock-free method for semiautomatic (application-guided) memory reclamation based on reference counting, aimed for use with arbitrary lock-free dynamic data structures. The method guarantees the safety of local as well as global references, supports arbitrary memory reuse, uses atomic primitives that are available in modern computer systems, and provides an upper bound on the amount of memory waiting to be reclaimed. To the best of our knowledge, this is the first lock-free method that provides all of these properties. We provide analytical and experimental study of the method. The experiments conducted have shown that the method can also provide significant performance improvements for lock-free algorithms of dynamic data structures that require strong memory management.  相似文献   

20.
随着网络技术的不断发展,分布式系统得到了广泛的研究与应用。然而由于分布式系统中网络带宽有限,且临界资源的数目是固定的,因此研究设计网络负载轻、临界资源利用率高的分布式互斥算法具有重要的意义。文中首先介绍了几种传统的互斥算法,对各个算法的性能加以比较,结合上述分析提出了一种新的基于令牌的算法,并详细阐述算法的设计思想及其数据结构。该算法最主要的特点是在分布式互斥中引入了优先级和选举算法的概念,能有效提高进程间的通信效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号