期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors

《Journal of Parallel and Distributed Computing》1998,51(1):1-26

Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their utilization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for a concurrent, atomic update of shared data structures: (1)preemption-safe lockingand (2)nonblocking(lock-free)algorithms. Preemption-safe locking requires kernel support. Nonblocking algorithms generally require a universal atomic primitive such ascompare-and-swaporload-linked/store-conditionaland are widely regarded as inefficient. We evaluate the performance of preemption-safe lock-based and nonblocking implementations of important data structures—queues, stacks, heaps, and counters—including nonblocking and lock-based queue algorithms of our own, in microbenchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that our nonblocking queue consistently outperforms the best known alternatives and that data-structure-specific nonblocking algorithms, which exist for queues, stacks, and counters, can work extremely well. Not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time, since general-purpose nonblocking techniques do not yet appear to be practical, preemption-safe locks remain the preferred alternative for complex data structures: they outperform conventional locks by significant margins on multiprogrammed systems. 相似文献

2.

On the inherent weakness of conditional primitives

Faith Ellen Fich Danny Hendler Nir Shavit 《Distributed Computing》2006,18(4):267-277

Some well-known primitive operations, such as compare-and-swap, can be used, together with read and write, to implement any object in a wait-free manner. However, this paper shows that, for a large class of objects, including counters, queues, stacks, and single-writer snapshots, wait-free implementations using only these primitive operations and a large class of other primitive operations cannot be space efficient: the number of base objects required is at least linear in the number of processes that share the implemented object. The same lower bounds are obtained for implementations of starvation-free mutual exclusion using only primitive operations from this class. For wait-free implementations of a closely related class of one-time objects, lower bounds on the tradeoff between time and space are presented. 相似文献

3.

Linearizable counting networks

Maurice Herlihy Nir Shavit Orli Waarts 《Distributed Computing》1996,9(4):193-203

Summary. The counting problem requires n asynchronous processes to assign themselves successive values. A solution is linearizable if the order of the values assigned reflects the real-time order in which they were requested. Linearizable counting lies at the heart of concurrent time-stamp generation, as well as concurrent implementations of shared counters, FIFO buffers, and similar data structures. We consider solutions to the linearizable counting problem in a multiprocessor architecture in which processes communicate by applying read-modify-write operations to a shared memory. Linearizable counting algorithms can be judged by three criteria: the memory contention produced, whether processes are required to wait for one another, and how long it takes a process to choose a value (the latency). A solution is ideal if it has low contention, low latency, and it eschews waiting. The conventional software solution, where processes synchronize at a single variable, avoids waiting and has low latency, but has high contention. In this paper we give two new constructions based on counting networks, one with low latency and low contention, but that requires processes to wait for one another, and one with low contention and no waiting, but that has high latency. Finally, we prove that these trade-offs are inescapable: an ideal linearizable counting algorithm is impossible. Since ideal non-linearizable counting algorithms exist, these results establish a substantial complexity gap between linearizable and non-linearizable counting. Received: November 1991 / Accepted: July 1995 相似文献

4.

Solo-valency and the cost of coordination

Danny Hendler Nir Shavit 《Distributed Computing》2008,21(1):43-54

This paper introduces solo-valency, a variation on the valency proof technique originated by Fischer, Lynch, and Paterson. The new technique focuses on critical events that influence the responses of solo runs by individual operations, rather than on critical events that influence a protocol’s single decision value. It allows us to derive lower bounds on the time to perform an operation for lock-free implementations of concurrent objects such as linearizable queues, stacks, sets, hash tables, counters, approximate agreement, and more. Time is measured as the number of distinct base objects accessed and the number of stalls caused by contention in accessing memory, incurred by a process as it performs a single operation. We introduce the influence level metric that quantifies the extent to which the response of a solo execution of one process can be changed by other processes. We then prove the existence of a relationship between the space complexity, latency, contention and influence level of all lock-free object implementations. Our results are broad in that they hold for implementations that may use any collection of read-modify-write operations in addition to read and write, and in that they apply even if base objects have unbounded size. Part of this work was done while the first author was a graduate student in the Department of Computer Science, Tel-Aviv University. This work was supported in part by a grant from Sun Microsystems. 相似文献

5.

Scalable Room Synchronizations

Guy E. Blelloch Perry Cheng Phillip B. Gibbons 《Theory of Computing Systems》2003,36(5):397-430

This paper presents a scalable solution to the group mutual exclusion problem, with applications to linearizable stacks and queues, and related problems. Our solution allows entry and exit from the mutually exclusive regions in $O(t_r + \tau)$ time, where $t_r$ is the maximum time spent in a critical region by a user, and $\tau$ is the maximum time taken by any instruction, including a fetch-and-add instruction. This bound holds regardless of the number of users. We describe how stacks and queues can be implemented using two regions, one for pushing (enqueueing) and one for popping (dequeueing). These implementations are particularly simple, are linearizable, and support access in time proportional to a fetch-and-add operation. In addition, we present experimental results comparing room synchronizations with the Keane–Moir algorithm for group mutual exclusion. 相似文献

6.

Rank order filters and priority queues

Anne Kaldewaij Jan Tijmen Udding 《Distributed Computing》1992,6(2):99-105

Summary A derivation of a parallel algorithm for rank order filtering is presented. Both derivation and result differ from earlier designs: the derivations are less complicated and the result allows a number of different implementations. The same derivation is used to design a collection of priority queues. Both filters and priority queues are highly efficient: they have constant response time and small latency. Anne Kaldewaij received an M.Sc. degree in Mathematics from the University of Utrecht (The Netherlands) and a Ph.D. degree in Computing Science from the Eindhoven University of Technology. Currently, he is associate professor in Computing Science at Eindhoven University. His research includes parallel programming and the design of algorithms and data structures. He enjoys teaching and he has written a number of textbooks on mathematics and programming. Jan Tijmen Udding received an M.Sc. degree in Mathematics in 1980 and a Ph.D. degree in Computing Science in 1984 from Eindhoven University of Technology. Currently, he is associate professor at Groningen University. His main research interests are mathematical aspects of VLSI, program derivation and correctness, and functional programming. 相似文献

7.

Bounded-wait combining: constructing robust and high-throughput shared objects

Danny Hendler Shay Kutten 《Distributed Computing》2009,21(6):405-431

Shared counters are among the most basic coordination structures in distributed computing. Known implementations of shared counters are either blocking, non-linearizable, or have a sequential bottleneck. We present the first counter algorithm that is both linearizable, non-blocking, and can provably achieve high throughput in k-synchronous executions—executions in which process speeds vary by at most a constant factor k. The algorithm is based on a novel variation of the software combining paradigm that we call bounded-wait combining (BWC). It can thus be used to obtain implementations, possessing the same properties, of any object that supports combinable operations, such as a stack or a queue. Unlike previous combining algorithms where processes may have to wait for each other indefinitely, in the BWC algorithm, a process only waits for other processes for a bounded period of time and then “takes destiny in its own hands”. In order to reason rigorously about the parallelism attainable by our algorithm, we define a novel metric for measuring the throughput of shared objects, which we believe is interesting in its own right. We use this metric to prove that our algorithm achieves throughput of Ω(N/ log N) in k-synchronous executions, where N is the number of processes that can participate in the algorithm. Our algorithm uses two tools that we believe may prove useful for obtaining highly parallel non-blocking implementation of additional objects. The first are “synchronous locks”, locks that are respected by processes only in k-synchronous executions and are disregarded otherwise; the second are “pseduo-transactions”—a weakening of regular transactions that allows higher parallelism. A preliminary version of this paper appeared in [11]. D. Hendler is supported in part by a grant from the Israel Science Foundation. S. Kutten is supported in part by a grant from the Israel Science Foundation. 相似文献

8.

Optimal and load balanced mapping of parallel priority queues inhypercubes

Das S.K. Pinotti M.C. Sarkar F. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(6):555-564

We efficiently map a priority queue on the hypercube architecture in a load balanced manner, with no additional communication overhead, and present optimal parallel algorithms for performing insert and deletemin operations. Two implementations for such operations are proposed on the single port hypercube model. In a b-bandwidth, n-item priority queue in which every node contains b items in sorted order, the first implementation achieves optimal speed up of O(min{log n, b log n/log b+log log n}) for inserting b presorted items or deleting b smallest items, where b=O(n¹c/) with c>1. In particular, single insertion and deletion operations are cost optimal and require O(log n/p+log p) time using O(log n/log log n) processors. The second implementation is more scalable since it uses a larger number of processors, and attains a “nearly” optimal speedup on the single hypercube. Namely, the insertion of log n presorted items or the deletion of the log n smallest items is accomplished in O(log log n²) time using O(log² n/log log n) processors. Finally, on the slightly more powerful pipelined hypercube model, the second implementation performs log n operations in O(log log n) time using O(log² n/log log n) processors, thus achieving an optimal speed up. To the best of our knowledge, our algorithms are the first implementations of b-bandwidth distributed priority queues, which are load balanced and yet guarantee optimal speed ups 相似文献

9.

Fast and lock-free concurrent priority queues for multi-thread systems

《Journal of Parallel and Distributed Computing》2005,65(5):609-627

We present an efficient and practical lock-free implementation of a concurrent priority queue that is suitable for both fully concurrent (large multi-processor) systems as well as pre-emptive (multi-process) systems. Many algorithms for concurrent priority queues are based on mutual exclusion. However, mutual exclusion causes blocking which has several drawbacks and degrades the overall performance of the system. Non-blocking algorithms avoid blocking, and several implementations have been proposed. Previously known non-blocking algorithms of priority queues did not perform well in practice because of their complexity, and they are often based on non-available atomic synchronization primitives. Our algorithm is based on the randomized sequential list structure called Skiplist, and a real-time extension of our algorithm is also described. In our performance evaluation we compare our algorithm with a well-representable set of earlier known implementations of priority queues. The experimental results clearly show that our lock-free implementation outperforms the other lock-based implementations in practical scenarios for 3 threads and more, both on fully concurrent as well as on pre-emptive systems. 相似文献

10.

Randomized Priority Queues for Fast Parallel Access

《Journal of Parallel and Distributed Computing》1998,49(1):86-97

We present simple randomized algorithms for parallel priority queues on distributed memory machines. Inserting O(n) elements or deleting the O(n) out ofmsmallest elements usingnprocessors requires O(T_coll+ log(m/n)) amortized time with high probability whereT_collbounds the time for performing prefix sums and randomized routing. The memory requirement is bounded by (m/n)(1 +o(1)) + O(logn) whp. These bounds are an improvement over the best previously known algorithms for many interconnection networks and even matches the speed of the best known PRAM algorithms. Generalizations for accessing thek⪢nsmallest elements are even more efficient. A portable implementation using MPI demonstrates that our approach is already useful for medium scale parallelism. Two parallel selection algorithms for randomly placed data are a spin-off. One runs in time O(T_coll) with high probability, beating a lower bound for the worst case. The other requires only a single reduction operation. 相似文献

11.

Detecting Race Conditions in Parallel Programs that Use Semaphores

Klein Netzer Lu 《Algorithmica》2008,35(4):321-345

Abstract. We address the problem of detecting race conditions in programs that use semaphores for synchronization. Netzer and Miller showed that it is NP-complete to detect race conditions in programs that use many semaphores. We show in this paper that it remains NP-complete even if only two semaphores are used in the parallel programs. For the tractable case, i.e., using only one semaphore, we give two algorithms for detecting race conditions from the trace of executing a parallel program on p processors, where n semaphore operations are executed. The first algorithm determines in O(n) time whether a race condition exists between any two given operations. The second algorithm runs in O( np log n) time and outputs a compact representation from which one can determine in O(1) time whether a race condition exists between any two given operations. The second algorithm is near-optimal in that the running time is only O( log n) times the time required simply to write down the output. 相似文献

12.

Time and Space Lower Bounds for Implementations Using k-CAS

Attiya Hagit Hendler Danny 《Parallel and Distributed Systems, IEEE Transactions on》2010,21(2):162-173

This paper presents lower bounds on the time and space complexity of implementations that use k-compare&swap (k-CAS) synchronization primitives. We prove that using k-CAS primitives can improve neither the time nor the space complexity of implementations of widely used concurrent objects, such as counter, stack, queue, and collect. Surprisingly, overly restrictive use of k-CAS may even increase the space complexity required by such implementations. We prove a lower bound of Omega (log_2 n) on the round complexity of implementations of a collect object using read, write, and k-CAS, for any k, where n is the number of processes in the system. There is an implementation of collect with O(log_2 n) round complexity that uses only reads and writes. Thus, our lower bound establishes that k-CAS is no stronger than read and write for collect implementation round complexity. For k-CAS operations that return the values of all the objects they access, we prove that the total step complexity of implementing key objects such as counters, stacks, and queues is Omega (n log_k n). We also prove that k-CAS cannot improve the space complexity of implementing many objects (including counter, stack, queue, and single-writer snapshot). An implementation has to use at least n base objects even if k-CAS is allowed, and if all operations (other than read) swap exactly k base objects, then it must use at least k cdot n base objects. 相似文献

13.

Priority queues and sorting methods for parallel simulation

Grammatikakis M.D. Liesche S. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(5):401-422

The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms 相似文献

14.

Automating the Selection of Implementation Structures

《IEEE transactions on pattern analysis and machine intelligence》1978,(6):494-506

An approach to automating the selection of implementations for the data representations used in a program is presented. Formalisms are developed for specifying data representations and implementation structures. Using these formalisms, algorithms are presented which will recognize the use in a program of known data representations (e. g., stacks, queues, lists, arrays, etc.) so that alternative implementations can be retrieved from a library and which will, for those representations not recognized, generate alternative implementations with a wide range of space-time tradeoffs. Experience with using these algorithms indicates they are reasonably successful, although there are several problems that must be solved before automatic implementation selection systems will be practical. 相似文献

15.

The group lock and its applications

Isaac A. Dimitrovsky 《Journal of Parallel and Distributed Computing》1991,11(4)

Parallel algorithms that use shared resources are notoriously difficult to write and verify, especially when properties such as fairness and efficiency are desired. This paper introduces a new synchronization method called the group lock. This method has some of the advantages of strict synchronization methods such as monitors; algorithms written using group lock are quite clear and easy to verify. At the same time, these algorithms generally make better use of parallelism than those written using stricter synchronization methods. In many cases we can obtain worst case time bounds that are constant or logarithmic in the number of processes, thus also insuring bounded fairness. The paper begins with a discussion of related synchronization methods and an introduction to the group lock. This is followed by some examples of applications in algorithms for a readers/writers problem, parallel stacks and queues, implementation of fetch- and-phi routines, and others. Finally, two implementations of group lock are described. A detailed implementation is given for the paracomputer, an idealized MIMD multiprocessor that supports the fetch-and-add synchronization instruction. An implementation is also outlined for general CREW multiprocessors using only reads and writes to shared memory. 相似文献

16.

优先队列的并行插入和删除

孙凝晖李国杰《计算机研究与发展》1993,(3)

优先队列广泛地使用在许多并行算法中(例如,多处理机调度和某些组合优化算法)。在这些算法中,共享优先队列的存取冲突限制了加速比的提高。本文提出一种链表优先队列的并行插入和删除方法,具有较小并行开销和较大的并行度,并且保证和串行存取算法的优先顺序完全一致,即删除操作返回已经插入和正在插入的所有元素中的最佳元素。同时,我们还介绍了目前性能最好的堆的并行插入和删除算法,并对准和链表结构并行插入和删除算法的性能和适用范围进行了比较,进一步提出了散列结构的优先队列。在ENCORE Multimax520多处理机上的实验结果验证了我们的理论分析结果:使用链表结构的并行分枝限界算法性能上可获得很大提高。相似文献

17.

Revisiting priority queues for image analysis

Cris L. Luengo Hendriks^{Author Vitae} 《Pattern recognition》2010,43(9):3003-3012

Many algorithms in image analysis require a priority queue, a data structure that holds pointers to pixels in the image, and which allows efficiently finding the pixel in the queue with the highest priority. However, very few articles describing such image analysis algorithms specify which implementation of the priority queue was used. Many assessments of priority queues can be found in the literature, but mostly in the context of numerical simulation rather than image analysis. Furthermore, due to the ever-changing characteristics of computing hardware, performance evaluated empirically 10 years ago is no longer relevant. In this paper I revisit priority queues as used in image analysis routines, evaluate their performance in a very general setting, and come to a very different conclusion than other authors: implicit heaps are the most efficient priority queues. At the same time, I propose a simple modification of the hierarchical queue (or bucket queue) that is more efficient than the implicit heap for extremely large queues. 相似文献

18.

Parallel heap: An optimal parallel priority queue

Narsingh Deo Sushil Prasad 《The Journal of supercomputing》1992,6(1):87-98

We describe a new parallel data structure, namely parallel heap, for exclusive-read exclusive-write parallel random access machines. To our knowledge, it is the first such data structure to efficiently implement a truly parallel priority queue based on a heap structure. Employing p processors, the parallel heap allows deletions of (p) highest priority items and insertions of (p) new items, each in O(log n) time, where n is the size of the parallel heap. Furthermore, it can efficiently utilize processors in the range 1 through n.This work was supported by U.S. Army's PM-TRADE contract N61339-88-g-0002, Florida High Technology and Industry grant 11-28-716, and Georgia State University's internal research support during spring and summer quarters, 1991. 相似文献

19.

High-Level Parallel Ant Colony Optimization with Algorithmic Skeletons

de Melo Menezes Breno A. Herrmann Nina Kuchen Herbert Buarque de Lima Neto Fernando 《International journal of parallel programming》2021,49(6):776-801

Parallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

相似文献

20.

Two new methods for constructing double-ended priority queues from priority queues

Amr Elmasry Claus Jensen Jyrki Katajainen 《Computing》2008,83(4):193-204

We introduce two data-structural transformations to construct double-ended priority queues from priority queues. To apply our transformations the priority queues exploited must support the extraction of an unspecified element, in addition to the standard priority-queue operations. With the first transformation we obtain a double-ended priority queue which guarantees the worst-case cost of O(1) for find-min, find-max, insert, extract; and the worst-case cost of O(lg n) with at most lg n + O(1) element comparisons for delete. With the second transformation we get a meldable double-ended priority queue which guarantees the worst-case cost of O(1) for find-min, find-max, insert, extract; the worst-case cost of O(lg n) with at most lg n + O(lg lg n) element comparisons for delete; and the worst-case cost of O(min {lg m, lg n}) for meld. Here, m and n denote the number of elements stored in the data structures prior to the operation in question. The work of the authors was partially supported by the Danish Natural Science Research Council under contracts 21-02-0501 (project Practical data structures and algorithms) and 272-05-0272 (project Generic programming—algorithms and tools). A. Elmasry was supported by Alexander von Humboldt fellowship. 相似文献