首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
High-level parallel programming models supporting dynamic fine-grained threads in a global object space are becoming increasingly popular for expressing irregular applications based on sophisticated adaptive algorithms and pointer-based data structures. However, implementing these multithreaded computations on scalable parallel machines poses significant challenges, particularly with respect to object caching. Object caching techniques must be able to tolerate unresponsive processors and protocol handler occupancy delays. This paper examines whether these challenges can be offset by leveraging responsive general-purpose communication architectural features (such as remote memory access and atomic operations), possibly compensating for the lack of more sophisticated hardware primitives by relying upon increased involvement of the run-time system and the compiler. A detailed performance analysis of four irregular applications, using the Illinois Concert System on the Cray T3D and the SGI Origin 2000, finds that existing software distributed shared memory (DSM) systems are capable of delivering good performance only in the presence of a high level of responsive communication architecture support (specifically, support for remote atomic operations). Recognizing that this situation stems from the synchronous request–reply nature of DSM protocols, we present a composable object caching framework, called view caching, which exploits knowledge of application data access semantics to construct custom protocols that require reduced processor synchronization. View caching protocols are more tolerant to responsiveness and occupancy delays and are able to exploit even lower level responsive communication primitives (such as nonatomic remote memory accesses) for a performance benefit.  相似文献   

The compare-and-swap register (CAS) is a synchronization primitive for lock-free algorithms. Most uses of it, however, suffer from the so-called ABA problem. The simplest and most efficient solution to the ABA problem is to include a tag with the memory location such that the tag is incremented with each update of the target location. This solution, however, is theoretically unsound and has limited applicability. This paper presents a general lock-free pattern that is based on the synchronization primitive CAS without causing ABA problem or problems with wrap around. It can be used to provide lock-free functionality for any data type. Our algorithm is a CAS variation of Herlihy’s LL/SC methodology for lock-free transformation. The basis of our techniques is to poll different locations on reading and writing objects in such a way that the consistency of an object can be checked by its location instead of its tag. It consists of simple code that can be easily implemented using C-like languages. A true problem of lock-free algorithms is that they are hard to design correctly, which even holds for apparently straightforward algorithms. We therefore develop a reduction theorem that enables us to reason about the general lock-free algorithm to be designed on a higher level than the synchronization primitives. The reduction theorem is based on Lamport’s refinement mappings, and has been verified with the higher-order interactive theorem prover PVS. Using the reduction theorem, fewer invariants are required and some invariants are easier to discover and formulate without considering the internal structure of the final implementation.  相似文献   

Explicit multithreading (XMT) is a parallel programming approach for exploiting on-chip parallelism. XMT introduces a computational framework with (1) a simple programming style that relies on fine-grained PRAM-style algorithms; (2) hardware support for low-overhead parallel threads, scalable load balancing, and efficient synchronization. The missing link between the algorithmic-programming level and the architecture level is provided by the first prototype XMT compiler. This paper also takes this new opportunity to evaluate the overall effectiveness of the interaction between the programming model and the hardware, and enhance its performance where needed, incorporating new optimizations into the XMT compiler. We present a wide range of applications, which written in XMT obtain significant speedups relative to the best serial programs. We show that XMT is especially useful for more advanced applications with dynamic, irregular access patterns, where for regular computations we demonstrate performance gains that scale up to much higher levels than have been demonstrated before for on-chip systems.  相似文献   

刘恒  杨小帆 《计算机应用研究》2012,29(10):3772-3775
动态内存管理的问题对无锁动态数据结构的性能尤为关键,因为多线程环境下的动态内存管理涉及开销较高的同步操作。提出一种构建用于动态无锁数据结构的内存池的方法来减少动态内存使用和与之相伴的动态内存管理开销。该方法通过平衡线程的动态内存消耗来减小内存开销,利用本方法构建的内存池基于线程私有的支持节点窃取的无锁循环队列。本方法具有以下优点:a)用本方法构建的内存池是无锁的;b)能够平衡线程的堆内存消耗;c)可以方便地与动态无锁数据结构集成。实验结果显示,用该方法构造的资源窃取型内存池扩展性较强,且能够在高负载下有效降低无锁数据结构的堆内存消耗和操作执行时间;平衡算法在很大程度上决定内存消耗量,内存池在高负载下的扩展性也受到它所用的数据结构自身多线程访问性能的影响。  相似文献   

Despite the large amount of Byzantine fault-tolerant algorithms for message-passing systems designed through the years, only recently algorithms for the coordination of processes subject to Byzantine failures using shared memory have appeared. This paper presents a new computing model in which shared memory objects are protected by fine-grained access policies, and a new shared memory object, the Policy-Enforced Augmented Tuple Space (PEATS). We show the benefits of this model by providing simple and efficient consensus algorithms. These algorithms are much simpler and requires less shared memory operations, using also less memory bits than previous algorithms based on ACLs and sticky bits. We also prove that PEATS objects are universal, i.e., that they can be used to implement any other shared memory object, and present lock-free and wait-free universal constructions.  相似文献   

Dynamic storage allocation is a vital component of programming systems intended for multiprocessor architectures that support globally shared memory. Highly parallel algorithms for access to system data structures lie at the core of effective memory allocation strategies as well as solutions to other parallel systems problems. In this paper, we investigate four algorithms, all based on the first fit approach, that provide different granularities of parallel access to the allocator's data structures. These solutions employ a variety of design techniques including specialized locking protocols, the use of atomic fetch-and- operations, and structural modifications. We describe experiments designed to compare the performance of these schemes. The results show that simple algorithms are appropriate when the expected number of concurrent requests per memory is low and the request pattern is not bursty. Algorithms that support finer granularity access while avoiding locking protocols are successful in a range of larger processor/memory ratios.This research was supported in part by the National Science Foundation under Grant Number DCR 8320136, DARPA/U.S. Army Engineer Topographic Laboratories under contract number DACA76-85-C-0001, and Unisys Corporation.A preliminary version appeared in International Conference on Parallel Processing, August 1987.  相似文献   

We proposed in Ref. 5) a new,message-oriented implementation technique for Moded Flat GHC that compiled unification for data transfer into message passing. The technique was based on constraint-based program analysis, and significantly improved the performance of programs that used goals and streams to implement reconfigurable data structures. In this paper we discuss how the technique can be parallelized. We focus on a method for shared-memory multiprocessors, called theshared-goal method, though a different method could be used for distributed-memory multiprocessors. Unlike other parallel implementations of concurrent logic languages which we callprocess-oriented, the unit of parallel execution is not an individual goal but a chain of message sends caused successively by an initial message send. Parallelism comes from the existence of different chains of message sends that can be executed independently or in a pipelined manner. Mutual exclusion based on busy waiting and on message buffering controls access to individual, shared goals. Typical goals allowlast-send optimization, the message-oriented counterpart of last-call optimization. We have built an experimental implementation on Sequent Symmetry. In spite of the simple scheduling currently adopted, preliminary evaluation shows good parallel speedup and good absolute performance for concurrent operations on binary process trees.  相似文献   

Most multiprocessors are multiprogrammed to achieve acceptable response time and to increase their utilization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for a concurrent, atomic update of shared data structures: (1)preemption-safe lockingand (2)nonblocking(lock-free)algorithms. Preemption-safe locking requires kernel support. Nonblocking algorithms generally require a universal atomic primitive such ascompare-and-swaporload-linked/store-conditionaland are widely regarded as inefficient.  We evaluate the performance of preemption-safe lock-based and nonblocking implementations of important data structures—queues, stacks, heaps, and counters—including nonblocking and lock-based queue algorithms of our own, in microbenchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that our nonblocking queue consistently outperforms the best known alternatives and that data-structure-specific nonblocking algorithms, which exist for queues, stacks, and counters, can work extremely well. Not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time, since general-purpose nonblocking techniques do not yet appear to be practical, preemption-safe locks remain the preferred alternative for complex data structures: they outperform conventional locks by significant margins on multiprogrammed systems.  相似文献   

This paper considers hardware support for the exploitation of control parallelism on data parallel architectures. It is well known that data parallel algorithms may also possess control parallel structure. However, the splitting of control leads to data dependency and synchronization issues that were implicitly handled in conventional SIMD architectures. These include synchronization of access to scalar and parallel variables, and synchronization for parallel communication operations. We propose a sharing mechanism for scalar variables and identify a strategy which allows synchronization of scalar variables between multiple streams. The techniques considered are based on a bit-interleaved register file structure which allows fast copy between register sets. Hardware cost estimates and timing analyses are provided, and comparison with an alternate scheme is presented. The register file structure has been designed and simulated for the HP 0.8μm CMOS process, and circuit simulation indicates access times are less than six nanoseconds. In addition, the impact of this structure on system performance is also studied.  相似文献   

We present an efficient and practical lock-free implementation of a concurrent priority queue that is suitable for both fully concurrent (large multi-processor) systems as well as pre-emptive (multi-process) systems. Many algorithms for concurrent priority queues are based on mutual exclusion. However, mutual exclusion causes blocking which has several drawbacks and degrades the overall performance of the system. Non-blocking algorithms avoid blocking, and several implementations have been proposed. Previously known non-blocking algorithms of priority queues did not perform well in practice because of their complexity, and they are often based on non-available atomic synchronization primitives. Our algorithm is based on the randomized sequential list structure called Skiplist, and a real-time extension of our algorithm is also described. In our performance evaluation we compare our algorithm with a well-representable set of earlier known implementations of priority queues. The experimental results clearly show that our lock-free implementation outperforms the other lock-based implementations in practical scenarios for 3 threads and more, both on fully concurrent as well as on pre-emptive systems.  相似文献   

Garbage collection algorithms for shared-memory multiprocessors typically rely on some form of global synchronization to preserve consistency. Such global synchronization may lead to problems on asynchronous architectures: if one process is halted or delayed, other, nonfaulty processes will be unable to progress. By contrast, a storage management algorithm is lock-free if (in the absence of resource exhaustion) a process that is allocating or collecting memory can be delayed at any point without forcing other processes to block. The authors present the first algorithm for lock-free garbage collection in a realistic model. The algorithm assumes that processes synchronize by applying read, write, and compare&swap operations to shared memory. This algorithm uses no locks, busy-waiting, or barrier synchronization, it does not assume that processes can observe or modify one another's local variables or registers, and it does not use inter-process interrupts  相似文献   

In shared memory multiprocessors, efficient synchronization is imperative to assure good performance. There are two aspects to the “cost” of a synchronization operation: the first is the waiting time at synchronization points, and the second is the intrinsic overhead in performing the operation. The overhead has two components. The first component deals with contention resolution for synchronization operation among competing processors. The second component deals with the shared data accesses that the processor has to perform once it enters a synchronization region. We present a mechanism to reduce the overhead of performing synchronization operations in a cache-based shared memory multiprocessor. The mechanism is based on the intuitive notion that parallel programs invariably use synchronization operations to govern the access to shared data. Traditional multiprocessor cache protocols treat synchronization accesses the same way as normal read/write memory accesses, leading to inefficiencies in performing synchronization operations which ultimately limit the scalability of such systems. The key idea in our approach is to combine synchronization with the coherence maintenance for the cached data. Each cache line maintains states for synchronization as well as for cache coherence, and the cache protocol ensures the correctness of the synchronization operations and the coherence of the data at these synchronization points. To assess the performance gain due to the proposed mechanism, simulation studies are performed using a workload model that represents a dynamic scheduling paradigm which forms the core of several parallel programs. Results from simulation studies show that the proposed cache-based synchronization performs significantly better than traditional cache coherence approaches.  相似文献   

任务并行是并行程序设计的基础设计模式.但由于算法本身的复杂性及目标平台的特殊性,设计实现高效率的任务并行程序对程序员来说往往充满挑战.基于新兴的SW26010众核CPU,本文提出支持任务嵌套并行模式的通用运行时框架SWAN.SWAN对任务并行程序的实现提供了高层次的抽象,使程序员能够专注于算法逻辑本身而提高开发效率.在性能方面,SWAN框架对诸多共享资源进行了细粒度的划分,从而有效地避免了众多线程间对共享资源的高强度争用.本文还充分利用平台的高速访存机制,高速可控缓存和原子操作等特性,对SWAN框架的核心数据结构进行优化设计以降低其本身的性能开销.另外,SWAN还具备动态负载均衡能力使得各个处理器核心的资源得以充分利用.本文基于SWAN框架在目标平台上实现了若干典型的具有递归特性的嵌套并行算法,包括N-皇后问题,二叉树遍历,快速排序和凸包求解.实验表明,这些通过使用SWAN框架得以并行化的算法相对其串行版本取得了4.5至32倍的加速,充分说明了SWAN框架具有较高的实用性及性能.  相似文献   

孙乔  黎雷生  赵海涛  赵慧  吴长茂 《软件学报》2021,32(8):2352-2364
任务并行是并行程序设计的基础设计模式.但由于算法本身的复杂性及目标平台的特殊性,设计实现高效率的任务并行程序对程序员来说往往充满挑战.基于新兴的SW26010众核CPU,提出了支持任务嵌套并行模式的通用运行时框架SWAN.SWAN对任务并行程序的实现提供了高层次的抽象,使程序员能够专注于算法逻辑本身而提高开发效率.在性能方面,SWAN框架对诸多共享资源进行了细粒度的划分,从而有效地避免了众多线程间对共享资源的高强度争用.充分利用平台的高速访存机制、高速可控缓存和原子操作等特性,对SWAN框架的核心数据结构进行优化设计以降低其本身的性能开销.SWAN还具备动态负载均衡能力,使各个处理器核心的资源得以充分利用.基于SWAN框架,在目标平台上实现了若干典型的具有递归特性的嵌套并行算法,包括N-皇后问题、二叉树遍历、快速排序和凸包求解.实验结果表明,这些通过使用SWAN框架得以并行化的算法相对于其串行版本取得了4.5~32倍的加速,充分说明了SWAN框架具有较高的实用性及性能.  相似文献   

《Parallel Computing》1988,9(1):1-24
The Connection Machine is a massively parallel architecture with 65 536 single-bit processors and 32 Mbytes of memory, organized as a high-dimensional hypercube. A sophisticated router system provides efficient communication between remote processors. A rich software environment, including a parallel extension of COMMON LISP, provides access to the processors and network. Virtual processor capability extends the degree of fine-grained parallelism beyond 1 000 000.We describe the hardware and the parallel programming environment. We then present implementations of SOR, Multigrid and Conjugate Gradient algorithms for solving Partial Differential Equations on the Connection Machine. Measurements of computational efficiency are provided as well as an analysis of opportunities for achieving better performance. Despite the lack of floating-point hardware, computation rates above 100 Mflops have been achieved in PDE solution. Virtual processors prove to be a real advantage, easing the effort of software development while improving system performance significantly.  相似文献   

Many computational-intensive problems from science and engineering are irregular in nature. This makes it difficult to develop an efficient parallel implementation, even for shared-memory machines. As a typical example, we investigate a parallel implementation of an irregular particle simulation algorithm. We concentrate on the issue which programming and system support is needed to yield an efficient implementation for a large number of processors. As an execution platform we use the SB-PRAM, a shared memory machine with up to 2048 processors. The processors of the SB-PRAM can access the global memory in unit time which is the basis for an exact performance prediction. Common approaches for parallel implementations like lock protection for concurrent accesses and sequential or distributed task queues are replaced by more efficient access mechanisms and data structures which can be realized by the powerful multiprefix operations of the SB-PRAM. Their use simplifies the implementation and yields large speedup values.  相似文献   

Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.  相似文献   

Systems that maintain coherence at large granularity, such as shared virtual memory systems, suffer from false sharing and extra communication. Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity. Release Consistency (RC) and Lazy Release Consistency (LRC) are accepted to offer a reasonable tradeoff between performance and programming complexity. Entry Consistency (EC) offers a more relaxed consistency model, but it requires explicit association of shared data objects with synchronization variables. The programming burden of providing such associations can be substantial. This paper proposes a new consistency model for such systems, called Scope Consistency (ScC), which offers most of the performance advantages of the EC model without requiring explicit bindings between data and synchronization variables. Instead, ScC dynamically detects the associations implied by the programmer, using a programming interface similar to that of RC or LRC. We propose two ScC protocols: one that uses hardware support for fine-grained remote writes (automatic updates or AU) and the other, an all-software protocol. We compare the AU-based ScC protocol with Automatic Update Release Consistency (AURC), a modified LRC protocol that also takes advantage of automatic update support. AURC already improves performance substantially over an all-software LRC protocol. For three of the five applications we used, ScC further improves the speedups achieved by AURC by about 10%. Received October 1996, and in final form July 1997.  相似文献   

In this paper we consider the mutual exclusion problem on a multiple access channel. Mutual exclusion is one of the fundamental problems in distributed computing. In the classic version of this problem, n processes execute a concurrent program that occasionally triggers some of them to use shared resources, such as memory, communication channel, device, etc. The goal is to design a distributed algorithm to control entries and exits to/from the shared resource (also called a critical section), in such a way that at any time, there is at most one process accessing it. In our considerations, the shared resource is the shared communication channel itself (multiple access channel), and the main challenge arises because the channel is also the only mean of communication between these processes. We consider both the classic and a slightly weaker version of mutual exclusion, called \(\varepsilon \)-mutual-exclusion, where for each period of a process staying in the critical section the probability that there is some other process in the critical section is at most \(\varepsilon \). We show that there are channel settings, where the classic mutual exclusion is not feasible even for randomized algorithms, while the \(\varepsilon \)-mutual-exclusion is. In more relaxed channel settings, we prove an exponential gap between the makespan complexity of the classic mutual exclusion problem and its weaker \(\varepsilon \)-exclusion version. We also show how to guarantee fairness of mutual exclusion algorithms, i.e., that each process that wants to enter the critical section will eventually succeed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号