首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
戴华东  杨学军 《计算机学报》2002,25(12):1387-1396
存储一致性模型对共享存储系统的正确性,性能以及程序的复杂性都有重要的影响,该文立足于分布共享存储系统,提出了一种新的存储一致性模型框架-S^3C框架,该框架通过同步点的概念来描述不同模型正确的存储访问事件顺序;通过一致性维护点的概念,对同一模型的不同实现方式也能够进行区别和比较,结合S^3C框架,该文提出一种以操作系统为中心的线程一致性模型,并针对以顺序一致性模为代表的存储一致性模型的正确实现进行了论述。  相似文献   

2.
并发Java程序同步操作的有效删除   总被引:2,自引:0,他引:2  
吴萍  陈意云  张健 《软件学报》2005,16(10):1708-1716
同步操作是并发Java程序非常大的一部分开销.在现有程序分析方法的基础上,提出了一种精确而有效的冗余同步操作的静态删除方法.该方法分为基本处理和线程间时序分析两个阶段,充分考虑了控制流结构和线程交互时序对同步删除的影响.构造了一个Java编译器JTool,并在其上实现了同步删除算法.对于确定的单线程程序,同步删除率达到100%;对于多线程程序,同步删除率高于现有的分析工具.  相似文献   

3.
多核处理器性能的发挥依靠程序的并行,共享存储并行编程模型为大多数多核处理器所采用,而有效同步多个线程对共享变量的访问是其关键、也是难题.借鉴数据库中事务的思想,人们提出事务存储(transactional memory),旨在提供一种编程简单,对程序正确性推理容易的同步手段.简介了事务存储的起源,诠释了事务存储系统的概念.论述了事务存储的编程接口和执行模型.讨论了事务存储系统所涉及的主要内容,对各种方法和策略进行了比较.对事务存储中有待解决的问题进行了探讨.最后介绍了几个开源的事务存储研究平台.  相似文献   

4.
Lock Free技术及其应用研究   总被引:1,自引:0,他引:1  
多线程环境下,线程同步伴随着多线程的产生而产生,程序员对于共享内存互斥同步访问的思维已经根深蒂固。由于锁的存在,当访问互斥锁保护的共享内存时,线程首先请求锁的允许,然后才被接受访问共享内存,否则只能等待其它线程释放锁,同时锁也带来了很多负面影响。Lock Free一个崭新的多线程共享内存的同步访问技术,抛开了锁的概念,多线程平行运行,尤其是在多核时代更显优势。但Lock Free技术应用技巧性很强,难度比较大,应用范围一直非常受限,本文旨在揭示Lock Free技术应用的一般规律,使其在多核时代大放异彩。  相似文献   

5.
Distributed shared memory (DSM) allows parallel programs to run on distributed computers by simulating a global virtual shared memory, but data racing bugs may easily occur when the threads of a multi-threaded process concurrently access the physically distributed memory. Earlier tools to help programmers locate data racing bugs in non-DSM parallel programs are not easily applied to DSM systems. This study presents the data race avoidance and replay scheme (DRARS) to assist debugging parallel programs on DSM or multi-core systems. DRARS is a novel tool which controls the consistency protocol of the target program, automatically preventing a large class of data racing bugs when the parallel program is subsequently run, obviating much of the need for manual debugging. For data racing bugs that cannot be avoided automatically, DRARS performs a deterministic replay-type function on DSM systems, faithfully reproducing the behavior of the parallel program during run time. Because one class of data racing bugs has already been eliminated, the remaining manual debugging task is greatly simplified. Unlike previous debugging methods, DRARS does not require that the parallel program be written in a specific style or programming language. Moreover, DRARS can be implemented in most consistency protocols. In this paper, DRARS is realized and verified in real experiments using the eager release consistency protocol on a DSM system with various applications.  相似文献   

6.
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.  相似文献   

7.
Summary.  In recent years, there is a growing tendency to support high-level synchronization operations, such as read-modify-write, FIFO queues and stacks, as part of the programmer’s shared memory model. This paper examines the problem of implementing hybrid consistency with high-level synchronization operations. It is shown that for any implementation of weak consistency, the time required to execute a read-modify-write, a dequeue or a pop operation is Ω(d), where d is the network delay. Following this, an efficient and simple algorithm for providing hybrid consistency that supports most types of high-level synchronization operations and weak read and weak write operations is presented. Weak read and weak write operations are executed instantaneously, while the time required to execute strong operations is O(d). This is within a constant factor of the lower bounds for most of the commonly used types of operations. Received: August 1994 / Accepted: June 1995  相似文献   

8.
刘恒  杨小帆 《计算机应用研究》2012,29(10):3772-3775
动态内存管理的问题对无锁动态数据结构的性能尤为关键,因为多线程环境下的动态内存管理涉及开销较高的同步操作。提出一种构建用于动态无锁数据结构的内存池的方法来减少动态内存使用和与之相伴的动态内存管理开销。该方法通过平衡线程的动态内存消耗来减小内存开销,利用本方法构建的内存池基于线程私有的支持节点窃取的无锁循环队列。本方法具有以下优点:a)用本方法构建的内存池是无锁的;b)能够平衡线程的堆内存消耗;c)可以方便地与动态无锁数据结构集成。实验结果显示,用该方法构造的资源窃取型内存池扩展性较强,且能够在高负载下有效降低无锁数据结构的堆内存消耗和操作执行时间;平衡算法在很大程度上决定内存消耗量,内存池在高负载下的扩展性也受到它所用的数据结构自身多线程访问性能的影响。  相似文献   

9.
Ming Hsiang Huang  Wuu Yang 《Software》2020,50(10):1877-1904
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.  相似文献   

10.
The authors present a data-race-free-1, shared-memory model that unifies four earlier models: weak ordering, release consistency (with sequentially consistent special operations), the VAX memory model, and data-race-free-0. Data-race-free-1 unifies the models of weak ordering, release consistency, the VAX, and data-race-free-0 by formalizing the intuition that if programs synchronize explicitly and correctly, then sequential consistency can be guaranteed with high performance in a manner that retains the advantages of each of the four models. Data-race-free-1 expresses the programmer's interface more explicitly and formally than weak ordering and the VAX, and allows an implementation not allowed by weak ordering, release consistency, or data-race-free-0. The implementation proposal for data-race-free-1 differs from earlier implementations by permitting the execution of all synchronization operations of a processor even while previous data operations of the processor are in progress. To ensure sequential consistency, two sychronizing processors exchange information to delay later operations of the second processor that conflict with an incomplete data operation of the first processor  相似文献   

11.
In this paper a proof outline logic is introduced for the partial correctness of multi-threaded object-oriented programs like in Java. The main contribution is a generalization of the Owicki& Gries proof method for shared-variable concurrency to dynamic thread creation. This paper also provides a formal justification of this generalization in terms of soundness and completeness proofs.  相似文献   

12.
多核处理器中,各个处理器核之间可以并发地进行外部存储访问,提供不同于单处理器的存储级并行(memory level parallelism)能力.不规则应用中的循环,传统的并行方法难以识别其并行性,不能充分利用多核处理器存储级并行能力和并行计算能力.对基于软件开发多核处理器存储级并行进行了讨论,提出一种前瞻并行多线程算法LLSM(loop level speculative mssultithreading).LLSM对不规则应用中的循环进行并行化,在多核处理器上的测试数据表明:该算法能够有效地挖掘多核处理器的存储级并行能力和计算能力,同时指出多核环境下存储级并行计算公式需要考虑线程同步开销.  相似文献   

13.
Disintermediated active communication (DAC) is a new paradigm of communication in which a sending thread actively engages a receiving thread when sending it a message via shared memory. DAC is different than existing approaches that use passive communication through shared-memory - based on intermittently checking for messages - or that use preemptive communication but must rely on intermediaries such as the operating system or dedicated interrupt channels. An implementation of DAC builds on existing cache coherency support and exploits light-weight user-level interrupts. Inter-thread communication occurs via monitored memory locations where the receiver thread responds to invalidations of monitored addresses with a light-weight user-level software-defined handler. Address monitoring is supported by cache line user-bits, or CLUbits. CLUbits reside in the cache next to the coherence state, are private per thread, and maintain user-defined per-cache-line state. A light weight software library can demultiplex asynchronous notifications and handle exceptional cases. In DAC-based programs threads coordinate with one another by explicit signaling and implicit resource monitoring. With the simple and direct communication primitives of DAC, multi-threaded workloads synchronize at a finer granularity and more efficiently utilize the hardware of upcoming multi-core designs. This paper introduces DAC, presents several signaling models for DAC-based programs, and describes a simple memory-based framework that supports DAC by leveraging existing cache-coherency models. Our framework is general enough to support uses beyond DAC  相似文献   

14.
Designers of distributed algorithms typically assume strong memory consistency guarantees, but system implementations provide weaker guarantees for better performance and scalability. This motivates the study of how to implement programs designed for sequential consistency on platforms with weaker consistency models. Typically, such implementations are impossible using only read and write operations to shared variables. One variant of processor consistency originally proposed by Goodman and called here PC-G is an exception because it provides just enough consistency to implement mutual exclusion using only reads and writes. This paper investigates the existence of compilers to convert arbitrary programs that use shared read/write variables with sequentially consistent memory semantics, to programs that use read/write variables with PC-G consistency semantics. We first provide a simple program transformation, and prove that it correctly compiles any 2-process program to a PC-G memory system, while preserving wait-freedom. We next prove that even a substantial generalization of this transformation cannot be a compiler for even a very restricted class of 3-process programs. Even though our program transformation is not a general compiler for three or more processes, it does correctly transform some specific n-process programs. In particular, for the special case of the (necessarily randomized) Test&Set algorithm of Tromp and Vitanyi, our transformation extends to any number of processes, thus providing the first algorithm for expected wait-free Test&Set on any weak memory system, using only read/write variables.  相似文献   

15.
存储模型仿真器的设计与实现   总被引:2,自引:1,他引:1  
存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS.基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能.  相似文献   

16.
MILLIPEDE is a project aimed at developing a distributed shared memory environment for parallel programming. A major goal of this project is to support easy-to-grasp parallel programming languages that will also make it straightforward to parallelize existing code. Other targets are forward compatibility and availability of both the user programs (hence the shared memory support and the C-like parallel language PARC) and the system itself (which is thus implemented in user-level and using the operating system exported services). Locality of memory references, which implies efficiency and speedups, is maintained by MILLIPEDE} using page and thread migration, through which dynamic load-balancing and weak memory are implemented. ©1997 by John Wiley & Sons, Ltd.  相似文献   

17.
In this paper, we present DetLock, a runtime system to ensure deterministic execution of multithreaded programs running on multicore systems. DetLock does not rely on any hardware support or kernel modification to ensure determinism. For tracking the progress of the threads, logical clocks are used. Unlike previous approaches, which rely on non-portable hardware to update the logical clocks, DetLock employs a compiler pass to insert code for updating these clocks, thus increasing portability. For 4 cores, the average overhead of these clocks on tested benchmarks is brought down from 16 to 2 % by applying several optimizations. Moreover, the average overall overhead, including deterministic execution, is 14 %.  相似文献   

18.
Due to the large amount of potential parallelism, resource management is a critical issue in multithreaded execution. The challenge in code generation is to control the parallelism without reducing the machine's ability to exploit it. Controlled parallelism reduces idle time, communication, and delay caused by synchronization. At the same time it increases the potential for exploitation of program data structure locality. In this paper, we evaluate the performance of methods to control program parallelism and resource usage in the context of the fine-grain dataflow execution model. The methods are in themselves not new, but their performance analysis is. The two methods to control parallelism here are slicing and chunking. We present the methods and their compilation strategy and evaluate their effectiveness in terms of run time and matching store occupancy. Communication is categorized in memory, loop, call, and expression communication. Input and output message locality is measured. Two techniques to reduce communication are introduced. Grouping allocates loop and function bodies on one processor and bundling combines messages with the same sender and receiver into one. Their effects on the total communication volume are quantified.  相似文献   

19.
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.  相似文献   

20.
在国产异构众核平台神威·太湖之光上的非结构网格计算具有稀疏存储、离散访存、数据依赖等特点,严重制约了众核处理器的性能发挥。为解决稀疏存储和离散访存问题,提出一种N阶对角染色算法,以有效平衡主从核计算并利用从核将全局访存转化为LDM访问。针对数据依赖造成的计算竞争问题,采用自适应和无依赖的任务划分方法,避免并行计算时的数据冲突。为对处理器架构和非结构网格计算进行优化,采用主核与从核异步并行的方式,差异化使用主从核以充分利用硬件资源,同时,取消处理器提供的寄存器通信机制,降低从核阵列的同步开销同时便于扩展到新一代神威平台。此外,使用计算访存异步重叠技术来充分隐藏访存延迟。利用SpMV、Integration、calcLudsFcc算子进行实验,结果表明,相比主核实现,组合加速算法在不同算例规模下平均取得了10倍的加速效果,加速比最高可达24倍,N阶对角染色算法相比非染色分块算法取得了超过5.8倍的性能加速,有效提升了数据局部性和计算并行度。该算法对有依赖关系的计算冲突算子同样具有良好的加速性能,验证了自适应和无依赖任务划分方法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号