期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems

Cesare Ferri Samantha Wood Tali Moreshet R. Iris Bahar Maurice Herlihy 《Journal of Parallel and Distributed Computing》2010

We investigate how transactional memory can be adapted for embedded systems. We consider energy consumption and complexity to be driving concerns in the design of these systems and therefore adapt simple hardware transactional memory (HTM) schemes in our architectural design. We propose several different cache structures and contention management schemes to support HTM and evaluate them in terms of energy, performance, and complexity. We find that ignoring energy considerations can lead to poor design choices, particularly for resource-constrained embedded platforms. We conclude that with the right balance of energy efficiency and simplicity, HTM will become an attractive choice for future embedded system designs. 相似文献

2.

Selective dynamic serialization for reducing energy consumption in hardware transactional memory systems

Epifanio Gaona J. Rubén Titos-Gil Juan Fernández Manuel E. Acacio 《The Journal of supercomputing》2014,68(2):914-934

In the search for new paradigms to simplify multithreaded programming, Transactional Memory (TM) is currently being advocated as a promising alternative to deadlock-prone lock-based synchronization. In this way, future many-core CMP architectures may need to provide hardware support for TM. On the other hand, power dissipation constitutes a first class consideration in multicore processor designs. In this work, we propose Selective Dynamic Serialization (SDS) as a new technique to improve energy consumption without degrading performance in applications with conflicting transactions by avoiding wasted work due to aborted transactions. Our proposal, which is implemented on top of a hardware transactional memory (HTM) system with an eager conflict management policy, detects and serializes conflicting transactions dynamically (at run-time). In its simplest form, in case of conflict, one transaction is allowed to continue whilst the rest are completely stalled. Once the executing transaction has finished, it wakes up several of the stalling transactions. More elaborated implementations of SDS try to delay this behavior until serialization of transactions is profitable, achieving the best trade-off between performance, energy savings and network traffic. SDS implementations differ from each other in the condition that triggers the serialization mode. We have evaluated several SDS schemes using GEMS, a full-system simulator implementing the LogTM-SE Eager–Eager HTM system, and several benchmarks from the STAMP suite. Results for a 16-core CMP show that SDS obtains reductions of 6 % on average in energy consumption (more than 20 % in high contention scenarios) in a wide range of benchmarks without affecting, on average, execution time. At the same time, network traffic level is also reduced by 22 %. 相似文献

3.

MetaTM/TxLinux: Transactional Memory for an Operating System

Ramadan H.E. Rossbach C.J. Porter D.E. Hofmann O.S. Bhandari A. Witchel E. 《Micro, IEEE》2008,28(1):42-51

Hardware transactional memory can reduce synchronization complexity while retaining high performance. MetaTM models changes to the x86 architecture to support transactional memory for user processes and the operating system. TxLinux is an operating system that uses transactional memory to facilitate synchronization in a large, complicated code base, where the burdens of current lock-based approaches are most evident. 相似文献

4.

Removal of Conflicts in Hardware Transactional Memory Systems

M. M. Waliullah Per Stenstrom 《International journal of parallel programming》2014,42(1):198-218

This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance and energy consumption are improved substantially. 相似文献

5.

硬件结构支持的基于同步的高速缓存一致性协议

黄河刘磊宋风龙马啸宇《计算机学报》2009,32(8)

共享存储系统中如何高效地实现高速缓存一致性是体系结构设计面临的一个关键问题和难点问题.已有的基于目录的协议存在难于实现、验证复杂和存储空间开销大等问题.面向片上众核处理器,文中提出一种由硬件结构支持、基于同步的高速缓存一致性协议.该方案不使用目录,而是通过使用bloom-filter表示一致性信息,并在并行程序中的同步点维护高速缓存一致性.与现有的基于目录的高速缓存一致性协议相比,该方案可以降低目录协议的实现、验证复杂度.用SPLASH一2测试程序集评估表明,基于同步的协议可以获得与基于目录的协议相当的性能. 相似文献

6.

Performance Pathologies in Hardware Transactional Memory

Bobba J. Moore K. Volos H. Yen L. Hill M.D. Swift M. Wood D.A. 《Micro, IEEE》2008,28(1):32-41

Transactional memory is a promising approach to ease parallel programming. Hardware transactional memory system designs reflect choices along three key design dimensions: conflict detection, version management, and conflict resolution. The authors identify a set of performance pathologies that could degrade performance in proposed HTM designs. Improving conflict resolution could eliminate these pathologies so designers can build robust HTM systems. 相似文献

7.

Rock: A High-Performance Sparc CMT Processor

Chaudhry S. Cypher R. Ekman M. Karlsson M. Landin A. Yip S. Zeffer H. Tremblay M. 《Micro, IEEE》2009,29(2):6-16

Rock, Sun's third-generation chip-multithreading processor, contains 16 high-performance cores, each of which can support two software threads. Rock uses a novel checkpoint-based architecture to support automatic hardware scouting under a load miss, speculative out-of-order retirement of instructions, and aggressive dynamic hardware parallelization of a sequential instruction stream. It is also the first processor to support transactional memory in hardware. 相似文献

8.

Improving the performance of software distributed shared memory with speculation 总被引：1，自引：0，他引：1

Kistler M. Alvisi L. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(9):885-896

We study the performance benefits of speculation in a release consistent software distributed shared memory system. We propose a new protocol, speculative home-based release consistency (SHRC) that speculatively updates data at remote nodes to reduce the latency of remote memory accesses. Our protocol employs a predictor that uses patterns in past accesses to shared memory to predict future accesses. We have implemented our protocol in a release consistent software distributed shared memory system that runs on commodity hardware. We evaluate our protocol implementation using eight software distributed shared memory benchmarks and show that it can result in significant performance improvements. 相似文献

9.

Distributed transactional memory for metric-space networks

Maurice Herlihy Ye Sun 《Distributed Computing》2007,20(3):195-208

Transactional Memory is a concurrent programming API in which concurrent threads synchronize via transactions (instead of locks). Although this model has mostly been studied in the context of multiprocessors, it has attractive features for distributed systems as well. In this paper, we consider the problem of implementing transactional memory in a network of nodes where communication costs form a metric. The heart of our design is a new cache-coherence protocol, called the Ballistic protocol, for tracking and moving up-to-date copies of cached objects. For constant-doubling metrics, a broad class encompassing both Euclidean spaces and growth-restricted networks, this protocol has stretch logarithmic in the diameter of the network. Supported by NSF grant 0410042 and by grants from Intel Corporation and Sun Microsystems. 相似文献

10.

Hardware transactional memory architecture with adaptive version management for multi-processor FPGA platforms

《Journal of Systems Architecture》2017

Multiprocessor embedded systems integrates diverse dedicated processing units to handle high performance applications such as in multimedia and network processing. However, lock-based synchronization limits the efficiency of such heterogeneous concurrent systems. Hardware Transactional Memory (HTM) is a promising approach in creating an abstraction layer for multi-threaded programming. However, HTM performance is application-specific and determined by version and conflict management configurations. Most previous HTM implementations for embedded system in literature were built on fixed version management that result in significant performance loss when transaction behaviour changes. In this paper, we propose a HTM targeted for embedded applications which is able to adapt its version management based on application behaviour at runtime. It is prototyped and analysed on Altera Cyclone IV platform. Random requests at different contention levels and different transaction sizes are used to verify the performance of the proposed HTM. Based on our experiments, lazy version management is able to obtain up to 12.82% speed-up compared to eager version management at high contention level. Meanwhile, eager version management obtains up to 37.84% speed-up compared to lazy version management at low contention. The adaptive mechanism is able to switch configuration at runtime based on applications behaviour for maximum performance. 相似文献

11.

Efficient execution of speculative threads and transactions with hardware transactional memory

《Future Generation Computer Systems》2014

Thread-level speculation (TLS) was researched to automatically parallelize portions of serial programs for execution, and transactional memory (TM) was studied as a promising alternative of lock for parallel programming due to its simplicity. Both TLS and TM require similar underlying support. In the paper, we present SeTM (sequential transactional memory), a hardware enhanced TM system which supports TLS at minor extra cost. Signature is an effective way to buffer speculative states in TM and TLS. But it cripples TM and TLS performance due to its false-positive in terms of conflict detection, especially for conflict-intensive TLS. SeTM adopts R/W bits and signature concurrently to ameliorate this bad influence. Additionally, SeTM introduces the fast rollback mechanism, which provides fast abort recovery for eager log-based HTM and TLS. The most important contribution of SeTM is the conflict-tolerant mechanism, which tolerates some ambiguous data conflicts in TLS. Finally, in order to achieve an efficient execution for these un-order transactions, we add an extra ordering mechanism for SeTM. With this ordering mechanism, the transactions in TM can also gain the performance improvement with the support of conflict-tolerant mechanism. Our evaluation major on TM and TLS separately. For the TLS applications, six representative benchmarks have been adopted to evaluate the above model. Our experimental results show that our scheme improves the execution performance of most tested codes at a modest hardware cost. For a set of important scientific loops, we report the highest speedup of 6.5 with 15 cores. Besides, experimental results also show good scalability of SeTM system. For the TM applications, with respect to LogTM-SE, the benchmarks from STAMP also gain performance improvement signally. 相似文献

12.

Adaptive thread mapping strategies for transactional memory applications

Márcio Castro Luís Fabrício W. Góes Jean-François Méhaut 《Journal of Parallel and Distributed Computing》2014

Transactional Memory (TM) is a programmer friendly alternative to traditional lock-based concurrency. Although it intends to simplify concurrent programming, the performance of the applications still relies on how frequent they synchronize and the way they access shared data. These aspects must be taken into consideration if one intends to exploit the full potential of modern multicore platforms. Since these platforms feature complex memory hierarchies composed of different levels of cache, applications may suffer from memory latencies and bandwidth problems if threads are not properly placed on cores. An interesting approach to efficiently exploit the memory hierarchy is called thread mapping. However, a single fixed thread mapping cannot deliver the best performance when dealing with a large range of transactional workloads, TM systems and platforms. In this article, we propose and implement in a TM system a set of adaptive thread mapping strategies for TM applications to tackle this problem. They range from simple strategies that do not require any prior knowledge to strategies based on Machine Learning techniques. Taking the Linux default strategy as baseline, we achieved performance improvements of up to 64.4% on a set of synthetic applications and an overall performance improvement of up to 16.5% on the standard STAMP benchmark suite. 相似文献

13.

Synergistic design of an application-oriented sparse directory on many-core embedded systems

《Journal of Systems Architecture》2017

As many-core embedded systems are evolving from single-memory based designs to systems-on-a-chip running on an on-chip network, implementing a cache coherence mechanism in large-scale many-core embedded systems turns out to be a technical challenge. However, existing coherence mechanisms are difficult to scale beyond tens of cores, which require either excessive area or energy, complex hierarchical protocols, or inexact representations of sharer sets. In this paper, we present a hardware-software synergistic design of a cache coherence mechanism by considering OS-level application allocation and hardware-level coherence operations. The proposed application-oriented sparse directory (AoSD) cooperates with a contiguous allocation algorithm to isolate cache coherence traffic and thereby reduce interferences among applications. The proposed micro-architecture of sharer set representations is area-efficient; moreover, it can also be configured dynamically to track a flexible and exact sharer set. We verify our design by analyzing memory requirements of different cache organizations and implementing our design on a popular simulator Graphite to evaluate cache coherence traffic improvement. The results show that our design is both area-efficient and efficient with improvements in memory network performance by 11.74%–28.72%. It is also indicated that our design is feasible to scale up to work well in thousands-of-cores embedded systems. 相似文献

14.

Transactional execution: toward reliable, high-performance multithreading

Rajwar R. Goodman J. 《Micro, IEEE》2003,23(6):117-125

Although lock-based critical sections are the synchronization method of choice, they have significant performance limitations and lack certain properties, such as failure atomicity and stability. Addressing both these limitations requires considerable software overhead. Transactional lock removal can dynamically eliminate synchronization operations and achieve transparent transactional execution by treating lock-based critical sections as lock-free optimistic transactions. 相似文献

15.

Hardware transactional memory: A high performance parallel programming model

Chen Fu Dongxin Wen Xiaoqun Wang Xiaozong Yang 《Journal of Systems Architecture》2010,56(8):384-391

The transactional memory in multicore processors has been a major research area over past several years. Many transactional memory systems have been proposed to be used to solve the synchronization problem of multicore processors. Hardware transactional memory is one of the critical methods to speedup communications in multicore environment. In this paper, we give a review of the current hardware transactional memory systems for multicore processors. We take a top-down approach to characterizing and classifying various hardware transactional design issues and present a taxonomy of hardware transactional memory systems which is consist of the five fundamental design issues: version management, conflict detection, contention management, virtualization and nesting. Finally, we discussed the active research challenge: the relationship between transactional memory and Input/Output operations and system calls. 相似文献

16.

Adaptive locks: Combining transactions and locks for efficient concurrency

Takayuki Usui Reimer Behrends Jacob Evans Yannis Smaragdakis 《Journal of Parallel and Distributed Computing》2010

Transactional memory is being advanced as an alternative to traditional lock-based synchronization for concurrent programming. Transactional memory simplifies the programming model and maximizes concurrency. At the same time, transactions can suffer from interference that causes them to often abort, from heavy overheads for memory accesses, and from expressiveness limitations (e.g., for I/O operations). In this paper we propose an adaptive locking technique that dynamically observes whether a critical section would be best executed transactionally or while holding a mutex lock. The critical new elements of our approach include the adaptivity logic and cost–benefit analysis, a low-overhead implementation of statistics collection and adaptive locking in a full C compiler, and an exposition of the effects on the programming model. In experiments with both micro and macrobenchmarks we found adaptive locks to consistently match or outperform the better of the two component mechanisms (mutexes or transactions). Compared to either mechanism alone, adaptive locks often provide 3-to-10x speedups. Additionally, adaptive locks simplify the programming model by reducing the need for fine-grained locking: with adaptive locks, the programmer can specify coarse-grained locking annotations and often achieve fine-grained locking performance due to the transactional memory mechanisms. 相似文献

17.

A Framework of Memory Consistency Models 总被引：2，自引：1，他引：2

下载免费PDF全文

Hu Weiwu Shi Weisong Tang Zhimin 《计算机科学技术学报》1998,13(2):110-124

相似文献

18.

Guest Editors' Introduction: Interaction of Many-Core Computer Architecture and Operating Systems

Cho Sangyeun Li Tao Mutlu Onur 《Micro, IEEE》2008,28(3):2-5

Rapid changes in platform hardware resources with the evolution of many-core architectures will require a fundamental reexamination of mainstream system-software design decisions to support multiple cores and to efficiently manage on-chip hardware resources shared among the multiple cores. In turn, the evolution of many-core processor architectures will be successfully sustained by the new capabilities and features added to the system software, perhaps while requiring substantial support from hardware. The guest editors introduce five articles on the interaction of computer architecture and operating systems for this special issue of IEEE Micro. 相似文献

19.

支持线程级猜测的存储体系结构设计

下载免费PDF全文

赖鑫刘聪王志英《计算机工程》2012,38(24):228-234

在线程级猜测中进行数据依赖相关检测时,存在Cache一致性协议无法容忍线程切换引起的Cache块替换等问题。为此,通过分析推测线程数据管理模型,结合推测线程切概率低的特点,提出一种分布-共享式恢复缓冲区结构。该结构在进行Cache一致性检验时结合作废向量和版本优先级寄存器进行数据依赖检测,利用L2 Cache进行推测数据缓冲和恢复以支持推测线程切换。修改SESC模拟器以验证和评估该存储体系结构。实验结果表明,在保持模拟器理想加速比的情况下,该存储体系结构可以较好地支持推测线程切换。相似文献

20.

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Weiwei Fu Tianzhou Chen Chao Wang Li Liu 《The Journal of supercomputing》2014,69(3):1491-1516

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads. 相似文献