期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Eduardo H.M. Cruz Matthias Diener Marco A.Z. Alves Philippe O.A. Navaux 《Journal of Parallel and Distributed Computing》2014

In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages. 相似文献

2.

Extensible transactional memory testbed

Derin Harmanci Vincent Gramoli Pascal Felber Christof Fetzer 《Journal of Parallel and Distributed Computing》2010

Transactional Memory (TM) is a promising abstraction as it hides all synchronization complexities from the programmers of concurrent applications. More particularly, the TM paradigm operated a complexity shift from the application programming to the TM programming. Therefore, expert programmers have now started to look for the ideal TM that will bring, once-for-all, performance to all concurrent applications. Researchers have recently identified numerous issues TMs may suffer from. Surprisingly, no TMs have ever been tested in these scenarios. In this paper, we present the first to date TM testbed. We propose a framework, TMunit, that provides a domain specific language to write rapidly TM workloads so that our test-suite is easily extensible. Our reproducible semantic tests indicate through reproducible counter-examples that existing TMs do not satisfy recent consistency criteria. Our performance tests identify workloads where well-known TMs perform differently. Finally, additional tests indicate some workloads preventing contention managers from progressing. 相似文献

3.

Integrating file operations into transactional memory

Brian Demsky^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(10):1293-1304

Researchers have proposed transactional memory as a concurrency primitive to simplify the development of multi-threaded programs. In this paper we present a new approach for supporting I/O operations in the context of transactional memory. Our approach provides isolation between the file operations of different transactions while allowing multiple transactions to concurrently perform I/O. To ease adoption, our approach attempts to implement the traditional I/O programming interface as closely as possible. We formalize aspects of our approach and use the formalization to reason about the correctness of the approach.We have implemented our approach as a Java library and have integrated it with the DSTM2 transactional memory system. We have evaluated the approach with several benchmarks including JCarder, TupleSoup, a financial transaction benchmark, a parallel sort benchmark, and a parallel grep benchmark. Our experience shows that the approach provides a straightforward mechanism for developers to integrate I/O in a transactional memory environment and that it performs well. 相似文献

4.

Software transactional memory 总被引：1，自引：0，他引：1

Nir Shavit Dan Touitou 《Distributed Computing》1997,10(2):99-116

Summary. As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a Load–Linked/Store–Conditional operation on a single word. Building on the hardware based transactional synchronization methodology of Herlihy and Moss, we offer software transactional memory (STM), a novel software method for supporting flexible transactional programming of synchronization operations. STM is non-blocking, and can be implemented on existing machines using only a Load–Linked/Store–Conditional operation. We use STM to provide a general highly concurrent method for translating sequential object implementations to non-blocking ones based on implementing a k-word compare&swap STM-transaction. Empirical evidence collected on simulated multiprocessor architectures shows that our method always outperforms the non-blocking translation methods in the style of Barnes, and outperforms Herlihy’s translation method for sufficiently large numbers of processors. The key to the efficiency of our software-transactional approach is that unlike Barnes style methods, it is not based on a costly “recursive helping” policy. Received: January 1996 / Revised: June 1996 / Accepted: August 1996 相似文献

5.

Software transactional memories for Scala

Daniel Goodman Behram KhanAuthor VitaeSalman KhanAuthor Vitae Mikel LujánAuthor VitaeIan WatsonAuthor Vitae 《Journal of Parallel and Distributed Computing》2013

Transactional memory is an alternative to locks for handling concurrency in multi-threaded environments. Instead of providing critical regions that only one thread can enter at a time, transactional memory records sufficient information to detect and correct for conflicts if they occur. This paper surveys the range of options for implementing software transactional memory in Scala. Where possible, we provide references to implementations that instantiate each technique. As part of this survey, we document for the first time several techniques developed in the implementation of Manchester University Transactions for Scala. We order the implementation techniques on a scale moving from the least to the most invasive in terms of modifications to the compilation and runtime environment. This shows that, while the less invasive options are easier to implement and more common, they are more verbose and invasive in the codes using them, often requiring changes to the syntax and program structure throughout the code. 相似文献

6.

Good programming in transactional memory : Game theory meets multicore architecture

Raphael Eidenbenz Roger Wattenhofer 《Theoretical computer science》2011,412(32):4136-4150

In a multicore transactional memory (TM) system, concurrent execution threads interact and interfere with each other through shared memory. The less interference a thread provokes the better for the system. However, as a programmer is primarily interested in optimizing her individual code’s performance rather than the system’s overall performance, she does not have a natural incentive to provoke as little interference as possible. Hence, a TM system must be designed compatible with good programming incentives (GPI), i.e., writing efficient code for the overall system should coincide with writing code that optimizes an individual thread’s performance. We show that with most contention managers (CM) proposed in the literature so far, TM systems are not GPI compatible. We provide a generic framework for CMs that base their decisions on priorities and explain how to modify Timestamp-like CMs so as to feature GPI compatibility. In general, however, priority-based conflict resolution policies are prone to be exploited by selfish programmers. In contrast, a simple non-priority-based manager that resolves conflicts at random is GPI compatible. 相似文献

7.

Boosting performance of transactional memory through O-GEHL predictors

Ehsan Atoofian 《Microprocessors and Microsystems》2014

Time-based Software Transactional Memory (STM) exploits a global clock to validate transactional data and guarantee consistency of transactions. While this method is simple to implement it results in contentions over the clock if transactions commit simultaneously. The alternative method is thread local clock (TLC) which exploits local variables to maintain consistency of transactions. However, TLC may increase false aborts and degrade performance of STMs. In this paper, we analyze global clock and TLC in the context of STM systems, highlighting both the implementation trade-offs and the performance implications of the two techniques. We demonstrate that neither global clock nor TLC is optimum across applications. To counter this challenge, we introduce two optimization techniques: The first optimization technique is Adaptive Clock (AC) which dynamically selects one of the two validation techniques based on probability of conflicts. AC is a speculative approach and relies on software O-GEHL predictors to speculate future conflicts. The second optimization technique is AC+ which reduces timing overhead of O-GEHL predictors by implementing the predictors in hardware. In addition, we exploit information theory to eliminate unnecessary computational resources and reduce storage requirements of the O-GEHL predictors. Our evaluation with TL2 and Stamp benchmark suite reveals that AC is effective and improves execution time of transactional applications up to 65%. 相似文献

8.

Implementation tradeoffs in the design of flexible transactional memory support

Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott 《Journal of Parallel and Distributed Computing》2010

We present FlexTM (FLEXible Transactional Memory), a high performance TM framework that allows software to determine when (eagerly, lazily, or in a mixed fashion) and how to manage conflicts, while employing hardware to manage transactional state and to track conflicts. FlexTM coordinates four decoupled hardware mechanisms: read and write signatures, which summarize per-thread access sets; per-thread conflict summary tables (CSTs), which identify the processors with which conflicts have occurred; Programmable Data Isolation, which buffers speculative updates in the local cache and uses an overflow table to handle unbounded updates; and Alert-On-Update, which notifies a thread immediately when a specified location is written by another processor. The CSTs enable an STM-inspired commit protocol that manages conflicts in a decentralized manner (no global arbitration) and allows parallel commits. 相似文献

9.

A transactional runtime system for the Cell/BE architecture

Alexandro Baldassin Felipe Goldstein Rodolfo Azevedo 《Journal of Parallel and Distributed Computing》2012

Single-core architectures have hit the end of the road and industry and academia are currently exploiting new multicore design alternatives. In special, heterogeneous multicore architectures have attracted a lot of attention but developing applications for such architectures is not an easy task due to the lack of appropriate tools and programming models. We present the design of a runtime system for the Cell/BE architecture that works with memory transactions. Transactional programs are automatically instrumented by the compiler, shortening development time and avoiding synchronization mistakes usually present in lock-based approaches (such as deadlock). Experimental results conducted with a prototype implementation and the STAMP benchmark show good scalability for applications with moderate to low contention levels, and whose transactions are not too small. For those cases in which a small performance loss is admissible, we believe that the ease of programming provided by transactions greatly pays off. 相似文献

10.

Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling

《Future Generation Computer Systems》2014

Modern processor architectures are increasingly complex and heterogeneous, often requiring software solutions tailored to the specific hardware characteristics of each processor model. In this article, we address this problem by targeting two processors featuring Simultaneous MultiThreading (SMT) to improve the occupancy of their internal execution units through a sustained stream of instructions coming from more than one thread. We target the AMD Bulldozer and IBM POWER7 processors as case studies for specific hardware-oriented performance optimizations that increase the variety of instructions sent to each core to maximize the occupancy of all its execution units. WorkOver, presented in this article, improves thread scheduling by increasing the performance of floating point-intensive workloads on Linux-based operating systems. WorkOver is a user-space monitoring tool that automatically identifies FPU-intensive threads and schedules them in a more efficient way without requiring any patches or modifications at the kernel level. Our measurements using standard benchmark suites show that speedups of up to 20% can be achieved by simply allowing WorkOver to monitor applications and schedule their threads, without any modification of the workload. 相似文献

11.

Communication‐aware thread mapping using the translation lookaside buffer

Eduardo H. M. Cruz Matthias Diener Philippe O. A. Navaux 《Concurrency and Computation》2015,27(17):4970-4992

Threads of parallel applications need to communicate in order to fulfill their tasks. The communication performance between the cores in modern multi‐core architectures differs because of the memory and interconnection hierarchies. In these architectures, it is important to map the threads of parallel applications by taking into account the communication between them, to improve their performance and energy consumption. In parallel applications based on shared memory, communication is implicit, which makes it difficult to detect the communication pattern between the threads. In this paper, we introduce a new lightweight mechanism to detect the communication pattern between threads of shared memory applications using the translation lookaside buffer. Our mechanism relies on hardware features, which make it transparent to the programmer and allow the detection to be performed by the operating system during the execution of the application. We also developed a heuristic mapping algorithm that uses the detected pattern to dynamically map the threads to cores. Experiments were performed with applications from the NAS‐OMP and PARSEC parallel benchmark suites in a simulated machine as well as a real machine. Results show that our mechanism can substantially improve parallel application performance, as well as processor and DRAM energy consumption. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

12.

Keeping it together: The role of transactional situation awareness in team performance

《International Journal of Industrial Ergonomics》2016

It has been argued that communications in teams are a means of transmitting Situation Awareness to improve performance. This study explored the frequency and types of situation awareness transactions in two groups of teams. Twelve teams were grouped into either more effective or less effective teams, based on performance measures. Distributed Situation Awareness theory predicts that Situation Awareness transaction are a medium for co-ordinating teamwork, and that more of these transaction will lead to improved performance. Differences in the frequency and type of transactions were observed between the more effective teams and the less effective teams with the former having a higher frequency of overall communications and, more importantly, a higher number of relevant situation awareness transaction types compared to less effective teams. Situation awareness transactions supported the team in making sense of the situation they found themselves in as it unfolded and enabled team members to perform their discrete tasks and therefore contribute to overall team success.Relevance to industry: Teams are a major feature of most industrial applications of work and communication play an important role in coordinating team work. Communication has been found to be linked to both team performance and situation awareness. Situation awareness is distributed in teams through transactions of information. A study was devised to explore the differences between more effective and less effective teams on a number of situation awareness transactional factors. Analysing the team as a functional unit of situation awareness is presented for future work. 相似文献

13.

软件事务内存的动态竞争管理策略

林菲《计算机工程与设计》2010,31(7)

软件事务内存是为了简化并行程序设计而出现的一种新的程序设计技术.为了降低软件事务内存系统中事务冲突的发生频率以提升系统整体性能,提出了一种新的基于动态控制和队列调度的竞争管理策略.定义了竞争强度的概念和系统总体框架,并在此基础上给出了利用运行时反馈信息动态调节竞争强度的方法.同时给出了事务序列化的设计方法与实现中应注意的问题,通过将冲突概率大的事务序列化以达到避免相同冲突再次发生的目的.结合常用的基准数据结构,对模型和算法进行了实验,最后结果表明了算法的正确性和有效性. 相似文献

14.

事务存储研究 总被引：1，自引：0，他引：1

黄国睿张平魏广博马航《计算机工程与设计》2010,31(2)

为了研究多核处理器系统上的并行编程问题,开展了对事务存储模型的研究.阐述了事务存储,介绍了事务存储系统的实现方法,利用4种事务存储系统详细阐述了事务存储的实现;重点讨论了6种影响事务存储发展的关键技术,即实现方式、数据结构组织、并发控制,冲突检测、争用管理等;提出了事务存储将向着软硬件结合、提升性能、提高正确性和满足多核应用需求的方向发展. 相似文献

15.

DMM:A dynamic memory mapping model for virtual machines 总被引：2，自引：0，他引：2

CHEN HaoGang WANG XiaoLin WANG ZhenLin ZHANG BinBin LUO YingWei & LI XiaoMing 《中国科学:信息科学(英文版)》2010,(6):1097-1108

Memory virtualization is an important part in the design of virtual machine monitors(VMM).In this paper,we proposed dynamic memory mapping(DMM) model,a mechanism that allows the VMM to change the mapping between a virtual machine's physical memory and the underlying hardware resource while the virtual machine is running.By utilizing DMM,the VMM can implement many novel memory management policies,such as Demand Paging,Swapping,Ballooning,Memory Sharing and Copy-On-Write,while preserving compatibility with va... 相似文献

16.

Topological map-based approach for localization and mapping memory optimization

André S. Aguiar Filipe N. dos Santos Luis C. Santos Armando J. Sousa José Boaventura-Cunha 《野外机器人技术杂志》2023,40(3):447-466

Robotics in agriculture faces several challenges, such as the unstructured characteristics of the environments, variability of luminosity conditions for perception systems, and vast field extensions. To implement autonomous navigation systems in these conditions, robots should be able to operate during large periods and travel long trajectories. For this reason, it is essential that simultaneous localization and mapping algorithms can perform in large-scale and long-term operating conditions. One of the main challenges for these methods is maintaining low memory resources while mapping extensive environments. This work tackles this issue, proposing a localization and mapping approach called VineSLAM that uses a topological mapping architecture to manage the memory resources required by the algorithm. This topological map is a graph-based structure where each node is agnostic to the type of data stored, enabling the creation of a multilayer mapping procedure. Also, a localization algorithm is implemented, which interacts with the topological map to perform access and search operations. Results show that our approach is aligned with the state-of-the-art regarding localization precision, being able to compute the robot pose in long and challenging trajectories in agriculture. In addition, we prove that the topological approach innovates the state-of-the-art memory management. The proposed algorithm requires less memory than the other benchmarked algorithms, and can maintain a constant memory allocation during the entire operation. This consists of a significant innovation, since our approach opens the possibility for the deployment of complex 3D SLAM algorithms in real-world applications without scale restrictions. 相似文献

17.

Optimised memory allocation for less false abortion and better performance in hardware transactional memory

Xiuhong Li Altenbek Gulila 《International Journal of Parallel, Emergent and Distributed Systems》2020,35(4):483-491

ABSTRACT

This paper introduces and tackles a special performance hazard in Hardware Transactional Memory (HTM): false abortion. False abortion causes many unnecessary transaction abortions in HTM and can greatly impact the performance, making HTM not that useful when it is adopted as a fast path for Software Transactional Memory. By introducing a new memory allocator design, we are able to put objects that are likely to be accessed together from different threads into different cache lines and thus avoid conflicts of hardware transactions in different threads. Experiments show that our method can reduce 47% of transaction abortion and achieve a speedup of up to 1.67× (averagely 22%), yet only consume 14% more memory, showing great potential to enhance current HTM technology. 相似文献

18.

Energy saving strategies for parallel applications with point-to-point communication phases

Vaibhav Sundriyal Masha Sosonkina Alexander Gaenko Zhao Zhang 《Journal of Parallel and Distributed Computing》2013

Although high-performance computing traditionally focuses on the efficient execution of large-scale applications, both energy and power have become critical concerns when approaching exascale. Drastic increases in the power consumption of supercomputers affect significantly their operating costs and failure rates. In modern microprocessor architectures, equipped with dynamic voltage and frequency scaling (DVFS) and CPU clock modulation (throttling), the power consumption may be controlled in software. Additionally, network interconnect, such as Infiniband, may be exploited to maximize energy savings while the application performance loss and frequency switching overheads must be carefully balanced. This paper advocates for a runtime assessment of such overheads by means of characterizing point-to-point communications into phases followed by analyzing the time gaps between the communication calls. Certain communication and architectural parameters are taken into consideration in the three proposed frequency scaling strategies, which differ with respect to their treatment of the time gaps. The experimental results are presented for NAS parallel benchmark problems as well as for the realistic parallel electronic structure calculations performed by the widely used quantum chemistry package GAMESS. For the latter, three different process-to-core mappings were studied as to their energy savings under the proposed frequency scaling strategies and under the existing state-of-the-art techniques. Close to the maximum energy savings were obtained with a low performance loss of 2% on the given platform. 相似文献

19.

Helenos: A realistic benchmark for distributed transactional memory

下载免费PDF全文

Paweł Kobyliński Konrad Siek Jan Baranowski Paweł T. Wojciechowski 《Software》2018,48(3):528-549

Transactional memory (TM) is an approach to concurrency control that aims to make writing parallel programs both effective and simple. The approach has been initially proposed for nondistributed multiprocessor systems, but it is gaining popularity in distributed systems to synchronize tasks at large scales. Efficiency and scalability are often the key issues in TM research; thus, performance benchmarks are an important part of it. However, while standard TM benchmarks like the Stanford Transactional Applications for Multi‐Processing suite and STMBench7 are available and widely accepted, they do not translate well into distributed systems. Hence, the set of benchmarks usable with distributed TM systems is very limited, and must be padded with microbenchmarks, whose simplicity and artificial nature often makes them uninformative or misleading. Therefore, this paper introduces Helenos, a realistic, complex, and comprehensive distributed TM benchmark based on the problem of the Facebook inbox, an application of the Cassandra distributed store. 相似文献

20.

面向数据中心的事务内存框架设计

下载免费PDF全文

孙勇《计算机工程与应用》2011,47(27):74-76

针对由计算机集群构成的云计算数据中心的特性,提出了一种基于事务内存的分布式编程框架。该框架将云计算任务封装为事务,自动完成所有事务的调度执行、负载均衡和故障恢复;将数据中心的分布式数据封装为事务对象,保证事务访问事务对象时的ACID特性。与同类研究相比,它无需用户关心程序的并行控制,具有简单易用性。该框架已在仿真环境下实现,实验结果表明它具有良好的可扩展性和容错性。相似文献