期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Ariadne: Architecture of a Portable Threads System Supporting Thread Migration

EDWARD MASCARENHAS VERNON REGO 《Software》1996,26(3):327-356

Threads exhibit a simply expressed and powerful form of concurrency, easily exploitable in applications that run on both uni- and multi-processors, shared- and distributed-memory systems. This paper presents the design and implementation of Ariadne: a layered, C-based software architecture for multi-threaded distributed computing on a variety of platforms. Ariadne is a portable user-space threads system that runs on shared- and distributed-memory multiprocessors. Thread-migration is supported at the application level in homogeneous environments (e.g., networks of SPARCs and Sequent Symmetrys, Intel hypercubes). Threads may migrate between processes to access remote data, preserving locality of reference for computations with a dynamic data space. Ariadne can be tuned to specific applications through a customization layer. Support is provided for scheduling via a built-in or application-specific scheduler, and interfacing with any communications library. Ariadne currently runs on the SPARC (SunOS 4.x and SunOS 5.x), Sequent Symmetry, Intel i860, Silicon Graphics workstation (IRIX), and IBM RS/6000 environments. We present simple performance benchmarks comparing Ariadne to threads libraries in the SunOS 4.x and SunOS 5.x systems. 相似文献

2.

Exploiting locality and tolerating remote memory access latency using thread migration

Stephen Jenks Jean-Luc Gaudiot 《International journal of parallel programming》1997,25(4):281-304

Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers. 相似文献

3.

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Weiwei Fu Tianzhou Chen Chao Wang Li Liu 《The Journal of supercomputing》2014,69(3):1491-1516

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads. 相似文献

4.

Non-Strict Execution in Parallel and Distributed Computing

Alfredo Cristobal-Salas Andrei Tchernykh Jean-Luc Gaudiot Wen-Yen Lin 《International journal of parallel programming》2003,31(2):77-105

This paper surveys and demonstrates the power of non-strict evaluation in applications executed on distributed architectures. We present the design, implementation, and experimental evaluation of single assignment, incomplete data structures in a distributed memory architecture and Abstract Network Machine (ANM). Incremental Structures (IS), Incremental Structure Software Cache (ISSC), and Dynamic Incremental Structures (DIS) provide non-strict data access and fully asynchronous operations that make them highly suited for the exploitation of fine-grain parallelism in distributed memory systems. We focus on split-phase memory operations and non-strict information processing under a distributed address space to improve the overall system performance. A novel technique of optimization at the communication level is proposed and described. We use partial evaluation of local and remote memory accesses not only to remove much of the excess overhead of message passing, but also to reduce the number of messages when some information about the input or part of the input is known. We show that split-phase transactions of IS, together with the ability of deferring reads, allow partial evaluation of distributed programs without losing determinacy. Our experimental evaluation indicates that commodity PC clusters with both IS and a caching mechanism, ISSC, are more robust. The system can deliver speedup for both regular and irregular applications. We also show that partial evaluation of memory accesses decreases the traffic in the interconnection network and improves the performance of MPI IS and MPI ISSC applications. 相似文献

5.

OpenMP for Networks of SMPs

Y. Charlie Hu Honghui Lu Alan L. Cox Willy Zwaenepoel 《Journal of Parallel and Distributed Computing》2000,60(12):160

In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions. 相似文献

6.

Combining flow and dependence analyses to expose redundant array accesses

Elana D. Granston Alexander V. Veidenbaum 《International journal of parallel programming》1995,23(5):423-470

The success of large-scale, hierarchical and distributed shared memory systems hinges on our ability to reduce delays resulting from remote accesses to shared data. To facilitate this, we present a compile-time algorithm for analyzing programs with doall-style parallelism to determine when read and write accesses to shared data areredundant (unnecessary). One identified, redundant remote accesses can be replaced by local accesses or eliminated entirely. This optimization improves program performance in two ways. First, slow memory accesses are replaced by faster ones. Second, the time to perform other remote memory accesses may be reduced as a result of the decreased traffic level. We also show how the information obtained through redundancy analysis can be used for other compiler optimizations such as prefetching and cache management. 相似文献

7.

LAPT: A locality-aware page table for thread and data mapping

《Parallel Computing》2016

The performance and energy efficiency of current systems is influenced by accesses to the memory hierarchy. One important aspect of memory hierarchies is the introduction of different memory access times, depending on the core that requested the transaction, and which cache or main memory bank responded to it. In this context, the locality of the memory accesses plays a key role for the performance and energy efficiency of parallel applications. Accesses to remote caches and NUMA nodes are more expensive than accesses to local ones. With information about the memory access pattern, pages can be migrated to the NUMA nodes that access them (data mapping), and threads that communicate can be migrated to the same node (thread mapping).In this paper, we present LAPT, a hardware-based mechanism to store the memory access pattern of parallel applications in the page table. The operating system uses the detected memory access pattern to perform an optimized thread and data mapping during the execution of the parallel application. Experiments with a wide range of parallel applications (from the NAS and PARSEC Benchmark Suites) on a NUMA machine showed significant performance and energy efficiency improvements of up to 19.2% and 15.7%, respectively, (6.7% and 5.3% on average). 相似文献

8.

Recovering distributed objects

《Information Processing Letters》2001,77(2-4):143-150

Distributed multithreaded applications operating in shared-nothing environments present challenges to classical fault tolerance mechanisms. The piecewise determinism assumption is lost (due to multithreading), and data must be replicated (because of the shared-nothing environment). In this paper, we explore a systematic approach to providing fault tolerance, by considering data-race-free programs that have the benefits of piecewise determinism and yet allow multithreading. We base our logging and recovery algorithm on a logical ring structure that allows the underlying distributed system to migrate threads, migrate and replicate objects, and perform multi-object transactions. 相似文献

9.

Parallel Application Scheduling on Networks of Workstations

Stergios V. Anastasiadis Kenneth C. Sevcik 《Journal of Parallel and Distributed Computing》1997,43(2):364

Parallel applications can be executed using the idle computing capacity of workstation clusters. However, it remains unclear how to schedule the processors among different applications most effectively. Processor scheduling algorithms that were successful for shared-memory machines have proven to be inadequate for distributed memory environments due to the high costs of remote memory accesses and redistributing data. We investigate how knowledge of system load and application characteristics can be used in scheduling decisions. We propose a new algorithm based on adaptive equipartitioning, which, by properly exploiting both the information types above, performs better than other nonpreemptive scheduling rules, and nearly as well as idealized versions of preemptive rules (with free preemption). We conclude that the new algorithm is suitable for use in scheduling parallel applications on networks of workstations. 相似文献

10.

Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

Teo Milanez Sylvain Collange Fernando Magno Quintão Pereira Wagner Meira Jr. Renato Ferreira 《Parallel Computing》2014

Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one architecture recently proposed that shares instruction decoding and execution between threads running the same program in an SMT processor, thereby generalizing the approach followed by Graphics Processing Units to general-purpose processors. In this paper we propose new ways to expose redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristic that handles function calls better than previous approaches. Our heuristic only inspects the program counter and the stack frame to reconverge threads; hence, it is amenable to efficient and inexpensive hardware implementation. Second, we demonstrate that this heuristic is able to reveal the existence of substantial regularity in inter-thread memory access patterns. We validate our results on data-parallel applications from the PARSEC and SPLASH suites. Our new reconvergence heuristic increases the throughput of our MMT model by 7%, when compared to a previous, and substantially more complex approach, due to Long et al. Moreover, it gives us an effective way to increase regularity in memory accesses. We have observed that over 70% of simultaneous memory accesses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses. 相似文献

11.

On the design of global object space for efficient multi-threading Java computing on clusters

Weijian Fang Cho-Li Wang Francis C. M. Lau 《Parallel Computing》2003,29(11-12):1563

The popularity of Java and recent advances in compilation and execution technology for Java are making the language one of the preferred ones in the field of high-performance scientific and engineering computing. A distributed Java Virtual Machine supports transparent parallel execution of multi-threaded Java programs on a cluster of computers. It provides an alternative platform for high-performance scientific computations. In this paper, we present the design of a global object space for a distributed JVM. It virtualizes a single Java object heap across machine boundaries to facilitate transparent object accesses. We leverage runtime object connectivity information to detect distributed shared objects (DSOs) that are reachable from threads at different nodes to facilitate efficient memory management in the distributed JVM. Based on the concept of DSO, we propose a framework to characterize object access patterns, along three orthogonal dimensions. With this framework, we are able to effectively calibrate the runtime memory access patterns and dynamically apply optimized cache coherence protocols to minimize consistency maintenance overhead. The optimization devices include an object home migration method that optimizes the single-writer access pattern, synchronized method migration that allows the execution of a synchronized method to take place remotely at the home node of its locked object, and connectivity-based object pushing that uses object connectivity information to optimize the producer–consumer access pattern. Several benchmark applications in scientific computing have been tested on our distributed JVM. We report the performance results and give an in-depth analysis of the effects of the proposed adaptive solutions. 相似文献

12.

Modeling distributed data representation and its effect on parallel data accesses

《Journal of Parallel and Distributed Computing》2005,65(10):1281-1289

PC clusters have emerged as viable alternatives for high-performance, low-cost computing. In such an environment, sharing data among processes is essential. Accessing the shared data, however, may often stall parallel executing threads. We propose a novel data representation scheme where an application data entity can be incarnated into a set of objects that are distributed in the cluster. The runtime support system manages the incarnated objects and data access is possible only via an appropriate interface. This distributed data representation facilitates parallel accesses for updates. Thus, tasks are subject to few limitations and application programs can harness high degrees of parallelism. Our PC cluster experiments prove the effectiveness of our approach. 相似文献

13.

Accelerating engineering software on modern multi-core processors

《Advances in Engineering Software》2015

Recent multi-core designs migrated from Symmetric Multi Processing to cache coherent Non Uniform Memory Access architectures. In this paper we discuss performance issues that arise when designing parallel Finite Element programs for a 64-core ccNUMA computer and explore solutions for these issues. We first present the overview of the computer architecture and show that highly parallel code that does not take into account the aspects of the system memory organization scales poorly, achieving only 2.8× speedup when running with 64 threads. Then, we discuss how we identified the sources of overhead and evaluate three possible solutions for the problem. We show that the first solution does not require the application’s code to be modified, however, the speedup achieved is only 10.6×. The second solution enables the performance to scale up to 30.9×, however, it requires the programmer to manually schedule threads and allocate related data on local CPUs and memory banks and rely on ccNUMA aware libraries that are not portable across operating systems. Also, we propose and evaluate “copy-on-thread”, an alternative solution that enables the performance to scale up to 25.5× without relying on specialized libraries nor requiring specific data allocation and thread scheduling. Finally, we argue that the issues reported only happen for large data sets and conclude the paper with recommendations to help programmers to design algorithms and programs that perform well on such kind of machine. 相似文献

14.

On Migrating Threads 总被引：1，自引：0，他引：1

Bernd Mathiske Florian Matthes Joachim W. Schmidt 《Journal of Intelligent Information Systems》1997,8(2):167-191

Based on the notion of persistent threads in Tycoon (Matthes and Schmidt, 1994), we investigate thread migration as a programming construct for building activity-oriented distributed applications. We first show how a straight-forward extension of a higher-order persistent language can be used to define activities that span multiple (semi-) autonomous nodes in heterogeneous networks. In particular, we discuss the intricate binding issues that arise in systems where threads are first-class language citizens that may access local and remote, mobile and immobile resources.We also describe how our system model can be understood as a promising generalization of the more static architecture of first-order and higher-order distributed object systems. Finally, we give some insight into the implementation of persistent and migrating threads and we explain how to represent bindings to ubiquitous resources present at each node visited by a migrating thread on the network to avoid excessive communication or storage costs. 相似文献

15.

Development and performance analysis of real‐world applications for distributed and parallel architectures

T. Fahringer P. Blaha A. Hssinger J. Luitz E. Mehofer H. Moritsch B. Scholz 《Concurrency and Computation》2001,13(10):841-868

Several large real‐world applications have been developed for distributed and parallel architectures. We examine two different program development approaches. First, the usage of a high‐level programming paradigm which reduces the time to create a parallel program dramatically but sometimes at the cost of a reduced performance; a source‐to‐source compiler, has been employed to automatically compile programs—written in a high‐level programming paradigm—into message passing codes. Second, a manual program development by using a low‐level programming paradigm—such as message passing—enables the programmer to fully exploit a given architecture at the cost of a time‐consuming and error‐prone effort. Performance tools play a central role in supporting the performance‐oriented development of applications for distributed and parallel architectures. SCALA—a portable instrumentation, measurement, and post‐execution performance analysis system for distributed and parallel programs—has been used to analyze and to guide the application development, by selectively instrumenting and measuring the code versions, by comparing performance information of several program executions, by computing a variety of important performance metrics, by detecting performance bottlenecks, and by relating performance information back to the input program. We show several experiments of SCALA when applied to real‐world applications. These experiments are conducted for a NEC Cenju‐4 distributed‐memory machine and a cluster of heterogeneous workstations and networks. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

16.

数据流程序动态调度与优化

杨胜哲于俊清唐九飞《计算机工程与科学》2017,39(7):1201-1210

为了解决数据流编程模型的可用性问题,使其能在兼顾程序并行性的前提下适用于动态数据交互速率的流应用,设计了一种动态调度与静态优化相结合的数据流编译系统。编译器以COStream语言编写的源程序为输入,通过对源程序进行分析,以动态速率的数据通信边作为边界划分程序到粗粒度的子图,在子图内部应用静态优化。根据子图的每个计算单元的工作量估计计算资源的使用状况,实现子图内计算单元到处理器核的映射,经过阶段划分分配子图内计算单元到相应流水阶段。在运行时,每个子图在各个处理器核上均启动一个线程,通过对线程间通信的优化,避免了运行时多个线程对同一段内存同时读写产生的同步开销,减少了线程的上下文切换次数。使用信号量控制子图内线程间的同步,基于各子图计算单元运行时数据交互速率并结合当前线程的状态,动态调度各个子图的执行,构建动态的软件流水线,生成相应多线程目标代码。实验以通用X86-64多核处理器作为实验平台,测试和分析数据流编译的性能。实验结果表明,编译系统可以实现动态数据交互速率的数据流应用,扩大了编译系统可用性并且具有一定加速效果。相似文献

17.

A framework for parallel computational physics algorithms on multi-core: SPH in parallel

David W. Holmes John R. WilliamsPeter Tilke 《Advances in Engineering Software》2011,42(11):999-1008

In this paper, a simulation framework that enables distributed numerical computing in multi-core shared-memory environments is presented. Using multiple threads allows a single memory image to be shared concurrently across cores but potentially introduces race conditions. Race conditions can be avoided by ensuring each core operates on an isolated memory block. This is usually achieved by running a different operating system process on each core, such as multiple MPI processes. However, we show that in many computational physics problems, memory isolation can also be enforced within a single process by leveraging spatial sub-division of the physical domain. A new spatial sub-division algorithm is presented that ensures threads operate on different memory blocks, allowing for in-place updates of state, with no message passing or creation of local variables during time stepping. Additionally, the developed framework controls task distribution dynamically ensuring an events based load balance. Results from fluid mechanics analysis using Smoothed Particle Hydrodynamics (SPH) are presented demonstrating linear performance with number of cores. 相似文献

18.

Improving the performance of software distributed shared memory with speculation 总被引：1，自引：0，他引：1

Kistler M. Alvisi L. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(9):885-896

We study the performance benefits of speculation in a release consistent software distributed shared memory system. We propose a new protocol, speculative home-based release consistency (SHRC) that speculatively updates data at remote nodes to reduce the latency of remote memory accesses. Our protocol employs a predictor that uses patterns in past accesses to shared memory to predict future accesses. We have implemented our protocol in a release consistent software distributed shared memory system that runs on commodity hardware. We evaluate our protocol implementation using eight software distributed shared memory benchmarks and show that it can result in significant performance improvements. 相似文献

19.

A Read-Copy Update based parallel server for distributed crowd simulations

Guillermo Vigueras Juan M. Orduña Miguel Lozano 《The Journal of supercomputing》2013,64(1):156-166

The Read-Copy Update (RCU) synchronization method was designed to cope with multiprocessor scalability some years ago, and it was included in the Linux kernel October of 2002. Recently, libraries providing user-space access to this method have been released, although they still have not been used in complex applications. In this paper, we propose the evaluation of the RCU synchronization method for two different cases of use in a distributed system architecture for crowd simulations. We have compared the RCU implementation with a parallel implementation based on Mutex, a traditional locking synchronization method for solving race conditions among threads in parallel applications. The performance evaluation results show that the use of RCU significantly decreases the system response time and increases the system throughput, supporting a higher number of agents while providing the same latency levels. The reason for this behavior is that the RCU method allows read accesses in parallel with write accesses to dynamic data structures, avoiding the sequential access that a Mutex represents for these data structures. In this way, it can better exploit the existing number of processor cores. These results show the potential of this synchronization method for improving parallel and distributed applications. 相似文献

20.

A real-time multigrid finite hexahedra method for elasticity simulation using CUDA

Christian Dick Joachim Georgii Rüdiger Westermann 《Simulation Modelling Practice and Theory》2011,19(2):801-816

We present a multigrid approach for simulating elastic deformable objects in real time on recent NVIDIA GPU architectures. To accurately simulate large deformations we consider the co-rotated strain formulation. Our method is based on a finite element discretization of the deformable object using hexahedra. It draws upon recent work on multigrid schemes for the efficient numerical solution of partial differential equations on such discretizations. Due to the regular shape of the numerical stencil induced by the hexahedral regime, and since we use matrix-free formulations of all multigrid steps, computations and data layout can be restructured to avoid execution divergence of parallel running threads and to enable coalescing of memory accesses into single memory transactions. This enables to effectively exploit the GPU’s parallel processing units and high memory bandwidth via the CUDA parallel programming API. We demonstrate performance gains of up to a factor of 27 and 4 compared to a highly optimized CPU implementation on a single CPU core and 8 CPU cores, respectively. For hexahedral models consisting of as many as 269,000 elements our approach achieves physics-based simulation at 11 time steps per second. 相似文献