期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Complete Formal Specification of the OpenMP Memory Model

Greg Bronevetsky Bronis R. de Supinski 《International journal of parallel programming》2007,35(4):335-392

相似文献

2.

Parallel algorithms for bipartite matching problems on distributed memory computers

Johannes Langguth Md. Mostofa Ali PatwaryFredrik Manne 《Parallel Computing》2011,37(12):820-845

We present a new parallel algorithm for computing a maximum cardinality matching in a bipartite graph suitable for distributed memory computers.The presented algorithm is based on the Push-Relabel algorithm which is known to be one of the fastest algorithms for the bipartite matching problem. Previous attempts at developing parallel implementations of it have focused on shared memory computers using only a limited number of processors.We first present a straightforward adaptation of these shared memory algorithms to distributed memory computers. However, this is not a viable approach as it requires too much communication. We then develop our new algorithm by modifying the previous approach through a sequence of steps with the main goal being to reduce the amount of communication and to increase load balance. The first goal is achieved by changing the algorithm so that many push and relabel operations can be performed locally between communication rounds and also by selecting augmenting paths that cross processor boundaries infrequently. To achieve good load balance, we limit the speed at which global relabelings traverse the graph. In several experiments on a large number of instances, we study weak and strong scalability of our algorithm using up to 128 processors.The algorithm can also be used to find ?-approximate matchings quickly. 相似文献

3.

A protocol-centric approach to on-the-fly race detection

Perkovic D. Keleher P.J. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(10):1058-1072

We present the design and evaluation of a new data-race-detection technique. Our technique executes at runtime rather than post-mortem, and handles unmodified shared-memory applications that run on top of CVM, a software distributed shared memory system. We do not assume explicit associations between synchronization and shared data, and require neither compiler support nor program source. Instead, we use a binary code re-writer to instrument instructions that may access shared memory. The most novel aspect of our system is that we are able to use information from the underlying memory system implementation in order to reduce the number of comparisons made at runtime. We present an experimental evaluation of our techniques by using our system to look for data races in five common shared-memory programs. We quantify the effect of several optimizations to the basic technique: data flow analysis, instrumentation batching, runtime code modification, and instrumentation inlining. Our system correctly found races in three of the five programs, including two from a standard benchmark suite. The slowdown of this debugging technique averages less than 2.5 for our applications. 相似文献

4.

Axiomatic treatment of processes with shared variables revisited 总被引：1，自引：1，他引：0

Olaf Owe 《Formal Aspects of Computing》1992,4(4):323-340

The aim of this paper is to develop simple and practically useful formalisms for reasoning about processes with shared variables. Our approach is based on the axiomatic system described by Neelam Soundararajan. In contrast to that work, our formalism is first derived from a model; this guarantees soundness and completeness of the formal proof system, with respect to the model. As an additional advantage the rules become simpler than those of Soundararajan; in particular, the local assertions may freely refer to shared variables; and we remove the explicit use of the compatibility predicate.Next we augment the formalism by allowing global invariants, which may refer to shared variables (including shared histories), but with a different semantics than in the local assertions. The augmented system makes reasoning simpler in the sense that reasoning about the past is replaced by reasoning about the present. Finally we suggest a sufficiently complete set of mythical (auxiliary) variables free from embedded program structure. We demonstrate our formalism on some examples.Dedicated to the memory of Jan Helge DÆhlin, 1959–1989 相似文献

5.

Object-based approach to programming distributed systems

A. S. Tanenbaum H. E. Bal S. Ben Hassen M. Frans Kaashoek 《Concurrency and Computation》1994,6(4):235-249

Two kinds of parallel computers exist: those with shared memory and those without. The former are difficult to build but easy to program. The latter are easy to build but difficult to program. In this paper we present a hybrid model that combines the best properties of each by simulating a restricted object-based shared memory on machines that do not share physical memory. In this model, objects can be replicated on multiple machines. An operation that does not change an object can then be done locally, without any network traffic. Update operations can be done using the reliable broadcast protocol described in the paper. We have constructed a prototype system, designed and implemented a new programming language for it, and programmed various applications using it. The model, algorithms, language, applications and performance will be discussed. 相似文献

6.

Optimization of parallel query execution plans in XPRS 总被引：1，自引：0，他引：1

Wei Hong Michael Stonebraker 《Distributed and Parallel Databases》1993,1(1):9-32

In this paper, we describe our approach to optimization of query execution plans in XPRS, a multiuser parallel database system based on a shared memory multiprocessor and a disk array. The main difficulties in this optimization problem are the compile-time unknown parameters such as available buffer size and number of free processors, and the enormous search space of possible parallel plans. We deal with these problems with a novel two phase optimization strategy which dramatically reduces the search space and allows run time parameters without significantly compromising plan optimality. In this paper we present our two phase optimization strategy and give experimental evidence from XPRS benchmarks that indicate that it almost always produces optimal or close to optimal plans. 相似文献

7.

Partial Order Trace Analyzer (POTA) for Distributed Programs

Alper Sen Vijay K. Garg 《Electronic Notes in Theoretical Computer Science》2003,89(2):22

Checking the correctness of software is a growing challenge. In this paper, we present a prototype implementation of Partial Order Trace Analyzer (POTA), a tool for checking execution traces of both message passing and shared memory programs using temporal logic. So far runtime verification tools have used the total order model of an execution trace, whereas POTA uses a partial order model. The partial order model enables us to capture possibly exponential number of interleavings and, in turn, this allows us to find bugs that are not found using a total order model. However, verification in partial order model suffers from the state explosion problem – the number of possible global states in a program increases exponentially with the number of processes.POTA employs an effective abstraction technique called computation slicing. A slice of a computation (execution trace) with respect to a predicate is the computation with the least number of global states that contains all global states of the original computation for which the predicate evaluates to true. The advantage of this technique is that, it mitigates the state explosion problem by reasoning only on the part of the global state space that is of interest. In POTA, we implement computing slicing algorithms for temporal logic predicates from a subset of CTL. The overall complexity of evaluating a predicate in this logic upon using computation slicing becomes polynomial in the number of processes compared to exponential without slicing.We illustrate the effectiveness of our techniques in POTA on several test cases such as the General Inter-ORB Protocol (GIOP)[18] and the primary secondary protocol[32]. POTA also contains a module that translates execution traces to Promela[16] (input language SPIN). This module enables us to compare our results on execution traces with SPIN. In some cases, we were able to verify traces with 250 processes compared to only 10 processes using SPIN. 相似文献

8.

Parallelism versus Memory Allocation in Pipelined Router Forwarding Engines

Fan Chung Ronald Graham Jia Mao George Varghese 《Theory of Computing Systems》2006,39(6):829-849

A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one-port memories) in order to minimize contention; however, this minimizes memory sharing. Idealized sharing occurs by using a single shared memory for all processors but this maximizes contention. Instead, in this paper we show that perfect memory sharing of shared memory can be achieved with a collection of two-port memories, as long as the number of processors is less than the number of memories. We show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of asymptotically. The proof utilizes a new bin packing model, which is interesting in its own right. Further, for important special cases that arise in practice a more sophisticated modification of this approximation algorithm is in fact optimal. We also discuss the online memory allocation problem and present fast online algorithms that provide good memory utilization while allowing fast updates. 相似文献

9.

GPU-Friendly Parallel Genome Matching with Tiled Access and Reduced State Transition Table

Yunho Oh Doohwan Oh Won W. Ro 《International journal of parallel programming》2013,41(4):526-551

In this paper, we propose a new parallel genome matching algorithm using graphics processing units (GPUs). Our proposed approach is based on the Aho–Corasick algorithm and it was developed based on a consideration of the architectural features of existing GPUs with a hundred or more cores. Thus, we provide an appropriate task partitioning method that runs on multiple threads and we fully utilize the cache memory and the shared memory structures available in GPUs. Especially, we propose a tiled access method for rapid data transfer from the global memory to the shared memory. We also provide new models for cache-friendly state transition table to improve performance of pattern matching operations on GPUs. The maximum throughput we achieved in various experiments was 15.3 Gbps. Moreover, we showed that our proposed design outperformed an earlier approach with a 15.4 % performance improvement. 相似文献

10.

Transactional Memory for Unstructured Mesh Simulations

Barna L. Bihari 《Journal of scientific computing》2013,54(2-3):311-332

In this paper we study transactional memory (TM) as a new tool for threading codes in this new era of multi- and many-core computers. In particular, we investigate the features and study the applicability of transactional memory as an efficient and easy-to-use alternative for handling memory conflicts in unstructured mesh simulations that use shared memory. The software tool used for our preliminary analysis of this novel construct is IBM’s freely available Software Transactional Memory (STM) system. For our studies, we developed the BUSTM benchmark which is a test code with state-of-the-art unstructured-mesh bookkeeping. The numerical algorithms are simplified yet still exhibit most of the salient features of modern unstructured mesh methods. We apply STM to two frequently used algorithm types used in multi-physics codes with realistic 3-D meshes. Our computational experiments indicate a good fit between these application scenarios and the TM features. 相似文献

11.

RTS: A system to simulate the real time cost behaviour of parallel computations

Bin Qin Howard A. Sholl Reda A. Ammar 《Software》1988,18(10):967-985

In this paper, we present a software tool, RTS (real time simulator), that analyses the time cost behaviour of parallel computations through simulation. It is assumed in RTS that the computer system which supports the executions of parallel computations has a limited number of processors all processors have the same speed and they communicate with each other through a shared memory. In RTS, the time cost of a parallel computation is defined as a function of the input, the algorithm, the data structure, the processor speed, the number of processors, the processor power allocation, the communication and the execution environment. How RTS models the time cost is first discussed in the paper. In the model, a locking technique is used to manipulate the access to the shared memory, processing power is equally allocated among all the operations that are currently being performed in parallel in the computer system, and the number of operations in the execution environment of a parallel computation changes from time to time. How RTS works and how the simulation is used to do time cost analysis are also discussed. 相似文献

12.

Efficient 3D stencil computations using CUDA

Marcin Krotkiewski Marcin Dabrowski 《Parallel Computing》2013

We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo (ghost zone). Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, common subexpression elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and memory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into account the actual halo overhead due to the minimum memory transaction size on the GPUs. Through experiments we demonstrate that in our implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that our method of reading the data is close to optimal in terms of memory bandwidth usage. Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented. Our stencil implementations are more efficient than any other implementation described in the literature to date. On Tesla C2050 with single and double precision arithmetic our 7-point stencil achieves an average throughput of 12.3 and 6.5 Gpts/s, respectively (98 GFLOP/s and 52 GFLOP/s, respectively). The symmetric 27-point stencil sustains a throughput of 10.9 and 5.8 Gpts/s, respectively. 相似文献

13.

Distributed disk-based algorithms for model checking very large Markov chains

Alexander Bell Boudewijn R. Haverkort 《Formal Methods in System Design》2006,29(2):177-196

In this paper we present data structures and distributed algorithms for CSL model checking-based performance and dependability evaluation. We show that all the necessary computations are composed of series or sums of matrix-vector products. We discuss sparse storage structures for the required matrices and present efficient sequential and distributed disk-based algorithms for performing these matrix-vector products. We illustrate the effectivity of our approach in a number of case studies in which continuous-time Markov chains (generated in a distributed way from stochastic Petri net specifications) with several hundreds of millions of states are solved on a workstation cluster with 26 dual-processor nodes. We show details about the memory consumption, the solution times, and the speedup. The distributed message-passing algorithms have been implemented in a tool called PARSECS, that also takes care of the distributed Markov chain generation and that can also be used for distributed CTL model checking of Petri nets. 相似文献

14.

Applications of parallel processing technologies in heuristic search planning: methodologies and experiments

Tu Huy Phan Enrico Pontelli Tran Cao Son Son Thanh To 《Concurrency and Computation》2009,21(15):1928-1960

The goal of this paper is to investigate the application of parallel programming techniques to boost the performance of heuristic search‐based planning systems in various aspects. It shows that an appropriate parallelization of a sequential planning system often brings gain in performance and/or scalability. We start by describing general schemes for parallelizing the construction of a plan. We then discuss the applications of these techniques to two domain‐independent heuristic search‐based planners—a competitive conformant planner (CP A) and a state‐of‐the‐art classical planner (FF). We present experimental results—on both shared memory and distributed memory platforms—which show that the performance improvements and scalability are obtained in both cases. Finally, we discuss the issues that should be taken into consideration when designing a parallel planning system and relate our work to the existing literature. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

15.

Sharing Memory between Byzantine Processes Using Policy-Enforced Tuple Spaces

Bessani Alysson Neves Correia Miguel da Silva Fraga Joni Cheuk Lung Lau 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(3):419-432

Despite the large amount of Byzantine fault-tolerant algorithms for message-passing systems designed through the years, only recently algorithms for the coordination of processes subject to Byzantine failures using shared memory have appeared. This paper presents a new computing model in which shared memory objects are protected by fine-grained access policies, and a new shared memory object, the Policy-Enforced Augmented Tuple Space (PEATS). We show the benefits of this model by providing simple and efficient consensus algorithms. These algorithms are much simpler and requires less shared memory operations, using also less memory bits than previous algorithms based on ACLs and sticky bits. We also prove that PEATS objects are universal, i.e., that they can be used to implement any other shared memory object, and present lock-free and wait-free universal constructions. 相似文献

16.

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

下载免费PDF全文

杨杨崔慧敏冯晓兵薛京灵《计算机科学技术学报》2012,27(1):57-74

In this paper,we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory.Unlike ea... 相似文献

17.

Computing in Totally Anonymous Asynchronous Shared Memory Systems

Hagit Attiya Alla Gorbach Shlomo Moran 《Information and Computation》2002,173(2):162

In the totally anonymous shared memory model of asynchronous distributed computing, processes have no identifiers and run identical programs. Moreover, processes have identical interface to the shared memory, and in particular, there are no single-writer registers. This paper assumes that processes do not fail, and the shared memory consists only of read/write registers, which are initialized to some default value. A complete characterization of the functions and agreement tasks that can be solved in this model is presented. Furthermore, it is shown that if a function is computable, then two registers are sufficient for some algorithm to compute it. Consensus is an important agreement task that can be computed. The paper proves logarithmic lower bounds on the number of registers and rounds needed for solving consensus in this model. A consensus protocol using a linear number of shared registers and rounds is also presented. 相似文献

18.

Efficient data race detection for async-finish parallelism

Raghavan Raman Jisheng Zhao Vivek Sarkar Martin Vechev Eran Yahav 《Formal Methods in System Design》2012,41(3):321-347

A major productivity hurdle for parallel programming is the presence of data races. Data races can lead to all kinds of harmful program behaviors, including determinism violations and corrupted memory. However, runtime overheads of current dynamic data race detectors are still prohibitively large (often incurring slowdowns of 10× or more) for use in mainstream software development. In this paper, we present an efficient dynamic race detection algorithm that handles both the async-finish task-parallel programming model used in languages such as X10 and Habanero Java (HJ) and the spawn-sync constructs used in Cilk. We have implemented our algorithm in a tool called TaskChecker and evaluated it on a suite of 12 benchmarks. To reduce overhead of the dynamic analysis, we have also implemented various static optimizations in the tool. Our experimental results indicate that our approach performs well in practice, incurring an average slowdown of 3.05× compared to a serial execution in the optimized case. 相似文献

19.

Designing fast LTL model checking algorithms for many-core GPUs

Jiří BarnatAuthor Vitae Petr BauchAuthor VitaeLuboš BrimAuthor Vitae Milan Češka 《Journal of Parallel and Distributed Computing》2012

Recent technological developments made various many-core hardware platforms widely accessible. These massively parallel architectures have been used to significantly accelerate many computation demanding tasks. In this paper, we show how the algorithms for LTL model checking can be redesigned in order to accelerate LTL model checking on many-core GPU platforms. Our detailed experimental evaluation demonstrates that using the NVIDIA CUDA technology results in a significant speedup of the verification process. Together with state space generation based on shared hash-table and DFS exploration, our CUDA accelerated model checker is the fastest among state-of-the-art shared memory model checking tools. 相似文献

20.

Active memory controller

Zhen Fang Lixin Zhang John B. Carter Sally A. McKee Ali Ibrahim Michael A. Parker Xiaowei Jiang 《The Journal of supercomputing》2012,62(1):510-549

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50× faster barriers, 12× faster spinlocks, 8.5×–15× faster stream/array operations, and 3× faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation. 相似文献