期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A message passing benchmark for unbalanced applications

《Simulation Modelling Practice and Theory》2008,16(9):1177-1189

We present a distributed memory parallel implementation of the unbalanced tree search (UTS) benchmark using MPI and investigate MPI’s ability to efficiently support irregular and nested parallelism through continuous dynamic load balancing. Two load balancing methods are explored: work sharing using a centralized work server and distributed work stealing using explicit polling to service steal requests. Experiments indicate that in addition to a parameter defining the granularity of load balancing, message-passing paradigms require additional techniques to manage the volume of communication and mitigate runtime overhead. Using additional parameters, we observed an improvement of up to 3–4X in parallel performance. We report results for three distributed memory parallel computer systems and use UTS to characterize the performance and scalability on these systems. Overall, we find that the simpler work sharing approach with a single work server achieves good performance on hundreds of processors and that our distributed work stealing implementation scales to thousands of processors and delivers more robust performance that is less sensitive to the particular workload and load balancing parameters. 相似文献

2.

Clique detection for nondirected graphs: Two new algorithms

Dr. L. Gerhards Dr. W. Lindenberg 《Computing》1979,21(4):295-322

Making use of special tree search algorithms the present paper describes two new methods for determining all maximal complete subgraphs (cliques) of a finite nondirected graph. In both methods the blockwise generation of all cliques induces characteristic properties, which guarantee an efficient calculation of special clique subsets, especially the set of all cliques of maximal length. Moreover, by their structure both algorithms allow to calculate the complete clique set by parallel processing. The algorithms have been tested for many series of characteristic graphs and compared with the algorithm of Bron-Kerbosch (Algorithm 457 of CACM) the most efficient algorithm which is known to the authors. 相似文献

3.

Theoretical underpinnings for maximal clique enumeration on perturbed graphs

William Hendrix Matthew C. Schmidt Paul Breimyer Nagiza F. Samatova 《Theoretical computer science》2010,411(26-28):2520-2536

The problem of enumerating the maximal cliques of a graph is a computationally expensive problem with applications in a number of different domains. Sometimes the benefit of knowing the maximal clique enumeration (MCE) of a single graph is worth investing the initial computation time. However, when graphs are abstractions of noisy or uncertain data, the MCE of several closely related graphs may need to be found, and the computational cost of doing so becomes prohibitively expensive.Here, we present a method by which the cost of enumerating the set of maximal cliques for related graphs can be reduced. By using the MCE for some baseline graph, the MCE for a modified, or perturbed, graph may be obtained by enumerating only the maximal cliques that are created or destroyed by the perturbation. When the baseline and perturbed graphs are relatively similar, the difference set between the two MCEs can be overshadowed by the maximal cliques common to both. Thus, by enumerating only the difference set between the baseline and perturbed graphs’ MCEs, the computational cost of enumerating the maximal cliques of the perturbed graph can be reduced.We present necessary and sufficient conditions for enumerating difference sets when the perturbed graph is formed by several different types of perturbations. We also present results of an algorithm based on these conditions that demonstrate a speedup over traditional calculations of the MCE of perturbed, real biological networks. 相似文献

4.

On runtime parallel scheduling for processor load balancing 总被引：3，自引：0，他引：3

Min-You Wu 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):173-186

Parallel scheduling is a new approach for load balancing. In parallel scheduling, all processors cooperate to schedule work. Parallel scheduling is able to accurately balance the load by using global load information at compile-time or runtime. It provides high-quality load balancing. This paper presents an overview of the parallel scheduling technique. Scheduling algorithms for tree, hypercube, and mesh networks are presented. These algorithms can fully balance the load and maximize locality at runtime. Communication costs are significantly reduced compared to other existing algorithms 相似文献

5.

On the scalability of parallel genetic algorithms

Cantú-Paz E Goldberg DE 《Evolutionary computation》1999,7(4):429-449

This paper examines the scalability of several types of parallel genetic algorithms (GAs). The objective is to determine the optimal number of processors that can be used by each type to minimize the execution time. The first part of the paper considers algorithms with a single population. The investigation focuses on an implementation where the population is distributed to several processors, but the results are applicable to more common master-slave implementations, where the population is entirely stored in a master processor and multiple slaves are used to evaluate the fitness. The second part of the paper deals with parallel GAs with multiple populations. It first considers a bounding case where the connectivity, the migration rate, and the frequency of migrations are set to their maximal values. Then, arbitrary regular topologies with lower migration rates are considered and the frequency of migrations is set to its lowest value. The investigationis mainly theoretical, but experimental evidence with an additively-decomposable function is included to illustrate the accuracy of the theory. In all cases, the calculations show that the optimal number of processors that minimizes the execution time is directly proportional to the square root of the population size and the fitness evaluation time. Since these two factors usually increase as the domain becomes more difficult, the results of the paper suggest that parallel GAs can integrate large numbers of processors and significantly reduce the execution time of many practical applications. 相似文献

6.

Using visualization tools to understand concurrency

Zernik D. Snir M. Malki D. 《Software, IEEE》1992,9(3):87-92

A visualization tool that provides an aggregate view of execution through a graph of events called the causality graph, which is suitable for systems with hundreds or thousands of processors, coarse-grained parallelism, and for a language that makes communication and synchronization explicit, is discussed. The methods for computing causality graphs and stepping through an execution with causality graphs are described. The properties of the abstraction algorithms and super nodes, the subgraphs in causality graphs, are also discussed 相似文献

7.

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Alfred J. Park Kalyan S. Perumalla 《Journal of Parallel and Distributed Computing》2013

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance. 相似文献

8.

Automatic runtime frequency-scaling system for energy savings in parallel applications

Vaibhav Sundriyal Masha Sosonkina Zhao Zhang 《The Journal of supercomputing》2014,68(2):777-797

Although high-performance computing has always been about efficient application execution, both energy and power consumption have become critical concerns owing to their effect on operating costs and failure rates of large-scale computing platforms. Modern processors provide techniques, such as dynamic voltage and frequency scaling (DVFS) and CPU clock modulation (called throttling), to improve energy efficiency on-the-fly. Without careful application, however, DVFS and throttling may cause a significant performance loss due to system overhead. This paper proposes a novel runtime system that maximizes energy saving by selecting appropriate values for DVFS and throttling in parallel applications. Specifically, the system automatically predicts communication phases in parallel applications and applies frequency scaling considering both the CPU offload, provided by the network-interface card, and the architectural stalls during computation. Experiments, performed on NAS parallel benchmarks as well as on real-world applications in molecular dynamics and linear system solution, demonstrate that the proposed runtime system obtaining energy savings of as much as 14 % with a low performance loss of about 2 %. 相似文献

9.

Parallel space‐filling curve generation through sorting

J. Luitjens M. Berzins T. Henderson 《Concurrency and Computation》2007,19(10):1387-1402

In this paper we consider the scalability of parallel space‐filling curve generation as implemented through parallel sorting algorithms. Multiple sorting algorithms are studied and results show that space‐filling curves can be generated quickly in parallel on thousands of processors. In addition, performance models are presented that are consistent with measured performance and offer insight into performance on still larger numbers of processors. At large numbers of processors, the scalability of adaptive mesh refined codes depends on the individual components of the adaptive solver. One such component is the dynamic load balancer. In adaptive mesh refined codes, the mesh is constantly changing resulting in load imbalance among the processors requiring a load‐balancing phase. The load balancing may occur often, requiring the load balancer to perform quickly. One common method for dynamic load balancing is to use space‐filling curves. Space‐filling curves, in particular the Hilbert curve, generate good partitions quickly in serial. However, at tens and hundreds of thousands of processors serial generation of space‐filling curves will hinder scalability. In order to avoid this issue we have developed a method that generates space‐filling curves quickly in parallel by reducing the generation to integer sorting. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

10.

Parallel multilevel algorithms for hypergraph partitioning

Aleksandar Trifunović William J. Knottenbelt 《Journal of Parallel and Distributed Computing》2008

In this paper, we present parallel multilevel algorithms for the hypergraph partitioning problem. In particular, we describe for parallel coarsening, parallel greedy k-way refinement and parallel multi-phase refinement. Using an asymptotic theoretical performance model, we derive the isoefficiency function for our algorithms and hence show that they are technically scalable when the maximum vertex and hyperedge degrees are small. We conduct experiments on hypergraphs from six different application domains to investigate the empirical scalability of our algorithms both in terms of runtime and partition quality. Our findings confirm that the quality of partition produced by our algorithms is stable as the number of processors is increased while being competitive with those produced by a state-of-the-art serial multilevel partitioning tool. We also validate our theoretical performance model through an isoefficiency study. Finally, we evaluate the impact of introducing parallel multi-phase refinement into our parallel multilevel algorithm in terms of the trade off between improved partition quality and higher runtime cost. 相似文献