首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper addresses the problem of global graph alignment on supercomputer-class clusters. We define the alignment of two graphs, as a mapping of each vertex in the first graph to a unique vertex in the second graph so as to optimize a given similarity-based cost function.1 Using a state of the art serial algorithm for the computation of vertex similarity scores called Network Similarity Decomposition (NSD), we derive corresponding parallel formulations. Coupling this parallel similarity algorithm with a parallel auction-based bipartite matching technique, we obtain a highly efficient and scalable graph matching pipeline. We validate the performance of our integrated approach on a large parallel platform and on diverse graph instances (including Protein Interaction, Wikipedia and Web networks). Experimental results demonstrate that our algorithms scale to large machine configurations (thousands of cores) and problem instances, enabling the alignment of networks of sizes two orders of magnitude larger than reported in the current literature.  相似文献   

2.
In this paper, we present parallel multilevel algorithms for the hypergraph partitioning problem. In particular, we describe for parallel coarsening, parallel greedy k-way refinement and parallel multi-phase refinement. Using an asymptotic theoretical performance model, we derive the isoefficiency function for our algorithms and hence show that they are technically scalable when the maximum vertex and hyperedge degrees are small. We conduct experiments on hypergraphs from six different application domains to investigate the empirical scalability of our algorithms both in terms of runtime and partition quality. Our findings confirm that the quality of partition produced by our algorithms is stable as the number of processors is increased while being competitive with those produced by a state-of-the-art serial multilevel partitioning tool. We also validate our theoretical performance model through an isoefficiency study. Finally, we evaluate the impact of introducing parallel multi-phase refinement into our parallel multilevel algorithm in terms of the trade off between improved partition quality and higher runtime cost.  相似文献   

3.
Stphane 《Pattern recognition》1995,28(12):1993-2000
We propose a parallel thinning algorithm for binary pictures. Given an N × N binary image including an object, our algorithm computes in O(N2) the skeleton of the object, using a pyramidal decomposition of the picture. The behavior of this algorithm is studied considering a family of digitalization of the same object at a different level of resolution. With the Exclusive Read Exclusive Write (EREW) Parallel Random Access Machine (PRAM), our algorithm runs in O(log N) time using O(N2/logN) processors and it is work-optimal. The same result is obtained with high-connectivity distributed memory SIMD machines having strong hypercube and pyramid. We describe the basic operator, the pyramidal algorithm and some experimental results on the SIMD MasPar parallel machine.  相似文献   

4.
A parallel FFT on an MIMD machine   总被引:5,自引:0,他引:5  
In this paper we present a parallelization of the Cooley- Tukey FFT algorithm that is implemented on a shared-memory MIMD (non-vector) machine that was built in the Dept. of Computer Science, Tel Aviv University. A parallel algorithm is presented for one dimension Fourier transform with performance analysis. For a large array of complex numbers to be transformed, an almost linear speed-up is demonstrated. This algorithm can be executed by any number of processors, but generally the number is much less than the length of the input data.  相似文献   

5.
Particle swarm optimization (PSO) algorithm is a population-based algorithm for finding the optimal solution. Because of its simplicity in implementation and fewer adjustable parameters compared to the other global optimization algorithms, PSO is gaining attention in solving complex and large scale problems. However, PSO often requires long execution time to solve those problems. This paper proposes a parallel PSO algorithm, called delayed exchange parallelization, which improves performance of PSO on distributed environment by hiding communication latency efficiently. By overlapping communication with computation, the proposed algorithm extracts parallelism inherent in PSO. The performance of our proposed parallel PSO algorithm was evaluated using several applications. The results of evaluation showed that the proposed parallel algorithm drastically improved the performance of PSO, especially in high-latency network environment.  相似文献   

6.
The restricted synchronization structure of so-called structured parallel programming paradigms has an advantageous effect on programmer productivity, cost modeling, and scheduling complexity. However, imposing these restrictions can lead to a loss of parallelism, compared to using a programming approach that does not impose synchronization structure. In this paper we study the potential loss of parallelism when expressing parallel computations into a programming model which limits the computation graph (DAG) to series–parallel topology, which characterizes all well-known structured programming models. We present an analytical model that approximately captures this loss of parallelism in terms of simple parameters that are related to DAG topology and workload distribution. We validate the model using a wide range of synthetic and real-world parallel computations running on shared and distributed-memory machines. Although the loss of parallelism is theoretically unbounded, our measurements show that for all above applications the performance loss due to choosing a series–parallel structured model is invariably limited up to 10%. In all cases, the loss of parallelism is predictable provided the topology and workload variability of the DAG are known.  相似文献   

7.
We study online adaptive scheduling for multiple sets of parallel jobs, where each set may contain one or more jobs with time-varying parallelism. This two-level scheduling scenario arises naturally when multiple parallel applications are submitted by different users or user groups in large parallel systems, where both user-level fairness and system-wide efficiency are of important concerns. To achieve fairness, we use the well-known equi-partitioning algorithm to distribute the available processors among the active job sets at any time. For efficiency, we apply a feedback-driven adaptive scheduler that periodically adjusts the processor allocations within each set by consciously exploiting the jobs’ execution history. We show that our algorithm achieves asymptotically competitive performance with respect to the set response time, which incorporates two widely used performance metrics, namely, total response time and makespan, as special cases. Both theoretical analysis and simulation results demonstrate that our algorithm improves upon an existing scheduler that provides only fairness but lacks efficiency. Furthermore, we provide a generalized framework for analyzing a family of scheduling algorithms based on feedback-driven policies with provable efficiency. Finally, we consider an extended multi-level hierarchical scheduling model and present a fair and efficient solution that effectively reduces the problem to the two-level model.  相似文献   

8.
In this paper, we consider the scheduling problem on identical parallel machines, in which jobs are arriving over time and preemption is not allowed. The goal is to minimize the total completion times. According to the idea of the Delayed-SPT Algorithm proposed by Hoogeven and Vestjens [Optimal on-line algorithms for single-machine scheduling. In: Proceedings 5th international conference on integer programming and combinatorial optimization (IPCO). Lecture notes in computer science, vol. 1084. Berlin: Springer; 1996. p. 404–14], we give an on-line algorithm for the scheduling problem on mm identical parallel machines. We show that this algorithm is 2-competitive and the bound is tight.  相似文献   

9.
This paper considers four parallel Cholesky factorization algorithms, including SPOTRF from the February 1992 release of LAPACK, each of which call parallel Level 2 or 3 BLAS, or both. A fifth parallel Cholesky algorithm that calls serial Level 3 BLAS is also described. The efficiency of these five algorithms on the CRAY-2, CRAY Y-MP/832, Hitachi Data Systems EX 80, and IBM 3090-600J is evaluated and compared with a vendor-optimized parallel Cholesky factorization algorithm. The fifth parallel Cholesky algorithm that calls serial Level 3 BLAS provided the best performance of all algorithms that called BLAS routines. In fact, this algorithm outperformed the Cray-optimized libsci routine (SPOTRF) by 13–44%;, depending on the problem size and the number of processors used.This work was supported by grants from IMSL, Inc., and Hitachi Data Systems. The first version of this paper was presented as a poster session at Supercomputing '90, New York City, November 1990.  相似文献   

10.
Hierarchical agglomerative clustering (HAC) is very useful but due to high CPU time and memory complexity its practical use is limited. Earlier, we proposed an efficient partitioning – partially overlapping partitioning (POP) – based on the fact that in HAC small and closely placed clusters are agglomerated initially, and only towards the end larger and distant clusters are agglomerated. Here, we present the parallel version of POP, pPOP. Theoretical analysis shows that, compared to the existing algorithms, pPOP achieves CPU time speed-up and memory scale-down of O(c) without compromising accuracy where c is the number of cells in the partition. A shared memory implementation shows that pPOP outperforms existing algorithms significantly.  相似文献   

11.
We present a parallel Monte Carlo photon transport algorithm that insures the reproducibility of results. The important feature of this parallel implementation is the introduction of a pair of pseudo-random number generators. This pair of generators is structured in such a manner as to insure minimal correlation between the two sequences of pseudo-random numbers produced. We term this structure as a ‘pseudo-random tree’. Using this structure, we are able to reproduce results exactly in a asynchronous parallel processing environment. The algorithm tracks the history of photons as they interact with two carbon cylinders joined end to end. The algorithm was implemented on both a Denelcor HEP and a CRAY X-MP/48. We describe the algorithm and the pseudo-random tree structure and present speedup results of our implementation.  相似文献   

12.
This paper presents a simple yet effective algorithm to improve an arbitrary Poisson disk sampling to reach the maximal property, i.e., no more Poisson disk can be inserted. Taking a non-maximal Poisson disk sampling as input, our algorithm efficiently detects the regions allowing additional samples and then generates Poisson disks in these regions. The key idea is to convert the complicated plane or space searching problem into a simple searching on circles or spheres, which is one dimensional lower than the original sampling domain. Our algorithm is memory efficient and flexible, which generates maximal Poisson disk sampling in an arbitrary 2D polygon or 3D polyhedron. Moreover, our parallel algorithm can be extended from the Euclidean space to curved surfaces in an intrinsic manner. Thanks to its parallel structure, our method can be implemented easily on modern graphics hardware. We have observed significance performance improvement compared to the existing techniques.  相似文献   

13.
A three-dimensional electromagnetic particle-in-cell code with Monte Carlo collision (PIC-MCC) is developed for MIMD parallel supercomputers. This code uses a standard relativistic leapfrog scheme incorporating Monte Carlo calculations to push plasma particles and to include collisional effects on particle orbits. A local finite-difference time-domain method is used to update the self-consistent electromagnetic fields. The code is implemented using the General Concurrent PIC (GCPIC) algorithm, which uses domain decomposition to divide the computation among the processors. Particles must be exchanged between processors as they move among subdomains. Message passing is implemented using the Express Cubix library and the PVM. We evaluate the performance of this code using a 512-processor Intel Touchstone Delta, a 512-processor Intel Paragon, and a 256-processor CRAY T3D. It is shown that a high parallel efficiency exceeding 95% has been achieved on all three machines for large problems. We have run PIC-MCC simulations using several hundred million particles with several million collisions per time step. For these large-scale simulations the particle push time achieved is in the range of 90–115 ns/particle/time step, and the collision calculation time in the range of a few hundred nanoseconds per collision.  相似文献   

14.
Continuous evolution and improvement of GPGPUs has significantly broadened areas of application. The massively parallel platform they offer, paired with the high efficiency of performing certain operations, opens many questions on the development of suitable techniques and algorithms. In this work, we present a novel algorithm and create a massively parallel, GPGPU-based solver for enumerating solutions of the N-Queens problem. We discuss two implementations of our algorithm for GPGPUs and provide insights on the optimizations we applied. We also evaluate the performance of our approach and compare our work to existing literature, showing a clear reduction in computational time.  相似文献   

15.
We study the online batch scheduling problem on parallel machines with delivery times. Online algorithms are designed on m parallel batch machines to minimize the time by which all jobs have been delivered. When all jobs have identical processing times, we provide the optimal online algorithms for both bounded and unbounded versions of this problem. For the general case of processing time on unbounded batch machines, an online algorithm with a competitive ratio of 2 is given when the number of machines m=2 or m=3, respectively. When m≥4, we present an online algorithm with a competitive ratio of 1.5+o(1).  相似文献   

16.
The growth in size of networked high performance computers along with novel accelerator‐based node architectures has further emphasized the importance of communication efficiency in high performance computing. The world's largest high performance computers are usually operated as shared user facilities due to the costs of acquisition and operation. Applications are scheduled for execution in a shared environment and are placed on nodes that are not necessarily contiguous on the interconnect. Furthermore, the placement of tasks on the nodes allocated by the scheduler is sub‐optimal, leading to performance loss and variability. Here, we investigate the impact of task placement on the performance of two massively parallel application codes on the Titan supercomputer, a turbulent combustion flow solver (S3D) and a molecular dynamics code (LAMMPS). Benchmark studies show a significant deviation from ideal weak scaling and variability in performance. The inter‐task communication distance was determined to be one of the significant contributors to the performance degradation and variability. A genetic algorithm‐based parallel optimization technique was used to optimize the task ordering. This technique provides an improved placement of the tasks on the nodes, taking into account the application's communication topology and the system interconnect topology. Application benchmarks after task reordering through genetic algorithm show a significant improvement in performance and reduction in variability, thereby enabling the applications to achieve better time to solution and scalability on Titan during production. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

17.
In this paper we describe and implement a parallel algorithm to find approximate solutions for the Closest String Problem (CSP). The CSP, also known as Motif Finding problem, has applications in Coding Theory and Computational Biology. The CSP is NP-hard which motivates us to think about heuristics to solve large instances. Several approximation algorithms have been designed for the CSP, but all of them have a poor performance guarantee. Recently some researchers have shown empirically that integer programming techniques can be successfully used to solve moderate-size instances (10–30 strings each of which is 300–800 characters long) of the CSP. However, real-world instances are larger than those tested. In this paper we show how a simple heuristic can be used to find near-optimal solutions to that problem. We implemented a parallel version of this heuristic and report computational experiments on large-scale instances. These results show the effectiveness of our approach.  相似文献   

18.
In this paper we introduce our estimation method for parallel execution times, based on identifying separate “parts” of the work done by parallel programs. Our run time analysis works without any source code inspection. The time of parallel program execution is expressed in terms of the sequential work and the parallel penalty. We measure these values for different problem sizes and numbers of processors and estimate them for unknown values in both dimensions using statistical methods. This allows us to predict parallel execution time for unknown inputs and non-available processor numbers with high precision. Our prediction methods require orders of magnitude less data points than existing approaches. We verified our approach on parallel machines ranging from a multicore computer to a peta-scale supercomputer.  相似文献   

19.
Database query processing can benefit significantly from parallelism. Parallel database algorithms combine substantial CPU and I/O activity, memory requirements, and massive data exchange between processes, all of which must be considered to obtain optimal performance. Since parallel external sorting is a very typical example, we have focused on sorting to tune Volcano, a new query processing system. The purpose of the Volcano project is to provide efficient, extensible tools for query and request processing in novel application domains, particularly in object-oriented and scientific database systems, and for experimental database performance research. It includes all query processing algorithms conventionally used in relational database systems as well as several new ones, and can execute all of them in parallel. In this article, we present Volcano's parallel external sorting algorithm and a sequence of enhancements to improve its performance. We obtained very good absolute performance, 84 seconds for 100 MB of data, as well as near-linear speedup with sixteen CPUs and disks. Furthermore, these results were achieved on a shared-memory machine despite the common belief that parallel query processing is best implemented on distributed-memory systems. We detail our tuning measures and report on their effectiveness.  相似文献   

20.
Bees Algorithm is a population-based method that is a computational bound algorithm whose inspired by the natural behavior of honey bees to finds a near-optimal solution for the search problem. Recently, many parallel swarm based algorithms have been developed for running on GPU (Graphic Processing Unit). Since nowadays developing a parallel Bee Algorithm running on the GPU becomes very important. In this paper, we extend the Bees Algorithm (CUBA (i.e. CUDA based Bees Algorithm)) in order to be run on the CUDA (Compute Unified Device Architecture). CUBA (CUDA based Bees Algorithm). We evaluate the performance of CUBA by conducting some experiments based on numerous famous optimization problems. Results show that CUBA significantly outperforms standard Bees Algorithm in numerous different optimization problems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号