首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Fast Ant Colony Optimization on Runtime Reconfigurable Processor Arrays   总被引:4,自引:0,他引:4  
Ant Colony Optimization (ACO) is a metaheuristic used to solve combinatorial optimization problems. As with other metaheuristics, like evolutionary methods, ACO algorithms often show good optimization behavior but are slow when compared to classical heuristics. Hence, there is a need to find fast implementations for ACO algorithms. In order to allow a fast parallel implementation, we propose several changes to a standard form of ACO algorithms. The main new features are the non-generational approach and the use of a threshold based decision function for the ants. We show that the new algorithm has a good optimization behavior and also allows a fast implementation on reconfigurable processor arrays. This is the first implementation of the ACO approach on a reconfigurable architecture. The running time of the algorithm is quasi-linear in the problem size n and the number of ants on a reconfigurable mesh with n 2 processors, each provided with only a constant number of memory words.  相似文献   

2.
Gopal Gupta  Enrico Pontelli 《Software》2001,31(12):1143-1181
Naive parallel implementation of non‐deterministic systems (such as a theorem proving system) and languages (such as logic, constraint, or concurrent constraint languages) can result in poor performance. We present three optimization schemas, based on flattening of the computation tree, procrastination of overheads, and sequentialization of computations that can be systematically applied to parallel implementations of non‐deterministic systems/languages to reduce the parallel overhead and to obtain improved efficiency of parallel execution. The effectiveness of these schemas is illustrated by applying them to the ACE parallel logic programming system. The performance data presented show that considerable improvement in execution efficiency can be achieved. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

3.
Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code.  相似文献   

4.
Pipelining has been used in the design of many PRAM algorithms to reduce their asymptotic running time. Paul, Vishkin, and Wagener (PVW) used the approach in a parallel implementation of 2-3 trees. The approach was later used by Cole in the first O( lg n) time sorting algorithm on the PRAM not based on the AKS sorting network, and has since been used to improve the time of several other algorithms. Although the approach has improved the asymptotic time of many algorithms, there are two practical problems: maintaining the pipeline is quite complicated for the programmer, and the pipelining forces highly synchronous code execution. Synchronous execution is less practical on asynchronous machines and makes it difficult to modify a schedule to use less memory or to take better advantage of locality. In this paper we show how futures (a parallel language construct) can be used to implement pipelining without requiring the user to code it explicitly, allowing for much simpler code and more asynchronous execution. A runtime system manages the pipelining implicitly. As with user-managed pipelining, we show how the technique reduces the depth of many algorithms by a logarithmic factor over the nonpipelined version. We describe and analyze four algorithms for which this is the case: a parallel merging algorithm on trees, parallel algorithms for finding the union and difference of two randomized balanced trees (treaps), and insertion into a variant of the PVW 2-3 trees. For three of these, the pipeline delays are data dependent making them particularly difficult to pipeline by hand. To determine the runtime of algorithms we first analyze the algorithms in a language-based cost model in terms of the work w and depth d of the computations, and then show universal bounds for implementing the language on various machine models. Received December 3, 1997, and in final form September 17, 1998.  相似文献   

5.
Particle swarm optimization (PSO), like other population-based meta-heuristics, is intrinsically parallel and can be effectively implemented on Graphics Processing Units (GPUs), which are, in fact, massively parallel processing architectures. In this paper we discuss possible approaches to parallelizing PSO on graphics hardware within the Compute Unified Device Architecture (CUDA™), a GPU programming environment by nVIDIA™ which supports the company’s latest cards. In particular, two different ways of exploiting GPU parallelism are explored and evaluated. The execution speed of the two parallel algorithms is compared, on functions which are typically used as benchmarks for PSO, with a standard sequential implementation of PSO (SPSO), as well as with recently published results of other parallel implementations. An in-depth study of the computation efficiency of our parallel algorithms is carried out by assessing speed-up and scale-up with respect to SPSO. Also reported are some results about the optimization effectiveness of the parallel implementations with respect to SPSO, in cases when the parallel versions introduce some possibly significant difference with respect to the sequential version.  相似文献   

6.
With the ability of customization for an application domain, extensible processors have been used more and more in embedded systems in recent years. Extensible processors customize an application domain by executing parts of application code in hardware instead of software. Determining parts of application code as custom instruction generally requires subgraph enumeration and subgraph selection. Both subgraph enumeration problem and subgraph selection problem are computationally difficult problems. Most of previous works focus on sequential algorithms for these two problems. In this paper, we present a parallel implementation of a latest subgraph enumeration algorithm based on a computer cluster. A standard ant colony optimization algorithm (ACO), a modified version of ACO with local optimum search and a parallel ACO algorithm are also proposed to solve the subgraph selection problem in this work. Experimental results show that the parallel algorithms outperform the sequential algorithms in terms of runtime or (and) quality of results. In addition, we have formally proved the upper bound on the number of feasible solutions in subgraph selection problem with or without the overlapping constraint.  相似文献   

7.
Two types of broadcast in algorithms are determined: (1) a data broadcast, where one data value is used for more than one computation and (2) a computational broadcast where one variable is computed in more then one computation. Both types of broadcast are preferred to be eliminated when a processor array implementation is desired by using VLSI technology.

When the algorithm computes only one variable value for each index vector then the computational broadcast can be eliminated in a straight forward manner by introducing counter values resulting in a single assignment code. However in cases when the algorithm computes two or more variable values that are specified by a different computational broadcast, has not been considered. As far as is known it has been solved by deriving localized algorithms in single assignment code heuristically.

In this paper we define this problem in terms of a system of affine recurrence equations and analyze the data dependencies introduced. Then we show a synthesis procedure that eliminates the computational broadcast and a few examples of implementation are shown. The QR decomposition algorithm is also presented in a localized single assignment code by using the proposed method and several different parallel implementations are discussed.  相似文献   

8.
The aim of this paper is to evaluate OpenMP, TBB and Cilk Plus as basic language-based tools for simple and efficient parallelization of recursively defined computational problems and other problems that need both task and data parallelization techniques. We show how to use these models of parallel programming to transform a source code of Adaptive Simpson’s Integration to programs that can utilize multiple cores of modern processors. Using the example of Belman–Ford algorithm for solving single-source shortest path problems, we advise how to improve performance of data parallel algorithms by tuning data structures for better utilization of vector extensions of modern processors. Manual vectorization techniques based on Cilk array notation and intrinsics are presented. We also show how to simplify such optimization using Intel SIMD Data Layout Template containers.  相似文献   

9.
We study the problem of guaranteeing correct execution semantics in parallel implementations of logic programming languages in presence of built-in constructs that are sensitive to order of execution. The declarative semantics of logic programming languages permit execution of various goals in any arbitrary order (including in parallel). However, goals corresponding to extra-logical built-in constructs should respect the sequential order of execution to ensure correct semantics. Ensuring this correctness in presence of such built-in constructs, while efficiently exploiting maximum parallelism, is a difficult problem. In this paper, we propose a formalization of this problem in terms of operations on dynamic trees. This abstraction enables us to: (i) show that existing schemes to handle order-sensitive computations used in current parallel systems are sub-optimal; (ii) develop a novel, optimal scheme to handle order-sensitive goals that requires only a constant time overhead per operation. While we present our results in the context of logic programming, they will apply equally well to most parallel non-deterministic systems. Received: 20 April 1998 / 3 April 2000  相似文献   

10.

Centrality measures or indicators of centrality identify most relevant nodes of graphs. Although optimized algorithms exist for computing of most of them, they are still time consuming and are even infeasible to apply to big enough graphs like the ones representing social networks or extensive enough computer networks. In this paper, we present a parallel implementation in C language of some optimal algorithms for computing of some indicators of centrality. Our parallel version greatly reduces the execution time of their sequential (non-parallel) counterpart. The proposed solution relies on threading, allowing for a theoretical improvement in performance close to the number of logical processors (cores) of the single computer in which it is running. Our software has been tested in several platforms, including the supercomputer Calendula, in which we achieved execution times close to 18 times faster when running our parallel implementation instead of our sequential one. Our solution is multi-platform and portable, working on any machine with several logical processor which is capable of compiling and running C language code.

  相似文献   

11.
Track-before-detect (TBD) algorithms are used for tracking systems, where the object’s signal is below the noise floor (low-SNR objects). A lot of computations and memory transfers for real-time signal processing are necessary. GPGPU in parallel processing devices for TBD algorithms is well suited. Finding optimal or suboptimal code, due to lack of documentation for low-level programming of GPGPUs is not possible. High-level code optimization is necessary and the evolutionary approach, based on the single parent and single child is considered, that is local search approach. Brute force search technique is not feasible, because there are N! code variants, where N is the number of motion vectors components. The proposed evolutionary operator—LREI (local random extraction and insertion) allows source code reordering for the reduction of computation time due to better organization of memory transfer and the texture cache content. The starting point, based on the sorting and the minimal execution time metric is proposed. The unbiased random and biased sorting techniques are compared using experimental approach. Tests shows significant improvements of the computation speed, about 8 % over the conventional code for CUDA code. The time period of optimization for the sample code is about 1 h (1,000 iterations) for the considered recursive spatio-temporal TBD algorithm.  相似文献   

12.
The paper gives an overview on the DSPL programming environment, an integrated approach to automate system design and implementation of applications run on dedicated parallel systems. The programming environment consists of a data-flow language and an integrated set of tools. The tools automatically derive a software model from the given application program. Based on the model, the design decisions as the network topology, the task mapping and schedule as well as the optimal use of buffers are computed. Finally, the design decisions are automatically implemented by transforming the application program in executable code for the chosen processor network. The DSPL programming environment integrates model-based optimization techniques and program transformation techniques. The integration allows to include new aspects in the optimization process. Especially optimizations crucial to the semantics of the program can be included. The most important examples of such optimizations are the enforcement of the schedule in case of data-dependent execution of tasks and the transformation of buffered communication to unbuffered communication. Both aspects are crucial to the generation of efficient parallel implementations. The integration of the two aspects is supported by a formal framework. This allows to formally prove the correctness of the program optimizations performed by the programming environment.  相似文献   

13.
Hardware accelerators such as GPUs or Intel Xeon Phi comprise hundreds or thousands of cores on a single chip and promise to deliver high performance. They are widely used to boost the performance of highly parallel applications. However, because of their diverging architectures programmers are facing diverging programming paradigms. Programmers also have to deal with low-level concepts of parallel programming that make it a cumbersome task. In order to assist programmers in developing parallel applications Algorithmic Skeletons have been proposed. They encapsulate well-defined, frequently recurring parallel programming patterns, thereby shielding programmers from low-level aspects of parallel programming. The main contribution of this paper is a comparison of two skeleton library implementations, one in C++ and one in Java, in terms of library design and programmability. Besides, on the basis of four benchmark applications we evaluate the performance of the presented implementations on two test systems, a GPU cluster and a Xeon Phi system. The two implementations achieve comparable performance with a slight advantage for the C++ implementation. Xeon Phi performance ranges between CPU and GPU performance.  相似文献   

14.
We examine combinatorial properties of a class of hash functions and its application to the simulations of classical models of parallel computation on other models, such as theBSPand theS*PRAM, optimally in communication to within additive lower order terms. The BSP model can serve as a programming paradigm as well; we also examine the implications of architecture independent parallel algorithm design in the context of the BSP model and show how it can lead to portable and scalable implementations of algorithms that can work on a multiplicity of hardware platforms with only recompilation of the source program code. Toward this end, dense Cholesky factorization algorithms are presented and their performance on three parallel hardware platforms, an SGI Power Challenge, IBM SP2, and Cray T3D, is examined and analyzed.  相似文献   

15.
This paper presents efficient and portable implementations of two useful primitives in image processing algorithms, histogramming and connected components. Our general framework is a single-address space, distributed memory programming model. We use efficient techniques for distributing and coalescing data as well as efficient combinations of task and data parallelism. Our connected components algorithm uses a novel approach for parallel merging which performs drastically limited updating during iterative steps, and concludes with a total consistency update at the final step. The algorithms have been coded in S -C and run on a variety of platforms. Our experimental results are consistent with the theoretical analysis and provide the best known execution times for these two primitives, even when compared with machine-specific implementations.  相似文献   

16.

Image registration is a commonly task in medical image analysis. Therefore, a significant number of algorithms have been developed to perform rigid and non-rigid image registration. Particularly, the free-form deformation algorithm is frequently used to carry out non-rigid registration task; however, it is a computationally very intensive algorithm. In this work, we describe an approach based on profiling data to identify potential parts of this algorithm for which parallel implementations can be developed. The proposed approach assesses the efficient of the algorithm by applying performance analysis techniques commonly available in traditional computer operating systems. Hence, this article provides guidelines to support researchers working on medical image processing and analysis to achieve real-time non-rigid image registration applications using common computing systems. According to our experimental findings, significant speedups can be accomplished by parallelizing sequential snippets, i.e., code regions that are executed more than once. For the selected costly functions previously identified in the studied free-form deformation algorithm, the developed parallelization decreased the runtime by up to seven times relatively to the related single thread based implementation. The implementations were developed based on the Open Multi-Processing application programming interface. In conclusion, this study confirms that based on the call graph visualization and detected performance bottlenecks, one can easily find and evaluate snippets which are potential optimization targets in addition to throughput in memory accesses.

  相似文献   

17.

This research presents a synthetic case study for multi-objective optimization for an active and passive design procedure based on dynamic programming using genetic algorithms (GAs) and computational fluid dynamics (CFD). Both active and passive optimized variables are indispensable for efficient building design. This paper shows how to deal with these two different types of variables in the multi-objective optimization frame. Energy saving, thermal comfort, and indoor air quality are selected as objective functions. While demonstrating a synthetic multi-objective optimization with active and passive variables, this research analyzes the trade-off relation among objective functions in the indoor environment. In this research, representing fluctuating outdoor conditions as random variables, optimization of the building geometry as the passive design variable and an HVAC system as the active design variable was performed using the dynamic programming approach. This research consists of several tasks. First, multi-objective optimization is carried out by genetic algorithms and computational fluid dynamics, and then dynamic programming is applied to the control system with random variables.

  相似文献   

18.
Multiprocessor execution of functional programs   总被引:1,自引:0,他引:1  
Functional languages have recently gained attention as vehicles for programming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This paper describes research that was performed to demonstrate that multiprocessor execution of functional programs on current multiprocessors is feasible, and results in a significant reduction in their execution times.Two implementations of the functional language ALFL were built on commercially available multiprocessors.Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, andBuckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs and a run-time system that supports their execution. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, calledserial combinators, that can be executed in parallel.The abstract machine model supported by Alfalfa and Buckwheat is calledheterogeneous graph reduction, which is a hybrid of graph reduction and conventional stack-oriented execution. This model supports parallelism, lazy evaluation, and highe order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat runtime systems support dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.This research was supported in part by National Science Foundation grants DCR-8302018 and DCR-8521451, by a DARPA subcontract with SDC/Unisys, and by gifts from Burroughs Austin Research Center and the Intel Corporation.  相似文献   

19.
Current computer systems separate main memory from storage, and programming languages typically reflect this distinction using different representations for data in memory and storage. However, moving data back and forth between these different layers and representations compromise both programming and execution efficiency. To remedy this, the concept of orthogonal persistence (OP) was proposed in the early 1980s advocating that, from a programmer's standpoint, there should be no differences in the way that short-term and long-term data are manipulated. However, at that time, the underlying implementations still had to cope with the complexity of moving data across memory and storage. Today, recent nonvolatile memory (NVM) technologies, such as resistive RAM and phase-change memory, allow main memory and storage to be collapsed into a single layer of persistent memory, opening the way for more efficient programming abstractions for handling persistence. In this work, we revisit OP concepts in the context of NVM architectures and propose a persistent heap design for languages with automatic memory management. We demonstrate how it can significantly increase programmer and execution efficiency, removing the impedance mismatch of crossing semantic boundaries. To validate and demonstrate the presented concepts, we present JaphaVM, an implementation of the proposed design based on JamVM, an open-source Java Virtual Machine. Our results show that JaphaVM, in most cases, executes the same operations between one and two orders of magnitude faster than regular database-based and file-based implementations, while requiring significantly less lines of code.  相似文献   

20.
We apply the methodology of competitive analysis of algorithms to the implementation of programs on parallel machines. We consider the problem of finding the best on-line distributed scheduling strategy that executes in parallel an unknown directed acyclic graph (dag) which represents the data dependency relation graph of a parallel program and which is revealed as execution proceeds. We study the competitive ratio of some important classes of dags assuming a fixed communication delay ratio τ that captures the average interprocessor communication measured in instruction cycles. We provide competitive algorithms for divide-and-conquer dags, trees, and general dags, when the number of processors depends on the size of the input dag and when the number of processors is fixed. Our major result is a lower bound Ω (τ / log τ ) of the competitive ratio for trees; it shows that it is impossible to design compilers that produce almost optimal execution code for all parallel programs. This fundamental result holds for almost any reasonable distributed memory parallel computation model, including the LogP and BSP model. Received March 5, 1996; revised March 11, 1997.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号