期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A parallel algorithm for the integer knapsack problem

D. Morales J.L. Roda C. Rodríguez F. Almeida F. García 《Concurrency and Computation》1996,8(4):251-260

A sequential algorithm with complexity O(M²+n) for the integer knapsack problem is presented. M is the capacity of the knapsack, and n the number of objects. The algorithm admits an efficient parallelization on a p-processor ring machine. The corresponding parallel algorithm is O(M²/p+n). The parallel algorithm is compared with a version of the well-known Lee algorithm adapted to the integer knapsack problem. Computational results on both a local area network and a transputer are reported. 相似文献

2.

Parallel medical image reconstruction: from graphics processing units (GPU) to Grids

Maraike Schellmann Sergei Gorlatch Dominik Meiländer Thomas Kösters Klaus Schäfers Frank Wübbeling Martin Burger 《The Journal of supercomputing》2011,57(2):151-160

We present and compare a variety of parallelization approaches for a real-world case study on modern parallel and distributed computer architectures. Our case study is a production-quality, time-intensive algorithm for medical image reconstruction used in computer tomography (PET). We parallelize this algorithm for the main kinds of contemporary parallel architectures: shared-memory multiprocessors, distributed-memory clusters, graphics processing units (GPU) using the CUDA framework, the Cell processor and, finally, how various architectures can be accessed in a distributed Grid environment. The main contribution of the paper, besides the parallelization approaches, is their systematic comparison regarding four important criteria: performance, programming comfort, accessibility, and cost-effectiveness. We report results of experiments on particular parallel machines of different architectures that confirm the findings of our systematic comparison. 相似文献

3.

An algorithmic framework for solving large-scale multistage stochastic mixed 0–1 problems with nonsymmetric scenario trees. Part II: Parallelization

Unai Aldasoro Laureano F. Escudero María Merino Gloria Pérez 《Computers & Operations Research》2013

This note is a sequel of paper (Escudero et al. (2012) [1]), in which the sequential Branch-and-Fix Coordination referred to as the BFC-MS algorithm was introduced for solving large-scale multistage mixed 0–1 optimization problems up to optimality under uncertainty. The aim of the note is to present the parallelization version of the BFC algorithm, referred to as PC-BFCMS, so that the elapsed time reduction on problem solving is analyzed. The parallelization is performed at two levels. The inner level parallelizes the optimization of the MIP submodels attached to the set of scenario clusters that have been created by the modeler-defined break stage to decompose the original problem, where the nonanticipativity constraints are partially relaxed. Several strategies are presented for analyzing the performance of the inner parallel computation based on MPI (Message Passing Interface) threads to solve scenario cluster submodels versus the sequential version of the BFC-MS methodology. The outer level of parallelization defines a set of 0–1 variables, the combinations of whose 0–1 values, referred to as paths (one for each combination), allow independent models to be optimized in parallel, such that each one can itself be internally optimized with the inner parallelization approach. The results of using the outer–inner parallelization are remarkable: the larger the number of paths and MPI threads (in addition to the number of threads that the MIP solver allows to be used), the smaller the elapsed time to obtain the optimal solution. This new approach allows problems to be solved faster, and, can thus solve very large scale problems that could not otherwise be solved by plain use of a state-of the-art MIP solver, or could not be solved by the sequential version of the decomposition algorithm in acceptable elapsed time. 相似文献

4.

A Parallel Implementation of the Simplex Function Minimization Routine

Donghoon Lee Matthew Wiswall 《Computational Economics》2007,30(2):171-187

This paper generalizes the widely used Nelder and Mead (Comput J 7:308–313, 1965) simplex algorithm to parallel processors. Unlike most previous parallelization methods, which are based on parallelizing the tasks required to compute a specific objective function given a vector of parameters, our parallel simplex algorithm uses parallelization at the parameter level. Our parallel simplex algorithm assigns to each processor a separate vector of parameters corresponding to a point on a simplex. The processors then conduct the simplex search steps for an improved point, communicate the results, and a new simplex is formed. The advantage of this method is that our algorithm is generic and can be applied, without re-writing computer code, to any optimization problem which the non-parallel Nelder–Mead is applicable. The method is also easily scalable to any degree of parallelization up to the number of parameters. In a series of Monte Carlo experiments, we show that this parallel simplex method yields computational savings in some experiments up to three times the number of processors. 相似文献

5.

Parallel multigrid algorithms based on generic approximate sparse inverses: an SMP approach

Christos K. Filelis-Papadopoulos George A. Gravvanis 《The Journal of supercomputing》2014,67(2):384-407

New parallel computational techniques are introduced for the parallelization of Generic Approximate Sparse Inverse multigrid methods, based on Portable Operating System Interface for UniX (POSIX) threads, for multicore systems. Parallelization of the Generic Approximate Sparse Inverse Matrix (GenAspI) algorithm is achieved based on a new computational approach, namely “strip,” which utilizes the data independence of the rows assigned in each available processor. Additionally, new parallel computational techniques are proposed for the parallelization of a modified multigrid V-Cycle method, based on POSIX Threads, for multicore systems. The modified V-Cycle utilized a Parallel PGenAspI Preconditioned Bi-Conjugate Gradient STABilized (BiCGSTAB) as a coarse solver to ensure better parallel performance of the multigrid method. For parallelization purposes, a replication of the multigrid method function is executed on each processor with different index bands and with proper synchronization points to ensure less thread-creation overhead and to maximize parallel performance. Theoretical estimates on speedups and efficiency are also presented. Finally, numerical results for the performance of the PGenAspI algorithm and the PGenAspI–MGV method for solving classical two-dimensional boundary value problems on multicore computer systems are presented. The implementation issues of the proposed method are also discussed using POSIX threads on multicore systems. 相似文献

6.

A parallel algorithm for preemptive scheduling of uniform machines

Charles U. Martel 《Journal of Parallel and Distributed Computing》1988,5(6)

In this paper we describe a fast parallel algorithm for preemptive scheduling of n independent jobs on m uniform machines. Each job has a processing requirement, and each machine processes jobs at a different rate. The goal of the scheduling algorithm is to find a schedule which minimizes the time at which the last job is completed. T. Gonzalez and S. Sahni have developed a sequential algorithm which solves this problem in O(n + m log m) time. We develop a parallel version of this algorithm for a Concurrent Read Exclusive Write (CREW) shared memory computer. The algorithm runs in O(log n + log³m) time using n processors. 相似文献

7.

Approximate string matching using withinword parallelism

Alden H. Wright 《Software》1994,24(4):337-362

Given a text string, a pattern string, and an integer k, the problem of approximate string matching with k differences is to find all substrings of the text string whose edit distance from the pattern string is less than k. The edit distance between two strings is defined as the minimum number of differences, where a difference can be a substitution, insertion, or deletion of a single character. An implementation of the dynamic programming algorithm for this problem is given that packs several characters and mod-4 integers into a computer word. Thus, it is a parallelization of the conventional implementation that runs on ordinary processors. Since a small alphabet means that characters have short binary codes, the degree of parallelism is greatest for small alphabets and for processors with long words. For an alphabet of size 8 or smaller and a 64 bit processor, a 21-fold parallelism over the conventional algorithm can be obtained. Empirical comparisons to the basic dynamic programming algorithm, to a version of Ukkonen's algorithm, to the algorithm of Galil and Park, and to a limited implementation of the Wu-Manber algorithm are given. 相似文献

8.

Parallel hybrid particle/finite volume algorithm for transported PDF methods employing sub-time stepping

B. Rembold M. Grass 《Computers & Fluids》2008,37(3):181-193

A previously presented hybrid finite volume/particle method for the solution of the joint-velocity-frequency-composition probability density function (JPDF) transport equation in complex 3D geometries is extended for parallel computing. The parallelization strategy is based on domain decomposition. The finite volume method (FVM) and the particle method (PM) are parallelized separately and the algorithm is fully synchronous. For the FVM a standard method based on transferring data in ghost cells is used. Moreover, a subdomain interior decomposition algorithm to efficiently solve the implicit time integration for hyperbolic systems is described. The parallelization of the PM is more complicated due to the use of a sub-time stepping algorithm for the particle trajectory integration. Hereby, each particle obeys its local CFL criterion, and the covered distances per global time step can vary significantly. Therefore, an efficient algorithm which deals with this issue and has minimum communication effort was devised and implemented. Numerical tests to validate the parallel vs. the serial algorithm are presented, where also the effectiveness of the subdomain interior decomposition for the implicit time integration was investigated. A 3D dump-combustor configuration test case with about 2.5 × 10⁵ cells was used to demonstrate the good performance of the parallel algorithm. The hybrid algorithm scales well and the maximum speedup on 60 processors for this configuration was 50 (≈80% parallel efficiency). 相似文献

9.

Comparing machine-independent versus machine-specific parallelization of a software platform for biological sequence comparison.

P L Miller P M Nadkarni W R Pearson 《Computer applications in the biosciences》1992,8(2):167-175

A platform program that performs biological sequence comparison provides a case study to compare the relative advantages of a machine-independent approach to parallel computation versus a machine-specific approach. The program consists of two routines: (i) PSCANLIB, which compares a single biological sequence against a database of sequences, and (ii) PCOMPLIB, which compares a database of sequences against another database of sequences, or against itself. The program was first parallelized to run on the Intel Hypercube parallel computer using native Hypercube commands to coordinate the parallel computation. The parallelization logic of the program was then translated into a machine-independent parallel programming language, Linda. These two approaches to parallelization are contrasted in terms of: (i) the expressive power of the logic that coordinates the parallel computation, (ii) the portability of the machine-independent version to other parallel machines and (iii) the relative efficiency of the two versions of the program. In the benchmark tests reported, the benefits of the machine-independent approach were achieved with only a modest sacrifice in efficiency. 相似文献

10.

Acceleration radiosity solutions through the use of hemisphere-base formfactor calculation

Akio Doi Takayuki Itoh 《Computer Animation and Virtual Worlds》1998,9(1):3-15

In this paper, we propose a novel formfactor calculation algorithm for acceleration radiosity solutions in complex environments. Our basic algorithm is an improved version of Spencer's (S.N. Spencer, ‘The hemisphere radiosity method: a tale of two algorithms’, in Photorealism in Computer Graphics, Spencer, 1992, pp. 127–135) and Van Wyk's (G.C. Van Wyk Jr., ‘A geometry-based insolution model for computer-aided design,’ Ph.D. Thesis, The University of Michigan, 1998.) methods, which fail to remove hidden surfaces for relatively large patches and cause large discretization errors in formfactors. We also demonstrate that our technique is superior to the hemi-cube method in terms of the computation time. Moreover, we parallelize our approach on a parallel computer with shared memory, and obtain a high performance with our radiosity rendering system. Our method divides a hemisphere-base into regions, and assigns a region to each processor. The approach can be applied to geometrical data generated by CAD systems, and is evaluated in terms of the computation time, the visual effects, and the parallelization performance. © 1998 John Wiley & Sons, Ltd 相似文献

11.

Parallel genetic algorithms: a survey and problem state of the art

D. S. Knysh V. M. Kureichik 《Journal of Computer and Systems Sciences International》2010,49(4):579-589

In relation with development of computer capabilities and the appearance of multicore processors, parallel computing made it possible to reduce the time for solution of optimization problems. At present of interest are methods of parallel computing for genetic algorithms using the evolutionary model of development in which the main component is the population of species (set of alternative solutions to the problem). In this case, the algorithm efficiency increases due to parallel development of several populations. The survey of basic parallelization strategies and the most interesting models of their implementation are presented. Theoretical ideas on improvement of existing parallelization mechanisms for genetic algorithms are described. A modified model of parallel genetic algorithm is developed. Since genetic algorithms are used for solution of optimization problems, the proposed model was studied for the problem of optimization of a multicriteria function. The algorithm capabilities of getting out of local optima and the influence of algorithm parameters on the deep extremum search dynamics were studied. The conclusion on efficiency of application of dynamic connections of processes, rather than static connections, is made. New mechanisms for implementation and analysis of efficiency of dynamic connections for distributed computing in genetic algorithms are necessary. 相似文献

12.

Parallel computation and FASTA: confronting the problem of parallel database search for a fast sequence comparison algorithm 总被引：4，自引：0，他引：4

P L Miller P M Nadkarni N M Carriero 《Computer applications in the biosciences》1991,7(1):71-78

We have parallelized the FASTA algorithm for biological sequence comparison using Linda, a machine-independent parallel programming language. The resulting parallel program runs on a variety of different parallel machines. A straight-forward parallelization strategy works well if the amount of computation to be done is relatively large. When the amount of computation is reduced, however, disk I/O becomes a bottleneck which may prevent additional speed-up as the number of processors is increased. The paper describes the parallelization of FASTA, and uses FASTA to illustrate the I/O bottleneck problem that may arise when performing parallel database search with a fast sequence comparison algorithm. The paper also describes several program design strategies that can help with this problem. The paper discusses how this bottleneck is an example of a general problem that may occur when parallelizing, or otherwise speeding up, a time-consuming computation. 相似文献

13.

基于脉动阵列的HMMer加速系统

下载免费PDF全文

陆志坚吴艳霞郭振华孙延腾《计算机工程与应用》2013,49(8):76-80

HMMer是用PHMM来对蛋白质或氨基酸序列查询进行分类和匹配的生物信息学软件工具包,但是由于HMMer的并行特性,HMMer在传统的串行化CPU平台上运行十分耗时。采用FPGA对HMMer的核心算法P7Viterbi进行加速,在P7Viterbi算法中存在一个限制并行性的多层循环的迭代间数据依赖关系,以前的工作都是忽略该循环反馈或者串行化这部分程序,从而导致精度和效率的降低。提出了一种基于FPGA的可以适应P7Viterbi的数据依赖特性的基于脉动阵列的并行运算结构,采用自动重算机制来解决阻碍计算并行的回边问题。在FPGA中通过并行流水技术实现的加速系统能够有效地提高HMMer的运算效率。实验结果表明,提出的带有20个运算单元的结构和Intel Core2 Duo 2.33 GHz CPU平台相比,加速比能够达到56.8倍。相似文献

14.

Parallelizing genetic linkage analysis: a case study for applying parallel computation in molecular biology 总被引：1，自引：0，他引：1

P L Miller P Nadkarni J E Gelernter N Carriero A J Pakstis K K Kidd 《Computers and biomedical research》1991,24(3):234-248

Parallel computers offer a solution to improve the lengthy computation time of many conventional, sequential programs used in molecular biology. On a parallel computer, different pieces of the computation are performed simultaneously on different processors. LINKMAP is a sequential program widely used by scientists to perform genetic linkage analysis. We have converted LINKMAP to run on a parallel computer, using the machine-independent parallel programming language, Linda. Using the parallelization of LINKMAP as a case study, the paper outlines an approach to converting existing highly iterative programs to a parallel form. The paper describes the steps involved in converting the sequential program to a parallel program. It presents performance benchmarks comparing the sequential version of LINKMAP with the parallel version running on different parallel machines. The paper also discusses alternative approaches to the problem of "load balancing," making sure the computational load is shared as evenly as possible among the available processors. 相似文献

15.

The ParTriCluster Algorithm for Gene Expression Analysis

Renata Braga Araújo Guilherme Henrique Trielli Ferreira Gustavo Henrique Orair Wagner Meira Jr. Renato Antônio Celso Ferreira Dorgival Olavo Guedes Neto Mohammed Javeed Zaki 《International journal of parallel programming》2008,36(2):226-249

Analyzing gene expression patterns is becoming a highly relevant task in the Bioinformatics area. This analysis makes it possible to determine the behavior patterns of genes under various conditions, a fundamental information for treating diseases, among other applications. A recent advance in this area is the Tricluster algorithm, which is the first algorithm capable of determining 3D clusters (genes × samples × timestamps), that is, groups of genes that behave similarly across samples and timestamps. However, even though biological experiments collect an increasing amount of data to be analyzed and correlated, the triclustering problem remains a bottleneck due to its NP-Completeness, so its parallelization seems to be an essential step towards obtaining feasible solutions. In this work we propose and evaluate the implementation of a parallel version of the Tricluster algorithm using the filter-labeled-stream paradigm supported by the Anthill parallel programming environment. The results show that our parallelization scales well with the data size, being able to handle severe load imbalances that are inherent to the problem. Further more, the parallelization strategy is applicable to any depth-first searches. 相似文献

16.

交替方向隐式CFD解法器的GPU并行计算及其优化

邓亮徐传福刘巍张理论《计算机应用》2013,33(10):2783-2786

交替方向隐格式(ADI)是常见的偏微分方程离散格式之一,目前对ADI格式在计算流体力学（CFD）实际应用中的GPU并行工作开展较少。从一个有限体积CFD应用出发,通过分析ADI解法器的特点和计算流程,基于统一计算架构(CUDA)编程模型设计了基于网格点与网格线的两类细粒度GPU并行算法,讨论了若干性能优化方法。在天河-1A系统上,采用128×128×128网格规模的单区结构网格算例,无粘项、粘性项及ADI迭代计算的GPU并行性能相对于单CPU核,分别取得了100.1、40.1和10.3倍的加速比,整体ADI CFD解法器的GPU并行加速比为17.3 相似文献

17.

Parallel Algorithms for the Circuit Value Update Problem

C. E. Leiserson K. H. Randall 《Theory of Computing Systems》1997,30(6):583-597

The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinational element has bounded fan-in and fan-out and can be evaluated in constant time. This problem is easily solved on an ordinary serial computer in O(W+D) time, where W is the number of elements in the altered subcircuit and D is the subcircuit's embedded depth (its depth measured in the original circuit). In this paper we show how to solve the circuit value update problem efficiently on a P-processor parallel computer. We give a straightforward synchronous, parallel algorithm that runs in expected time. Our main contribution, however, is an optimistic, asynchronous, parallel algorithm that runs in expected time, where W and D are the size and embedded depth, respectively, of the ``volatile' subcircuit, the subcircuit of elements that have inputs which either change or glitch as a result of the update. To our knowledge, our analysis provides the first analytical bounds on the running time of an optimistic, asynchronous, parallel algorithm. Received November 1, 1995, and in final form November 25, 1996. 相似文献

18.

Scenario-Based Hypersequential Programming

Naoshi Uchihira Hideji Kawata Fumitaka Tamura 《International journal of parallel programming》2000,28(2):155-157

Hypersequential programming is a new paradigm of concurrent programming. The original concurrent program is first serialized, then the sequential version is tested and debugged, and finally the target concurrent program is synthesized by parallelizing the debugged sequential version. In hypersequential programming, testing and debugging are performed on the sequential version of the program and the correctness is preserved in the subsequent parallelization process. Therefore, it offers both higher productivity and enhanced reliability. This paper describes a practical approach to hypersequential programming using the execution history called scenario. It also formalizes the parallelization process using a new equivalence relation called scenario graph equivalence, and gives the parallelization algorithm. 相似文献

19.

Efficient minimum spanning tree algorithms on the reconfigurable mesh 总被引：3，自引：0，他引：3

下载免费PDF全文

万颖瑜许胤龙顾晓东陈国良《计算机科学技术学报》2000,15(2):116-125

1 IntroductionLet G = (V, E) be a connected, undirected graph with a weight function W on the set Eof edges to the set of reals. A spanning tree is a subgraph T = (V, ET), ET G E, of C suchthat T is a tree. The weight W(T) of a spanning tree T is the sum of the weights of its edges.A spanning tree with the smallest possible'weight is called a minimum spanning tree (MST)of G. Computing an MST of a given weighted graph is an important problem that arisesin many applications. For this … 相似文献

20.

OpenMP on Networks of Workstations for Software DSMs 总被引：3，自引：0，他引：3

下载免费PDF全文

章锋陈国良张兆庆《计算机科学技术学报》2002,17(1):0-0

This paper describes the implementation of a sizable subset of OpenMP on networks of workstations(NOWs) and the source-to-source OpenMP complier(AutoPar） is used for the JIAJIA home-based shared virtual memory system (SVM).The paper suggests some simple modifications and extensions to the OpenMP standard for the difference between SVM and SMP(symmetric multi processor),at which the OpenMP specification is aimed.The OpenMP translator is based on an automatic paralleization compiler,so it is possible to check the correctness of the semantics of OpenMP programs which is not required in an OpenMP-compliant implementation AutoPar is measured for five applications including both programs from NAS Parallel Benchmarks and real applications on a cluster of eight Pentium Ⅱ PCs connected by a 100 Mbps switched Eternet.The evaluation shows that the parallelization by annotaing OpenMPdirectives is simple and the performance of generatd JIAJIA code is still acceptable on NOWs. 相似文献