期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Massively parallel algorithms from physics and biology

《国际计算机数学杂志》2012,89(2):201-250

In this paper the recent developments of lattice kinetic models are discussed followed by the fundamental concepts of Evolutionary Algorithms and their application to optimization, machine learning, Neural networks and many other areas. Finally, the application of Evolutionary Algorithms to the biological eco-system is presented. 相似文献

2.

Massively parallel GPU computing for fast stereo correspondence algorithms

《Journal of Systems Architecture》2016

Current accurate stereo matching algorithms employ some key techniques that are not suitable for parallel GPU architecture. It will be tricky and cumbersome to directly take these techniques into GPU applications. Trying to tackle this difficulty, we design two GPU-based stereo matching algorithms, one using a local fixed aggregation window whose size is configurable, and the other using an adaptive aggregation window which only includes necessary pixels. We use the winner-takes-all (WTA) principle for optimization and a plain voting refinement for post-processing; both do not need complex data structures. We aim to implement on GPU platforms fast stereo matching algorithms that produce results with same-level quality as other WTA local dense methods that use window-based cost aggregation. In our GPU-based implementation of the fixed window partially demosaiced CFA stereo matching application, accelerations up to 20 times are obtained for large size images. In our GPU-based implementation of the adaptive window color stereo matching application, experiment results show that it can handle four pairs of standard images from Middlebury database within roughly 100 ms. 相似文献

3.

Massively parallel augmenting path algorithms for the assignment problem

S. Storøy T. Sørevik 《Computing》1997,59(1):1-16

In this paper we describe how to apply fine grain parallelism to augmenting path algorithms for the dense linear assignment problem. We prove by doing that the technique we suggest, can be efficiently implemented on commercial available, massively parallel computers. Using n processors, our method reduces the computational complexity from the sequentialO(n ³) to the parallel complexity ofO(n ²). Exhaustive experiments are performed on a Maspar MP-2 in order to determine which of the algorithmic flavors that fits best onto this kind of architecture. 相似文献

4.

Validating trace-driven microarchitectural simulations

Khalid H. 《Micro, IEEE》2000,20(6):76-82

Benchmark complexity and sluggish performance analysis create an ever-increasing gap between workload size and the speed of analysis. Experimental results with a new data-sampling methodology, however, show promise in closing this gap 相似文献

5.

Massively parallel adaptive mesh refinement and coarsening for dynamic fracture simulations

Andrei Alhadeff Sofie E. Leon Waldemar Celes Glaucio H. Paulino 《Engineering with Computers》2016,32(3):533-552

We use the graphical processing unit (GPU) to perform dynamic fracture simulation using adaptively refined and coarsened finite elements and the inter-element cohesive zone model. Due to the limited memory available on the GPU, we created a specialized data structure for efficient representation of the evolving mesh given. To achieve maximum efficiency, we perform finite element calculation on a nodal basis (i.e., by launching one thread per node and collecting contributions from neighboring elements) rather than by launching threads per element, which requires expensive graph coloring schemes to avoid concurrency issues. These developments made possible the parallel adaptive mesh refinement and coarsening schemes to systematically change the topology of the mesh. We investigate aspects of the parallel implementation through microbranching examples, which has been explored experimentally and numerically in the literature. First, we use a reduced-scale version of the experimental specimen to demonstrate the impact of variation in floating point operations on the final fracture pattern. Interestingly, the parallel approach adds some randomness into the finite element simulation on the structured mesh in a similar way as would be expected from a random mesh. Next, we take advantage of the speedup of the implementation over a similar serial implementation to simulate a specimen whose size matches that of the actual experiment. At this scale, we are able to make more direct comparisons to the original experiment and find excellent agreement with those results. 相似文献

6.

Massively parallel processing

N. Metropolis 《Journal of scientific computing》1986,1(2):115-116

相似文献

7.

Massively parallel methods for semiconductor device modelling

R. K. Coomer I. G. Graham 《Computing》1996,56(1):1-27

In this paper we describe, analyse and implement a parallel iterative method for the solution of the steady-state drift diffusion equations governing the behaviour of a semiconductor device in two space dimensions. The unknowns in our model are the electrostatic potential and the electron and hole quasi-Fermi potentials. Our discretisation consists of a finite element method with mass lumping for the electrostatic potential equation and a hybrid finite element with local current conservation properties for the continuity equations. A version of Gummel's decoupling algorithm which only requires the solution of positive definite symmetric linear systems is used to solve the resulting nonlinear equations. We show that this method has an overall rate of convergence which only degrades logarithmically as the mesh is refined. Indeed the (inner) nonlinear solves of the electrostatic potential equation converge quadratically, with a mesh independent asymptotic constant. We also describe an implementation on a MasPar MP-1 data parallel machine, where the required linear systems are solved by the preconditioned conjugate gradient method. Domain decomposition methods are used to parallelise the required matrix-vector multiplications and to build preconditioners for these very poorly-conditioned systems. Our preconditioned linear solves also have a rate of convergence which degrades logarithmically as the grid is refined relative to subdomain size, and their performance is resilient to the severe layers which arise in the coefficients of the underlying elliptic operators. Parallel experiments are given. 相似文献

8.

Massively parallel quantum computer simulator

K. De Raedt H. De Raedt B. Trieu 《Computer Physics Communications》2007,176(2):121-136

We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers. 相似文献

9.

Massively parallel architectures for large scale neural networksimulations

Fujimoto Y. Fukuda N. Akabane T. 《Neural Networks, IEEE Transactions on》1992,3(6):876-888

A toroidal lattice architecture (TLA) and a planar lattice architecture (PLA) are proposed as massively parallel neurocomputer architectures for large-scale simulations. The performance of these architectures is almost proportional to the number of node processors, and they adopt the most efficient two-dimensional processor connections for WSI implementation. They also give a solution to the connectivity problem, the performance degradation caused by the data transmission bottleneck, and the load balancing problem for efficient parallel processing in large-scale neural network simulations. The general neuron model is defined. Implementation of the TLA with transputers is described. A Hopfield neural network and a multilayer perceptron have been implemented and applied to the traveling salesman problem and to identity mapping, respectively. Proof that the performance increases almost in proportion to the number of node processors is given. 相似文献

10.

Massively parallel software rendering for visualizing large-scaledata sets

Kwan-Liu Ma Parker S. 《Computer Graphics and Applications, IEEE》2001,21(4):72-83

We describe two highly scalable, parallel software volume-rendering algorithms - one renders unstructured grid volume data and the other renders isosurfaces. We designed one algorithm for distributed-memory parallel architectures to render unstructured grid volume data. We designed the other for shared-memory parallel architectures to directly render isosurfaces. Through the discussion of these two algorithms, we address the most relevant issues when using massively parallel computers to render large-scale volumetric data. The focus of our discussion is direct rendering of volumetric data 相似文献

11.

Optimizing graph algorithms for improved cache performance

Park J.-S. Penner M. Prasanna V.K. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(9):769-782

We develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of /spl Omega/(N/sup 3///spl radic/C), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines. 相似文献

12.

基于最小效用的流媒体缓存替换算法 总被引：7，自引：0，他引：7

田小波陈蜀宇《计算机应用》2007,27(3):733-736

提出最小缓存替换算法SCU-K，综合考虑流媒体文件最近K次访问情况，使缓存大小动态适应媒体流行度、字节有用性和已缓存部分大小的变化，降低了文件前缀部分被替换的概率，避免LRU和LFU算法中出现的媒体文件被连续替换的问题。在与LRU，LFU和LRU-2算法的对比实验中，SCU-K算法在提高缓存空间利用率、字节命中率和降低启动延迟方面具有更好的性能。相似文献

13.

Massively parallel computations on many-variable polynomials

Bernard Beauzamy Jean-Louis Frot Christian Millour 《Annals of Mathematics and Artificial Intelligence》1996,16(1):251-283

We show that a multivariate homogeneous polynomial can be represented on a hypercube in such a way that sums, products and partial derivatives can be performed by massively parallel computers. This representation is derived from the theoretical results of Beauzamy-Bombieri-Enflo-Montgomery [1]. The norm associated with it, denoted by [·], is itself a very efficient tool: when products of polynomials are performed, the best constant in inequalities of the form [PQ]C[P][Q] are provided, and the extremal pairs (that is, the pairs of polynomials for which the product is as small as possible) can be identified.Supported by the C.N.R.S (France) and the N.S.F. (USA), by contracts E.T.C.A./C.R.E.A. No. 20351/90 and 20357/91 (Ministry of Defense, France), and by Research Contract EERP-FR 22, DIGITAL Eq. Corp. 相似文献

14.

Massively parallel computation of conservation laws

Marc Garbey David Levine 《Parallel Computing》1990,16(2-3):293-304

We present a new method for computing solutions of conservation laws based on the use of cellular automata with the method of characteristics. The method exploits the high degree of parallelism available with cellular automata and retains important features of the method of characteristics. It yields high numerical accuracy and extends naturally to adaptive meshes and domain decomposition methods for perturbed conservation laws. We describe the method and its implementation for a Dirichlet problem with a single conservation law for the one-dimensional case.

Numerical results for the one-dimensional law with the classical Burgers nonlinearity or the Buckley-Leverett equation show good numerical accuracy outside the neighborhood of the shocks. The error in the area of the shocks is of the order of the mesh size. The algorithm is well suited for execution on both massively parallel computers and vector machines. We present timing results for an Alliant FX/8, Connection Machine Model 2, and CRAY X-MP. 相似文献

15.

Scalable cache invalidation algorithms for mobile data access 总被引：2，自引：0，他引：2

Elmagarmid A. Jin Jing Helal A. Choonhwa Lee 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(6):1498-1511

In this paper, we address the problem of cache invalidation in mobile and wireless client/server environments. We present cache invalidation techniques that can scale not only to a large number of mobile clients, but also to a large number of data items that can be cached in the mobile clients. We propose two scalable algorithms: the Multidimensional Bit-Sequence (MD-BS) algorithm and the Multilevel Bit-Sequence (ML-BS) algorithm. Both algorithms are based on our prior work on the Basic Bit-Sequences (BS) algorithm. Our study shows that the proposed algorithms are effective for a large number of cached data items with low update rates. The study also illustrates that the algorithms ran be used with other complementary techniques to address the problem of cache invalidation for data items with varied update and access rates. 相似文献

16.

Disk cache architectures for transaction-like applications in parallel computers

Eckardt H. 《Computing》1994,53(1):13-31

Computing - I/O in computer systems is prone to become a bottleneck. This is a particular severe problem in highly parallel machines where some applications are fully I/O bound if only one or few... 相似文献

17.

Environments for prototyping parallel algorithms

《Journal of Parallel and Distributed Computing》1988,5(4):421-437

The current ferment in parallel computer architecture has exacerbated the already large problems in developing and testing parallel algorithms. Although only experimentation will identify effective designs, the variety of computing paradigms has precluded development of the (necessary) general environment for parallel programming. We propose an environment design system where design specification is divorced from implementation. Because the design system, Polylith, provides the communication structure of a parallel computation, the computation's individual tasks can be written in one of several relatively portable sequential languages. 相似文献

18.

Line-drawing algorithms for parallel machines

Pang A.T. 《Computer Graphics and Applications, IEEE》1990,10(5):54-59

The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length orientation 相似文献

19.

Massively parallelized replica-exchange simulations of polymers on GPUs

Jonathan Gross Wolfhard Janke Michael Bachmann 《Computer Physics Communications》2011,(8):1638-1644

We discuss the advantages of parallelization by multithreading on graphics processing units (GPUs) for parallel tempering Monte Carlo computer simulations of an exemplified bead-spring model for homopolymers. Since the sampling of a large ensemble of conformations is a prerequisite for the precise estimation of statistical quantities such as typical indicators for conformational transitions like the peak structure of the specific heat, the advantage of a strong increase in performance of Monte Carlo simulations cannot be overestimated. Employing multithreading and utilizing the massive power of the large number of cores on GPUs, being available in modern but standard graphics cards, we find a rapid increase in efficiency when porting parts of the code from the central processing unit (CPU) to the GPU. 相似文献

20.

Massively parallel processing in coastal ocean circulation model

José Gómez-Valdés Dong-Ping Wang 《Journal of scientific computing》1995,10(3):305-323

Massively parallel computers are becoming quite common in their use in computational fluid dynamics. In this study, a parallel algorithm of a 3-D primitive-equation coastal ocean circulation model is designed on the hypercube MIMD computer architecture. The grid is partitioned using one-dimensional domain decomposition. The code is tested in a uniform rectangular grid problem for which the model domain in each node is a cube. For the problem where the grain size (n _y) is fixed, the speedup is linear and is close to ideal forP 8 processors. The overhead (F _C) increases as the number of processors increases. The background overhead is inversely proportional to the size of the grain. The slopeF _C vs.P is a measure of the fraction of non-parallel code. For the problem where the domain is fixed, the speedup is 7.8 using 8-processors and 29.6 using 32-processors. The overhead increases linearly withP. The slopeF _C is a measure of the communication cost. The load balancing problem is examined for a model of the Gulf of California whose computational domain is irregular. 相似文献