期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A cell multipole based domain decomposition algorithm for molecular dynamics simulation of systems of arbitrary shape

Pasupulati Lakshminarasimhulu Jeffry D. Madura 《Computer Physics Communications》2002,144(2):141-153

A domain decomposition algorithm for molecular dynamics simulation of atomic and molecular systems with arbitrary shape and non-periodic boundary conditions is described. The molecular dynamics program uses cell multipole method for efficient calculation of long range electrostatic interactions and a multiple time step method to facilitate bigger time steps. The system is enclosed in a cube and the cube is divided into a hierarchy of cells. The deepest level cells are assigned to processors such that each processor has contiguous cells and static load balancing is achieved by redistributing the cells so that each processor has approximately same number of atoms. The resulting domains have irregular shape and may have more than 26 neighbors. Atoms constituting bond angles and torsion angles may straddle more than two processors. An efficient strategy is devised for initial assignment and subsequent reassignment of such multiple-atom potentials to processors. At each step, computation is overlapped with communication greatly reducing the effect of communication overhead on parallel performance. The algorithm is tested on a spherical cluster of water molecules, a hexasaccharide and an enzyme both solvated by a spherical cluster of water molecules. In each case a spherical boundary containing oxygen atoms with only repulsive interactions is used to prevent evaporation of water molecules. The algorithm shows excellent parallel efficiency even for small number of cells/atoms per processor. 相似文献

2.

Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer

Bin Fang Glenn Martyna 《Computer Physics Communications》2007,176(8):531-538

QCDOC is a massively parallel supercomputer with tens of thousands of nodes distributed on a six-dimensional torus network. The 6D structure of the network provides the needed communication resources for many communication-intensive applications. In this paper, we present a parallel algorithm for three-dimensional Fast Fourier Transform and its implementation for a 4096-node QCDOC prototype. Two techniques have been used to increase its parallel performance: simultaneous multi-dimensional communication and communication-and-computation overlapping. Benchmarking experiments suggest that 3D FFTs of size 128×128×128 can scale well on such platforms up to 4096 nodes. Our performance results suggest stronger scalability on QCDOC than on IBM BlueGene/L supercomputer. 相似文献

3.

总线互连机群系统上的静态任务调度

章军冯秀山韩冀中韩承德《计算机研究与发展》1999,36(7):805-812

与大规模并行处理（ＭＰＰ）系统相比,基于总线互连的机群系统是一种较为廉价的并行计算环境,文中提出了一个基于总线互连机群系统上的静态任务调度算法。在该算法具有３个主要特点：（１）由于不同处理机之间的通信都必须通过共享总线,故在调度时将总线与处理机一些看成是资源加以分配;（２）针对总线适合于广播的特点,在调度中考虑了广播,地于某些应用而言可以大大通信次数,（３）在确定任务在某个处理机上的开始执行时间以相似文献

4.

全局文件系统的设计与实现

蒋金虎陈左宁黄文政《计算机工程》2005,31(1):71-72,F003

全局文件系统(GFS)是大规模并行处理计算机系统和机群系统的关键技术。该文系统地介绍了一种新型的全局文件系统在“神威”计算机系统上的设计和实现,重点讨论了系统采用的一系列高性能和高可靠措施。最后提出了对下一代全局文件系统的展望。相似文献

5.

Simulating spin systems on IANUS, an FPGA-based computer

F. Belletti M. Cotallo A. Cruz L.A. Fernández A. Gordillo A. Maiorano F. Mantovani E. Marinari A. Muñoz-Sudupe D. Navarro S. Pérez-Gaviro J.J. Ruiz-Lorenzo S.F. Schifano D. Sciretti A. Tarancón R. Tripiccione J.L. Velasco 《Computer Physics Communications》2008,178(3):208-216

We describe the hardwired implementation of algorithms for Monte Carlo simulations of a large class of spin models. We have implemented these algorithms as VHDL codes and we have mapped them onto a dedicated processor based on a large FPGA device. The measured performance on one such processor is comparable to O(100) carefully programmed high-end PCs: it turns out to be even better for some selected spin models. We describe here codes that we are currently executing on the IANUS massively parallel FPGA-based system. 相似文献

6.

Electrostatic force computation for bio-molecules on supercomputers with torus networks

《Parallel Computing》2007,33(2):116-123

We present an application of the Ewald algorithm for electrostatic force computation on a supercomputer with a torus network, like those on QCDOC and BlueGene/L. Typical bio-molecular systems have thousands, possibly millions of atoms interacting, with simulation time ranging from microseconds to milliseconds. The most dominant time consuming calculation for bio-molecules is the electrostatic interactions. The importance of an efficient all-gather method is discussed, in particular for QCDOC since it does not have a network specific for global communication like the tree network on BlueGene/L. In addition, we demonstrate the ability for QCDOC to run non QCD (Quantum Chromodynamics) applications, in particular, electrostatic force computation on bio-molecules. 相似文献

7.

dHybrid: A massively parallel code for hybrid simulations of space plasmas

L. Gargaté R. Bingham R.A. Fonseca L.O. Silva 《Computer Physics Communications》2007,176(6):419-425

A massively parallel simulation code, called dHybrid, has been developed to perform global scale studies of space plasma interactions. This code is based on an explicit hybrid model; the numerical stability and parallel scalability of the code are studied. A stabilization method for the explicit algorithm, for regions of near zero density, is proposed. Three-dimensional hybrid simulations of the interaction of the solar wind with unmagnetized artificial objects are presented, with a focus on the expansion of a plasma cloud into the solar wind, which creates a diamagnetic cavity and drives the Interplanetary Magnetic Field out of the expansion region. The dynamics of this system can provide insights into other similar scenarios, such as the interaction of the solar wind with unmagnetized planets. 相似文献

8.

The Monarch parallel processor hardware design

Rettberg R.D. Crowther W.R. Carvey P.P. Tomlinson R.S. 《Computer》1990,23(4)

The Monarch architecture team took advantage of custom VLSI in the design of a shared-memory parallel processor. The simple structure eases the task of programming a massively parallel machine 相似文献

9.

A fuzzy neural network model and its hardware implementation 总被引：3，自引：0，他引：3

Kuo Y.-H. Kao C.-I. Chen J.-J. 《Fuzzy Systems, IEEE Transactions on》1993,1(3):171-183

A fuzzy classifier based on a four-layered feedforward neural network model is proposed. This connectionist fuzzy classifier, called CFC, realizes the weighted-Euclidean-distance fuzzy classification concept in a massively parallel manner to recognize input patterns. CFC employs a hybrid supervised/unsupervised learning scheme to organize referenced pattern vectors. This scheme not only overcomes the major drawbacks of multilayer perceptron models using the backpropagation algorithm, i.e., the local minimal problem and long training time, but also avoids the disadvantage of the huge storage space requirement of the probabilistic neural network. According to experimental results, CFC shows better accuracy for speech recognition than several existing methods, even in a noisy environment. Moreover, it has higher stability of recognition rates for different environmental conditions. A massively parallel hardware architecture has been developed to implement CFC. A bus-oriented multiprocessor, systolic processor structure, and pipelining are used to obtain low-cost, high-performance fuzzy classification 相似文献

10.

流水线配置技术在可重构处理器中的应用 总被引：1，自引：1，他引：0

下载免费PDF全文

于苏东刘雷波魏少军《计算机工程》2010,36(8):227-229

提出一种应用于可重构处理器中的流水线配置技术,能够有效减低配置时间,提高应用程序的执行速度。可重构处理器包括通用处理器和一个粗颗粒度的可重构阵列。可重构阵列将处理应用中占据大量执行时间的循环,这些循环将被分解为不同的行在阵列上以流水线的方式执行。该技术在FPGA验证系统上得到了验证。验证的应用包括H.264基准中的整数离散余弦变换和运动估计。相比传统的可重构处理器PipeRench, MorphoSys以及TI的DSP TMS320DM642有大约3.5倍的性能提升。相似文献

11.

并行系统的通讯效率问题

林洪陈国良《小型微型计算机系统》1996,(1)

巨量并行处理（ＭＰＰ）强调并行系统结构和并行算法的可扩放性。在一个可扩放的并行系统结构上，可扩放的并行算法应该能够有效地利用不断增加的处理机，算法的有效性通常以算法运行时的处理机效率来衡量。一个被普遍忽视的因素是通讯效率，这是一个具有一般性的问题。本文给出了通讯效率的定义，研究了它与处理机效率的关系，并通过对一个典型算法的运行情况分析，研究了几个常见的并行系统结构的通讯效率。本文的结果表明：处理机效率和通讯效率的综合才能全面地评价算法的可扩放性并指导并行系统结构的设计。相似文献

12.

A VLSI implementation of an architecture for applicative programming

《Future Generation Computer Systems》1988,4(3):245-254

The Applicative Programming System Architecture contains a novel Data Structure Memory (DSM) which supports fast access operations on compact linear data structures. Several problems that arise in implementations of applicative and functional programming languages can be solved efficiently using special data representations on the DSM. Each memory word in the DSM contains a very small local processor, and there is also a tree-structured communications network within the DSM. Therefore the DSM is a massively parallel SIMD machine. This paper describes a VLSI implementation of the DSM architecture and compares its performance with implementations on a conventional sequential computer and the NASA Massively Parallel Processor. 相似文献

13.

Compiling communication-efficient programs for massively parallelmachines

Li J. Chen M. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(3):361-376

A method of generating parallel target code with explicit communication for massively parallel distributed-memory machines is presented. The source programs are shared-memory parallel programs with explicit control structures. The method extracts syntactic reference patterns from a program with shared address space, selects appropriate communication routines, places these routines in appropriate locations in the target program text and sets up correct conditions for invoking these routines. An explicit communication metric is used to guide the selection of data layout strategies 相似文献

14.

Time-parallel solution of linear partial differential equations on the Intel Touchstone Delta supercomputer

Nikzad Toomarian Amir Fijany Jacob Barmen 《Concurrency and Computation》1994,6(8):641-652

The paper presents the implementation of a new class of massively parallel algorithms for solving certain time-dependent partial differential equations (PDEs) on massively parallel supercomputers. Such PDEs are usually solved numerically, by discretization in time and space, and by applying a time-stepping procedure to data and algorithms potentially parallelized in the spatial domain. In a radical departure from such a strictly sequential temporal paradigm, we have developed a concept of time-parallel algorithms, which allows the marching in time to be fully parallelized. This is achieved by using a set of transformations based on eigenvalue-eigenvector decomposition of the matrices involved in the discrete formalism. Our time-parallel algorithms possess a highly decoupled structure, and can therefore be efficiently implemented on emerging, massively parallel, high-performance supercomputers, with a minimum of communication and synchronization overhead. We have successfully carried out a proof-of-concept demonstration of the basic ideas using a two-dimensional heat equation example implemented on the Intel Touchstone Delta supercomputer. Our results indicate that linear, and even superlinear, speed-up can be achieved and maintained for a very large number of processor nodes. 相似文献

15.

Embedded divide-and-conquer algorithm on hierarchical real-space grids: parallel molecular dynamics simulation based on linear-scaling density functional theory

Fuyuki Shimojo Rajiv K. Kalia Priya Vashishta 《Computer Physics Communications》2005,167(3):151-164

A linear-scaling algorithm has been developed to perform large-scale molecular-dynamics (MD) simulations, in which interatomic forces are computed quantum mechanically in the framework of the density functional theory. A divide-and-conquer algorithm is used to compute the electronic structure, where non-additive contribution to the kinetic energy is included with an embedded cluster scheme. Electronic wave functions are represented on a real-space grid, which is augmented with coarse multigrids to accelerate the convergence of iterative solutions and adaptive fine grids around atoms to accurately calculate ionic pseudopotentials. Spatial decomposition is employed to implement the hierarchical-grid algorithm on massively parallel computers. A converged solution to the electronic-structure problem is obtained for a 32,768-atom amorphous CdSe system on 512 IBM POWER4 processors. The total energy is well conserved during MD simulations of liquid Rb, showing the applicability of this algorithm to first principles MD simulations. The parallel efficiency is 0.985 on 128 Intel Xeon processors for a 65,536-atom CdSe system. 相似文献

16.

Instrument-Specific Harmonic Atoms for Mid-Level Music Representation

Leveau P. Vincent E. Richard G. Daudet L. 《IEEE transactions on audio, speech, and language processing》2008,16(1):116-128

Several studies have pointed out the need for accurate mid-level representations of music signals for information retrieval and signal processing purposes. In this paper, we propose a new mid-level representation based on the decomposition of a signal into a small number of sound atoms or molecules bearing explicit musical instrument labels. Each atom is a sum of windowed harmonic sinusoidal partials whose relative amplitudes are specific to one instrument, and each molecule consists of several atoms from the same instrument spanning successive time windows. We design efficient algorithms to extract the most prominent atoms or molecules and investigate several applications of this representation, including polyphonic instrument recognition and music visualization. 相似文献

17.

Parallel Execution of Nested Parallel Expressions

Simon C. Merrall 《Journal of Parallel and Distributed Computing》1996,33(2):122

This paper describes an implementation of P L for a massively parallel SIMD machine, the M P MP-1. The system is based on a byte code interpreter which can emulate as many virtual processors on each physical processor as desired (within the limits of memory). The implementation makes it possible to activate more virtual processors once execution has begun and this feature can be used to support nested parallelism. Nested parallelism describes the ability to nest data parallel constructs, a feature of P L , C M L , and N ; however, the outer parallel forms usually have to be sequentialized, with only the innermost forms being executed in parallel. N and a subset of P L have been implemented to fully support nested parallelism by flattening nested structures at compile time. To do this the languages must impose various restrictions on both the data and control structures. There is an overhead associated with the runtime technique described here, but it is very versatile and can execute code in parallel that cannot be “flattened.” Hence this technique can be used to effectively support many of the moredifficultaspects of P L . 相似文献

18.

对流占优扩散问题的并行计算 总被引：1，自引：0，他引：1

刘晓遇赵凯《数值计算与计算机应用》2000,21(3):178-186

１．引言在刻画流体运动的某些物理现象,以及研究热的传导、粒子的扩散等问题时,都会归结到求解对流扩散方程．用有限差分方法求解该方程,若采用显式方法,计算格式简单,但它们都是条件稳定的,时间步长必须取得非常小;若采用隐式方法,方法是无条件稳定的,但要解代数方程组,求解比较困难．Ｄ．Ｊ．ＥＶＡＮＳ和Ａ．Ｒ．ＡＨＭＡＤ在文［２］中提出了用显式交替方向法求解定态椭圆型方程,对Ｌａｐｌａｃｅ方程做了数值实验．本文将这个方法推广到了时间依赖的问题,而且适用于对流占优扩散问题的求解．基于二阶迎风格式［１］;本… 相似文献

19.

A portable fault-tolerant parallel software MPEG-1 encoder

Iskender Agi R. Jagannathan 《Multimedia Tools and Applications》1996,2(3):183-197

MPEG video compression is quite difficult to achieve in real time, and hardware solutions for this problem are expensive. We present a portable, fault-tolerant, parallel, MPEG-1 encoder implemented in software. We detail the implementation strategy for the encoder and give performance results on a network of workstations and a massively parallel processor. We also show that our encoder can efficiently adapt to fluctuating processing power typical in workstation networks. 相似文献

20.

A Sweep Algorithm for Massively Parallel Simulation of Circuit-Switched Networks

《Journal of Parallel and Distributed Computing》1993,18(4):484-500

A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks that are controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data implementation is described and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data implementation is also described and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude. 相似文献