期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Priority queues and sorting methods for parallel simulation

Grammatikakis M.D. Liesche S. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(5):401-422

The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms 相似文献

2.

GPU加速的并行粒子模拟在线可视化

方晓健徐骥威华彪何险峰葛蔚《计算机与应用化学》2011,28(10)

粒子模拟是研究离散粒子和连续介质运动规律的常用方法.而大规模的粒子模拟通常借助高性能计算系统.近年来,得益于其众核架构,图形处理器(GPU)已成为高性能计算的重要设备,并被广泛用于大规模粒子模拟过程的加速.本文讨论了一种对GPU加速的分布式粒子模拟进行在线可视化的方法.在该方法中,GPU除了被用于加速粒子模拟过程外,也被用于数据到图像的快速转换.同时,并行绘制技术被用于分布式数据的可视化.通过本文所述的方法,用户可在并行计算运行过程中,通过显示于拼接显示墙的高分辨率图像,实时地观察到粒子模拟中发生的现象,并对计算过程进行跟踪和调整. 相似文献

3.

Data analysis for parallel car-crash simulation results and model optimization

《Simulation Modelling Practice and Theory》2008,16(3):329-337

The paper discusses automotive crash simulation in a stochastic context, whereby the uncertainties in numerical simulation results generated by parallel computing. Since crash is a non-repeatable phenomenon, qualification for crashworthiness based on a single test is not meaningful, and should be replaced by stochastic simulation. But the stochastic simulations may generate different results on parallel machines, if the same application is executed more than once. For a benchmark car model, differences between the position of a node in two simulation runs of PAMCRASH or LS-DYNA of up to 10 cm were observed, just as a result of round-off differences in the case of parallel computing. In this paper, some data mining algorithms are described to measure the scatter of parallel simulation results of car-crash and then provide hints to overcome this scatter to get more stable car model. 相似文献

4.

On parallel integer sorting 总被引：1，自引：0，他引：1

Sanguthevar Rajasekaran Sandeep Sen 《Acta Informatica》1992,29(1):1-15

We present an optimal algorithm for sortingn integers in the range [1,n ^c] (for any constantc) for the EREW PRAM model where the word length isn , for any >0. Using this algorithm, the best known upper bound for integer sorting on the (O(logn) word length) EREW PRAM model is improved. In addition, a novel parallel range reduction algorithm which results in a near optimal randomized integer sorting algorthm is presented. For the case when the keys are uniformly distributed integers in an arbitrary range, we give an algorithm whose expected running time is optimal.Supported by NSF-DCR-85-03251 and ONR contract N00014-87-K-0310 相似文献

5.

A new parallel sorting approach with sorting memory module

Chuan-Qi Zhu Zhixi Fang Xiaobo Li 《Journal of Parallel and Distributed Computing》1989,7(3)

A new approach to accelerating parallel sorting processes is introduced in this paper. This approach involves the design of a new type of memory chip with sorting functions. This type of sorting memory chip is feasible with today's VLSI techniques. A memory module organizing several sorting memory chips associated with additional ECL or TTL control logic circuits is also presented. Using the sorting memory modules in a shared memory parallel processor machine, parallel sorting algorithms such as the column sort method can reduce the row access time significantly and avoid data collisions in the interconnection network. Experimental simulation results on the practical speedup achieved and the memory utilization for the proposed approach are described. 相似文献

6.

Efficient algorithms for parallel sorting on mesh multicomputers

V. Singh V. Kumar G. Agha C. Tomlinson 《International journal of parallel programming》1991,20(2):95-131

We present two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyze their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers, while QSP1 is fairly close to optimal. Langet al. ⁽¹⁾ and Schnorret al. ⁽²⁾ have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-perprocessor case. Both QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). We also analyze a different variant of Lang's sort which is as scalable as QSP2. We briefly discuss another metric called resource consumption. According to this metric, both QSP1 and QSP2 are superior to variants of Lang's sort. 相似文献

7.

A load-balanced parallel sorting algorithm for shared-nothing architectures

Anil Kumar Tony T. Lee Vassilis J. Tsotras 《Distributed and Parallel Databases》1995,3(1):37-68

With the popularity of parallel database machines based on the shared-nothing architecture, it has become important to find external sorting algorithms which lead to a load-balanced computation, i.e., balanced execution, communication and output. If during the course of the sorting algorithm each processor is equally loaded, parallelism is fully exploited. Similarly, balanced communication will not congest the network traffic. Since sorting can be used to support a number of other relational operations (joins, duplicate elimination, building indexes etc.) data skew produced by sorting can further lead to execution skew at later stages of these operations. In this paper we present a load-balanced parallel sorting algorithm for shared-nothing architectures. It is a multiple-input multiple-output algorithm with four stages, based on a generalization of Batcher's odd-even merge. At each stage then keys are evenly distributed among thep processors (i.e., there is no final sequential merge phase) and the distribution of keys between stages ensures against network congestion. There is no assumption made on the key distribution and the algorithm performs equally well in the presence of duplicate keys. Hence our approach always guarantees its performance, as long asn is greater thanp ³, which is the case of interest for sorting large relations. In addition, processors can be added incrementally. Recommended by: Patrick Valduriez 相似文献

8.

大规模粒子模拟并行前处理系统的设计与实现

郭卫卫郭力《计算机与应用化学》2010,27(12)

粒子模拟是目前化工、材料、生物等领域重要的研究手段之一.随着计算机软硬件的发展和大规模并行集群的出现,可模拟的粒子规模越来越大,模拟对象也越来越复杂.前处理足粒子模拟初始数据的生成环节,它负责将模拟对象转化为粒子系统,并按照模拟算例需求,将粒子数据输出为文件.前处理是连接模拟对象和模拟计算的纽带.是粒子模拟过程中关键的一步.本文提出的设计方案是首先使用BRLCAD建立模拟对象的三维模型,然后将二维模型转换为空间枚举,接着在空间枚举的规则块中填充粒子,同时通过使用元胞法检测粒子之间的冲突来保证粒子的合法性,最后根据粒子的类型和位置计算粒子的物性并将粒子数据输出到文件.本文根据该设计方案结合MPI并行计算技术,实现了大规模粒子模拟并行前处理系统,并进行了一系列的测试证明了该系统的实用性和可靠性. 相似文献

9.

Generalized algorithm for parallel sorting on product networks

Fernandez A. Efe K. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(12):1211-1225

We generalize the well-known odd-even merge sorting algorithm, originally due to Batcher (1968), and show how this generalized algorithm can be applied to sorting on product networks. If G is an arbitrary factor graph with N nodes, its r-dimensional product contains N^r nodes. Our algorithm sorts N^r keys stored in the r-dimensional product of G in O(r^rF(N)) time, where F(N) depends on G. We show that, for any factor graph G, F(N) is, at most, O(N), establishing an upper bound of O(r² N) for the time complexity of sorting N^r keys on any product network. For product networks with bounded r(e.g. for grids), this leads to the asymptotic complexity of O(N) to sort N^r keys, which is optimal for several instances of product networks. There are factor graphs for which F(N)=O(log² N), which leads to the asymptotic running time of O(log² N) to sort N^r keys. For networks with bounded N (e.g. in the hypercube N=2, fixed), the asymptotic complexity becomes O(r²). We show how to apply the algorithm to several cases of well-known product networks, as well as others introduced recently. We compare the performance of our algorithm to well-known algorithms developed specifically for these networks, as well as others. The result of these comparisons led us to conjecture that the proposed algorithm is probably the best deterministic algorithm that can be found in terms of the low asymptotic complexity with a small constant 相似文献

10.

External parallel sorting with multiprocessor computers

S. I. Comanescu 《Cybernetics and Systems Analysis》1983,19(6):774-778

相似文献

11.

A sorting classification of parallel rendering 总被引：26，自引：0，他引：26

Molnar S. Cox M. Ellsworth D. Fuchs H. 《Computer Graphics and Applications, IEEE》1994,14(4):23-32

We describe a classification scheme that we believe provides a more structured framework for reasoning about parallel rendering. The scheme is based on where the sort from object coordinates to screen coordinates occurs, which we believe is fundamental whenever both geometry processing and rasterization are performed in parallel. This classification scheme supports the analysis of computational and communication costs, and encompasses the bulk of current and proposed highly parallel renderers - both hardware and software. We begin by reviewing the standard feed-forward rendering pipeline, showing how different ways of parallelizing it lead to three classes of rendering algorithms. Next, we consider each of these classes in detail, analyzing their aggregate processing and communication costs, possible variations, and constraints they may impose on rendering applications. Finally, we use these analyses to compare the classes and identify when each is likely to be preferable 相似文献

12.

粒子模拟中一种非结构化通用并行通信模式的研究和范型实现

秦东明郭力王小伟江鹰葛蔚《计算机与应用化学》2007,24(12):1625-1630

中国科学院过程工程研究所多相反应实验室,建立了一个通用粒子模拟平台并已开始应用。目前类似的并行模拟系统采用的Shift并行通信模式往往有一些问题,需要一种新的通信模式来弥补它的不足。本文设计具有良好通用性的非结构化通信模式All2All,用来完成通用粒子方法模拟平台中计算节点问的通信。本文的算例证明这种通信模式可解决在粒子并行模拟Shift通信模式所不能处理的,具有复杂拓扑关系的相邻节点间的数据通信问题。本文设计的All2All通信模式方法只需稍加修改,就可以方便地应用于其它领域的并行计算系统。相似文献

13.

A parallel sorting algorithm for a novel model of computation

Amitabha Das Louise E. Moser P. M. Melliar-Smith 《International journal of parallel programming》1991,20(5):403-419

The computational complexity of a parallel algorithm depends critically on the model of computation. We describe a simple and elegant rule-based model of computation in which processors apply rules asynchronously to pairs of objects from a global object space. Application of a rule to a pair of objects results in the creation of a new object if the objects satisfy the guard of the rule. The model can be efficiently implemented as a novel MIMD array processor architecture, the Intersecting Broadcast Machine. For this model of computation, we describe an efficient parallel sorting algorithm based on mergesort. The computational complexity of the sorting algorithm isO(nlog² n), comparable to that for specialized sorting networks and an improvement on theO(n ^1.5) complexity of conventional mesh-connected array processors. 相似文献

14.

A parallel architecture for performing mail sorting in real time

《Journal of Microcomputer Applications》1994,17(3):273-286

This paper describes a special-purpose embedded multiprocessor architecture developed for performing real-time multi-line optical character recognition (MLOCR). MLOCR is a computationally intensive real-time application involving pattern recognition, character image extraction, gray-scale thresholding, rotation and scaling of individual characters, and character identification. The computational complexity of the MLOCR application dictated the development of custom hardware in a parallel processing environment in order to meet the real-time system requirements. The overall system organization is described, along with the functional partitioning of algorithms onto processors, development of specific custom hardware to implement the algorithms in real time, interprocess communications, and system control. 相似文献

15.

Optimizing VHDL compilation for parallel simulation

Willis J.C. Siewiorek D.P. 《Design & Test of Computers, IEEE》1992,9(3):42-53

Auriga, an experimental simulator that utilizes five compilation techniques to reduce runtime complexity and promote concurrency in the simulation of VHDL models is described. Auriga is designed to translate a model using any VHDL construct into an optimized, parallel simulation. Auriga's distributed simulation uses a message-passing network to simulate a single VHDL model. The authors present results obtained with seven benchmark models to illustrate the compiler's aggressive optimization techniques: temporal analysis, waveform propagation, input desensitization, concurrent evaluation, and statement compaction 相似文献

16.

Efficient parallel algorithms for numerical simulation

David Mike 《Future Generation Computer Systems》2001,17(8):961-967

COUPL+ is a programming environment for applications using unstructured and hybrid grids for numerical simulations. It automates parallelization by handling the partitioning of data and dependent data and maintaining halo interfaces and copy coherency. We explore some algorithms behind this package. A multi-level partitioning method is described which is effective in the presence of skewed data, solving the multi-set median-finding problem. Partitioning elements over a set of pre-partitioned nodes is explored and a novel method is suggested for reducing communication in the resulting distribution. 相似文献

17.

Asynchronous parallel simulation of parallel programs

Prakash S. Deelman E. Bagrodia R. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(5):385-400

Parallel simulation of parallel programs for large datasets has been shown to offer significant reduction in the execution time of many discrete event models. The paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message passing, or written in UC, a data parallel language, compiled to use message passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. The paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in the paper yielded significant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model 相似文献

18.

A resampling method for parallel particle filter architectures

《Microprocessors and Microsystems》2016

Particle filters are able to represent multi-modal beliefs but require a large number of particles in order to do so. The particle filter consists of three sequential steps: the sampling, the importance factor, and the resampling step. Each step processes every particle in oder to acquire the final state estimation. A high number of particles leads to a high processing time, thus reducing the particle filters usefulness for real-time embedded systems. Through parallelization, the processing time can be significantly reduced. However, the resampling step is not easily parallelizable since it requires the importance factor of each particle. In this work, a resampling scheme is proposed which uses virtual particles to solve the parallelization problem of the resampling component. Besides evaluating its performance against the multinomial resampling scheme, it is also implemented on a Xilinx Zynq-7000 FPGA. 相似文献

19.

High-performance sorting algorithms for the CRAY T3D parallel computer

Brandon Dixon John Swallow 《The Journal of supercomputing》1997,10(4):371-395

In this paper we study the sorting performance of a 128-processor CRAY T3D and discuss the efficient use of the toroidal network connecting the processors. The problems we consider range from that of sorting one word per processor to sorting the entire memory of the machine, and we give efficient algorithms for each case. In addition, we give both algorithms that make assumptions about the distribution of the data and those that make no assumptions. The clear winner, if data can be assumed to be uniformly distributed, is a method that we call a hash-and-chain sort. The time for this algorithm to sort one million words per processor over 64 processors is less than two seconds, which compares favorably to about four seconds using a 4-processor CRAY C90 and about 17 seconds using a 64-processor Thinking Machines CM-5. 相似文献

20.

Time lower bounds for parallel sorting on a mesh-connected processor array

Yijie Han Yoshihide Igarashi 《Acta Informatica》1989,26(7):643-655

Summary We prove that (1+6/2)n 2.22n is a time lower bound independent of indexing schemes for sorting n² items on an n × n mesh-connected processor array. We distinguish between indexing schemes by showing that there exists an indexing scheme which is provably worse than the snake-like row-major indexing for sorting. We also derive lower bounds for various indexing schemes. All these results are obtained by using the chain argument which we provide in this paper.A preliminary version of this paper was presented on the 3rd International Workshop on Parallel Computation and VLSI Theory, Corfu Island, Greece (June–July, 1988). Lect. Notes Comput. Sci. 319, 434–443 (1988)Research supported in part by the University of Kentucky research initiation award 相似文献