共查询到20条相似文献,搜索用时 15 毫秒
1.
Grammatikakis M.D. Liesche S. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(5):401-422
The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms 相似文献
2.
粒子模拟是研究离散粒子和连续介质运动规律的常用方法.而大规模的粒子模拟通常借助高性能计算系统.近年来,得益于其众核架构,图形处理器(GPU)已成为高性能计算的重要设备,并被广泛用于大规模粒子模拟过程的加速.本文讨论了一种对GPU加速的分布式粒子模拟进行在线可视化的方法.在该方法中,GPU除了被用于加速粒子模拟过程外,也被用于数据到图像的快速转换.同时,并行绘制技术被用于分布式数据的可视化.通过本文所述的方法,用户可在并行计算运行过程中,通过显示于拼接显示墙的高分辨率图像,实时地观察到粒子模拟中发生的现象,并对计算过程进行跟踪和调整. 相似文献
3.
On parallel integer sorting 总被引:1,自引:0,他引:1
We present an optimal algorithm for sortingn integers in the range [1,n
c
] (for any constantc) for the EREW PRAM model where the word length isn
, for any >0. Using this algorithm, the best known upper bound for integer sorting on the (O(logn) word length) EREW PRAM model is improved. In addition, a novel parallel range reduction algorithm which results in a near optimal randomized integer sorting algorthm is presented. For the case when the keys are uniformly distributed integers in an arbitrary range, we give an algorithm whose expected running time is optimal.Supported by NSF-DCR-85-03251 and ONR contract N00014-87-K-0310 相似文献
4.
《Simulation Modelling Practice and Theory》2008,16(3):329-337
The paper discusses automotive crash simulation in a stochastic context, whereby the uncertainties in numerical simulation results generated by parallel computing. Since crash is a non-repeatable phenomenon, qualification for crashworthiness based on a single test is not meaningful, and should be replaced by stochastic simulation. But the stochastic simulations may generate different results on parallel machines, if the same application is executed more than once. For a benchmark car model, differences between the position of a node in two simulation runs of PAMCRASH or LS-DYNA of up to 10 cm were observed, just as a result of round-off differences in the case of parallel computing. In this paper, some data mining algorithms are described to measure the scatter of parallel simulation results of car-crash and then provide hints to overcome this scatter to get more stable car model. 相似文献
5.
A new approach to accelerating parallel sorting processes is introduced in this paper. This approach involves the design of a new type of memory chip with sorting functions. This type of sorting memory chip is feasible with today's VLSI techniques. A memory module organizing several sorting memory chips associated with additional ECL or TTL control logic circuits is also presented. Using the sorting memory modules in a shared memory parallel processor machine, parallel sorting algorithms such as the column sort method can reduce the row access time significantly and avoid data collisions in the interconnection network. Experimental simulation results on the practical speedup achieved and the memory utilization for the proposed approach are described. 相似文献
6.
V. Singh V. Kumar G. Agha C. Tomlinson 《International journal of parallel programming》1991,20(2):95-131
We present two new parallel algorithms QSP1 and QSP2 based on sequential quicksort for sorting data on a mesh multicomputer, and analyze their scalability using the isoefficiency metric. We show that QSP2 matches the lower bound on the isoefficiency function for mesh multicomputers, while QSP1 is fairly close to optimal. Langet al.
(1) and Schnorret al.
(2) have developed parallel sorting algorithms for the mesh architecture that have either optimal (Schnorr) or close to optimal (Lang) run-time complexity for the one-element-perprocessor case. Both QSP1 and QSP2 have better scalability than the scaled-down variants of these algorithms (for the case in which there are more elements than processors). We also analyze a different variant of Lang's sort which is as scalable as QSP2. We briefly discuss another metric called resource consumption. According to this metric, both QSP1 and QSP2 are superior to variants of Lang's sort. 相似文献
7.
With the popularity of parallel database machines based on the shared-nothing architecture, it has become important to find external sorting algorithms which lead to a load-balanced computation, i.e., balanced execution, communication and output. If during the course of the sorting algorithm each processor is equally loaded, parallelism is fully exploited. Similarly, balanced communication will not congest the network traffic. Since sorting can be used to support a number of other relational operations (joins, duplicate elimination, building indexes etc.) data skew produced by sorting can further lead to execution skew at later stages of these operations. In this paper we present a load-balanced parallel sorting algorithm for shared-nothing architectures. It is a multiple-input multiple-output algorithm with four stages, based on a generalization of Batcher's odd-even merge. At each stage then keys are evenly distributed among thep processors (i.e., there is no final sequential merge phase) and the distribution of keys between stages ensures against network congestion. There is no assumption made on the key distribution and the algorithm performs equally well in the presence of duplicate keys. Hence our approach always guarantees its performance, as long asn is greater thanp
3, which is the case of interest for sorting large relations. In addition, processors can be added incrementally.
Recommended by: Patrick Valduriez 相似文献
8.
We generalize the well-known odd-even merge sorting algorithm, originally due to Batcher (1968), and show how this generalized algorithm can be applied to sorting on product networks. If G is an arbitrary factor graph with N nodes, its r-dimensional product contains Nr nodes. Our algorithm sorts Nr keys stored in the r-dimensional product of G in O(rrF(N)) time, where F(N) depends on G. We show that, for any factor graph G, F(N) is, at most, O(N), establishing an upper bound of O(r2 N) for the time complexity of sorting Nr keys on any product network. For product networks with bounded r(e.g. for grids), this leads to the asymptotic complexity of O(N) to sort Nr keys, which is optimal for several instances of product networks. There are factor graphs for which F(N)=O(log2 N), which leads to the asymptotic running time of O(log2 N) to sort Nr keys. For networks with bounded N (e.g. in the hypercube N=2, fixed), the asymptotic complexity becomes O(r2). We show how to apply the algorithm to several cases of well-known product networks, as well as others introduced recently. We compare the performance of our algorithm to well-known algorithms developed specifically for these networks, as well as others. The result of these comparisons led us to conjecture that the proposed algorithm is probably the best deterministic algorithm that can be found in terms of the low asymptotic complexity with a small constant 相似文献
9.
粒子模拟是目前化工、材料、生物等领域重要的研究手段之一.随着计算机软硬件的发展和大规模并行集群的出现,可模拟的粒子规模越来越大,模拟对象也越来越复杂.前处理足粒子模拟初始数据的生成环节,它负责将模拟对象转化为粒子系统,并按照模拟算例需求,将粒子数据输出为文件.前处理是连接模拟对象和模拟计算的纽带.是粒子模拟过程中关键的一步.本文提出的设计方案是首先使用BRLCAD建立模拟对象的三维模型,然后将二维模型转换为空间枚举,接着在空间枚举的规则块中填充粒子,同时通过使用元胞法检测粒子之间的冲突来保证粒子的合法性,最后根据粒子的类型和位置计算粒子的物性并将粒子数据输出到文件.本文根据该设计方案结合MPI并行计算技术,实现了大规模粒子模拟并行前处理系统,并进行了一系列的测试证明了该系统的实用性和可靠性. 相似文献
10.
S. G. Akl 《Computing》1984,32(1):1-11
Nonlinear equations are considered, where some input parameters are subjected to errors. By a class of monotone enclosing methods sequences of intervals are constructed, containing for each value of the perturbation parameter at least one zero of the problem. In finite dimensional spaces concrete realizations are given, e. g. of Newton-, Regula falsi- and Jacobi-Newton-type. 相似文献
11.
12.
A sorting classification of parallel rendering 总被引:26,自引:0,他引:26
We describe a classification scheme that we believe provides a more structured framework for reasoning about parallel rendering. The scheme is based on where the sort from object coordinates to screen coordinates occurs, which we believe is fundamental whenever both geometry processing and rasterization are performed in parallel. This classification scheme supports the analysis of computational and communication costs, and encompasses the bulk of current and proposed highly parallel renderers - both hardware and software. We begin by reviewing the standard feed-forward rendering pipeline, showing how different ways of parallelizing it lead to three classes of rendering algorithms. Next, we consider each of these classes in detail, analyzing their aggregate processing and communication costs, possible variations, and constraints they may impose on rendering applications. Finally, we use these analyses to compare the classes and identify when each is likely to be preferable 相似文献
13.
针对大数据量排序算法优化问题,提出一种基于Java的按位拆分的排序新算法。该排序算法按照位拆分数据,并结合Java的多线程对拆分的数据进行并行处理。数据实验结果表明,对于大数据量排序,该算法性能明显优于快速排序算法,而且算法具有很好的并行效率。 相似文献
14.
KLT算法已在多个领域得到成功的应用,其中特征点的排序是用来选择好的特征点跟踪的关键。针对传统排序算法计算耗时、实时性差的缺点,提出一种可并行的多层次归并排序算法并在FPGA中实现了其并行计算,同时分析了其周期精确的计算时间。结果表明该归并排序算法可以[O(N)]的时间复杂度完成特征点的排序,能够满足高清分辨率的图像/视频数据中KLT特征点排序的实时性要求。 相似文献
15.
Amitabha Das Louise E. Moser P. M. Melliar-Smith 《International journal of parallel programming》1991,20(5):403-419
The computational complexity of a parallel algorithm depends critically on the model of computation. We describe a simple and elegant rule-based model of computation in which processors apply rules asynchronously to pairs of objects from a global object space. Application of a rule to a pair of objects results in the creation of a new object if the objects satisfy the guard of the rule. The model can be efficiently implemented as a novel MIMD array processor architecture, the Intersecting Broadcast Machine. For this model of computation, we describe an efficient parallel sorting algorithm based on mergesort. The computational complexity of the sorting algorithm isO(nlog2
n), comparable to that for specialized sorting networks and an improvement on theO(n
1.5) complexity of conventional mesh-connected array processors. 相似文献
16.
《Journal of Microcomputer Applications》1994,17(3):273-286
This paper describes a special-purpose embedded multiprocessor architecture developed for performing real-time multi-line optical character recognition (MLOCR). MLOCR is a computationally intensive real-time application involving pattern recognition, character image extraction, gray-scale thresholding, rotation and scaling of individual characters, and character identification. The computational complexity of the MLOCR application dictated the development of custom hardware in a parallel processing environment in order to meet the real-time system requirements. The overall system organization is described, along with the functional partitioning of algorithms onto processors, development of specific custom hardware to implement the algorithms in real time, interprocess communications, and system control. 相似文献
17.
中国科学院过程工程研究所多相反应实验室,建立了一个通用粒子模拟平台并已开始应用。目前类似的并行模拟系统采用的Shift并行通信模式往往有一些问题,需要一种新的通信模式来弥补它的不足。本文设计具有良好通用性的非结构化通信模式All2All,用来完成通用粒子方法模拟平台中计算节点问的通信。本文的算例证明这种通信模式可解决在粒子并行模拟Shift通信模式所不能处理的,具有复杂拓扑关系的相邻节点间的数据通信问题。本文设计的All2All通信模式方法只需稍加修改,就可以方便地应用于其它领域的并行计算系统。 相似文献
18.
Auriga, an experimental simulator that utilizes five compilation techniques to reduce runtime complexity and promote concurrency in the simulation of VHDL models is described. Auriga is designed to translate a model using any VHDL construct into an optimized, parallel simulation. Auriga's distributed simulation uses a message-passing network to simulate a single VHDL model. The authors present results obtained with seven benchmark models to illustrate the compiler's aggressive optimization techniques: temporal analysis, waveform propagation, input desensitization, concurrent evaluation, and statement compaction 相似文献
19.
In this paper we study the sorting performance of a 128-processor CRAY T3D and discuss the efficient use of the toroidal network connecting the processors. The problems we consider range from that of sorting one word per processor to sorting the entire memory of the machine, and we give efficient algorithms for each case. In addition, we give both algorithms that make assumptions about the distribution of the data and those that make no assumptions. The clear winner, if data can be assumed to be uniformly distributed, is a method that we call a hash-and-chain sort. The time for this algorithm to sort one million words per processor over 64 processors is less than two seconds, which compares favorably to about four seconds using a 4-processor CRAY C90 and about 17 seconds using a 64-processor Thinking Machines CM-5. 相似文献
20.
COUPL+ is a programming environment for applications using unstructured and hybrid grids for numerical simulations. It automates parallelization by handling the partitioning of data and dependent data and maintaining halo interfaces and copy coherency. We explore some algorithms behind this package. A multi-level partitioning method is described which is effective in the presence of skewed data, solving the multi-set median-finding problem. Partitioning elements over a set of pre-partitioned nodes is explored and a novel method is suggested for reducing communication in the resulting distribution. 相似文献