首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The authors examine the design, implementation, and experimental analysis of parallel priority queues for device and network simulation. They consider: 1) distributed splay trees using MPI; 2) concurrent heaps using shared memory atomic locks; and 3) a new, more general concurrent data structure based on distributed sorted lists, designed to provide dynamically balanced work allocation and efficient use of shared memory resources. We evaluate performance for all three data structures on a Cray-TSESOO system at KFA-Julich. Our comparisons are based on simulations of single buffers and a 64×64 packet switch which supports multicasting. In all implementations, PEs monitor traffic at their preassigned input/output ports, while priority queue elements are distributed across the Cray-TBE virtual shared memory. Our experiments with up to 60000 packets and two to 64 PEs indicate that concurrent priority queues perform much better than distributed ones. Both concurrent implementations have comparable performance, while our new data structure uses less memory and has been further optimized. We also consider parallel simulation for symmetric networks by sorting integer conflict functions and implementing a packet indexing scheme. The optimized message passing network simulator can process ~500 K packet moves in one second, with an efficiency that exceeds ~50 percent for a few thousand packets on the Cray-T3E with 32 PEs. All developed data structures form a parallel library. Although our concurrent implementations use the Cray-TSE ShMem library, portability can be derived from Open-MP or MP1-2 standard libraries, which will provide support for one-way communication and shared memory lock mechanisms  相似文献   

2.
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.  相似文献   

3.
This paper addresses the problem of monodimensional (1D) FFT parallel computation on constant-valence multicomputers, i.e. on parallel systems made up of processing elements (PEs) which do not share memory and are connected to a bounded number of neighbours. After a qualitative analysis of several possible partitionings of the DIT FFT algorithm, a decomposition is introduced that has good scalability properties and makes it possible to use sections of sequential code based on the most common 1D-FFT algorithms. If a computing architecture with indirect binary n-cube interconnection network is used, the proposed decomposition guarantees strictly local communications and therefore requires no through-routing support. These characteristics have a positive impact on software development and also on overall performance. Furthermore, thanks to a pipelined organization of the PEs, the resulting architecture has high potentialities for real-time signal processing. As these useful features are obtained at the ‘expense’ of an uneven workload distribution, computing efficiency is relatively low but does not significantly change in a wide range of the number of processors. An implementation on a Transputer-based system is presented along with the performance results obtained. Finally a simple analytical model of the architecture is shown, that allows the values of the main performance parameters to be obtained as a function of the number of processors used and of the elementary response times of the first stage of PEs.  相似文献   

4.
The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The processing elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-ll Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1  相似文献   

5.
Two parallel implementations of a 3D convex hull algorithm are reported. The paper considers a MIMD distributed memory architecture and the implementations are carried out on the Meiko Computing Surface using T800 transputers and the programming languages Occam and C. The first method uses a simple parallel geometric decomposition strategy and produces encouraging results. With the second approach a parallel generic Divide-and-Conquer kernel is incorporated. This is an example of the algorithmic skeleton approach to parallel programming and involves run-time, dynamic allocation of work to processors. The resulting performances for both methods are measured and compared.  相似文献   

6.
The paper deals with the parallelization of Delaunay triangulation, a widely used space partitioning technique. Two parallel implementations of a three-dimensional incremental construction algorithm are presented. The first is based on the decomposition of the spatial domain, while the second relies on the master-slaves approach. Both parallelization strategies are evaluated, stressing practical issues rather than theoretical complexity. We report on the exploitation of two different parallel environments: a tightly coupled distributed memory MIMD architecture and a network of workstations co-operating under the Linda environment Then, a third hybrid solution is proposed, specifically addressed to the exploitation of higher parallelism. It combines the other two solutions by grouping the processing nodes of the multicomputer into clusters and by exploiting parallelism at two different levels.  相似文献   

7.
Based on extending the sequential execution model of Prolog to include parallel execution, we present a method for OR-parallel execution of Prolog on a multiprocessor system. The method reduces the overhead incurred by parallel processing. It allows many processing elements (PEs) to process simultaneously a common branch of a search tree, and each of these PEs creates its local environment and selects a subtree for processing without communication. The run-time overhead is small: simple and efficient operations for selecting the proper subtree. Communication is necessary only when some PEs have exhausted their search spaces and there are others still searching for solutions. The method is able to utilize most of the technology devised for sequential implementation of Prolog. It is optimized for an architecture that supports broadcast copying.  相似文献   

8.
Simulated annealing is a general-purpose optimization technique capable of finding an optimal or near-optimal solution in various applications. However, the long execution time required for a good quality solution has been a major drawback in practice. Extensive studies have been carried out to develop parallel algorithms for simulated annealing. Most of them were not very successful, mainly because multiple processing elements (PEs) were required to follow a single Markov chain and, therefore, only a limited parallelism was exploited. In this paper, we propose new parallel simulated annealing algorithms which allow multiple Markov chains to be traced simultaneously by PEs which may communicate with each other. We have considered both synchronous and asynchronous implementations of the algorithms. Their performance has been analyzed in detail and also verified by extensive experimental results. It has been shown that for graph partitioning the proposed parallel simulated annealing schemes can find a solution of equivalent (or even better) quality up to an order of magnitude faster than the conventional parallel schemes. Among the proposed schemes, the one where PEs exchange information dynamically (not with a fixed period) performs best  相似文献   

9.
The Cellular Potts Model (CPM) is a lattice based modeling technique used for simulating cellular structures in computational biology. The computational complexity of the model means that current serial implementations restrict the size of simulation to a level well below biological relevance. Parallelization on computing clusters enables scaling the size of the simulation but marginally addresses computational speed due to the limited memory bandwidth between nodes. In this paper we present new data-parallel algorithms and data structures for simulating the Cellular Potts Model on graphics processing units. Our implementations handle most terms in the Hamiltonian, including cell–cell adhesion constraint, cell volume constraint, cell surface area constraint, and cell haptotaxis. We use fine level checkerboards with lock mechanisms using atomic operations to enable consistent updates while maintaining a high level of parallelism. A new data-parallel memory allocation algorithm has been developed to handle cell division. Tests show that our implementation enables simulations of >106 cells with lattice sizes of up to 2563 on a single graphics card. Benchmarks show that our implementation runs ∼80× faster than serial implementations, and ∼5× faster than previous parallel implementations on computing clusters consisting of 25 nodes. The wide availability and economy of graphics cards mean that our techniques will enable simulation of realistically sized models at a fraction of the time and cost of previous implementations and are expected to greatly broaden the scope of CPM applications.  相似文献   

10.
This paper describes the implementation of a logic programming language on a massively parallel architecture. This implementation is based on the AND/OR Process Model which allows the exploitation of both AND and OR parallelism in logic programs. A distributed memory model is used, and a decentralized control mechanism has been designed. The multicomputer, which the system has been implemented on, consists of a network of Inmos Transputers. The AND/OR processes are implemented as Occam processes mapped onto the Transputer nodes. After the presentation of the system architecture and a deep discussion of the distributed memory management, some preliminary performance results are discussed.  相似文献   

11.
Any parallel program has abstractions that are shared by the program's multiple processes. Such shared abstractions can considerably affect the performance of parallel programs, on both distributed and shared memory multiprocessors. As a result, their implementation must be efficient, and such efficiency should be achieved without unduly compromising program portability and maintainability. The primary contribution of the DSA library is its representation of shared abstractions as objects that may be internally distributed across different nodes of a parallel machine. Such distributed shared abstractions (DSA) are encapsulated so that their implementations are easily changed while maintaining program portability across parallel architectures. The principal results presented are: a demonstration that the fragmentation of object state across different nodes of a multiprocessor machine can significantly improve program performance; and that such object fragmentation can be achieved without compromising portability by changing object interfaces. These results are demonstrated using implementations of the DSA library on several medium scale multiprocessors, including the BBN Butterfly, Kendall Square Research, and SGI shared memory multiprocessors. The DSA library's evaluation uses synthetic workloads and a parallel implementation of a branch and bound algorithm for solving the traveling salesperson problem (TSP)  相似文献   

12.
Several possibilities exist to implement the propagation step of lattice Boltzmann methods. This paper describes common implementations and compares the number of memory transfer operations they require per lattice node update. A performance model based on the memory bandwidth is then used to obtain an estimation of the maximum achievable performance on different machines. A subset of the discussed implementations of the propagation step are benchmarked on different Intel- and AMD-based compute nodes using the framework of an existing flow solver that is specially adapted to simulate flow in porous media, and the model is validated against the measurements. Advanced approaches for the propagation step like “A–A pattern” or “Esoteric Twist” require more programming effort but often sustain significantly better performance than non-naïve but straightforward implementations.  相似文献   

13.
DADO is a parallel, tree-structured machine designed to provide significant performance improvements in the execution of large expert systems implemented in production system form. A full-scale version of the DADO machine would comprise a large set of processing elements (PEs) (on the order of thousands), each containing its own processor, a small amount (20K bytes, in the current prototype design) of local random access memory, and a specialized I/O switch. The PEs are interconnected to form a complete binary tree. This paper describes the application domain of the DADO machine and the rationale for its design. Parallel algorithms for production system execution are briefly described. We then focus on the machine architecture and detail the hardware design of a moderately large prototype comprising 1023 microprocessors currently operational at Columbia University. We conclude with very encouraging performance statistics recently calculated from an analysis of simulations of the system.  相似文献   

14.
网络并行计算系统的消息存储器网络接口设计   总被引:4,自引:0,他引:4  
文中通过定性分析典型并行应用程序,提出产蒙义了消息传递无关因子R,即堆中的数据的传递在整个消息传递中所占比例,而且后在一个实际的NPC环境中对一组典型并行应用程序进行踪迹统计,证实了R接近1的分析,根据这个定性分析以及定量统计结构,结合存储器技术的进展,在NPC中的网络接口上引入了消息存储器,使得NPC中各个结点可以直接访问其它结点的消息存储器,通过竣是出结论,在设置了消息存储器的网络接口的NPC  相似文献   

15.
A portable parallelization of the Cooley–Tukey FFT algorithm for MIMD multiprocessors is presented. The implementation uses the virtual machine for multiprocessors (VMMP) and PVM portable software packages. Since VMMP provides the same set of services on all target machines, a single version of the parallel FFT code was used for shared memory (25-processor Sequent Symmetry), shared bus (MOS-running distributed UNIX) and distributed memory multiprocessor (transputer network and 64-processor IBM SP2). It is accompanied with detailed performance analysis of the implementations. The algorithm achieved high efficiencies on all target machines. The analysis indicates that most overheads are caused by the target architecture and not by VMMP or PVM inefficiencies. The portability analysis of the FFT provides several important insights. On the message passing architecture, the parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in the problem size. The parallel FFT can be executed by any number of processors, but generally the number of processors is much less than the length of the input data. The results indicate that the parallel FFT is portable: it achieves very good speedups on either a shared memory multiprocessor with high memory bandwidth or on a message passing multiprocessor without any change in the programs. © 1998 John Wiley & Sons, Ltd.  相似文献   

16.
This paper addresses parallel execution of chain code generation on a linear array architecture. The contours in the proposed algorithm are viewed as a set of edges (or contour segments) that can be traced by a top-down contour tracing method to generate the chain codes for the outer and inner object contours. A parallel algorithm that contains the chain code generating rules and operations needed is also described, and the algorithm is mapped onto a one-dimensional systolic array containing processing elements (PEs) to devise this architecture. The architecture extracts the contours of objects and quickly generates the corresponding chain codes after the image data in all rows are inputted in a linear fashion. The total processing time for generating the chain codes in an N×N image is O(3N). By doing so, the real-time requirement is fulfilled and its execution time is independent of the image content. In addition, a partition method is developed to process an image when the parallel architecture has a fixed number of PEs; say two or more. The total execution time for an N×N image by employing a fixed number of PEs is N(N+1)/M+2(M−1), when M is the fixed number of PEs.  相似文献   

17.
针对并行计算机体系结构中没有通用的计算模型这一问题,分析了一些现有的典型计算模型,在同步性、通信方式、参数方面进行比较,以LogGP模型为基础提出一种改进的mzLogGP模型。利用MPI并行算法对满足节点计算资源非独占、网络存在拥塞条件下的并行程序进行分析与测试,通过增加memory层次化层数和网络拥塞指数这两个参数,计算其计算开销和通信开销,将实测时间与预测时间进行比较,可知随节点数的增加系统误差不断减小,说明该新模型能改善并行应用在多核处理器集群平台上运行的性能,具有较好的可扩展性。  相似文献   

18.
19.
The Applicative Programming System Architecture contains a novel Data Structure Memory (DSM) which supports fast access operations on compact linear data structures. Several problems that arise in implementations of applicative and functional programming languages can be solved efficiently using special data representations on the DSM. Each memory word in the DSM contains a very small local processor, and there is also a tree-structured communications network within the DSM. Therefore the DSM is a massively parallel SIMD machine. This paper describes a VLSI implementation of the DSM architecture and compares its performance with implementations on a conventional sequential computer and the NASA Massively Parallel Processor.  相似文献   

20.
This paper develops a performance model of an optically interconnected parallel computer system operating in a distributed shared memory environment. The performance model is developed to reflect the impact of low level optical media access protocol and optical device switching latency on high level system performance. This enables the model to predict the performance impact of supporting distributed shared memory with different address allocation schemes and media access protocols. The passive star-coupled photonic network operates through wavelength division multiple access. Two media access protocols are examined for this WDM network, both are designed to operate in a multiple-channel multiple-access environment and require each node to possess a wavelength tunable transmitter and a fixed (or slow tunable) receiver. A semi-Markov model has been developed to study the interaction of the distributed shared memory architecture and the two access protocols of the photonic network. This analytical model has been validated by extensive simulation. The model is then used to examine the system performance with varying numbers of nodes and wavelength channels and varying, memory and channel access times.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号