首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 10 毫秒
1.
The lack of high-level languages and good compilers for parallel machines hinders their widespread acceptance and use. Programmers must address issues such as process decomposition, synchronization, and load balancing. We have developed a parallelizing compiler that, given a sequential program and a memory layout of its data, performs process decomposition while balancing parallelism against locality of reference. A process decomposition is obtained by specializing the program for each processor to the data that resides on that processor. If this analysis fails, the compiler falls back to a simple but inefficient scheme called run-time resolution. Each process's role in the computation is determined by examining the data required for execution at run-time. Thus, our approach to process decomposition is data-driven rather than program-driven. We discuss several message optimizations that address the issues of overhead and synchronization in message transmission. Accumulation reorganizes the computation of a commutative and associative operator to reduce message traffic. Pipelining sends a value as close to its computation as possible to increase parallelism. Vectorization of messages combines messages with the same source and the same destination to reduce overhead. Our results from experiments in parallelizing SIMPLE, a large hydrodynamics benchmark, for the Intel iPSC/2, show a speedup within 60% to 70% of handwritten code  相似文献   

2.
OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has advantages on its ease of use and incremental programming, message passing is today still the most widely-used programming model for distributed memory architectures. How to effectively extend OpenMP to distributed memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for distributed memory architectures. Based on the partially replicating shared arrays memory model, we propose ...  相似文献   

3.
We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single-source and multiple-source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multidimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, the data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory and a set of processor registers. The analytic models are verified by measurements on the Connection Machine Model CM-2.  相似文献   

4.
It was shown in the paper of Solchenbach and Trottenberg (in this special issue) that grid algorithms are inherently parallel and that parallel grid algorithms for regular grids can be efficiently implemented on dm-mp systems using the concept of grid partitioning.

In this paper, we demonstrate that grid applications can be implemented quite easily on dm-mp systems if a hardware-independent process system exists and convenient tools (such as the SUPRENUM mapping and communications library) are available.

The evaluation of parallel grid algorithms shows that the multiprocessor speedup and efficiency for single grid applications depends on the communication/calculation performance ratio of the hardware, on the communication/calculation ratio of the algorithms, and on the process size. The efficiency of parallel multigrid algorithms additionally depends on the number of nodes.  相似文献   


5.
Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoCs), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs to provide data exchange and synchronization support.This paper focuses on the energy/delay exploration of a distributed shared memory architecture, suitable for low-power on-chip multiprocessors based on NoC. A mechanism is proposed for the data allocation on the distributed shared memory space, dynamically managed by an on-chip hardware memory management unit (HwMMU). Moreover, the exploitation of the HwMMU primitives for the migration, replication, and compaction of shared data is discussed. Experimental results show the impact of different distributed shared memory configurations for a selected set of parallel benchmark applications from the power/-performance perspective. Furthermore, a case study for a graph exploration algorithm is discussed, accounting for the effects of the core mapping and the network topology on energy and performance at the system level.  相似文献   

6.
The rapid rise of OpenMP as the preferred parallel programming paradigm for small‐to‐medium scale parallelism could slow unless OpenMP can show capabilities for becoming the model‐of‐choice for large scale high‐performance parallel computing in the coming decade. The main stumbling block for the adaptation of OpenMP to distributed shared memory (DSM) machines, which are based on architectures like cc‐NUMA, stems from the lack of capabilities for data placement among processors and threads for achieving data locality. The absence of such a mechanism causes remote memory accesses and inefficient cache memory use, both of which lead to poor performance. This paper presents a simple software programming approach called copy‐inside–copy‐back (CC) that exploits the data privatization mechanism of OpenMP for data placement and replacement. This technique enables one to distribute data manually without taking away control and flexibility from the programmer and is thus an alternative to the automat and implicit approaches. Moreover, the CC approach improves on the OpenMP‐SPMD style of programming that makes the development process of an OpenMP application more structured and simpler. The CC technique was tested and analyzed using the NAS Parallel Benchmarks on SGI Origin 2000 multiprocessor machines. This study shows that OpenMP improves performance of coarse‐grained parallelism, although a fast copy mechanism is essential. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

7.
Two techniques to perform an irregularly structured Gröbner basis computation (a basic method used in symbolic polynomial manipulation) on distributed memory machines are developed. The first technique, based on relaxation of dependencies present in the sequential computation, exploits coarse-grain parallelism. In this so-called relaxation approach, at every step, each processor reduces a local pair if available, communicates the result and status information from other processors, and updates the local set of pairs and basis. The basis is replicated on each processor while the set of pairs is distributed across the processors. The computation terminates when no pairs are left to be reduced on each processor. A relaxation algorithm based on this approach, along with its experimental results, are provided. The other technique, named quasi-barrier, is developed to enhance the performance of the relaxation algorithm. Using this technique, load balance and performance can be improved by synchronizing p processors when a fraction of the active tasks are completed. The performance enhancement is significant for large numbers of processors when the distribution of pair reduction times is close to exponential. The experimental results obtained on Intel Paragon and IBM SP2 demonstrate the effectiveness of these techniques.  相似文献   

8.
With the advent of multicore processors, it has become imperative to write parallel programs if one wishes to exploit the next generation of processors. This paper deals with skyline computation as a case study of parallelizing database operations on multicore architectures. First we parallelize three sequential skyline algorithms, BBS, SFS, and SSkyline, to see if the design principles of sequential skyline computation also extend to parallel skyline computation. Then we develop a new parallel skyline algorithm PSkyline based on the divide-and-conquer strategy. Experimental results show that all the algorithms successfully utilize multiple cores to achieve a reasonable speedup. In particular, PSkyline achieves a speedup approximately proportional to the number of cores when it needs a parallel computation the most.  相似文献   

9.
The False Nearest Neighbors (FNN) method is particularly relevant in several fields of science and engineering (medicine, economics, oceanography, biological systems, etc.). In some of these applications, it is important to give results within a reasonable time scale; hence, the execution time of the FNN method has to be reduced. This paper describes two parallel implementations of the FNN method for distributed memory architectures. A ‘Single‐Program, Multiple Data’ (SPMD) paradigm is employed using a simple data decomposition approach where each processor runs the same program but acts on a different subset of the data. The computationally intensive part of the method lies mainly in the neighbor search and therefore this task is parallelized and executed using 2 to 64 processors. The accuracy and the performance of the two parallel approaches are then assessed and compared with the best sequential implementation of the FNN method, which appears in the TISEAN project. The results indicate that the two parallel approaches, when the method is run using 64 processors on a SGI Origin 3800, are between 40 and 80 times faster than the sequential one. The efficiency is between 65 and 125%. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

10.
In this paper we consider the problem of designing a controller that is distributed to l processing nodes associated with l collocated control input-measurement output pairs of a given plant. We seek necessary and sufficient conditions so that a controller, distributed in this form, can provide pre-specified performance levels when certain nodes fail. Based on the Youla-Kucera parameterization, the problem can be converted to a convex, structured in Q problem, the solution of which provides a construction for the distributed controller. In addition to nodal failures, we consider the effect of communication noise existing in the transmission of information on the regulated variable, and show that this can also be precisely captured in a convex in Q problem.  相似文献   

11.
Dahlgren  F. Torrellas  J. 《Computer》1999,32(6):72-79
The shared memory concept makes it easier to write parallel programs, but tuning the application to reduce the impact of frequent long latency memory accesses still requires substantial programmer effort. Researchers have proposed using compilers, operating systems, or architectures to improve performance by allocating data close to the processors that use it. The Cache-Only Memory Architecture (COMA) increases the chances of data being available locally because the hardware transparently replicates the data and migrates it to the memory module of the node that is currently accessing it. Each memory module acts as a huge cache memory in which each block has a tag with the address and the state. The authors explain the functionality, architecture, performance, and complexity of COMA systems. They also outline different COMA designs, compare COMA to traditional nonuniform memory access (NUMA) systems, and describe proposed improvements in NUMA systems that target the same performance obstacles as COMA  相似文献   

12.
在信息和服务爆炸的时代,分布式的信任评估技术作为一种新的解决方案,引起了广泛关注并取得了许多重要成果。针对分布式信任评估中的难点与关键点,在介绍现有模型的解决手段时,指出了目前仍遗留的一些问题。  相似文献   

13.
《Parallel Computing》1997,23(7):915-925
Direct volume rendering algorithms are too computationally expensive to offer interactive frame rates when rendering large 3D medical datasets on standard workstations. This article presents an image space parallelization of an image order volume rendering algorithm aimed at shared memory multiprocessors. This parallel implementation of direct volume rendering can significantly speed up rendering times and visualize 3D datasets with speeds of several frames per second. The algorithm was implemented and evaluated on Convex SPP Exemplar and SGI Challenge multiprocessors.  相似文献   

14.
In this paper, we present a scalable three-dimensional hybrid parallel Delaunay image-to-mesh conversion algorithm (PDR.PODM) for distributed shared memory architectures. PDR.PODM is able to explore parallelism early in the mesh generation process thanks to the aggressive speculative approach employed by the Parallel Optimistic Delaunay Mesh generation algorithm (PODM). In addition, it decreases the communication overhead and improves data locality by making use of a data partitioning scheme offered by the Parallel Delaunay Refinement algorithm (PDR). PDR.PODM supports fully functional volume grading by creating elements with varying size. Small elements are created near boundary or inside the critical regions in order to capture the fine features while big elements are created in the rest of the mesh. We tested PDR.PODM on Blacklight, a distributed shared memory (DSM) machine in Pittsburgh Supercomputing Center. For the uniform mesh generation, we observed a weak scaling speedup of 163.8 and above for up to 256 cores as opposed to PODM whose weak scaling speedup is only 44.7 on 256 cores. PDR.PODM scales well on uniform refinement cases running on DSM supercomputers. The end result is that PDR.PODM can generate 18 million elements per second as opposed to 14 million per second in our earlier work. The varying size version sharply reduces the number of elements compared to the uniform version and thus reduces the time to generate the mesh while keeping the same fidelity.  相似文献   

15.
We present a framework for optimizing the distributed performance of sparse matrix computations. These computations are optimally parallelized by distributing their operations across processors in a subtly uneven balance. Because the optimal balance point depends on the non-zero patterns in the data, the algorithm, and the underlying hardware architecture, it is difficult to determine. The Hogs and Slackers genetic algorithm (GA) identifies processors with many operations – hogs, and processors with few operations – slackers. Its intelligent operation-balancing mutation operator swaps data blocks between hogs and slackers to explore new balance points. We show that this operator is integral to the performance of the genetic algorithm and use the framework to conduct an architecture study that varies network specifications. The Hogs and Slackers GA is itself a parallel algorithm with near linear speedup on a large computing cluster.  相似文献   

16.
面向范例的分布式计算是一种新型分布式计算环境,它基于移动代理系统使用范例来实现分布式计算.这些范例提供了用于通信的下层结构,应用程序员只需通过图形界面来提供明确的应用功能.该分布式计算环境支持5个公共范例:任务包范例、搜索树范例、继承编程范例、有限差范例以及个体仿真范例.以搜索树范例为例,描述了搜索树的基本结构,并对其进行分布式实现,阐明了面向范例的分布式计算的具体实现方法.  相似文献   

17.
Hybrid architectures are systems where a high performance general purpose computer is coupled to one or more special purpose devices (SPDs). Such a system can be the optimal choice for several fields of computational science. Configuring the system and finding the optimal mapping of the application tasks onto the hybrid machine often is not straightforward. Performance modeling is a tool to tackle and solve these problems. We have developed a performance model to simulate the behavior of a hybrid architecture consisting of a parallel multiprocessor where some nodes are the host of a GRAPE board. GRAPE is a very high performance SPD used in computational astrophysics. We validate our model on the architecture at our disposal, and show examples of predictions that our model can produce.  相似文献   

18.
Controllability and observability problems may manifest themselves during the application of a checking sequence in a test architecture where there are multiple remote testers. These problems often require the use of external coordination message exchanges among testers during testing. However, the use of coordination messages requires the existence of an external network that can increase the cost of testing and can be difficult to implement. In addition, the use of coordination messages introduces delays and this can cause problems where there are timing constraints. Thus, sometimes it is desired to construct a checking sequence from the specification of the system under test that will be free from controllability and observability problems without requiring the use of external coordination message exchanges. This paper gives conditions under which it is possible to produce such a checking sequence, using multiple distinguishing sequences, and an algorithm that achieves this.  相似文献   

19.
An efficient algorithm for the evaluation of the parametric sensitivities for mixed systems of differential and algebraic equations (DAEs) on computers involving multiple processors operating in parallel is presented. The algorithm derives its efficiency by decoupling the integration of the sensitivity equations from that of the original DAE system, and by allowing tasks associated with the evaluation of sensitivities at multiple time points to overlap instead of being carried out in sequence. Numerical experiments demonstrating the efficiency of the proposed algorithm with systems of more than 850 DAEs and 45 parameters are presented. In all the cases studied, the simultaneous integration of the original DAEs and their sensitivity equations is carried out in less than 10% more time than that of the original DAEs alone.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号