首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. In this paper we examine the numerical solution of an elliptic partial differential equation in order to study the relationship between problem size and architecture. The equation's domain is discretized into n2 grid points which are divided into partitions and mapped onto the individual processor memories. We analytically quantify the relationships among grid size, stencil type, partitioning strategy processor execution time, and communication network type. In doing so, we determine the optimal number of processors to assign to the solution (and hence the optimal speedup), and identify (i) the smallest grid size which fully benefits from using all available processors, (ii) the leverage on performance given by increasing processor speed or communication network speed, and (iii) the suitability of various architectures for large numerical problems. Finally, we compare the predictions of our analytic model with measurements from a multiprocessor and find that the model accurately predicts performance.  相似文献   

2.
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objectives of this paper are to critically assess the state of the art in the theory of scalability analysis, and to motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoefficiency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.  相似文献   

3.
Molecular dynamics model is processed by a parallel array type computer PAX, that has an architecture of nearest neighbor meash connection of processors. Two parallel schemes, named Lagrangian and Eulerian, are implemented, execution time and efficiency are analyzed and expressed in terms of the basic parameters such as problem size and array size. The Lagrangian scheme realizes high efficiency close to 1, which assures the linear speedup proportional to the size of the processor array. Parallel programming technique is also presented.  相似文献   

4.
Binary images can be compressed efficiently using context‐based statistical modeling and arithmetic coding. However, this approach is fully sequential and therefore additional computing power from parallel computers cannot be utilized. We attack this problem and show how to implement the context‐based compression in parallel. Our approach is to segment the image into non‐overlapping blocks, which are compressed independently by the processors. We give two alternative solutions about how to construct, distribute and utilize the model in parallel, and study the effect on the compression performance and execution time. We show by experiments that the proposed approach achieves speedup that is proportional to the number of processors. The work efficiency exceeds 50% with any reasonable number of processors. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

5.
The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

6.
The effectiveness of parallel processing of relational join operations is examined. The skew in the distribution of join attribute values and the stochastic nature of the task processing times are identified as the major factors that can affect the effective exploitation of parallelism. Expressions for the execution time of parallel hash join and semijoin are derived and their effectiveness analyzed. When many small processors are used in the parallel architecture, the skew can result in some processors becoming sources of bottleneck while other processors are being underutilized. Even in the absence of skew, the variations in the processing times of the parallel tasks belonging to a query can lead to high task synchronization delay and impact the maximum speedup achievable through parallel execution. For example, when the task processing time on each processor is exponential with the same mean, the speedup is proportional to P/ln(P) where P is the number of processors. Other factors such as memory size, communication bandwidth, etc., can lead to even lower speedup. These are quantified using analytical models  相似文献   

7.
This paper presents a parallel algorithm for computing the inversion of a dense matrix based on modified Jordan's elimination which requires fewer calculation steps than the standard one. The algorithm is proposed for the implementation on the linear array with a small to moderate number of processors which operate in a parallel-pipeline fashion. A communication between neighboring processors is achieved by a common memory module implemented as a FIFO memory module. For the proposed algorithm we define a task scheduling procedure and prove that it is time optimal. In order to compute the speedup and efficiency of the system, two definitions (Amdahl's and Gustafson's) were used. For the proposed architecture, involving two to 16 processors, estimated Gustafson's (Amdahl's) speedups are in the range 1.99 to 13.76 (1.99 to 9.69).  相似文献   

8.
Motif发现是生物信息学的一个重要研究问题。采用均匀分配后缀群策略、并行淘汰和归并方法,在机群系统上设计一种Motif发现并行投票算法。实验结果表明,在保证解精 度的前提下,该并行算法获得了良好的加速,执行效率达到95%以上。  相似文献   

9.
Optimization of parallel query execution plans in XPRS   总被引:1,自引:0,他引:1  
In this paper, we describe our approach to optimization of query execution plans in XPRS, a multiuser parallel database system based on a shared memory multiprocessor and a disk array. The main difficulties in this optimization problem are the compile-time unknown parameters such as available buffer size and number of free processors, and the enormous search space of possible parallel plans. We deal with these problems with a novel two phase optimization strategy which dramatically reduces the search space and allows run time parameters without significantly compromising plan optimality. In this paper we present our two phase optimization strategy and give experimental evidence from XPRS benchmarks that indicate that it almost always produces optimal or close to optimal plans.  相似文献   

10.
传统遗传算法求解计算密集型任务时,适应度函数的执行时间增加相当快,致使当种群规模或者进化代数增大时,算法的收敛速度非常缓慢。基于此,设计了"粗粒度-主从式"混合式并行遗传算法(HBPGA),并在目前TOP500上排名第一的超级计算机神威"太湖之光"平台上实现。该算法模型采用两级并行架构,结合了MPI和Athread两种编程模型,与传统在单核或者一级并行构架的多核集群上实现的遗传算法相比,在申威众核处理器上实现了二级并行,并得到了更好的性能和更高的加速比。实验中,当从核数为16×64时,最大加速比达到544,从核加速比超过31。  相似文献   

11.
Partitioning of processors on a multiprocessor system involves logically dividing the system into processor partitions. Programs can be executed in the different partitions in parallel. Optimally setting the partition size can significantly improve the throughput of multiprocessor systems. The speedup characteristics of parallel programs can be defined by execution signatures. The execution signature of a parallel program on a multiprocessor system is the rate at which the program executes in the absence of other programs and depends upon the number of allocated processors, the specific architecture, and the specific program implementation. Based on the execution signatures, this paper analyzes simple Markovian models of dynamic partitioning. From the analysis, when there are at most two multiprocessor partitions, the optimal dynamic partition size can be found which maximizes throughput. Compared against other partitioning schemes, the dynamic partitioning scheme is shown to be the best in terms of throughput when thereconfiguration overhead is low. If the reconfiguration overhead is high, dynamic partitioning is to be avoided. An expression for the reconfiguration overhead threshold is derived. A general iterative partitioning technique is presented. It is shown that the technique gives maximum throughput forn partions.  相似文献   

12.
A parallel algorithm is derived for solving the discrete-ordinates approximation of the neutron transport equation, based on the naturally occurring decoupling between the mesh sweeps in the various discrete directions during each iteration. In addition, the parallel Source Iteration (SI) algorithm, characterized by its coarse granularity and static scheduling, is implemented for the Nodal Integral Method (NIM) into the Parallel Nodal Transport (P-NT) code on Intel's iPSC/2 hypercube. Measured parallel performance for solutions of two test problems is used as evidence of the parallel algorithm's potential for high speedup and efficiency. The measured performance data are also used to develop and validate a parallel performance model for the total, serial, parallel, and global-summation time components per iteration as a function of the spatial mesh size, the problem size (number of mesh cells and angular quadrature order), and the number of utilized processors. The potential for high performance (large speedup at high efficiency) for large problems is explored using the performance model, and it is concluded that present applications in three-dimensional Cartesian geometry will benefit by concurrent execution on parallel computers with up to a few hundred processors.Research sponsored by the U.S. Department of Energy, managed by Martin Marietta Energy Systems, Inc., under contract No. DE-AC05-84OR21400.  相似文献   

13.
We show how computations such as those involved in American or European-style option price valuations with the explicit finite difference method can be performed in parallel. Towards this we introduce a latency tolerant parallel algorithm for performing such computations efficiently that achieves optimal theoretical speedup p, where p is the number of processor of the parallel system. An implementation of the parallel algorithm has been undertaken, and an evaluation of its performance is carried out by performing an experimental study on a high-latency PC cluster, and at a smaller scale, on a multi-core processor using in addition the SWARM parallel computing framework for multi-core processors. Our implementation of the parallel algorithm is not only architecture but also communication library independent: the same code works under LAM-MPI and Open MPI and also BSPlib, two sets of library frameworks that facilitate parallel programming. The suitability of our approach to multi-core processors is also established.  相似文献   

14.
It has been difficult to develop simple formulations to predict the execution time of parallel programs due to the complexity of characterizing parallel hardware and software. In an attempt to clarify these characterizations, we introduce a methodology for applying a simple performance model based on Amdahl′s law. Our formulation results in accurate predictions of execution time on available systems, allowing programmers to select the optimal number of processors to apply to a particular problem or to select an appropriate problem size for the number of processors available. In short, we accurately quantify the scalability of a specific algorithm when it is run on a specific parallel computer. Our predictions are based on simple experiments that characterize machine performance and on a simple analysis of the parallel program. We illustrate our method for a program executed on a Sequent Symmetry multiprocessor with 20 processors. Our predictions closely match experimental results, differing by no more than 5% from the actual execution times. Our results illustrate key performance limitations of parallel systems, showing the impact of overhead and the scaling of problem size.  相似文献   

15.
The Garey–Johnson algorithm is a well known polynomial-time algorithm constructing an optimal schedule for the maximum lateness problem with unit execution time tasks, two parallel identical processors, precedence constraints and release times. The paper is concerned with the worst-case analysis of a generalization of the Garey–Johnson algorithm to the case of arbitrary number of processors. In contrast to other algorithms for the maximum lateness problem, the tight performance guarantee for the even number of processors differs from the tight performance guarantee for the odd number of processors.  相似文献   

16.
This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors.  相似文献   

17.
At present, meshless element free Galerkin (EFG) method is being successfully applied in the areas such as solid mechanics, fracture mechanics and thermal. Being a meshless method, it has many advantages over finite element method. One big hurdle with the wide implementation of this method is its computational cost. Therefore, in this paper, a parallel algorithm is proposed for the EFG method. The parallel code has been written in FORTRAN language using MPI message passing library and executed on a four node (eight processors) MIMD type, distributed memory ‘PARAM 10000’ parallel computer. The total time, communication time, speedup and efficiency have been estimated for a three-dimensional heat transfer problem to validate the proposed algorithm. For eight processors, the speedup and efficiency are obtained to be 4.66 and 58.22%, respectively, for a data size of 1320 nodes.  相似文献   

18.
一个调度Fork-Join任务图的最优算法   总被引:5,自引:0,他引:5  
Fork-Join任务图是一种并行处理的基本结构.虽然许多算法在任务满足某些条件时能产生最优调度,但往往没有考虑节省处理器个数和减少任务集的总完成时间,从而降低算法的加速比和效率.因此,提出一种基于任务复制的平衡调度算法,其时间复杂度为O(vq+vlogv),v和q分别表示任务集中任务的个数和使用的处理器个数.通过分析已用处理器的负载和空闲时间段,把任务尽量分配到已用的处理器上以均衡负载,从而提高其利用率.实验结果表明,该算法的加速比和总体效率优于其他算法.因此,该算法对于高性能应用程序的调度是一个较好的选择.  相似文献   

19.
New computer architectures based on large numbers of processors are now used in various application areas ranging from embedded systems to supercomputers. Efficient parallel processing algorithms are applied in a wide variety of applications such as simulation, robot control, and image synthesis. This article presents two novel parallel algorithms for computing robot inverse dynamics (as well as control laws) starting from customized symbolic robot models. To gain the most benefit from the concurrent processor architecture, the whole job is divided into a large number of simple tasks, each involving only a single floating-point operation. Although requiring sophisticated scheduling schemes, fine granularity of tasks was the key factor for achieving nearly maximum efficiency and speedup. The first algorithm resolves the scheduling problem for an array of pipelined processors. The second one is devoted to parallel processors connected by a complete crossbar interconnection network. The main feature of the proposed algorithms is that they take into account the communication delays between processors and minimize both the execution time and communication cost. To prove the theoretical results, the algorithms have been verified by experiments on an INMOS T800 transputer-based system. We used four transputers in serial and parallel configurations. The experimental results show that the most complicated dynamic control laws can be executed in a submilisecond time interval. © 1993 John Wiley & Sons, Inc.  相似文献   

20.
1IntroductionAlgorithmshavebeenproposedtosolvelinearrecurrencesinparallell1-13].Someofthemsupposeunlimitednumberofprocessorsbeingusedwhileothersuselimitednumberofprocessors.P-M.KoggeandH.S.Stoneproposedarecursivedou-blingalgorithmforthesolutionofageneraJclassofrecuxrenceequationsl1].Itisthefastestalgorithm(thetimeisO(log,N))whenthenumberofprocessingelemelltspiseqllaltoN.Howeveritisnotoptimalintermsofefficiency:itsspeedupisO(de),whileitsefficiencyisO(wt).TherecursivedoublingapproachcanPro…  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号