首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper explores the problem of solving triangular linear systems on parallel distributed-memory machines. Working within the LogP model, tight asymptotic bounds for solving these systems using forward/backward substitution are presented. Specifically, lower bounds on execution time independent of the data layout, lower bounds for data layouts in which the number of data items per processor is bounded, and lower bounds for specific data layouts commonly used in designing parallel algorithms for this problem are presented in this paper. Furthermore, algorithms are provided which have running times within a constant factor of the lower bounds described. One interesting result is that the popular two-dimensional block matrix layout necessarily results in significantly longer running times than simpler one-dimensional schemes. Finally, a generalization of the lower bounds to banded triangular linear systems is presented.  相似文献   

2.
The problem of solving tridiagonal linear systems on parallel distributed-memory environments is considered in this paper. In particular, two common direct methods for solving such systems are considered: odd-even cyclic reduction and prefix summing. For each method, a variety of lower bounds on execution time for solving tridiagonal linear systems are presented. Specifically, lower bounds are presented that (a) hold when the number of data items per processor is bounded, (b) are general lower bounds, and (c) for specific data layouts commonly used in designing parallel algorithms to solve tridiagonal linear systems. Furthermore, algorithms are presented that have running times within a constant factor of the lower bounds provided. Lastly, a comparison of bounds for odd-even cyclic reduction and prefix summing is given.  相似文献   

3.
The paper deals with the scheduling of tasks on a set of parallel machines subject to precedence constraints and communication delays, that are not fully known before the execution. This is usual in parallel computing (message passing) and in production workshops (part transportation). The goal is to give a priori informations on the behavior of different schedules computed before the execution. This sensitivity analysis helps to choose among different available algorithms. It also provides tools to decide whether it is justified to change the schedule structure at execution time (run-time scheduling).By hypothesis, the communication delays are within a given interval. Worst case sensitivity bounds are proposed that give the maximum loss of performance. These bounds hold for algorithms which are optimal in the deterministic case, but can be easily adapted to arbitrary algorithms. Some applications to specific problems are presented, namely two identical machines, and unbounded number of machines, for tree-like precedence constraints.  相似文献   

4.
The main contribution of this paper is to introduce two parallel memory machines, the discrete memory machine (DMM) and the unified memory machine (UMM). Unlike well-studied theoretical parallel computational models such as parallel random access machines, these parallel memory machines are practical and capture the essential feature of the memory access by graphical processing units (GPUs). As a first step of the development of algorithmic techniques on the DMM and the UMM, we first evaluate the computing time for the contiguous access and the stride access to the memory on these models. We then present parallel algorithms to transpose a 2D array on these models and evaluate their performance. Finally, we show that, for any permutation given in offline, data in an array can be moved efficiently along the given permutation both on the DMM and on the UMM. Since the computing time of our permutation algorithms on the DMM and the UMM is equal to the sum of the lower bounds obtained from the memory bandwidth limitation and the latency limitation, they are optimal from the theoretical point of view. We believe that the DMM and the UMM can be good theoretical platforms to develop algorithmic techniques for GPUs.  相似文献   

5.
Randomized algorithms are algorithms that employ randomness in their solution method. We show that the performance of randomized algorithms is less affected by factors that prevent most parallel deterministic algorithms from attaining their theoretical speedup bounds. A major reason is that the mapping of randomized algorithms onto multiprocessors involves very little scheduling or communication overhead. Furthermore, reliability is enhanced because the failure of a single processor leads only to degradation, not failure, of the algorithm. We present results of an extensive simulation done on a multiprocessor simulator, running a randomized branch-and-bound algorithm. The particular case we consider is the knapsack problem, due to its ease of formulation. We observe the largest speedups in precisely those problems that take large amounts of time to solve. This work has been supported by the U.S. Army Research Office under Contract No. DAAG 29-85-K-0236.  相似文献   

6.
本文系统地总结和探讨了共享和分布式存储环境下的并行计算时间模型。微观上,结合并行机结构特征和通信机制,揭示了延长算法运行时间的关键因素,并据此提出一些优化原则和效率评价准则,能辅助用户修改并行算法达到最优性能;宏观上,给出了基本消息传递的常用通信原语类型和部分原语操作时间经验公式,能辅助用户选择最优通信原语和问题粒度,正确预测程序的运行时间和性能。  相似文献   

7.
Mackworth and Freuder have analyzed the time complexity of several constraint satisfaction algorithms.(1) Mohr and Henderson have given new algorithms, AC-4 and PC-3, for arc and path consistency, respectively, and have shown that the arc consistency algorithm is optimal in time complexity and of the same order space complexity as the earlier algorithms.(2) In this paper, we give parallel algorithms for solving node and arc consistency. We show that any parallel algorithm for enforcing are consistency in the worst case must have O(na) sequential steps, wheren is number of nodes, anda is the number of labels per node. We give several parallel algorithms to do arc consistency. It is also shown that they all have optimal time complexity. The results of running the parallel algorithms on a BBN Butterfly multiprocessor are also presented.This work was partially supported by NSF Grants MCS-8221750, DCR-8506393, and DMC-8502115.  相似文献   

8.
The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of index set computation , wherein the destination locations for individual data blocks are calculated, and data communication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct indirect and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations. Received June 1, 1997; revised March 10, 1998.  相似文献   

9.
The advent of parallel machines brought about a controverse in the domain of parallel algorithms: is it worth to conceive parallel algorithms for NP-complete problems? In this work we present a parallel implementation of a sequential algorithm from Horowitz and Sahni for the knapsack problem on a FPS T20. Our aim is to establish some experimental results in order to allow comparisons for forthcoming works. The results show that the development of theory and technology yields the computation tractability of very large knapsack problems.  相似文献   

10.
The development of intelligent transportation systems (ITS) and the resulting need for the solution of a variety of dynamic traffic network models and management problems require faster‐than‐real‐time computation of shortest path problems in dynamic networks. Recently, a sequential algorithm was developed to compute shortest paths in discrete time dynamic networks from all nodes and all departure times to one destination node. The algorithm is known as algorithm DOT and has an optimal worst‐case running‐time complexity. This implies that no algorithm with a better worst‐case computational complexity can be discovered. Consequently, in order to derive algorithms to solve all‐to‐one shortest path problems in dynamic networks, one would need to explore avenues other than the design of sequential solution algorithms only. The use of commercially‐available high‐performance computing platforms to develop parallel implementations of sequential algorithms is an example of such avenue. This paper reports on the design, implementation, and computational testing of parallel dynamic shortest path algorithms. We develop two shared‐memory and two message‐passing dynamic shortest path algorithm implementations, which are derived from algorithm DOT using the following parallelization strategies: decomposition by destination and decomposition by transportation network topology. The algorithms are coded using two types of parallel computing environments: a message‐passing environment based on the parallel virtual machine (PVM) library and a multi‐threading environment based on the SUN Microsystems Multi‐Threads (MT) library. We also develop a time‐based parallel version of algorithm DOT for the case of minimum time paths in FIFO networks, and a theoretical parallelization of algorithm DOT on an ‘ideal’ theoretical parallel machine. Performances of the implementations are analyzed and evaluated using large transportation networks, and two types of parallel computing platforms: a distributed network of Unix workstations and a SUN shared‐memory machine containing eight processors. Satisfactory speed‐ups in the running time of sequential algorithms are achieved, in particular for shared‐memory machines. Numerical results indicate that shared‐memory computers constitute the most appropriate type of parallel computing platforms for the computation of dynamic shortest paths for real‐time ITS applications.  相似文献   

11.
In this article, we present a multigrid algorithm for parallel computers, the chopped parallel multigrid (CPMG) algorithm. The CPMG algorithm improves the processor utilization by reducing the work load on coarse grids without affecting the convergence rate of the algorithm. This is in contrast to earlier approaches (Gannon and van Rosendale, 1986; Frederickson and McBryan, 1989), where unutilized processors are used to improve the convergence rate. The CPMG algorithm reduces the coarse grid work bychopping the alternate cycles of multigrid. Using analytical results and simulations on sequential machines we show that the CPMG can achieve almost the same convergence rate as standard MG for many cases. Analytically we show that the advantage gained by CPMG over standard MG on a mesh connected massively parallel machine is 33% in hardware utilization, 50% in communication overheads and 38% in overall execution time. We have also evaluated the performance of CPMG on an actual massively parallel machine, the DAP-510. The advantage gained by CPMG over standard MG is 35% in overall execution time. Moreover, the CPMG can be integrated with other parallel multigrid algorithms, such as the PSMG algorithm (Frederickson and McBryan, 1989) and Decker's algorithm (Decker, 1990).  相似文献   

12.
Previous research on developing parallel triangulation algorithms concentrated on triangulating planar point sets.O(log3 n) running time algorithms usingO(n) processors have been developed in Refs. 1 and 2. Atallah and Goodrich(3) presented a data structure that can be viewed as a parallel analogue of the sequential plane-sweeping paradigm, which can be used to triangulate a planar point set inO(logn loglogn) time usingO(n) processors. Recently Merks(4) described an algorithm for triangulating point sets which runs inO(logn) time usingO(n) processors, and is thus optimal. In this paper we develop a parallel algorithm for triangulating simplicial point sets in arbitrary dimensions based on the idea of the sequential algorithm presented in Ref. 5. The algorithm runs inO(log2 n) time usingO(n/logn) processors. The algorithm hasO(n logn) as the product of the running time and the number of processors; i.e., an optimal speed-up.  相似文献   

13.
莫则尧 《计算机学报》2004,27(10):1311-1319
复杂物理现象通常由多类复杂的物理过程紧耦合构成,其数值模拟也通常由适用不同物理过程的多类并行应用程序紧耦合完成.如何设计这些物理过程之间的联接算法,既要保证程序之间数据传递的高效,又要保证程序各自运行和总体模拟的高效,还要保证程序各自开发的独立,是一个值得研究的课题.该文基于广泛应用于高温高压多物理研究中的辐射流体力学和中子输运多物理并行数值模拟,在非结构网格上,提出了两种联接算法:完全松散联接算法和两层紧耦合联接算法,前者侧重于实现程序各自运行的高效和开发的独立,后者在前者的基础上,还权衡了数据传递和总体模拟的高效.在两台并行机的数百个处理机上,通信复杂度分析和数值实验结果表明两个算法均是有效的,可推广适用于辐射或中子输运与其他流体力学的多物理并行数值模拟应用中.特别地,两层紧耦合联接算法是高效可扩展的,取得了近似最优的并行性能.  相似文献   

14.
Minimizing Makespan and Preemption Costs on a System of Uniform Machines   总被引:1,自引:0,他引:1  
It is well known that for preemptive scheduling on uniform machines there exist polynomial time exact algorithms, whereas for non-preemptive scheduling there are probably no such algorithms. However, it is not clear how many preemptions (in total, or per job) suffice in order to guarantee an optimal polynomial time algorithm. In this paper we investigate exactly this hardness gap, formalized as two variants of the classic preemptive scheduling problem. In generalized multiprocessor scheduling (GMS) we have a job-wise or total bound on the number of preemptions throughout a feasible schedule. We need to find a schedule that satisfies the preemption constraints, such that the maximum job completion time is minimized. In minimum preemptions scheduling (MPS) the only feasible schedules are preemptive schedules with the smallest possible makespan. The goal is to find a feasible schedule that minimizes the overall number of preemptions. Both problems are NP-hard, even for two machines and zero preemptions. For GMS, we develop polynomial time approximation schemes, distinguishing between the cases where the number of machines is fixed, or given as part of the input. Our scheme for a fixed number of machines has linear running time, and can be applied also for instances where jobs have release dates, and for instances with arbitrary preemption costs. For MPS, we derive matching lower and upper bounds on the number of preemptions required by any optimal schedule. Our results for MPS hold for any instance in which a job, Jj, can be processed simultaneously by ρj machines, for some ρj ≥ 1.  相似文献   

15.
Finding the best data layout has been an ultimate goal of memory optimization. Even with data access profile, heuristic algorithms are needed to reorganize data layout for better locality. The best layout could be found by running the given application with all possible data layouts and selecting the best performing layout. This approach, however, can incur too much overhead, particulary when the number of possible layouts are too many. In this paper, we present a composition-based cache simulation for structure reorganization. Instead of running all possible layouts, we simulate only the primary subsets of layouts and compose the cache misses for all layouts by summing up the cache misses of component subsets. Our experiment with the composition-based cache simulation shows that the differences in the cache misses are within 10% of the full cache simulation for 4-way and 8-way set associative caches. In addition to the cache miss estimation, our heuristic algorithm takes account of the extra instruction overhead incurred by structure reorganization. Our experiment with several structure intensive benchmarks shows the 37% reduction in the L1D read misses and the 28% reduction in the L2 read misses. As a result, the execution times are also reduced by 19% on average.  相似文献   

16.
We consider the NP-hard problem of scheduling parallel jobs with release dates on identical parallel machines to minimize the makespan. A parallel job requires simultaneously a prespecified, job-dependent number of machines when being processed. We prove that the makespan of any nonpreemptive list-schedule is within a factor of 2 of the optimal preemptive makespan. This gives the best-known approximation algorithms for both the preemptive and the nonpreemptive variant of the problem. We also show that no list-scheduling algorithm can achieve a better performance guarantee than 2 for the nonpreemptive problem, no matter which priority list is chosen. List-scheduling also works in the online setting where jobs arrive over time and the length of a job becomes known only when it completes; it therefore yields a deterministic online algorithm with competitive ratio 2 as well. In addition, we consider a different online model in which jobs arrive one by one and need to be scheduled before the next job becomes known. We show that no list-scheduling algorithm has a constant competitive ratio. Still, we present the first online algorithm for scheduling parallel jobs with a constant competitive ratio in this context. We also prove a new information-theoretic lower bound of 2.25 for the competitive ratio of any deterministic online algorithm for this model. Moreover, we show that 6/5 is a lower bound for the competitive ratio of any deterministic online algorithm of the preemptive version of the model jobs arriving over time.  相似文献   

17.
随着深度学习的快速发展,卷积神经网络已广泛应用于计算机视觉、自然语言处理等人工智能领域中.Winograd快速卷积算法因能有效降低卷积神经网络中卷积操作的计算复杂度而受到广泛关注.随着国防科技大学自主研制的飞腾多核处理器在智能领域的推广应用,对面向飞腾多核处理器的高性能卷积实现提出了强烈需求.针对飞腾多核处理器的体系结构特征与Wingorad快速卷积算法的计算特点,提出了一种高性能并行Winograd快速卷积算法.该算法不依赖通用矩阵乘库函数,由卷积核转换、输入特征图转换、逐元素乘、输出特征图逆变换等4个部分构成,融合设计了4个部分的数据操作,并设计了与之配套的数据布局、多级并行数据转换算法与多级并行矩阵乘算法,实现访存性能以及算法整体性能的提升.在两款飞腾多核处理器上的测试结果显示,与开源库ACL和NNPACK中的Winograd快速卷积实现相比,该算法分别能获得1.05~16.11倍与1.66~16.90倍的性能加速;集成到开源框架Mxnet后,该算法使得VGG16网络的前向计算获得了3.01~6.79倍的性能加速.  相似文献   

18.
We have designed a family of parallel data flow analysis algorithms for execution on distributed-memory MIMD machines, based on general-purpose, hybrid algorithms for data flow analysis [Marlowe and Ryder 1990]. We exploit a natural partitioning of the hybrid algorithms and explore a static mapping, dynamic scheduling strategy. Alternative mapping-scheduling choices and refinements of the flow graph condensation used are discussed. Our parallel hybrid algorithm family is illustrated on Reaching Definitions, although parallel algorithms also exist for many interprocedural (e.g., Aliasing) and intraprocedural (e.g., Available Expressions) problems [Marlowe 1989]. We have implemented the parallel hybrid algorithm for Reaching Definitions on an Intel iPSC/2. Our empirical results suggest the practicality of parallel hybrid algorithms.An earlier version of this paper was presented at Supercomputing '90.The research reported here was supported, in part, by the New Jersey Commission on Science and Technology and the CAIP Center's Industrial Members, by Siemens Research Corporation and by National Science Foundation grant CCR-8920078.  相似文献   

19.
The paper describes the implementation of the Successive Overrelaxation (SOR) method on an asynchronous multiprocessor computer for solving large, linear systems. The parallel algorithm is derived by dividing the serial SOR method into noninterfering tasks which are then combined with an optimal schedule of a feasible number of processors. The important features of the algorithm are: (i) achieves a speedup Sp O(N/3) and an efficiency Ep 2/3 using P = [N/2] processors, where N is the number of the equations, (ii) contains a high level of inherent parallelism, whereas on the other hand, the convergence theory of the parallel SOR method is the same as its sequential counterpart and (iii) may be modified to use block methods in order to minimise the overhead due to communication and synchronisation of the processors.  相似文献   

20.
为了达到有效缩短算法运行时间、加快定位速度的目的,研究并分析了匹配场声源定位算法中存在的固有并行性,构建了匹配场声源定位的并行计算算法框架。详细介绍了匹配场声源定位算法的网格法并行化方法;并以互连的双核PC机作为并行计算硬件平台,选用Windows操作系统、FORTRAN编译器,使用TCP/IP作为标准的通信协议,在MPI并行环境中对并行算法进行了测试和相应分析。研究表明,提出的并行算法运行效果较好,可以通过开发匹配场处理方法的并行性,有效解决信号匹配处理中遇到的计算量大、耗时长等问题,为实现快速声源定位提供解决途径。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号