共查询到20条相似文献,搜索用时 15 毫秒
1.
由于线性规划在理论和实践中的重要性,对求解大规模规划问题并行算法的研究已引起许多学者的兴趣.本文根据Galperin提出的线性规划的一种线性时间的立方算法特别适合并行的特点,提出了一种基于SPMD模型和主从式MPI的线性规划并行算法,并对算法性能进行了深入分析,理论分析和在曙光3000上的实验结果表明:该算法具有粗粒度并行、良好的可扩展性和理想加速比模型等优点,明显优于目前为止求解同类不对称线性规划问题的其他并行算法,可用于求解此类大规模线性规划问题的高性能计算. 相似文献
2.
可预测扩展并行性能的并行程序设计模型 总被引:1,自引:0,他引:1
BSP(Bulk-Synchronous)模型是独立于并行体系结构的,即可作为并行计算模型又可看作并地程序设计模型,该模型使程序员在算法设计阶段和编程调试阶段可精确地分析和预测并行程序性能。BSP程序可移植性强,可在多种并行系统发PVM,MPI等上实现。 相似文献
3.
4.
J. Dingel 《Formal Aspects of Computing》2002,14(2):123-197
Parallel computers have not yet had the expected impact on mainstream computing. Parallelism adds a level of complexity to
the programming task that makes it very error-prone. Moreover, a large variety of very different parallel architectures exists.
Porting an implementation from one machine to another may require substantial changes. This paper addresses some of these
problems by developing a formal basis for the design of parallel programs in the form of a refinement calculus. The calculus
allows the stepwise formal derivation of an abstract, low-level implementation from a trusted, high-level specification. The
calculus thus helps structuring and documenting the development process. Portability is increased, because the introduction
of a machine-dependent feature can be located in the refinement tree. Development efforts above this point in the tree are
independent of that feature and are thus reusable. Moreover, the discovery of new, possibly more efficient solutions is facilitated.
Last but not least, programs are correct by construction, which obviates the need for difficult debugging. Our programming/specification
notation supports fair parallelism, shared-variable and message-passing concurrency, local variables and channels. The calculus
rests on a compositional trace semantics that treats shared-variable and message-passing concurrency uniformly. The refinement
relation combines a context-sensitive notion of trace inclusion and assumption-commitment reasoning to achieve compositionality.
The calculus straddles both concurrency paradigms, that is, a shared-variable program can be refined into a distributed, message-passing
program and vice versa.
Received July 2001 / Accepted in revised form May 2002 相似文献
5.
6.
Keqin Li 《Journal of Parallel and Distributed Computing》2001,61(12):1709
Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(Nα), where 2<α3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(log N) time by using Nα/log N processors. Such a parallel computation is cost optimal and matches the performance of PRAM. Furthermore, our parallelization on a DMPC can be made fully scalable, that is, for all 1pNα/log N, multiplying two N×N matrices can be performed by a DMPC with p processors in O(Nα/p) time, i.e., linear speedup and cost optimality can be achieved in the range [1..Nα/log N]. This unifies all known algorithms for matrix multiplication on DMPC, standard or non- standard, sequential or parallel. Extensions of our methods and results to other parallel systems are also presented. For instance, for all 1p Nα /log N, multiplying two N×N matrices can be performed by p processors connected by a hypercubic network in O(Nα/p+(N2/p2/α)(log p)2(α−1)/α) time, which implies that if p=O(Nα/(log N)2(α−1)/(α−2)), linear speedup can be achieved. Such a parallelization is highly scalable. The above claims result in significant progress in scalable parallel matrix multiplication (as well as solving many other important problems) on distributed memory systems, both theoretically and practically. 相似文献
7.
《Journal of Parallel and Distributed Computing》2000,60(7):835-852
Parallel logic programming (PLP) systems are sophisticated examples of symbolic computing systems. PLP systems address problems such as allocating dynamic memory, scheduling irregular computations, and managing different types of implicit parallelism. Most PLP systems have been developed for bus-based architectures. However, the complexity of PLP systems and the large amount of data they process raise the question of whether logic programming systems can still achieve good performance on modern scalable architectures, such as distributed shared-memory (DSM) systems. In this work we use execution-driven simulation of a cache-coherent DSM architecture to investigate the performance of Andorra-I, a state-of-the-art PLP system, on a modern multiprocessor. The results of this simulation show that Andorra-I exhibits reasonable running time performance, but it does not scale well. Our detailed analysis of cache misses and their sources expose several opportunities for improvements in Andorra-I. Based on this analysis, we modify Andorra-I using a set of simple techniques that led to significantly better running time and scalability. These results suggest that Andorra-I can and should perform well on modern multiprocessors. Furthermore, as Andorra-I shares its main data structures with several PLP systems, we conclude that the methodology and techniques used in our work can greatly benefit these other PLP systems. 相似文献
8.
9.
CC$是一种并行编程语言,目的是解决分布式众核并行计算机的编程困难。CC$的编程模型以Multi BSP
模型为基础,将分布式众核并行计算机的硬件架构抽象为3层。数据按照存储的层次和共享范围分为5类,以便在不
同层次上提供共享。LL$还提出一类虚拟指令来解决不同层次之间的数据交换,实现数据访问的逻辑化描述。并行
程序按照3层Multi BSP超步嵌套执行。CC$具有统一的编程风格、内建的多层会共地址空间、数据访问请求的表达
式描述和数据传输编译优化4大特点。测试表明,CC$程序的运行效率高,易学易用,大幅地缩短了开发周期。 相似文献
10.
11.
机器学习问题通常会转换成一个目标函数去求解,优化算法是求解目标函数中参数的重要工具.在大数据环境下,需要设计并行与分布式的优化算法,通过多核计算和分布式计算技术来加速训练过程.近年来,该领域涌现了大量研究工作,部分算法也在各机器学习平台得到广泛应用.本文针对梯度下降算法、二阶优化算法、邻近梯度算法、坐标下降算法、交替方向乘子算法五类最常见的优化方法展开研究,每一类算法分别从单机并行和分布式并行来分析相关研究成果,并从模型特性、输入数据特性、算法评价、并行计算模型等角度对每个算法进行详细对比.随后对有代表性的可扩展机器学习平台中优化算法的实现和应用情况进行对比分析.同时对本文中介绍的所有优化算法进行多层次分类,方便用户根据目标函数类型选择合适的优化算法,也可以通过该多层次分类图交叉探索如何将优化算法应用到新的目标函数类型.最后分析了现有优化算法存在的问题,提出可能的解决思路,并对未来研究方向进行展望. 相似文献
12.
Lei Pan Ming Kin Lai Koji Noguchi Javid J. Huseynov Lubomir F. Bic Michael B. Dillencourt 《International journal of parallel programming》2004,32(1):1-37
Message Passing (MP) and Distributed Shared Memory (DSM) are the two most common approaches to distributed parallel computing. MP is difficult to use, whereas DSM is not scalable. Performance scalability and ease of programming can be achieved at the same time by using navigational programming (NavP). This approach combines the advantages of MP and DSM, and it balances convenience and flexibility. Similar to MP, NavP suggests to its programmers the principle of pivot-computes and hence is efficient and scalable. Like DSM, NavP supports incremental parallelization and shared variable programming and is therefore easy to use. The implementation and performance analysis of real-world algorithms, namely parallel Jacobi iteration and parallel Cholesky factorization, presented in this paper supports the claim that the NavP approach is better suited for general-purpose parallel distributed programming than either MP or DSM. 相似文献
13.
《Journal of Parallel and Distributed Computing》1994,22(3):479-487
In scalable concurrent architectures, the performance of a parallel algorithm depends on the resource management policies used. Such policies determine, for example, how data is partitioned and distributed and how processes are scheduled. In particular, the performance of a parallel algorithm obtained by using a particular policy can be affected by increasing the size of the architecture or the input. In order to support scalability, we are developing a methodology for modular specification of partition and distribution strategies (PDSs). As a consequence, a PDS may be changed without modifying the code specifying the logic of a parallel algorithm. We illustrate our methodology for parallel algorithms that use dynamic data structures. 相似文献
14.
15.
分布式系统并行编程的简单方法 总被引:1,自引:0,他引:1
本文以并行矩阵乘法为例子,给出在PVM, EXPRESS以及NX环境上的并行程序实现.通过对这个简单例子的分析,归纳出在一般情况下并行程序设计应采取的策略.这里介绍的并行程序设计方法,特别对大型应用问题的并行移植是非常有用的,其目的是为想写并行程序的科研人员提供参考. 相似文献
16.
袁道华 《计算机研究与发展》1994,31(9):56-60
本文概述了分布式并行系统和分布式共享存储器的一般概念,讨论了使用共享对象和可靠广播的并行程序设计模型,最后给出了我们的改进模型。 相似文献
17.
《Journal of Parallel and Distributed Computing》1994,22(3):506-522
Much prior work in AI on various attempts to speed up rule-based systems by parallel processing has been reported. Unfortunately, many of these results indicate that there is limited parallelism to be found when rules are applied to relatively small amounts of data. Thus, one can predict that much greater parallelism can be extracted when rules are applied to large amounts of data. However, traditional compile-time parallelization strategies as developed for main-memory based systems do not scale to large databases. We propose a scalable strategy for the efficient parallel implementation of rule-based systems operating upon large databases. We concentrate on load balancing techniques in a synchronous model of rule execution, where the variance in runtime of the distributed sites is minimized per cycle of rule processing, thus increasing utilization and speedup. We demonstrate that static load balancing techniques are insufficient, and thus low overhead dynamic load balancing is the key to successful scaling. We present a form of dynamic load balancing that is based upon predicting future system loads, rather than conventional demand-driven approaches that monitor current system state. We analyze a number of possible predictive dynamic load balancing protocols by isoefficiency analysis to guide the design of a parallel database rule processing system. 相似文献
18.
MapReduce:新型的分布式并行计算编程模型 总被引:3,自引:0,他引:3
MapReduce是Google提出的分布式并行计算编程模型,用于大规模数据的并行处理。Ma-pReduce模型受函数式编程语言的启发,将大规模数据处理作业拆分成若干个可独立运行的Map任务,分配到不同的机器上去执行,生成某种格式的中间文件,再由若干个Reduce任务合并这些中间文件获得最后的输出文件。用户在使用MapReduce模型进行大规模数据处理时,可以将主要精力放在如何编写Map和Reduce函数上,其它并行计算中的复杂问题诸如分布式文件系统、工作调度、容错、机器间通信等都交给MapReduce系统处理,在很大程度上降低了整个编程难度。MapReduce日益成为云计算平台的主流编程模型。Apache Hadoop项目提供开源的MapReduce系统还有待进一步完善。 相似文献
19.
Rachid Habel Frédérique Silber-Chaussumier François Irigoin Elisabeth Brunet François Trahay 《International journal of parallel programming》2016,44(6):1268-1295
This paper describes dSTEP, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularly fine control of the computation. The goal is to improve the programmer productivity while providing good performances in terms of execution time and memory usage. We define a generic compilation scheme for computation mapping and communication generation. We implement the solution in a source-to-source compiler together with a runtime library. We provide a series of optimizations to improve the performance of the generated code, with a special focus on reducing the communications time. We evaluate our solution on several scientific kernels as well as on the more challenging NAS BT benchmark, and compare our results with the hand written Fortran MPI and UPC implementations. The results show first that our solution allows to make explicit the non trivial parallel execution of the NAS BT benchmark using the dSTEP directives. Second, the results show that our generated MPI+OpenMP BT program runs with a 83.35 speedup over the original NAS OpenMP C benchmark on a hybrid cluster composed of 64 quadricores (256 cores). Overall, our solution dramatically reduces the programming effort while providing good time execution and memory usage performances. This programming model is suitable for a large variety of machines as multi-core and accelerator clusters. 相似文献
20.
Martínez Millán A. Fraguela Basilio B. Cabaleiro José C. 《International journal of parallel programming》2021,49(6):820-845
International Journal of Parallel Programming - The Divide-and-conquer (D&C) pattern appears in a large number of problems and is highly suitable to exploit parallelism. This has led to... 相似文献