首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Continuations are used to define the flow of messages between low level tasks in a parallel logic programming language. A combination of compiler and runtime operations reduces message traffic by up to 50% when success continuations are passed as parameters in messages that start new processes. Continuations are also the key to fast task switching, a critical operation in this fine grain parallel system. Data from sample programs shows the effectiveness of continuations in reducing message traffic and the speed with which task switches are performed on a typical host architecture.Supported by NSF Grant CCR-8707177 and grants from Motorola, Inc, and Hewlett-Packard Corp.  相似文献   

2.
燃烧数值模拟计算通常采用非结构网格模拟计算区域。在非结构网格上进行并行模拟计算时,其自适应方式使得不同进程上的计算负载频繁变动,且差异巨大,导致并行计算效率低下。为了提高并行计算的效率,一个有效的方法是采用动态负载平衡技术。提出一种针对燃烧的化学反应状态的动态负载平衡方法,该方法采用不同策略对化学反应不同阶段各进程上的计算负载进行预测,根据预测结果平均进程间的计算任务,达到负载平衡。实验分析表明,该方法能有效地降低进程间的负载不平衡程度,使得模拟计算的总体运行时间降低了10%。  相似文献   

3.
Scheduling of message passing for synchronous communication is found to be equivalent to colouring the edges of a graph without conflict. The graph edge-colouring problem, which has other applications, is studied. An algorithm which colours the graph with no more than deg + 1 colours, where deg is the degree of the graph, is implemented. The problem of minimising the sum of the largest weight for each colour is also investigated and an algorithm suggested. These algorithms are used to organise the communication as part of a finite element Euler solver. Different communication schemes and their effect on the performance of the flow solver are compared.  相似文献   

4.
When parallel applications are run in large‐scale distributed environments, such as grids, peer‐to‐peer (P2P) systems, and clouds, the set of resources used can change dynamically as machines crash, reservations end, and new resources become available. It is vital for applications to respond to these changes. Therefore, it is necessary to keep track of the available resources—a problem which is known to be notoriously difficult. In this article we argue that resource tracking must be provided as the standard functionality in the lower parts of the software stack. We propose a general solution to resource tracking: the Join–Elect–Leave (JEL) model. JEL provides unified resource tracking for parallel and distributed applications across environments. JEL is a simple yet powerful model based on notifying when resources have Joined or Left the computation. We demonstrate that JEL is suitable for resource tracking in a wide variety of programming models, ranging from the fixed resource sets traditionally used in MPI‐1 to flexible grid‐oriented programming models. We compare several JEL implementations, and show these to perform and scale well in several real‐world scenarios involving grids, clouds and P2P systems applied concurrently, and wide‐area systems with failing resources. Using JEL, we have won the first prize in a number of international distributed computing competitions. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

5.
Although various strategies have been developed for scheduling parallel applications with independent tasks, very little work exists for scheduling tightly coupled parallel applications on cluster environments. In this paper, we compare four different strategies based on performance models of tightly coupled parallel applications for scheduling the applications on clusters. In addition to algorithms based on existing popular optimization techniques, we also propose a new algorithm called Box Elimination that searches the space of performance model parameters to determine the best schedule of machines. By means of real and simulation experiments, we evaluated the algorithms on single cluster and multi‐cluster setups. We show that our Box Elimination algorithm generates up to 80% more efficient schedules than other algorithms. We also show that the execution times of the schedules produced by our algorithm are more robust against the performance modeling errors. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

6.
程海英  张武 《计算机工程与设计》2004,25(11):1961-1963,2011
根据解反应扩散方程的自适应样条小波-交替方向(SW-ADI)方法,使用MPI、OpenMP两种并行编程模式,对串行程序进行了直接并行化,并在上海大学的高性能计算机自强2000上分别用MPI和OpenMP实现了对方程的求解。对运算结果进行了分析并给出了与串行程序相比较的并行加速比。  相似文献   

7.
Server applications augmented with behavioral adaptation logic can react to environmental changes, creating self‐managing server applications with improved quality of service at runtime. However, developing adaptive server applications is challenging due to the complexity of the underlying server technologies and highly dynamic application environments. This paper presents an architecture framework, the Adaptive Server Framework (ASF), to facilitate the development of adaptive behavior for legacy server applications. ASF provides a clear separation between the implementation of adaptive behavior and the business logic of the server application. This means a server application can be extended with programmable adaptive features through the definition and implementation of control components defined in ASF. Furthermore, ASF is a lightweight architecture in that it incurs low CPU overhead and memory usage. We demonstrate the effectiveness of ASF through a case study, in which a server application dynamically determines the resolution and quality to scale an image based on the load of the server and network connection speed. The experimental evaluation demonstrates the performance gains possible by adaptive behavior and the low overhead introduced by ASF. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

8.
在计算流体力学领域中,由于流场求解的复杂性,设计出高效的并行算法成为了流场并行化计算的研究重点.以格子Boltzmann方法的理论应用为研究背景,把并行思想和格子Boltzmann方法在模拟流体流动中的计算问题结合起来,讨论了格子Boitzmann方法LBGK D2Q9模型的计算过程和计算特点.研究并实现了LBGK模型的分布式并行算法,并在自强3000上进行了算法的并行性能的分析和测试.结果表明,格子Boltzmann方法LBGKD2Q9模型适合大规模的并行计算,能提高计算的精度和速度,解决复杂流场计算问题.  相似文献   

9.
Parallel implementations of dynamic structured adaptive mesh refinement (SAMR) methods lead to significant runtime management challenges that can limit their scalability on large systems. This paper presents a runtime engine that addresses the scalability of SAMR applications with localized refinements and high SAMR efficiencies on large numbers of processors (upto 1024 processors). The SAMR runtime engine augments hierarchical partitioning with bin-packing based load-balancing to manage the space-time heterogeneity of the SAMR grid hierarchy, and includes a communication substrate that optimizes the use of MPI non-blocking communication primitives. An experimental evaluation on the IBM SP2 supercomputer using the 3-D Richtmyer-Meshkov compressible turbulence kernel demonstrates the effectiveness of the runtime engine in improving SAMR scalability.
Manish ParasharEmail:
  相似文献   

10.
The results of a study of a family of parallel symbolic architectures executing several parallel applications are presented. The class of architectures being simulated is characterized by a shared memory structure, by a hierarchical interconnect, and by clustered processors. Speedup measurements were obtained from six different application kernels. Measurements were also performed to assess the degradation of speedup as a function of the interconnection delays, and to study the effect of different scheduling algorithms. The results presented support the claim that the proposed architecture would be a powerful parallel symbolic computation system. The paper discusses processor starvation, fine grain parallelism, unever loads, foreign reference, schedule and indeterminate computation with respect to the applications chosen.This work was completed within the Advanced Computer Architecture Program, Micro-electronics and Technology Computer Corporation, Austin, Texas.  相似文献   

11.
Block‐structured adaptive mesh refinement (BSAMR) is widely used within simulation software because it improves the utilization of computing resources by refining the mesh only where necessary. For BSAMR to scale onto existing petascale and eventually exascale computers all portions of the simulation need to weak scale ideally. Any portions of the simulation that do not will become a bottleneck at larger numbers of cores. The challenge is to design algorithms that will make it possible to avoid these bottlenecks on exascale computers. One step of existing BSAMR algorithms involves determining where to create new patches of refinement. The Berger–Rigoutsos algorithm is commonly used to perform this task. This paper provides a detailed analysis of the performance of two existing parallel implementations of the Berger–Rigoutsos algorithm and develops a new parallel implementation of the Berger–Rigoutsos algorithm and a tiled algorithm that exhibits ideal scalability. The analysis and computational results up to 98 304 cores are used to design performance models which are then used to predict how these algorithms will perform on 100 M cores. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

12.
Performance and scalability optimization of large HPC applications is currently a labor-intensive, manual process with very low productivity. Major difficulties come from the disaggregated environment for HPC application development: the compiler is only involved in local decisions (core or multithreaded domain), while a library-based, communication-oriented programming model realizes whole-machine parallelism. Realizing any major global change in such a disaggregated environment is very difficult and involves changing large portions of the source code. We present semi-automated techniques, based on structural analysis and rewriting, for performing global transformations on an HPC application source code. We present two case studies using the Self-Consistent Field (SCF) standalone benchmark as well as the Coupled Cluster (CCSD) module (2.9 million lines of Fortran code), a key module of the NWChem computational chemistry application. We demonstrate how structural rewriting techniques can be used to automate transformations that affect multiple sections of the application’s source code. We show that the transformations can be applied in a systematic fashion across the source code bases with minimal manual effort. These transformations improve the scalability of the SCF benchmark by more than two orders of magnitude and the performance of the full CCSD module by a factor of four.  相似文献   

13.
We have developed a one-dimensional parallel global adaptive quadrature algorithm for a machine with hypercube architecture, and studied the heuristics present in the algorithm. A mathematical model for how long time it takes to process a balanced tree of sub-intervals under certain (almost implementable) conditions is developed. The results from numerical experiments are given. The speedups achieved depend on the problem and are ranging from 0.83 to 30.5 on a 32-processor hypercube.  相似文献   

14.
Optimal design of truss structures using parallel computing   总被引:1,自引:0,他引:1  
Parallel design optimization of large structural systems calls for a multilevel approach to the optimization problem. The general optimization problem is decomposed into a number of non-interacting suboptimization problems on the first level. They are controlled from the second level through coordination variables. Thus, the solutions of the independent first-level subsystems are directed towards the overall system optimum. In the present paper, optimal design of truss structures using parallel computing technique is described. In this method, optimization of a large truss structure has been carried out by decomposing the structure into sub-domains and suboptimization tasks. Each sub-domain has independent design variables and a small number of behaviour constraints. The two-level sub-domain optimum design approach is summarized by several numerical examples with speedups and efficiencies of algorithms on message passing systems. It has been noticed that the efficiency of the algorithm for design optimization increases with the size of the structure.  相似文献   

15.
基于局域网的有限元网格分布式并行生成   总被引:2,自引:0,他引:2  
在常见的PC+Windows+LAN环境下,采用Winsock API网络通信接口实现了局域网上的分布式并行有限元网格生成。网格生成区域在服务器上按照工作站数量被分解为若干个子区域,这些子区域及网格控制参数通过局域网(LAN)传给工作站。子区域在工作站上被剖分成子网格并通过局域网传回服务器以合并形成最终网格。算例表明只要有足够的计算节点,分布式并行技术可以将网格生成速度大幅度提高,而网络通信所占时间的比例基本固定。  相似文献   

16.
论文提出一种基于点集自适应分组构建Voronoi 图的并行算法,其基本思 路是采用二叉树分裂的方法将平面点集进行自适应分组,将各分组内的点集独立生成 Voronoi 图,称为Voronoi 子图;提取所有分组内位于四边的边界点,对边界点集构建Voronoi 图,称为边界点Voronoi 图;最后,针对每个边界点,提取其位于Voronoi 子图和边界点Voronoi 图内所对应的两个多边形,进行Voronoi 多边形的合并,最终实现子网的合并。考虑到算法 耗时主要在分组点集的Voronoi 图生成,而各分组的算法实现不受其他分组影响,采用并行 计算技术加速分组点集的Voronoi 图生成。理论分析和测试表明,该算法是一个效率较高的 Voronoi 图生成并行算法。  相似文献   

17.
This paper presents a parallel algorithm for constructing Voronoi diagrams based on point‐set adaptive grouping. The binary tree splitting method is used to adaptively group the point set in the plane and construct sub‐Voronoi diagrams for each group. Given that the construction of Voronoi diagrams in each group consumes the majority of time and that construction within one group does not affect that in other groups, the use of a parallel algorithm is suitable. After constructing the sub‐Voronoi diagrams, we extracted the boundary points of the four sides of each sub‐group and used to construct boundary site Voronoi diagrams. Finally, the sub‐Voronoi diagrams containing each boundary point are merged with the corresponding boundary site Voronoi diagrams. This produces the desired Voronoi diagram. Experiments demonstrate the efficiency of this parallel algorithm, and its time complexity is calculated as a function of the size of the point set, the number of processors, the average number of points in each block, and the number of boundary points. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

18.
20世纪60年代,学习控制开启了人类探究复杂系统控制的新途径,基于人工智能技术的智能控制随之兴起.本文以智能控制为主线,阐述其由学习控制向平行控制发展的历程.本文首先介绍学习控制的基本思想,描述了智能机器的架构设计与运行机理.随着信息科技的进步,基于数据的计算智能方法随之出现.对此,本文进一步简述了基于计算智能的学习控制方法,并以自适应动态规划方法为切入点分析非线性动态系统自学习优化问题的求解过程.最后,针对工程复杂性与社会复杂性互相耦合的复杂系统控制问题,阐述了基于平行控制的学习与优化方法求解思路,分析其在求解复杂系统优化控制问题方面的优势.智能控制思想经历了学习控制、计算智能控制到平行控制的演化过程,可以看出平行控制是实现复杂系统知识自动化的有效方法.  相似文献   

19.
Debuggers play an important role in developing parallel applications. They are used to control the state of many processes, to present distributed information in a concise and clear way, to observe the execution behavior, and to detect and locate programming errors. More sophisticated debugging systems also try to improve understanding of global execution behavior and intricate details of a program. In this paper we describe the design and implementation of SPiDER, which is an interactive source‐level debugging system for both regular and irregular High‐Performance Fortran (HPF) programs. SPiDER combines a base debugging system for message‐passing programs with a high‐level debugger that interfaces with an HPF compiler. SPiDER, in addition to conventional debugging functionality, allows a single process of a parallel program to be expected or the entire program to be examined from a global point of view. A sophisticated visualization system has been developed and included in SPiDER to visualize data distributions, data‐to‐processor mapping relationships, and array values. SPiDER enables a programmer to dynamically change data distributions as well as array values. For arrays whose distribution can change during program execution, an animated replay displays the distribution sequence together with the associated source code location. Array values can be stored at individual execution points and compared against each other to examine execution behavior (e.g. convergence behavior of a numerical algorithm). Finally, SPiDER also offers limited support to evaluate the performance of parallel programs through a graphical load diagram. SPiDER has been fully implemented and is currently being used for the development of various real‐world applications. Several experiments are presented that demonstrate the usefulness of SPiDER. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

20.
Nearest-neighbor-mesh connection plus global broadcasting/control bus characterizes the architecture of the processor array PAX, that was constructed for and is now operating in many typical scientific applications. Not only these inter-processor connections, but also an MIMD structure of the machine were found effective in the particle transport problems, that require asynchronous operation.

The paper describes the bases of architecture of two recent versions of the PAX computer, their hardware and software systems, and, based on the implementation of scientific applications, the effectiveness of the PAX type architecture is presented.  相似文献   


设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号