首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
JPI:基于纯Java语言的异构并行处理支持平台   总被引:4,自引:0,他引:4  
针对使用Parallel Virual Machinel(PVM)和MessagePassing Interface(MPI)软件包的解决方案,该软件包用纯Java语言实现了类似于PVM和MPI所提供的任务调度、通信和全局归约操作等方面的功能,基于JPI的并行程序的运行和性能测试表明,JPI不仅解决了并行程序在异构环境中的无缝移植问题,并且能够为包括网络密集型在内的并行程度提供有效的开发、运行支持。  相似文献   

2.
We present a distributed memory parallel implementation of the unbalanced tree search (UTS) benchmark using MPI and investigate MPI’s ability to efficiently support irregular and nested parallelism through continuous dynamic load balancing. Two load balancing methods are explored: work sharing using a centralized work server and distributed work stealing using explicit polling to service steal requests. Experiments indicate that in addition to a parameter defining the granularity of load balancing, message-passing paradigms require additional techniques to manage the volume of communication and mitigate runtime overhead. Using additional parameters, we observed an improvement of up to 3–4X in parallel performance. We report results for three distributed memory parallel computer systems and use UTS to characterize the performance and scalability on these systems. Overall, we find that the simpler work sharing approach with a single work server achieves good performance on hundreds of processors and that our distributed work stealing implementation scales to thousands of processors and delivers more robust performance that is less sensitive to the particular workload and load balancing parameters.  相似文献   

3.
基于消息传递并行进程迁移技术的研究与实现   总被引:1,自引:0,他引:1  
高可用在并行计算环境中的地位日益突出.实现LAM/Migration扩展了LAM/MPI的进程迁移功能,可实现MPI整体任务在节,最之间的自由迁移,其迁移功能对应用程序透明,智能化程度高,并可应用于集群节点客错与负裁均衡,有效提高集群的可用性.  相似文献   

4.
卢照  张锦娟  师军  鱼佳欣 《微机发展》2010,(5):132-135,149
集群环境下的并行计算越来越被广泛应用,MPI是集群系统中最重要的编程工具。在并行处理过程中,负载平衡起着很重要的作用,它直接影响到整个算法的效率。文中结合MPI编程环境下的具体特点,提出了基于负载益处估价的方法来判断是否进行任务迁移,给出了负载实时监测和调度的算法,并在每个节点机上间隔性地进行测试。最后在搭建的MPI环境下,运用并行排序方法进行了验证。实验结果表明采用负载前后有了很明显的提高,特别是随着任务量不断增大的情况下提高的效果更加明显。  相似文献   

5.
SMP集群系统上矩阵特征问题并行求解器的有效算法   总被引:2,自引:0,他引:2  
对称矩阵三对角化和三对角对称矩阵的特征值求解是稠密对称矩阵特征问题并行求解器的关键步 .针对SMP集群系统的多级体系结构,基于Householder变换的矩阵三对角化和三对角矩阵特征值问题的分而治之算法,给出了它们的MPI OpenMP混合并行算法 .算法研究集中在SMP集群系统环境下的负载平衡、通信开销和性能评价 .混合并行算法的设计结合了粗粒度线程并行模式和任务共享的动态调用方法,改善了MPI算法中的负载平衡问题、降低了通信开销 .在深腾6800上的实验表明,基于混合并行算法的求解器比纯MPI版本的求解器具有更好的性能和可扩展性 .  相似文献   

6.
Hua Zhang  Joohan Lee  Ratan Guha 《Software》2008,38(10):1049-1071
Clusters, composed of symmetric multiprocessor (SMP) machines and heterogeneous machines, have become increasingly popular for high‐performance computing. Message‐passing libraries, such as message‐passing interface (MPI) and parallel virtual machine (PVM), are de facto parallel programming libraries for clusters that usually consist of homogeneous and uni‐processor machines. For SMP machines, MPI is combined with multithreading libraries like POSIX Thread and OpenMP to take advantage of the architecture. In addition to existing parallel programming libraries that are in C/C++ and FORTRAN programming languages, the Java programming language presents itself as another alternative with its object‐oriented framework, platform neutral byte code, and ever‐increasing performance. This paper presents a new parallel programming model and a library, VCluster, which implements this model. VCluster is based on migrating virtual threads instead of processes to support clusters of SMP machines more efficiently. The implementation uses thread migration, which can be used in dynamic load balancing. VCluster was developed in pure Java, utilizing the portability of Java to support clusters of heterogeneous machines. Several applications are developed to illustrate the use of this library and compare the usability and performance of VCluster with other approaches. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

7.
近年来研究人员对高性能计算中的并行I/O问题进行了深入研究,然而这些研究主要针对MPP问题,而对集群计算机系统中并行I/O问题的研究不多。因此,对于集群计算中的并行I/O系统进行研究是一个重要的研究课题。对集群计算中的并行I/O传输调度效率进行研究,设计了一个文件传输调度器,可以实现文件传输最快捷,节点资源最大利用,显著提高I/O节点吞吐率和反应时间。经过大量数据的测试和实验证明该调度器的有效性和适用性。  相似文献   

8.
FuzzyCLIPS is a rule-based programming language and it is very suitable for developing fuzzy expert systems. However, it usually requires much longer execution time than algorithmic languages such as C and Java. To address this problem, we propose a parallel version of FuzzyCLIPS to parallelize the execution of a fuzzy expert system with data dependence on a cluster system. We have designed some extended parallel syntax following the original FuzzyCLIPS style. To simplify the programming model of parallel FuzzyCLIPS, we hide, as much as possible, the tasks of parallel processing from programmers and implement them in the inference engine by using MPI, the de facto standard for parallel programming for cluster systems. Furthermore, a load balancing function has been implemented in the inference engine to adapt to the heterogeneity of computing nodes. It will intelligently allocate different amounts of workload to different computing nodes according to the results of dynamic performance monitoring. The programmer only needs to invoke the function in the program for better load balancing. To verify our design and evaluate the performance, we have implemented a human resource website. Experimental results show that the proposed parallel FuzzyCLIPS can garner a superlinear speedup and provide a more reasonable response time.  相似文献   

9.
Data‐driven programming models such as many‐task computing (MTC) have been prevalent for running data‐intensive scientific applications. MTC applies over‐decomposition to enable distributed scheduling. To achieve extreme scalability, MTC proposes a fully distributed task scheduling architecture that employs as many schedulers as the compute nodes to make scheduling decisions. Achieving distributed load balancing and best exploiting data locality are two important goals for the best performance of distributed scheduling of data‐intensive applications. Our previous research proposed a data‐aware work‐stealing technique to optimize both load balancing and data locality by using both dedicated and shared task ready queues in each scheduler. Tasks were organized in queues based on the input data size and location. Distributed key‐value store was applied to manage task metadata. We implemented the technique in MATRIX, a distributed MTC task execution framework. In this work, we devise an analytical suboptimal upper bound of the proposed technique, compare MATRIX with other scheduling systems, and explore the scalability of the technique at extreme scales. Results show that the technique is not only scalable but can achieve performance within 15% of the suboptimal solution. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

10.
The fast multipole method (FMM) is a complex, multi‐stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchronous activities. The parallel tasks comprising FMM may be expressed in X10 by using a scalable pattern of activities. This paper demonstrates the use of X10 to implement FMM for simulation of electrostatic interactions between ions in a cyclotron resonance mass spectrometer. X10's task‐parallel model is used to express parallelism by using a pattern of activities mapping directly onto the tree. X10's work stealing runtime handles load balancing fine‐grained parallel activities, avoiding the need for explicit work sharing. The use of global references and active messages to create and synchronize parallel activities over a distributed tree structure is also demonstrated. In contrast to previous simulations of ion trajectories in cyclotron resonance mass spectrometers, our code enables both simulation of realistic particle numbers and guaranteed error bounds. Single‐node performance is comparable with the fastest published FMM implementations, and critical expansion operators are faster for high accuracy calculations. A comparison of parallel and sequential codes shows the overhead of activity management and work stealing in this application is low. Scalability is evaluated for 8k cores on a Blue Gene/Q system and 512 cores on a Nehalem/InfiniBand cluster. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

11.
介绍了MPI并行编程环境和MPI并行程序设计的特点,讨论了在MPI并行程序设计中实现动态负载平衡的方法,提出一种根据计算节点的计算能力和实时负载情况进行任务迁移的动态负载平衡策略。  相似文献   

12.
在分布式存储系统上,MPI已被证实是理想的并行程序设计模型。MPI是基于消息传递的并行编程模型,进程间的通信是通过调用库函数来实现的,因此MPI并行程序中,通信部分代码的效率对该并行程序的性能有直接的影响。通过用集群通信函数替代点对点通信函数以及通过派生数据类型和建立新通信域这两种方式,两次改进DNS的MPI并行程序实现,并通过实验给出一个优化MPI并行程序的一般思路与方法。  相似文献   

13.
Scheduling large-scale application in heterogeneous grid systems is a fundamental NP-complete problem that is critical to obtain good performance and execution cost. To achieve high performance in a grid system it requires effective task partitioning, resource management and load balancing. The heterogeneous and dynamic nature of a grid, as well as the diverse demands of applications running on the grid, makes grid scheduling a major task. Existing schedulers in wide-area heterogeneous systems require a large amount of information about the application and the grid environment to produce reasonable schedules. However, this required information may not be available, may be too expensive to collect, or may increase the runtime overhead of the scheduler such that the scheduler is rendered ineffective. We believe that no one scheduler is appropriate for all grid systems and applications. This is because while data parallel applications in which further data partitioning is possible can be further improved by efficient management of resources, smart selection of resources and load balancing can be possible, in functional/not-dividable-task parallel applications such partitioning is either not possible or difficult or expensive in term of performance. In this paper, we propose a scheduler for data parallel applications (SDPA) which offers an efficient task partitioning and load balancing strategy for data parallel applications in grid environment. The proposed SDPA offers two major features: maintaining job priority even if insufficient number of free resources is available and pre-task assignment to cut the idle time of nodes. The SDPA selects nodes smartly according to the nature of task and the nodes’ resources availability. Simulation results conducted reveal that SDPA achieves performance improvement over reported strategies in the reviewed literature in terms of execution time, throughput and waiting time.  相似文献   

14.
为了有效地监控集群系统,基于消息传递接口(Message Passing Interface,MPI)并行库构建一个简单易行的并行任务模型.详细介绍该任务模型中的集群监控、节点负载均衡评估模型结构以及Linux集群数据采集.实验表明该模型配置简单、资源开销低,且对集群系统的干扰小.  相似文献   

15.
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed‐memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task‐parallel programs executed on hybrid distributed‐memory CPU‐graphics processing unit (GPU) systems in a global‐address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU‐GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state‐of‐the‐art CCSD(T) application module from the computational chemistry domain. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

16.
王巍  张杰敏 《电子技术应用》2011,37(12):126-129
针对单层型MPI集群通信效率不高的特点,通过对比分析单层型结构和树型结构在集群聚合通信中的不同,提出了一种基于树型结构的MPI集群系统设计方案.用以降低全局通信流量和均衡主控节点负载,从而改善集群通信效率,使集群的扩展更加灵活,通过实验验证了该方案的可行性.  相似文献   

17.
针对大规模结构非线性动力问题的有限元分析非常耗时,基于消息传递接口(MPI)机群环境,提出多种基于并行求解策略的显式有限元并行算法。基于显式消息传递的区域分解技术,采取重叠、非重叠区域分解技术及动态任务分配方法,通过将计算与通信重叠,优化处理器间的通信,对非重叠通信区域分解并行算法、重叠通信区域分解并行算法、群动态任务分配算法、动态任务分配算法及动态负载平衡算法进行研究。为在机群环境下实现非线性动力有限元分析,开发了基于有效并行求解策略的显式有限元并行算法。编写了基于消息传递编程模式的并行有限元程序,在工作站机群上实现了数值算例,分析了算法的性能,并与传统的Newmark算法进行了比较。算例表明:群动态任务分配算法的性能优于动态任务分配算法,低于区域分解算法的性能,动态负载平衡算法最优。对相同规模的问题提出的算法比Newmark算法快,优于Newmark算法。对结构非线性动力问题的有限元分析,所提出的并行算法是可行有效的。  相似文献   

18.
并行处理在计算能力方面与单处理器的串行处理相比有着无可比拟的优势。个人计算机和网络成本的下降使得使用分布式系统进行并行处理的现象越来越普遍,而分布式网络系统中多采用MPI作为并行编程标准。为了减少程序运行时间,改善MPI计算的性能,负载均衡方法尤为重要,本文提出一种在MPI并行处理中负载均衡的方法,可以按照节点的计算能力和负载情况,在节点之间分配和迁移任务。实验表明,本文提出的方法可有效提高MPI并行处理的性能。  相似文献   

19.
The parallel computation capabilities of modern graphics processing units (GPUs) have attracted increasing attention from researchers and engineers who have been conducting high computational throughput studies. However, current single GPU based engineering solutions are often struggling to fulfill their real-time requirements. Thus, the multi-GPU-based approach has become a popular and cost-effective choice for tackling the demands. In those cases, the computational load balancing over multiple GPU “nodes” is often the key and bottleneck that affect the quality and performance of the real-time system. The existing load balancing approaches are mainly based on the assumption that all GPU nodes in the same computer framework are of equal computational performance, which is often not the case due to cluster design and other legacy issues. This paper presents a novel dynamic load balancing (DLB) model for rapid data division and allocation on heterogeneous GPU nodes based on an innovative fuzzy neural network (FNN). In this research, a 5-state parameter feedback mechanism defining the overall cluster and node performance is proposed. The corresponding FNN-based DLB model will be capable of monitoring and predicting individual node performance under different workload scenarios. A real-time adaptive scheduler has been devised to reorganize the data inputs to each node when necessary to maintain their runtime computational performance. The devised model has been implemented on two dimensional (2D) discrete wavelet transform (DWT) applications for evaluation. Experiment results show that this DLB model enables a high computational throughput while ensuring real-time and precision requirements from complex computational tasks.  相似文献   

20.
Reachability testing is an approach to verifying concurrent programs. During reachability testing, every partially ordered synchronization sequence of a program with a given input is exercised exactly once. In this paper, we present the design and implementation of a distributed reachability testing algorithm for a cluster of workstations. This algorithm allows different test sequences to be exercised concurrently by different workstations without any synchronization, and without any duplication of sequences among workstations. Dynamic load balancing is performed using a work‐stealing scheme. A novel aspect of this scheme is that work‐stealing requests progress in rounds. This round‐based structure identifies overloaded workstations to target for work stealing. Empirical studies show good speedup for four benchmark Java programs and one Lotos specification. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号