首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
一个有效的动态负载平衡方法   总被引:26,自引:0,他引:26       下载免费PDF全文
动态负载平衡问题是影响工作站网络并行计算性能的重要因素.首先分析出在负载平衡中产生额外开销的根本原因是负载的移动,进而定性地给出了每次移动负载的粒度公式.引入益处估计的方法,仅在有益的情况下进行负载平衡.另外还提出了一个动态负载平衡算法.最后,通过实验,将该算法的运行结果与其他人的负载平衡结果以及不作负载平衡的情况进行了对比.此负载平衡方法在工作站为空载以及不同的负载和应用问题的数据规模的情况下,都优于Siegell等人提出的方法.  相似文献   

2.
This article presents an object-oriented mechanism to achieve group communication in large scale grids. Group communication is a crucial feature for high-performance and grid computing. While previous work on collective communications imposed the use of dedicated interfaces, we propose a scheme where one can initiate group communications using the standard public methods of the class by instantiating objects through a special object factory. The object factory utilizes casting and introspection to construct a “parallel processing enhanced” implementation of the object which matches the original class’ interface. This mechanism is then extended in an evolution of the classical SPMD programming paradigm into the domain of clusters and grids named “Object-Oriented SPMD”. OOSPMD provides interprocess (inter-object) communications via transparent remote method invocations rather than custom interfaces. Such typed group communication constitutes a basis for improvement of component models allowing advanced composition of parallel building blocks. The typed group pattern leads to an interesting, uniform, and complete model for programming applications intended to be run on clusters and grids.  相似文献   

3.
    
A number of high‐level parallel programming platforms for networks of workstations (NOWs) have been developed in recent times. Most of these platforms target the exploitation of data parallelism in applications. They do not allow expressibility of applications as a collection of tasks along with their precedence relationships. As a result, the control or task parallelism in an application cannot be expressed or exploited. The current work aims at integrating the notion of task parallelism and precedence relationships among constituting tasks to such high‐level data parallel platforms for NOWs. Our model of integration provides for arbitrary nesting of data and task parallel modules. Also, the precedence relationships are clearly reflected from the program structure. The model relieves the programmer from the need to design applications for non‐determinism in the order of completion of constituting tasks. The design of the runtime support as well as system‐level book keeping is discussed. The model is general enough to be applied to a wide range of data parallel platforms. A specific case of integrating the model into anonymous remote computing (ARC), a data parallel programming platform, is presented. The performance related aspects are also discussed. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

4.
笑影 《现代计算机》2007,(1):111-112
随着电脑对电源功率需求的不断增加,电源的噪音和散热成了消费者的一块心病。为了更好地满足消费者在静音、稳定以及散热方面的需求,电源厂商纷纷推出了各自的静音电源,原本属于高端散热的大风扇逐渐应用在中低端的电源上。那么,什么样的电源才算是静音电源呢?  相似文献   

5.
Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation.  相似文献   

6.
Computational Fluid Dynamics (CFD) applications are highly demanding for parallel computing. Many such applications have been shifted from expensive MPP boxes to cost-effective Networks of Workstations (NOW). Auto-CFD-NOW is a pre-compiler that transforms Fortran CFD sequential programs to efficient message-passing parallel programs running on NOW. Our work makes the following three unique contributions. First, this pre-compiler is highly automatic, requiring a minimum number of user directives for parallelization. Second, we have applied a dependency analysis technique for the CFD applications, called analysis after partitioning. We propose a mirror-image decomposition technique to parallelize self-dependent field loops that are hard to parallelize by existing methods. Finally, traditional optimizations of communication focus on eliminating redundant synchronizations. We have developed an optimization scheme which combines all the non-redundant synchronizations in CFD programs to further reduce the communication overhead. The Auto-CFD-NOW has been implemented on networks of workstations and has been successfully used for automatically parallelizing structured CFD application programs. Our experiments show its effectiveness and scalability for parallelizing large CFD applications. This work is supported in part by the China National Aerospace Science Foundation, and by the U.S. National Science Foundation under grants CCR-9812187, CCR-0098055, CCF-0325760, CCF 0514078, and CNS 0549006.  相似文献   

7.
Abstract

Heterogeneous networks of workstations and/or personal computers (NOW) are increasingly used as a powerful platform for the execution of parallel applications. When applications previously developed for traditional parallel machines (homogeneous and dedicated) are ported to NOWs, performance worsens owing in part to less efficient communications but more often to unbalancing.

In this paper, we address the problem of the efficient porting to heterogeneous NOWs of data-parallel applications originally developed using the SPMD paradigm for homogeneous parallel systems with regular topology like ring.

To achieve good performance, the computation time on the various machines composing the NOW must be as balanced as possible. This can be obtained in two ways: by using an heterogeneous data partition strategy with a single process per node, or by splitting homogeneously data among processes and assigning to each node a number of processes proportional to its computing power. The first method is however more difficult, since some modifications in the code are always needed, whereas the second approach requires very few changes.

We carry out a simplified but reliable analysis, and propose a simple model able to simulate performance in the various situations. Two test cases, matrix multiplication and computation of long-range interactions, are considered, obtaining a good agreement between simulated and experimental results.

Our analysis shows that an efficient porting of regular homogeneous data-parallel applications on heterogeneous NOWs is possible. Particularly, the approach based on multiple processes per node turns out to be a straightforward and effective way for achieving very satisfying performance in almost all situations, even dealing with highly heterogeneous systems.  相似文献   

8.
本文利用神经网络BP算法建立网络性能评估的数学模型,采用各性能指标作为其输入,网络性能作为输出,基于最小二乘思想,采用梯度搜索技术,以期使网络的实际输出值与期望输出值的误差均方值最小.经实验证明,该数学模型具有较好的辩识精度.  相似文献   

9.
MPI并行编程的教学过程中,首要任务是解决好串行编程思想到并行编程思想的转换,让学生掌握SPMD方式下同一段代码可以形成不同计算行为是关键所在。从单机环境下的多进程/线程编程入手,获得SPMD的概念后再学习MPI的基本概念,然后学习其他高级特性,可以使得学习过程相当顺利。  相似文献   

10.
The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.  相似文献   

11.
Melab  N.  Talbi  E.-G.  Petiton  S. 《The Journal of supercomputing》2000,17(2):167-185
This paper presents a parallel adaptive version of the block-based Gauss-Jordan algorithm, utilized to invert large matrices. This version includes a characterization of the workload and a mechanism of its folding/unfolding. Furthermore, this paper proposes a work scheduling strategy and an application-oriented solution for the fault tolerance problem. The application is implemented and experimented with MARS1 in dedicated and non-dedicated environments. The results show that an absolute efficiency of 92% is possible on a cluster of DEC/ALPHA processors interconnected by a Gigaswitch network and an absolute efficiency of 67% can be obtained on an Ethernet network of SUN-Sparc 4 workstations. Moreover, the algorithm is tested on a meta-system including both the two parks of machines. Finally, an out-of-core solution for the algorithm is proposed. This solution allows a gain of 66% of data input operations and reduces the central memory space required for storing the data space of the algorithm by a factor q, where q is the dimension of the matrix to be inverted in terms of data blocks.  相似文献   

12.
PVM下矩阵相乘并行算法的研究与实现   总被引:4,自引:0,他引:4  
在许多实际计算机科学、数学、工程等问题的计算中,经常遇到一些大型的高阶矩阵的有关计算,尤其是两矩阵相乘最为常见,当矩阵阶数较高时,通常的计算过程需要占用较多的工作单元和较大的计算机内存,计算效率受到影响。文中研究了一种矩阵要乘的并行算法,并根据其时间复杂度,进行了一定的改进,并在PVM环境下实现了该改进算法。此算法减少了所需要的处理要的数量和在一台处理机上分配的子任务数。通过对时间复杂度的分析可知,此改进算法减少了进程间的选择性传输所带来的过重通信开销,提高了程序的运行效率。  相似文献   

13.
Clusters of workstations employ flexible topologies: regular, irregular, and hierarchical topologies have been used in such systems. The flexibility poses challenges for developing efficient collective communication algorithms since the network topology can potentially have a strong impact on the communication performance. In this paper, we consider the all-to-all broadcast operation on clusters with cut-through and store-and-forward switches. We show that near-optimal all-to-all broadcast on a cluster with any topology can be achieved by only using the links in a spanning tree of the topology when the message size is sufficiently large. The result implies that increasing network connectivity beyond the minimum tree connectivity does not improve the performance of the all-to-all broadcast operation when the most efficient topology specific algorithm is used. All-to-all broadcast algorithms that achieve near-optimal performance are developed for clusters with cut-through and clusters with store-and-forward switches. We evaluate the algorithms through experiments and simulations. The empirical results confirm our theoretical finding.  相似文献   

14.
基于群机系统的并行程序的最大加速比计算   总被引:1,自引:0,他引:1  
加速比是并行程序的重要指标之一。在大多数并行系统中,在数据规 模确定的情况下,程序的加速比随节点工作站的增加而增加,但是大多数群机 系统的节点工作站是共享物理传输介质的,这使得许多并行程序的加速比在节 点机数目超过某一个值之后会随着节,点机的增加而减少。本文通过对群机系统 上并行程序执行时间的分析,论述了在数据规模确定的情况下,程序能够获得 的最大加速比和最短的计算时间,以及获得这个加速比和计算时间的节点机个 数。  相似文献   

15.
考虑工作站网络(NOWs)中三对角线性方程组的并行求解,基于最小秩解耦算法与分布治之并行计算模式,并行最小秩解耦算法(PMRD)。它在计算过程中保持原矩阵的结构特征,数值稳定性高,本文给出算法的数值特征分析以及计算与通讯复杂性分析并与Mehrmann分治算比较,所有算法由PVM软件系统实现并在工作站网络中测试。  相似文献   

16.
    
This paper examines the structure of artificial neural networks (ANN) and the operation of their algorithms in order to identify the forms of parallelism that may be inherent in them. Parallelism within the topological structure of ANNs are seen to be of two forms: neuron and synapse. Operational parallelism is also of two forms: training set parallelism and recall/teaching parallelism. Performance models are formulated to predict the likely speed improvement achieved due to parallelism.  相似文献   

17.
LBS—基于PVM的动态任务负载平衡系统   总被引:1,自引:0,他引:1  
负载平衡问题是影响工作站机群系统并行计算性能的一个重要因素。  相似文献   

18.
析网络协议并行实现的必要性,探讨端系统与互连设备并行协议系统的实现体系结构和开发途径,通过示例展示了协议并行化技术的应用前景。  相似文献   

19.
工作站网络中负载参数的一种收集方法   总被引:2,自引:0,他引:2  
工作站网络 ( NOW)中影响负载平衡效果的关键之一是及时收集各工作站的负载信息 .本文提出在工作站网络中收集负载信息的一种新方法 .实验证明 ,将之应用于动态负载平衡算法 ,可取得良好的性能 .  相似文献   

20.
周笑波  陆桑璐  谢立 《软件学报》1998,9(10):734-739
对一个动态自适应的NOW(network of workstations)协同调度算法Dasic算法进行了性能评估.通过对Dasic算法和典型的NOW协同调度算法MAX算法、Grab算法的性能模拟,在响应时间和系统流量两个主要性能指标上进行分析和比较,从而验证了在较大规模的NOW系统中,动态自适应对协同调度的性能有较大的影响.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号