首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This article focuses on the effect of both process topology and load balancing on various programming models for SMP clusters and iterative algorithms. More specifically, we consider nested loop algorithms with constant flow dependencies, that can be parallelized on SMP clusters with the aid of the tiling transformation. We investigate three parallel programming models, namely a popular message passing monolithic parallel implementation, as well as two hybrid ones, that employ both message passing and multi-threading. We conclude that the selection of an appropriate mapping topology for the mesh of processes has a significant effect on the overall performance, and provide an algorithm for the specification of such an efficient topology according to the iteration space and data dependencies of the algorithm. We also propose static load balancing techniques for the computation distribution between threads, that diminish the disadvantage of the master thread assuming all inter-process communication due to limitations often imposed by the message passing library. Both improvements are implemented as compile-time optimizations and are further experimentally evaluated. An overall comparison of the above parallel programming styles on SMP clusters based on micro-kernel experimental evaluation is further provided, as well.  相似文献   

2.
Many large-scale finite element problems are intractable on current generation production supercomputers. High-performance computer architectures offer effective avenues to bridge the gap between computational needs and the power of computational hardware. The biggest challenge lies in the substitution of the key algorithms in an application program with redesigned algorithms which exploit the new architectures and use better or more appropriate numerical techniques. A methodology for implementing non-linear finite element analysis on a homogeneous distributed processing network is discussed. The method can also be extended to heterogeneous networks comprised of different machine architectures provided that they have a mutual communication interface. This unique feature has greatly facilitated the port of the code to the 8-node Intel Touchstone Gamma and then the 512-node Intel Touchstone Delta. The domain is decomposed serially in a preprocessor. Separate input files are written for each subdomain. These files are read in by local copies of the program executable operating in parallel. Communication between processors is addressed utilizing asynchronous and synchronous message passing. The basic kernel of message passing is the internal force exchange which is analogous to the computed interactions between sections of physical bodies in static stress analysis. Benchmarks for the Intel Delta are presented. Performance exceeding 1 gigaflop was attained. Results for two large-scale finite element meshes are presented.  相似文献   

3.
The implementation of CSP-S (a subset of CSP)—a high level language for distributed programming—is presented in this paper. The language CSP-S features a parallel command, communication by message passing and the use of guarded command. The implementation consists of a compiler translating the CSP-S constructs into intermediate language. The execution is carried out by a scheduler which creates an illusion of concurrency. Using the CSP-S language constructs, distributed algorithms are written, executed and tested with the compiler designed.  相似文献   

4.
There are a number of models that were proposed in recent years for message passing parallel systems. Examples are the postal model and its generalization the LogP model. In the postal model a parameter λ is used to model the communication latency of the message-passing system. Each node during each round can send a fixed-size message and, simultaneously, receive a message of the same size. Furthermore, a message sent out during round r will incur a latency of λ and will arrive at the receiving node at round r+λ-1. Our goal in this paper is to bridge the gap between the theoretical modeling and the practical implementation. In particular, we investigate a number of practical issues related to the design and implementation of two collective communication operations, namely, the broadcast operation and the global combine operation. Those practical issues include, for example, (1) techniques for measurement of the value of λ on a given machine, (2) creating efficient broadcast algorithms that get the latency h and the number of nodes n as parameters and (3) creating efficient global combine algorithms for parallel machines with λ which is not an integer. We propose solutions that address those practical issues and present results of an experimental study of the new algorithms on the Inter Delta machine. Our main conclusion is that the postal model can help in performance prediction and tuning, for example, a properly tuned broadcast improves the known implementation by more than 20%  相似文献   

5.
This paper concerns resource allocation in distributed message passing systems, i.e., the scheduling of accesses to exclusive system resources shared among concurrent processes. An efficient modular resource allocation algorithm is presented that uses any arbitrary resource allocation algorithm as a subroutine. It improves the performance of the subroutine by letting each process wait only for its currently conflicting processes, and therefore, allows more concurrency. For appropriate choices of the subroutine, we obtain resource allocation algorithms with the minimum worst case response times. Simulation studies were conducted which also indicate that on average, the obtained algorithms perform faster and require a smaller number of messages than other previously known algorithms, especially when resource contention among processes is high and the average time that a process remains in the critical region is large. Received: May 1997 / Accepted: May 1998  相似文献   

6.
Heterogeneous Message Passing and a Link to Resource Management   总被引:1,自引:0,他引:1  
Parallel process communication and system resource management have been seen as two separate entities in parallel and distributed systems. This causes difficulties in the dynamic mapping of newly spawned processes, because the application has little or no information on the availability, the connectivity and the current work-load of the target system. As a consequence, process mappings are often sub-optimal, overloading resources on one system while other processors are idling. We present a software system named "PLUS" that provides interprocess communication between different message passing models such as MPI, PVM and PARIX, and access to resource management systems for optimal process mapping and task migration.PLUS is a light-weight, extensible and efficient communication interface. With only four commands, PLUS is almost transparent ot the application code. Our current implementation supports inter-process communication between PVM, MPI and PARIX, but it can be easily extended to other vendor-specific message passing libraries. As PLUS has been designed for wide area networks, much effort has been spent on portability and on optimizing the communication speed across internet and also intranet links.  相似文献   

7.
Writing software for one parallel system is a feasible though arduous task. Reusing the substantial intellectual effort so expended for programming a second system has proved much more challenging. In sequential computing algorithms textbooks and portable software are resources that enable software systems to be written that are efficiently portable across changing hardware platforms. These resources are currently lacking in the area of multi-core architectures, where a programmer seeking high performance has no comparable opportunity to build on the intellectual efforts of others. In order to address this problem we propose a bridging model aimed at capturing the most basic resource parameters of multi-core architectures. We suggest that the considerable intellectual effort needed for designing efficient algorithms for such architectures may be most fruitfully expended in designing portable algorithms, once and for all, for such a bridging model. Portable algorithms would contain efficient designs for all reasonable combinations of the basic resource parameters and input sizes, and would form the basis for implementation or compilation for particular machines. Our Multi-BSP model is a multi-level model that has explicit parameters for processor numbers, memory/cache sizes, communication costs, and synchronization costs. The lowest level corresponds to shared memory or the PRAM, acknowledging the relevance of that model for whatever limitations on memory and processor numbers it may be efficacious to emulate it. We propose parameter-aware portable algorithms that run efficiently on all relevant architectures with any number of levels and any combination of parameters. For these algorithms we define a parameter-free notion of optimality. We show that for several fundamental problems, including standard matrix multiplication, the Fast Fourier Transform, and comparison sorting, there exist optimal portable algorithms in that sense, for all combinations of machine parameters. Thus some algorithmic generality and elegance can be found in this many parameter setting.  相似文献   

8.
This paper presents a novel approach to reducing the communication costs incurred when performing multiple multicasts on wormhole routed two-dimensional mesh multiprocessor systems. Both unicast and path-based implementations of multicasting incur communication costs due to the inherent message passing and contention for network resources. The start-up time dominates the transmission time when the data volume is small. However, in the presence of multiple multicasts when the data volume is very large, the communication delays due to message blocking and resource contention become very significant. Because of this, we present a hybrid static-dynamic technique to reduce the communication costs incurred when performing multiple multicasts on wormhole routed direct networks. This technique requires a focus on ordering and routing information for the individual message transmissions. At compile time, each message is assigned a priority using the recently developed collision graph model. Then at runtime these priorities are used to arbitrate the message transmissions. As a base, dimension-ordered routing is used. However, to further reduce the communication costs, some messages will be rerouted. This technique is useful either as a stand-alone algorithm or as an embedded procedure into existing algorithms. Furthermore, the techniques can be applied to higher dimension direct networks. For a single multicast, our work performs as well as conventional methods. For multiple multicasts, results show that our approach provides significant improvement over baseline techniques.  相似文献   

9.
路由选择机制是分布多式媒体系统中的重要研究方向。其中,基于服务质量的多媒体通信目的节点加入与退出算法是关键组成部分。该文在基于服务质量的多媒体通信初始路由建立算法的基础上,基于资源共享原则,提出支持成员动态加入与退出多媒体组通信的目的节点加入与退出算法,以增加使用费用最小为目标在满足服务质量约束的条件下完成目的节点加入,在不影响多媒体组通信服务质量的前提下,在完成目的节点退出的同时最大限度地释放已占用资源。文中还探讨了这些算法的有效性。它们和基于服务质量的多媒体通信初始路由建立算法相结合,可以提供对分布式多毁体组应用服务质量保证的支持。  相似文献   

10.
In many real applications, for example, those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, that extends the LogP and LogGP models to account for the impact of network contention and network interface DMA behavior on the performance of message passing programs. We validate LoGPC by analyzing three applications implemented with Active Messages on the MIT Alewife multiprocessor. Our analysis shows that network contention accounts for up to 50 percent of the total execution time. In addition, we show that the impact of communication locality on the communication costs is at most a factor of two on Alewife. Finally, we use the model to identify trade-offs between synchronous and asynchronous message passing styles  相似文献   

11.
The dynamicity, coupled with the uncertainty that occurs between advertised resources and users’ resource requirement queries, remains significant problems that hamper the discovery of candidate resources in a cloud computing environment. Network size and complexity continue to increase dynamically which makes resource discovery a complex, NP-hard problem that requires efficient algorithms for optimum resource discovery. Several algorithms have been proposed in literature but there is still room for more efficient algorithms especially as the size of the resources increases. This paper proposes a soft-set symbiotic organisms search (SSSOS) algorithm, a new hybrid resource discovery solution. Soft-set theory has been proved efficient for tackling uncertainty problems that arises in static systems while symbiotic organisms search (SOS) has shown strength for tackling dynamic relationships that occur in dynamic environments in search of optimal solutions among objects. The SSSOS algorithm innovatively combines the strengths of the underlying techniques to provide efficient management of tasks that need to be accomplished during resource discovery in the cloud. The effectiveness and efficiency of the proposed hybrid algorithm is demonstrated through empirical simulation study and benchmarking against recent techniques in literature. Results obtained reveal the promising potential of the proposed SSSOS algorithm for resource discovery in a cloud environment.  相似文献   

12.
Message passing programs commonly use message buffers to avoid unnecessary synchronizations and to improve performance by overlapping communication with computation. Unfortunately, using buffers can introduce portability problems and can lead to deadlock problems on systems without a sufficient number of message buffers.We explore a variety of problems related to buffer allocation for safe and efficient execution of message passing programs. We show that determining the minimum number of message buffers or verifying that each process has a sufficient number of message buffers are intractable problems. However, we give a polynomial time algorithm to determine the minimum number of message buffers needed to ensure that no send operation is unnecessarily delayed due to lack of message buffers. We extend these results to several different buffering schemes, which in some cases make the problems tractable.  相似文献   

13.
The resource discovery problem was introduced by Harchol-Balter, Leighton, and Lewin. They developed a number of algorithms for the problem in the weakly connected directed graph model. This model is a directed logical graph that represents the vertices’ knowledge about the topology of the underlying communication network. The current paper proposes a deterministic algorithm for the problem in the same model, with improved time, message, and communication complexities. Each previous algorithm had a complexity that was higher at least in one of the measures. Specifically, previous deterministic solutions required either time linear in the diameter of the initial network, or communication complexity $O(n^3)$ (with message complexity $O(n^2)$), or message complexity $O(|E_0| \log n)$ (where $E_0$ is the arc set of the initial graph $G_0$). Compared with the main randomized algorithm of Harchol-Balter, Leighton, and Lewin, the time complexity is reduced from $O(\log^2n)$ to\pagebreak[4] $O(\log n )$, the message complexity from $O(n \log^2 n)$ to $O(n \log n )$, and the communication complexity from $O(n^2 \log^3 n)$ to $O(|E_0|\log ^2 n )$. \par Our work significantly extends the connectivity algorithm of Shiloach and Vishkin which was originally given for a parallel model of computation. Our result also confirms a conjecture of Harchol-Balter, Leighton, and Lewin, and addresses an open question due to Lipton.  相似文献   

14.
Advance reservation is important to guarantee the quality of services of jobs by allowing exclusive access to resources over a defined time interval on resources. It is a challenge for the scheduler to organize available resources efficiently and to allocate them for parallel advance reservation jobs with deadline constraint appropriately. This paper provides a slot-based data structure to organize available resources of multiprocessor systems in a way that enables efficient search and update operations and formulates a suite of scheduling policies to allocate resources for dynamically arriving advance reservation requests. The performance of the scheduling algorithms were investigated by simulations with different job sizes and durations, system loads, and scheduling flexibilities. Simulation results show that job sizes and durations, system load and the flexibility of scheduling will impact the performance metrics of all the scheduling algorithms, and the $\textit{PE}\; \textit{Worst Fit}$ algorithm becomes the best algorithm for the scheduler with the highest acceptance rate of advance reservation requests, and the jobs with the $\textit{First Fit}$ algorithm experience the lowest average slowdown. The data structure and scheduling policies can be used to organize and allocate resources for parallel advance reservation jobs with deadline constraint in large-scale computing systems.  相似文献   

15.
网格计算主要关注大规模的资源共享,且这种共享是高度可控的。为解决网格环境下文件资源共享与管理的问题,提出了一个网格文件资源共享模型FsvGrid。该模型引入注册通知机制,并采用确定性算法与非确定性算法相结合的消息传递机制,使得网格中的各个节点之间能够高效协作;采用分层结构,屏蔽了文件资源的多样性;增加了共享的安全性,可以对共享进行控制;提出了一种依靠虚拟组织来对文件资源进行管理的方式,解决分布式资源难以管理的问题。  相似文献   

16.
Model checking of safety properties can be scaled up by pooling the CPU and memory resources of multiple computers. As compute clusters containing 100s of nodes, with each node realized using multi-core (e.g., 2) CPUs will be widespread, a model checker based on the parallel (shared memory) and distributed (message passing) paradigms will more efficiently use the hardware resources. Such a model checker can be designed by having each node employ two shared memory threads that run on the (typically) two CPUs of a node, with one thread responsible for state generation, and the other for efficient communication, including (1) performing overlapped asynchronous message passing, and (2) aggregating the states to be sent into larger chunks in order to improve communication network utilization. We present the design details of such a novel model checking architecture called Eddy. We describe the design rationale, details of how the threads interact and yield control, exchange messages, as well as detect termination. We have realized an instance of this architecture for the Murphi modeling language. Called Eddy_Murphi, we report its performance over the number of nodes as well as communication parameters such as those controlling state aggregation. Nearly linear reduction of compute time with increasing number of nodes is observed. Our thread task partition is done in such a way that it is modular, easy to port across different modeling languages, and easy to tune across a variety of platforms. Supported in part by NSF award CNS-0509379 and SRC Contract 2005-TJ-1318.  相似文献   

17.
For many applications two-dimensional hydraulic models are time intensive to run due to their computational requirements, which can adversely affect the progress of both research and industry modelling projects. Computational time can be reduced by running a model in parallel over multiple cores. However, there are many parallelisation methods and these differ in terms of difficulty of implementation, suitability for particular codes and parallel efficiency. This study compares three parallelisation methods based on OpenMP, message passing and specialised accelerator cards. The parallel implementations of the codes were required to produce near identical results to a serial version for two urban inundation test cases. OpenMP was considered the easiest method to develop and produced similar speedups (of ~3.9×) to the message passing code on up to four cores for a fully wet domain. The message passing code was more efficient than OpenMP, and remained over 90% efficient on up to 50 cores for a completely wet domain. All parallel codes were less efficient for a partially wet domain test case. The accelerator card code was faster and more power efficient than the standard code on a single core for a fully wet domain, but was subject to longer development time (2 months compared to <2 week for the other methods).  相似文献   

18.
Multicast communication, in which the same message is delivered from a source node to an arbitrary number of destination nodes, is being increasingly demanded in parallel computing. System supported multicast services can potentially offer improved performance, increased functionality, and simplified programming, and may in turn be used to support various higher-level operations for data movement and global process control. This paper presents efficient algorithms to implement multicast communication in wormhole-routed direct networks, in the absence of hardware multicast support, by exploiting the properties of the switching technology. Minimum-time multicast algorithms are presented for n-dimensional meshes and hypercubes that use deterministic, dimension-ordered routing of unicast messages. Both algorithms can deliver a multicast message to m-1 destinations in [log 2 m] message passing steps, while avoiding contention among the constituent unicast messages. Performance results of implementations on a 64-node nCUBE-2 hypercube and a 168-node Symult 2010 2-D mesh are given  相似文献   

19.
Developing high‐quality, error‐free message‐passing concurrent programs is not trivial. Although a number of different primitives with associated semantics are available to assist such development, they often increase the complexity of the testing process. In this paper, we extend our previous test model for message‐passing programs and present new structural testing criteria, taking into account additional features used in this paradigm, such as collective communication, non‐blocking sends, distinct semantics for non‐blocking receives, and persistent operations. Our new model also recognizes that sender primitives cannot always be matched with every receive primitive. This improvement allows us to remove statically a significant number of infeasible synchronization edges that would otherwise have to be analyzed later by the tester. In this paper, the test model is presented using the Message‐Passing Interface standard; however, our new model has been designed to be flexible, and it can be configured to support a range of different message‐passing environments or languages. We have carried out case studies showing the applicability of the new test model to represent message‐passing programs and also to reveal errors, mainly those errors related to inter‐process communication. In addition to increasing the number of features supported by the test model, we have also reduced the overall cost of testing significantly. Our case studies suggest that the number of synchronization edges can be reduced by up to 93%, mainly by eliminating infeasible edges between unmatchable communication primitives. The main contribution of the paper is to present a more flexible test model that provides improved coverage for message‐passing programs and at the same time reduces the cost of testing significantly. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

20.
In this paper we explore compiler techniques for achieving efficient communications on circuit switching interconnection networks. We propose a compilation framework for identifying communication patterns and compiling these patterns as network configuration directives. This has the potential of providing significant performance benefits when connections can be established in the network prior to the actual communications. The framework includes a flexible and powerful communication pattern representation scheme that captures the property of communication patterns and allows manipulation of these patterns. In this way, communication phases can be identified within the application. Additionally, we extend the classification of static and dynamic communications to include persistent communications. Persistent communications are a subclass of dynamic communications that remain unchanged for large segments of the application execution. An experimental compiler has been developed to implement the framework. This compiler is capable of detecting both static and persistent communications within an application. We show that for the NAS Parallel Benchmarks, 100% of the point-to-point communications can be classified as either static or persistent and 100% of the collectives are either static or persistent with the exception of IS. Simulation-based performance analysis demonstrates the benefit of using our compiler techniques for achieving efficient communications in multiprocessor systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号