期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Low‐level PGAS computing on many‐core processors with TSHMEM

Bryant C. Lam Alan D. George Herman Lam Vikas Aggarwal 《Concurrency and Computation》2015,27(17):5288-5310

Diminishing returns from increased clock frequencies and instruction‐level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many‐core architectures have progressed at a remarkable rate, concerns arise regarding the performance and productivity of numerous parallel‐programming tools for application development. Development of parallel applications on many‐core processors often requires developers to familiarize themselves with unique characteristics of a target platform while attempting to maximize performance and maintain correctness of their applications. The family of partitioned global address space (PGAS) programming models comprises the current state of the art in balancing performance and programmability. One such PGAS approach is SHMEM, a lightweight, shared‐memory programming library that has demonstrated high performance and productivity potential for parallel‐computing systems with distributed‐memory architectures. In the paper, we present research, design, and analysis of a new SHMEM infrastructure specifically crafted for low‐level PGAS on modern and emerging many‐core processors featuring dozens of cores and more. Our approach (with a new library known as TSHMEM) is investigated and evaluated atop two generations of Tilera architectures, which are among the most sophisticated and scalable many‐core processors to date, and is intended to enable similar libraries atop other architectures now emerging. In developing TSHMEM, we explore design decisions and their impact on parallel performance for the Tilera TILE‐Gx and TILEPro many‐core architectures, and then evaluate the designs and algorithms within TSHMEM through microbenchmarking and applications studies with other communication libraries. Our results with barrier primitives provided by the Tilera libraries show dissimilar performance between the TILE‐Gx and TILEPro; therefore, TSHMEM's barrier design takes an alternative approach and leverages the on‐chip mesh network to provide consistent low‐latency performance. In addition, our experiments with TSHMEM show that naive collective algorithms consistently outperformed linear distributed collective algorithms when executed in an SMP‐centric environment. In leveraging these insights for the design of TSHMEM, our approach outperforms the OpenSHMEM reference implementation, achieves similar to positive performance over OpenMP and OSHMPI atop MPICH, and supports similar libraries in delivering high‐performance parallel computing to emerging many‐core systems. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

2.

Redistribution of block-cyclic data distributions using MPI

D.W. Walker S.W. Otto 《Concurrency and Computation》1996,8(9):707-728

Arrays that are distributed in a block-cyclic fashion are important for many applications in the computational sciences since they often lead to parallel algorithms with good load balancing properties. We consider the problem of redistributing such an array to a new block size. This operation is directly expressible in High Performance Fortran (HPF) and will arise in applications written in this language. Efficient message passing algorithms are given for the redistribution operation, expressed in the standardized message passing interface, MPI. The algorithms are analyzed and performance results from the IBM SP-1 and Intel Paragon are given and discussed. The results show that redistribution can be done in time comparable to other collective communication operations, such as broadcast and MPI_ALLTOALL. 相似文献

3.

集群动态负载平衡系统的性能评价 总被引：18，自引：0，他引：18

唐丹金海张永坤《计算机学报》2004,27(6):803-811

该文使用随机Petri网对集群动态负载平衡系统建立了一个抽象模型．通过细化模型中的节点本地处理部分对5种动态负载平衡算法的性能进行了分析，并讨论了集群负载特性对动态负载平衡系统性能的影响，最后得出的主要结论有：(1)动态负载平衡算法可以取得比静态负载平衡算法更好的性能；(2)与传统的只考虑CPU就绪队列的负载平衡算法相比，考虑了各种I／O请求队列的负载平衡算法可以取得更好的性能；(3)即使在极端的集群负载特性中。集群动态负载平衡算法仍然能取得比较理想的性能，因此实现即使是十分简单的集群动态负载平衡系统也是很有必要的。相似文献

4.

Mapping of option pricing algorithms onto heterogeneous many-core architectures

Shuai Zhang Zhao Wang Ying Peng Bertil Schmidt Weiguo Liu 《The Journal of supercomputing》2017,73(9):3715-3737

The rapid development of technologies and applications in recent years poses high demands and challenges for high-performance computing. Because of their competitive performance/price ratio, heterogeneous many-core architectures are widely used in high-performance computing areas. GPU and Xeon Phi are two popular general-purpose many-core accelerators. In this paper, we demonstrate how heterogeneous many-core architectures, powered by multi-core CPUs, CUDA-enabled GPUs and Xeon Phis can be used as an efficient computational platform to accelerate popular option pricing algorithms. In order to make full use of the compute power of this architecture, we have used a hybrid computing model which consists of two types of data parallelism: worker level and device level. The worker level data parallelism uses a distributed computing infrastructure for task distribution, while the device level data parallelism uses both the multi-core CPUs and many-core accelerators for fast option pricing calculation. Experiments show that our implementations achieve good performance and scalability on this architecture and also outperform other state-of-the-art GPU-based solutions for Monte Carlo European/American option pricing and BSDE European option pricing. 相似文献

5.

基于Petri网的集群应用系统性能评价

丁静《计算机应用与软件》2007,24(3):129-131

利用随机Petri网理论给出了一个集群应用软件负载平衡系统的抽象模型,通过细化其中的本地节点处理部分对3种集群动态负载平衡的调度策略和应用系统体系结构对负载平衡系统的影响进行了分析,得到了一些对大部分应用系统的设计起到指导作用的结论.这些结论是:(1)无论是静态负载平衡还是动态负载平衡都能提高集群系统的性能,动态负载平衡会得到更好的性能;(2)在动态负载平衡算法中除了要考虑系统中最重要的等待队列--应用队列外,还要考虑数据库队列;(3)异步体系架构将任务切分到各处理子系统中,有助于将各子系统负载数据综合到负载向量中,能够更准确地衡量系统负载、提高负载平衡系统的性能,优于同步体系结构. 相似文献

6.

Tuning collective communication for Partitioned Global Address Space programming models

Rajesh Nishtala Yili Zheng 《Parallel Computing》2011,37(9):576-591

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect. 相似文献

7.

Scalable Load Balancing Techniques for Parallel Computers

《Journal of Parallel and Distributed Computing》1994,22(1):60-79

In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics: the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very difficult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among different processors. Our goal here is to determine the most scalable load balancing schemes for different architectures such as hypercube, mesh, and network of workstations. For each of these architectures, we establish lower bounds on the scalability of any possible load balancing scheme. We present the scalability analysis of a number of load balancing schemes that have not been analyzed before. This gives us valuable insights into their relative performance for different problem and architectural characteristics. For each of these architectures, we are able to determine near optimal load balancing schemes. Results obtained from implementation of these schemes in the context of the Tautology Verification problem on the Ncube/2 (a trademark of the Ncube Corporation) multicomputer are used to validate our theoretical results for the hypercube architecture. These results also demonstrate the accuracy and viability of our framework for scalability analysis. 相似文献

8.

A Methodology for Programming Scalable Architectures

《Journal of Parallel and Distributed Computing》1994,22(3):479-487

In scalable concurrent architectures, the performance of a parallel algorithm depends on the resource management policies used. Such policies determine, for example, how data is partitioned and distributed and how processes are scheduled. In particular, the performance of a parallel algorithm obtained by using a particular policy can be affected by increasing the size of the architecture or the input. In order to support scalability, we are developing a methodology for modular specification of partition and distribution strategies (PDSs). As a consequence, a PDS may be changed without modifying the code specifying the logic of a parallel algorithm. We illustrate our methodology for parallel algorithms that use dynamic data structures. 相似文献

9.

Distribution assignment placement: effective optimization ofredistribution costs

Knoop J. Mehofer E. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(6):628-647

Data locality and workload balance are key factors for getting high performance out of data-parallel programs on multiprocessor architectures. Data-parallel languages such as High-Performance Fortran (HPF) thus offer means allowing a programmer both to specify data distributions and to change them dynamically in order to maintain these properties. On the other hand, redistributions can be quite expensive and can significantly degrade a program's performance. They must thus be reduced to a minimum. In this article, we present a novel, aggressive approach for avoiding unnecessary remappings, which works by eliminating partially dead and partially redundant distribution changes. Basically, this approach evolves from extending and combining two algorithms for these optimizations, each achieving optimal results on its own. In distinction to the sequential setting, the data-parallel setting leads naturally to a family of algorithms of varying power and efficiency, allowing requirement-customized solutions. The power and flexibility of the new approach are demonstrated by various examples, which range from typical HPF fragments to real-world programs. Performance measurements underline its importance and show its effectiveness on different hardware platforms and in different settings 相似文献

10.

Parallel space‐filling curve generation through sorting

J. Luitjens M. Berzins T. Henderson 《Concurrency and Computation》2007,19(10):1387-1402

In this paper we consider the scalability of parallel space‐filling curve generation as implemented through parallel sorting algorithms. Multiple sorting algorithms are studied and results show that space‐filling curves can be generated quickly in parallel on thousands of processors. In addition, performance models are presented that are consistent with measured performance and offer insight into performance on still larger numbers of processors. At large numbers of processors, the scalability of adaptive mesh refined codes depends on the individual components of the adaptive solver. One such component is the dynamic load balancer. In adaptive mesh refined codes, the mesh is constantly changing resulting in load imbalance among the processors requiring a load‐balancing phase. The load balancing may occur often, requiring the load balancer to perform quickly. One common method for dynamic load balancing is to use space‐filling curves. Space‐filling curves, in particular the Hilbert curve, generate good partitions quickly in serial. However, at tens and hundreds of thousands of processors serial generation of space‐filling curves will hinder scalability. In order to avoid this issue we have developed a method that generates space‐filling curves quickly in parallel by reducing the generation to integer sorting. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

11.

Dynamic Load Balancing and Pricing in Grid Computing with Communication Delay

Qin Zheng Chen-Khong Tham Bharadwaj Veeravalli 《Journal of Grid Computing》2008,6(3):239-253

Due to the emergence of Grid computing over the Internet, there is presently a need for dynamic load balancing algorithms which take into account the characteristics of Grid computing environments. In this paper, we consider a Grid architecture where computers belong to dispersed administrative domains or groups which are connected with heterogeneous communication bandwidths. We address the problem of determining which group an arriving job should be allocated to and how its load can be distributed among computers in the group to optimize the performance. We propose algorithms which guarantee finding a load distribution over computers in a group that leads to the minimum response time or computational cost. We then study the effect of pricing on load distribution by considering a simple pricing function. We develop three fully distributed algorithms to decide which group the load should be allocated to, taking into account the communication cost among groups. These algorithms use different information exchange methods and a resource estimation technique to improve the accuracy of load balancing. We conducted extensive simulations to evaluate the performance of the proposed algorithms and strategies. 相似文献

12.

Fault tolerant algorithm based on dynamic and active load balancing for redundant services

下载免费PDF全文

Jun-FengTian Jun-WeiZhang Feng-XianWang 《计算机科学技术学报》2004,19(6):0-0

A new Some-Read-Any-Write (SRAW) fault tolerant algorithm for redundant services is presented that allows a system to adjust failures dynamically in order to keep the availability and improve the performance.SRAW is based upon dynamic and active load balancing. By introducing dynamic and active load balancings cheme into redundant services, not only the processing speed of requests can be greatly improved, but also the load balancing can be simply and efficiently achieved. Integrated with consistency protocol in this paper, SRAW can also be applied to state services. The performance of SRAW algorithm is also analyzed, and comparisons with other fault tolerant algorithms, especially with RAWA, indicate that SRAW efficiently improves the performance of redundant services with guaranteeing system availability. 相似文献

13.

Interprocedural Compilation of Fortran D

Mary W. Hall Seema Hiranandani Ken Kennedy Chau-Wen Tseng 《Journal of Parallel and Distributed Computing》1996,38(2):114

Fortran D is a version of Fortran extended with data decomposition specifications. It is designed to provide a machine-independent programming model for data-parallel applications and has heavily influenced the design of High Performance Fortran (HPF). In previous work we described Fortran D compilation algorithms for individual procedures. This paper presents an interprocedural approach to analyze data and computation partitions, optimize communication, support dynamic data decomposition, and perform other tasks required to compile Fortran D programs. Our algorithms are designed to make interprocedural compilation efficient. First, we collect summary information after edits to solve important data-flow problems in a separate interprocedural propagation phase. Second, for nonrecursive programs we compile procedures in reverse topological order to propagate additional interprocedural information during code generation. We thus limit compilation to a single pass over each procedure body. We also perform optimizations across procedure boundaries by delaying instantiation of the computation partition, communication, and dynamic data decomposition. Empirical results show that interprocedural optimization is crucial in achieving acceptable performance for a common application code. 相似文献

14.

UPCBLAS: a library for parallel matrix computations in Unified Parallel C

Jorge Gonzlez‐Domínguez María J. Martín Guillermo L. Taboada Juan Tourio Ramn Doallo Damin A. Malln Brian Wibecan 《Concurrency and Computation》2012,24(14):1645-1667

The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures such as multicore clusters. This paper describes UPCBLAS, a parallel numerical library for dense matrix computations using the PGAS Unified Parallel C language. The routines developed in UPCBLAS are built on top of sequential basic linear algebra subprograms functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. Furthermore, the routines implement other optimization techniques, several of them by automatically taking into account the hardware characteristics of the underlying systems on which they are executed. The library has been experimentally evaluated on a multicore supercomputer and compared with a message‐passing‐based parallel numerical library, demonstrating good scalability and efficiency. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

15.

A Dynamic Load Balancing Framework for Real-time Applications in Message Passing Systems

Ghada F. El Kabbany Nayer M. Wanas Nadia H. Hegazi Samir I. Shaheen 《International journal of parallel programming》2011,39(2):143-182

Load balancing algorithms are designed essentially to equally distribute the load on processors and maximize their utilities while minimizing the total task execution time. In order to achieve these goals, the load-balancing mechanism should be “fair” in distributing the load across the different processors. This implies that the difference between the heaviest-loaded and the lightest-loaded processors should be minimized. Therefore, the load information on each processor must be updated such that the load-balancing mechanism can be more effective. In this work, we present an application independent dynamic algorithm for scheduling tasks and load- balancing in message passing systems. We propose a DAG-based Dynamic Load Balancing algorithm for Real time applications (DAG-DLBR) that is designed to work dynamically to cope with possible changes in the load that might occur during runtime. This algorithm addresses the challenge of devising a load balancing scheme which judicially deals with the hybrid execution of existing real-time application (represented by a Direct Acyclic Graph (DAG)) together with newly arriving jobs. The main objective of this algorithm is to reduce response times of the newly arriving jobs while maintaining the time constrains of the existing DAG. To evaluate the performance of the DAG-DLBR algorithm, a comparison with the performance of two common dynamic load balancing algorithms is presented. This comparison is performed by evaluating, experimentally, the execution time of different load balancing algorithms on a homogenous real parallel machine. In addition, the values of load imbalance, the execution time, and the communication overhead time are evaluated analytically using different benchmarks as test-bed workloads. These workloads cover a wide range of dynamic applications with different task types. Experimental results illustrate the improved performance of the DAG-DLBR algorithm compared to both distributed and hierarchal based algorithms by at least 12 and 19%, respectively. This improvement is true for all workloads, even with highly dependent workload. The DAG-DLBR algorithm achieves lower computation time than its corresponding values of both the distributed and the hierarchical-based algorithms for 4, 8, 12 and 16 processors. 相似文献

16.

动态负载均衡算法在校园网格中的应用 总被引：2，自引：0，他引：2

李相朋《微计算机信息》2006,22(24):164-165

校园网格能有效消除信息孤岛,实现我国高校的计算资源和信息资源的有效共享。一个亟待解决的问题是在校园网格环境下,服务器节点响应能力低下。目前已提出多种技术与方案以解决并提高校园网格的服务器节点的响应能力,负载均衡技术就是一种全新的技术。本文根据校园网格的特点和影响负载均衡的因素,对基于校园网格的负载均衡技术进行了分析和探讨,并提出一种动态负载均衡算法。相似文献

17.

Optimizations of a GPU accelerated heat conduction equation by a programming of CUDA Fortran from an analysis of a PTX file

Shin-ichi Satake Hajime Yoshimori Takayuki Suzuki 《Computer Physics Communications》2012,183(11):2376-2385

The Fortran language has been commonly used for many kinds of scientific computation. In this paper, we focus on the solution of an unsteady heat conduction equation, which is one of the simplest problems for thermal dynamics. Recently, a GPU (graphics processing unit) has been enhanced with a Fortran programming language capability employing CUDA (compute unified device architecture), known as CUDA Fortran. We find that the speed performance of a system using an ordinary program coding of CUDA Fortran is lower than that of systems using a program coding of CUDA C. We also find that intermediate assembly files PTX (parallel thread execution) of the two languages are not coincident. Therefore, by comparing the PTX files from the two coding programs we could detect the bottleneck that causes the speed reduction. We propose three optimization techniques that can enable the calculated speeds using CUDA Fortran and CUDA C to be coincident. The optimizations can be performed by the Fortran language when improved by an analyzed PTX file. It is thus possible to improve the performance of CUDA Fortran by adding a correction to it, which happens to be at a programming language level. 相似文献

18.

XpressSpace: a programming framework for coupling partitioned global address space simulation codes

Fan Zhang Ciprian Docan Hoang Bui Manish Parashar Scott Klasky 《Concurrency and Computation》2014,26(3):644-661

Complex coupled multiphysics simulations are playing increasingly important roles in scientific and engineering applications such as fusion, combustion, and climate modeling. At the same time, extreme scales, increased levels of concurrency, and the advent of multicores are making programming of high‐end parallel computing systems on which these simulations run challenging. Although partitioned global address space (PGAS) languages attempt to address the problem by providing a shared memory abstraction for parallel processes within a single program, the PGAS model does not easily support data coupling across multiple heterogeneous programs, which is necessary for coupled multiphysics simulations. This paper explores how multiphysics‐coupled simulations can be supported by the PGAS programming model. Specifically, in this paper, we present the design and implementation of the XpressSpace programming system, which extends existing PGAS data sharing and data access models with a semantically specialized shared data space abstraction to enable data coupling across multiple independent PGAS executables. XpressSpace supports a global‐view style programming interface that is consistent with the PGAS memory model, and provides an efficient runtime system that can dynamically capture the data decomposition of global‐view data‐structures such as arrays, and enable fast exchange of these distributed data‐structures between coupled applications. In this paper, we also evaluate the performance and scalability of a prototype implementation of XpressSpace by using different coupling patterns extracted from real world multiphysics simulation scenarios, on the Jaguar Cray XT5 system at Oak Ridge National Laboratory. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

19.

Optimizing the Cray Graph Engine for performant analytics on cluster,SuperDome Flex,Shasta systems and cloud deployment

Christopher D. Rickett Kristyn J. Maschhoff Sreenivas R. Sukumar 《Concurrency and Computation》2024,36(10):e7982

We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-sided MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems. 相似文献

20.

Parallelizing a GIS on a shared address space architecture

Shekhar S. Ravada S. Kumar V. Chubb D. Turner G. 《Computer》1996,29(12):42-48

We are developing a high-performance GIS (our term for a parallel GIS) on an SGI Challenge, a 16-processor machine with a shared address space architecture (SASA). We describe how we parallelized a key GIS operation using a message-passing algorithm. We advocate the linking of two diverse approaches to the design of parallel architectures and algorithms. As part of our project, we evaluated the effect of parallelizing an important GIS operation: range query. We parallelized a range query using data partitioning (to reduce synchronization) and dynamic load balancing (to improve speedup). We found that these approaches do achieve the performance required for many GIS applications 相似文献