共查询到20条相似文献,搜索用时 15 毫秒
1.
Simulation has become an indispensable tool for researchers to explore systems without having recourse to real experiments. Depending on the characteristics of the modeled system, methods used to represent the system may vary. Multi-agent systems are often used to model and simulate complex systems. In any cases, increasing the size and the precision of the model increases the amount of computation, requiring the use of parallel systems when it becomes too large. In this paper, we focus on parallel platforms that support multi-agent simulations and their execution on high performance resources as parallel clusters. Our contribution is a survey on existing platforms and their evaluation in the context of high performance computing. We present a qualitative analysis of several multi-agent platforms, their tests in high performance computing execution environments, and the performance results for the only two platforms that fulfill the high performance computing constraints. 相似文献
2.
3.
Ifeanyi P. Egwutuoha Shiping Chen David Levy Bran Selic Rafael Calvo 《International Journal of Parallel, Emergent and Distributed Systems》2014,29(4):363-378
Cloud computing offers new computing paradigms, capacity and flexible solutions to high performance computing (HPC) applications. For example, Hardware as a Service (HaaS) allows users to provide a large number of virtual machines (VMs) for computation-intensive applications using the HaaS model. Due to the large number of VMs and electronic components in HPC system in the cloud, any fault during the execution would result in re-running the applications, which will cost time, money and energy. In this paper we presented a proactive fault tolerance (FT) approach to HPC systems in the cloud to reduce the wall-clock execution time and dollar cost in the presence of faults. We also developed a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We also developed a cost model for executing computation-intensive applications on HPC systems in the cloud. We analysed the dollar cost of provisioning spare nodes and checkpointing FT to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of checkpointing of computation-intensive applications can be reduced up to 50% with our FT approach for HPC in the cloud compared with current FT approaches. 相似文献
4.
George Bosilca Rémi Delmas Jack Dongarra Julien Langou 《Journal of Parallel and Distributed Computing》2009
We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518–528] to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault-tolerant matrix–matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix–matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly. 相似文献
5.
针对高性能计算系统中故障定位难度高且实时性差的问题,提出了一种基于消息传递的故障定位框架(MPFL),包括基于树形拓扑的故障检测(TFD)和故障分析(TFA)算法。首先,在并行作业初始化时,将所有参与计算的节点进行逻辑上的树形划分,生成故障定位树(FLT),并将故障定位任务分布到节点上;然后,当消息库、操作系统等组件检测到节点异常状态时,基于TFD算法分析作业的FLT结构,根据负载平衡、性能开销等因素选择接收异常状态的节点;最后,节点利用TFA算法对接收到的异常状态进行推理得出故障,TFA算法使用基于规则的事件关联,并基于消息传递设计轻量级的主动探测,将两种方式相结合,提高了故障分析的准确性。实验以模拟节点停机故障为定位目标,并以NPB-FT与NPB-IS为基准测试,在集群上对MPFL框架进行了评估。实验结果表明,MPFL框架在故障定位能力与开销节省方面表现突出。 相似文献
6.
There has been an increasing research interest in extending the use of Java towards high‐performance demanding applications such as scalable Web servers, distributed multimedia applications, and large‐scale scientific applications. However, extending Java to a multicomputer environment and improving the low performance of current Java implementations pose great challenges to both the systems developer and application designer. In this survey, we describe and classify 14 relevant proposals and environments that tackle Java's performance bottlenecks in order to make the language an effective option for high‐performance network‐based computing. We further survey significant performance issues while exposing the potential benefits and limitations of current solutions in such a way that a framework for future research efforts can be established. Most of the proposed solutions can be classified according to some combination of three basic parameters: the model adopted for inter‐process communication, language extensions, and the implementation strategy. In addition, where appropriate to each individual proposal, we examine other relevant issues, such as interoperability, portability, and garbage collection. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献
7.
宇宙射线辐射所导致的瞬态故障一直是航天计算面临的最主要挑战之一.而随着集成电路制造工艺的持续进步,现代处理器的性能在大幅度提高的同时,其可信性也正日益面临着瞬态故障的严重威胁.当前针对瞬态故障的容错技术可大致分为两类:基于硬件实现和基于软件实现.相比较前者,后者由于在实现成本和灵活性等方面的优势而备受关注.本文首先概述... 相似文献
8.
流式计算是大数据的一种重要计算模式,大数据流式计算已成为研究热点。任务管理是大数据流式计算的核心功能之一,负责对流式计算的任务进行资源调度及全生命周期管理。目前对于大数据流式计算的技术调研工作主要集中于流式计算应用需求、体系结构及整体技术,缺乏对大数据流式计算任务管理技术的精细化调研分析。首先给出流式计算任务管理的抽象功能模型,其次基于该模型对任务管理的关键技术进行了分类和综述,最后对既有主流的大数据流式计算系统对上述关键技术的应用、集成和优化进行了调研分析。 相似文献
9.
研究了集群的系统结构和主要优势,以及集群式高性能计算系统的诞生;分析了集群式高性能计算系统的架构和构建方式,集群构建包括网络部署、存储系统、计算节点、管理节点、登录节点等部分。在此基础上构建了基于Linux的集群式高性能计算系统。 相似文献
10.
Hameed Hussain Saif Ur Rehman Malik Abdul Hameed Samee Ullah Khan Gage Bickler Nasro Min-Allah Muhammad Bilal Qureshi Limin Zhang Wang Yongji Nasir Ghani Joanna Kolodziej Albert Y. Zomaya Cheng-Zhong Xu Pavan Balaji Abhinav Vishnu Fredric Pinel Johnatan E. Pecero Dzmitry Kliazovich Pascal Bouvry Hongxiang Li Lizhe Wang Dan Chen Ammar Rayes 《Parallel Computing》2013
An efficient resource allocation is a fundamental requirement in high performance computing (HPC) systems. Many projects are dedicated to large-scale distributed computing systems that have designed and developed resource allocation mechanisms with a variety of architectures and services. In our study, through analysis, a comprehensive survey for describing resource allocation in various HPCs is reported. The aim of the work is to aggregate under a joint framework, the existing solutions for HPC to provide a thorough analysis and characteristics of the resource management and allocation strategies. Resource allocation mechanisms and strategies play a vital role towards the performance improvement of all the HPCs classifications. Therefore, a comprehensive discussion of widely used resource allocation strategies deployed in HPC environment is required, which is one of the motivations of this survey. Moreover, we have classified the HPC systems into three broad categories, namely: (a) cluster, (b) grid, and (c) cloud systems and define the characteristics of each class by extracting sets of common attributes. All of the aforementioned systems are cataloged into pure software and hybrid/hardware solutions. The system classification is used to identify approaches followed by the implementation of existing resource allocation strategies that are widely presented in the literature. 相似文献
11.
Chouhan Siddharth Singh Kaul Ajay Singh Uday Pratap 《Multimedia Tools and Applications》2018,77(21):28483-28537
Multimedia Tools and Applications - Image segmentation is the method of partitioning an image into a group of pixels that are homogenous in some manner. The homogeneity dependents on some... 相似文献
12.
Zhengxiong HOU Hong SHEN Xingshe ZHOU Jianhua GU Yunlan WANG Tianhai ZHAO 《Frontiers of Computer Science》2022,16(5):165107
Nowadays, high-performance computing (HPC) clusters are increasingly popular. Large volumes of job logs recording many years of operation traces have been accumulated. In the same time, the HPC cloud makes it possible to access HPC services remotely. For executing applications, both HPC end-users and cloud users need to request specific resources for different workloads by themselves. As users are usually not familiar with the hardware details and software layers, as well as the performance behavior of the underlying HPC systems. It is hard for them to select optimal resource configurations in terms of performance, cost, and energy efficiency. Hence, how to provide on-demand services with intelligent resource allocation is a critical issue in the HPC community. Prediction of job characteristics plays a key role for intelligent resource allocation. This paper presents a survey of the existing work and future directions for prediction of job characteristics for intelligent resource allocation in HPC systems. We first review the existing techniques in obtaining performance and energy consumption data of jobs. Then we survey the techniques for single-objective oriented predictions on runtime, queue time, power and energy consumption, cost and optimal resource configuration for input jobs, as well as multi-objective oriented predictions. We conclude after discussing future trends, research challenges and possible solutions towards intelligent resource allocation in HPC systems. 相似文献
13.
This paper describes an approach to providing software fault tolerance for future deep‐space robotic National Aeronautics and Space Administration missions, which will require a high degree of autonomy supported by an enhanced on‐board computational capability. We focus on introspection‐based adaptive fault tolerance guided by the specific requirements of applications. Introspection supports monitoring of the program execution with the goal of identifying, locating, and analyzing errors. Fault tolerance assertions for the introspection system can be provided by the user, domain‐specific knowledge, or via the results of static or dynamic program analysis. This work is part of an on‐going project at the Jet Propulsion Laboratory in Pasadena, California. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献
14.
This paper introduces the design of a hyper parallel processing (HPP) controller, which is a system controller used in heterogeneous high performance computing systems. It connects several heterogeneous processors via HyperTransport (HT) interfaces, a commercial Infiniband HCA card with PCI-express interface, and a customized global synchronization network with self-defined high-speed interface. To accelerate intra-node communication and synchronization, global address space is supported and some dedicated hardware is integrated in the HPP controller to enable intra-node memory and shared I/O resources. On the prototype system with the HPP controller, evaluation results show that the proposed design achieves high communication efficiency, and obvious acceleration to synchronization operations. 相似文献
15.
Yao SONG Limin XIAO Liang WANG Guangjun QIN Bing WEI Baicheng YAN Chenhao ZHANG 《Frontiers of Computer Science》2022,16(5):165105
Wide-area high-performance computing is widely used for large-scale parallel computing applications owing to its high computing and storage resources. However, the geographical distribution of computing and storage resources makes efficient task distribution and data placement more challenging. To achieve a higher system performance, this study proposes a two-level global collaborative scheduling strategy for wide-area high-performance computing environments. The collaborative scheduling strategy integrates lightweight solution selection, redundant data placement and task stealing mechanisms, optimizing task distribution and data placement to achieve efficient computing in wide-area environments. The experimental results indicate that compared with the state-of-the-art collaborative scheduling algorithm HPS+, the proposed scheduling strategy reduces the makespan by 23.24%, improves computing and storage resource utilization by 8.28% and 21.73% respectively, and achieves similar global data migration costs. 相似文献
16.
Steve C. Chiu 《The Journal of supercomputing》2008,46(2):105-107
The abundance of parallel and distributed computing platforms, such as MPP, SMP, and the Beowulf clusters, to name just a
few, has added many more possibilities and challenges to high performance computing (HPC), parallel I/O, mass data storage,
scalable architectures, and large-scale simulations, which traditionally belong to the realm of custom-tailored parallel systems.
The intent of this special issue is to discuss problems and solutions, to identify new issues, and to help shape future research
directions in these areas. From these perspectives, this special issue addresses the problems encountered at the hardware,
architectural, and application levels, while providing conceptual as well as empirical treatments to the current issues in
high performance computing, and the I/O architectures and systems utilized therein. 相似文献
17.
Thomas Brady Jack Dongarra Michele Guidolin Alexey Lastovetsky Keith Seymour 《Concurrency and Computation》2010,22(18):2467-2487
The paper presents the SmartGridRPC model, an extension of the GridRPC model, which aims to achieve higher performance. The traditional GridRPC provides a programming model and API for mapping individual tasks of an application in a distributed Grid environment, which is based on the client‐server model characterized by the star network topology. SmartGridRPC provides a programming model and API for mapping a group of tasks of an application in a distributed Grid environment, which is based on the fully connected network topology. The SmartGridRPC programming model and API and its performance advantages over the GridRPC model are outlined in this paper. In addition, experimental results using a real‐world application are also presented. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献
18.
MRPC is an RPC system that is designed and optimized for MPMD parallel computing. Existing systems based on standard RPC incur an unnecessarily high cost when used on high‐performance multi‐computers, limiting the appeal of RPC‐based languages in the parallel computing community. MRPC combines the efficient control and data transfer provided by Active Messages (AM) with a minimal multithreaded runtime system that extends AM with the features required to support MPMD. This approach introduces only the necessary RPC overheads for an MPMD environment. MRPC has been integrated into Compositional C++ (CC++), a parallel extension of C++ that offers an MPMD programming model. Basic performance in MRPC is within a factor of two from those of Split‐C, a highly tuned SPMD language, and other messaging layers. CC++ applications perform within a factor of two to six from comparable Split‐C versions, which represent an order of magnitude improvement over previous CC++ implementations. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献
19.
The use of a network of shared, heterogeneous workstations each harboring a reconfigurable computing (RC) system offers high performance users an inexpensive platform for a wide range of computationally demanding problems. However, effectively using the full potential of these systems can be challenging without the knowledge of the system's performance characteristics. While some performance models exist for shared, heterogeneous workstations, none thus far account for the addition of RC systems. Our analytic performance model includes the effects of the reconfigurable device, application load imbalance, background user load, basic message passing communication, and processor heterogeneity. The methodology proves to be accurate in characterizing these effects for applications running on shared, homogeneous, and heterogeneous HPRC resources. The model error in all cases was found to be less than 5% for application runtimes greater than 30 s, and less than 15% for runtimes less than 30 s. 相似文献
20.
High performance computing problems usually have the characteristics of parallelization of subtasks, and a lot of computing resources are consumed in the process of execution. It has been proved that traditional cloud computing based on virtual machine can deal with such problems, but the management of distributed environment and the distributed design of solutions make the processing more complex. Function computing is a new type of serverless cloud computing paradigm, its automatic expansion and considerable computing resources can be well combined with HPC problems. However, the cold start delay is an unavoidable problem on the public cloud function computing platform, especially in the task of HPC problems having high concurrent jobs of which delay will be further magnified. In this paper, we first analyze the completion time of a simple HPC task under cold start and hot start conditions, and analyze the causes of additional delay. According to these analyses, we combine the time series ana lysis tools and the platform's automatic expansion mechanism to propose an effective preheating method, which can effectively reduce the cold start delay of HPC tasks on the function computing platform. 相似文献