期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel application performance on shared high performance reconfigurable computing resources

Melissa C. Gregory D. 《Performance Evaluation》2005,60(1-4):107-125

The use of a network of shared, heterogeneous workstations each harboring a reconfigurable computing (RC) system offers high performance users an inexpensive platform for a wide range of computationally demanding problems. However, effectively using the full potential of these systems can be challenging without the knowledge of the system's performance characteristics. While some performance models exist for shared, heterogeneous workstations, none thus far account for the addition of RC systems. Our analytic performance model includes the effects of the reconfigurable device, application load imbalance, background user load, basic message passing communication, and processor heterogeneity. The methodology proves to be accurate in characterizing these effects for applications running on shared, homogeneous, and heterogeneous HPRC resources. The model error in all cases was found to be less than 5% for application runtimes greater than 30 s, and less than 15% for runtimes less than 30 s. 相似文献

2.

PARMON: a portable and scalable monitoring system for clusters

Rajkumar Buyya 《Software》2000,30(7):723-739

Workstation/PC clusters have become a cost‐effective solution for high performance computing. C‐DAC's PARAM 10000 (or OpenFrame, internal code name) is a large cluster of high‐performance workstations interconnected through low‐latency and high bandwidth networks. The management and control of such a huge system is a tedious and challenging task since workstations/PCs are typically designed to work as a standalone system rather than part of a cluster. We have designed and developed a tool called PARMON that allows effective monitoring and control of large clusters. It supports the monitoring of critical system resource activities and their utilization at three different levels: entire system, node and component level. It also allows the monitoring of multiple instances of the same component; for instance, multiple processors in SMP type cluster nodes. PARMON is a portable, flexible, interactive, scalable, location‐transparent, and comprehensive environment based on client–server technology. The major components of PARMON are parmon‐server—system resource activities and utilization information provider and parmon‐client—a GUI based client responsible for interacting with parmon‐server and users for data gathering in real‐time and presenting information graphically for visualization. The client is developed as a Java application and the server is developed as a multithreaded server using C and POSIX/Solaris threads since Java does not support interfaces to access system internals. PARMON is regularly used to monitor PARAM 10000 supercomputer, a cluster of 48+ Ultra‐4 workstations powered by the Solaris operating system. The recent popularity of Beowulf‐class clusters (dedicated Linux clusters) in terms of price–performance ratio has motivated us to port PARMON to Linux (accomplished by porting system dependent portions of parmon‐server). This enables management/monitoring of both Solaris and Linux‐based clusters (federated clusters) through a single user interface. Copyright © 2000 John Wiley & Sons, Ltd. 相似文献

3.

用于高性能计算的作业调度能效性研究综述

郑文旭潘晓东马迪汪浩《计算机工程与科学》2019,41(9):1526-1533

由于科学研究与商业应用等对高性能计算的需求与日俱增,高性能计算的性能和系统规模得到迅速发展。但是,急剧增长的功耗严重限制了高性能计算系统的设计和使用,使得低功耗技术成为高性能计算领域的关键技术。作为整个系统的核心组件,作业调度系统立足有限的系统资源,对用户提交的应用进行作业-资源分配,其能效性对于整个高性能计算系统的能耗控制与调节起到至关重要的作用。首先介绍主要的能量效率技术和常用的作业调度策略,然后对当前高性能计算作业调度能效性进行分析,并讨论了其面临的挑战及未来发展方向。相似文献

4.

Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters

Björn Gmeiner Harald Köstler Markus Stürmer Ulrich Rüde 《Concurrency and Computation》2014,26(1):217-240

This article studies the performance and scalability of a geometric multigrid solver implemented within the hierarchical hybrid grids (HHG) software package on current high performance computing clusters up to nearly 300,000 cores. HHG is based on unstructured tetrahedral finite elements that are regularly refined to obtain a block‐structured computational grid. One challenge is the parallel mesh generation from an unstructured input grid that roughly approximates a human head within a 3D magnetic resonance imaging data set. This grid is then regularly refined to create the HHG grid hierarchy. As test platforms, a BlueGene/P cluster located at Jülich supercomputing center and an Intel Xeon 5650 cluster located at the local computing center in Erlangen are chosen. To estimate the quality of our implementation and to predict runtime for the multigrid solver, a detailed performance and communication model is developed and used to evaluate the measured single node performance, as well as weak and strong scaling experiments on both clusters. Thus, for a given problem size, one can predict the number of compute nodes that minimize the overall runtime of the multigrid solver. Overall, HHG scales up to the full machines, where the biggest linear system solved on Jugene had more than one trillion unknowns. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

5.

Algorithm-based fault tolerance applied to high performance computing

George Bosilca Rémi Delmas Jack Dongarra Julien Langou 《Journal of Parallel and Distributed Computing》2009

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518–528] to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault-tolerant matrix–matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix–matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly. 相似文献

6.

Green data centers: Using hierarchies for scalable energy efficiency in large web clusters

L.S. Sousa J.C.B. Leite Orlando Loques 《Information Processing Letters》2013,113(14-16):507-515

The growth in the demand for web services requires increasing processing capacity to maintain adequate response time to customer requests and, consequently, results in an increase in the energy consumption to support this infrastructure. This work is directed to the energy saving in large scale web server clusters, towards “green” data center construction. Energy consumption reduction is fundamental due to the economic and environmental aspects involved: energy generation has high costs and produces millions of tons of carbon. However, while saving energy, the quality of service offered to the customers should be maintained above an acceptable minimum level. Our solution involves optimization techniques, the use of Dynamic Voltage and Frequency Scaling technology (DVFS), and the application of a hierarchical architecture that uses a heuristic approach to define the cluster configuration at each instant. 相似文献

7.

A survey on parallel and distributed multi-agent systems for high performance computing simulations

《Computer Science Review》2016

Simulation has become an indispensable tool for researchers to explore systems without having recourse to real experiments. Depending on the characteristics of the modeled system, methods used to represent the system may vary. Multi-agent systems are often used to model and simulate complex systems. In any cases, increasing the size and the precision of the model increases the amount of computation, requiring the use of parallel systems when it becomes too large. In this paper, we focus on parallel platforms that support multi-agent simulations and their execution on high performance resources as parallel clusters. Our contribution is a survey on existing platforms and their evaluation in the context of high performance computing. We present a qualitative analysis of several multi-agent platforms, their tests in high performance computing execution environments, and the performance results for the only two platforms that fulfill the high performance computing constraints. 相似文献

8.

An analysis of definition and placement of virtual machines for high performance applications on Clouds

Giacomo Mc Evoy Antonio R. Mury Bruno Schulze 《Concurrency and Computation》2015,27(7):1789-1814

相似文献

9.

Multidisciplinary design optimization of a vehicle system in a scalable, high performance computing environment 总被引：3，自引：0，他引：3

S. Kodiyalam R.J. Yang L. Gu C.-H. Tho 《Structural and Multidisciplinary Optimization》2004,26(3-4):256-263

Multidisciplinary Design Optimization of a vehicle system for safety, NVH (noise, vibration and harshness) and weight, in a scalable HPC environment, is addressed. High performance computing, utilizing several hundred processors in conjunction with approximation methods, formal MDO strategies and engineering judgement are effectively used to obtain superior design solutions with significantly reduced elapsed computing times. The increased computational complexity in this MDO work is due to addressing multiple safety modes including frontal crash, offset crash, side impact and roof crush, in addition to the NVH discipline, all with detailed, high fidelity models and analysis tools. The reduction in large-scale MDO solution times through HPC is significant in that it now makes it possible for such technologies to impact the vehicle design cycle and improve the engineering productivity. 相似文献

10.

面向高性能应用的MPI大数据处理

王鹏周岩《计算机应用》2018,38(12):3496-3499

针对消息传递接口（MPI）在高性能计算领域的应用场景,为了优化MPI现有数据集中管理模式,增强其对大数据的处理能力,借鉴并行与分布式系统思想,开发设计一套适用于大数据处理的基于MPI的数据存储组件（MPI-DSP）。首先,创建接口函数,以对MPI系统影响最小的方式实现"计算向存储迁移"的设计目标,将文件分配与计算进行分离,使MPI突破大数据文件读取时的网络传输瓶颈。然后,分析阐述设计目标、运行机制、实现策略,通过描述接口函数MPI_Open在MPI环境下的应用,验证设计理念。通过Wordcount实验对比使用MPI-DSP组件与原MPI在数据文件处理方面的时间性能,初步验证了MPI"计算向存储迁移"模式的可行性,使其具备在高性能应用场景下的大数据处理能力。同时分析了MPI-DSP的适用环境和局限性,界定了其应用范围。相似文献

11.

基于高性能计算的开源云平台性能评估

李春艳张学杰《计算机应用》2013,33(12):3580-3585

云计算是一种提供各种IT服务的互联网资源利用的新模式,已经广泛地应用在包括高性能计算的各种领域。然而,虚拟化带来了一些性能开销;同时,不同的云平台实施虚拟化技术的不同,使得在这些云平台上应用高性能计算服务的性能也千差万别。通过HPC Challenge (HPCC) Benchmark和NAS Parallel Benchmark(NPB)分别对CPU、内存、网络、扩展性和高性能计算真实负载进行评估,比较并分析了诸如Nimbus、OpenNebula和OpenStack实施高性能计算的性能,实验显示OpenStack对计算密集型的高性能应用负载表现出较好的性能,因此,OpenStack是实施高性能计算的开源云平台的一个好的选择。相似文献

12.

Care HPS: A high performance simulation tool for parallel and distributed agent-based modeling

《Future Generation Computer Systems》2017

Parallel and distributed simulation is a powerful tool for developing complex agent-based simulation. Complex simulations require parallel and distributed high performance computing solutions. It is necessary because their sequential solutions are not able to give answers in a feasible total execution time. Therefore, for the advance of computing science, it is important that High Performance Computing (HPC) techniques and solutions be proposed and studied. In literature, we can find some agent-based modeling and simulation tools that use HPC. However, none of these tools are designed to enable the HPC expert to be able to propose new techniques and solutions without great effort. In this paper, we introduce Care High Performance Simulation (HPS), which is a scientific instrument that enables researchers to: (1) develop techniques and solutions of high performance distributed simulations for agent-based models; and, (2) study, design and implement complex agent-based models that require HPC solutions. Care HPS was designed to easily and quickly develop new agent-based models. It was also designed to extend and implement new solutions for the main issues of parallel and distributed solutions such as: synchronization, communication, load and computing balancing, and partitioning algorithms. We conducted some experiments with the aim of showing the completeness and functionality of Care HPS. As a result, we show that Care HPS can be used as a scientific instrument for the advance of the agent-based parallel and distributed simulations field. 相似文献

13.

A complete and efficient CUDA-sharing solution for HPC clusters 总被引：1，自引：0，他引：1

Antonio J. Peña Carlos Reaño Federico Silla Rafael Mayo Enrique S. Quintana-Ortí José Duato 《Parallel Computing》2014

In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix–matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated. 相似文献

14.

Power and performance control of soft real-time web server clusters

Luciano Bertini Daniel Mossé 《Information Processing Letters》2010,110(17):767-773

This paper presents a novel way to control power consumption and performance in a multi-tier server cluster designed for e-commerce applications. The requests submitted to these server systems have a soft real-time constraint, given that although some can miss a pre-defined deadline, the system can still meet an agreed upon performance level. Clusters of servers are extensively used nowadays and, with the steep increase in the total power consumption in these systems, economic and environmental problems have been raised. We present ways of decreasing power expenditure, and show the implementation of a SISO (Single Input Single Output) controller that acts on the speed of all server nodes capable of dynamic voltage and frequency scaling (DVFS), with QoS (Quality of Service) being the reference setpoint. For QoS, we use the request tardiness, defined as the ratio of the end-to-end response time to the deadline, rather than the usual metric that counts missed deadlines. We adjust the servers operating frequencies to guarantee that a pre-defined p-quantile of the tardiness probability distribution of the requests meet their deadlines. Doing so we can guarantee that the QoS will be statistically p. We test this technique in a prototype multi-tier cluster, using open software, commodity hardware, and a standardized e-commerce application to generate a workload close to that of the real world. The main contribution of this paper is to empirically show the robustness of the SISO controller, presenting a sensibility analysis of its parameters. Experimental results show that our implementation outperforms other published state-of-the-art cluster implementations. 相似文献

15.

Current issues in high performance computing I/O architectures and systems

Steve C. Chiu 《The Journal of supercomputing》2008,46(2):105-107

The abundance of parallel and distributed computing platforms, such as MPP, SMP, and the Beowulf clusters, to name just a few, has added many more possibilities and challenges to high performance computing (HPC), parallel I/O, mass data storage, scalable architectures, and large-scale simulations, which traditionally belong to the realm of custom-tailored parallel systems. The intent of this special issue is to discuss problems and solutions, to identify new issues, and to help shape future research directions in these areas. From these perspectives, this special issue addresses the problems encountered at the hardware, architectural, and application levels, while providing conceptual as well as empirical treatments to the current issues in high performance computing, and the I/O architectures and systems utilized therein. 相似文献

16.

Large-scale climate simulations harnessing clusters,grid and cloud infrastructures

《Future Generation Computer Systems》2015

The current availability of a variety of computing infrastructures including HPC, Grid and Cloud resources provides great computer power for many fields of science, but their common profit to accomplish large scientific experiments is still a challenge. In this work, we use the paradigm of climate modeling to present the key problems found by standard applications to be run in hybrid distributed computing infrastructures and propose a framework to allow a climate model to take advantage of these resources in a transparent and user-friendly way. Furthermore, an implementation of this framework, using the Weather Research and Forecasting system, is presented as a working example. In order to illustrate the usefulness of this framework, a realistic climate experiment leveraging Cluster, Grid and Cloud resources simultaneously has been performed. This test experiment saved more than 75% of the execution time, compared to local resources. The framework and tools introduced in this work can be easily ported to other models and are probably useful in other scientific areas employing data- and CPU-intensive applications. 相似文献

17.

集群式高性能计算系统研究

陈红梅张纪英《计算机时代》2015,(7)

研究了集群的系统结构和主要优势,以及集群式高性能计算系统的诞生;分析了集群式高性能计算系统的架构和构建方式,集群构建包括网络部署、存储系统、计算节点、管理节点、登录节点等部分。在此基础上构建了基于Linux的集群式高性能计算系统。相似文献

18.

基于PageRank和基准测试的异构集群节点性能评价算法研究

胡亚红王一洲毛家发《计算机工程与科学》2020,42(3):391-396

为了提高集群效率,需要根据集群节点的性能来进行集群的数据部署和任务调度。在异构集群中,节点性能存在很大差异,如何评价节点的性能非常具有挑战性。可以使用基准测试来评价节点的性能,而不同的基准测试对节点评价的角度不尽相同。PageRank算法被谷歌用来对网站进行排名,现在它也被应用于评价书籍的影响力或用户行为等等。提出一种新颖的基于PageRank的节点性能评价算法,以充分利用不同基准测试的评价结果。首先对每个节点使用LINPACK、NPB、IOzone等主流基准测试进行评价;然后采用PageRank算法处理每个基准测试的执行结果,从而得到节点的性能。为了使用PageRank算法,建立了1个图模型,并计算了性能向量和概率转移矩阵。该算法具有计算复杂度低、综合评价效果好等优点。相似文献

19.

KernelHive: a new workflow‐based framework for multilevel high performance computing using clusters and workstations with CPUs and GPUs

Pawe&#x; Rociszewski Pawe&#x; Czarnul Rafa&#x; Lewandowski Marcel Schally‐Kacprzak 《Concurrency and Computation》2016,28(9):2586-2607

The paper presents a new open‐source framework called KernelHive for multilevel parallelization of computations among various clusters, cluster nodes, and finally, among both CPUs and GPUs for a particular application. An application is modeled as an acyclic directed graph with a possibility to run nodes in parallel and automatic expansion of nodes (called node unrolling) depending on the number of computation units available. A methodology is proposed for parallelization and mapping of an application to the environment that includes selection of devices using a chosen optimizer, selection of best grid configurations for compute devices, optimization of data partitioning and the execution. One of possibly many scheduling algorithms can be selected considering execution time, power consumption, and so on. An easy‐to‐use GUI is provided for modeling and monitoring with a repository of ready‐to‐use constructs and computational kernels. The methodology, execution times, and scalability have been demonstrated for a distributed and parallel password‐breaking example run in a heterogeneous environment with a cluster and servers with different numbers of nodes and both CPUs and GPUs. Additionally, performance of the framework has been compared with an MPI + OpenCL implementation using a parallel geospatial interpolation application employing up to 40 cluster nodes and 320 cores. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

20.

An investigation of the performance portability of OpenCL

S.J. Pennycook S.D. Hammond S.A. Wright J.A. Herdman I. Miller S.A. Jarvis 《Journal of Parallel and Distributed Computing》2013

This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation. 相似文献