首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Symbolic computation has underpinned a number of key advances in Mathematics and Computer Science. Applications are typically large and potentially highly parallel, making them good candidates for parallel execution at a variety of scales from multi‐core to high‐performance computing systems. However, much existing work on parallel computing is based around numeric rather than symbolic computations. In particular, symbolic computing presents particular problems in terms of varying granularity and irregular task sizes that do not match conventional approaches to parallelisation. It also presents problems in terms of the structure of the algorithms and data. This paper describes a new implementation of the free open‐source GAP computational algebra system that places parallelism at the heart of the design, dealing with the key scalability and cross‐platform portability problems. We provide three system layers that deal with the three most important classes of hardware: individual shared memory multi‐core nodes, mid‐scale distributed clusters of (multi‐core) nodes and full‐blown high‐performance computing systems, comprising large‐scale tightly connected networks of multi‐core nodes. This requires us to develop new cross‐layer programming abstractions in the form of new domain‐specific skeletons that allow us to seamlessly target different hardware levels. Our results show that, using our approach, we can achieve good scalability and speedups for two realistic exemplars, on high‐performance systems comprising up to 32000 cores, as well as on ubiquitous multi‐core systems and distributed clusters. The work reported here paves the way towards full‐scale exploitation of symbolic computation by high‐performance computing systems, and we demonstrate the potential with two major case studies. © 2016 The Authors. Concurrency and Computation: Practice and Experience Published by John Wiley & Sons Ltd.  相似文献   

2.
Extreme scale supercomputers available before the end of this decade are expected to have 100 million to 1 billion computing cores. The power and energy efficiency issue has become one of the primary concerns of extreme scale high performance scientific computing. This paper surveys the research on saving power and energy for numerical linear algebra algorithms in high performance scientific computing on supercomputers around the world. We first stress the significance of numerical linear algebra algorithms in high performance scientific computing nowadays, followed by a background introduction on widely used numerical linear algebra algorithms and software libraries and benchmarks. We summarize commonly deployed power management techniques for reducing power and energy consumption in high performance computing systems by presenting power and energy models and two fundamental types of power management techniques: static and dynamic. Further, we review the research on saving power and energy for high performance numerical linear algebra algorithms from four aspects: profiling, trading off performance, static saving, and dynamic saving, and summarize state-of-the-art techniques for achieving power and energy efficiency in each category individually. Finally, we discuss potential directions of future work and summarize the paper.  相似文献   

3.
Distributed Java virtual machine (dJVM) systems enable concurrent Java applications to transparently run on clusters of commodity computers. This is achieved by supporting Java's shared‐memory model over multiple JVMs distributed across the cluster's computer nodes. In this work, we describe and evaluate selective dynamic diffing and lazy home allocation, two new runtime techniques that enable dJVMs to efficiently support memory sharing across the cluster. Specifically, the two proposed techniques can contribute to reduce the overheads due to message traffic, extra memory space, and high latency of remote memory accesses that such dJVM systems require for implementing their memory‐coherence protocol either in isolation or in combination. In order to evaluate the performance‐related benefits of dynamic diffing and lazy home allocation, we implemented both techniques in Cooperative JVM (CoJVM), a basic dJVM system we developed in previous work. In subsequent work, we carried out performance comparisons between the basic CoJVM and modified CoJVM versions for five representative concurrent Java applications (matrix multiply, LU, Radix, fast Fourier transform, and SOR) using our proposed techniques. Our experimental results showed that dynamic diffing and lazy home allocation significantly reduced memory sharing overheads. The reduction resulted in considerable gains in CoJVM system's performance, ranging from 9% up to 20%, in four out of the five applications, with resulting speedups varying from 6.5 up to 8.1 for an 8‐node cluster of computers. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

4.
当前处理器由于较高的能量消耗,导致处理器热量散发的提高及系统可靠性的降低,已经成为目前计算机领域较为关心的问题.然而目前一些有效降低能量消耗的技术大多针对单处理器系统,较少考虑多处理器系统.提出的调度算法针对多处理器计算环境,以执行时间最快的任务优先调度为基础,结合其它有效技术(共享空闲时间回收),使得实时任务在其截止期内完成的同时能够有效地减低整个系统的能量消耗.针对独立任务集及具有依赖关系的任务集,提出两种针对同构计算环境的算法:STFBA1(Shortest—Task—First—Based Algorithm)及STFBA2,及两钟针对多任务集的算法HSA1(Hybrid Seheduling Algorithm)及HAS2.在单任务集计算环境下,与目前所知的有效算法相比,算法具有更好的性能(调度长度及能量消耗).在多任务集计算环境下,基于混合调度策略的算法能够明显改进调度性能.  相似文献   

5.
由于科学研究与商业应用等对高性能计算的需求与日俱增,高性能计算的性能和系统规模得到迅速发展。但是,急剧增长的功耗严重限制了高性能计算系统的设计和使用,使得低功耗技术成为高性能计算领域的关键技术。作为整个系统的核心组件,作业调度系统立足有限的系统资源,对用户提交的应用进行作业-资源分配,其能效性对于整个高性能计算系统的能耗控制与调节起到至关重要的作用。首先介绍主要的能量效率技术和常用的作业调度策略,然后对当前高性能计算作业调度能效性进行分析,并讨论了其面临的挑战及未来发展方向。  相似文献   

6.
The power consumption of modern high‐performance computing (HPC) systems that are built using power hungry commodity servers is one of the major hurdles for achieving Exascale computation. Several efforts have been made by the HPC community to encourage the use of low‐powered system‐on‐chip (SoC) embedded processors in large‐scale HPC systems. These initiatives have successfully demonstrated the use of ARM SoCs in HPC systems, but there is still a need to analyze the viability of these systems for HPC platforms before a case can be made for Exascale computation. The major shortcomings of current ARM‐HPC evaluations include a lack of detailed insights about performance levels on distributed multicore systems and performance levels for benchmarking in large‐scale applications running on HPC. In this paper, we present a comprehensive evaluation of results that covers major aspects of server and HPC benchmarking for ARM‐based SoCs. For the experiments, we built an unconventional cluster of ARM Cortex‐A9s that is referred to as Weiser and ran single‐node benchmarks (STREAM, Sysbench, and PARSEC) and multi‐node scientific benchmarks (High‐performance Linpack (HPL), NASA Advanced Supercomputing (NAS) Parallel Benchmark, and Gadget‐2) in order to provide a baseline for performance limitations of the system. Based on the experimental results, we claim that the performance of ARM SoCs depends heavily on the memory bandwidth, network latency, application class, workload type, and support for compiler optimizations. During server‐based benchmarking, we observed that when performing memory intensive benchmarks for database transactions, x86 performed 12% better for multithreaded query processing. However, ARM performed four times better for performance to power ratios for a single core and 2.6 times better on four cores. We noticed that emulated double precision floating point in Java resulted in three to four times slower performance as compared with the performance in C for CPU‐bound benchmarks. Even though Intel x86 performed slightly better in computation‐oriented applications, ARM showed better scalability in I/O bound applications for shared memory benchmarks. We incorporated the support for ARM in the MPJ‐Express runtime and performed comparative analysis of two widely used message passing libraries. We obtained similar results for network bandwidth, large‐scale application scaling, floating‐point performance, and energy‐efficiency for clusters in message passing evaluations (NBP and Gadget 2 with MPJ‐Express and MPICH). Our findings can be used to evaluate the energy efficiency of ARM‐based clusters for server workloads and scientific workloads and to provide a guideline for building energy‐efficient HPC clusters. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

7.
实时多处理器系统中基于能量节约的动态调度算法   总被引:1,自引:0,他引:1  
当前处理器由于较高的能量消耗。导致处理器热量散发的提高及系统可靠性的降低,已经成为目前计算机领域较为关心的问题.然而目前一些有效降低能量消耗的技术大多针对单处理器系统,较少考虑多处理器系统.本文提出的调度算法针对多处理器系统,以最短任务优先调度为基础,结合其它有效技术,如共享空闲时间回收等,使得实时任务在其截止期内完成的同时能够有效地减低整个系统的能量消耗.针对独立任务集及具有依赖关系的任务集,本文提出两种算法:STFBA1及STFBA2(Shortest Task First—Based Algorithm).与目前所知的有效算法相比,我们的算法具有更好的性能(调度长度及能量消耗).  相似文献   

8.

Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data-intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud data centres. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research.

  相似文献   

9.
This survey gives an overview of the current state of the art in GPU techniques for interactive large‐scale volume visualization. Modern techniques in this field have brought about a sea change in how interactive visualization and analysis of giga‐, tera‐ and petabytes of volume data can be enabled on GPUs. In addition to combining the parallel processing power of GPUs with out‐of‐core methods and data streaming, a major enabler for interactivity is making both the computational and the visualization effort proportional to the amount and resolution of data that is actually visible on screen, i.e. ‘output‐sensitive’ algorithms and system designs. This leads to recent output‐sensitive approaches that are ‘ray‐guided’, ‘visualization‐driven’ or ‘display‐aware’. In this survey, we focus on these characteristics and propose a new categorization of GPU‐based large‐scale volume visualization techniques based on the notions of actual output‐resolution visibility and the current working set of volume bricks—the current subset of data that is minimally required to produce an output image of the desired display resolution. Furthermore, we discuss the differences and similarities of different rendering and data traversal strategies in volume rendering by putting them into a common context—the notion of address translation. For our purposes here, we view parallel (distributed) visualization using clusters as an orthogonal set of techniques that we do not discuss in detail but that can be used in conjunction with what we present in this survey.  相似文献   

10.
林伟伟  吴文泰 《软件学报》2016,27(4):1026-1041
云计算引领了计算机科学的一场重大变革,但与此同时,也不可避免地带来了日益凸显的能源消耗问题,因此,云计算能耗管理成为近几年的研究热点.云计算系统的能耗测量和管理直接关系到云计算的可持续发展,能耗数据不仅关系到能耗模型的建立,而且也是检验云计算资源调度算法的基础.为此,在广泛研究现有能耗测量方法的基础上,归纳总结了当前云计算环境的4种能耗测量方法:基于软件或硬件的直接测量方法、基于能耗模型的估算方法、基于虚拟化技术的能耗测量方法、基于仿真的能耗评估方法,并分析和比较了它们的优势、缺陷和适用环境.在此基础上,指出了云计算能耗管理的未来重要研究趋势:智能主机电源模块、面向不同类型应用的能耗模型、混合任务负载的能耗模型、可动态管理的高效云仿真工具、动态异构分布式集群的能耗管理、面向大数据分析处理和任务调度的节能方法以及新能源供电环境下的节能规划,为云计算节能领域的研究指明了方向.  相似文献   

11.
The complexity of computing systems introduces a few issues and challenges such as poor performance and high energy consumption. In this paper, we first define and model resource contention metric for high performance computing workloads as a performance metric in scheduling algorithms and systems at the highest level of resource management stack to address the main issues in computing systems. Second, we propose a novel autonomic resource contention‐aware scheduling approach architected on various layers of the resource management stack. We establish the relationship between distributed resource management layers in order to optimize resource contention metric. The simulation results confirm the novelty of our approach.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

12.
High performance computers provide strategic computing power in the construction of national economy and defense, and become one of symbols of the country's overall strength. Over 30 years, with the supports of governments, the technology of high performance computers is in the process of rapid development, during which the computing performance increases nearly 3 million times and the processors number expands over 10 hundred thousands times. To solve the critical issues related with parallel efficiency and scalability, scientific researchers pursued extensive theoretical studies and technical innovations. The paper briefly looks back the course of building high performance computer systems both at home and abroad, and summarizes the significant breakthroughs of international high performance computer technology. We also overview the technology progress of China in the area of parallel computer architecture, parallel operating system and resource management, parallel compiler and performance optimization, environment for parallel programming and network computing. Finally, we examine the challenging issues, "memory wall", system scalability and "power wall", and discuss the issues of high productivity computers, which is the trend in building next generation high performance computers.  相似文献   

13.
Since its release, the Java programming language has attracted considerable attention from the high‐performance computing (HPC) community because of its portability, high programming productivity, and built‐in multithreading and networking support. As a consequence, several initiatives have been taken to develop a high‐performance Java message‐passing library to program distributed memory architectures, such as clusters. The performance of Java message‐passing applications relies heavily on the communications performance. Thus, the design and implementation of low‐level communication devices that support message‐passing libraries is an important research issue in Java for HPC. MPJ Express is our Java message‐passing implementation for developing high‐performance parallel Java applications. Its public release currently contains three communication devices: the first one is built using the Java New Input/Output (NIO) package for the TCP/IP; the second one is specifically designed for the Myrinet Express library on Myrinet; and the third one supports thread‐based shared memory communications. Although these devices have been successfully deployed in many production environments, previous performance evaluations of MPJ Express suggest that the buffering layer, tightly coupled with these devices, incurs a certain degree of copying overhead, which represents one of the main performance penalties. This paper presents a more efficient Java message‐passing communications device, based on Java Input/Output sockets, that avoids this buffering overhead. Moreover, this device implements several strategies, both in the communication protocol and in the HPC hardware support, which optimizes Java message‐passing communications. In order to evaluate its benefits, this paper analyzes the performance of this device comparatively with other Java and native message‐passing libraries on various high‐speed networks, such as Gigabit Ethernet, Scalable Coherent Interface, Myrinet, and InfiniBand, as well as on a shared memory multicore scenario. The reported communication overhead reduction encourages the upcoming incorporation of this device in MPJ Express ( http://mpj‐express.org ). Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

14.
15.
In recent years, heterogeneous systems and cooperative computing have become popular research directions in the field of high performance computing. With fast scaling of the size of high performance computer systems, problems such as power consumption and reliability come to the forefront. The aim of high performance computing has thus shifted from merely seeking peak performance to comprehensively pursuing high efficiency, which takes into consideration many factors including performance, cost, power, reliability and so on. A heterogeneous computing system consisting of general-purpose CPU(s) and special-purpose accelerator(s) features high performance, lower power consumption and low cost, etc. Hence, it has already become the mainstream in the field of high performance computing. However, such systems still face many challenges and problems, for example, programmability and reliability. In this paper, we firstly analyze the main challenges facing heterogeneous computing systems. Then, we introduce the architecture of the first petaflop computing system in China, the Tianhe-1 (TH-1) heterogeneous system, including its hardware/software interface and interconnect network. During development of the TH-1 system, several challenges were encountered; research into the solutions of these challenges is subsequently presented.  相似文献   

16.
Biomedical modelling that is mathematically described by ordinary differential equations (ODEs) is often one of the most computationally intensive parts of simulations. With high inherent parallelism, hardware acceleration based on field programmable gate array has great potential to increase the computational performance of the ODE model integration while being very power efficient. ODE‐based Domain‐specific Synthesis Tool is a tool we proposed previously to automatically generate the complete hardware/software co‐design framework for computing biomedical models based on CellML. Although it provides remarkable performance improvement and high energy efficiency compared with CPUs and GPUs, there is still a great potential for optimisation. In this paper, we investigate a set of optimisation strategies including compiler optimisation, resource fitting and balancing, and multiple pipelines. They all have in common that they can be performed automatically and hence can be integrated in our domain‐specific high level synthesis tool. We evaluate the optimised hardware accelerator modules generated by ODE‐based Domain‐specific Synthesis Tool on real hardware based on their resource usage, processing speed and power consumption. The results are compared with single threaded and multi‐core CPUs with/without Streaming SIMD Extension (SSE) optimisation and a graphics card. The results show that the proposed optimisation strategies provide significant performance improvement and result in even more energy‐efficient hardware accelerator modules. Furthermore, the resources of the target field programmable gate array device can be more efficiently utilised in order to fit larger biomedical models than before. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

17.
In this paper we discuss queueing network methodology as a framework to address issues that arise in the design and planning of discrete manufacturing systems. Our review focuses on three aspects: modeling of manufacturing facilities, performance evaluation and optimization with queueing networks. We describe both open and closed network models and present several examples from the literature illustrating applications of the methodology. We also provide a brief outline of outstanding research issues. The paper is directed towards the practitioner with operations research background and the operations management researcher with interest in this topic.  相似文献   

18.
Diminishing returns from increased clock frequencies and instruction‐level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many‐core architectures have progressed at a remarkable rate, concerns arise regarding the performance and productivity of numerous parallel‐programming tools for application development. Development of parallel applications on many‐core processors often requires developers to familiarize themselves with unique characteristics of a target platform while attempting to maximize performance and maintain correctness of their applications. The family of partitioned global address space (PGAS) programming models comprises the current state of the art in balancing performance and programmability. One such PGAS approach is SHMEM, a lightweight, shared‐memory programming library that has demonstrated high performance and productivity potential for parallel‐computing systems with distributed‐memory architectures. In the paper, we present research, design, and analysis of a new SHMEM infrastructure specifically crafted for low‐level PGAS on modern and emerging many‐core processors featuring dozens of cores and more. Our approach (with a new library known as TSHMEM) is investigated and evaluated atop two generations of Tilera architectures, which are among the most sophisticated and scalable many‐core processors to date, and is intended to enable similar libraries atop other architectures now emerging. In developing TSHMEM, we explore design decisions and their impact on parallel performance for the Tilera TILE‐Gx and TILEPro many‐core architectures, and then evaluate the designs and algorithms within TSHMEM through microbenchmarking and applications studies with other communication libraries. Our results with barrier primitives provided by the Tilera libraries show dissimilar performance between the TILE‐Gx and TILEPro; therefore, TSHMEM's barrier design takes an alternative approach and leverages the on‐chip mesh network to provide consistent low‐latency performance. In addition, our experiments with TSHMEM show that naive collective algorithms consistently outperformed linear distributed collective algorithms when executed in an SMP‐centric environment. In leveraging these insights for the design of TSHMEM, our approach outperforms the OpenSHMEM reference implementation, achieves similar to positive performance over OpenMP and OSHMPI atop MPICH, and supports similar libraries in delivering high‐performance parallel computing to emerging many‐core systems. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

19.
It is a fact that the attention of research community in computer science, business executives, and decision makers is drastically drawn by big data. As the volume of data becomes bigger, it needs performance‐oriented data‐intensive processing frameworks such as MapReduce, which can scale computation on large commodity clusters. Hadoop MapReduce processes data in Hadoop Distributed File System as jobs scheduled according to YARN fair scheduler and capacity scheduler. However, with advancement and dynamic changes in hardware and operating environments, the performance of clusters is greatly affected. Various efforts in literature have been made to address the issues of heterogeneity (i.e., clusters consisting of virtual machines and machines with different hardware), network communication, data locality, better resource utilization, and run‐time scheduling. In this paper, we present a survey to discuss various research efforts made so far to improve Hadoop MapReduce scheduling. We classify scheduling algorithms and techniques proposed in the literature so far based on their addressing areas and present a taxonomy. Furthermore, we also discuss various aspects of open issues and challenges in the scheduling of MapReduce to improve its performance. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

20.
We review fast networking technologies for both wide-area and high performance cluster computer systems. We describe our experiences in constructing asynchronous transfer mode (ATM)-based local- and wide-area clusters and the tools and technologies this experience led us to develop. We discuss our experiences using Internet Protocol on such systems as well as native ATM protocols and the problems facing wide-area integration of cluster systems. We are presently constructing Beowulf-class computer clusters using a mix of Fast Ethernet and Gigabit Ethernet technology and we anticipate how such systems will integrate into a new local-area Gigabit Ethernet network and what technologies will be used for connecting shared HPC resources across wide-areas. High latencies on wide-area cluster systems led us to develop a metacomputing problem-solving environment known as distributed information systems control world (DISCWorld). We summarize our main developments in this project as well as the key features and research directions for software to exploit computational services running on fast networked cluster systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号