期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A performance comparison of current HPC systems: Blue Gene/Q,Cray XE6 and InfiniBand systems

《Future Generation Computer Systems》2014

We present here a performance analysis of three of current architectures that have become commonplace in the High Performance Computing world. Blue Gene/Q is the third generation of systems from IBM that use modestly performing cores but at large-scale in order to achieve high performance. The XE6 is the latest in a long line of Cray systems that use a 3-D topology but the first to use its Gemini interconnection network. InfiniBand provides the flexibility of using compute nodes from many vendors that can be connected in many possible topologies. The performance characteristics of each vary vastly, and the way in which nodes are allocated in each type of system can significantly impact on achieved performance. In this work we compare these three systems using a combination of micro-benchmarks and a set of production applications. In addition we also examine the differences in performance variability observed on each system and quantify the lost performance using a combination of both empirical measurements and performance models. Our results show that significant performance can be lost in normal production operation of the Cray XE6 and InfiniBand Clusters in comparison to Blue Gene/Q. 相似文献

2.

万亿次机群系统高性能应用软件运行现状分析 总被引：2，自引：0，他引：2

侯晓吻张林波张云泉《计算机工程》2005,31(22):81-83

通过调用PAPI（Performance Application Programming Interface）接口函数对2004年3月～4月之间运行在国家应用“973”计划项目LSSC—Ⅱ万亿次机群系统上部分应用程序进行了跟踪,收集到了大量宝贵的性能数据。依据这些性能数据信息,对我国当前高性能软件的运行情况给出了初步分析。分析结果表明,目前大部分应用程序性能都处于较低水平,并行程序使用处理器的数目范围一般为1～64个,处理器平均效率低于10%,平均性能低于300Mflops。相似文献

3.

Control of sampled-data systems with variable sampling rate

Michael Schinkel 《International journal of systems science》2013,44(9):609-618

This paper addresses stability and performance of sampled-data systems with variable sampling rate, where the change between sampling rates is decided by a scheduler. A motivational example is presented, where a stable continuous time system is controlled with two sampling rates. It is shown that the resulting system could be unstable when the sampling changes between these two rates, although each individual closed-loop system is stable under the designed controller that minimizes the same continuous loss function. Two solutions are presented in this paper. The first solution is to impose restrictions on switching sequences such that only stable sequences are chosen. The second solution presented is more general, where a piecewise constant state feedback control law is designed which guarantees stability for all possible variations of sampling rate. Furthermore, the performance defined by a continuous time quadratic cost function for the sampled-data system with variable sampling rate can be optimized using the proposed synthesis method. 相似文献

4.

The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Francisco D. Igual Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert A. van de Geijn Field G. Van Zee 《Journal of Parallel and Distributed Computing》2012

Parallel accelerators are playing an increasingly important role in scientific computing. However, it is perceived that their weakness nowadays is their reduced “programmability” in comparison with traditional general-purpose CPUs. For the domain of dense linear algebra, we demonstrate that this is not necessarily the case. We show how the libflame library carefully layers routines and abstracts details related to storage and computation, so that extending it to take advantage of multiple accelerators is achievable without introducing platform specific complexity into the library code base. We focus on the experience of the library developer as he develops a library routine for a new operation, reduction of a generalized Hermitian positive definite eigenvalue problem to a standard Hermitian form, and configures the library to target a multi-GPU platform. It becomes obvious that the library developer does not need to know about the parallelization or the details of the multi-accelerator platform. Excellent performance on a system with four NVIDIA Tesla C2050 GPUs is reported. This makes libflame the first library to be released that incorporates multi-GPU functionality for dense matrix computations, setting a new standard for performance. 相似文献

5.

GPU并行计算编程技术介绍

王泽寰王鹏《数据与计算发展前沿》2013,4(1):81-87

近年来GPU通用计算蓬勃发展。程序开发者和GPU通用计算应用程序的数量增长很快。针对不同的应用程序的要求和程序开发者不同的使用习惯,围绕着CUDA架构的 GPU,NVIDIA及其合作伙伴共同开发了很多种不同的编程技术。本文详细介绍了它们的特点和适用对象。希望可以帮助广大开发人员针对自己的编程习惯和程序要求选择最为合适的编程技术。相似文献

6.

Parallel application performance on shared high performance reconfigurable computing resources

Melissa C. Gregory D. 《Performance Evaluation》2005,60(1-4):107-125

The use of a network of shared, heterogeneous workstations each harboring a reconfigurable computing (RC) system offers high performance users an inexpensive platform for a wide range of computationally demanding problems. However, effectively using the full potential of these systems can be challenging without the knowledge of the system's performance characteristics. While some performance models exist for shared, heterogeneous workstations, none thus far account for the addition of RC systems. Our analytic performance model includes the effects of the reconfigurable device, application load imbalance, background user load, basic message passing communication, and processor heterogeneity. The methodology proves to be accurate in characterizing these effects for applications running on shared, homogeneous, and heterogeneous HPRC resources. The model error in all cases was found to be less than 5% for application runtimes greater than 30 s, and less than 15% for runtimes less than 30 s. 相似文献

7.

Modeling parallel and distributed systems with finite workloads

Ahmed M. Lester Reda 《Performance Evaluation》2005,60(1-4):303-325

In studying or designing parallel and distributed systems one should have available a robust analytical model that includes the major parameters that determine the system performance. Jackson networks have been very successful in modeling computer systems. However, the ability of Jackson networks to predict performance with system changes remains an open question, since they do not apply to systems where there are population size constraints. Also, the product-form solution of Jackson networks assumes steady-state and exponential service centers or certain specialized queueing discipline. In this paper, we present a transient model for Jackson networks that is applicable to any population size and any finite workload (no new arrivals). Using several non-exponential distributions we show to what extent the exponential distribution can be used to approximate other distributions and transient systems with finite workloads. When the number of tasks to be executed is large enough, the model approaches the product-form solution (steady-state solution). We also, study the case where the non-exponential servers have queueing (Jackson networks cannot be applied). Finally, we show how to use the model to analyze the performance of parallel and distributed systems. 相似文献

8.

Dual time-scale distributed capacity allocation and load redirect algorithms for cloud systems

Danilo Ardagna Sara Casolari Michele Colajanni Barbara Panicucci 《Journal of Parallel and Distributed Computing》2012

Resource management remains one of the main issues of cloud computing providers because system resources have to be continuously allocated to handle workload fluctuations while guaranteeing Service Level Agreements (SLA) to the end users. In this paper, we propose novel capacity allocation algorithms able to coordinate multiple distributed resource controllers operating in geographically distributed cloud sites. Capacity allocation solutions are integrated with a load redirection mechanism which, when necessary, distributes incoming requests among different sites. The overall goal is to minimize the costs of allocated resources in terms of virtual machines, while guaranteeing SLA constraints expressed as a threshold on the average response time. We propose a distributed solution which integrates workload prediction and distributed non-linear optimization techniques. Experiments show how the proposed solutions improve other heuristics proposed in literature without penalizing SLAs, and our results are close to the global optimum which can be obtained by an oracle with a perfect knowledge about the future offered load. 相似文献

9.

Oscillation analysis of linearly coupled piecewise affine systems: Application to spatio-temporal neuron dynamics

Kenji Kashima Yasuyuki Kawamura Jun-ichi ImuraAuthor vitae 《Automatica》2011,(6):1249-1254

This paper discusses oscillation analysis of (a large number of) linearly coupled piecewise affine (PWA) systems, motivated by various kinds of reaction–diffusion systems including cell-signaling dynamics and neural dynamics. We derive a sufficient condition under which the system shows an oscillatory behavior called Y-oscillation. It is known that the analysis of PWA systems is difficult due to their switching nature. An important feature of the result obtained is that, under the assumption that every subsystem has a specific property in common, the criteria can be rewritten in terms of coupling topology in an easily checkable way, so it is applicable to large scale systems. The results obtained are applied to theoretical investigation of the cardiac action potential generation/propagation represented by spatio-temporal FitzHugh–Nagumo equations. 相似文献

10.

Sparsity-aware subband adaptive algorithms with adjustable penalties

《Digital Signal Processing》2019

相似文献

11.

Dynamic adaptation of response-time models for QoS management in autonomic systems

Joaquín EntrialgoAuthor Vitae 《Journal of Systems and Software》2011,84(5):810-820

In transactional systems, the objectives of quality of service regarding are often specified by Service Level Objectives (SLOs) that stipulate a response time to be achieved for a percentile of the transactions. Usually, there are different client classes with different SLOs. In this paper, we extend a technique that enforces the fulfilment of the SLOs using admission control. The admission control of new user sessions is based on a response-time model. The technique proposed in this paper dynamically adapts the model to changes in workload characteristics and system configuration, so that the system can work autonomically, without human intervention. The technique requires no knowledge about the internals of the system; thus, it is easy to use and can be applied to many systems. Its utility is demonstrated by a set of experiments on a system that implements the TPC-App benchmark. The experiments show that the model adaptation works correctly in very different situations that include large and small changes in response times, increasing and decreasing response times, and different patterns of workload injection. In all this scenarios, the technique updates the model progressively until it adjusts to the new situation and in intermediate situations the model never experiences abnormal behaviour that could lead to a failure in the admission control component. 相似文献

12.

Obtaining transparent models of chaotic systems with multi-objective simulated annealing algorithms 总被引：1，自引：0，他引：1

Luciano Sánchez 《Information Sciences》2008,178(4):952-970

Transparent models search for a balance between interpretability and accuracy. This paper is about the estimation of transparent models of chaotic systems from data, which are accurate and simple enough for their expression to be understandable by a human expert. The models we propose are discrete, built upon common blocks in control engineering (gain, delay, sum, etc.) and optimized both in their complexity and accuracy.The accuracy of a discrete model can be measured by means of the average error between its prediction for the next sampling period and the true output at that time, or ‘one-step error’. A perfect model has zero one-step error, but a small error is not always associated with an approximate model, especially in chaotic systems. In chaos, an arbitrarily low difference between two initial states will produce uncorrelated trajectories, thus a model with a low one-step error may be very different from the desired one. Even though a recursive evaluation (multi-step prediction) improves the fitting, in this work we will show that a learning algorithm may not converge to an appropriate model, unless we include some terms that depend on estimates of certain properties of the model (so called ‘invariants’ of the chaotic series). We will show this graphically, by means of the reconstructed attractors of the original system and the model. Therefore, we also propose to follow a multi-objective approach to model chaotic processes and to apply a simulated annealing-based optimization to obtain transparent models. 相似文献

13.

基于R的并行统计计算 总被引：1，自引：0，他引：1

宋磊尹俊平陈虹《计算机科学》2013,40(3):95-99

随着统计分析中数据规模和复杂性的不断增加,高性能计算也开始在金融、经济和管理等统计计算主导的领域中发挥重要的作用。将对基于R的统计分析中并行计算技术的发展现状和最新进展做一个综述,重点从用户的角度考察R在不同体系结构计算平台上并行统计计算的实现。一个人造和真实应用的测试表明了其应用效果。相似文献

14.

Exploiting performance counters to predict and improve energy performance of HPC systems

《Future Generation Computer Systems》2014

Hardware monitoring through performance counters is available on almost all modern processors. Although these counters are originally designed for performance tuning, they have also been used for evaluating power consumption. We propose two approaches for modelling and understanding the behaviour of high performance computing (HPC) systems relying on hardware monitoring counters. We evaluate the effectiveness of our system modelling approach considering both optimizing the energy usage of HPC systems and predicting HPC applications’ energy consumption as target objectives. Although hardware monitoring counters are used for modelling the system, other methods–including partial phase recognition and cross platform energy prediction–are used for energy optimization and prediction. Experimental results for energy prediction demonstrate that we can accurately predict the peak energy consumption of an application on a target platform; whereas, results for energy optimization indicate that with no a priori knowledge of workloads sharing the platform we can save up to 24% of the overall HPC system’s energy consumption under benchmarks and real-life workloads. 相似文献

15.

H-infinity performance optimization for networked control systems with limited communication channels

Yulong WANG Guanghong YANG 《控制理论与应用(英文版)》2010,8(1):099-104

This paper studies the problems of H-infinity performance optimization and controller design for continuoustime NCSs with both sensor-to-controller and controller-to-actuator communication constraints (limited communication channels). By taking the derivative character of network-induced delay into full consideration and defining new Lyapunov functions, linear matrix inequalities (LMIs)-based H-infinity performance optimization and controller design are presented for NCSs with limited communication channels. If there do not exist any constraints on the communication channels, the proposed design methods are also applicable. The merit of the proposed methods lies in their less conservativeness, which is achieved by avoiding the utilization of bounding inequalities for cross products of vectors. The simulation results illustrate the merit and effectiveness of the proposed H-infinity controller design for NCSs with limited communication channels. 相似文献

16.

Scalability analysis of three monitoring and information systems: MDS2, R-GMA,and Hawkeye

Xuehai Zhang Jeffrey L. Freschl Jennifer M. Schopf 《Journal of Parallel and Distributed Computing》2007

Monitoring and information system (MIS) implementations provide data about available resources and services within a distributed system, or Grid. A comprehensive performance evaluation of an MIS can aid in detecting potential bottlenecks, advise in deployment, and help improve future system development. In this paper, we analyze and compare the performance of three implementations in a quantitative manner: the Globus Toolkit^®

^{®}

Monitoring and Discovery Service (MDS2), the European DataGrid Relational Grid Monitoring Architecture (R-GMA), and the Condor project's Hawkeye. We use the NetLogger toolkit to instrument the main service components of each MIS and conduct four sets of experiments to benchmark their scalability with respect to the number of users, the number of resources, and the amount of data collected. Our study provides quantitative measurements comparable across all systems. We also find performance bottlenecks and identify how they relate to the design goals, underlying architectures, and implementation technologies of the corresponding MIS, and we present guidelines for deploying MISs in practice. 相似文献

17.

Performance evaluation and enhancement of multistage manufacturing systems with rework loops

Yongxin CaoV. Subramaniam Ruifeng Chen 《Computers & Industrial Engineering》2012,62(1):161-176

The phenomena of machine failures, defects, multiple rework loops, etc., results in much difficulty in modeling rework systems, and therefore the performance analysis of such systems has been investigated limitedly in the past. We propose an analytical method for the performance evaluation of rework systems with unreliable machines and finite buffers. To characterize the rework flow in the system, a new 3M1B (three-machine and one-buffer) Markov model is first presented. Unlike previous models, it is capable of representing multiple rework loops, and the rework fraction of each loop is calculated based on the quality of material flow in the system. A decomposition method is then developed for multistage rework systems using the proposed 3M1B model as one of the building blocks. The experimental results demonstrate that the decomposition method provides accurate estimates of performance measures such as throughput and Work-In-Process (WIP). We have applied this method to several problems, such as the determination of the optimal inspection location and the identification of bottleneck machines in rework systems. 相似文献

18.

Complementary computing: policies for transferring callers from dialog systems to human receptionists

Eric Horvitz Tim Paek 《User Modeling and User-Adapted Interaction》2007,17(1-2):159-182

We describe a study of the use of decision-theoretic policies for optimally joining human and automated problem-solving efforts. We focus specifically on the challenge of determining when it is best to transfer callers from an automated dialog system to human receptionists. We demonstrate the sensitivities of transfer actions to both the inferred competency of the spoken-dialog models and the current sensed load on human receptionists. The policies draw upon probabilistic models constructed via machine learning from cases that were logged by a call routing service deployed at our organization. We describe the learning of models that predict outcomes and interaction times and show how these models can be used to generate expected-utility policies that identify when it is best to transfer callers to human operators. We explore the behavior of the policies with simulations constructed from real-world call data. See D’Agostino (2005) for a reflection from the business community about the failure to date of automated speech recognition systems to penetrate widely. 相似文献

19.

Moments of accumulated reward and completion time in Markovian models with application to unreliable manufacturing systems

《Performance Evaluation》2014

Performance evaluation models are used by companies to design, adapt, manage and control their production systems. In the literature, most of the effort has been dedicated to the development of efficient methodologies to estimate the first moment performance measures of production systems, such as the expected production rate, the buffer levels and the mean completion time. However, there is industrial evidence that the higher moments of the production output may drastically impact on the capability of managing the system operations, causing the observed system performance to be highly different from what expected. This paper presents a methodology to analyze the cumulated output and the lot completion time moments of Markovian reward models. Both the discrete and continuous time cases are considered. The technique is applied to unreliable manufacturing systems characterized by general Markovian structures. Numerical results show how the theory developed in this paper can be applied to analyse the dependency of the output variability and the service level on the system parameters. Moreover, they highlight previously uninvestigated features of the system behavior that are useful while operating the system in practical settings. 相似文献

20.

Improving reliability and performances in large scale distributed applications with erasure codes and replication

《Future Generation Computer Systems》2016

Replication of Data Blocks is one of the main technologies on which Storage Systems in Cloud Computing and Big Data Applications are based. With the heterogeneity of nodes, and an always-changing topology, keeping the reliability of the data contained in the common large-scale distributed file system is an important research challenge. Common approaches are based either on replication of data or erasure codes. The former stores each data block several times in different nodes of the considered infrastructures: the drawback is that this can lead to large overhead and non-optimal resources utilization. Erasure coding instead exploits Maximum Distance Separable codes that minimize the information required to restore blocks in case of node failure: this approach can lead to increased complexity and transfer time due to the fact that several blocks, coming from different sources, are required to reconstruct lost information. In this paper we study, by means of discrete event simulation, the performances that can be obtained by combining both techniques, with the goal of minimizing the overhead and increasing the reliability while keeping the performances. The analysis proves that a careful balance between the application of replication and erasure codes significantly improves reliability and performances avoiding large overheads with respect to the isolated use of replication and redundancy. 相似文献