期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Visualizing and exploring profiles with calling context ring charts

Philippe Moret Walter Binder Alex Villazón Danilo Ansaloni Abbas Heydarnoori 《Software》2010,40(9):825-847

Calling context profiling is an important technique for analyzing the performance of object‐oriented software with complex inter‐procedural control flow. The Calling Context Tree (CCT) is a common data structure that stores dynamic metrics, such as CPU time, separately for each calling context. As CCTs may comprise millions of nodes, there is a need for a condensed visualization that eases the localization of performance bottlenecks. In this article, we discuss Calling Context Ring Charts (CCRCs), a compact visualization for CCTs, where callee methods are represented in ring segments surrounding the caller's ring segment. In order to reveal hot methods, their callers, and callees, the ring segments can be sized according to a chosen dynamic metric. We describe two case studies where CCRCs help us to detect and fix performance problems in applications. A performance evaluation also confirms that our implementation can efficiently handle large CCTs. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

2.

Using Bytecode Instruction Counting as Portable CPU Consumption Metric

Walter Binder Jarle Hulaas 《Electronic Notes in Theoretical Computer Science》2006,153(2):57

Accounting for the CPU consumption of applications is crucial for software development to detect and remove performance bottlenecks (profiling) and to evaluate the performance of algorithms (benchmarking). Moreover, extensible middleware may exploit resource consumption information in order to detect a resource overuse of client components (detection of denial-of-service attacks) or to charge clients for the resource consumption of their deployed components. The Java Virtual Machine (JVM) is a predominant target platform for application and middleware developers, but it currently lacks standard mechanisms for resource management.In this paper we present a tool, the Java Resource Accounting Framework, Second Edition (J-RAF2), which enables precise CPU management on standard Java runtime environments. J-RAF2 employs a platform-independent CPU consumption metric, the number of executed JVM bytecode instructions. We explain the advantages of this approach to CPU management and present five case studies that show the benefits in different settings. 相似文献

3.

Differential profiling

Paul E. McKenney 《Software》1999,29(3):219-234

Performance can be a critical aspect of software quality; in some systems, poor performance can cause financial loss, physical damage, or even death. In such cases, it is imperative to identify system performance problems before deployment, preferably well before implementation. Unfortunately, the size of most software systems grossly exceeds the capacity of current performance‐modelling techniques. Hence, there is a need for techniques to quickly identify the portions of the system that are performance‐critical. These portions are often small enough to be modelled directly. This paper describes one such technique, differential profiling. Differential profiling combines two or more conventional profiles of a given program run in different situations or conditions. The technique mathematically combines corresponding buckets of the conventional profiles, then sorts the resulting list by these combined values. Different combining functions are suitable for different situations. This combining of conventional profiles frequently yields much greater insight than could be obtained from either of the conventional profiles. Hence, differential profiling helps to locate difficult‐to‐find performance bottlenecks, such as those that are distributed widely throughout a large program or system, perhaps by being concealed within macros or inlined functions. This paper also describes how this technique may be used to pinpoint certain types of performance bottlenecks in large programs running on large‐scale shared‐memory multiprocessors. In this environment, the critical bottleneck might consume only a small fraction of the total CPU time, since typical critical sections can consume at most one CPUs worth of computation. This sort of bottleneck, particularly when widely distributed throughout the program under consideration, is often invisible to traditional profiling techniques. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

4.

A 2589 line topology optimization code written for the graphics card 总被引：1，自引：0，他引：1

Stephan Schmidt Volker Schulz 《Computing and Visualization in Science》2011,14(6):249-256

We investigate topology optimization based on the solid isotropic material with penalization approach on compute unified device architecture enabled graphics cards in three dimensions. Linear elasticity is solved entirely on the GPU by a matrix-free conjugate gradient method using finite elements. Due to the unique requirements of the single instruction, multiple data stream processors, special attention is given to the procedural generation of matrix?Cvector products entirely on the graphics card. The GPU code is found to be extremely efficient, being faster than a 48 core shared memory CPU system. CPU and GPU implementations show different performance bottlenecks. The sources are available at http://www.mathematik.uni-trier.de/~schmidt/gputop. 相似文献

5.

Booting,browsing and streaming time profiling,and bottleneck analysis on android-based systems

Ying-Dar Lin Cheng-Yuan Ho Yuan-Cheng Lai Tzu-Hsiung Du Shun-Lee Chang 《Journal of Network and Computer Applications》2013,36(4):1208-1218

Android-based systems perform slowly in three scenarios: booting, browsing, and streaming. Time profiling on Android devices involves three unique constraints: (1) the execution flow of a scenario invokes multiple software layers, (2) these software layers are implemented in different programming languages, and (3) log space is limited. To compensate for the first and second constraints, we assumed a staged approach using different profiling tools applied to different layers and programming languages. As for the last constraint and to avoid generating enormous quantities of irrelevant log data, we began profiling scenarios from an individual module, and then iteratively profiled an increased number of modules and layers, and finally consolidated the logs from different layers to identify bottlenecks. Because of this iteration, we called this approach a staged iterative instrumentation approach. To analyze the time required to boot the devices, we conducted experiments using off-the-shelf Android products. We determined that 72% of the booting time was spent initializing the user-space environment, with 44.4% and 39.2% required to start Android services and managers, and preload Java classes and resources, respectively. Results from analyzing browsing performance indicate that networking is the most significant factor, accounting for at least 90% of the delay in browsing. With regard to online streaming, networking and decoding technologies are two most important factors occupying 77% of the time required to prepare a 22 MB video file over a Wi-Fi connection. Furthermore, the overhead of this approach is low. For example, the overhead of CPU loading is about 5% in the browsing scenario. We believe that this proposed approach to time profiling represents a major step in the optimization and future development of Android-based devices. 相似文献

6.

Profiling techniques for communication in fine‐grained parallel languages

Chris J. Scheiman Bjoern Haake Maximilian Ibel Klaus E. Schauser 《Software》1999,29(6):519-550

Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this paper we present two profiling techniques for the fine‐grained parallel programming language Split‐C, which provides a simple global address space memory model. One profiler provides a detailed analysis of a program's execution. The other profiler collects cumulative information. As our experience shows, it is quite challenging to profile programs that make use of efficient, low‐overhead communication. We incorporated techniques which minimize profiling effects on the running program, and quantified the profiling overhead. We present several Split‐C applications showing that the profiler is useful in determining performance bottlenecks. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

7.

Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

Nozal Raúl Perez Borja Bosque Jose Luis Beivide Ramón 《The Journal of supercomputing》2019,75(3):1123-1136

Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload distribution among their devices. This paper describes an extension of the Maat library to allow the co-execution of a data-parallel OpenCL kernel on a heterogeneous system composed by a CPU and an Intel Xeon Phi. Maat provides an abstract view of the heterogeneous system as well as set of load balancing algorithms to squeeze the performance out of the node. It automatically performs the data partition and distribution among the devices, generates the kernels and efficiently merges the partial outputs together. Experimental results show that this approach always outperforms the baseline with only a Xeon Phi, giving excellent performance and energy efficiency. Furthermore, it is essential to select the right load balancing algorithm because it has a huge impact in the system performance and energy consumption.

相似文献

8.

Platform‐independent profiling in a virtual execution environment

Walter Binder Jarle Hulaas Philippe Moret Alex Villazón 《Software》2009,39(1):47-79

Virtual execution environments, such as the Java virtual machine, promote platform‐independent software development. However, when it comes to analyzing algorithm complexity and performance bottlenecks, available tools focus on platform‐specific metrics, such as the CPU time consumption on a particular system. Other drawbacks of many prevailing profiling tools are high overhead, significant measurement perturbation, as well as reduced portability of profiling tools, which are often implemented in platform‐dependent native code. This article presents a novel profiling approach, which is entirely based on program transformation techniques, in order to build a profiling data structure that provides calling‐context‐sensitive program execution statistics. We explore the use of platform‐independent profiling metrics in order to make the instrumentation entirely portable and to generate reproducible profiles. We implemented these ideas within a Java‐based profiling tool called JP. A significant novelty is that this tool achieves complete bytecode coverage by statically instrumenting the core runtime libraries and dynamically instrumenting the rest of the code. JP provides a small and flexible API to write customized profiling agents in pure Java, which are periodically activated to process the collected profiling information. Performance measurements point out that, despite the presence of dynamic instrumentation, JP causes significantly less overhead than a prevailing tool for the profiling of Java code. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献

9.

Cross‐profiling for Java processors

Walter Binder Martin Schoeberl Philippe Moret Alex Villazón 《Software》2009,39(18):1439-1465

Performance evaluation of embedded software is essential in an early development phase so as to ensure that the software will run on the embedded device's limited computing resources. The prevailing approaches either require the deployment of the software on the embedded target, which can be tedious and may be impossible in an early development phase, or rely on simulation, which can be very slow. In this article, we introduce a customizable cross‐profiling framework for embedded Java processors, including processors featuring a method cache. The developer profiles the embedded software in the host environment, completely decoupled from the target system, on any standard Java virtual machine, but the generated profiles represent the execution time metric of the target system. Our cross‐profiling framework is based on bytecode instrumentation. We identify several pointcuts in the execution of bytecode that need to be instrumented in order to estimate the CPU cycle consumption on the target system. An evaluation using the JOP embedded Java processor as target confirms that our approach reconciles high profile accuracy with moderate overhead. Our cross‐profiling framework also enables the performance evaluation of new processor architectures before they are implemented. As a case study, we explore the performance impact of various processor design choices and optimizations, such as different cache sizes or pipeline organizations, and come up with an improved processor design that yields speedups of up to 40% on standard Java benchmarks. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

10.

Scalability analysis of three monitoring and information systems: MDS2, R-GMA,and Hawkeye

Xuehai Zhang Jeffrey L. Freschl Jennifer M. Schopf 《Journal of Parallel and Distributed Computing》2007

Monitoring and information system (MIS) implementations provide data about available resources and services within a distributed system, or Grid. A comprehensive performance evaluation of an MIS can aid in detecting potential bottlenecks, advise in deployment, and help improve future system development. In this paper, we analyze and compare the performance of three implementations in a quantitative manner: the Globus Toolkit^®

^{®}

Monitoring and Discovery Service (MDS2), the European DataGrid Relational Grid Monitoring Architecture (R-GMA), and the Condor project's Hawkeye. We use the NetLogger toolkit to instrument the main service components of each MIS and conduct four sets of experiments to benchmark their scalability with respect to the number of users, the number of resources, and the amount of data collected. Our study provides quantitative measurements comparable across all systems. We also find performance bottlenecks and identify how they relate to the design goals, underlying architectures, and implementation technologies of the corresponding MIS, and we present guidelines for deploying MISs in practice. 相似文献

11.

GPU implementation of a parallel two‐list algorithm for the subset‐sum problem

Lanjun Wan Kenli Li Jing Liu Keqin Li 《Concurrency and Computation》2015,27(1):119-145

The subset‐sum problem is a well‐known non‐deterministic polynomial‐time complete (NP‐complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two‐list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU–GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide‐and‐conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector‐based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

12.

Dynamic load distribution using anti-tasks and load state vectors

Qin Lu Sau-Ming Lau Kwong-Sak Leung 《Concurrency and Computation》1998,10(14):1251-1269

Polling-based load distribution (LD) algorithms suffer from two weaknesses: (i) load information exchanged during a polling session is confined to the two negotiating nodes only; (ii) as the distributed system grows in size (in terms of the number of constituent nodes), a larger number of polling sessions, and thus a higher amount of network bandwidth consumption and CPU overhead, are needed. We propose a new LD algorithm which is based on anti-tasks and load state vectors. This new algorithm avoids the above weaknesses of polling-based LD algorithms. Anti-tasks are composite agents which travel around a distributed system to facilitate the pairing up of task senders and receivers, as well as the collection and dissemination of load information. Time-stamped load information of processing nodes is stored in load state vectors which, when used together with anti-tasks, encourage mutual sharing of load information among processing nodes. Anti-tasks, which make use of load state vectors to decide their traveling paths, are spontaneously directed towards processing nodes having high transient workload, thus allowing their surplus workload to be relocated quickly. Using simulations, we evaluate the performance of our new algorithm by comparing its performance with a number of well-known polling-based load distribution algorithms. We found that our algorithm provides significant reduction of mean task response time over a large range of system sizes. The cost of achieving this performance gain in terms of CPU overhead and channel bandwidth consumption is generally comparable to the other algorithms we studied. © 1998 John Wiley & Sons, Ltd. 相似文献

13.

Using SPEC CPU2006 to evaluate the sequential and parallel code generated by commercial and open-source compilers

Aldea Sergio Llanos Diego R. González-Escribano Arturo 《The Journal of supercomputing》2012,59(1):486-498

相似文献

14.

An exercise in resource allocation

H. Gomaa 《Software》1974,4(3):199-213

In a large operating system, the probability that bottlenecks exist is high. The outcome of modifications to the system attempting to overcome these bottlenecks are often not easy to predict. It is frequently difficult to discover:

(1) Whether an improvement has actually been made to the system.
(2) Where exactly the improvement in system performance, if any, is occurring.
(3) How to adjust parameters of the system to achieve an improved performance.

Performance tools are described in this paper which were used to help resolve these points in the implementation of a Peripheral Processor and Channel Scheduling mechanism in the operating system used at CERN on a CDC 6000 system. The paper shows how analysis of the performance data provided a clearer appreciation of the performance of the scheduling mechanism. 相似文献

15.

Incorporating System Overhead in Queuing Network Models

《IEEE transactions on pattern analysis and machine intelligence》1980,(4):381-390

Multiclass queuing network models of multiprogramming computer systems are frequently used to predict the performance of computing systems as a function of user workload and hardware configuration. This paper examines three different methods for incorporating operating system overhead in multiclass queuing network models. The goal of the resultant model is to provide an accurate account of the processing performance and the system CPU overhead of each of the several different types of jobs (batch, timesharing, transaction processing, etc.) that together make up the multiprogramming workload. The first method introduces an operating sysbtm workload consisting of a fixed number of jobs to represent system CPU overhead processing. The second method extends the jobs' CPU service requests to include explicitly the CPU overhead necessary for system processing. The third method employs a communicating set of user and system job classes so that the CPU overhead can be modeled by switching jobs from user to system class whenever they require system CPU service. The capabilities and accuracy of the three methods are assessed and compared against performance and overhead data measured on a Univac 1110 computer. 相似文献

16.

CUDA平台下的实时超声扫描转换

王伟民王合闯王华军《计算机应用》2011,31(10):2760-2763

为了克服传统医学超声扫描转换不能实时的缺陷,实时超声扫描转换算法利用计算统一设备架构(CUDA)技术,通过分配最优的线程结构、合理规划中央处理器(CPU)和图形处理器(GPU)之间的数据传输方式和计算任务的划分,提高了算法的吞吐量,满足了实时性。传统CPU算法和3种GPU算法的实验结果对比显示,GPU处理3121×936大小的图片,帧速率可达746fps,并行算法加速比可达300以上。相似文献

17.

一种基于系统实时负载的网络流量均衡方法

周计梁刚《计算机安全》2014,(3):7-11

为了解决传统NAPI机制中单核网络处理的性能瓶颈问题,Google提出了Receive packet steering（RP$）,将接收到的网络流量均衡到多个CPU核上并行处理,从而提高网络吞吐量。但是,RPS只是一种静态均衡策略,不能充分利用系统资源。改进了RPS的流量均衡方法,提出了一种基于系统实时负载的网络流量均衡方法,综合考虑各CPu核的实时负载情况,动态地均衡网络流量。实验结果表明,该方法能够根据系统当前的各CPU核的负载情况动态地分配网络流量,因此能够更加有效地提高系统的网络吞吐量。相似文献

18.

Automatic performance debugging of SPMD-style parallel programs

Xu LiuAuthor Vitae Jianfeng ZhanAuthor Vitae Kunlin ZhanAuthor Vitae Dan Meng^{Author Vitae} 《Journal of Parallel and Distributed Computing》2011,71(7):925-937

Automatic performance debugging of parallel applications includes two main steps: locating performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in two ways: first, several previous efforts automate locating bottlenecks, but present results in a confined way that only identifies performance problems with a priori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships, but these efforts do not focus on locating performance bottlenecks or uncovering their root causes.The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any prior knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on the basis of the rough set theory, we propose an innovative approach to automatically uncover root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code—MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks. 相似文献

19.

Reasons for bottlenecks in very large-scale system of systems development

《Information and Software Technology》2014,56(10):1403-1420

ContextSystem of systems (SoS) is a set or arrangement of systems that results when independent and useful systems are to be incorporated into a larger system that delivers unique capabilities. Our investigation showed that the development life cycle (i.e. the activities transforming requirements into design, code, test cases, and releases) in SoS is more prone to bottlenecks in comparison to single systems.ObjectiveThe objective of the research is to identify reasons for bottlenecks in SoS, prioritize their significance according to their effect on bottlenecks, and compare them with respect to different roles and different perspectives, i.e. SoS view (concerned with integration of systems), and systems view (concerned with system development and delivery).MethodThe research method used is a case study at Ericsson AB.ResultsResults show that the most significant reasons for bottlenecks are related to requirements engineering. All the different roles agree on the significance of requirements related factors. However, there are also disagreements between the roles, in particular with respect to quality related reasons. Quality related hinders are primarily observed and highly prioritized by quality assurance responsibles. Furthermore, SoS view and system view perceive different hinders, and prioritize them differently.ConclusionWe conclude that solutions for requirements engineering in SoS context are needed, quality awareness in the organization has to be achieved end to end, and views between SoS and system view need to be aligned to avoid sub optimization in improvements. 相似文献

20.

Toward low CPU usage and efficient DPDK communication in a cluster

Wu Mingjie Chen Qingkui Wang Jingjuan 《The Journal of supercomputing》2022,78(2):1852-1884

In recent years, DPDK (Data Plane Development Kit, a data plane development tool set provided by Intel, focusing on high-performance processing of data packets in network applications), one of the high-performance packet I/O frameworks, is widely used to improve the efficiency of data transmission in the cluster. But, the busy polling used in DPDK will not only waste a lot of CPU cycles and cause certain power consumption, but also the high CPU usage will have a great impact on the performance of other applications in the host. Although some technologies, such as DVFS (dynamic voltage and frequency scaling, which is to dynamically adjust the operating frequency and voltage of the chip according to the different needs of the computing power of the application running on the chip, so as to achieve the purpose of energy saving) and LPI (low power idle, a technology that saves power by turning off the power of certain supporting circuits when the CPU core is idle), can reduce power consumption by adjusting CPU voltage and frequency, they can also cause performance degradation in other applications. Using thread sleep technology is a promising method to reduce the CPU usage and power consumption. However, it is challenging because the appropriate thread sleep duration cannot be obtained accurately. In this paper, we propose a model that finds the optimal thread sleep duration to solve the above challenges. From the model, we can balance the thread CPU usage and transmission efficiency to obtain the optimal sleep duration called the transmission performance threshold. Experiments show that the proposed models can significantly reduce the thread CPU usage. Generally, while the communication performance is slightly reduced, the CPU utilization is reduced by about 80%.

相似文献