首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Although traditional approaches to code profiling help locate performance bottlenecks, they offer only limited support for removing these bottlenecks. The main reason is the lack of detailed visual runtime information to identify and eliminate computation redundancy. We provide three profiling blueprints that help identify and remove performance bottlenecks. The structural distribution blueprint graphically represents the CPU consumption share for each method and class of an application. The behavioral distribution blueprint depicts the distribution of CPU consumption along method invocations and hints at method candidates for caching optimizations. The behavioral evolution blueprint compares profiles of different versions of a software system and highlights performance‐critical changes in the system. These three blueprints helped us to significantly optimize Mondrian, an open source visualization engine. Our implementation is freely available for the Pharo development environment and has been evaluated in a number of different scenarios. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

2.
As today’s standard screening methods frequently fail to detect breast cancer before metastases have developed, early diagnosis is still a major challenge. With the promise of high-quality volume images, three-dimensional ultrasound computer tomography is likely to improve this situation, but has high computational needs. In this work, we investigate the acceleration of the ray-based transmission reconstruction by a GPU-based implementation of the iterative numerical optimization algorithm TVAL3. We identified the regular and transposed sparse-matrix–vector multiply as the performance limiting operations. For accelerated reconstruction we propose two different concepts and devise a hybrid scheme as optimal configuration. In addition we investigate multi-GPU scalability and derive the optimal number of devices for our two primary use-cases: a fast preview mode and a high-resolution mode. In order to achieve a fair estimation of the speedup, we compare our implementation to an optimized CPU version of the algorithm. Using our accelerated implementation we reconstructed a preview 3D volume with 24,576 unknowns, a voxel size of (8 mm)3 and approximately 200,000 equations in 0.5 s. A high-resolution volume with 1,572,864 unknowns, a voxel size of (2mm)3 and approximately 1.6 million equations was reconstructed in 23 s. This constitutes an acceleration of over one order of magnitude in comparison to the optimized CPU version.  相似文献   

3.
Performance evaluation of embedded software is essential in an early development phase so as to ensure that the software will run on the embedded device's limited computing resources. The prevailing approaches either require the deployment of the software on the embedded target, which can be tedious and may be impossible in an early development phase, or rely on simulation, which can be very slow. In this article, we introduce a customizable cross‐profiling framework for embedded Java processors, including processors featuring a method cache. The developer profiles the embedded software in the host environment, completely decoupled from the target system, on any standard Java virtual machine, but the generated profiles represent the execution time metric of the target system. Our cross‐profiling framework is based on bytecode instrumentation. We identify several pointcuts in the execution of bytecode that need to be instrumented in order to estimate the CPU cycle consumption on the target system. An evaluation using the JOP embedded Java processor as target confirms that our approach reconciles high profile accuracy with moderate overhead. Our cross‐profiling framework also enables the performance evaluation of new processor architectures before they are implemented. As a case study, we explore the performance impact of various processor design choices and optimizations, such as different cache sizes or pipeline organizations, and come up with an improved processor design that yields speedups of up to 40% on standard Java benchmarks. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

4.
Hybrid systems with CPU and GPU have become new standard in high performance computing. Workload can be split and distributed to CPU and GPU to utilize them for data-parallelism in hybrid systems. But it is challenging to manually split and distribute the workload between CPU and GPU since the performance of GPU is sensitive to the workload it received. Therefore, current dynamic schedulers balance workload between CPU and GPU periodically and dynamically. The periodical balance operation causes frequent synchronizations between CPU and GPU. It often degrades the overall performance because of the overhead of synchronizations. To solve the problem, we propose a Co-Scheduling Strategy Based on Asymptotic Profiling (CAP). CAP dynamically splits and distributes the workload to CPU and GPU with only a few synchronizations. It adopts the profiling technique to predict performance and partitions the workload according to the performance. It is also optimized for GPU’s performance characteristics. We examine our proof-of-concept system with six benchmarks and evaluation result shows that CAP produces up to 42.7% performance improvement on average compared with the state-of-the-art co-scheduling strategies.  相似文献   

5.
OFMC: A symbolic model checker for security protocols   总被引:5,自引:0,他引:5  
We present the on-the-fly model checker OFMC, a tool that combines two ideas for analyzing security protocols based on lazy, demand-driven search. The first is the use of lazy data types as a simple way of building efficient on-the-fly model checkers for protocols with very large, or even infinite, state spaces. The second is the integration of symbolic techniques and optimizations for modeling a lazy Dolev–Yao intruder whose actions are generated in a demand-driven way. We present both techniques, along with optimizations and proofs of correctness and completeness.Our tool is state of the art in terms of both coverage and performance. For example, it finds all known attacks and discovers a new one in a test suite of 38 protocols from the Clark/Jacob library in a few seconds of CPU time for the entire suite. We also give examples demonstrating how our tool scales to, and finds errors in, large industrial-strength protocols.  相似文献   

6.
Molecular dynamics (MD) is an important research tool extensively applied in materials science. Running MD on a graphics processing unit (GPU) is an attractive new approach for accelerating MD simulations. Currently, GPU implementations of MD usually run in a one-host-process-one-GPU (OHPOG) scheme. This scheme may pose a limitation on the system size that an implementation can handle due to the small device memory relative to the host memory. In this paper, we present a one-host-process-multiple-GPU (OHPMG) implementation of MD with embedded-atom-model or semi-empirical tight-binding many-body potentials. Because more device memory is available in an OHPMG process, the system size that can be handled is increased to a few million or more atoms. In comparison with the serial CPU implementation, in which Newton’s third law is applied to improve the computational efficiency, our OHPMG implementation has achieved a 28.9x–86.0x speedup in double precision, depending on the system size, the cut-off ranges and the number of GPUs. The implementation can also handle a group of small simulation boxes in one run by combining the small boxes into a large box. This approach greatly improves the GPU computing efficiency when a large number of MD simulations for small boxes are needed for statistical purposes.  相似文献   

7.
This article presents the parallel implementation on a GPU of a real-time dynamic tone-mapping operator. The operator we describe in this article is generic and may be used by any application. However, the goal of our work is to integrate this operator into the graphic rendering process of a car driving simulator; thus, we studied its real-time implementation. The tone-mapping operator outputs a low dynamic range (LDR) image keeping as much as possible the contrast and luminance of the original input high dynamic range (HDR) image. It performs the mapping between the luminances of the original scene to the output device??s display values. We address the problem of mapping HDR images to standard displays. In this case, the tone mapping compresses the luminances ratio. Several tone-mapping operators can be found in the literature as well as some parallelizations. However, they use either static or adaptations of static operators. We have adapted the dynamic operator of Irawan and parallelized it on GPU. Algorithmic optimizations have been performed, and we have explored the different strategies of repartition of the computation among the CPU and the GPU. We have chosen to implement on the GPU the changes between the color spaces and the interpolation of the histogram which are the most time-consuming steps on the CPU (1?C2?s per image 1,002?×?666). All of these optimizations lead to an increase of the processing rate and the number of HDR-quality images displayed to LDR per second. This operator has been implemented on a RISC processor Pentium 4?at 3.6?GHz and a GPU Nvidia 8800?GTX (728MB, 518GFLOPS). The execution speed has been multiplied by a factor of 15 compared to the naive implementation of the algorithm. The display rate reaches 30 images per second, which fulfills our goal for real-time video rate of 25 images per second.  相似文献   

8.
In this paper we present the programming of the Linpack benchmark on TianHe-1 system,the first petascale supercomputer system of China,and the largest GPU-accelerated heterogeneous system ever attempted before.A hybrid programming model consisting of MPI,OpenMP and streaming computing is described to explore the task parallel,thread parallel and data parallel of the Linpack.We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details.To overcome the low-bandwidth between the CPU and GPU communication,we present a software pipelining technique to hide the communication overhead.Combined with other traditional optimizations,the Linpack we developed achieved 196.7 GFLOPS on a single compute element of TianHe-1.This result is 70.1% of the peak compute capability,3.3 times faster than the result by using the vendor’s library.On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563 PFLOPS,which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November,2009.  相似文献   

9.
Dissipative particle dynamics (DPD) simulation is implemented on multiple GPUs by using NVIDIA’s Compute Unified Device Architecture (CUDA) in this paper. Data communication between each GPU is executed based on the POSIX thread. Compared with the single-GPU implementation, this implementation can provide faster computation speed and more storage space to perform simulations on a significant larger system. In benchmark, the performance of GPUs is compared with that of Material Studio running on a single CPU core. We can achieve more than 90x speedup by using three C2050 GPUs to perform simulations on an 80∗80∗80 system. This implementation is applied to the study on the dispersancy of lubricant succinimide dispersants. A series of simulations are performed on lubricant–soot–dispersant systems to study the impact factors including concentration and interaction with lubricant on the dispersancy, and the simulation results are agreed with the study in our present work.  相似文献   

10.
Although modern computer hardware offers an increasing number of processing elements organized in nonuniform memory access (NUMA) architectures, prevailing middleware engines for executing business processes, workflows, and Web service compositions have not been optimized for properly exploiting the abundant processing resources of such machines. Amongst others, factors limiting performance are inefficient thread scheduling by the operating system, which can result in suboptimal use of system memory and CPU caches, and sequential code sections that cannot take advantage of multiple available cores. In this article, we study the performance of the JOpera process execution engine on recent multicore machines. We first evaluate its performance without any dedicated optimization for multicore hardware, showing that additional cores do not significantly improve performance, although the engine has a multithreaded design. Therefore, we apply optimizations on the basis of replication together with an improved, hardware‐aware usage of the underlying resources such as NUMA nodes and CPU caches. Thanks to our optimizations, we achieve speedups from a factor of 2 up to a factor of 20 (depending on the target machine) when compared with a baseline execution ‘as is’. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

11.
Stepwise refinement is a crucial conceptual tool for system development, encouraging program construction via a number of separate correctness-preserving stages which ideally can be understood in isolation. A crucial conceptual component of security is an adversary’s ignorance of concealed information. We suggest a novel method of combining these two ideas.Our suggestion is based on a mathematical definition of “ignorance-preserving” refinement that extends classical refinement by limiting an adversary’s access to concealed information: moving from specification to implementation should never increase that access. The novelty is the way we achieve this in the context of sequential programs.Specifically we give an operational model (and detailed justification for it), a basic sequential programming language and its operational semantics in that model, a “logic of ignorance” interpreted over the same model, then a program-logical semantics bringing those together — and finally we use the logic to establish, via refinement, the correctness of a real (though small) protocol: Rivest’s Oblivious Transfer. A previous report? treated Chaum’s Dining Cryptographers similarly.In passing we solve the Refinement Paradox for sequential programs.  相似文献   

12.
Many researchers have developed applications using transactional memory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose tools for profiling and optimizing programs which use transactions. In this paper we introduce a series of profiling and optimization techniques for TM applications. The profiling techniques are of three types: (i) techniques to identify multiple potential conflicts from a single program run, (ii) techniques to identify the data structures involved in conflicts by using a symbolic path through the heap, rather than a machine address, and (iii) visualization techniques to summarize how threads spend their time and which of their transactions conflict most frequently. Altogether they provide in-depth and comprehensive information about the wasted work caused by aborting transactions. To reduce the contention between transactions we suggest several TM specific optimizations which leverage nested transactions, transaction checkpoints, early release and etc. To examine the effectiveness of the profiling and optimization techniques, we provide a series of illustrations from the STAMP TM benchmark suite and from the synthetic WormBench workload. First we analyze the performance of TM applications using our profiling techniques and then we apply various optimizations to improve the performance of the Bayes, Labyrinth and Intruder applications. We discuss the design and implementation of the profiling techniques in the Bartok-STM system. We process data offline or during garbage collection, where possible, in order to minimize the probe effect introduced by profiling.  相似文献   

13.
在程序实际执行中,Profiling技术能为编译器提供准确的轮廓信息。编译优化借助这种轮廓信息,可在优化时进行取舍,提高生成代码性能。该文介绍了在龙芯/ORC编译器中edge profiling的技术,给出了在edge profiling辅助下CPU2000性能测试结果。  相似文献   

14.
Parallel programming on workstation clusters is subject to many factors and problems which determine the potential success or failure of any individual implementation. The most obvious problems are the difficulty in developing parallel algorithms and the high communication latency which may render such algorithms inefficient. In an attempt to address some of these issues we propose a strategy for estimating the potential speedup of a parallel program based on computation and communication profiling. We show that our proposed strategy yields accurate estimates of the speedup. We also propose a complete communication model so that the speedup can be estimated under different programming inputs and show that moderately accurate estimates can be obtained. High communication latency is the major problem with workstation cluster computing. We attempt to examine this problem from the system level point of view and show experimentally that latency hiding can allow almost full utilisation of the CPU resource even though individual programs may suffer from high communication latencies. © 1997 John Wiley & Sons, Ltd.  相似文献   

15.
An effectiveness-based adaptive cache replacement policy   总被引:1,自引:0,他引:1  
Belady’s optimal cache replacement policy is an algorithm to work out the theoretical minimum number of cache misses, but the rationale behind it was too simple. In this work, we revisit the essential function of caches to develop an underlying analytical model. We argue that frequency and recency are the only two affordable attributes of cache history that can be leveraged to predict a good replacement. Based on those two properties, we propose a novel replacement policy, the Effectiveness-Based Replacement policy (EBR) and a refinement, Dynamic EBR (D-EBR), which combines measures of recency and frequency to form a rank sequence inside each set and evict blocks with lowest rank. To evaluate our design, we simulated all 30 applications from SPEC CPU2006 for uni-core system and a set of combinations for 4-core systems, for different cache sizes. The results show that EBR achieves an average miss rate reduction of 12.4%. With the help of D-EBR, we can tune the weight ratio between ‘frequency’ and ‘recency’ dynamically. D-EBR can nearly double the miss reduction achieved by EBR alone. In terms of hardware overhead, EBR requires half the hardware overhead of real LRU and even compared with Pseudo LRU the overhead is modest.  相似文献   

16.
We present a novel algorithm to solve the non-negative single-source shortest path problem on road networks and graphs with low highway dimension. After a quick preprocessing phase, we can compute all distances from a given source in the graph with essentially a linear sweep over all vertices. Because this sweep is independent of the source, we are able to reorder vertices in advance to exploit locality. Moreover, our algorithm takes advantage of features of modern CPU architectures, such as SSE and multiple cores. Compared to Dijkstra’s algorithm, our method needs fewer operations, has better locality, and is better able to exploit parallelism at multi-core and instruction levels. We gain additional speedup when implementing our algorithm on a GPU, where it is up to three orders of magnitude faster than Dijkstra’s algorithm on a high-end CPU. This makes applications based on all-pairs shortest-paths practical for continental-sized road networks. Several algorithms, such as computing the graph diameter, arc flags, or exact reaches, can be greatly accelerated by our method.  相似文献   

17.
In this paper we present HySAT, a bounded model checker for linear hybrid systems, incorporating a tight integration of a DPLL–based pseudo–Boolean SAT solver and a linear programming routine as core engine. In contrast to related tools like MathSAT, ICS, or CVC, our tool exploits the various optimizations that arise naturally in the bounded model checking context, e.g.isomorphic replication of learned conflict clauses or tailored decision strategies, and extends them to the hybrid domain. We demonstrate that those optimizations are crucial to the performance of the tool.  相似文献   

18.
Virtual execution environments, such as the Java virtual machine, promote platform‐independent software development. However, when it comes to analyzing algorithm complexity and performance bottlenecks, available tools focus on platform‐specific metrics, such as the CPU time consumption on a particular system. Other drawbacks of many prevailing profiling tools are high overhead, significant measurement perturbation, as well as reduced portability of profiling tools, which are often implemented in platform‐dependent native code. This article presents a novel profiling approach, which is entirely based on program transformation techniques, in order to build a profiling data structure that provides calling‐context‐sensitive program execution statistics. We explore the use of platform‐independent profiling metrics in order to make the instrumentation entirely portable and to generate reproducible profiles. We implemented these ideas within a Java‐based profiling tool called JP. A significant novelty is that this tool achieves complete bytecode coverage by statically instrumenting the core runtime libraries and dynamically instrumenting the rest of the code. JP provides a small and flexible API to write customized profiling agents in pure Java, which are periodically activated to process the collected profiling information. Performance measurements point out that, despite the presence of dynamic instrumentation, JP causes significantly less overhead than a prevailing tool for the profiling of Java code. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

19.
Nowadays, computing becomes a service on cloud computing resources. Users reserve virtual machines to execute their applications with minimum number of processing cores to save money. Optimizing user applications on the level of single core of a physical machine is highly desirable to users to reduce cost, as well as to cloud providers to reduce power consumption. In this paper, we showed how to exploit all the processing resources available in a single CPU physical core to optimize the performance of the 2D spatial filtering operation, a basic kernel in important image and multimedia applications such as image enhancement, edge detection, image segmentation, and image analysis. We proposed a novel computational procedure to restructure the conventional image filtering operation. Then, we demonstrated the merits of combining hand-optimized source-code restructuring, auto-optimized compiler techniques including vectorization, and hand-optimized threading to squeeze the performance of a single CPU core. Our intensive performance evaluations, using Sobel filters, on a variety of image sizes using the Linux Perf tool on a single core of the quad-core Intel Core i7 processor showed that our source-code restructurings with compiler auto-vectorization, using Intel AVX vector instructions, is 1.3X better than the non-restructured auto-vectorized version of the CImg library for computing the image gradient. Moreover, using OpenMP library directives we studied different image partitioning strategies to better exploit the two hardware threads inside a CPU core which boosted performance to 2.6X. Compared with the conventional CImg implementation, we obtained an average enhancement of 5.0X for image sizes ranging from 0.5 MPixel to 8 MPixel. However, comparing our best-optimized code to the conventional non-optimized serial code, without threading, resulted in a significant enhancement of 23X. The overall results showed how significant performance in important image processing applications can be obtained by applying source-code restructurings before employing any automatic compiler optimizations to exploit ILP, DLP and TLP parallelism degrees inside a single core of a multi-core CPU.  相似文献   

20.
Optimizing main-memory join on modern hardware   总被引:4,自引:0,他引:4  
In the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rates, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called "radix-cluster", which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an order of magnitude on modern RISC hardware when both memory and CPU resources are optimized  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号