首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Hybrid systems with CPU and GPU have become new standard in high performance computing. Workload can be split and distributed to CPU and GPU to utilize them for data-parallelism in hybrid systems. But it is challenging to manually split and distribute the workload between CPU and GPU since the performance of GPU is sensitive to the workload it received. Therefore, current dynamic schedulers balance workload between CPU and GPU periodically and dynamically. The periodical balance operation causes frequent synchronizations between CPU and GPU. It often degrades the overall performance because of the overhead of synchronizations. To solve the problem, we propose a Co-Scheduling Strategy Based on Asymptotic Profiling (CAP). CAP dynamically splits and distributes the workload to CPU and GPU with only a few synchronizations. It adopts the profiling technique to predict performance and partitions the workload according to the performance. It is also optimized for GPU’s performance characteristics. We examine our proof-of-concept system with six benchmarks and evaluation result shows that CAP produces up to 42.7% performance improvement on average compared with the state-of-the-art co-scheduling strategies.  相似文献   

2.
CPU是计算机进行运算的核心,主要性能指标有字长、频率、高速缓存、前端总线频率、超线程技术的应用、支持的扩展指令集等对整个计算机的性能起着至关重要的作用。在计算机的使用中常见的CPU超频故障、计算机感染病毒使CPU性能大幅度下降,偶伴随死机等现象,逐步掌握CPU主要性能与故障的排除技巧,达到举一反三的效果。  相似文献   

3.
Multiclass queuing network models of multiprogramming computer systems are frequently used to predict the performance of computing systems as a function of user workload and hardware configuration. This paper examines three different methods for incorporating operating system overhead in multiclass queuing network models. The goal of the resultant model is to provide an accurate account of the processing performance and the system CPU overhead of each of the several different types of jobs (batch, timesharing, transaction processing, etc.) that together make up the multiprogramming workload. The first method introduces an operating sysbtm workload consisting of a fixed number of jobs to represent system CPU overhead processing. The second method extends the jobs' CPU service requests to include explicitly the CPU overhead necessary for system processing. The third method employs a communicating set of user and system job classes so that the CPU overhead can be modeled by switching jobs from user to system class whenever they require system CPU service. The capabilities and accuracy of the three methods are assessed and compared against performance and overhead data measured on a Univac 1110 computer.  相似文献   

4.
CPU是计算机进行运算的核心,主要性能指标有字长、频率、高速缓存、前端总线频率、超线程技术的应用、支持的扩展指令集等对整个计算机的性能起着至关重要的作用。在计算机的使用中常见的CPU超频故障、计算机感染病毒使CPU性能大幅度下降,偶伴随死机等现象,逐步掌握CPU主要性能与故障的排除技巧,达到举一反三的效果。  相似文献   

5.
Feature tracking and matching in video using programmable graphics hardware   总被引:2,自引:0,他引:2  
This paper describes novel implementations of the KLT feature tracking and SIFT feature extraction algorithms that run on the graphics processing unit (GPU) and is suitable for video analysis in real-time vision systems. While significant acceleration over standard CPU implementations is obtained by exploiting parallelism provided by modern programmable graphics hardware, the CPU is freed up to run other computations in parallel. Our GPU-based KLT implementation tracks about a thousand features in real-time at 30 Hz on 1,024 × 768 resolution video which is a 20 times improvement over the CPU. The GPU-based SIFT implementation extracts about 800 features from 640 × 480 video at 10 Hz which is approximately 10 times faster than an optimized CPU implementation.  相似文献   

6.
wpa/wpa2-psk高速暴力破解器的设计和实现   总被引:1,自引:0,他引:1       下载免费PDF全文
针对基于单核CPU的wpa/wpa2-psk暴力破解器破解速度慢的缺点,提出一种分布式多核CPU加GPU的高速暴力破解器.采用分布式技术将密钥列表合理地分配到各台机器上,在单机上利用多核CPU和GPU形成多个计算核心并行破解,利用GPU计算密集型并行任务强大的计算能力提高破解速度.实验结果证明,该暴力破解器的破解速度相...  相似文献   

7.
We present a new hybrid CPU/GPU collision detection technique for rigid and deformable objects based on spatial subdivision. Our approach efficiently exploits the massive computational capabilities of modern CPUs and GPUs commonly found in off‐the‐shelf computer systems. The algorithm is specifically tailored to be highly scalable on both the CPU and the GPU sides. We can compute discrete and continuous external and self‐collisions of non‐penetrating rigid and deformable objects consisting of many tens of thousands of triangles in a few milliseconds on a modern PC. Our approach is orders of magnitude faster than earlier CPU‐based approaches and up to twice as fast as the most recent GPU‐based techniques.  相似文献   

8.
本文较全面地阐述了X386机器从系统加电到系统核心程序被加载进内存的硬盘引导的全过程,给出了各个阶段的工作流程图。  相似文献   

9.
64位Linux并行计算大气模型效率优化研究*   总被引:3,自引:1,他引:2  
研究了CMAQ大气模型在64位Linux操作系统上不同CPU核心数目并行计算模拟耗时以及结果的差异情况。研究结果表明,并行计算能大幅缩短CMAQ模拟耗时,以16个CPU核心并行处理为性价比最佳值;此时连续模拟中国区域37天空气质量状况(分辨率36 km、167行×97列、垂直14层)平均耗时小于16 min/d,而相同情况下单核模拟耗时大于2 h/d;多于16个核心并行处理时,随核心数量的增加模型性能提升的趋势减缓;操作系统和参与运算的核心数目对CMAQ模型模拟结果没有影响。  相似文献   

10.
当前世界上排前几位的超级计算机都基于大量CPU和GPU组合的混合架构,它们对某些特殊问题,譬如基于FFT的图像处理或N体颗粒计算等领域可获得很高的性能。但是对由有限差分(或基于网格的有限元)离散的偏微分方程问题,于CPU/GPU集群上获得较好的性能仍然是一种挑战。本文提出并测试一种基于这类集群架构的混合算法。算法的可扩展性通过区域分解算法实现,而GPU的性能由基于光滑聚集的代数多重网格法获得,避免了在GPU上表现不理想的不完全分解算法。本文的数值实验采用32CPU/GPU求解用差分离散后达三千万未知数的偏微分方程。  相似文献   

11.
矩量法是广泛使用的高精度电磁数值算法之一。在仿真复杂电磁问题时,该算法需要处理大型复数稠密矩阵方程,这导致其面临内存需求高、计算时间长的问题。与传统基函数相比,本文采用的高阶多项式基函数能够在保证计算精度的前提下大幅度降低未知量,进而降低矩阵阶数。在此基础上,本文设计了基于分块矩阵的高效并行策略,在国内超级计算机平台开展了并行高阶矩量法的超级电磁计算研究,大幅度提升了矩量法的仿真能力。在国产神威蓝光超级计算机上,以机载天线阵列的辐射特性计算为例,对并行规模高达30720 CPU核时的算法性能进行了评估,测试结果表明算法在并行规模扩大20倍以上时仍可获得50%以上的并行效率。在当前排名世界第一的天河2号超级计算机上,以飞机散射特性计算为例,对并行规模高达201600 CPU核时的算法性能进行了评估,测试结果表明算法在并行规模扩大约8倍时可获得50%以上的并行效率。数值仿真结果表明并行高阶矩量法可以在不同架构的超级计算机上高效完成复杂电大目标的精确电磁计算。  相似文献   

12.
Classification using Ant Programming is a challenging data mining task which demands a great deal of computational resources when handling data sets of high dimensionality. This paper presents a new parallelization approach of an existing multi-objective Ant Programming model for classification, using GPUs and the NVIDIA CUDA programming model. The computational costs of the different steps of the algorithm are evaluated and it is discussed how best to parallelize them. The features of both the CPU parallel and GPU versions of the algorithm are presented. An experimental study is carried out to evaluate the performance and efficiency of the interpreter of the rules, and reports the execution times and speedups regarding variable population size, complexity of the rules mined and dimensionality of the data sets. Experiments measure the original single-threaded and the new multi-threaded CPU and GPU times with different number of GPU devices. The results are reported in terms of the number of Giga GP operations per second of the interpreter (up to 10 billion GPops/s) and the speedup achieved (up to 834× vs CPU, 212× vs 4-threaded CPU). The proposed GPU model is demonstrated to scale efficiently to larger datasets and to multiple GPU devices, which allows the expansion of its applicability to significantly more complicated data sets, previously unmanageable by the original algorithm in reasonable time.  相似文献   

13.
We present a hybrid ray tracing system, where the work is divided between the CPU cores and the GPU in an integrated chip, and communication occurs via shared memory. Rays are organized in large packets that can be distributed among the two units as needed. Testing visibility between rays and the scene is mostly performed using an optimized kernel on the GPU, but the CPU can help as necessary. The CPU cores typically handle most or all shading, which makes it easy to support complex appearances. For efficiency, the CPU cores shade whole batches of rays by sorting them on material and shading each material using a vectorized kernel. In addition, we introduce a method to support light paths with arbitrary recursion, such as multiple recursive Whitted‐style ray tracing and adaptive sampling where the result of a ray is examined before sending the next, while still batching up rays for the benefit of GPU‐accelerated traversal and vectorized shading. This allows our system to achieve high rendering performance while maintaining the flexibility to accommodate different rendering algorithms.  相似文献   

14.
本文讨论的是一种单片机控制的通用文字/语音综合报警系统。探讨了系统的结构组成、使单片机数据存储器寻址范围远超出64KB的分段管理技术及人机接口。还讨论了语音芯片与单片机的接口以及发光二极管点阵用于汉字显示的驱动与接口电路。并对系统的软件设计和单片机内部资源的配置也作了详细的说明。  相似文献   

15.
Fang  Juan  Zhang  Xibei  Liu  Shijian  Chang  Zeqing 《The Journal of supercomputing》2019,75(8):4519-4528

When multiple processor (CPU) cores and a GPU integrated together on the same chip share the last-level cache (LLC), the competition for LLC is more serious. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of LLC capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications have high number of concurrent threads and they can tolerate access latency. Taking into account the GPU program memory latency tolerance characteristics, we propose an LLC buffer management strategy (buffer-for-GPU, BFG) for heterogeneous multi-core. A buffer is added on the side of LLC to filtrate streaming requests of GPU. Cache-insensitive GPU messages directly access to buffer instead of accessing to LLC, thereby filtering the GPU request and freeing up the LLC space for the CPU application. Then, for the different characteristics of CPU and GPU applications, an improved LRU replacement taking into account the recent access time and access frequency of the cache block is adopted. The cache misses-aware algorithm dynamically selects the improved LRU or LRU algorithm to fit the current operating state by comparing the miss rate of cache in buffer so that the performance of the system will be improved significantly.

  相似文献   

16.
为解决超大图像(2048×2048)的FBP与OR-OSEM扇束图像重建,作者采用PC机群的并行处理技术。将图像重建算法改写为并行运算方式,按角度数均匀地分配计算任务给各个CPU。并行运算结果表明:图像重建速度与CPU的个数基本上成线性正比关系,可提高近25倍(CPU数为25时)。超大图像的在线重建可采用CPU阵列机来高速实现,这一技术对发展高精度CT具有重要的作用。  相似文献   

17.
The emerging integrated CPU–GPU architectures facilitate short computational kernels to utilize GPU acceleration. Evidence has shown that, on such systems, the GPU control responsiveness (how soon the host program finds out about the completion of a GPU kernel) is essential for the overall performance. This study identifies the GPU responsiveness dilemma: host busy polling responds quickly, but at the expense of high energy consumption and interference with co-running CPU programs; interrupt-based notification minimizes energy and CPU interference costs, but suffers from substantial response delay. We present a program level solution that wakes up the host program in anticipation of GPU kernel completion. We systematically explore the design space of an anticipatory wakeup scheme through a timer-delayed wakeup or kernel splitting-based pre-completion notification. Experiments show that our proposed technique can achieve the best of both worlds, high responsiveness with low power and CPU costs, for a wide range of GPU workloads.  相似文献   

18.
We present a novel, hybrid parallel continuous collision detection (HPCCD) method that exploits the availability of multi‐core CPU and GPU architectures. HPCCD is based on a bounding volume hierarchy (BVH) and selectively performs lazy reconstructions. Our method works with a wide variety of deforming models and supports self‐collision detection. HPCCD takes advantage of hybrid multi‐core architectures – using the general‐purpose CPUs to perform the BVH traversal and culling while GPUs are used to perform elementary tests that reduce to solving cubic equations. We propose a novel task decomposition method that leads to a lock‐free parallel algorithm in the main loop of our BVH‐based collision detection to create a highly scalable algorithm. By exploiting the availability of hybrid, multi‐core CPU and GPU architectures, our proposed method achieves more than an order of magnitude improvement in performance using four CPU‐cores and two GPUs, compared to using a single CPU‐core. This improvement results in an interactive performance, up to 148 fps, for various deforming benchmarks consisting of tens or hundreds of thousand triangles.  相似文献   

19.
In this paper, we present an algorithm and provide design improvements needed to port the serial Lempel–Ziv–Storer–Szymanski (LZSS), lossless data compression algorithm, to a parallelized version suitable for general purpose graphic processor units (GPGPU), specifically for NVIDIA’s CUDA Framework. The two main stages of the algorithm, substring matching and encoding, are studied in detail to fit into the GPU architecture. We conducted detailed analysis of our performance results and compared them to serial and parallel CPU implementations of LZSS algorithm. We also benchmarked our algorithm in comparison with well known, widely used programs: GZIP and ZLIB. We achieved up to 34× better throughput than the serial CPU implementation of LZSS algorithm and up to 2.21× better than the parallelized version.  相似文献   

20.
With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad‐core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S‐W algorithm in a 45‐nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power‐efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号